2019/Grants/Classifying Wikipedia Actors

From WikiConference North America
Jump to navigation Jump to search


Title:

Classifying Wikipedia Actors

Name:

Carlin MacKenzie

Wikimedia username:

carlinmack

E-mail address:

carlin.mackenzie@gmail.com

Resume:

https://carlinmack.com/cv.html

Geographical impact:

English

Type of project:

Technology

What is your idea?

Wikipedia currently has good vandalism detection tools for individual edits but lacks tooling to report which users repeatedly engage in misconduct over time. This project will create a tool which can identify these users. This will reduce the load on volunteers as they currently need to go through a user's edits manually to identify any patterns of bad behaviour. A machine process can then alert human volunteers where to examine a user’s history. This will enhance the current practice which only catches editors with the most egregious editing histories.

I am currently doing research on misconduct on Wikipedia at the University of Virginia. My project will create a database of all edits on Wikipedia since 2001 for subsequent classification of good and bad actors. We have no funding for our current project, and we know its limits. If I am sponsored, we could pursue this project next year and add the following features and outcomes
  • Detection of more complex forms of misconduct, such as complaining or discussion in bad faith
  • Developing a system that will ingest information in an online fashion and flag users that have reputation scores that are decreasing. This could be integrated into the Recent changes feed.
  • Creating online community discussion around adding AI features for reliability to Wikipedia. Regardless of the direction of future technology, the social discussion about human / AI interaction should start now using experiments like this for modelling.
  • Creating documentation to make research into Wikipedia more approachable.
  • Community review to improve the current Research pages on meta.
  • Creating reports and visualisations of users, misconduct and editing patterns on Wikipedia

Why is it important?

There are no tools to detect users who perform misconduct over time which decreases the quality of Wikipedia. This means that we don't know how big the problem is or strategies that are effective.

Additionally, this project aims to create documentation around Wikipedia research and improve the resources so that it is more appealing. This is a fertile area of research with applicability for network analysis, natural language processing, classification, prediction and digital humanities. However, starting Wikipedia research requires a large learning curve and the information we create will encourage future research into these areas.

Is your project already in progress?

Yes, I am currently working on part one of this project, namely https://meta.wikimedia.org/wiki/Research:Classifying_Actors_on_Talk_Pages. This project lasts until June of 2020. If I am sponsored, I can continue this research until June 2021

How is it relevant to credibility and Wikipedia? (max 500 words)

This project is relevant to Wikipedia's credibility because misinformation and misconduct routinely appear in the same topic environments. Where there is misinformation, there tends to be higher rates of misconduct. By studying misconduct, we get benefits including leads on topics where misinformation has a higher risk of appearing, and also we keep conversations around misinformation more civil and enjoyable.

Additionally, this project will give us a handle on the scale of the issue and allow us to create strategies to deal with it.

What is the ultimate impact of this project?

Better understanding of the types of users on Wikipedia and catching users performing misconduct earlier.

Could it scale?

Hopefully the resulting processing techniques could be light enough to be integrated into the Recent changes feed.

Other avenues could be a Wikipedia gadget or script that could display scores for users like the existing tool ORES currently does.

Why are you the people to do it?

I, as project lead, am familiar with the Wikipedia research space and I am a Wikipedian of several years. I have already started this project and the database is being created with the requisite data. Additionally, I have collaborators and advisers for this project in academia, the Wikimedia community, the WMF and in industry, which are important to coordinate future strategy.

What is the impact of your idea on diversity and inclusiveness of the Wikimedia movement?

One type of misconduct is blaming and criticising other users. By decreasing this behaviour we can make Wikipedia more inclusive.

What are the challenges associated with this project and how you will overcome them?

One main challenge is taking the insights from the database and applying this to edits happening in real time. Fortunately, this is not completely new territory as there are many bots monitoring the Recent Changes feed and I also have support from tool developers.

How much money are you requesting?

$8,250

How will you spend the money?

My university will grant some storage and computation resources, but for me to develop this project beyond the pilot demonstration, I need to purchase storage required to host all of Wikipedia. Beyond that, I would like to document and publish this project as a model for anyone to perform research on Wikipedia.

  • Development - $4000
  • Community survey - $500
  • Documentation - $1500
  • Data analysis, reports and visualisations - $1500
  • Storage and compute space - $750

How long will your project take?

If I am sponsored, I will be able to pursue this research in September for the duration of the academic year, until June 2021.

My current research is until June of this year.

Have you worked on projects for previous grants before?

I am currently doing this project, but not with a grant