Difference between revisions of "2019/Grants/Classifying Wikipedia Actors"

From WikiConference North America
< 2019‎ | Grants
Jump to navigation Jump to search
(Created page with "{{WCNA 2019 Grant Submission |name=Carlin MacKenzie |username=carlinmack |email=carlin.mackenzie{{@}}gmail.com |geography=English |type=Technology |idea=Wikipedia currently ha...")
 
(Rm indents)
 
(5 intermediate revisions by 3 users not shown)
Line 3: Line 3:
 
|username=carlinmack
 
|username=carlinmack
 
|email=carlin.mackenzie{{@}}gmail.com
 
|email=carlin.mackenzie{{@}}gmail.com
  +
|resume=https://carlinmack.com/cv.html
 
|geography=English
 
|geography=English
 
|type=Technology
 
|type=Technology
|idea=Wikipedia currently has good vandalism detection tools for individual edits - this project aims to classify users who perform misconduct over time. As there aren't any tools for this, it puts excessive load on volunteer moderators as they need to go through a user's edits manually to see if there is a pattern of bad behaviour. Not only does this waste valuable volunteer time, it can only catch the editors with the most egregious editing histories.
+
|idea=Wikipedia currently has good vandalism detection tools for individual edits but lacks tooling to report which users repeatedly engage in misconduct over time. This project will create a tool which can identify these users. This will reduce the load on volunteers as they currently need to go through a user's edits manually to identify any patterns of bad behaviour. A machine process can then alert human volunteers where to examine a user’s history. This will enhance the current practice which only catches editors with the most egregious editing histories.
 
I am currently doing research in this area during my study abroad year at the University of Virginia. This entails the creation of a database of all edits on Wikipedia since 2001, for subsequent classification of good and bad actors.
 
 
This project is currently in progress and I imagine the next steps to be:
 
   
 
I am currently doing research on misconduct on Wikipedia at the University of Virginia. My project will create a database of all edits on Wikipedia since 2001 for subsequent classification of good and bad actors. We have no funding for our current project, and we know its limits. If I am sponsored, we could pursue this project next year and add the following features and outcomes
 
* Detection of more complex forms of misconduct, such as complaining or discussion in bad faith
 
* Detection of more complex forms of misconduct, such as complaining or discussion in bad faith
 
 
* Developing a system that will ingest information in an online fashion and flag users that have reputation scores that are decreasing. This could be integrated into the Recent changes feed.
 
* Developing a system that will ingest information in an online fashion and flag users that have reputation scores that are decreasing. This could be integrated into the Recent changes feed.
  +
* Creating online community discussion around adding AI features for reliability to Wikipedia. Regardless of the direction of future technology, the social discussion about human / AI interaction should start now using experiments like this for modelling.
 
* Creating online community discussion around adding this feature to Wikipedia - many avenues to do this with and without endorsement from the Wikimedia Foundation.
 
 
 
* Creating documentation to make research into Wikipedia more approachable.
 
* Creating documentation to make research into Wikipedia more approachable.
 
 
* Community review to improve the current Research pages on meta.
 
* Community review to improve the current Research pages on meta.
 
 
* Creating reports and visualisations of users, misconduct and editing patterns on Wikipedia
 
* Creating reports and visualisations of users, misconduct and editing patterns on Wikipedia
  +
|relevance=The less misconduct there is on Wikipedia, the more credible it is. With the tools that I will develop we can hopefully get a handle on the scale of the issue and start thinking about strategies to deal with it.
 
  +
|relevance=This project is relevant to Wikipedia's credibility because misinformation and misconduct routinely appear in the same topic environments. Where there is misinformation, there tends to be higher rates of misconduct. By studying misconduct, we get benefits including leads on topics where misinformation has a higher risk of appearing, and also we keep conversations around misinformation more civil and enjoyable.
  +
  +
Additionally, this project will give us a handle on the scale of the issue and allow us to create strategies to deal with it.
  +
 
|importance=There are no tools to detect users who perform misconduct over time which decreases the quality of Wikipedia. This means that we don't know how big the problem is or strategies that are effective.
 
|importance=There are no tools to detect users who perform misconduct over time which decreases the quality of Wikipedia. This means that we don't know how big the problem is or strategies that are effective.
   
Additionally, my work aims to create documentation around Wikipedia research and improve the resources so that it is more appealing. This is a fertile area of research with applicability for network analysis, natural language processing, classification, prediction and digital humanities. However, it requires a large learning curve before it starts. This information will encourage future research into these areas.
+
Additionally, this project aims to create documentation around Wikipedia research and improve the resources so that it is more appealing. This is a fertile area of research with applicability for network analysis, natural language processing, classification, prediction and digital humanities. However, starting Wikipedia research requires a large learning curve and the information we create will encourage future research into these areas.
  +
|scalability=Hopefully the resulting processing techniques could be light enough to be integrated into the Recent changes feed.
+
|scalability=Hopefully the resulting processing techniques could be light enough to be integrated into the Recent changes feed.
  +
 
Other avenues could be a Wikipedia gadget or script that could display scores for users like the existing tool ORES currently does.
  +
 
|people=I, as project lead, am familiar with the Wikipedia research space and I am a Wikipedian of several years. I have already started this project and the database is being created with the requisite data. Additionally, I have collaborators and advisers for this project in academia, the Wikimedia community, the WMF and in industry, which are important to coordinate future strategy.
   
Other avenues could be a Wikipedia gadget or script that could display scores for users like ORES currently does.
 
|people=I am familiar with the Wikipedia research space and I am a Wikipedian of several years. I have already started this project and the database is being created with the requisite data. Additionally, I have made contacts in academia, the WMF and in industry, which are important to coordinate future strategy.
 
 
|cost=$8,250
 
|cost=$8,250
|expenses=The money will cover many different costs.
 
   
My university is timid in granting me the large amounts of storage required to store all of Wikipedia. For this reason, I would like to offer funds for purchasing several terabytes of storage space so that I am not limited by storage - $750
+
|expenses=My university will grant some storage and computation resources, but for me to develop this project beyond the pilot demonstration, I need to purchase storage required to host all of Wikipedia. Beyond that, I would like to document and publish this project as a model for anyone to perform research on Wikipedia.
   
 
* Development - $4000
Additionally, I would like to have funds available for:
 
* Development - $4000
 
 
* Community survey - $500
 
* Community survey - $500
 
* Documentation - $1500
 
* Documentation - $1500
 
* Data analysis, reports and visualisations - $1500
 
* Data analysis, reports and visualisations - $1500
  +
* Storage and compute space - $750
|time=If I am sponsored, I will be able to pursue this research in September for the duration of the academic year, until June 2021.
 
  +
 
|time=If I am sponsored, I will be able to pursue this research in September for the duration of the academic year, until June 2021.
   
 
My current research is until June of this year.
 
My current research is until June of this year.
  +
  +
|inprogress=Yes, I am currently working on part one of this project, namely https://meta.wikimedia.org/wiki/Research:Classifying_Actors_on_Talk_Pages. This project lasts until June of 2020. If I am sponsored, I can continue this research until June 2021
  +
  +
|impact=Better understanding of the types of users on Wikipedia and catching users performing misconduct earlier.
  +
  +
|inclusiveness=One type of misconduct is blaming and criticising other users. By decreasing this behaviour we can make Wikipedia more inclusive.
  +
  +
|challenges=One main challenge is taking the insights from the database and applying this to edits happening in real time. Fortunately, this is not completely new territory as there are many bots monitoring the Recent Changes feed and I also have support from tool developers.
  +
  +
|previous=I am currently doing this project, but not with a grant
 
}}
 
}}

Latest revision as of 08:48, 8 April 2020


Title:

Classifying Wikipedia Actors

Name:

Carlin MacKenzie

Wikimedia username:

carlinmack

E-mail address:

carlin.mackenzie@gmail.com

Resume:

https://carlinmack.com/cv.html

Geographical impact:

English

Type of project:

Technology

What is your idea?

Wikipedia currently has good vandalism detection tools for individual edits but lacks tooling to report which users repeatedly engage in misconduct over time. This project will create a tool which can identify these users. This will reduce the load on volunteers as they currently need to go through a user's edits manually to identify any patterns of bad behaviour. A machine process can then alert human volunteers where to examine a user’s history. This will enhance the current practice which only catches editors with the most egregious editing histories.

I am currently doing research on misconduct on Wikipedia at the University of Virginia. My project will create a database of all edits on Wikipedia since 2001 for subsequent classification of good and bad actors. We have no funding for our current project, and we know its limits. If I am sponsored, we could pursue this project next year and add the following features and outcomes

  • Detection of more complex forms of misconduct, such as complaining or discussion in bad faith
  • Developing a system that will ingest information in an online fashion and flag users that have reputation scores that are decreasing. This could be integrated into the Recent changes feed.
  • Creating online community discussion around adding AI features for reliability to Wikipedia. Regardless of the direction of future technology, the social discussion about human / AI interaction should start now using experiments like this for modelling.
  • Creating documentation to make research into Wikipedia more approachable.
  • Community review to improve the current Research pages on meta.
  • Creating reports and visualisations of users, misconduct and editing patterns on Wikipedia

Why is it important?

There are no tools to detect users who perform misconduct over time which decreases the quality of Wikipedia. This means that we don't know how big the problem is or strategies that are effective.

Additionally, this project aims to create documentation around Wikipedia research and improve the resources so that it is more appealing. This is a fertile area of research with applicability for network analysis, natural language processing, classification, prediction and digital humanities. However, starting Wikipedia research requires a large learning curve and the information we create will encourage future research into these areas.

Is your project already in progress?

Yes, I am currently working on part one of this project, namely https://meta.wikimedia.org/wiki/Research:Classifying_Actors_on_Talk_Pages. This project lasts until June of 2020. If I am sponsored, I can continue this research until June 2021

How is it relevant to credibility and Wikipedia? (max 500 words)

This project is relevant to Wikipedia's credibility because misinformation and misconduct routinely appear in the same topic environments. Where there is misinformation, there tends to be higher rates of misconduct. By studying misconduct, we get benefits including leads on topics where misinformation has a higher risk of appearing, and also we keep conversations around misinformation more civil and enjoyable.

Additionally, this project will give us a handle on the scale of the issue and allow us to create strategies to deal with it.

What is the ultimate impact of this project?

Better understanding of the types of users on Wikipedia and catching users performing misconduct earlier.

Could it scale?

Hopefully the resulting processing techniques could be light enough to be integrated into the Recent changes feed.

Other avenues could be a Wikipedia gadget or script that could display scores for users like the existing tool ORES currently does.

Why are you the people to do it?

I, as project lead, am familiar with the Wikipedia research space and I am a Wikipedian of several years. I have already started this project and the database is being created with the requisite data. Additionally, I have collaborators and advisers for this project in academia, the Wikimedia community, the WMF and in industry, which are important to coordinate future strategy.

What is the impact of your idea on diversity and inclusiveness of the Wikimedia movement?

One type of misconduct is blaming and criticising other users. By decreasing this behaviour we can make Wikipedia more inclusive.

What are the challenges associated with this project and how you will overcome them?

One main challenge is taking the insights from the database and applying this to edits happening in real time. Fortunately, this is not completely new territory as there are many bots monitoring the Recent Changes feed and I also have support from tool developers.

How much money are you requesting?

$8,250

How will you spend the money?

My university will grant some storage and computation resources, but for me to develop this project beyond the pilot demonstration, I need to purchase storage required to host all of Wikipedia. Beyond that, I would like to document and publish this project as a model for anyone to perform research on Wikipedia.

  • Development - $4000
  • Community survey - $500
  • Documentation - $1500
  • Data analysis, reports and visualisations - $1500
  • Storage and compute space - $750

How long will your project take?

If I am sponsored, I will be able to pursue this research in September for the duration of the academic year, until June 2021.

My current research is until June of this year.

Have you worked on projects for previous grants before?

I am currently doing this project, but not with a grant