2019/Grants/Adding Context to Online Stories by Suggesting Related Wikipedia Articles
Title:
Adding Context to Online Stories by Suggesting Related Wikipedia Articles
Name:
Gabriel Altay
Wikimedia username:
Gabrielaltay
E-mail address:
gabriel.altaygmail.com
Resume:
Gabriel Altay - Machine Learning Engineer at Kensho Technologies
https://www.linkedin.com/in/gabriel-altay-75599126
https://en.wikipedia.org/wiki/User:Gabrielaltay
Jennifer Radel - Full Stack Engineer at Kensho Technologies
https://www.linkedin.com/in/jennifer-van/
Stephen Corwin - Front End Engineer at Kensho Technologies
https://github.com/stephencorwin
Jena Van - Computer Science Student at George Mason University
Geographical impact:
English readers globally but scalable to other languages.
Type of project:
Technology
What is your idea?
We hope to build a Google Chrome browser extension that will help people reach Wikipedia articles relevant to stories they are viewing online. We will do this by converting the entire English Wikipedia corpus from wikitext markup to plain text and using it to build explicit topic models. We will avoid any user specific modeling and focus on simply connecting text snippets to relevant Wikipedia pages.
Why is it important?
Misinformation is pervasive online, but Wikipedia is one of the “last best places on the Internet” [1]. This is because, “Other platforms are finding they need to retrofit their products to address misinformation; but battling fake news has been a central principle of Wikipedia since the early days.” [2]. Many social media platforms have implemented features that direct users to Wikipedia (e.g., see [3] and [4]). Our project seeks to expand these strategies such that any text online can be used to suggest relevant Wikipedia articles. This is important as it brings the good work being done on Wikipedia to a larger audience.
[1] https://www.wired.com/story/wikipedia-online-encyclopedia-best-place-internet/
[2] https://misinfocon.com/wikipedia-built-to-battle-fake-news-c36370fe2c0e
[3] https://techcrunch.com/2018/04/03/facebook-author-info/
[4] https://www.wired.com/story/youtube-will-link-directly-to-wikipedia-to-fight-conspiracies/
Is your project already in progress?
The project is not currently in progress, but we have done initial explorations into explicit topic models using Wikipedia and published them on Kaggle: https://www.kaggle.com/kenshoresearch/kdwd-explicit-topic-models
How is it relevant to credibility and Wikipedia? (max 500 words)
Wikipedia articles typically offer broad context while news articles and social media posts typically focus on specific events. Giving readers a paved path from the specific to the broad will help them frame the content they are reading. Wikipedia is a venue designed to minimize misinformation. The greater the number of people who get to view the material there, the better off we are in terms of an informed public.
What is the ultimate impact of this project?
Google Chrome has approximately 65% of the browser market share (https://netmarketshare.com/). This translates to more than a billion users. If we can offer this service to a small fraction of these users, we can make a large impact with this project.
Could it scale?
In some sense the project has scale built in from the beginning. Making it available as a browser extension allows for a large number of users to install it. Designing the extension such that it operates on user highlighted text allows us to avoid any bespoke HTML parsing that would have to be custom built for each web page format. Allowing the Wikipedia page recommendation model to run in the client’s browser makes it unnecessary to run a model service.
Why are you the people to do it?
Gabriel Altay, Jennifer Radel, and Stephen Corwin work at Kensho Technologies. Gabriel Altay has expertise dealing with raw Wikimedia data (see https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data) and natural language processing (https://www.kaggle.com/kenshoresearch/kdwd-explicit-topic-models). Jennifer Radel has expertise in data engineering and machine learning. They have both mentored many students, software engineers, and machine learning practitioners. Stephen Corwin has expertise in front-end technologies as well as design and has worked on multiple browser extensions before. Jena Van is a freshman computer science student at George Mason University and is motivated to work at the intersection of data mining and misinformation. They are all eager to learn more about developing browser extensions and have a network of skilled engineers via Kensho and the Wikimedia community to ask for help.
What is the impact of your idea on diversity and inclusiveness of the Wikimedia movement?
This project aims to help anyone with a web browser reach useful information on Wikipedia.
What are the challenges associated with this project and how you will overcome them?
There are several challenges associated with this project. The first is converting all of English Wikipedia into plain text. There are some off-the-shelf solutions, but we will need to build our own so that we can include the appropriate metadata when we produce the plain text. This includes the section structure of the pages, the number of incoming and outgoing links, and potentially the category structure. Gabriel has experience doing this and will open source the software used in this project. The second challenge is keeping the data in the browser extension up-to-date. Wikipedia releases full text dumps on a monthly basis and Kensho already produces some Wikimedia data products at that cadence. We can design our browser extension with an update that happens on the same schedule. The third challenge is getting an explicit topic model to run in a browser client. This is what truly gives the project scale. We could design a proof-of-concept that relied on API calls to a service, but that would require maintaining and scaling that service as users increased. If we can design and implement an explicit topic model in the browser extension environment (i.e., in JavaScript) then we can scale much larger. This will take the bulk of the research and development project time, but the explicit topic models we seek to build can be implemented using only sparse arrays so we are hopeful.
How much money are you requesting?
US$ 7,200
How will you spend the money?
All of the money will be used as a research stipend for Jena Van to work on this project full time during the summer (May 20 - August 20) of 2020. We based our figure on the National Science Foundation’s Research Experience for Undergraduates (https://www.nsf.gov/pubs/2019/nsf19582/nsf19582.htm)
How long will your project take?
Three months.
Have you worked on projects for previous grants before?
This will be our first Wikimedia grant.