Talk:2019/Grants/Adding Context to Online Stories by Suggesting Related Wikipedia Articles
Additional information requested
Hi Gabriel, thanks for the submission. Could you say a little more related to the way that this tool will present relevant topics? In particular:
- can you explain if there is something additional in terms doing a frequency check on meaningful terms that result from an NLP analysis (is this how you approach what a relevant topic is?)
- because part of your application directly addresses misinformation, is there something particular it would provide w/r/t additional context in these cases? (consider https://en.wikipedia.org/wiki/Flat_Earth versus https://en.wikipedia.org/wiki/Myth_of_the_flat_Earth, both?)
Also, as a final question related to challenges:
- Have you given thought to this backfiring in any way?
Thanks! -Connie, 7 April 2020
Response
Thanks for the questions Connie!
Our baseline for similarity between user highlighted text and the text of a Wikipedia page will be an explicit topic model based on TF-IDF. We selected this baseline because it is straightforward to implement, runs quickly at inference time, is small enough to fit in a browser extension, and is easy to interpret (e.g. word x contributed y to the similarity score between user highlighted text and Wikipedia page z). We will also explore more modern word embedding approaches (e.g. Word2Vec, GloVE, and contextualized word embeddings such as BERT) to see if they improve on our baseline while still maintaining the benefits of TF-IDF.
However, text similarity is not the whole story, especially in a project focusing on credibility. One of the major choices to make in our modeling is which Wikipedia pages to include as potential topics to be returned to the user. The concept of a “good” Wikipedia page / topic is hard to quantify, but much of the R&D for this project will be investigating this question. Below we will go into more detail about the signals and features we plan to examine. We will divide these approaches into “community” features - those made possible by humans reading and reviewing pages - and “machine” features - those derived algorithmically from raw Wikimedia data (note that Wikidata provides additional metadata about pages that can be used for filtering).
Community Features
- Wikipedia:Content_assessment: The Wikipedia community has established an assessment system that assigns grades to some pages. While not a comprehensive solution, incorporating this signal into our model will help it return quality pages. Alternatively, we could give the user the option to include only Featured_articles and Good_articles (approximately 36,000 pages).
- Wikipedia:WikiProject_Reliability: This Wikipedia project focuses on the core policies of verifiability and no original research. Their guidelines encourage usage of templates indicating reference quality. We can parse these templates when we convert wiki text markup to plain text and incorporate them into our model.
- Wikipedia:Reliable_sources/Perennial_sources: The perennial sources list tracks commonly discussed sources (both good and bad). We can detect if pages use any sources on this list and use the source status (generally reliable, no consensus, generally unreliable, deprecated, blacklisted, …) to inform our relevancy/credibility score.
Machine Features
- incoming link count: It’s fairly easy to create a page with many outgoing links. It is not as easy to convincingly edit many pages such that a low quality page has many incoming links. Link counting is a baseline feature of many Wikipedia quality models.
- link graph structure: Wikipedia can be viewed as a graph with pages as nodes and links as edges. Graph metrics such as centrality may be useful for identifying quality pages.
- page length: Not a direct measure of quality, but very short pages are typically poor topics. Additionally, there are modifications to TF-IDF that take document length into account (e.g. http://singhal.info/pivoted-dln.pdf).
- disambiguation/list pages: Good topic pages cover one topic while disambiguation pages focus on multiple meanings. List pages can be useful to readers, but usually don’t use language in the same way as traditional pages and so tend to confuse TF-IDF models. Some of these pages can be identified by their titles (e.g. starts with “List of” or ends with “disambiguation”) but this is not always the case (consider the disambiguation page https://en.wikipedia.org/wiki/Ram). We have found some success using Wikidata to identify pages that are P31 (instance of) Q17442446 (Wikimedia internal item).
- page views: Measured over a month or longer, page views are a good measure of popularity as opposed to quality, but they are useful for removing some noise at the very low page view end.
- categories: The Wikipedia category graph is a rich source of information. Some categories may be directly useful to us (e.g., Pseudoscience, Scientific_skepticism, Anti-vaccinationism, Conspiracy_theories) while the page-category graph as a whole may also yield interesting features.
Flat Earth Example
Your example with,
is an interesting one. There is a section in the first page titled “Modern Flat-Earthers” but the page as a whole is not as explicit about misinformation as the second page. However, I would consider this project a success if it returned either page (given the input text had something to do with the concept of a flat earth). This is because our underlying assumption is that Wikipedia is better at fighting misinformation (even if not perfect) than most other sources and neither of these pages encourage a modern belief in a flat earth.
However, attempting to discover a signal that correlates with “this page is debunking a myth” is an interesting challenge and we will definitely investigate. Interestingly, the first page “Flat_Earth” uses the template pseudoscience
and is in the category Obsolete scientific theories while the second page “Myth_of_the_flat_Earth” is in the category Misconceptions. Under the assumption that Wikipedia itself would not be promoting actual pseudoscience or misconceptions, these signals could potentially be used to boost the relevance of a page. The user experience of our extension is still under design, but perhaps pages with similar templates and categories to these could be highlighted or given a special icon when the page suggestions are presented to the user.
Challenges
Regarding the challenges of this project and the potential for backfiring, there are several potential areas for concern, but I think they can be addressed via the UI of the extension and an open and honest hosting page.
- Suggestions come from a subset of Wikipedia: We will have to make sure that users understand that we are not providing a live search of all of English Wikipedia. We would not want to suggest that certain pages don’t exist just because we did not return them as suggestions. This information will be prominently displayed in the landing/hosting page and could be displayed with every set of suggestions (e.g. these suggestions are drawn from the featured and good article subset of Wikipedia pages)
- Suggestions come from monthly static dumps: We will have to make sure that users understand that models are trained on Wikimedia dumps that are produced once a month. The monthly dumps are versioned in the sense that they have a fixed timestamp. We can design the UI such that every suggestion page has a prominent display of the dump used for the model (e.g. Suggestions produced from 20200401 model dump). We can also use the UI in the same way to prompt users to update their model (e.g. These suggestions come from the 20200101 model, but a newer 20200401 model is available, click here to download).
- Bad actors: Bad actors could potentially use a tool like this to target Wikipedia pages that are relevant to misinformation. This is true, but some version of this statement is always true and the reverse statement is also true, i.e. good actors could potentially use a tool like this to target Wikipedia pages relevant to misinformation. In our landing page we could invite users to help improve pages that are suggested by following the guidelines in Wikipedia Project Reliability
Please let us know if this helps clarify our project.
As a final note, we've recruited a new front end member to the team who has expertise in browser extensions (Stephen Corwin) and updated the team sections in the application.
GabrielKensho (talk) 17:45, 9 April 2020 (UTC)
Additional information requested x2
Thanks Gabriel! Just a few more: -- would it be possible to do this in Ffox as opposed to Chrome -- can you tell us what kinds of information either extension might store about the users -- what kinds of permission will you be requiring? -- what license are you seeking on this extension?
Thanks again, have a great weekend. --Connie (talk) 19:18, 9 April 2020 (UTC)
Extension Concerns
Thanks for the questions Connie! You are definitely helping us think through some of the details in a structured way.
Our current understanding is that the extension will be easiest to develop in Chrome and then port to Firefox (which supports Chrome extensions) and Edge (which uses the Chromium engine). Developing in Firefox first will incur fairly significant development burdens if we want to support multiple browsers (and we do).
To be clear, we are explicitly not interested in reading or storing user information. We do not want our model to depend on any aspect of a user’s profile at all. We would like our model to return the same results for every user if they input identical text. However, model queries at different times may in fact return different results. For example, a page is suggested from the model on disk but the model then checks the live Wikipedia API to see if the page still exists and removes it from the suggestions if it has been deleted.
We will host the source code of the extension, the wikimedia parsing code, and the model training code in public github repos and put them under open licenses (we have used Apache in the past https://github.com/kensho-technologies/qwikidata).
The most significant permission we will need for the browser extension is sandboxed local storage on the user’s machine. This is unavoidable if we want the model to run locally. We will still need to store the models online so that the extension can download updates, but those models will need to be persisted on the user's hard drive between browser sessions. For storage above 5MB, extensions need to request the unfortunately named “unlimitedStorage” permission (https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/storage/local). This is true in Firefox and in Chrome.
For other permissions, we will choose the most restrictive set possible but, it is hard to determine the exact set before development begins. For example we will need permission to grab highlighted text, open new tabs or display popups to present results, and hit Wikimedia APIs to determine if a page suggested by the model still exists and/or check if it has new templates for categories.
We understand that some users will not be comfortable installing an extension that has access to a sandboxed portion of their hard drive (or any extension at all). To mitigate this, we can offer up a detailed explanation of what the extension does on extension store pages and commit to never incorporating user data into the model.
Thanks again for the questions! GabrielKensho (talk) 23:49, 9 April 2020 (UTC)