Talk:2019/Grants/Adding Context to Online Stories by Suggesting Related Wikipedia Articles

Additional information requested

Hi Gabriel, thanks for the submission. Could you say a little more related to the way that this tool will present relevant topics? In particular:

can you explain if there is something additional in terms doing a frequency check on meaningful terms that result from an NLP analysis (is this how you approach what a relevant topic is?)
because part of your application directly addresses misinformation, is there something particular it would provide w/r/t additional context in these cases? (consider https://en.wikipedia.org/wiki/Flat_Earth versus https://en.wikipedia.org/wiki/Myth_of_the_flat_Earth, both?)

Also, as a final question related to challenges:

Have you given thought to this backfiring in any way?

Thanks! -Connie, 7 April 2020

Response

Thanks for the questions Connie!

Our baseline for similarity between user highlighted text and the text of a Wikipedia page will be an explicit topic model based on TF-IDF. We selected this baseline because it is straightforward to implement, runs quickly at inference time, is small enough to fit in a browser extension, and is easy to interpret (e.g. word x contributed y to the similarity score between user highlighted text and Wikipedia page z). We will also explore more modern word embedding approaches (e.g. Word2Vec, GloVE, and contextualized word embeddings such as BERT) to see if they improve on our baseline while still maintaining the benefits of TF-IDF.

However, text similarity is not the whole story, especially in a project focusing on credibility. One of the major choices to make in our modeling is which Wikipedia pages to include as potential topics to be returned to the user. The concept of a “good” Wikipedia page / topic is hard to quantify, but much of the R&D for this project will be investigating this question. Below we will go into more detail about the signals and features we plan to examine. We will divide these approaches into “community” features - those made possible by humans reading and reviewing pages - and “machine” features - those derived algorithmically from raw Wikimedia data (note that Wikidata provides additional metadata about pages that can be used for filtering).

Community Features

Wikipedia:Content_assessment: The Wikipedia community has established an assessment system that assigns grades to some pages. While not a comprehensive solution, incorporating this signal into our model will help it return quality pages. Alternatively, we could give the user the option to include only Featured_articles and Good_articles (approximately 36,000 pages).

Wikipedia:WikiProject_Reliability: This Wikipedia project focuses on the core policies of verifiability and no original research. Their guidelines encourage usage of templates indicating reference quality. We can parse these templates when we convert wiki text markup to plain text and incorporate them into our model.

Wikipedia:Reliable_sources/Perennial_sources: The perennial sources list tracks commonly discussed sources (both good and bad). We can detect if pages use any sources on this list and use the source status (generally reliable, no consensus, generally unreliable, deprecated, blacklisted, …) to inform our relevancy/credibility score.

Machine Features

incoming link count: It’s fairly easy to create a page with many outgoing links. It is not as easy to convincingly edit many pages such that a low quality page has many incoming links. Link counting is a baseline feature of many Wikipedia quality models.

link graph structure: Wikipedia can be viewed as a graph with pages as nodes and links as edges. Graph metrics such as centrality may be useful for identifying quality pages.

page length: Not a direct measure of quality, but very short pages are typically poor topics. Additionally, there are modifications to TF-IDF that take document length into account (e.g. http://singhal.info/pivoted-dln.pdf).

disambiguation/list pages: Good topic pages cover one topic while disambiguation pages focus on multiple meanings. List pages can be useful to readers, but usually don’t use language in the same way as traditional pages and so tend to confuse TF-IDF models. Some of these pages can be identified by their titles (e.g. starts with “List of” or ends with “disambiguation”) but this is not always the case (consider the disambiguation page https://en.wikipedia.org/wiki/Ram). We have found some success using Wikidata to identify pages that are P31 (instance of) Q17442446 (Wikimedia internal item).

page views: Measured over a month or longer, page views are a good measure of popularity as opposed to quality, but they are useful for removing some noise at the very low page view end.

categories: The Wikipedia category graph is a rich source of information. Some categories may be directly useful to us (e.g., Pseudoscience, Scientific_skepticism, Anti-vaccinationism, Conspiracy_theories) while the page-category graph as a whole may also yield interesting features.

Flat Earth Example

Your example with,

is an interesting one. There is a section in the first page titled “Modern Flat-Earthers” but the page as a whole is not as explicit about misinformation as the second page. However, I would consider this project a success if it returned either page (given the input text had something to do with the concept of a flat earth). This is because our underlying assumption is that Wikipedia is better at fighting misinformation (even if not perfect) than most other sources and neither of these pages encourage a modern belief in a flat earth.

However, attempting to discover a signal that correlates with “this page is debunking a myth” is an interesting challenge and we will definitely investigate. Interestingly, the first page “Flat_Earth” uses the template pseudoscience and is in the category Obsolete scientific theories while the second page “Myth_of_the_flat_Earth” is in the category Misconceptions. Under the assumption that Wikipedia itself would not be promoting actual pseudoscience or misconceptions, these signals could potentially be used to boost the relevance of a page. The user experience of our extension is still under design, but perhaps pages with similar templates and categories to these could be highlighted or given a special icon when the page suggestions are presented to the user.

Challenges

Regarding the challenges of this project and the potential for backfiring, there are several potential areas for concern, but I think they can be addressed via the UI of the extension and an open and honest hosting page.

Suggestions come from a subset of Wikipedia: We will have to make sure that users understand that we are not providing a live search of all of English Wikipedia. We would not want to suggest that certain pages don’t exist just because we did not return them as suggestions. This information will be prominently displayed in the landing/hosting page and could be displayed with every set of suggestions (e.g. these suggestions are drawn from the featured and good article subset of Wikipedia pages)

Suggestions come from monthly static dumps: We will have to make sure that users understand that models are trained on Wikimedia dumps that are produced once a month. The monthly dumps are versioned in the sense that they have a fixed timestamp. We can design the UI such that every suggestion page has a prominent display of the dump used for the model (e.g. Suggestions produced from 20200401 model dump). We can also use the UI in the same way to prompt users to update their model (e.g. These suggestions come from the 20200101 model, but a newer 20200401 model is available, click here to download).

Bad actors: Bad actors could potentially use a tool like this to target Wikipedia pages that are relevant to misinformation. This is true, but some version of this statement is always true and the reverse statement is also true, i.e. good actors could potentially use a tool like this to target Wikipedia pages relevant to misinformation. In our landing page we could invite users to help improve pages that are suggested by following the guidelines in Wikipedia Project Reliability

Please let us know if this helps clarify our project.

As a final note, we've recruited a new front end member to the team who has expertise in browser extensions (Stephen Corwin) and updated the team sections in the application.

GabrielKensho (talk) 17:45, 9 April 2020 (UTC)