Talk:2019/Grants/Sourceror: The Wikipedia community's platform against disinformation
Additional Information Requested
Hi Newslinger, I really appreciate this application because, if successful, our project would probably like to use it; our NewsQ.net project has actually scraped and created a first version of what you propose here (https://newsq.net/newsq-signals/ -- search for Wikipedia, but the status is about to be updated to 'Collected'), but we'd much prefer using a Wikipedia generated API. Because it's tough! See below:
Based on our experience, here are a few questions:
- One of the challenges we had in terms of ingesting this data is the ways that sources are not always associated with websites, and applicability of the rating might only be on a subsection of the source -- an example of this are: Bloomberg or Fox News. Can you explain how you would handle? (Are there objects in WikiData that you would be tying to here, e.g.)
- Another challenge is Newsweek -- how might you represent the staged evaluations here in JSON?
- Blacklist -- is this worth *not* including, and rather encourage people to use whatever latest and greatest from the spam lists? If the blacklist status is being maintained by hand on the perennial sources lists, we thought it worth not including, in case of sync issues.
Thanks much for your thoughts. --Connie (talk) 19:05, 9 April 2020 (UTC)
Response
Hi Connie, I've taken a look at NewsQ and I think the initiative is very exciting! The amount of data you are collecting is incredible, and I love how you are working with the W3C to turn these signals into standards. I would be happy to cooperate with NewsQ to ensure that the Sourceror API becomes a suitable data provider for the "<Wikipedia Perennial Sources Reliability Guidance>" and "<Wikipedia Spam Blacklist Flag>" NewsQ signals.
One of the challenges we had in terms of ingesting this data is the ways that sources are not always associated with websites, and applicability of the rating might only be on a subsection of the source -- an example of this are: Bloomberg or Fox News. Can you explain how you would handle? (Are there objects in WikiData that you would be tying to here, e.g.)
Some domains, including the ones for Bloomberg and Fox News, are associated with more than one entry on the perennial sources list.
To handle these cases as well as possible, the Sourceror API would store a regular expression for each entry on the list. Any entry with a regular expression that matches a given URL is considered a relevant match for that URL. This handles cases like Bloomberg, which has an entry for its company/executive profiles and another entry for its publications. Here's what the expressions for the Bloomberg entries would look like, with the caveat that they are not yet fully tested:
- Bloomberg:
\bbloomberg\.com\b(?!\/profile\b)
- Bloomberg profiles:
\bbloomberg\.com\b\/profile\b
URLs for Bloomberg articles (such as this one) would only match the first entry, while URLs for Bloomberg profiles (such as this one) would only match the second entry.
Fox News is an interesting case, since the vast majority of its online content corresponds to the "news and website" entry, and the other "talk shows" entry was created as a compromise with editors who took exception to Fox News talk shows on cable TV (after a brief edit war broke out last year). Almost all citations of Fox News on Wikipedia are to their news content, not to videos of their talk shows.
Here's what the regular expressions for Fox News would look like, again with the caveat that they are not yet fully tested:
- Fox News (news and website):
\bfoxnews\.com\b
- Fox News (talk shows):
\b(video\.foxnews\.com|foxnews\.com\/shows)\b
Unfortunately, there is no way to distinguish between news shows from talk shows on the video.foxnews.com subdomain. In cases like these, when it is impossible or impractical to identify the most applicable entry by the URL of the website, the Sourceror API would return all relevant entries as an array, and the Sourceror Browser Extension would display all of these entries together. Sourceror would match both Fox News entries for URLs under the video.foxnews.com
subdomain.
I see that the "<Wikipedia Perennial Sources Reliability Guidance>" NewsQ signal is of a "categorical" data type. If that means the value needs to be an enumerated type, there are three ways we could handle multiple entries with different reliability classifications:
- The signal could fall back to "no consensus, unclear, or additional considerations apply".
- The signal could be set to a value of "mixed", which would indicate that the classification of the URL could be more than one value.
- The set of possible values could include combinations of all possible reliability ranges of an URL. For example,
nc-gr
would indicate that avideo.foxnews.com
could be either "no consensus..." or "generally reliable".gu-gr
would indicate that a site could be anywhere from "generally unreliable" to "generally reliable".
My preference is option #3, which provides the most detailed data.
These regular expressions would be maintained by the Wikipedia community on a subpage of the perennial sources list, and scraped regularly by the Sourceror Bot. List entries are tracked by unique IDs. I did not consider using Wikidata as a repository for these expressions, because I felt that it would be more practical to keep them centralized in one place, which helps keep vandalism under control.
For more complicated cases that are distinguishable by URLs, such as Forbes vs. Forbes.com contributors, the Wikipedia community will eventually maintain a database of website authors and their roles (whether they are staff writers or non-staff contributors). Contributor platforms like the one used by Forbes are a long-term problem on Wikipedia. It will take some time for the community to build a database, but I intend to integrate it into Sourceror as soon as it reaches a usable state.
Another challenge is Newsweek -- how might you represent the staged evaluations here in JSON?
Unfortunately, Newsweek does not indicate the year of publication in the URL, which means that the Sourceror API would return both the pre-2013 and the 2013–present entries as an array. The NewsQ signal value would be nc-gr
, which would indicate either "no consensus..." or "generally reliable".
The Sourceror Browser Extension is able to retrieve the publication date from the Newsweek article when it is loaded in the browser. This requires maintaining a CSS or XPath selector. For Newsweek, the time
CSS selector or //time
XPath selector would retrieve the <time>
tag in any article, whose data-timestamp
attribute can be used to determine the date of the article (with the caveat that this process is not yet fully tested). For example, for this article, the Extension would retrieve the timestamp indicating a date in 2017, then determine that the 2013–present entry is the only match.
It is not feasible for the Sourceror API to scrape the website and determine the date of publication server-side, as that would be too expensive to implement.
Blacklist -- is this worth *not* including, and rather encourage people to use whatever latest and greatest from the spam lists? If the blacklist status is being maintained by hand on the perennial sources lists, we thought it worth not including, in case of sync issues.
The proposed JSON schema uses two separate variables to track a source's reliability classification (status
, an enumerated value) and its blacklist status (blacklisted
, a boolean). I think the Sourceror API should provide the blacklisted
value to ensure a one-to-one correspondence between the variable and the source's entry. This makes it easy for applications such as the Sourceror Web App to visually indicate whether an entry is blacklisted without having to do additional processing to determine whether any or all of the (possibly multiple) domains are on the blacklist.
You are right in that there is a lag time between when the spam blacklist is updated and when the perennial sources list is updated. NewsQ and most applications that seek data for specific URLs would most likely be better served by fetching the spam blacklist directly. Fortunately, it is easy to disregard the blacklisted
value returned by the Sourceror API. If I have time, I will also implement a GraphQL API, which allows clients to specify the desired contents and form of the data in the Sourceror API's response.
Thank you for reading my submission and response, Connie! I hope this answers your questions. If you have any additional questions, please don't hesitate to ask. — Newslinger talk 08:32, 10 April 2020 (UTC)