2019/Grants/Covid19Relay: a graph database of coronavirus research and concepts
Covid19Relay: a graph database of coronavirus research and concepts
James Hare, Scatter Inc.
Type of project:
What is your idea?
Covid19Relay is a MediaWiki/Wikibase site dedicated to sharing any and all information about the coronavirus pandemic, accepting contributions of longform text, linked data, and multimedia. It is operated as a public service. Users are authenticated through their Wikimedia login, an authentication system which is both secure and private.
I would like to use this wiki to create a graph database of COVID-19 research, based on the COVID-19 Research Dataset Challenge from the Allen Institute for AI and the Chan-Zuckerberg Initiative. The idea is to create Wikibase entities for each journal article, venue, etc. described in the metadata of the research. Example of a Wikidata item for a journal article in the dataset. This is essentially an extension of my work with WikiCite, where I worked with colleagues around the world to describe millions of journal articles in Wikidata, including over 150 million citation relationships. I would like to do a specialized version of that project for COVID-19 research, hosted on a dedicated Wikibase with a dedicated query service, so as to relieve the burden of the primary Wikidata Query Service which has been overloaded for some time.
To accomplish this, I am hiring a contract engineer to move the wiki to a scalable Kubernetes cluster. From there, I can work on bringing my old bot back, adjusting it so that it can work on multiple wikis. Then, I will get to work migrating the dataset to the Wikibase via bot. I am also currently negotiating an in-kind contribution of software that will make it easier for items on the Wikibase to link back to their corresponding Wikidata items where they exist.
With this dataset migrated to Covid19Relay, it will be available for members of the public to use, query, and annotate, as the wiki uses Wikimedia single-user login for authentication. Tools and reports could be built on top of this data service, and comparisons could be made to other datasets, including citations that appear on Wikipedia articles.
Why is it important?
When there is a global pandemic, the quality of information can be a life or death matter. Wikidata and the Wikibase software stack have scaled to make over one billion statements about over 80 million different things, including 22 million publications (many of which have been cited on Wikipedia). Yet there is a scaling issue: putting all the world's data in one wiki can overwhelm that one wiki. Indeed, we have run into performance issues trying to scale bulk contributions of source metadata to Wikidata.
Covid19Relay presents a timely opportunity to scale this project by introducing additional Wikibases and query services. The more we learn from this experience, the better equipped we will be to carry out similar projects in similar subject areas. Eventually, we should be able to describe every source cited in Wikipedia.
This project could also directly support other projects funded through this grant, including "Credibility of information regarding COVID-19".
Is your project already in progress?
The wiki is running on covid19relay.org in a fully containerized setup on Docker-Compose. Work is underway on sourcing a contract engineer and I am also negotiating additional in-kind support from a software company. Additionally, much of this builds on years of work that has already been done.
How is it relevant to credibility and Wikipedia? (max 500 words)
This is an evolution of an earlier project related to credibility and Wikipedia: WikiCite.
Citations represent the chain of provenance of information. A Wikipedia article is trustworthy because it cites its sources. The tools and datasets that have come from WikiCite allow us to do a deeper dive into what is cited on Wikipedia.
What this specific project will allow is an analysis of sources that appear on Wikipedia's COVID-19 coverage, including which citations are to reliable research and which citations are not. At least, it will represent progress on that front.
What is the ultimate impact of this project?
If successful, Covid19Relay will help lead the way in the development of a decentralized Wikibase "federation," which is necessary if we want to eventually describe every source cited on Wikipedia.
Could it scale?
This project is designed to scale through the use of highly scalable software and through the use of open standards and protocols that allow multiple data sources to pool into one data repository (e.g. Hadoop).
Why are you the people to do it?
My past and present work focuses on Wikibase and source metadata:
- As part of the Librarybase project, funded by the Wikimedia Foundation, I wrote a report describing an ideal data model mapping Wikipedia citation events to sources described in Wikidata and a separate Wikibase (at the time, Librarybase, which was subsequently discontinued). On Wikidata in particular, I developed automated workflows which documented over 150 million citation relationships between journal articles, in the process needing to create processes that were more and more scalable.
- As a former consultant for the U.S. National Institute for Occupational Safety and Health, an operating division of the CDC, I wrote code which migrated over 50,000 bibliographic records to Wikidata.
- I have been a part of the Wikimedia community for over 15 years, and have deep knowledge of its communities and practices. As a part of the Wikimedia community I have managed volunteers and paid staff and have led software development efforts. My code helps the community do their work more efficiently.
What is the impact of your idea on diversity and inclusiveness of the Wikimedia movement?
The tools that will come out of Covid19Relay will allow us to do a deep dive on the kinds of sources included in Wikipedia articles. This allows us, for instance, to figure out if articles use a disproportionately high number of sources from a highly privileged, well represented part of the world. Additionally, a wiki like Covid19Relay could be used to seed a database of underrepresented sources, which could help fuel the development of new Wikipedia and Wikidata content.
As for the software infrastructure itself, if we make it trivially easy to scale up and deploy MediaWiki instances, this will give us an opportunity to create new content platforms outside of the strict requirements of the Wikimedia Foundation production environment.
What are the challenges associated with this project and how you will overcome them?
We are pushing the Wikibase software and the Blazegraph-based query engine to its limits through the introduction of more and more complex and interlinked data. The introduction of a separate query services helps alleviate some of the burden, and I am in close contact with Wikibase developers.
Also, there are not many high-profile Wikibases outside of Wikidata. Our experience with this Wikibase will help us figure out best practices for subsidiary/satellite Wikibases going forward.
How much money are you requesting?
How will you spend the money?
The money will be spent on a contract engineer to move the wiki to Kubernetes, on my own time to develop scripts and workflows to migrate the datasets, and estimated costs for six months of hosting (based on costs for a similar project).
How long will your project take?
The wiki migration should take 2-4 weeks. It should take a comparable amount of time to work on the migration scripts. The bulk editing itself might take a month or so.
Have you worked on projects for previous grants before?