Submissions:2016/Drug and chemical compound items in Wikidata as a data source for Wikipedia infoboxes

From WikiConference North America
Revision as of 21:01, 1 September 2016 by FloNight (talk | contribs) (add categories)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Drug and chemical compound items in Wikidata as a data source for Wikipedia infoboxes

Academic Peer Review option
Type of submission
Sebastian Burgstaller-Muehlbacher
E-mail address
The Scripps Research Institute

The English Wikipedia has articles on thousands of chemical compounds. All the data for these compounds is stored in Wiki markup as parameters to infoboxes. Trying to reuse these data in a computer readable way is challenging. Furthermore, with the emergence of Wikidata, these data is stored essentially twice, once as infobox parameters in the Wikipedias and also as structured data in Wikidata. It is even stored up to hundreds of times, as each Wikipedia language project has articles on chemical compounds in the respective language, but the basic chemical data like structure and formula always stay the same. My talk will focus on our efforts of improving chemical data quality in Wikidata and how these data can then be used in the English Wikipedia (but in principle in any Wikipedia). First, we improved the chemical data by checking chemical compound structural representations like SMILES (Simplified molecular-input line-entry system), InChI (International Chemical Identifier) and InChI key (a hashed, short representation of the InChI) for consistency of the ~17,000 chemical compounds which link to the English Wikipedia, and those chemical compound items in Wikidata without a link to English Wikipedia (~ 7,000 items -> 24,000 Wikidata items in total). In order to follow Wikidata and Wikipedia referencing standards and increase the reliability of the data, we added references to reliable authoritative resources for chemical compound data, like PubChem and DrugBank. After the data cleanup step, we generated draft infobox templates to be used for retrieval of the chemical compound data from Wikidata and for display in the chemical compound infoboxes on Wikipedia. The retrieval of the data from Wikidata and the display as infobox code was achieved by writing Lua modules with Scribunto. Most importantly, the data in Wikidata, combined with the infobox Lua code, can serve as centralized data repository and retrieval procedure for all Wikipedias, avoiding duplication of data and allowing for high data quality maintained by Wikidata bots. These bots ensure consistency of chemical data and keep it in sync with the open primary repositories mentioned above. In summary, my talk will show how the vast space of chemical data in Wikidata is being made accessible and reusable from the Wikipedias and how the data in Wikidata has been made more reliable, so it could even be used for scientific purposes.

Length of presentation
30 min
Special schedule requests
Preferred room size
Will you attend WikiConference North America if your submission is not accepted?

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Add your username here.