Submissions:2016/Drug and chemical compound items in Wikidata as a data source for Wikipedia infoboxes
- Title
- Drug and chemical compound items in Wikidata as a data source for Wikipedia infoboxes
- Theme
- tech
- Type of submission
- presentation
- Author
- Sebastian Burgstaller-Muehlbacher
- E-mail address
- sburgs@scripps.edu
- Username
- Sebotic
- Affiliation
- The Scripps Research Institute
- Abstract
The English Wikipedia has articles on thousands of chemical compounds. All the data for these compounds is stored in Wiki markup as parameters to infoboxes. Trying to reuse these data in a computer readable way is challenging. Furthermore, with the emergence of Wikidata, these data is stored essentially twice, once as infobox parameters in the Wikipedias and also as structured data in Wikidata. It is even stored up to hundreds of times, as each Wikipedia language project has articles on chemical compounds in the respective language, but the basic chemical data like structure and formula always stay the same. My talk will focus on our efforts of improving chemical data quality in Wikidata and how these data can then be used in the English Wikipedia (but in principle in any Wikipedia). First, we improved the chemical data by checking chemical compound structural representations like SMILES (Simplified molecular-input line-entry system), InChI (International Chemical Identifier) and InChI key (a hashed, short representation of the InChI) for consistency of the ~17,000 chemical compounds which link to the English Wikipedia, and those chemical compound items in Wikidata without a link to English Wikipedia (~ 7,000 items -> 24,000 Wikidata items in total). In order to follow Wikidata and Wikipedia referencing standards and increase the reliability of the data, we added references to reliable authoritative resources for chemical compound data, like PubChem and DrugBank. After the data cleanup step, we generated draft infobox templates to be used for retrieval of the chemical compound data from Wikidata and for display in the chemical compound infoboxes on Wikipedia. The retrieval of the data from Wikidata and the display as infobox code was achieved by writing Lua modules with Scribunto. Most importantly, the data in Wikidata, combined with the infobox Lua code, can serve as centralized data repository and retrieval procedure for all Wikipedias, avoiding duplication of data and allowing for high data quality maintained by Wikidata bots. These bots ensure consistency of chemical data and keep it in sync with the open primary repositories mentioned above. In summary, my talk will show how the vast space of chemical data in Wikidata is being made accessible and reusable from the Wikipedias and how the data in Wikidata has been made more reliable, so it could even be used for scientific purposes.
- Length of presentation
- 30 min
- Special schedule requests
- Preferred room size
- 50
- Will you attend WikiConference North America if your submission is not accepted?
- Yes
Interested attendees
If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).
- Add your username here.