2019/Grants/A bot to add reference support to Wikidata statements

From WikiConference North America
< 2019‎ | Grants
Jump to navigation Jump to search


A bot to add reference support to Wikidata statements


Houcemeddine Turki

Wikimedia username:


E-mail address:



Born in May 24, 1994, Houcemeddine Turki is a long-term Wikimedian and a medical student at University of Sfax, Tunisia. He is also a published researcher in Computational Linguistics, Scientometrics and Biomedical Informatics.

As a Wikimedian, I began contributing to Wikipedia in 2009 by adding non-stub articles and ameliorating reference support in French and English Wikipedia. Thanks to this effort, I was a semi-finalist of WikiCup 2015 and my work has been featured in the main page of English Wikipedia several times. I also was among the founding members of Wikimedia TN User Group and launched several initiatives and projects to enhance the coverage of Tunisia-related topics linked to science, cultural heritage, and sociolinguistics. I was also a coordinator of The Wikipedia Library for two years and tried to provide full access to paywalled scholarly resources to Wikimedia Communities. Since 2017, I shifted my interest to Wikidata before retiring from Wikipedia in 2019. I was aware that Wikidata can be interesting for a variety of real life applications including health, industry, science and linguistics. That is why I started since 2018 to work on developing applications of Wikidata in Medicine and succeeded to contribute to Wikidata by adjusting several biomedical knowledge and adding new types of entities particularly in the context of COVID-19 pandemic. Currently, I am vice-chair of Wikimedia TN User Group, a board member of Wikimedia and Libraries User Group and an active member of Wikimedia Medicine. I am also trying during the last few months to co-found the first Wikimedia Research Unit in Tunisia under the supervision of the University of Sfax. This research unit is called "Data Engineering and Semantics" and will include all the research scientist of the University of Sfax working on Wikimedia-related research topics.

Geographical impact:


Type of project:


What is your idea?

My idea consists on creating a bot to process news feed and open source search engines to find references to unsupported statements in Wikidata.

Why is it important?

Wikidata statements that are not supported by references are not trustworthy enough to be considered. Adding accessible reference URLs to them will let possible to verify the accuracy of Wikidata statements and consequently to enhanced the quality of Wikidata database.

Is your project already in progress?

I already developed a Python code to retrieve references to biomedical statements from PubMed Central. The principle of the algorithm is explained in https://www.jclinepi.com/article/S0895-4356(17)31073-9/abstract and in https://linkinghub.elsevier.com/retrieve/pii/S1532-0464(18)30094-7

The process of the bot in the context of medicine includes:

  • Extracting unreferenced Wikidata statements: A SPARQL query is formulated for this purpose (https://w.wiki/S79)
  • Searching for the co-occurrence of the triple in PubMed Central publications (using Biopython)
  • Finding the sentence where the statement is confirmed from the search results of PubMed Central (using Biopython)
  • Aligning PMC ID with Wikidata ID of the references (using Wikidata Hub)
  • Adding obtained references to Wikidata (using QuickStatements API)

How is it relevant to credibility and Wikipedia? (max 500 words)

Finding references to Wikidata statements will ameliorate the quality of Wikidata-based bot-generated Wikipedia articles, particularly in the context of COVID-19 pandemic.

What is the ultimate impact of this project?

  • Reducing the number of unsupported Wikidata statements
  • Ameliorate the reference support for Wikipedia articles

Could it scale?

Of course, the bot can later evolve so that it can add references to Wikipedia articles.

Why are you the people to do it?

What is the impact of your idea on diversity and inclusiveness of the Wikimedia movement?

This bot can reduce deletion rates in Wikipedia and Wikidata. More new editors will be encouraged to contribute more to Wikipedia and Wikidata when they find their work fixed.

What are the challenges associated with this project and how you will overcome them?

  • Internet connectivity matters: We will use a high-speed internet connection option (4G).
  • Legal concerns: We will use open license tools and materials.
  • High-scale data to process: We will buy a high performance desktop computer.

How much money are you requesting?

7500 TND (2 579.09 USD)

How will you spend the money?

  • 500 TND (171.94 USD) will be used to purchase high speed internet connection for the project.
  • I will work for the project for 35 TND/h: 35 TND * 200 = 7000 TND (2 407.15 USD)

How long will your project take?

6 months

  • Bot request: Months 1 and 2 (Deliverable: Bot request on Wikidata)
  • Bot development: Months 2, 3, and 4 (Deliverable: Python codes on GitHub)
  • Data extraction and enrichment of Wikidata statements with references: Months 4, 5, and 6 (Deliverable: XTools Wikidata edit statistics for the bot)

Have you worked on projects for previous grants before?