2019/Grants/CiteFix: a tool to fix broken references
Title:
CiteFix: a tool to fix broken references
Name:
Jay Prakash
Krishna Chaitanya Velaga
Wikimedia username:
Jayprakash12345
KCVelaga
E-mail address:
0freerunninggmail.com
kcvelagagmail.com
Resume:
- Jay Prakash
- Krishna Chaitanya
Geographical impact:
global
Type of project:
Technology
What is your idea?
English Wikipedia has several maintenance categories that are auto-populated based on errors in referencing and citation templates. There are thousands of pages in these categories that are left unattended for a long time. While some of these are easy to fix manually, some of them aren’t. Our idea is to develop a tool named “CiteFix” which will suggest changes to users to fix the errors.
In layman terms, the tool will scan through the pages in various defined categories, and suggest edits to fix for a user to take action. The interface will be displaying the preview and the edit space of both the current version, and suggested edit version. If the user feels that that the suggested edit is fine, they would be able to easily fix the error with a click. They will also have an option to edit the suggested edit space, in case the tool is not able to generate the suggestion properly. The user can also skip a certain suggestion if they are not sure if it is right or not. The tool will mark all its edits with a hashtag, which can be used to track the usage of the tool over time.
For this project, we will be focusing the following errors:
- Cite error: There are <ref group=lower-alpha> tags or {{efn}} templates on this page, but the references will not show without a {{reflist|group=lower-alpha}} template or {{notelist}} template.
- Help page: https://w.wiki/uLe
- Example page: https://en.wikipedia.org/wiki/Education_in_India
- Fix: The tool will suggest addition of {{reflist|group=lower-alpha}} along with a section heading if not already there at an appropriate position in the article.
- Cite error: The named reference $1 was invoked but never defined.
- Help page: https://w.wiki/uLi
- Example page: https://en.wikipedia.org/wiki/Foundation_(cosmetics)
- Fix: This error is largely generated in two ways - a.) an editor might have removed the citation while editing the content in one of the previous revisions, unaware that the citation is named and used multiple times in the article b.) an editor might have copied the content from another Wikipedia article in the same category. This happens especially in sports articles, when tables are listed. The tool will scan through the edit revisions, and pages (at depth 1) of categories of the page in questions. If it is able to find a reference named with a required name, it will suggest the full reference to be added to the current version of the article.
- Cite error: The named reference "sofield" was defined multiple times with different content (see the help page).
- Help page: https://w.wiki/uMX
- Example: https://en.wikipedia.org/wiki/Interstate_95
- Fix: The tool will suggest two options depending on the case. If the full reference text is same for both (or multiple), then the suggestion will be to remove full text, and inserting invoke syntax. If the full reference texts are different, the suggestion will be to rename one or more of the references in question.
- Help page: https://en.wikipedia.org/wiki/Help:CS1_errors#bad_url
- Fix: Though “bad url” error can be caused due to a range of errors we will be focusing on fixing the http:/https: prefixes of the URL which will solve a good number of errors. For example, in this citation (https://en.wikipedia.org/wiki/2020_Indiana_Hoosiers_football_team#cite_note-48) is caused due to using “Bryantawards.org” instead of “https://bryantawards.org/”.
We will approach the problem from a technical standpoint in the following way:
- Helper library: While developing a large tool, it is very important to place every logic in an orderly manner, so that it doesn’t appear messy for others to follow the code. Modular approach also helps the tool to get newbie developers involved easily. So initially, we will create a helper functions library that will contain functions like getting wikitext of pages, identifying patterns for reference errors, generating suggestion diff of fixes, and saving text back to Wikipedia etc.
- Web backend: After developing the helper library, it is important to integrate it with the backend framework. We will use the Flask framework for this tool because it is a lightweight framework and quite modular to integrate with other python libraries. Till now, we have developed 8 tools for the Wikimedia community using the Flask framework.
- Integration with OAuth: Next big step is to integrate Wikimedia OAuth with the tool so that edits can be made by a user account, not anonymous users (IP).
- Web frontend: We will create the frontend of this tool in the ReactJS framework. ReactJS is the second most used web frontend, after Google's AngularJS. ReactJS allows us to manage a lot of data as centrally by Redux Javascript library. It will let users stay on a single page and the browser will not refresh at all, as we will use centrally stored data. There are two benefits when browsers do not refresh:
- Whole web app will load once and will not take extra bandwidth to download the web page again and again. The app will be able to run on slow connections.
- It is very good for a better user experience.
- Web frontend UI: We all know user interfaces play a vital role while working on something. Good user interface will encourage users to contribute more. We will use Material-UI as a UI Component for React App. Material design is dominating currently. Websites such as Gmail, Reddit, Dropbox, and Pinterest are using Material design. For a demo of material UI, please visit http://bodh.toolforge.org.
- I18n of tool: Internationalisation tools in terms of interface language is a very important and part of the Wikimedia community best practices. We will integrate our tool’s i18n with https://translatewiki.net. TranslateWiki is run by the Wikimedia language team and allows Wikimedia community users to translate wiki interfaces in their own language. By the end of this project, we will have a tool interface in at least 30 languages.
- Worker Queue: In the last step, the tool will be ready for English wikipedia but everyone knows the server's limited resources will serve limited users and it will not be scalable. To make this tool scalable, we will set up the 8 worker queues in the background which will have the capacity to serve more than thousands users at a time so that many users can use this tool at once. An example of a worker queue system is http://quarry.wmflabs.org, it serves very heavy requests easily by workers and users can run queries that only take minutes in their executions.
- Deployment of the tool: We will use Wikimedia’s Cloud VPS as server hosting. Cloud VPS has many advantages over Toolforge. Unlike toolforge which is a shared hosting, Cloud VPS gives a separate server for tool deployment and we can customize our Cloud VPS as per our tool requirement. As it is free so we don’t need to pay anything for hosting and the tool will be there permanently.
Why is it important?
Though for experienced users, many of these errors might be an easy fix - it is not the case for new editors and readers of Wikipedia. Especially talking from the perspective of readers, these errors generate big red tags which do not create a good experience while reading. Also, for general readers, if someone wants to verify the source, they won’t be able to do it as an error text is displayed instead of full reference text.
There are at least ten thousand pages in all the above categories combined as of now, but the tool can be used to fix errors that can arise in the future as well. There is a high probability that new pages will be added to these error categories, as the citation templates are prone to errors, especially by new editors, and this can help them to easily fix things.
Is your project already in progress?
n/a
How is it relevant to credibility and Wikipedia? (max 500 words)
According to WP:WHYCITE, “By citing sources for Wikipedia content, you enable users to verify that the information given is supported by reliable sources, thus improving the credibility of Wikipedia while showing that the content is not original research. You also help users find additional information on the subject; and by giving attribution you avoid plagiarising the source of your words or ideas.”
Every citation added, adds to credibility and verifiability of Wikipedia. However, it is not just important to add a citation, it should also be added the right way. If not, it fails the whole purpose as a general reader will not be able to access the source. In a community of thousands of volunteers, mistakes are inevitable. This tool helps to fix those mistakes in a simple, easy and engaging way - thereby contributing to the credibility of Wikipedia, a little at least.
What is the ultimate impact of this project?
We are envisioning the ultimate impact of the project in two categories. The first one is internationalisation - though for this stage of the project, we will be focusing on English Wikipedia only, we will ensure that the tool can be internationalised for other language Wikipedians, without a huge effort. If the tool is used on more language Wikipedias, the impact will be more significant, rather than on just Wikipedia. As we progress towards the end of the project, we will talk to at least 1-2 other Wikipedians if they’d like to adopt this, and see what’s possible, and also develop documentation for the same.
The second component is to use this tool in campaigns. A campaign focused to fix such basic issues can be a great help for newcomers, or it can also be used in campaigns such as 1lib1ref.
Could it scale?
Yes, the answer to this mostly lies in the answer to the question “What is the ultimate impact of this project?” - scaling is mainly applicable to expansion to other languages.
Why are you the people to do it?
Jay Prakash is a seasoned developer who has created several tools and is working on MediaWiki since 2017 (his 500+ fixes has been merged), which can be seen from his GitHub and Phabricator profile. He is an expert in web applications & MediaWiki. Addinally Jay was Google Summer of Code Intern for Wikimedia Foundation in 2019. Krishna is a seasoned English Wikipedian, his experience will be useful to guide Jay on understanding the nuances of each of the problems mentioned. Moreover, he is an experienced project manager who has received several grants from the Wikimedia Foundation in the past.
Additionally, both of them have worked together on a project to redesign the wiki of University Innovation Fellows program (of Hasso Plattner Institute of Design at Stanford; https://universityinnovation.org/wiki/Main_Page). During the project, Jay developed several extensions used for management of learning, while Krishna focused on design and coordination of the overall framework. This experience of working together on this project, and several other collaborations, will make us a great team to work on this project.
What is the impact of your idea on diversity and inclusiveness of the Wikimedia movement?
Though it is a direct relation to the overall credibility of Wikipedia, which can bring in diversity in a lot of ways, and the tool can be used by people who are not tech-savvy, it is inclusive. Though there might be an unforeseen impact, we are not able to establish a direct connection on this issue.
What are the challenges associated with this project and how you will overcome them?
- Integration with Wikimedia: Login by OAuth with Wikimedia projects for tools is not easy. We have to apply for app credentials on MetaWiki, then a steward will approve our application. After that we will use mwoauth python library in our web backend to do Open Authentication with Wikimedia projects.
- Reference error pattern: The tool primarily works using pattern identification, and it can’t be done just with one expression or a function. A lot of patterns need to be identified. We will have a bunch of regex expressions, and this set will be constantly updated to identify new patterns. We will create a form for users to report any unrecognized patterns.
- Mobile friendly: We will have this challenge as we worked on mobile friendly web apps very few times. We will use cascading stylesheet’s media-query which allows us to identify devices like Mobile, Tablet, and Desktop then we will style our app like margin, padding, box-model etc according to device and set breakpoints based on screen size.
- Deployment of workers: Worker queue runs in the background so they need continuous monitoring. We will use Celery as our tool worker and will create some custom script that will check their status if script founds status negative it will restart the workers again.
How much money are you requesting?
USD 10,000
How will you spend the money?
- Developer: USD 6,500 (325 hrs * $20/hr)
- Product Manager: USD 3,000 (150 * $20/hr)
- Development sprints: USD 500 (travel, per diem)
How long will your project take?
The following will be the broad timeline:
- Month 0: Research and prep
- Month 1: Development Sprint 1
- Month 2: Phase 1 Development concludes
- Month 3: User Testing Phase 1
- Month 3: Development Sprint - 2
- Month 4: User Testing Phase 2 & Campaign exploration
- Month 5: Phase 2 (and Final) Development concludes
- Month 6: Documentation and reporting
Have you worked on projects for previous grants before?
https://meta.wikimedia.org/wiki/Grants:Conference/KCVelaga/Wikigraphists_Bootcamp_(2018_India)/Report https://meta.wikimedia.org/wiki/Grants:Project/Rapid/VVIT_WikiConnect/Annual_Plan_(2018%E2%80%932019)/Report https://commons.wikimedia.org/wiki/Commons:SVG_Translation_Campaign_2019_in_India/Report more at https://meta.wikimedia.org/wiki/User:KCVelaga/Outreach