Submissions:2016/Linking a controlled subject vocabulary to Wikipedia

From WikiConference North America
Jump to navigation Jump to search
Linking a controlled subject vocabulary to Wikipedia
Academic Peer Review option
Type of submission
Diane Vizine-Goetz
E-mail address
OCLC Research, 6565 Kilgour Place, Dublin, Ohio USA 43017
A controlled vocabulary is described as “an organized arrangement of words and phrases used to index content and/or to retrieve content through browsing or searching. It typically includes preferred and variant terms and has a defined scope or describes a specific domain.”[1] This presentation will report on an investigation to link topical terms from the FAST (Faceted Application of Subject Terminology) vocabulary to Wikipedia articles.
The FAST vocabulary is derived from the Library of Congress Subject Headings (LCSH), the largest and most widely used controlled subject vocabulary in the library domain. FAST consists of eight categories of terms, or facets, which cover key attributes of information resources (topics, persons, organizations, events, geographic places, titles of works, time, and form/genre). FAST terms provide subject access to millions of resources in library collections, institutional repositories, and special collections in the GLAM community. FAST, like Wikipedia, is a general knowledge organization structure that covers all topics. Both FAST and Wikipedia are curated, aimed at non-experts, and have guidelines regarding which topics are to be included, i.e., notability for Wikipedia articles and literary warrant for FAST headings. The principles guiding the choice of article titles, e.g., use of natural language, precision, consistency, etc., echo many of the principles governing the formation of topical subject headings. [2][3]
With these characteristics in mind, automated techniques were developed to match FAST topical terms to Wikipedia article titles. Of the approximately 183,000 candidate terms, 76,000 terms were matched to Wikipedia article titles with 95% accuracy. The presentation will describe the approach used to match controlled vocabulary terms to Wikipedia article titles and current efforts to publish the mappings. The mappings (links to Wikipedia and Wikidata) are available in the FAST authority file and FAST linked data. The links enable people and software applications to take advantage of information in both resources. Next steps, such as, accumulating data on bad matches and correcting the mappings, will also be discussed.
1. Harpring, Patricia. "What Are Controlled Vocabularies?" Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works. Ed. Murtha Baca. Los Angeles: Getty Research Institute, 2010. N. pag. Introduction to Controlled Vocabularies (Getty Research Institute). 2010. Web. 31 Aug. 2016. <>.
2. "Wikipedia:Article Titles." Wikipedia. Wikimedia Foundation, 26 Aug. 2016. Web. 31 Aug. 2016. <>.
3. "Session 2 History and Principles of LCSH." Basic Subject Cataloging Using LCSH: Instructor's Manual (Revised 2011). Ed. Lori Robare. Washington: Library of Congress., 2007. 51-61. Web. 31 August 2016. <>.

Length of presentation
30 min.
Special schedule requests
Preferred room size
Will you attend WikiConference North America if your submission is not accepted?

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).

  1. Dcheney (talk) 00:44, 30 September 2016 (EDT)
  2. Add your username here.