Submissions:2016/Linking a controlled subject vocabulary to Wikipedia
- Title
- Linking a controlled subject vocabulary to Wikipedia
- Theme
- GLAM
- Type of submission
- presentation
- Author
- Diane Vizine-Goetz
- E-mail address
- vizine@oclc.org
- Username
- Diane.vg
- Affiliation
- OCLC Research, 6565 Kilgour Place, Dublin, Ohio USA 43017
- Abstract
- A controlled vocabulary is described as “an organized arrangement of words and phrases used to index content and/or to retrieve content through browsing or searching. It typically includes preferred and variant terms and has a defined scope or describes a specific domain.”[1] This presentation will report on an investigation to link topical terms from the FAST (Faceted Application of Subject Terminology) vocabulary to Wikipedia articles.
- The FAST vocabulary is derived from the Library of Congress Subject Headings (LCSH), the largest and most widely used controlled subject vocabulary in the library domain. FAST consists of eight categories of terms, or facets, which cover key attributes of information resources (topics, persons, organizations, events, geographic places, titles of works, time, and form/genre). FAST terms provide subject access to millions of resources in library collections, institutional repositories, and special collections in the GLAM community. FAST, like Wikipedia, is a general knowledge organization structure that covers all topics. Both FAST and Wikipedia are curated, aimed at non-experts, and have guidelines regarding which topics are to be included, i.e., notability for Wikipedia articles and literary warrant for FAST headings. The principles guiding the choice of article titles, e.g., use of natural language, precision, consistency, etc., echo many of the principles governing the formation of topical subject headings. [2][3]
- With these characteristics in mind, automated techniques were developed to match FAST topical terms to Wikipedia article titles. Of the approximately 183,000 candidate terms, 76,000 terms were matched to Wikipedia article titles with 95% accuracy. The presentation will describe the approach used to match controlled vocabulary terms to Wikipedia article titles and current efforts to publish the mappings. The mappings (links to Wikipedia and Wikidata) are available in the FAST authority file and FAST linked data. The links enable people and software applications to take advantage of information in both resources. Next steps, such as, accumulating data on bad matches and correcting the mappings, will also be discussed.
- References
- 1. Harpring, Patricia. "What Are Controlled Vocabularies?" Introduction to Controlled Vocabularies: Terminology for Art, Architecture, and Other Cultural Works. Ed. Murtha Baca. Los Angeles: Getty Research Institute, 2010. N. pag. Introduction to Controlled Vocabularies (Getty Research Institute). 2010. Web. 31 Aug. 2016. <http://www.getty.edu/research/publications/electronic_publications/intro_controlled_vocab/what.html>.
- 2. "Wikipedia:Article Titles." Wikipedia. Wikimedia Foundation, 26 Aug. 2016. Web. 31 Aug. 2016. <https://en.wikipedia.org/wiki/Wikipedia:Article_titles>.
- 3. "Session 2 History and Principles of LCSH." Basic Subject Cataloging Using LCSH: Instructor's Manual (Revised 2011). Ed. Lori Robare. Washington: Library of Congress., 2007. 51-61. Web. 31 August 2016. <http://www.loc.gov/catworkshop/courses/basicsubject/pdf/LCSH_Instructor_2011.pdf>.
- Length of presentation
- 30 min.
- Special schedule requests
- None
- Preferred room size
- 25
- Will you attend WikiConference North America if your submission is not accepted?
- likely
Interested attendees
If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with four tildes. (~~~~).