Submissions:2025/Beyond Wikivecs: How to Use Dense Vectors to Explore and Expand Wikipedia

From WikiConference North America
Jump to navigation Jump to search

This submission has been noted and is pending review for WikiConference North America 2025.



Title:

Beyond Wikivecs: How to Use Dense Vectors to Explore and Expand Wikipedia

Type of session:

Lecture (15-30 min)

Session theme(s):

Missing pieces, Future of Wikipedia

Abstract:

Lightning talk from Wikipedia Day NYC 2025:

https://docs.google.com/presentation/d/15GX1wxEGAT9B65HeuV39_bXZwtOLtnlH2lre36xVys4/edit?slide=id.g2af49c7b022_0_272#slide=id.g2af49c7b022_0_272

Abstract:

Dense vector representations have opened up new ways to explore, understand, and improve Wikipedia. In early 2025, we released Wikivecs, the first fully open and reproducible dataset of dense vector embeddings for every article in Multilingual Wikipedia. Built with a permissively licensed multilingual text encoder, the dataset aligned content across languages in a shared vector space, enabling the rapid discovery of several content silos across languages.

This lecture will build on that foundation and focus on how contributors, researchers, and tool developers can use dense vectors to analyze and improve Wikipedia. Participants will learn the basics of working with vector representations, including:

- How to understand and interpret vector representations

- How to understand and interpret data maps

- How to identify missing or inconsistent content across language editions

- How to cluster and visualize articles by topic or conceptual similarity

We’ll also explore future directions for Wikipedia tooling with vectors, from bias detection to recommendation systems. No prior experience with machine learning is required; we’ll walk through everything with practical examples and open-source tools.

Author name(s):

Brandon Duderstadt

Wikimedia username(s):

Affiliated organization(s):

Nomic AI, Johns Hopkins University

Estimated length of session

30 minutes

Will you be presenting remotely?

I will present in-person

Okay to livestream?

Livestreaming is okay

Previously presented?

I presented an earlier version of this work at Wikipedia Day NYC 2025.

Special requests:

The dataset referenced in this talk is currently under review as a submission to https://meta.wikimedia.org/wiki/NLP_for_Wikipedia_(ACL_2025)/Call_for_Papers