Towards a Yoruba DBpedia Chapter — GSoC 2026


Body:


Hello DBpedia community,

I am Jesujuwon Egbewale, a Python/ML/NLP developer based in Ibadan, Nigeria, with experience in data science, machine learning, natural language processing, and low-resource languages. I would like to propose a GSoC 2026 project to build a Yoruba DBpedia Language Chapter.


Description

DBpedia extracts structured information from Wikipedia and publishes it as Linked Open Data, forming one of the most comprehensive multilingual knowledge graphs available. However, Yoruba, a language spoken by an estimated 55 million people across Nigeria, Benin, and Togo, has no DBpedia chapter despite having an active Wikipedia edition with over 33,000 articles.

As of early 2024, Yoruba Wikipedia had over 33,700 articles and 110 active editors, and recorded over 25 million views in 2023 alone PostgreSQL Wiki, making it the most-read Nigerian language Wikipedia. Despite this active community and growing corpus, virtually none of this knowledge is represented in structured form within DBpedia or the Linked Open Data cloud.

This project proposes building the first Yoruba DBpedia Chapter by developing an end-to-end information extraction pipeline that transforms Yoruba Wikipedia content into structured RDF triples and integrates them into the DBpedia knowledge graph.


Goal

  • Analyze Yoruba Wikipedia structure: infobox coverage, article categories, template usage, and linguistic characteristics relevant to extraction
  • Build a multilingual NLP preprocessing pipeline tailored for Yoruba (tokenization, diacritic normalization, stopword handling)
  • Develop a relational triple extraction pipeline using multilingual models (XLM-R, mBERT, or Aya, an open-source model with strong African language coverage)
  • Map extracted triples to DBpedia ontology classes and properties
  • Publish extracted triples as Linked Open Data following the DBpedia chapter format
  • Create a demo page and documentation following FAIR principles, modeled on the Amharic and Hindi chapter templates

Material

  • Yoruba Wikipedia XML dump
  • DBpedia Extraction Framework and existing chapter codebases (Amharic, Hindi)
  • Multilingual pre-trained models: XLM-R, mBERT, CroissantLLM, Aya (Cohere)
  • DBpedia Ontology and mappings.dbpedia.org
  • Masakhane NLP resources for Yoruba

Project Size

Large — 350 hours


Impact

  • Creates the first structured Yoruba knowledge graph integrated into DBpedia, serving over 55 million Yoruba speakers
  • Provides a reusable extraction pipeline extensible to other under-resourced Nigerian and West African languages (Hausa, Igbo, Ewe)
  • Contributes to linguistic diversity in the Linked Open Data cloud
  • Supports downstream NLP tasks: machine translation, question answering, and named entity recognition in Yoruba
  • Follows the proven DBpedia African chapter model (Amharic 2024, 2025) with a new language of equal or greater reach

Warm-up Tasks

  • Read the DBpedia Extraction Framework documentation and review the Amharic and Hindi chapter codebases and GSoC reports
  • Explore Yoruba Wikipedia structure: catalog infobox types, template usage, and article size distribution across major categories
  • Run the Amharic chapter pipeline locally on a sample Yoruba Wikipedia dump to understand adaptation requirements
  • Identify available Yoruba NLP resources: tokenizers, stopword lists, existing annotated corpora
  • Draft a preliminary mapping of Yoruba Wikipedia infobox templates to DBpedia ontology classes
  • Set up the development environment (DBpedia Extraction Framework, HuggingFace, RDF tooling)

I would greatly appreciate mentor guidance on whether this aligns with DBpedia’s 2026 goals, and which aspects of the existing chapter infrastructure to prioritize in the proposal.

Thank you for your time.