Building the Amharic DBpedia Language Chapter with Large Language Models (LLMs)
Description
DBpedia is a collaborative initiative that extracts structured information from Wikipedia and publishes it as Linked Open Data. This is a continuation of GSoC 2024 and GSoC 2025. We successfully integrated Amharic parsers and extractors into the DBpedia chapter. However, due to time constraints, we could not build a complete automation system to extract and build the artifacts. In this year’s GSOC, we would like to continue from last year’s progress.
Goal
The primary goal of this project is to enhance the existing Amharic DBpedia chapter:
- Integrate an automatic extraction framework and mapping by applying LLMs
- Class/Property/Relation prediction
- Build a demo page
- Update the home page
- Deploy the knowledge graph available to end users via a web page.
- Create documentation for processes, tools, and techniques used for sustainable development, following FAIR principles.
Impact
- Enable users to access and utilize structured data in Amharic DBpedia more effectively.
- This will promote linguistic diversity and support research, education, and applications that rely on multilingual knowledge graphs.
- NLP downstream tasks: Apply knowledge graphs from DBpedia to NLP applications such as machine translation and sentiment analysis.
- Community engagement: Encourage the community to contribute and collaborate to sustain and expand Amharic DBpedia.
Warmup Tasks
Read the documentation for Amharic DBpedia at
https://github.com/AmharicDBpedia/AmharicDBpediaChapter/wiki
Amharic Wikipedia
Skills Required
- A good understanding of Java and Python
- Optionally, good knowledge of SPARQL, RDF, and other Semantic Web technologies
- Machine Learning
- Good documentation and communication skills
Project Size
350 hours
Mentors
Hizkiel Alemayehu
Tilahun Tafa
Ricardo Usbeck
Andargachew Asfaw
Keywords
Amharic DBpedia, Semantic Web, Extraction Framework
2 Likes
Hiiii @hizclick and mentors,
I’ve been going through the warmup tasks - read through the AmharicDBpedia wiki, explored the Amharic Wikipedia structure, and checked out the Arabic, Korean, and German DBpedia chapters to understand how other language editions handle similar challenges.
Really interested in working on the LLM-based extraction and mapping automation this year. The continuation from GSoC 2024/2025 makes a lot of sense.
Could you please advise on where you prefer a pre-proposal draft to be shared, here on the forum or via email, so I can align my approach with your expectations?
Thanks!
Hi @hizclick, @Tilahun_Tafa
I’ve been reviewing the AmharicDBpediaChapter repository and the Wiki progress from GSoC ’24 and ’25. The existing parsers are a great foundation, but I see exactly why the 2026 focus is on LLMs. Since Amharic is morphologically rich, standard regex rules often miss data trapped in unstructured text. I plan to implement a pipeline using multilingual LLMs—specifically testing Open Source LLMs—to extract Class and Property triples directly from the article body, rather than relying solely on infobox templates.
Regarding the deployment goal, I believe a raw SPARQL endpoint creates a barrier for general researchers. I’d love to build a lightweight React or Streamlit frontend that sits on top of the Virtuoso endpoint, allowing users to search for entities such as “Addis Ababa” and visualize their relationships in graph form. I also performed a quick audit of the Wiki against FAIR principles; while RDF interoperability is strong, I plan to improve Reusability by adding standardized metadata tags and a Data Dictionary to the extraction dumps.
I am finalizing my proposal now, but I wanted to ensure we are on the same page.
Best regards,
Anay Dongre
1 Like
Hi @hizclick ,@Tilahun_Tafa and mentors
I am really interested in the “Amharic DBpedia with LLMs” project. I’ve started going through the wiki and trying to understand the extraction process, mappings, and how everything fits together.
Especially I am interested in working on automating the extraction pipeline and exploring how LLMs can help with class/property/relation prediction. I’m also thinking about adding a simple web interface so users can interact with the data more easily.
I just wanted to ask which part of the project would you recommend focusing on the most this year? I want to make sure my proposal aligns well with your priorities.
2 Likes
Hi @abegail
Thank you for contacting us. The main goal of this year’s plan is to automate the DEF pipeline for predicting classes, properties and relations using LLMs.
Hi @hizclick,
I had two questions regarding the project while writing my proposal:
- Which LLM: paid API (Claude, Gemini, GPT) or open-source local (Llama, Gemma)? Or does the project have an existing preference? (The neural-extraction-framework uses Gemini.)
- Output format : should the pipeline produce TemplateMapping files for mappings.dbpedia.org, or produce RDF triples directly from Python like the neural-extraction-framework?
@hizclick
I am writing to express my strong interest in the GSoC project “Building the Amharic DBpedia Language Chapter with Large Language Models (LLMs)” . The idea of extending DBpedia with automated extraction and knowledge graph construction using LLMs is both impactful and technically exciting.
I am particularly drawn to this project because it combines Natural Language Processing, Knowledge Graphs, and Large Language Models , which aligns closely with my current work and interests in AI/ML systems.
From the project description, I am especially interested in contributing to:
- Designing an automatic extraction framework using LLMs
- Implementing class, property, and relation prediction
- Improving mapping pipelines for structured knowledge generation
- Building a demo interface to showcase the Amharic DBpedia chapter
- Assisting in deployment to make the knowledge graph accessible to end users
I have experience working with:
- Machine Learning and NLP concepts
- Python-based data pipelines
- Retrieval-Augmented Generation (RAG) and LLM-based workflows
- Open-source collaboration and structured project development
This project excites me because it contributes to multilingual knowledge accessibility , especially for underrepresented languages like Amharic, and leverages modern AI techniques to scale knowledge extraction.
I am eager to explore the existing codebase and previous GSoC work and would love guidance on where to begin contributing. I am fully committed to dedicating focused time to this project during GSoC.
Thank you for your time and consideration.