DBpedia Hindi Chapter — GSoC 2025

DBpedia is an open knowledge graph in continuous evolution. Unlike Wikidata, where the RDF content is directly edited as a wiki, DBpedia relies strictly on Wikipedia, meaning that every single triple in DBpedia — except for ontology statements — can be traced back to some infobox, sentence or table cell in Wikipedia.

The graph exposed at the root domain of DBpedia is derived solely from English Wikipedia (e.g. https://dbpedia.org/page/India). Purpose of this project is to create a graph derived solely from Hindi Wikipedia. Methods to generate triples rely on the Extraction Framework 15 for infobox extraction or through novel NLP-based approaches such as the Neural Extraction Framework. Unfortunately, the latter approach only supports the English language. We thus welcome NLP and/or LLM-based solutions to target multilingual text. We have proposed the first edition on DBpedia Hindi Chapter in 2024 GSoC proposal for Configuring different extractors for hindi in DBpedia extraction framework and included a neural pipeline for extracting tuples directly from hindi wiki. In this proposal we are extending the first edition of DBpedia Hindi Chapter with extended goals.

Goal

Extending the DBpedia Chapter in Hindi language to be reached at hi.dbpedia.org. In particular:

  • Create the knowledge graph with data from Hindi Wikipedia by including more indic neural extractor. It aims for extracting information in the form of relational triples(subject → predicate → object) from unstructured text in hindi Wikipedia articles that can be added to the DBpedia knowledge base
  • Automating the knowledge graph with the availability of new LLMs by creating Indic embeddings, so that missing links can be generated automatically.
  • Create a SPARQL endpoint to make it queryable.

Material

See Warm-up tasks.

Project size

This project is medium-sized (175 hours).

Impact

  • Cultural and Educational Enrichment: Empower Hindi-speaking users with culturally relevant and easily accessible knowledge, fostering educational enrichment and linguistic inclusivity.
  • Semantic Search and NLP Applications: Enable advanced semantic search and natural language processing (NLP) applications in Hindi, opening avenues for innovation in information retrieval and analysis.
  • Community Engagement: Encourage community contributions, feedback, and collaboration in maintaining and expanding the Hindi ontology, ensuring continuous improvement and relevance.

In summary, this project seeks to contribute significantly to linguistic diversity in the semantic web domain by extending the DBpedia ontology to Hindi, promoting a more inclusive and accessible knowledge landscape
for Hindi-speaking users.

Warm-up tasks

  1. Please read carefully our overview on creating new DBpedia Chapters.
  2. Read the paper Internationalization of Linked Data: The case of the Greek DBpedia edition by Kontokostas et al.
  3. Learn about the DBpedia Extraction Framework, the software used to transform Wikipedia infobox data into RDF triples.
  4. Check the mapping in Hindi of the DBpedia ontology and Indic embeddings.
  5. Go through the list of current chapters can be found at this address to get an idea of how they are structured.
  6. Get familiar with SPARQL on the DBpedia endpoint .
  7. Run a local DBpedia Virtuoso endpoint .

Mentors

Sanju Tiwari (@tiwarisanju18), Debarghya Dutta, Ananya, Ronak Panchal

2 Likes

Hi @tiwarisanju18! I’m Aditya Venkatesh, currently pursuing my Master’s in CompSci at University of Amsterdam. Being a native Hindi speaker, I’m particularly interested in this project due to the impact it can have on expanding knowledge bases in Hindi!

I have previously worked with SPARQL queries and have used the Babelscape/rebel-large model to extract RDP triplets from raw texts (specifically outputs of LLM’s). I implemented a fact checker which would extract these triplets from LLM outputs and subsequently query wikidata using sparql and check the validity of these outputs. This was done as part of my university project in Nov, Dec 24 so it’s fairly recent. I was the main contributor of this project and you can find it here: GitHub - advenk/wdps_group27
I believe my experience with extracting relational triplets can be extended for the dbpedia Hindi chapter. I’ve started looking into the warm up tasks and exploring the DBpedia Extraction Framework. I hope to come up with a draft proposal over the next week after completing the warm up tasks and taking inputs from the project progress from last year.
I have a question, I could not find any repo to see the project progress from last year - could you point me to the same?

Thank you in advance!

Thank You for your interest. Please go ahead for proposal.