DBpedia is an open knowledge graph in continuous evolution. Unlike Wikidata, where the RDF content is directly edited as a wiki, DBpedia relies strictly on Wikipedia, meaning that every single triple in DBpedia — except for ontology statements — can be traced back to some infobox, sentence or table cell in Wikipedia.
The graph exposed at the root domain of DBpedia is derived solely from English Wikipedia (e.g. https://dbpedia.org/page/India). Purpose of this project is to create a graph derived solely from Hindi Wikipedia. Methods to generate triples rely on the Extraction Framework 15 for infobox extraction or through novel NLP-based approaches such as the Neural Extraction Framework. Unfortunately, the latter approach only supports the English language. We thus welcome NLP and/or LLM-based solutions to target multilingual text. We have proposed the first edition on DBpedia Hindi Chapter in 2024 GSoC proposal for Configuring different extractors for hindi in DBpedia extraction framework and included a neural pipeline for extracting tuples directly from hindi wiki. In this proposal we are extending the first edition of DBpedia Hindi Chapter with extended goals.
Goal
Extending the DBpedia Chapter in Hindi language to be reached at hi.dbpedia.org. In particular:
- Create the knowledge graph with data from Hindi Wikipedia by including more indic neural extractor. It aims for extracting information in the form of relational triples(subject → predicate → object) from unstructured text in hindi Wikipedia articles that can be added to the DBpedia knowledge base
- Automating the knowledge graph with the availability of new LLMs by creating Indic embeddings, so that missing links can be generated automatically.
- Create a SPARQL endpoint to make it queryable.
Material
See Warm-up tasks.
Project size
This project is medium-sized (175 hours).
Impact
- Cultural and Educational Enrichment: Empower Hindi-speaking users with culturally relevant and easily accessible knowledge, fostering educational enrichment and linguistic inclusivity.
- Semantic Search and NLP Applications: Enable advanced semantic search and natural language processing (NLP) applications in Hindi, opening avenues for innovation in information retrieval and analysis.
- Community Engagement: Encourage community contributions, feedback, and collaboration in maintaining and expanding the Hindi ontology, ensuring continuous improvement and relevance.
In summary, this project seeks to contribute significantly to linguistic diversity in the semantic web domain by extending the DBpedia ontology to Hindi, promoting a more inclusive and accessible knowledge landscape
for Hindi-speaking users.
Warm-up tasks
- Please read carefully our overview on creating new DBpedia Chapters.
- Read the paper Internationalization of Linked Data: The case of the Greek DBpedia edition by Kontokostas et al.
- Learn about the DBpedia Extraction Framework, the software used to transform Wikipedia infobox data into RDF triples.
- Check the mapping in Hindi of the DBpedia ontology and Indic embeddings.
- Go through the list of current chapters can be found at this address to get an idea of how they are structured.
- Get familiar with SPARQL on the DBpedia endpoint .
- Run a local DBpedia Virtuoso endpoint .
Mentors
Sanju Tiwari (@tiwarisanju18), Debarghya Dutta, Ananya, Ronak Panchal