DBpedia Hindi Chapter 2026: Fine-Tuning Indic Models for Hindi Relational Triple Extraction + Human-in-the-Loop Feedback — GSoC 2026

Project Title:

DBpedia Hindi Chapter 2026: Fine-Tuning Indic Models for Hindi Relational Triple Extraction + Human-in-the-Loop Feedback

Description:

The DBpedia Hindi Chapter aims to expand the multilingual depth of DBpedia by extracting structured relational triples (subject → predicate → object) from Hindi Wikipedia and integrating them into the DBpedia knowledge graph. While DBpedia’s existing extraction framework provides strong support for infobox-based triples, the extraction of relations from free text using neural and NLP-driven methods—particularly for Hindi—remains underdeveloped. As a result, a large portion of relational knowledge present in Hindi Wikipedia articles is not yet represented in structured form within DBpedia.

Previous efforts have shown clear limitations in current approaches. Triple extraction using LLMs or prompt-only methods often suffers from inconsistency and a lack of robustness across different linguistic contexts, while rule-based extractors such as IndIE alone are insufficient to capture the richness and variability of Hindi language constructions. Furthermore, the absence of an integrated human validation and correction mechanism prevents systematic improvement of training datasets and limits the ability to iteratively refine extraction models, ultimately affecting the quality and reliability of the resulting knowledge graph.

## Goal

This project proposes:

  1. Fine-tuning a small language model (SLM) for reliable relational triple extraction from Hindi text, improving over current prompt-only and rule-based baselines.

  2. Building a lightweight UI for human feedback on extracted triples, enabling iterative dataset improvement (via correction labels, edits) — crucial for error analysis, dataset curation, and future active learning.

This work supports the DBpedia extraction ecosystem and accelerates automated extraction methods for Hindi Wikipedia content, enhancing coverage, quality, and usability of the Hindi DBpedia chapter.

# Material

See Warm-up tasks.

Project size

This project is medium-sized (175 hours).

Impact

This project will directly enhance the DBpedia Hindi Chapter by improving the quality and coverage of relational triple extraction from Hindi Wikipedia. By fine-tuning a small language model and incorporating human-in-the-loop validation, it will produce more accurate, trustworthy triples suitable for DBpedia ingestion. The resulting open-source pipeline and feedback dataset will support sustainable growth of hi.dbpedia.org and provide a reusable foundation for extending neural extraction methods to other low-resource languages in DBpedia.

Warm-up Tasks

  • Read carefully DBpedia extraction framework, ontology, and existing Hindi DBpedia resources and review past GSoC work related to DBpedia Hindi and Indic information extraction.
  • Explore Hindi BenchIE and understand its annotation schema and evaluation protocol and reproduce baseline results using prompt-based extraction and rule-based systems (e.g., IndIE and the GSoC25 pipeline).
  • Analyze Hindi Wikipedia text characteristics relevant to relation extraction.
  • Set up the training environment (HuggingFace, evaluation scripts, reproducibility configs).
  • Run small-scale experiments with candidate SLMs (gemma3) on sample data.
  • Sketch the feedback workflow and data schema for triplet validation and build a minimal prototype to display extracted triples and collect annotations.

These warm-up tasks will ensure early alignment with DBpedia goals, reduce onboarding time, and enable faster progress once the coding phase begins.

Mentors

Sanju Tiwari (@tiwarisanju18), Aditya Venkatesh, Debarghya Dutta, Ronak Panchal

2 Likes

Hi @tiwarisanju18

This is a very interesting project, the move to Fine-Tuning SLMs (like Gemma 3) definitely seems like the right way to solve the semantic issues in Hindi extraction.

Regarding the Codebase: Since this project shares infrastructure with the Neural Extraction Framework (which I am actively exploring), I decided to explore the GSoC25_H pipeline as well.

I noticed some code redundancy in the LLM connection modules and submitted [PR #25]. It refactors the LLMService to unify connection logic across IndIE and llm_IE, which should make future SLM integration cleaner.

Quick Observation: My local tests with the “Berlin Wall” abstract confirm that the current embeddings struggle with Hindi predicates (e.g., mapping निर्माण शुरु हुआ to dbo:constructionYear), so the switch to a fine-tuned SLM is definitely well-motivated.

Best, Siddharth

hey @sid0858 @tiwarisanju18
This is a very interesting and well-structured project proposal. The focus on fine-tuning Indic SLMs for Hindi relational triple extraction, along with a human-in-the-loop feedback mechanism, directly addresses some of the key limitations in current DBpedia Hindi extraction workflows.

I’m particularly interested in contributing to the warm-up tasks, especially exploring Hindi BenchIE, running baseline experiments with prompt-based and rule-based systems, and experimenting with candidate SLMs. I’d love to learn more and contribute to the project as a community contributor. Looking forward to collaborating with the mentors and the team!
Regards Nitin.

Dear @sid0858 @5051a9b40975112c9b4a
Thank You for your inputs
Please send your proposal on tiwarisanju18@ieee.org

Hello @tiwarisanju18,

Thank you for the invitation! I would be happy to put together a proposal. I am currently consolidating my notes on the SLM fine-tuning strategy and the Human-in-the-Loop architecture to align them with the GSoC requirements and with the core Neural Extraction Framework. I will draft the Proposal of my technical approach and share it with you via email shortly for your feedback.

Looking forward to collaborating!

Best regards, Siddharth

Just sent sir please Check.

Hi @tiwarisanju18
This project sounds really interesting and aligns well with my academic background and research interests. The idea of fine-tuning and aligning small language models to extract higher-quality relational triples from Hindi Wikipedia feels both impactful and technically exciting. I am especially interested in the human in the loop aspect and using feedback to improve datasets and build a more reliable extraction pipeline.

I am keen to explore the DBpedia extraction framework and warm-up tasks in depth, especially experimenting with existing baselines and Hindi IE resources to better understand the current pipeline. I would love to learn more and contribute meaningfully to this project.
Regards, Vishakha

Dear Vishakha

Please send your proposal by email with all your complete details.

I have shared my draft proposal with complete details via email as requested.

Looking forward to your feedback.

Best regards,
Siddharth

Hello @tiwarisanju18 and mentors!!

I have completed the warm-up tasks and sent my proposal via email for your review.

Looking forward to your feedback.

Best regards,
Syeda Alizah

I have mailed you my draft proposal with complete details. Looking forward to your feedback.

Regards,
Vishakha

Hi @tiwarisanju18,

I am Anushtup Ghosh, an undergraduate engineering student from Jadavpur University, India. I am very excited about the “Fine-Tuning Indic Models” project, as it sits right at the intersection of my experience with Small Language Models (SLMs) and Full-Stack Development. Also, being fluent in Hindi, Bengali, and English, I can actively help with the manual validation and linguistic nuances required for the dataset.

I have gone through the project goals and the warm-up tasks, and I wanted to share a few initial thoughts and questions as I set up my environment.

I noticed the mention of Gemma 3 in the warm-up tasks. I have been experimenting with Gemini Nano and other on-device SLMs for local inference, and I am currently setting up a pipeline to benchmark Gemma-2-2B-It against the existing prompt-based baselines. For the fine-tuning phase, are we prioritizing LoRA/QLoRA adapters to keep the compute “medium-sized,” or are we looking at full fine-tuning of these smaller models?

Since the goal is a “lightweight UI” for validation, I can leverage my web development background (React/Next.js) to build a fast annotation dashboard. I can prototype a simple Streamlit or Gradio interface this weekend that takes a raw Hindi sentence, displays the extracted triple, and allows a human annotator to “Accept/Edit/Reject” it.

Would you prefer the feedback data to be stored in a simple JSONL format for now, or should I look into integrating with a specific DBpedia ontology schema immediately?

I am currently reproducing the IndIE baseline on the Hindi BenchIE dataset as suggested. Looking forward to contributing!

Best,
Anushtup

Subject: Proposal: Robust SLM Fine-Tuning with Schema Guards & Feedback UI

Hi @tiwarisanju18 and everyone,

This is a fantastic initiative! As a native Hindi speaker, I’ve seen firsthand how standard extractors struggle with the nuances of Hindi grammar (like the complex case markers in “निर्माण शुरु हुआ”).

My Approach to the Goals: I plan to submit a proposal that integrates Fine-Tuning with a strict Validation Layer to ensure the “Human-in-the-Loop” process is efficient.

  1. Fine-Tuning (The Engine): I will fine-tune Gemma-2B/Llama-3-8B on Hindi DBpedia subsets to improve semantic understanding of local contexts.
  2. Schema Guardrails (The Safety Net): Small models often hallucinate output formats. I will adapt my work from PR #30 (Core Framework) to enforce strict JSON schemas after the SLM generates text. This ensures the “Human Feedback UI” receives clean, structured data—not broken strings.
  3. The UI (The Loop): I will build the feedback interface (React/Streamlit) to capture user corrections. These corrections will be fed back into the SLM for iterative improvement (Active Learning).

I believe this “Structured Fine-Tuning” approach is the only way to make SLMs reliable enough for the main DBpedia graph.

Drafting the full proposal now!

Best, Aryan