DBpedia Hindi Chapter 2026: Fine-Tuning Indic Models for Hindi Relational Triple Extraction + Human-in-the-Loop Feedback — GSoC 2026

Hi @tiwarisanju18 mam and everyone

I’m interested in the DBpedia Hindi triple extraction project and have started working through the warm-up tasks. I’m currently going through the DBpedia extraction framework and exploring relation extraction approaches for Hindi text.

I’ve also begun setting up a small environment with HuggingFace to run initial experiments on Hindi data.

I had a couple of questions:

  1. For the initial baseline, would you suggest starting with prompt-based extraction or focusing early on fine-tuning a small model?
  2. Are there any recommended subsets of Hindi Wikipedia or datasets that would be good for early experimentation?

I’ll continue working through the warm-up tasks and would be happy to share results as I progress.

Thanks!

hello @tiwarisanju18 ma’am, I have sent you the final proposal , would appreciate any feedback whenever you are available

Regards,
Siddharth

Hi @tiwarisanju18 and mentors,

I am Krrish Kumar, and I am writing to express my strong interest in the DBpedia Hindi Chapter 2026 project.

I have been exploring the project’s focus on Small Language Models (SLMs) and the integration of human-in-the-loop feedback for Hindi relational extraction. To align with the project goals, I have already started working on the warm-up tasks, specifically:

  1. Setting up a standalone environment to run Gemma-3 benchmarks for triple extraction.
  2. Reviewing the Hindi BenchIE annotation schema to understand how to evaluate predicate mapping accuracy.
  3. Researching the Human-in-the-Loop workflow to design a lightweight interface for triplet validation.

I am currently developing a benchmark notebook to analyze the current zero-shot capabilities of Indic models on Wikipedia abstracts. I look forward to sharing my results and contributing to the Hindi DBpedia ecosystem.

Best regards, Krrish Kumar


2. Slack Message (Community Engagement)

Post this in the gsoc or #dbpedia-hindi channel.

“Hi everyone! :waving_hand: I’m Krrish. I’ve just introduced myself on the forum for the Hindi Relational Triple Extraction project. I’m currently diving into the Warm-up tasks, specifically focusing on reproducing baselines for Hindi BenchIE using Gemma-3. I’m also looking into the best practices for the Human-in-the-Loop feedback UI to ensure it integrates well with the extraction pipeline. Looking forward to connecting with the mentors! @tiwarisanju18

Sounds cool

Dear Krish can you please send your full proposal by email on tiwarisanju18@ieee.org before the final deadline

  1. Prompt base-line already used in earlier version
  2. there is no subset particulary

Please share your proposal by email

Hi Sanju Mam, @tiwarisanju18
Thanks a lot for clarifying. I am going through the GSoC25_H pipeline on GitHub. I’ve made significant progress on the warm-up tasks and wanted to share a quick update:

  1. Baseline Reproduction: I ran experiments with Gemma-2b on Hindi Wikipedia snippets. I’ve identified a critical ‘Semantic Inversion’ issue where the model separates Hindi postpositions (Kaaraks) as in IndIE paper, confirming the findings in the BenchIE paper:


  1. Wikipedia Analysis: I’ve analyzed Hindi Wikipedia text characteristics, specifically focusing on free-word-order constructions that confuse standard neural extractors.
  2. Environment Setup: My HuggingFace/Colab environment is ready for LoRA fine-tuning.
  3. HITL Design: I am finalizing the Spring Boot schema for the feedback UI to ensure we collect ‘Minimality’ and ‘Exhaustiveness’ labels as defined by BenchIE.

I’m currently drafting the implementation timeline for my proposal and will share it shortly.

Thanks and regards,
Abhigyan Tiwari

Sure Please go ahead

Warm-up Baseline Results: Zero-shot Gemma-3 on Hindi BenchIE Sample

Hi @tiwarisanju18 and mentors,

I’ve completed the baseline experiments and wanted to share concrete results.

I ran zero-shot Gemma-3-1b-it on 5 Hindi sentences with gold DBpedia annotations and evaluated subject, predicate, and object accuracy separately.

Results:

Metric Score
Subject Accuracy 5/5 = 100%
Predicate Accuracy 0/5 = 0%
Object Accuracy 5/5 = 100%
Full Triple Match 0/5 = 0%

Key finding — Predicate Normalization is the core failure mode: The model correctly identifies entities but completely fails to map predicates to DBpedia ontology properties. Examples:

  • Extracted “का निर्माण” instead of dbo:builder
  • Extracted “was born in” (English!) instead of dbo:birthPlace
  • Extracted “है” instead of dbo:capital

This also reveals a secondary issue: language mixing — the model switches to English predicates for Hindi input, which breaks DBpedia ontology alignment entirely.

This confirms that predicate normalization via fine-tuning is the most critical gap to address, which is the central focus of my proposed approach.

Regards, Nitin

Warm-up Results: Gemma-3 Baseline + Error Taxonomy + Ontology Alignment + Working HITL Prototype

Hi mam @tiwarisanju18 and mentors,

I have completed all warm-up tasks and want to share concrete results across four areas. Everything below is reproducible.


1. Quantitative Baseline — Zero-shot Gemma-3-1b-it on Hindi BenchIE

I ran Gemma-3-1b-it in zero-shot mode on 5 Hindi sentences with gold DBpedia annotations and evaluated subject, predicate, and object accuracy separately.

Metric Score
Subject Accuracy 5/5 = 100%
Predicate Accuracy 0/5 = 0%
Object Accuracy 5/5 = 100%
Full Triple Match 0/5 = 0%

The model understands Hindi entity boundaries perfectly but has zero ability to map predicates to DBpedia ontology properties.


2. Error Taxonomy — 3 Distinct Failure Modes Identified

Rather than reporting a single accuracy number, I categorized every failure by type:

Error Type Count Example
Predicate Normalization Failure 2/5 (40%) Extracted “का निर्माण” instead of dbo:builder
Language Mixing 2/5 (40%) Extracted “was born in” (English) instead of dbo:birthPlace
Implicit Relation Error 1/5 (20%) Extracted “है” instead of dbo:capital

Language Mixing is a previously undocumented failure mode — the model switches to English predicates for Hindi input, making DBpedia ontology alignment impossible downstream.

Implicit Relation errors (copula constructions like “X की राजधानी Y है”) are structurally distinct and require special handling — neither fine-tuning nor embedding similarity alone will fix them reliably.


3. Ontology Alignment Layer — 0% → 80% Predicate Accuracy

I built a dedicated ontology alignment layer using multilingual sentence embeddings (paraphrase-multilingual-MiniLM-L12-v2) that maps extracted Hindi predicates to DBpedia ontology properties via cosine similarity.

Stage Predicate Accuracy
Zero-shot Gemma-3 alone 0/5 = 0%
+ Ontology Alignment Layer 4/5 = 80%

Key finding: The one remaining failure (“है” → dbo:capital) scored high confidence (0.691) but mapped to the wrong property. This is exactly where confidence-based flagging is critical — rather than silently passing a wrong triple into the knowledge graph, the system should flag it for human review. This directly motivates the HITL component.


4. Working HITL Feedback Prototype

I built a functional Human-in-the-Loop annotation interface in Streamlit that:

  • Displays each Hindi sentence alongside the model’s extracted triple
  • Shows the expected DBpedia property and detected error type
  • Allows reviewers to Accept, Reject, or Edit each triple
  • Captures structured error type labels from the taxonomy above
  • Outputs JSONL feedback data ready for retraining



The JSONL output already contains corrected DBpedia predicate mappings (e.g. “का निर्माण” → dbo:builder) that can directly feed back into fine-tuning — closing the human-in-the-loop cycle.


Summary

The pipeline I have prototyped in these warm-up experiments directly validates the core project hypothesis: zero-shot Gemma-3 fails entirely at predicate normalization, the ontology alignment layer recovers 80% of that, and the remaining hard cases (implicit relations) are exactly what the HITL interface is designed to catch and correct.

mam i have shared my proposal earlier and i have some changes in that also some upgradation. i am sharing it to you by email.

Regards, Nitin Singh

Great Efforts @5051a9b40975112c9b4a

1 Like

@tiwarisanju18 Thank you so much Mam ,this kind message from you really make my effort worth it.Mam i have shared my final proposal through gmail and same i have applied it on gsoc platform Mam your suggestion/review would genuinely help me.I would request you to see that as last date is 31 6pm so before that if you can provide any suggestion it means a lot. I have applied through gmail-(nitinsingh3323@gmail.com) and sent a mail from this to you Nitin Singh.Also added you as editor in it. Looking forward to contribute under your great mentorship.

Warm-up Results: Hybrid Pipeline with Ontology Alignment + Spring Boot HITL — Abhigyan Tiwari
Hi @tiwarisanju18 mam and mentors,
I have completed my warm-up tasks. Rather than reporting isolated accuracy numbers, I built and evaluated a complete end-to-end pipeline where each component directly motivates the next.
Baseline — gemma-3-4b-it zero-shot:

I tested all of gemma-2b-it, gemma-3-1b-it and 3-4b-it. All except gemma-3-4b- showed severe output instability (hallucination, format breaking, English switching mid-output). gemma-3-4b-it was selected as the stable baseline — itself evidence that prompting alone is insufficient.

Here’s the confidence scoring based on dbpedia properties, after extracting the triples:

Key finding from evaluation output:

[5] विराट कोहली एक भारतीय क्रिकेटर है।
  Expected: (विराट कोहली | rdf:type | dbo:Cricketer)
  Got:      (विराट कोहली | rdf:type | dbo:Person)
  Confidence: 0.231 → AUTO-FLAGGED for human review

This is the core insight: the model gets the structure right but the ontology class wrong — and the confidence score catches it. The triple routes automatically to the HITL dashboard rather than silently polluting the knowledge graph.

Pipeline: each stage feeds the next:

Postposition stripper → fixes Kaarak bleeding (increasing object accuracy)

Ontology alignment layer (MiniLM cosine similarity) → maps Hindi predicates to DBpedia properties, flags low-confidence triples (score < 0.45)

Spring Boot + MySQL HITL backend → low-confidence triples POST automatically as PENDING, annotators approve/reject via dashboard, approved triples become Gold Data for next fine-tuning cycle

This is the key architectural difference from a standalone demo — the confidence threshold and the persistent backend will be connected. A wrong triple never silently enters the KG.

I am integrating this full pipeline into my HITL dashboard and it’s roughly functional (you may check the full prototype through above link)
Requesting feedback before the deadline today. I know you all would be so busy evaluating them.

Regards, Abhigyan Tiwari
NIT Silchar | B.Tech CSE (2024–2028)

Hi @tiwarisanju18 mam and mentors,

Thank you for this incredible initiative and for being so responsive throughout the process — it made the warm-up period genuinely motivating.

Seeing @AbhigyanNoBC’s pipeline with the postposition stripper and confidence-based routing is exciting — the Kaarak bleeding issue is real and his architectural insight of auto-routing low-confidence triples to HITL rather than silently passing them is exactly the right instinct. It aligns closely with the confidence-threshold flagging I implemented in my ontology alignment layer, and it’s encouraging to see multiple approaches converging on the same core finding.

I wanted to briefly summarize where my submission stands as the deadline closes:

• Baseline: Zero-shot Gemma-3-1b-it — 100% subject/object accuracy, 0% predicate accuracy across 5 Hindi BenchIE sentences
• Error Taxonomy: Three failure modes identified and categorised — Predicate Normalization Failure (40%), Language Mixing (40%), Implicit Relation Error (20%)
• Ontology Alignment Layer: 0% → 80% predicate accuracy using paraphrase-multilingual-MiniLM-L12-v2 with cosine similarity and confidence-based flagging
• Working HITL Prototype: Functional Streamlit interface producing corrected JSONL output ready for fine-tuning retraining

My full proposal has been submitted on the GSoC platform and shared with you by email (nitinsingh3323@gmail.com). As a native Hindi speaker with hands-on NLP pipeline experience, I am deeply invested in improving Hindi’s representation in DBpedia — this is not just a coding project for me.

I look forward to any feedback you may have before the deadline, and I am grateful for the opportunity regardless of the outcome.

Regards,
Nitin Singh
KIIT University | B.Tech CSE (2023–2027)

1 Like

Hi @5051a9b40975112c9b4a, thank you so much for the shout-out!

I really appreciated your breakdown of the Error Taxonomy—identifying that 40% ‘Language Mixing’ failure mode is a huge insight for the Hindi chapter. Your jump from 0% to 80% predicate accuracy using the paraphrase-multilingual model is seriously impressive and shows how much the ontology alignment layer was needed.

It’s great to see our approaches converging on the same core issues like ‘Kaarak bleeding.’ Whether it’s through confidence-based routing to HITL or your ontology alignment, the goal is the same: making Hindi triple extraction actually reliable.

On a personal note, diving deep into NLP and Transformers through this project has been an incredible experience. Engaging with the community and getting insights from the mentors and fellow contributors like you is exactly what inspires me to keep pushing my boundaries in this field.

Finally, a huge thanks to @tiwarisanju18 for the constant guidance and encouragement throughout this warm-up period. It’s been a steep but rewarding learning curve!

Looking forward to the results and potentially cross-pollinating these ideas.

Sure :+1:t2:

Hello @tiwarisanju18 mam and members,

I have opened issues #52 found something relevant.
ThankYou.