Warm-up Results: Gemma-3 Baseline + Error Taxonomy + Ontology Alignment + Working HITL Prototype
Hi mam @tiwarisanju18 and mentors,
I have completed all warm-up tasks and want to share concrete results across four areas. Everything below is reproducible.
1. Quantitative Baseline — Zero-shot Gemma-3-1b-it on Hindi BenchIE
I ran Gemma-3-1b-it in zero-shot mode on 5 Hindi sentences with gold DBpedia annotations and evaluated subject, predicate, and object accuracy separately.
| Metric |
Score |
| Subject Accuracy |
5/5 = 100% |
| Predicate Accuracy |
0/5 = 0% |
| Object Accuracy |
5/5 = 100% |
| Full Triple Match |
0/5 = 0% |
The model understands Hindi entity boundaries perfectly but has zero ability to map predicates to DBpedia ontology properties.
2. Error Taxonomy — 3 Distinct Failure Modes Identified
Rather than reporting a single accuracy number, I categorized every failure by type:
| Error Type |
Count |
Example |
| Predicate Normalization Failure |
2/5 (40%) |
Extracted “का निर्माण” instead of dbo:builder |
| Language Mixing |
2/5 (40%) |
Extracted “was born in” (English) instead of dbo:birthPlace |
| Implicit Relation Error |
1/5 (20%) |
Extracted “है” instead of dbo:capital |
Language Mixing is a previously undocumented failure mode — the model switches to English predicates for Hindi input, making DBpedia ontology alignment impossible downstream.
Implicit Relation errors (copula constructions like “X की राजधानी Y है”) are structurally distinct and require special handling — neither fine-tuning nor embedding similarity alone will fix them reliably.
3. Ontology Alignment Layer — 0% → 80% Predicate Accuracy
I built a dedicated ontology alignment layer using multilingual sentence embeddings (paraphrase-multilingual-MiniLM-L12-v2) that maps extracted Hindi predicates to DBpedia ontology properties via cosine similarity.
| Stage |
Predicate Accuracy |
| Zero-shot Gemma-3 alone |
0/5 = 0% |
| + Ontology Alignment Layer |
4/5 = 80% |
Key finding: The one remaining failure (“है” → dbo:capital) scored high confidence (0.691) but mapped to the wrong property. This is exactly where confidence-based flagging is critical — rather than silently passing a wrong triple into the knowledge graph, the system should flag it for human review. This directly motivates the HITL component.
4. Working HITL Feedback Prototype
I built a functional Human-in-the-Loop annotation interface in Streamlit that:
- Displays each Hindi sentence alongside the model’s extracted triple
- Shows the expected DBpedia property and detected error type
- Allows reviewers to Accept, Reject, or Edit each triple
- Captures structured error type labels from the taxonomy above
- Outputs JSONL feedback data ready for retraining
The JSONL output already contains corrected DBpedia predicate mappings (e.g. “का निर्माण” → dbo:builder) that can directly feed back into fine-tuning — closing the human-in-the-loop cycle.
Summary
The pipeline I have prototyped in these warm-up experiments directly validates the core project hypothesis: zero-shot Gemma-3 fails entirely at predicate normalization, the ontology alignment layer recovers 80% of that, and the remaining hard cases (implicit relations) are exactly what the HITL interface is designed to catch and correct.
mam i have shared my proposal earlier and i have some changes in that also some upgradation. i am sharing it to you by email.
Regards, Nitin Singh