New season of GSoC 2026

Hi DBpedians,

the new season of Google Summer of Code (GSoC) is about to start soon. Please check the timeline .

We can submit our application at the beginning of February 2026. We have around 3 weeks to come up with some cool new projects. So, mentors, we need your ideas!

Start collecting them here https://forum.dbpedia.org/c/projects/gsoc/8 and tag them with the following tag #gsoc2026-ideas. Please use the structure we had always used in the past in order to provide project details on the project idea.

DESCRIPTION
Goal
Impact
Warm-up tasks
Mentors
Project size (90h, 175h or 350h)
Keywords

For any questions or remarks, just get back to us here in the forum or on slack.

Cheers,

Julia
on behalf of the DBpedia Association

4 Likes

Hi @SyedaAlizah, thanks for sharing your intention to continue this project.

Just a small detail. Could you please remove the mentors’ names here? They will write their own post or reply to yours if they decide to join GSoC 2026 and continue this project. Thanks.

Thanks for pointing that out, @tsoru. I can’t edit the post anymore, so I’ll reply with a corrected version of the idea without the mentors’ names. Appreciate the clarification.

SyedaAlizah

3d

##Stabilizing, Completing, and Upstreaming the Hindi DBpedia IE Pipeline

DESCRIPTION

During GSoC 2025, a comprehensive Hindi Information Extraction (IE) pipeline was developed for DBpedia, covering SLM-guided triplet extraction, IndIE enhancement, link prediction, predicate linking to the DBpedia ontology, and SPARQL endpoint deployment. While the core functionality has been implemented and evaluated, several critical components remain unfinished or unmerged, preventing the pipeline from being production-ready and fully integrated into DBpedia’s main infrastructure.

This project focuses on completing, stabilizing, and upstreaming the Hindi IE pipeline developed in GSoC 2025. The work emphasizes finalizing pending technical components (notably type/isA predicate linking, mappings integration, and finetuning), improving robustness and reproducibility, cleaning overlaps across modules, and preparing the existing pull requests and infrastructure for merge into the main DBpedia repositories.

Rather than introducing an entirely new pipeline, the project consolidates and strengthens a substantial existing contribution, ensuring long-term usability, maintainability, and extensibility of Hindi DBpedia.


GOAL

  1. Complete pending GSoC 2025 tasks, including:
  • Predicate linking for type (isA / rdf:type) relations, where objects are ontology classes rather than properties
  • Finalizing Hindi mappings via the DBpedia mappings UI
  • Optional finetuning of Gemma-3 using the existing filtered synthetic dataset
  1. Stabilize the Hindi IE pipeline by:
  • Improving error handling and logging across stages (IndIE, LLM-IE, predicate linking)
  • Ensuring reproducible runs using clean configuration, caching, and setup instructions
  • Reducing code duplication across IndIE, ReAct, and llm_IE modules
  1. Prepare and upstream existing work by:
  • Cleaning and finalizing pending PRs (e.g., neural-extraction-framework#20, extraction-framework#776)
  • Aligning the GSoC25_H fork with DBpedia’s main repository structure
  1. Enable production readiness by:
  • Preparing deployment-ready SPARQL setup for a permanent Hindi DBpedia endpoint
  • Documenting deployment, usage, and troubleshooting for maintainers and contributors
  1. Lower the entry barrier for new contributors through clear documentation and well-defined extension points.

IMPACT

Immediate impact

  • Delivers a stable, merge-ready Hindi IE pipeline for DBpedia
  • Enables consistent extraction, ontology linking, and querying over Hindi Wikipedia

Community impact

  • Reduces technical debt accumulated across multiple GSoC cycles
  • Makes Hindi DBpedia easier to maintain and extend by future contributors

Sustainability

  • Improves reproducibility and robustness for low-resource language IE pipelines
  • Establishes a cleaner foundation for multilingual expansion beyond Hindi

Research & practice

  • Strengthens reproducible evaluation for SLM-based IE and hybrid rule/LLM pipelines
  • Improves ontology alignment for non-English knowledge graphs

WARM-UP TASKS

  • Clone and run the complete Hindi IE pipeline (GSoC25_H) end-to-end
  • Reproduce published results on the Hindi-BenchIE dataset and report deviations
  • Execute predicate linking for both property and type (rdf:type) relations and analyze failure cases
  • Identify and fix at least one concrete issue (bug, logging gap, or documentation flaw)
  • Submit a small but meaningful PR (documentation, error handling, or predicate linking improvement)

PROJECT SIZE

175 hours (Medium)

KEYWORDS

Information Extraction, Knowledge Graphs, Hindi NLP, DBpedia, Predicate Linking,
Ontology Alignment, Low-Resource Languages, Pipeline Stabilization, SPARQL, Reproducibility

#gsoc2026-ideas