Towards a Neural Extraction Framework — GSoC 2026 ⚠️ mentors needed

This project started in 2021 and is looking to its 6th participation in DBpedia’s GSoC.

Description

Every Wikipedia article links to a number of other articles. In DBpedia, we keep track of these links through the dbo:wikiPageWikiLink property. Thanks to them, we know that the :Berlin_Wall entity is — at the time of writing — semantically connected to 299 base entities.

However, only 9 out of 299 base entities are linked from :Berlin_Wall via also another predicate. This suggests that in the large majority of cases, it is not clear what kind of relationship exists between the entities. In other words, DBpedia does not know what specific RDF predicate links the subject (in our case, :Berlin_Wall) to any of the remaining 290 objects.

Currently, such relationships are extracted from tables and the infobox, usually found top right of a Wikipedia article, via the Extraction Framework. Instead of extracting RDF triples from semi-structured data only, we built an end-to-end system to leverage information found in the entirety of a Wikipedia article, including page text.

The repository where all source code is stored is the following:

Goal

The goal of this project is to develop a framework for predicate resolution of wiki links among entities by:

  1. harnessing neural models to extract knowledge from text and
  2. validating their output against the DBpedia ontology.

During the first GSoC years, we employed a suite of machine-learning models to perform joint entity-relation extraction on open-domain text. Additionally, we implemented an end-to-end system that translates any English sentence into triples using the DBpedia vocabulary. Over the last two years, we improved the quality of output triples using a chain-of-thought approach powered by a large language model.

However, the current algorithm still has the following issues. Now, we want to devise a method that can solve as many of them as possible.

  1. The generated triples were not validated against the DBpedia ontology and may thus lead to inconsistencies in data.
  2. The current models are not efficient enough to scale to millions of entities.
  3. The extracted relations are not categorised with respect to their semantics (e.g. reflexive/irreflexive, symmetric/antisymmetric/asymmetric, transitive, equivalence).
  4. Ideally, our algorithm should be able to adapt its output not only to the DBpedia vocabulary but to any specified one (e.g., SKOS, schema.org, Wikidata, RDFS, or even a combination of many).

Alternative extraction targets

The current pipeline targets relationships that are explicitly mentioned in the text. The contributor may also choose to extract complex relationships, such as:

  • Causality. (Addressed during GSoC 2021, but not completed.) The direct cause-effect between events, e.g., from the text

The Peaceful Revolution (German: Friedliche Revolution) was the process of sociopolitical change that led to the opening of East Germany’s borders with the west, the end of the Socialist Unity Party of Germany (SED) in the German Democratic Republic (GDR or East Germany) and the transition to a parliamentary democracy, which enabled the reunification of Germany in October 1990.

extract: :Peaceful_Revolution –––dbo:effect––> :German_reunification

  • Issuance. An abstract entity assigned to some agent, e.g., from the text

Messi won the award, his second consecutive Ballon d’Or victory.

extract: :2010_FIFA_Ballon_d'Or –––dbo:recipient––> :Lionel_Messi

Material

The contributor may use any LLM API and/or Python deep learning framework. The following resources are recommended (but not compulsory) for use in the project.

  • The project repository linked above and the machine-learning models mentioned in the readme files found in each GSoC folder.
  • The 2025 and 2024 blog to understand the project status quo.
  • Python Wikipedia makes it easy to access and parse data from Wikipedia.
  • Huggingface Transformers for Natural Language Inference can be extremely useful to extract structured knowledge from text or perform zero-shot classification.
  • DBpedia Lookup is a service available both online and offline (e.g., given a string, list all entities that may refer to it).
  • DBpedia Anchor text is a dataset containing the text and the URL of all links in Wikipedia; the indexed dataset will be available to the student (e.g., given an entity, list all strings that point to it).
  • An example of an excellent proposal that was accepted a few years ago.

Project size

The size of this project can be either medium or large. Please state in your proposal the number of total project hours you intend to dedicate to it (175 or 350).

Impact

This project will potentially generate millions of new statements. This new information could be released by DBpedia to the public as part of a new dataset. The creation of a neural extraction framework could introduce the use of robust parsers for a more accurate extraction of Wikipedia content.

Warm-up tasks

Mentors

Ara Yeroyan, @tsoru (tentative)

4 Likes

Hello @tsoru and everyone,

Thanks for posting this Project for GSoC 2026. I am strongly interested in contributing to the “Towards a Neural Extraction Framework” project.

I’ve been following the project evolution, and the focus on Scalability and Ontology Validation really does stand out to me. Moving from experimental models to a production-ready system that can handle millions of entities is exactly the kind of engineering challenge I’ve been solving before and further want to tackle.

Current Progress on Warm-up Tasks: I’ve started auditing the GSoC25 codebase to understand the bottlenecks:

  • Stability Fixes: I noticed the legacy EntityLinking tests were crashing due to missing dependencies/imports. I submitted PR #27 to fix the test harness so it runs smoothly for new contributors.
  • Scalability & API Limits: I am currently testing Emeddings.py and confirmed that it was not able to run successfully due to Gemini 429 rate-limiting issues.

Next Steps: To address the scalability goal, I am prototyping a FAISS-based local predicate retriever.

Instead of hitting the Gemini API for every embedding (which causes the 429 errors), I am migrating the logic to a pre-computed local vector index. This should drastically reduce latency and allow us to run high-volume extractions (like the “Berlin Wall” dataset) offline.

I plan to share the prototype results shortly. Please let me know if there are any specific feedbacks on my approach or my PR #27.

Best, Siddharth

Hi @tsoru,

I’m continuing to explore the existing pipeline and current scalability/ontology-validation bottlenecks, and I’m drafting a proposal aligned with these objectives.

I’ll share updates from the local predicate retrieval experiments shortly. Please let me know the preferred next step for proposal submission.

Best regards,
Siddharth

Subject: Prototyping Strict Schema Validation for Ontology Compliance (PR #30)

Hi @tsoru and everyone,

I am Aryan Gairola, a Computer Science undergrad (2026). I’ve been following the Neural Extraction Framework’s evolution and am particularly interested in solving the “Ontology Validation” challenge mentioned in the project description.

What I Have Done So Far: While auditing the codebase to understand the “Berlin Wall” inconsistency problem, I identified a critical stability issue in NEF.py. The legacy index-based prompting was prone to “off-by-one” errors, and worse, it defaulted to allowed[0] when the model hallucinated—silently corrupting the Knowledge Graph with false positives.

To address this immediately, I submitted PR #30 (currently passing all CI checks), which refactors the extraction logic to use:

  1. Native Pydantic Schemas: Enforcing strict structural validation on the LLM’s output using Gemini’s response_schema rather then begging llm to respond using prompt engineering.
  2. Explicit Hallucination Rejection: The pipeline now safely returns None if the model invents a predicate, rather than forcing an incorrect relation leading to corrupt knowledge graph.

Future Contribution Plan:

  1. Semantic Validation: Beyond the JSON structural validation I implemented in PR #30, I will add a Semantic Layer. This module will query the DBpedia Ontology (checking rdfs:domain and rdfs:range) to reject logical hallucinations—e.g., preventing the model from assigning a dbo:birthDate (Property of a Person) to the :Berlin_Wall (a Place).
  2. Deterministic Inference: Instead of relying on the LLM to “guess” logic, I will implement a Post-Processing Inference Rule Engine. By reading standard OWL definitions (like owl:SymmetricProperty) directly from the ontology source, the system can automatically generate implicit facts (e.g., inferring :Bob spouse :Alice from :Alice spouse :Bob) with 100% accuracy and zero extra token cost.

I’d love to hear your thoughts on whether this “Correctness-First” approach aligns with the team’s vision for the Hybrid Pipeline.

Ready for review on PR #30 whenever you have a moment!

Best regards,
Aryan Gairola

Hey @tsoru,

I went through the year-on-year progress with NEF since 2021, and it’s impressive how a variety of techniques were used to improve the resolution of RDF triples. Currently going through the latest GSoC 2025 Codebase, here are my early observations:

  1. Ontology Validation

The current pipeline (NEF.py) does not validate generated triples against the DBpedia ontology. Predicates are selected via embedding similarity from a pre-computed predicates.csv, with only a threshold filter (default 0.5) and word-count validation (1-3 words). I noticed there’s a prototype in TestFiles/predGen.py that loads the DBpedia ontology via RDFLib and validates predicate existence using SPARQL ASK queries—but this isn’t integrated into the main pipeline. Domain/range constraint checking is also absent, which could lead to semantically invalid triples (e.g., assigning dbo:birthPlace to a non-Person entity).

  1. Scalability

The Redis-backed entity linking is a solid foundation for scale, but the pipeline currently makes synchronous Gemini API calls per sentence for both extraction and disambiguation. For millions of entities, this would hit rate limits and cost constraints. There’s no batching, async processing, or local model fallback currently implemented.

  1. Relation Semantics

There’s no categorization of extracted relations by their logical properties (reflexive, symmetric, transitive, etc.). The predicate retrieval is purely similarity-based without any semantic metadata attached. This would be valuable for downstream reasoning and consistency checking.

  1. Vocabulary Flexibility

The codebase is currently tightly coupled to DBpedia—URIs are hardcoded with About: http://dbpedia.org/resource/ prefixes, and the predicate embeddings are specific to DBpedia ontology. There’s no abstraction layer or configuration for alternative vocabularies like Wikidata, or SKOS.

I’d be interested in working on these problem statements. Happy to discuss potential approaches!

Subject: Prototyping the Neuro-Symbolic Pipeline (Architecture, Reasoning & Validation)

Hi @tsoru and the whole DBpedia community!

To ensure the GSoC 2026 proposal directly addresses the core data quality and scalability bottlenecks identified in the project scope, I have spent the last week engineering and stress-testing a Minimum Viable System (MVS).

My goal was to move beyond theoretical models and prove that we can architecturally solve the three critical failures of the current framework: Ambiguity, Hallucination, and Latency.

Here are the results of the architectural experiments.

1. System Architecture (The Hybrid Resolver)

Instead of relying solely on LLMs (which are prone to hallucination) or rigid lookups (which fail on slang), I prototyped a multi-stage Neuro-Symbolic Pipeline.

  • Mechanism: It cascades through Strict Redirects (for acronyms like “UK”) → Opensearch API (for slang like “Barca”) → Fuzzy Matching (for context).
  • Impact: This structure mimics human research behavior to maximize recall without sacrificing precision.

2. The “Ambiguity Solver” (Context-Aware Entity Resolution)

The Problem: Standard extractors often fail on nicknames (“Barca”) or context-dependent entities (e.g., distinguishing “Man City” the club from the city).
The Solution: I prototyped a context-aware linking layer that uses fuzzy matching and synonym expansion.

  • Result: As shown in the logs below, the system correctly resolved slang (Barca → FC Barcelona) and disambiguated entities within dense sentences (“Man City” → Manchester City F.C.) with high precision.


3. Multi-Relational Extraction (Handling Complex Sentences)

The Problem: Real-world Wikipedia sentences are dense, often containing multiple facts in a single line.
The Solution: I refined the extraction logic to handle multi-entity dependency chains.

  • Result: In the Complex Sentence Stress Test below, the pipeline successfully extracted three distinct relations from a single input (“Ronaldo plays for Real Madrid and they won the UCL under Zidane”), correctly mapping dbo:team, dbo:award, and dbo:manager simultaneously.

4. The Reasoning Engine (Graph Traversal)

The Problem: Extracting facts is not enough; we need to connect them.
The Solution: I implemented a client-side BFS (Breadth-First Search) pathfinder to discover semantic chains between disjoint entities.

  • Result: The system successfully reconstructed the semantic chain: Lionel MessibirthPlaceArgentina National TeamSpain (visualized below).
  • The Infrastructure Gap: Standard SPARQL property paths timed out on the public endpoint for this 2-hop query. This empirically validates the need for the Dockerized Local Endpoint I am proposing, which will enable sub-second deep reasoning.

5. The “Hallucination Buster” (Ontology Validation)

The Problem: LLMs often invent relations (e.g., Elon Musk → founder of → Amazon).
The Solution: I built a Neuro-Symbolic Validation Layer that cross-references extracted triples against live DBpedia constraints.

  • Result: The system accepted valid historical links (Ronaldo → Real Madrid) but automatically rejected statistically probable but factually wrong hallucinations (Ronaldo → Man City).

6. Linguistic Robustness (Active vs. Passive Voice)

The Problem: The pipeline must handle linguistic variance, as noted in the project goal (extracting Ballon d’Orrecipients).
The Solution: I verified the extraction logic against the specific “Messi / Ballon d’Or” example. The system correctly mapped the predicate regardless of sentence structure:

Next Steps: Infrastructure (Docker)

These experiments prove the logic is sound, but the infrastructure limits scale. To move this from “Prototype” to “Production,” I am finalizing the docker-compose setup for the newly merged codebase. This will allow us to run these heavy reasoning and validation jobs locally without hitting public API rate limits.

I will submit the Draft PR for containerization shortly.

Reproducibility: You can view the full source code, benchmark dataset (sentences.json), and execution logs in my fork here: Github fork of the prototype

I’d really appreciate feedbacks on my work so far for the Neural Extraction Framework project, what all things i can work on whenever possible.

Best regards,
Nakul Singh

Update: PR Submitted

As promised, I have opened the Pull Request to integrate this Dockerized Neuro-Symbolic Infrastructure into the main repository.

PR #32: Feat: Dockerize Neural Extraction Pipeline for Reproducibility https://github.com/dbpedia/neural-extraction-framework/pull/32

This PR includes:

  1. Containerized Setup: One-command execution (docker compose up) for the entire pipeline.
  2. The Neuro-Symbolic Logic: The src/ modules for the Hybrid Resolver, Fact Validator, and Reasoning Engine.
  3. Docs: Updated README with architecture diagrams and benchmarks.

I’d really appreciate your reviews on this.

Best regards,
Nakul Singh

hey man, ive been on a wikiwalk and landed here, super interesting project.. I might land at this problem eventually in my current work, so Ill keep you posted about collaborating.

Subject: Re: Prototyping the Neuro-Symbolic Pipeline (Architecture, Reasoning & Validation)

Hi everyone,

Quick update on the pipeline architecture.

Based on the excellent feedback from @tsoru over on GitHub regarding the GSoC25 infrastructure, I completely overhauled the prototype to drop the external Opensearch/Wikipedia APIs and transition to a fully Redis-backed, offline entity resolution system.

I have updated PR #32 with the new architecture.

Key Upgrades in the New Iteration:

  • Zero-Config Docker & Redis Setup: The pipeline now spins up a local Redis instance and seeds it with mock DBpedia fixtures (seed_redis.py) to test end-to-end extraction without needing the 50GB data dump.

  • Graph-Proximity Reasoning (Mentor Feedback Implemented): Taking the suggestion to exploit wikiPageWikiLink for relatedness, the Validator now has a two-tier system. If a strict ontology match fails (e.g., historical facts like Ronaldo → playsFor → Real Madrid), it automatically falls back to a BFS traversal to validate contextual graph proximity.

  • The “Hallucination Buster” in Action: The symbolic layer successfully intercepts and drops statistically probable but factually incorrect LLM hallucinations (e.g., rejecting Ronaldo → Chicago Bulls) while preserving valid complex sentences.

You can check out the updated visual logs and the exact pipeline execution in the updated PR description here: Link to PR #32

For a cleaner look at the isolated architecture, the updated Docker instructions, and my GSoC 2026 Roadmap, I have also organized the prototype into a standalone showcase repository here: Link to the dbpedia-entity-linker repo

With this architectural foundation proven, I am now translating these findings into my official GSoC 2026 Proposal draft. Thank you again to the mentors for pointing me toward the Redis integration—it made the system infinitely more robust!

Best regards,

Nakul Singh

Subject: Re: Docker Architecture - Handling Production Data vs. Mock Seeding

Hi @Nakul, I agree that seed_redis.py is excellent for isolated unit testing with small data samples.

However, I am concerned about the Scalability of that approach for the actual GSoC deliverables. The full DBpedia Redis Index (English/Hindi) is massive (tens of GBs). We cannot ‘seed’ this dynamically every time the container starts since it takes too long. For the Production Pipeline, the Docker container must be designed to mount the existing, pre-built Redis dumps as persistent volumes.

The Risk of Isolation: If we build a ‘self-contained’ pipeline that relies on seed_redis.py, we effectively decouple the Docker setup from the real-world data format used by the main IndIE framework.

Proposed Solution (Unified & Scalable): I have verified that we can Dockerize the existing extraction modules (using the fixes from PR #27) to connect directly to a persistent Redis volume. This allows us to use the Real Data without maintaining a separate “Mock Client.”

Proof of Concept: Here is the legacy test_redis.py running successfully inside a Docker container, connecting to the Redis service via environment variables. This confirms we don’t need to rewrite the client; we just need to config the container correctly.

I suggest we merge the docker-compose infra with the existing codebase to ensure we support the full-scale data dumps from Day 1.

Best, Siddharth


Subject: Contribution and Proposal for DBpedia Hindi Chapter / Neural Extraction Framework (GSoC 2026)

Hello @tsoru and mentors,

My name is Nitin Singh, a Computer Science undergraduate at KIIT University, Bhubaneswar, and I am preparing a proposal for GSoC 2026 related to the DBpedia Hindi Chapter and the Neural Extraction Framework.

Over the past few days, I have been carefully studying the neural-extraction-framework repository and exploring the GSoC25 and GSoC25_H pipelines to understand the current architecture and limitations of the system.

While going through the codebase and attempting to run parts of the pipeline, I focused on understanding how the framework performs:

  • sentence extraction from Wikipedia text
  • entity linking
  • relation extraction using language models
  • predicate mapping to DBpedia ontology
  • RDF triple generation

During this exploration I identified several engineering and architectural issues that may affect reproducibility, scalability, and maintainability of the pipeline.

So far I have documented 7 potential issues in the repository, including:

• heavy model initialization and resource downloads during module import in models.py
• runtime downloads of models and NLTK resources rather than setup-time installation
• use of sys.path modification in testing scripts instead of proper package imports
• packaging and module structure improvements for better portability
• opportunities to improve pipeline reproducibility and environment setup
• improvements to the testing utilities around collector.py
• minor maintainability improvements discovered while auditing the repository

I am currently preparing to open issues and contribute fixes for these areas to improve the reliability of the framework for new contributors and researchers.

Alongside this exploration, I have drafted a GSoC proposal focused on improving Hindi relational triple extraction*through:

  • fine-tuning an Indic language model for relation extraction
  • adding an ontology alignment layer for DBpedia predicate normalization
  • building a lightweight human-in-the-loop feedback interface to iteratively improve training data
  • performing structured evaluation using Hindi BenchIE and error-type analysis

My goal is to contribute not only a model improvement but also a robust extraction pipeline and dataset infrastructure that future contributors can build upon.

I would really appreciate any feedback from the mentors on:

  1. whether these identified issues align with current priorities in the repository
  2. which parts of the pipeline would benefit most from contributions before the coding period
  3. whether the direction of my proposal aligns with the goals of the DBpedia Hindi Chapter / Neural Extraction Framework work

I am excited about contributing to DBpedia and would be happy to start submitting fixes and improvements to the repository.

Thank you for your time and guidance.

Best regards,
Nitin Singh
B.Tech CSE — KIIT University
GitHub: singhhnitin · GitHub
LinkedIn: linkedin.com/in/nitin-singh12

Hi everyone and thanks for your interest in this project!

If you haven’t done it yet, please prepare a Google Doc with your project proposal and share it with my account (mommi84 at gmail dot com) so we can leave you our feedback before the 31st March deadline.

Hello @tsoru and mentors,

I’ve been reading through the GSoC25 implementation and documentation, especially NEF.py, Embeddings.py, and the 2025 project overview.

From my current understanding, the predicate retrieval layer appears tightly coupled to a DBpedia-only predicate inventory: embeddings are precomputed directly from the DBpedia ontology, and the pipeline also defaults to DBpedia resource URIs during grounding/output formatting.

Because of this, I’m particularly interested in the adaptability side of the project beyond a fixed DBpedia vocabulary. I would like to explore a configurable target-vocabulary adaptation layer for predicate resolution, with a lightweight semantic consistency layer as a secondary component for normalization quality.

Would this direction be considered within scope for the 2026 Neural Extraction Framework project?

Thank you.

Hi @tsoru and everyone,

I’m Gayun Bang, a Management Information Systems student interested in contributing to the Neural Extraction Framework for GSoC 2026.

As part of the warm-up task, I cloned the GSoC25 codebase and tried to run the pipeline locally (macOS, Python 3.10, google-genai==1.47.0). Here is what I found:

1. Crash before pipeline starts — NEF_REDIS_PORT not set

The argument parser contains:

p.add_argument("--redis-port", type=int, default=int(os.getenv("NEF_REDIS_PORT", "")))

When NEF_REDIS_PORT is not set, this raises a ValueError and the pipeline crashes before even attempting a Redis connection, which is notable since Redis is marked as [REQUIRED] in the pipeline documentation.

I wasn’t sure whether the intended fix is to default to 6379, or to allow None and skip Redis initialization when no configuration is provided. Happy to open a PR once the expected behavior is clarified.

2. Embedding step fails — deprecated Gemini model

Emeddings.py uses embedding-001 via the batchEmbedContents endpoint, both of which appear to no longer be supported (HTTP 404).

Updating the model name to text-embedding-004 also requires changing the BATCH_ENDPOINT URL, and batchEmbedContents does not appear to be supported for this model either. The same deprecated default (embed_model="embedding-001") also appears in PredicateEmbeddingRetriever in NEF.py.

Would migrating to embedContent per request, or switching to a local embedding model (e.g., via HuggingFace), be the preferred direction?

Looking forward to contributing, and happy to help with fixes or open PRs once the preferred direction is confirmed.

Best regards,
Gayun Bang