Towards a Neural Extraction Framework — GSoC 2026

This project started in 2021 and is looking to its 6th participation in DBpedia’s GSoC.

Description

Every Wikipedia article links to a number of other articles. In DBpedia, we keep track of these links through the dbo:wikiPageWikiLink property. Thanks to them, we know that the :Berlin_Wall entity is — at the time of writing — semantically connected to 299 base entities.

However, only 9 out of 299 base entities are linked from :Berlin_Wall via also another predicate. This suggests that in the large majority of cases, it is not clear what kind of relationship exists between the entities. In other words, DBpedia does not know what specific RDF predicate links the subject (in our case, :Berlin_Wall) to any of the remaining 290 objects.

Currently, such relationships are extracted from tables and the infobox, usually found top right of a Wikipedia article, via the Extraction Framework. Instead of extracting RDF triples from semi-structured data only, we built an end-to-end system to leverage information found in the entirety of a Wikipedia article, including page text.

The repository where all source code is stored is the following:

Goal

The goal of this project is to develop a framework for predicate resolution of wiki links among entities by:

  1. harnessing neural models to extract knowledge from text and
  2. validating their output against the DBpedia ontology.

During the first GSoC years, we employed a suite of machine-learning models to perform joint entity-relation extraction on open-domain text. Additionally, we implemented an end-to-end system that translates any English sentence into triples using the DBpedia vocabulary. Over the last two years, we improved the quality of output triples using a chain-of-thought approach powered by a large language model.

However, the current algorithm still has the following issues. Now, we want to devise a method that can solve as many of them as possible.

  1. The generated triples were not validated against the DBpedia ontology and may thus lead to inconsistencies in data.
  2. The current models are not efficient enough to scale to millions of entities.
  3. The extracted relations are not categorised with respect to their semantics (e.g. reflexive/irreflexive, symmetric/antisymmetric/asymmetric, transitive, equivalence).
  4. Ideally, our algorithm should be able to adapt its output not only to the DBpedia vocabulary but to any specified one (e.g., SKOS, schema.org, Wikidata, RDFS, or even a combination of many).

Alternative extraction targets

The current pipeline targets relationships that are explicitly mentioned in the text. The contributor may also choose to extract complex relationships, such as:

  • Causality. (Addressed during GSoC 2021, but not completed.) The direct cause-effect between events, e.g., from the text

The Peaceful Revolution (German: Friedliche Revolution) was the process of sociopolitical change that led to the opening of East Germany’s borders with the west, the end of the Socialist Unity Party of Germany (SED) in the German Democratic Republic (GDR or East Germany) and the transition to a parliamentary democracy, which enabled the reunification of Germany in October 1990.

extract: :Peaceful_Revolution –––dbo:effect––> :German_reunification

  • Issuance. An abstract entity assigned to some agent, e.g., from the text

Messi won the award, his second consecutive Ballon d’Or victory.

extract: :2010_FIFA_Ballon_d'Or –––dbo:recipient––> :Lionel_Messi

Material

The contributor may use any LLM API and/or Python deep learning framework. The following resources are recommended (but not compulsory) for use in the project.

  • The project repository linked above and the machine-learning models mentioned in the readme files found in each GSoC folder.
  • The 2025 and 2024 blog to understand the project status quo.
  • Python Wikipedia makes it easy to access and parse data from Wikipedia.
  • Huggingface Transformers for Natural Language Inference can be extremely useful to extract structured knowledge from text or perform zero-shot classification.
  • DBpedia Lookup is a service available both online and offline (e.g., given a string, list all entities that may refer to it).
  • DBpedia Anchor text is a dataset containing the text and the URL of all links in Wikipedia; the indexed dataset will be available to the student (e.g., given an entity, list all strings that point to it).
  • An example of an excellent proposal that was accepted a few years ago.

Project size

The size of this project can be either medium or large. Please state in your proposal the number of total project hours you intend to dedicate to it (175 or 350).

Impact

This project will potentially generate millions of new statements. This new information could be released by DBpedia to the public as part of a new dataset. The creation of a neural extraction framework could introduce the use of robust parsers for a more accurate extraction of Wikipedia content.

Warm-up tasks

Mentors (tentative)

@tsoru, @smilingprogrammer, @ronitblenz, @gnav

2 Likes

Hello @tsoru and everyone,

Thanks for posting this Project for GSoC 2026. I am strongly interested in contributing to the “Towards a Neural Extraction Framework” project.

I’ve been following the project evolution, and the focus on Scalability and Ontology Validation really does stand out to me. Moving from experimental models to a production-ready system that can handle millions of entities is exactly the kind of engineering challenge I’ve been solving before and further want to tackle.

Current Progress on Warm-up Tasks: I’ve started auditing the GSoC25 codebase to understand the bottlenecks:

  • Stability Fixes: I noticed the legacy EntityLinking tests were crashing due to missing dependencies/imports. I submitted PR #27 to fix the test harness so it runs smoothly for new contributors.
  • Scalability & API Limits: I am currently testing Emeddings.py and confirmed that it was not able to run successfully due to Gemini 429 rate-limiting issues.

Next Steps: To address the scalability goal, I am prototyping a FAISS-based local predicate retriever.

Instead of hitting the Gemini API for every embedding (which causes the 429 errors), I am migrating the logic to a pre-computed local vector index. This should drastically reduce latency and allow us to run high-volume extractions (like the “Berlin Wall” dataset) offline.

I plan to share the prototype results shortly. Please let me know if there are any specific feedbacks on my approach or my PR #27.

Best, Siddharth

Hi @tsoru,

I’m continuing to explore the existing pipeline and current scalability/ontology-validation bottlenecks, and I’m drafting a proposal aligned with these objectives.

I’ll share updates from the local predicate retrieval experiments shortly. Please let me know the preferred next step for proposal submission.

Best regards,
Siddharth

Subject: Prototyping Strict Schema Validation for Ontology Compliance (PR #30)

Hi @tsoru and everyone,

I am Aryan Gairola, a Computer Science undergrad (2026). I’ve been following the Neural Extraction Framework’s evolution and am particularly interested in solving the “Ontology Validation” challenge mentioned in the project description.

What I Have Done So Far: While auditing the codebase to understand the “Berlin Wall” inconsistency problem, I identified a critical stability issue in NEF.py. The legacy index-based prompting was prone to “off-by-one” errors, and worse, it defaulted to allowed[0] when the model hallucinated—silently corrupting the Knowledge Graph with false positives.

To address this immediately, I submitted PR #30 (currently passing all CI checks), which refactors the extraction logic to use:

  1. Native Pydantic Schemas: Enforcing strict structural validation on the LLM’s output using Gemini’s response_schema rather then begging llm to respond using prompt engineering.
  2. Explicit Hallucination Rejection: The pipeline now safely returns None if the model invents a predicate, rather than forcing an incorrect relation leading to corrupt knowledge graph.

Future Contribution Plan:

  1. Semantic Validation: Beyond the JSON structural validation I implemented in PR #30, I will add a Semantic Layer. This module will query the DBpedia Ontology (checking rdfs:domain and rdfs:range) to reject logical hallucinations—e.g., preventing the model from assigning a dbo:birthDate (Property of a Person) to the :Berlin_Wall (a Place).
  2. Deterministic Inference: Instead of relying on the LLM to “guess” logic, I will implement a Post-Processing Inference Rule Engine. By reading standard OWL definitions (like owl:SymmetricProperty) directly from the ontology source, the system can automatically generate implicit facts (e.g., inferring :Bob spouse :Alice from :Alice spouse :Bob) with 100% accuracy and zero extra token cost.

I’d love to hear your thoughts on whether this “Correctness-First” approach aligns with the team’s vision for the Hybrid Pipeline.

Ready for review on PR #30 whenever you have a moment!

Best regards,
Aryan Gairola

Hey @tsoru,

I went through the year-on-year progress with NEF since 2021, and it’s impressive how a variety of techniques were used to improve the resolution of RDF triples. Currently going through the latest GSoC 2025 Codebase, here are my early observations:

  1. Ontology Validation

The current pipeline (NEF.py) does not validate generated triples against the DBpedia ontology. Predicates are selected via embedding similarity from a pre-computed predicates.csv, with only a threshold filter (default 0.5) and word-count validation (1-3 words). I noticed there’s a prototype in TestFiles/predGen.py that loads the DBpedia ontology via RDFLib and validates predicate existence using SPARQL ASK queries—but this isn’t integrated into the main pipeline. Domain/range constraint checking is also absent, which could lead to semantically invalid triples (e.g., assigning dbo:birthPlace to a non-Person entity).

  1. Scalability

The Redis-backed entity linking is a solid foundation for scale, but the pipeline currently makes synchronous Gemini API calls per sentence for both extraction and disambiguation. For millions of entities, this would hit rate limits and cost constraints. There’s no batching, async processing, or local model fallback currently implemented.

  1. Relation Semantics

There’s no categorization of extracted relations by their logical properties (reflexive, symmetric, transitive, etc.). The predicate retrieval is purely similarity-based without any semantic metadata attached. This would be valuable for downstream reasoning and consistency checking.

  1. Vocabulary Flexibility

The codebase is currently tightly coupled to DBpedia—URIs are hardcoded with About: http://dbpedia.org/resource/ prefixes, and the predicate embeddings are specific to DBpedia ontology. There’s no abstraction layer or configuration for alternative vocabularies like Wikidata, or SKOS.

I’d be interested in working on these problem statements. Happy to discuss potential approaches!