Towards a Neural Extraction Framework — GSoC 2024

tsoru · January 31, 2024, 8:06pm

This project started in 2021 and is looking to its 4th participation in DBpedia’s GSoC.

Description

Every Wikipedia article links to a number of other articles. In DBpedia, we keep track of these links through the dbo:wikiPageWikiLink property. Thanks to them, we know that the :Berlin_Wall entity (at the time of writing this) is semantically connected to 299 base entities.

However, only 9 out of 299 base entities are linked from :Berlin_Wall via also another predicate. This suggests that in the large majority of cases, it is not clear what kind of relationship exists between the entities. In other words, DBpedia does not know what specific RDF predicate links the subject (in our case, :Berlin_Wall) to any of the objects above.

Currently, such relationships are extracted from tables and the infobox (usually found top right of a Wikipedia article) via the Extraction Framework. Instead of extracting RDF triples from semi-structured data only, we want to leverage information found in the entirety of a Wikipedia article, including page text.

The repository where all source code will be stored is the following:

Goal

The goal of this project is to develop a framework for predicate resolution of wiki links among entities.

During GSoC 2022, we employed a suite of machine-learning models to perform joint entity-relation extraction on open-domain text.
Last year, we implemented an end-to-end system that translates any English sentence into triples using the DBpedia vocabulary.

However, the current algorithm still has the following issues. Now, we want to devise a method that can solve as many of them as possible.

When an RDF property representing the predicate is not found, our algorithm cannot make any suggestions for the creation of a new property.
The current models are not efficient enough to scale to millions of entities.
The extracted relations are not categorised with respect to their semantics (e.g. reflexive/irreflexive, symmetric/antisymmetric/asymmetric, transitive, equivalence).
The generated triples were not validated against the DBpedia ontology and may thus lead to inconsistencies in data.
Our algorithm should be able to adapt its output not only to the DBpedia vocabulary but to any specified one (e.g., SKOS, schema.org, Wikidata, RDFS, or even a combination of many).

Extraction examples

The current pipeline targets relationships that are explicitly mentioned in the text. The contributor may also choose to extract complex relationships, such as:

Causality. (Addressed during GSoC 2021, but not completed.) The direct cause-effect between events, e.g., from the text

The Peaceful Revolution (German: Friedliche Revolution) was the process of sociopolitical change that led to the opening of East Germany’s borders with the west, the end of the Socialist Unity Party of Germany (SED) in the German Democratic Republic (GDR or East Germany) and the transition to a parliamentary democracy, which enabled the reunification of Germany in October 1990.

extract: :Peaceful_Revolution –––dbo:effect––> :German_reunification

Issuance. An abstract entity assigned to some agent, e.g., from the text

Messi won the award, his second consecutive Ballon d’Or victory.

extract: :2010_FIFA_Ballon_d'Or –––dbo:recipient––> :Lionel_Messi

Material

The contributor may use any Python deep learning framework and/or existing tool. The following resources are recommended (but not compulsory) for use in the project.

The project repository linked above and the machine-learning models mentioned in the readme files found in each GSoC folder.
Last year’s blog to understand the project status quo.
Python Wikipedia makes it easy to access and parse data from Wikipedia.
Huggingface Transformers for Natural Language Inference can be extremely useful to extract structured knowledge from text or perform zero-shot classification.
DBpedia Lookup is a service available both online and offline (e.g., given a string, list all entities that may refer to it).
DBpedia Anchor text is a dataset containing the text and the URL of all links in Wikipedia; the indexed dataset will be available to the student (e.g., given an entity, list all strings that point to it).
An example of an excellent proposal that was accepted a few years ago.

Project size

The size of this project can be either medium or large. Please state in your proposal the number of total project hours you intend to dedicate to it (175 or 300).

Impact

This project will potentially generate millions of new statements. This new information could be released by DBpedia to the public as part of a new dataset. The creation of a neural extraction framework could introduce the use of robust parsers for a more accurate extraction of Wikipedia content.

Warm-up tasks

Get familiar with SPARQL on the DBpedia endpoint.
Understand the science behind relation extraction.
Run and understand the pipeline implemented last year.

Mentors

@tsoru, @zoelevert, TBD

smilingprogrammer · February 2, 2024, 10:05am

Hi @tsoru @zoelevert I’m Abdulsobur Oyewale. It’s nice to be here again this year.
Just to clarify what we are trying to work towards to here:
The goal here will be to address issues such as the inability to suggest new properties when an RDF property is not found, inefficiency in scaling to millions of entities, lack of semantic categorization of extracted relations, potential inconsistencies in data validation, and the need for our framework to adapt to different vocabularies.
Right?

tsoru · February 5, 2024, 2:27pm

Hi @smilingprogrammer,

that’s correct. You’ve listed all the features that we would like to be implemented.

yassanov · February 27, 2024, 8:44pm

Dear @tsoru,

My name is Yerden Assanov, and I am currently pursuing my M.Sc. degree in Engineering Cybernetics from the University of Stuttgart, Germany, where I have been engaged in courses such as machine learning, deep learning, probabilistic machine learning, and a deep learning seminar focusing on knowledge graphs, knowledge graph embeddings, and link prediction.

I am particularly excited about contributing to open-source projects, especially DBpedia. As a current research assistant involved in related field of the semantic web as part of the ATLAS project at the Institute of Artificial Intelligence of Stuttgart University, I believe I can bring valuable insights to the DBpedia community.

I have completed the warm-up tasks you provided and am eager to contribute to this project. I would greatly appreciate the opportunity to stay connected with you via email, DBpedia Slack, or any other means available, where I can seek your guidance.

Could you please advise me on the most convenient method for further discussions?

Looking forward to collaborating with you and making meaningful contributions to the community.

Warm regards,
Yerden Assanov

tsoru · March 1, 2024, 9:24am

Hi @yassanov and thank you for your interest.

@yassanov @smilingprogrammer and anyone who plans on applying, please prepare a Google Doc on the lines of this example of an excellent proposal that was accepted a few years ago.

When you are happy with your draft, please share it with my Google account (mommi84 at gmail dot com), and I will extend the invite to the other co-mentors. We will leave comments, questions, and help you with your submission.

yogeshkulkarni · April 3, 2024, 8:53am

Hi @tsoru, as discussed in Slack, I wish to be co-mentor for this project

koppalv · November 10, 2024, 3:34pm

@tsoru Hello Sir
I am Vedant Koppal and currently pursuing BTech degree in AIML at KIT’s College of Engineering, Kolhapur.
I liked this project and wanted to learn more about it and contribute to it.
Also i am participating in GSOC-2025.
So from where can I contribute to the DBPedia?
What are resources you would suggest I should look into?
I have knowledge in AI and ML field in finetuning LLMs and building ML models.
If you could provide one resource i should look into thats would be great help!
Thanks!