LG2RDF: Language generation to generate RDF from DBpedia abstracts - GSoC2022

Description

Recently there have been significant improvements in generating triples from natural language text in the fields of relation extraction with the use of generative language models. Few such improvements are:

The objective of this work will be to use such advancements to create a system that can generate DBpedia triples given a Wikipedia abstract (or even the complete page). If a set of additional triples can be generated with high confidence which is complementary to the triples generated from infoboxes, it will be quite useful to the Semantic Web community.

For the prospective student, it would be useful to have some background in both Knowledge Graph / Semantic Web and Natural Language Processing (NLP), especially in relation extraction and linking. The project will involve deep learning approaches especially focused on transformer-based generative language models such as BART or T5.

Goal

In order to use an approach similar to the ones mentioned in the description, there are two main related goals in this project.

  1. Create a distant supervision dataset for DBpedia that aligns the DBpedia triples to natural language text in Wikipedia articles that can be used as training data.
  2. Develop a system based on the aforementioned papers by building a model that can generate DBpedia triples from the Wikipedia abstracts.

Impact

In previous approaches (e.g. this GSoC 2021 project) we used lexicalizations, a handcrafted process that relates ontology’s classes and properties with their verbalization.
With this automatic method we remove humans from the equation, or, at least, we can provide humans a first version of the lexicalization.

Warm up tasks

  1. Read the following papers:
  • Elsahar, Hady, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. “T-rex: A large scale alignment of natural language with knowledge base triples.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.

  • Cabot, Pere-Lluís Huguet and Roberto Navigli (November 2021). REBEL: Relation Extraction By End-to-end Language generation 1. Association for Computational Linguistics. In: Findings of the Association for Computational Linguistics, The 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, 7–11 November 2021, pages 2370–2381.

  • GenerativeRE: Incorporating a Novel Copy Mechanism and Pretrained Model for Joint Entity and Relation Extraction by Jiarun Cao, Sophia Ananiadou at EMNLP 2021.

  • GenIE: Generative Information Extraction by Martin Josifoski, Nicola De Cao, Maxime Peyrard, Robert West.

  1. Also you should be fluent managing the DBpedia datasets, in particular the abstracts.nt file.

Mentors

Mariano Rico
(co-mentor) Nandana Mihindukulasooriya

Project size

350h

Keywords

NLP, text parsing, machine learning, deep learning, RDF generation, relation extraction

1 Like

Just to add a fuller citation:

  • Cabot, Pere-Lluís Huguet and Roberto Navigli (November 2021). REBEL: Relation Extraction By End-to-end Language generation. Association for Computational Linguistics. In: Findings of the Association for Computational Linguistics, The 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, 7–11 November 2021, pages 2370–2381. Unlicensed.

Hello @mariano_rico, @robbie.morrison,
I presenting briefly my self, i’m working on the french dbpedia chapter as R&D engineer.
I read the article of Cabot & Navigli carefully and I am very interested in your project.

I put here the git repo as reminder :
How to build a new corpus in case of : https://github.com/Babelscape/crocodile
Pretrained model : https://github.com/Babelscape/rebel

The relation extraction subjet is a recurrent emerging question in the DBpedia community :

But the never pass a real integration step in the DBpedia workflow, while an extraction process could be a very pleasant solution for exploiting the information contained in the wiki articles !

I imagine that the old fashion way to extract relations (NER+RC) was a too tedious job for being automated and launched on a regular basis (“as both Wikipedia and Wikidata are in constant change”). Using autoregressive transformer models seems to facilitate the task by their end-to-end approaches, and could be the opportunity to overcome these obstacles.

Are you thinking to also address the issue of mapping the predicates obtains in the relation triplets to the DBpedia ontology ?

In brief, really enthusiast, I will be very happy to participate in your project as a Gsoc student :slight_smile:

1 Like

Hi @cringwald

Thanks for the interest in the project. Yes, in the latest papers, it seems that end-to-end neural approaches transformers are actually doing a very good job in relation extraction and related tasks. The goal here would be to use them to try to generate DBpedia triples using Wikipedia text. I hope you can submit a good proposal for the project. Please let us know if you need any further details.

Hi @cringwald. I cannot find your proposal. Did you finally submit it?