Description
Recently there have been significant improvements in generating triples from natural language text in the fields of relation extraction with the use of generative language models. Few such improvements are:
-
“REBEL: Relation Extraction By End-to-end Language generation 20” by Cabot and Navigli, presented at EMNLP 2021.
-
GenerativeRE: Incorporating a Novel Copy Mechanism and Pretrained Model for Joint Entity and Relation Extraction by Jiarun Cao, Sophia Ananiadou at EMNLP 2021.
-
GenIE: Generative Information Extraction by Martin Josifoski, Nicola De Cao, Maxime Peyrard, Robert West.
The objective of this work will be to use such advancements to create a system that can generate DBpedia triples given a Wikipedia abstract (or even the complete page). If a set of additional triples can be generated with high confidence which is complementary to the triples generated from infoboxes, it will be quite useful to the Semantic Web community.
For the prospective student, it would be useful to have some background in both Knowledge Graph / Semantic Web and Natural Language Processing (NLP), especially in relation extraction and linking. The project will involve deep learning approaches especially focused on transformer-based generative language models such as BART or T5.
Goal
In order to use an approach similar to the ones mentioned in the description, there are two main related goals in this project.
- Create a distant supervision dataset for DBpedia that aligns the DBpedia triples to natural language text in Wikipedia articles that can be used as training data.
- Develop a system based on the aforementioned papers by building a model that can generate DBpedia triples from the Wikipedia abstracts.
Impact
In previous approaches (e.g. this GSoC 2021 project) we used lexicalizations, a handcrafted process that relates ontology’s classes and properties with their verbalization.
With this automatic method we remove humans from the equation, or, at least, we can provide humans a first version of the lexicalization.
Warm up tasks
- Read the following papers:
-
Elsahar, Hady, Pavlos Vougiouklis, Arslen Remaci, Christophe Gravier, Jonathon Hare, Frederique Laforest, and Elena Simperl. “T-rex: A large scale alignment of natural language with knowledge base triples.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.
-
Cabot, Pere-Lluís Huguet and Roberto Navigli (November 2021). REBEL: Relation Extraction By End-to-end Language generation 1. Association for Computational Linguistics. In: Findings of the Association for Computational Linguistics, The 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, 7–11 November 2021, pages 2370–2381.
-
GenerativeRE: Incorporating a Novel Copy Mechanism and Pretrained Model for Joint Entity and Relation Extraction by Jiarun Cao, Sophia Ananiadou at EMNLP 2021.
-
GenIE: Generative Information Extraction by Martin Josifoski, Nicola De Cao, Maxime Peyrard, Robert West.
- Also you should be fluent managing the DBpedia datasets, in particular the
abstracts.nt
file.
Mentors
Mariano Rico
(co-mentor) Nandana Mihindukulasooriya
Project size
350h
Keywords
NLP, text parsing, machine learning, deep learning, RDF generation, relation extraction