Description:
Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001) and Semantic Web (SW) data (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014). Presently, the generation of natural language from SW, more precisely from RDF data, has gained substantial attention (Bouayad-Agha et al., 2014; Staykova, 2014). Some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). However, English is the only language that has been widely targeted. Even though some studies explore the generation of content in languages other than English, our previous GSoC project, NABU, has been the first proposed approach to train a multilingual neural model for generating texts in different languages from RDF data.
Nowadays, Large Language Models have become state-of-the-art in generating content either from data or text. In the last year at WebNLG, LLMs were widely used and GPT3.5 along with Google translate won the challenge to low-resource languages (Breton, Irish, Maltese, and Welsh), sometimes surpassing humans. Therefore, this year at GSoC2024, we plan to combine the data from all previous WebNLG challenges thus combining German, Russian, Breton, Irish, Maltese, and Welsh languages along with the synthetic data that we generated at the previous GSoCs for languages such as Portuguese and Hindi.
Project size
- 175 hours or 350 hours
Previous GSoC 2021/22/23
We published NABU at ISWC
Goals:
In this GSoC Project, the candidate is entitled to train and extend our multilingual neural model capable of generating natural language sentences from DBpedia RDF triples in more than one language. The idea is to increment our last GSoC project by investigating other NN architectures and open-source LLMs.
Impact:
The project may allow users to generate automatically short summaries about entities that do not have a human abstract using triples.
Warm-up tasks:
-
Read the papers:
NABU
A Holistic Natural Language Generation Framework for the Semantic Web
Neural End-to-End vs Pipeline
NeuralREG: An end-to-end approach to referring expression generation
RDF2PT: Generating Brazilian Portuguese Texts from RDF Data
Attention is all you need -
Download and get familiar with the code of papers above.
GitHub - dice-group/NABU: Multilingual RDF Verbalizer
GitHub - msobrevillac/Multilingual-RDF-Verbalizer
GitHub - DiegoMoussallem/RDF2NL: Triples to NL.
GitHub - dice-group/RDF2PT: Portuguese Verbalizer from RDF triples to NL sentences and summaries.
GitHub - ThiagoCF05/NeuralREG: Referring Expression Generation using Neural Networks
GitHub - ThiagoCF05/DeepNLG: A systematic comparison between pipeline and end-to-end architectures in the RDF-to-text task -
Get familiar with our last GSoC project - GitHub - dbpedia/neural-rdf-verbalizer: 🗣 Multilingual RDF Verbalizer – Google Summer of Code 2019
Mentors
@diegomoussallem @TBD , TBD
Keywords
NLG, Semantic Web, NLP, Knowledge Graphs