A Multilingual Neural RDF Verbalizer - GSoC2020


Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001) and Semantic Web (SW) data (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014). Presently, the generation of natural language from SW, more precisely from RDF data, has gained substantial attention (Bouayad-Agha et al., 2014; Staykova, 2014). Some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). However, English is the only language that has been widely targeted. Even though there are studies that explore the generation of content in languages other than English, to the best of our knowledge, no work has been proposed to train a multilingual neural model for generating texts in different languages from RDF data.


In this GSoC Project, the candidate is entitled to train a multilingual neural model that is capable of generating natural language sentences from DBpedia RDF triples in more than language. The idea is to increment our last GSoC project by investigating other NN architectures.


The project may allow users to generate automatically short summaries about entities that do not have a human abstract using triples.

Warm-up tasks:


Diego Moussallem and Thiago Castro Ferreira


NLG, Semantic Web, NLP