A Multilingual Data-to-Text Generation Approach using Large Language Models — GSoC 2024

Description:

Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001) and Semantic Web (SW) data (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014). Presently, the generation of natural language from SW, more precisely from RDF data, has gained substantial attention (Bouayad-Agha et al., 2014; Staykova, 2014). Some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). However, English is the only language that has been widely targeted. Even though some studies explore the generation of content in languages other than English, our previous GSoC project, NABU, has been the first proposed approach to train a multilingual neural model for generating texts in different languages from RDF data.

Nowadays, Large Language Models have become state-of-the-art in generating content either from data or text. In the last year at WebNLG, LLMs were widely used and GPT3.5 along with Google translate won the challenge to low-resource languages (Breton, Irish, Maltese, and Welsh), sometimes surpassing humans. Therefore, this year at GSoC2024, we plan to combine the data from all previous WebNLG challenges thus combining German, Russian, Breton, Irish, Maltese, and Welsh languages along with the synthetic data that we generated at the previous GSoCs for languages such as Portuguese and Hindi.

Project size

  • 175 hours or 350 hours

Previous GSoC 2021/22/23

We published NABU at ISWC

Goals:

In this GSoC Project, the candidate is entitled to train and extend our multilingual neural model capable of generating natural language sentences from DBpedia RDF triples in more than one language. The idea is to increment our last GSoC project by investigating other NN architectures and open-source LLMs.

Impact:

The project may allow users to generate automatically short summaries about entities that do not have a human abstract using triples.

Warm-up tasks:

Mentors

@diegomoussallem @TBD , TBD

Keywords

NLG, Semantic Web, NLP, Knowledge Graphs

2 Likes

Additional paper: https://synalp.gitlabpages.inria.fr/webnlg-challenge/webnlg_2023_report.pdf

Hello @diegomoussallem,
My name is Raya Chakravarty, and I am currently pursuing Computer Science at Veermata Jijabai Technological Institute in Mumbai, India.

I have previous experience with Large Language Models (LLMs), having developed a Healthcare Chatbot using llama-2.
I am keenly interested in contributing to this project and have begun familiarizing myself with the research papers and GitHub repositories.

From what I understand, the project involves implementing a natural language generation framework to verbalize RDF triples in multiple languages, which builds upon previous GSOCs’ work. However, this year’s emphasis is on leveraging Large Language Models (LLMs) to enhance the process of generating text from data, specifically RDF triples.

I have also started researching about LLMs for this.

  1. Other than being open source, the size of the LLMs, and which languages are supported, are there any other factors that should be taken into account for the selection?
  2. Also will we be considering only the languages mentioned above or more than these?
  3. I am already halfway through the warm up tasks, are there any other tasks that you would like me to do apart from these?

Hi @diegomoussallem

My name is Jerry Wang, and I am currently engaged as a Research Assistant, following my graduation with a degree in Data Science from The University of British Columbia, Canada. Here’s my LinkedIn profile for a detailed overview of my academic and professional journey.

I have related research experience in the Semantic Web, as evidenced by my paper accepted at ISWC 2023 (link to the paper). I also have rich experience in LLMs, highlighted by my achievements in related Kaggle competitions, which you can see on my Kaggle profile. I am quite interested in this project and want to contribute to the open source, especially to DBpedia, from which I greatly benefited.

I have started the Warm-up tasks you posted. I think my experience, which combines research and computer engineering experience, aligns well with the project needed.

I would be grateful for the chance to stay connected with you and receive your guidance on drafting the proposal for this project. Could you please advise me on the most convenient method to reach out to you for further discussions?

Looking forward to collaborating with you and contributing to the community.

Warm regards,
Jerry Wang

Hi Raya,

nice to hear from you.

  1. I think you should focus on the smaller LLMs, i.e., 7b.
  2. feel free to suggest other languages
  3. nope

Hi Jerry,

nice to hear from you.

you can reach me via the DBpedia slack channels :slight_smile: