A Multilingual Data-to-Text Generation Approach using Large Language Models — GSoC 2024

diegomoussallem · February 3, 2024, 1:52pm

Description:

Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001) and Semantic Web (SW) data (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014). Presently, the generation of natural language from SW, more precisely from RDF data, has gained substantial attention (Bouayad-Agha et al., 2014; Staykova, 2014). Some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). However, English is the only language that has been widely targeted. Even though some studies explore the generation of content in languages other than English, our previous GSoC project, NABU, has been the first proposed approach to train a multilingual neural model for generating texts in different languages from RDF data.

Nowadays, Large Language Models have become state-of-the-art in generating content either from data or text. In the last year at WebNLG, LLMs were widely used and GPT3.5 along with Google translate won the challenge to low-resource languages (Breton, Irish, Maltese, and Welsh), sometimes surpassing humans. Therefore, this year at GSoC2024, we plan to combine the data from all previous WebNLG challenges thus combining German, Russian, Breton, Irish, Maltese, and Welsh languages along with the synthetic data that we generated at the previous GSoCs for languages such as Portuguese and Hindi.

Project size

175 hours or 350 hours

Previous GSoC 2021/22/23

We published NABU at ISWC

Goals:

In this GSoC Project, the candidate is entitled to train and extend our multilingual neural model capable of generating natural language sentences from DBpedia RDF triples in more than one language. The idea is to increment our last GSoC project by investigating other NN architectures and open-source LLMs.

Impact:

The project may allow users to generate automatically short summaries about entities that do not have a human abstract using triples.

Warm-up tasks:

Mentors

@diegomoussallem @TBD , TBD

Keywords

NLG, Semantic Web, NLP, Knowledge Graphs

diegomoussallem · February 3, 2024, 1:54pm

Additional paper: https://synalp.gitlabpages.inria.fr/webnlg-challenge/webnlg_2023_report.pdf

raya · February 12, 2024, 6:21pm

Hello @diegomoussallem,
My name is Raya Chakravarty, and I am currently pursuing Computer Science at Veermata Jijabai Technological Institute in Mumbai, India.

I have previous experience with Large Language Models (LLMs), having developed a Healthcare Chatbot using llama-2.
I am keenly interested in contributing to this project and have begun familiarizing myself with the research papers and GitHub repositories.

From what I understand, the project involves implementing a natural language generation framework to verbalize RDF triples in multiple languages, which builds upon previous GSOCs’ work. However, this year’s emphasis is on leveraging Large Language Models (LLMs) to enhance the process of generating text from data, specifically RDF triples.

I have also started researching about LLMs for this.

Other than being open source, the size of the LLMs, and which languages are supported, are there any other factors that should be taken into account for the selection?
Also will we be considering only the languages mentioned above or more than these?
I am already halfway through the warm up tasks, are there any other tasks that you would like me to do apart from these?

submergence2000 · February 16, 2024, 8:27pm

Hi @diegomoussallem

My name is Jerry Wang, and I am currently engaged as a Research Assistant, following my graduation with a degree in Data Science from The University of British Columbia, Canada. Here’s my LinkedIn profile for a detailed overview of my academic and professional journey.

I have related research experience in the Semantic Web, as evidenced by my paper accepted at ISWC 2023 (link to the paper). I also have rich experience in LLMs, highlighted by my achievements in related Kaggle competitions, which you can see on my Kaggle profile. I am quite interested in this project and want to contribute to the open source, especially to DBpedia, from which I greatly benefited.

I have started the Warm-up tasks you posted. I think my experience, which combines research and computer engineering experience, aligns well with the project needed.

I would be grateful for the chance to stay connected with you and receive your guidance on drafting the proposal for this project. Could you please advise me on the most convenient method to reach out to you for further discussions?

Looking forward to collaborating with you and contributing to the community.

Warm regards,
Jerry Wang

diegomoussallem · February 18, 2024, 6:33pm

Hi Raya,

nice to hear from you.

I think you should focus on the smaller LLMs, i.e., 7b.
feel free to suggest other languages
nope

diegomoussallem · February 18, 2024, 6:34pm

Hi Jerry,

nice to hear from you.

you can reach me via the DBpedia slack channels

utraj080303 · March 8, 2024, 8:46pm

Dear @diegomoussallem,

I am Utkarsh Raj, a pre-final year student specializing in the cutting-edge domains of Machine Learning, Generative AI and Artificial Intelligence. My journey in the realm of technology has been marked by a diverse and illustrious skill set, honed through experiences ranging from fine-tuning intricate models to crafting code for open-source Large Language Models (LLMs). Notably, my expertise extends beyond conventional technologies, encompassing the nuanced landscape of Natural Language Processing (NLP). I have not only delved into the depths of pre-trained models like BERT and GPT-2 for tasks such as summarization but have also pioneered novel methodologies like code generation using the groundbreaking Magicoder LLM Model. I have a profound knack for deriving inspiration from research papers, which has enabled me to conceive and implement novel AI/ML models. This amalgamation of skills positions me as an ideal candidate to spearhead the extension and enhancement of the existing multilingual neural model for generating natural language sentences from DBpedia RDF triples across multiple languages. With my extensive background and proven capabilities, I am poised to make substantial contributions to the project, empowering users to effortlessly produce concise summaries for entities bereft of human abstracts using triples.

Looking forward to collaborate with mentors and other brilliant people in achieving success on this project.

Can I get some more information about the project and the dataset intended to be used so that I can start the work.

utraj080303 · March 16, 2024, 6:32am

I am unable to join slack channel even after scanning barcode and entering the key.

victoroliveira · March 25, 2024, 3:02pm

Hello @diegomoussallem,

I am Victor Oliveira, a Computer Science Master’s candidate at the Federal Rural University of Semi-Arid Region, Brazil, and a fellow researcher at the University of Twente, Netherlands. I have been researching web semantics, ontology engineering, machine learning, NLG, NLP, LLMs, RAGs, etc. You may find all my academic and professional endeavors at my Linkedin profile. For now, I am developing an OWL ontology to describe the domain of legal aspects regarding International Data Spaces, which is based on the Information Model Ontology and the Service Contract Ontology. For that purpose, I developed the Legal Interoperability Ontology for IDS (LegIOn-IDS). Furthermore, based on the instances of a contract within the ontology architecture, we may generate a natural language service contract based on NLP and LLM models for text processing and generation.

Finally, I am also researching the amplified intelligence field and translation from RDF and natural language to SPARQL queries, to retrieve knowledge from KBs.

I just started the warm-up tasks, as a coincidence, I was familiarized with a few of the provided papers.

I have been a huge fan since I discovered your work, and hopefully, this would be an opportunity to work together. I hope we can connect on Slack.

Best regards,
Victor Oliveira

sumana-2705 · December 2, 2024, 11:59am

Hello @diegomoussallem

My name is Sumana Sree, I am currently pursuing my masters in Indian Institute of Technology (BHU). I am very much interested to work on this project. Will it be open for GSoC-2025