A Multilingual Neural Data-to-Text Generation - GSoC 2023

diegomoussallem · February 10, 2023, 11:42am

Description:

Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001) and Semantic Web (SW) data (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014). Presently, the generation of natural language from SW, more precisely from RDF data, has gained substantial attention (Bouayad-Agha et al., 2014; Staykova, 2014). Some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). However, English is the only language that has been widely targeted. Even though there are studies that explore the generation of content in languages other than English, to the best of our knowledge, only our previous GSoC project, NABU, has been proposed to train a multilingual neural model for generating texts in different languages from RDF data.

Previous GSoC 2020

We published NABU at ISWC

Goals:

In this GSoC Project, the candidate is entitled to train and extend our multilingual neural model that is capable of generating natural language sentences from DBpedia RDF triples in more than one language. The idea is to increment our last GSoC project by investigating other NN architectures.

Impact:

The project may allow users to generate automatically short summaries about entities that do not have a human abstract using triples.

Warm-up tasks:

Mentors

@diegomoussallem @tsoru , TBD

Keywords

NLG, Semantic Web, NLP, Knowledge Graphs

5hv5hvnk · February 28, 2023, 12:02pm

Hello @diegomoussallem ,
I am Shashank Kirtania, currently a final year computer major student at Thapar institute of engineering and technology, I am interested in working on this project. I am proficient in python, c++ and solidity. I have knowledge of using various computing packages of python and have experience in developing Deep Learning models. I am currently working as a ML intern at WadhwaniAI. My previous experience includes working at PyMC labs, IIT Delhi, IIIT Allahabad in domains like computer vision and statistical modelling.
I have decent open source experience and background in applied machine learning to contribute to this project positively. I have knowledge of Auto-encoders, GANs, basic transformers and have strong understanding of basics of deep learning. I have also been through some of the Vision transformers and have a basic understanding of the same. I have skimmed through the paper NABU and have been reading through the codebase of NABU to understand the implementation from a programmer perspective.I believe I can be a potential fit for the project and deliver positively on the same.
If you believe I might be a potential contributor we can set up a google meet/Huddle call to discuss possibilities of our collaboration.I also have attached my resume for you to evaluate my profile.

Best Regards
Shashank Kirtania

wattsishaan · March 1, 2023, 7:22pm

Hi @diegomoussallem @tsoru
My name is Ishaan Watts, and I am a final year student at IIT Delhi majoring in Engineering Physics and minoring in Computer Science with a CGPA of 9.11. I would like to express my interest in being a part of this project.

I have experience working with graph neural techniques and NLP, having previously worked as a Data Scientist intern at Udaan, a Machine Learning Engineer intern at Torch Investment Management, and a research intern at Griffith University. During my internships, I have worked on various projects such as using Heterogeneous and Multi-relational Graph Auto Encoders for generating holistic user embeddings, optimizing regression models, performing NLP tasks such as sentiment analysis and topic classification on Twitter data, and using GCNs for malware detection.

I have also gone through your paper NABU and the use of knowledge graphs further piqued my interest. I believe my previous experience with graph neural techniques and NLP will be valuable contributions to the project.

Thank you for considering my application. I look forward to hearing from you soon. I have also attached my CV.

Best
Ishaan Watts

deborahdjon · March 4, 2023, 1:43am

Hi @diegomoussallem and @tsoru,

My name is Deborah, and I am an MSc computing (Maj. AI) student at Dublin City University. I would like to learn more about open-source development by getting more hands-on experience in the fields of semantic web and linked open data in this project. Learning two languages shows how difficult it can be to find proper translations for specific texts. With this project, I would like to use my love for languages and technical skills to contribute to DBpedia’s open-source code.

During my master’s and computer science undergrad at the Baden-Württemberg Cooperate State University (Germany), I learned about the semantic web, SPARQL, programming in python, and more in modules like Semantic Web, Artificial Intelligence and Information Seeking, and Big Data Technologies.

Projects Include:

Implantation of the reinforcement learning framework OpticalRLGym at the Nokia Bell Labs Paris-Saclay (Python, Java, JSON-RPC, Apache Kafka, BASH)
Creation of a Telegram chatbot for playing a card game (Python)
Student NLP thesis on analyzing movie scripts to create movie content-based summaries. (Python, Beautifulseaup, NumPy, Pandas)

This is myGithub, and my LinkedIn.

tsoru · March 7, 2023, 12:57pm

Hi @5hv5hvnk @wattsishaan @deborahdjon,

to some of you I already replied privately. If not, thanks for the interest in our project. Please follow the next steps:

if you haven’t already, start with the warm-up tasks;
prepare a Google Doc draft of a project proposal on the lines of this example of a successful proposal we received a few years ago;
when you reached a few pages and are happy with your draft, please invite me as an editor (mommi84 at gmail dot com), and we will help you elaborate on your idea.

deborahdjon · March 8, 2023, 6:52pm

Perfect, thanks for the info. Working on it now

aditya-coys · March 14, 2023, 12:04am

Hey @diegomoussallem! I am Aditya Hari, a graduate student at IIIT-Hyderabad. My research primarily centers NLP, and the thread that I have been working on for the past few months is precisely multilingual data-to-text generation, so I would be incredibly interested in working on this project. The problem statement that I have been working on is generating texts from facts given as RDF triplets in a cross-lingual manner, i.e facts given in language A (only English in our case) and generation of text in language B. I have a good understanding of using different techniques for this problem already which I believe will be really helpful for this task. I would be happy to discuss this further if you’re interested.
Beyond this, I have plenty of experience with NLP, and with LLMs in particular.

I have already gone through the warm-up tasks, so can I also proceed with starting a draft of the proposal?

tsoru · March 19, 2023, 12:16pm

Hi @aditya-coys, I’ve just replied to you in the other conversation.

submergence2000 · April 1, 2023, 6:26pm

Hello @diegomoussallem and @tsoru

I am Junrui(Jerry) Wang. I am studying Data Science in Computational Linguistics at The University of British Columbia, Canada. I am interested in this Data-to-Text Generation project and want to contribute through the GSoC program.

I am familiar with the Semantic Web and Knowledge Graph because my graduation thesis was about a co-clustering method between RDF dataset and ontology when I was an undergraduate studying Computer Science at Nanjing University. My classmate and I tried to submit a paper to ISWC but was not accepted. I am very interested in this field, and after studying data science and natural language processing for almost a year, I think I am more qualified to work on projects in this field.

I only knew about this Google program yesterday, and I think it’s a little bit late to start preparing, but I’m willing to give my best shot for this far-reaching project. As the head of the GSoC program in DBpedia Association, could you tell me what I should do to prepare the project proposal as soon as possible?

What I’m doing to make up for my late discovery of the program:

Read relevant papers (some have already been read)
Get familiar with the last GSoC project
Fill in the Contributor Application Template

I wish you all the best.
Junrui Wang

tsoru · April 2, 2023, 5:17pm

Hi @submergence2000 and thanks for your interest in the project.

Please prepare a draft proposal on the lines of this example of a successful proposal and add mommi84 at gmail dot com as an editor, so we mentors can attempt to leave you a feedback before the deadline, depending on our schedule.

submergence2000 · April 4, 2023, 4:40am

Thank you, @tsoru

I finally completed the whole proposal and added you as an editor. I am very grateful for your help. I have benefited a lot from the open-source community in my past studies. Thank you for your selfless contribution!

I wish you all the best!

Junrui Wang