Training a Model for Neural Question Answering over DBpedia - GSoC 2023

tsoru · February 7, 2023, 2:45pm

This project started in 2018 as ‘A Neural QA Model for DBpedia’ and is now looking to its 6th consecutive year at Google Summer of Code.

Introduction

Neural SPARQL Machines (NSpM) aim at building an end-to-end system to answer questions posed by user not versed with writing SPARQL queries.

Currently, billions of relationships on the Web are expressed in the RDF format. Accessing such data is difficult for a lay user, who does not know how to write a SPARQL query. This GSoC project consists of building upon the NSpM question answering system, which tries to make this humongous linked data accessible to a larger user base in their natural language (as of now restricted to English) by improving, adding and amending upon the existing codebase, which resides at the link below.

Documentation

Related work

The first 3 papers introduce and elaborate on Neural SPARQL Machines. Work number 3 was carried out by our GSoC 2019 student and published at KGSWC 2020. The 4th paper is an almost-complete survey of related approaches.

GSoC Blogs

You may also check which problems past GSoC contributors worked on:

[GSoC 2018] Aman’s Blog — building raw templates
[GSoC 2019] Anand’s Blog — automating template creation
[GSoC 2020] Zheyuan’s Blog — paraphrasing questions
[GSoC 2021] Siddhant’s Blog — data augmentation
[GSoC 2022] Saurav’s Blog — refining template discovery

Warm-up tasks

Read the Medium post What is a Neural SPARQL Machine? to get a general idea about NSpM.
Read through the most recent blogs and the reading list to get a good understanding of the code. This will allow you to get a good idea about the project.
Run the pipelines in the ./gsoc/anand and ./gsoc/zheyuan folders of the base repository using examples of your choice.

Your proposal

Now that you have a good understanding of the current state of the project, we ask you to write your own proposal. Feel free to bring your own solutions to tackle the problem that the project currently faces, i.e. training a question-answering model using the dataset we have built over the years.

Although the original paper mentions a seq2seq model, the NSpM paradigm allow us to choose any model as our Learner to translate natural-language questions into SPARQL. You may even propose your own model or one from any other community (e.g., HuggingFace).

Project size

The size of this project can be either medium or large. Please state in your proposal the number of total project hours you intend to dedicate to it (175 or 300).

Mentors

@tsoru, @panchbhai1969, @tiwarisanju18, @sauravjoshi23

Feel free to contact us for more information. We eagerly look forward to working with you and contributing towards making data accessible to all.

smilingprogrammer · February 8, 2023, 9:26pm

Hello, @tsoru, My name is Abdulsobur. From the description you listed above. I would like to clarify about the dataset available: Has the dataset gone through data processing before training? (Cleaning of data)

Also, from the description, it seems it is also okay if we fine-tune an already built model for the project. Right?

smilingprogrammer · February 8, 2023, 10:34pm

Also, from the medium documentation you shared, will some of the datasets be generated by one of the Neural SPARQL Machine modules (the generator)?

tsoru · February 9, 2023, 6:56pm

Hi @smilingprogrammer and thanks for your interest in the project.

Has the dataset gone through data processing before training? (Cleaning of data)

The dataset is artificially built, so it’s likely that it won’t need any data cleaning. It could, however, contain outliers that we want to get rid of.

will some of the datasets be generated by one of the Neural SPARQL Machine modules (the generator)?

Yes, the working dataset is the result of the generator. Think of it as a typical case of machine translation, where natural-language questions are mapped to SPARQL queries.

smilingprogrammer · February 10, 2023, 12:52am

Thanks for clarifying this for me. @tsoru

smilingprogrammer · February 10, 2023, 1:56am

Hello to everyone!!!, and particularly to the mentors. I am Abdulsobur Oyewale, A Machine Learning Engineer I am relatively new to open source. I am keen to participate in GSOC’23 and DBpedia caught my eye while looking through previous organizations.

smilingprogrammer · February 10, 2023, 2:01am

Since i’m trying out the warm up tasks, is their any particular project i can start contributing to and get started with in order for me to be more in-depth with the community, or can i start writing my proposal on how i can approach the project. Thank you

sauravjoshi23 · February 11, 2023, 6:31pm

Hi @smilingprogrammer, I am @sauravjoshi23, junior mentor for the project, and would love to help you succeed. I would personally, focus on trying to understand the previous year’s projects and everything else in the Documentation section. There you will understand what we are trying to achieve and what problems we faced during previous years. Since you said, you are doing the warm-up tasks, I believe you are done with the previous steps. Having done with the warm-up tasks, I think you can explore more in the domain, read research papers and understand similar GitHub projects and then finally write the proposal v1 and send it to us. We will help you to make it more interesting and fill in the gaps if any.

smilingprogrammer · February 11, 2023, 8:14pm

Thanks for the guide @sauravjoshi23
Looking forward to more of your guide and collaborations. Have also been spending sometimes reading the papers shared too.

arinjay11020 · March 4, 2023, 12:56am

Greetings Respected sir

I am Arinjay Pathak, a final year student in bachelor of engineering (B.E.) . I have been interested in machine learning, deep learning and natural language processing, and have worked on projects in these domains. I am a beginner in open source and I want to work on summarization. I have won Smart India Hackathon, organized by the government of India where I had build a semantic search using sentence Transformers. My other projects include building speech emotion recognition on RAVDESS dataset using ensembling model techniques.

I have also worked on text classification problems where I achieved 95%+ accuracy on UCI spam classification dataset.

I am familiar with theoretical and practical aspects of traditional NLP and large language models, statistics, which will help me with training and fine tuning models for question answering.

I am attaching my resume for your reference.
Thanks

sauravjoshi23 · March 4, 2023, 8:37pm

Hi @arinjay11020, thanks for your interest in the project. It would be great if you can complete the warm-up tasks and familiarise yourself with the project. If you have any questions, we are here to help you!

kusumlata123 · March 5, 2023, 6:36am

Hello , I have also done my PHD in Natural langauge processing area on low resource langauge. I would like to contributebon this.

sauravjoshi23 · March 8, 2023, 12:13am

Hi @smilingprogrammer, @arinjay11020, @kusumlata123
to some of you I already replied privately. If not, thanks for the interest in our project. Please follow the next steps:

if you haven’t already, start with the warm-up tasks;
prepare a Google Doc draft of a project proposal on the lines of this example of a successful proposal we received a few years ago;
when you reached a few pages and are happy with your draft, please invite @sauravjoshi23 as an editor (sauravjoshi2362000 at gmail dot com), and we will help you elaborate on your idea.

arinjay11020 · March 13, 2023, 6:35am

Sure, I am already halfway through, and have done warmup tasks, will send you an edit request soon.

mehrzadshm · March 13, 2023, 5:49pm

Hello community! I’m Mehrzad Shahin-Moghadam, last-year PhD student based in Montreal. Although I’m coming from a civil engineering background, I have profound knowledge on linked data principles/tools (RDF, OWL, SPARQL), and NLP with SOTA deep learning.

I just finished a 1-year internship at a software company, where we explored question-answering from building graphs. Here you can find a short video of the demo app I developed. During this internship I gained valuable insights on neural/semantic search, and hands-on experience with TensorFlow and HuggingFace.

My key motivation here is to give back to the community I’ve been using tons of open-source modules in my works. Here’s an example notebook where we experimented with RDF2Vec. Anyway, I was not active in the open-source community

I found this project particularly appealing as in my PhD we investigated exactly the same idea: architects and mechanical/civil engineers need an intuitive NL-based interface to perform information retrieval from linked data graphs, they will NOT use SPARQL !!!

Dear mentors, I’ll start with warm-ups ASAP. Please let me know if you have any questions!

Regards,
Mehrzad

tsoru · March 19, 2023, 12:44pm

Hi @mehrzadshm and thank you for your interest in the project!

Please follow the steps abovementioned:

mehrzadshm · March 19, 2023, 4:32pm

Hi @tsoru Tnx for the response!
Sure, I’m on it… will share the draft with @sauravjoshi23 soon.

rdhkdh21 · March 28, 2023, 5:17pm

@sauravjoshi23 Sir…I am Ridhiman and am keen on this project as part of GSoC 2023. I have prepared a draft proposal. Please let me know your email id so that I can allow access to the Google doc https://docs.google.com/document/d/11VLNwHbqhHK4J9DlpXACoGhONgEf6hF6RDfmks3sOQ4/edit?usp=sharing

sauravjoshi23 · March 29, 2023, 3:31am

Hello all, here is my mail id - sauravjoshi2362000@gmail.com

rdhkdh21 · March 29, 2023, 4:08am

@sauravjoshi23 . Thanks Sir. I have shared the proposal draft in edit mode. Request you to review and suggest for improvements.