This project started in 2018 as ‘A Neural QA Model for DBpedia’ and is now looking to its 6th consecutive year at Google Summer of Code.
Introduction
Neural SPARQL Machines (NSpM) aim at building an end-to-end system to answer questions posed by user not versed with writing SPARQL queries.
Currently, billions of relationships on the Web are expressed in the RDF format. Accessing such data is difficult for a lay user, who does not know how to write a SPARQL query. This GSoC project consists of building upon the NSpM question answering system, which tries to make this humongous linked data accessible to a larger user base in their natural language (as of now restricted to English) by improving, adding and amending upon the existing codebase, which resides at the link below.
Documentation
Related work
The first 3 papers introduce and elaborate on Neural SPARQL Machines. Work number 3 was carried out by our GSoC 2019 student and published at KGSWC 2020. The 4th paper is an almost-complete survey of related approaches.
Read through the most recent blogs and the reading list to get a good understanding of the code. This will allow you to get a good idea about the project.
Run the pipelines in the ./gsoc/anand and ./gsoc/zheyuan folders of the base repository using examples of your choice.
Your proposal
Now that you have a good understanding of the current state of the project, we ask you to write your own proposal. Feel free to bring your own solutions to tackle the problem that the project currently faces, i.e. training a question-answering model using the dataset we have built over the years.
Although the original paper mentions a seq2seq model, the NSpM paradigm allow us to choose any model as our Learner to translate natural-language questions into SPARQL. You may even propose your own model or one from any other community (e.g., HuggingFace).
Project size
The size of this project can be either medium or large. Please state in your proposal the number of total project hours you intend to dedicate to it (175 or 300).
Hello, @tsoru, My name is Abdulsobur. From the description you listed above. I would like to clarify about the dataset available: Has the dataset gone through data processing before training? (Cleaning of data)
Also, from the description, it seems it is also okay if we fine-tune an already built model for the project. Right?
Has the dataset gone through data processing before training? (Cleaning of data)
The dataset is artificially built, so it’s likely that it won’t need any data cleaning. It could, however, contain outliers that we want to get rid of.
will some of the datasets be generated by one of the Neural SPARQL Machine modules (the generator)?
Yes, the working dataset is the result of the generator. Think of it as a typical case of machine translation, where natural-language questions are mapped to SPARQL queries.
Hello to everyone!!!, and particularly to the mentors. I am Abdulsobur Oyewale, A Machine Learning Engineer I am relatively new to open source. I am keen to participate in GSOC’23 and DBpedia caught my eye while looking through previous organizations.
Since i’m trying out the warm up tasks, is their any particular project i can start contributing to and get started with in order for me to be more in-depth with the community, or can i start writing my proposal on how i can approach the project. Thank you
Hi @smilingprogrammer, I am @sauravjoshi23, junior mentor for the project, and would love to help you succeed. I would personally, focus on trying to understand the previous year’s projects and everything else in the Documentation section. There you will understand what we are trying to achieve and what problems we faced during previous years. Since you said, you are doing the warm-up tasks, I believe you are done with the previous steps. Having done with the warm-up tasks, I think you can explore more in the domain, read research papers and understand similar GitHub projects and then finally write the proposal v1 and send it to us. We will help you to make it more interesting and fill in the gaps if any.
Thanks for the guide @sauravjoshi23
Looking forward to more of your guide and collaborations. Have also been spending sometimes reading the papers shared too.
I am Arinjay Pathak, a final year student in bachelor of engineering (B.E.) . I have been interested in machine learning, deep learning and natural language processing, and have worked on projects in these domains. I am a beginner in open source and I want to work on summarization. I have won Smart India Hackathon, organized by the government of India where I had build a semantic search using sentence Transformers. My other projects include building speech emotion recognition on RAVDESS dataset using ensembling model techniques.
I have also worked on text classification problems where I achieved 95%+ accuracy on UCI spam classification dataset.
I am familiar with theoretical and practical aspects of traditional NLP and large language models, statistics, which will help me with training and fine tuning models for question answering.
Hi @arinjay11020, thanks for your interest in the project. It would be great if you can complete the warm-up tasks and familiarise yourself with the project. If you have any questions, we are here to help you!
when you reached a few pages and are happy with your draft, please invite @sauravjoshi23 as an editor (sauravjoshi2362000 at gmail dot com), and we will help you elaborate on your idea.
Hello community! I’m Mehrzad Shahin-Moghadam, last-year PhD student based in Montreal. Although I’m coming from a civil engineering background, I have profound knowledge on linked data principles/tools (RDF, OWL, SPARQL), and NLP with SOTA deep learning.
I just finished a 1-year internship at a software company, where we explored question-answering from building graphs. Here you can find a short video of the demo app I developed. During this internship I gained valuable insights on neural/semantic search, and hands-on experience with TensorFlow and HuggingFace.
My key motivation here is to give back to the community I’ve been using tons of open-source modules in my works. Here’s an example notebook where we experimented with RDF2Vec. Anyway, I was not active in the open-source community
I found this project particularly appealing as in my PhD we investigated exactly the same idea: architects and mechanical/civil engineers need an intuitive NL-based interface to perform information retrieval from linked data graphs, they will NOT use SPARQL !!!
Dear mentors, I’ll start with warm-ups ASAP. Please let me know if you have any questions!