This project started in 2018 as ‘A Neural QA Model for DBpedia’ and is now looking to its 7th year at Google Summer of Code after a hiatus of three years.
Introduction
Neural SPARQL Machines (NSpM) pioneered end-to-end approaches to answer questions posed by users not versed with writing SPARQL queries. This project takes that vision further with an agentic architecture.
Currently, billions of relationships on the Web are expressed in the RDF format. Accessing such data is difficult for a lay user, who does not know how to write a SPARQL query. This GSoC project consists of building an agentic question answering system over DBpedia, where an LLM-based agent can autonomously plan and execute queries by leveraging a set of tools — including entity linking indexes, ontology indexes, the DBpedia SPARQL endpoint, and other retrieval mechanisms — to answer natural-language questions (as of now restricted to English).
Documentation
Related work
The first 3 papers introduce and elaborate on Neural SPARQL Machines. Work number 3 was carried out by our GSoC 2019 student and published at KGSWC 2020. The 4th paper is an almost-complete survey of related approaches.
- SPARQL as a Foreign Language
- Neural Machine Translation for Query Construction and Composition
- Exploring Sequence-to-Sequence Models for SPARQL Pattern Composition
- Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs
GSoC Blogs
You may also check which problems past GSoC contributors worked on:
- [GSoC 2018] Aman’s Blog — building raw templates
- [GSoC 2019] Anand’s Blog — automating template creation
- [GSoC 2020] Zheyuan’s Blog — paraphrasing questions
- [GSoC 2021] Siddhant’s Blog — data augmentation
- [GSoC 2022] Saurav’s Blog — refining template discovery
- [GSoC 2023] Mehrzad’s Blog — fine-tuning code LLMs
Warm-up tasks
- Read the Medium post What is a Neural SPARQL Machine? to get a general idea about NSpM.
- Read through the most recent blogs and the reading list to get a good understanding of the project. This will allow you to get a good idea about its current state.
- Understand the entity linking service that maps strings to lists of entities by confidence value.
Your proposal
Now that you have a good understanding of the current state of the project, we ask you to write your own proposal. The core challenge is designing an agent that can reliably answer natural-language questions over DBpedia by selecting and composing the right tools — entity linking, ontology lookup, SPARQL query construction and execution, result validation, and so on.
You are free to choose the LLM backbone, the agent framework (e.g., LangChain, LlamaIndex, custom), and the tool set. You may propose additional tools beyond those listed above, and you are encouraged to evaluate your system against existing QA benchmarks such as QALD.
Project size
The size of this project can be either medium or large. Please state in your proposal the number of total project hours you intend to dedicate to it (175 or 300).
Mentors
@tsoru, @smilingprogrammer, @ronitblenz, @gnav
Feel free to contact us for more information. We eagerly look forward to working with you and contributing towards making data accessible to all.