Towards a Neural Extraction Framework - GSoC2021


Every Wikipedia article links to a number of other articles. In DBpedia, we keep track of these links through the dbo:wikiPageWikiLink property. Thanks to them, we know that the :Berlin_Wall entity is semantically connected to 299 base entities.

However, only 9 out of 299 base entities are linked from :Berlin_Wall via also another predicate. This suggests that in the large majority of cases, it is not clear what kind of relationship exists between the entities. In other words, DBpedia does not know what specific RDF predicate links the subject (in our case, :Berlin_Wall) to any of the objects above.

Currently, such relationships are extracted from tables and the infobox (usually found top right of a Wikipedia article) via the Extraction Framework. Instead of extracting RDF triples from semi-structured data only, we want to leverage information found in the entirety of a Wikipedia article, including page text.


The goal of this project is to develop a framework for predicate resolution of wiki links among entities. The student may choose to focus on a specific kind of relationship:

  • Causality. The direct cause-effect between events, e.g., from the text

The Peaceful Revolution (German: Friedliche Revolution) was the process of sociopolitical change that led to the opening of East Germany’s borders with the west, the end of the Socialist Unity Party of Germany (SED) in the German Democratic Republic (GDR or East Germany) and the transition to a parliamentary democracy, which enabled the reunification of Germany in October 1990.

extract: :Peaceful_Revolution –––dbo:effect––> :German_reunification

  • Issuance. An abstract entity assigned to some agent, e.g., from the text

Messi won the award, his second consecutive Ballon d’Or victory.

extract: :2010_FIFA_Ballon_d'Or –––dbo:recipient––> :Lionel_Messi

  • Any other direct relationship which is not found in DBpedia.


The student may use any Python deep learning framework, but a neural approach is not strictly mandatory, and/or any existing tool for semantic parsing. The following resources are recommended for use in the project.

  • Python Wikipedia makes it easy to access and parse data from Wikipedia.
  • DBpedia Lookup is a service available both online and offline (e.g., given a string, list all entities that may refer to it).
  • DBpedia Anchor text is a dataset containing the text and the URL of all links in Wikipedia; the indexed dataset will be available to the student (e.g., given an entity, list all strings that point to it).
  • FRED is a tool for automatically producing RDF/OWL ontologies and linked data from natural language sentences. Unfortunately, an API key is required and the number of daily requests is limited, but it is a very inspiring project.
  • An example of an excellent proposal that was accepted a few years ago. Please mind that the number of total project hours have changed from 300 to 175.


This project will potentially generate millions of new statements. This new information could be released by DBpedia to the public as part of a new dataset. The creation of a neural extraction framework could introduce the use of robust parsers for a more accurate extraction of Wikipedia content.

Warm-up tasks


Tommaso Soru @tsoru, Zheyuan Bai @baizydl, TBD


Hello mentors, I am Gaurav, a sophomore from SVNIT, Surat, India and the project idea seems very interesting to me, I will surely do my part of the research and would love to connect with all to share my doubts and would like to evaluate how my skills in ML/DL, python will be relevant to this projecct idea. Thank you!!

Hi @gaurav and welcome. Glad you’re interested in the project.

Have you already gone through the warm-up tasks?

1 Like

I will get started with those, I am also looking into the materials you have shared to get familiar with the idea and the logic that best serves the purpose ;).

Good. Feel free to ask any question they might arise.

Hey @tsoru, so I have done the first and second warmup task. I took some time trying to understand what the terminologies mean actually cause I am a beginner with databases and I would wish to know if I am getting the goal of the project correctly.

So, we need to implement an extraction framework that will extract triples from the entire article and not just semi-structured data and establish all the direct relationships which are not yet there in DBpedia.

I also wanted to know the constraints with the tech-stack, cause the current extraction framework is written in Scala and what I know the best is Python for Neural Network, I have some experience with Java, JS, C and C++. So, I wanted to know if you have decided about the various tools and stacks that should be specifically used for development.

@tsoru am I supposed to submit any coding task as a part of the proposal, cause I was done with the warmup task and I was wondering if I there are any more pre-requisites to work on or I can try implementing a few simple models in Python for a starter.

Hello Mentors, I am Abdullah, an engineering grad student from Amity University, Noida. I really like this project idea and would like to be a part of this project. I intend to begin with the warm up problems.

Hello @tsoru!

My name is Aakriti Jain, and I am currently pursuing Data Science from RWTH Aachen University. Having taken a course on Semantic web my previous semester, I am well-versed with RDFs, and query languages such as Turtle and SPARQL. I am intrigued by this project as we are to use techniques of ML combined with the concepts of semantic web to come up with a more accurate framework.

From what I understand, the goal of this project is to utilise the the content of several wikipedia articles to determine the relationships between subject and predicate, instead of simply extracting RDF triplets?

A GSoC project is supposed to be only 175 hours long, therefore for new ideas such as this one, we just require a prototype. As you already pointed out, the recommended language for deep learning (and prototyping in general) is Python, but we don’t require it. Students are free to decide what to use in the rest of the stack.

The coding tasks are only for you to warm up, however the proposal would greatly benefit if you include your findings and considerations about those simple models you plan on implementing.

1 Like

Hi @fauzank339 and welcome!

Hi @aakriti23 and welcome! The current extraction framework does already utilise the content of Wikipedia articles, but just at given locations (e.g., the top right infobox). Instead of that, we want to exploit the article’s text to extract RDF triples, where a triple is a relationship between the corresponding entity and another entity in Wikipedia/DBpedia.

Hi @tsoru.
I am Ishank, an undergraduate student from Indian Institute of Technology, Delhi, I have done a course on Information Retrieval and a project involving Q&A over YAGO Knowledgebase. I find this problem interesting and would try my best to come up with a unique and effective approach for solving it.

Hi @ishank_agrawal, welcome to the forum!

To @gaurav @fauzank339 @aakriti23 @ishank_agrawal and everyone else interested in the project.

On the 29th of March, the Google Summer of Code website will open the submission window for your proposals, which will remain open until the 13th of April.

Please draft a Google Doc on the lines of this example of an excellent proposal and share it in editor mode with me (mommi84 at gmail dot com). The other co-mentors and I will try and help you prepare your project proposal.

Important: Please mind that the number of total project hours have changed from 300 to 175. We do not expect your project to be as extensive as in the previous editions.


Hello mentors, thanks for your kind reminder.

Hi @zoelevert and welcome!

Hello @tsoru, I shared my project plan today. Please check if you can access it when you are free.