Towards a Neural Extraction Framework - GSoC2021

tsoru · February 19, 2021, 7:16pm

Description

Every Wikipedia article links to a number of other articles. In DBpedia, we keep track of these links through the dbo:wikiPageWikiLink property. Thanks to them, we know that the :Berlin_Wall entity is semantically connected to 299 base entities.

However, only 9 out of 299 base entities are linked from :Berlin_Wall via also another predicate. This suggests that in the large majority of cases, it is not clear what kind of relationship exists between the entities. In other words, DBpedia does not know what specific RDF predicate links the subject (in our case, :Berlin_Wall) to any of the objects above.

Currently, such relationships are extracted from tables and the infobox (usually found top right of a Wikipedia article) via the Extraction Framework. Instead of extracting RDF triples from semi-structured data only, we want to leverage information found in the entirety of a Wikipedia article, including page text.

Goal

The goal of this project is to develop a framework for predicate resolution of wiki links among entities. The student may choose to focus on a specific kind of relationship:

Causality. The direct cause-effect between events, e.g., from the text

The Peaceful Revolution (German: Friedliche Revolution) was the process of sociopolitical change that led to the opening of East Germany’s borders with the west, the end of the Socialist Unity Party of Germany (SED) in the German Democratic Republic (GDR or East Germany) and the transition to a parliamentary democracy, which enabled the reunification of Germany in October 1990.

extract: :Peaceful_Revolution –––dbo:effect––> :German_reunification

Issuance. An abstract entity assigned to some agent, e.g., from the text

Messi won the award, his second consecutive Ballon d’Or victory.

extract: :2010_FIFA_Ballon_d'Or –––dbo:recipient––> :Lionel_Messi

Any other direct relationship which is not found in DBpedia.

Material

The student may use any Python deep learning framework, but a neural approach is not strictly mandatory, and/or any existing tool for semantic parsing. The following resources are recommended for use in the project.

Python Wikipedia makes it easy to access and parse data from Wikipedia.
DBpedia Lookup is a service available both online and offline (e.g., given a string, list all entities that may refer to it).
DBpedia Anchor text is a dataset containing the text and the URL of all links in Wikipedia; the indexed dataset will be available to the student (e.g., given an entity, list all strings that point to it).
FRED is a tool for automatically producing RDF/OWL ontologies and linked data from natural language sentences. Unfortunately, an API key is required and the number of daily requests is limited, but it is a very inspiring project.
An example of an excellent proposal that was accepted a few years ago. Please mind that the number of total project hours have changed from 300 to 175.

Impact

This project will potentially generate millions of new statements. This new information could be released by DBpedia to the public as part of a new dataset. The creation of a neural extraction framework could introduce the use of robust parsers for a more accurate extraction of Wikipedia content.

Warm-up tasks

Get familiar with SPARQL on the DBpedia endpoint.
Run a local DBpedia Virtuoso endpoint.
Understand the science behind relation extraction.
Use the FRED interface to parse a sentence from Wikipedia and analyse the generated graph.

Mentors

Tommaso Soru @tsoru, Zheyuan Bai @baizydl, TBD

gaurav · March 10, 2021, 7:26am

Hello mentors, I am Gaurav, a sophomore from SVNIT, Surat, India and the project idea seems very interesting to me, I will surely do my part of the research and would love to connect with all to share my doubts and would like to evaluate how my skills in ML/DL, python will be relevant to this projecct idea. Thank you!!

tsoru · March 10, 2021, 2:34pm

Hi @gaurav and welcome. Glad you’re interested in the project.

Have you already gone through the warm-up tasks?

gaurav · March 10, 2021, 6:38pm

I will get started with those, I am also looking into the materials you have shared to get familiar with the idea and the logic that best serves the purpose ;).

tsoru · March 11, 2021, 2:40pm

Good. Feel free to ask any question they might arise.

gaurav · March 12, 2021, 4:52am

Hey @tsoru, so I have done the first and second warmup task. I took some time trying to understand what the terminologies mean actually cause I am a beginner with databases and I would wish to know if I am getting the goal of the project correctly.

So, we need to implement an extraction framework that will extract triples from the entire article and not just semi-structured data and establish all the direct relationships which are not yet there in DBpedia.

I also wanted to know the constraints with the tech-stack, cause the current extraction framework is written in Scala and what I know the best is Python for Neural Network, I have some experience with Java, JS, C and C++. So, I wanted to know if you have decided about the various tools and stacks that should be specifically used for development.

gaurav · March 17, 2021, 11:06am

@tsoru am I supposed to submit any coding task as a part of the proposal, cause I was done with the warmup task and I was wondering if I there are any more pre-requisites to work on or I can try implementing a few simple models in Python for a starter.

fauzank339 · March 18, 2021, 5:18pm

Hello Mentors, I am Abdullah, an engineering grad student from Amity University, Noida. I really like this project idea and would like to be a part of this project. I intend to begin with the warm up problems.

aakriti23 · March 19, 2021, 10:00am

Hello @tsoru!

My name is Aakriti Jain, and I am currently pursuing Data Science from RWTH Aachen University. Having taken a course on Semantic web my previous semester, I am well-versed with RDFs, and query languages such as Turtle and SPARQL. I am intrigued by this project as we are to use techniques of ML combined with the concepts of semantic web to come up with a more accurate framework.

From what I understand, the goal of this project is to utilise the the content of several wikipedia articles to determine the relationships between subject and predicate, instead of simply extracting RDF triplets?

tsoru · March 19, 2021, 4:19pm

A GSoC project is supposed to be only 175 hours long, therefore for new ideas such as this one, we just require a prototype. As you already pointed out, the recommended language for deep learning (and prototyping in general) is Python, but we don’t require it. Students are free to decide what to use in the rest of the stack.

The coding tasks are only for you to warm up, however the proposal would greatly benefit if you include your findings and considerations about those simple models you plan on implementing.

tsoru · March 19, 2021, 4:20pm

Hi @fauzank339 and welcome!

tsoru · March 19, 2021, 4:33pm

Hi @aakriti23 and welcome! The current extraction framework does already utilise the content of Wikipedia articles, but just at given locations (e.g., the top right infobox). Instead of that, we want to exploit the article’s text to extract RDF triples, where a triple is a relationship between the corresponding entity and another entity in Wikipedia/DBpedia.

ishank_agrawal · March 22, 2021, 5:08pm

Hi @tsoru.
I am Ishank, an undergraduate student from Indian Institute of Technology, Delhi, I have done a course on Information Retrieval and a project involving Q&A over YAGO Knowledgebase. I find this problem interesting and would try my best to come up with a unique and effective approach for solving it.

tsoru · March 27, 2021, 8:17pm

Hi @ishank_agrawal, welcome to the forum!

tsoru · March 27, 2021, 9:57pm

To @gaurav @fauzank339 @aakriti23 @ishank_agrawal and everyone else interested in the project.

On the 29th of March, the Google Summer of Code website will open the submission window for your proposals, which will remain open until the 13th of April.

Please draft a Google Doc on the lines of this example of an excellent proposal and share it in editor mode with me (mommi84 at gmail dot com). The other co-mentors and I will try and help you prepare your project proposal.

Important: Please mind that the number of total project hours have changed from 300 to 175. We do not expect your project to be as extensive as in the previous editions.

zoelevert · April 6, 2021, 4:18pm

Hello mentors, thanks for your kind reminder.

tsoru · April 6, 2021, 6:48pm

Hi @zoelevert and welcome!

zoelevert · April 8, 2021, 8:45pm

Hello @tsoru, I shared my project plan today. Please check if you can access it when you are free.