Towards a Neural Extraction Framework - GSoC2022

tsoru · March 26, 2022, 3:12pm

This project started in 2021 and is looking to its 2nd participation in DBpedia’s GSoC.

Description

Every Wikipedia article links to a number of other articles. In DBpedia, we keep track of these links through the dbo:wikiPageWikiLink property. Thanks to them, we know that the :Berlin_Wall entity is semantically connected to 299 base entities.

However, only 9 out of 299 base entities are linked from :Berlin_Wall via also another predicate. This suggests that in the large majority of cases, it is not clear what kind of relationship exists between the entities. In other words, DBpedia does not know what specific RDF predicate links the subject (in our case, :Berlin_Wall) to any of the objects above.

Currently, such relationships are extracted from tables and the infobox (usually found top right of a Wikipedia article) via the Extraction Framework. Instead of extracting RDF triples from semi-structured data only, we want to leverage information found in the entirety of a Wikipedia article, including page text.

The repository where all source code will be stored is the following:

Goal

The goal of this project is to develop a framework for predicate resolution of wiki links among entities. The student may choose to focus on a specific kind of relationship:

Causality. (Addressed during GSoC 2021, but not completely.) The direct cause-effect between events, e.g., from the text

The Peaceful Revolution (German: Friedliche Revolution) was the process of sociopolitical change that led to the opening of East Germany’s borders with the west, the end of the Socialist Unity Party of Germany (SED) in the German Democratic Republic (GDR or East Germany) and the transition to a parliamentary democracy, which enabled the reunification of Germany in October 1990.

extract: :Peaceful_Revolution –––dbo:effect––> :German_reunification

Issuance. An abstract entity assigned to some agent, e.g., from the text

Messi won the award, his second consecutive Ballon d’Or victory.

extract: :2010_FIFA_Ballon_d'Or –––dbo:recipient––> :Lionel_Messi

Any other direct relationship which is not found in DBpedia.
Bonus points for a more general solution that targets multiple relationships at once.

Material

The student may use any Python deep learning framework and/or existing tool. The following resources are recommended for use in the project.

Python Wikipedia makes it easy to access and parse data from Wikipedia.
Huggingface Transformers for Natural Language Inference can be extremely useful to extract structured knowledge from text or perform zero-shot classification.
DBpedia Lookup is a service available both online and offline (e.g., given a string, list all entities that may refer to it).
DBpedia Anchor text is a dataset containing the text and the URL of all links in Wikipedia; the indexed dataset will be available to the student (e.g., given an entity, list all strings that point to it).
FRED is a tool for automatically producing RDF/OWL ontologies and linked data from natural language sentences. Unfortunately, an API key is required and the number of daily requests is limited, but it is a very inspiring project.
An example of an excellent proposal that was accepted a few years ago.

Project size

The size of this project can be either medium or large. Please state in your proposal the number of total project hours you intend to dedicate to it (175 or 300).

Impact

This project will potentially generate millions of new statements. This new information could be released by DBpedia to the public as part of a new dataset. The creation of a neural extraction framework could introduce the use of robust parsers for a more accurate extraction of Wikipedia content.

Warm-up tasks

Get familiar with SPARQL on the DBpedia endpoint.
Run a local DBpedia Virtuoso endpoint.
Understand the science behind relation extraction.
Use the FRED interface to parse a sentence from Wikipedia and analyse the generated graph.

Mentors

@tsoru, @diegomoussallem, TBD

ananyaiitbhilai · March 29, 2022, 5:27am

Hello @tsoru @diegomoussallem I am Ananya, a CSE sophomore at IIT Bhilai, India. I recently took DS250 course(Data analysis and visualisation), and learnt about various ML models, NLP, knowledge graphs, GNN, time series, etc. I found knowledge graphs and NLP very interesting and exciting, and I want to explore these areas more. Hence I would love to contribute to the idea, Towards a Neural Extraction Framework - GSoC2022.

I did basic analysis of Wikipedia graph- https://github.com/Ananyaiitbhilai/Wikipedia-Graph-Analysis/tree/master.

In this, I analyzed Wikipedia Graph for Mathematics article related to JEE advance. I generated a sub-graph(s) of Wikipedia articles that are relevant to their study. Currently I am trying to give them an order in which these should be read (traversal algorithm). These traversals can be organized by subjects etc.

Me and my friends scraped and Labeled articles according to complexity for someone who has just passed 10th class. Dataset have labels like: {Beginner(1), Intermediate(2), Advanced(3), Irrelevant(0)}

What tasks I did:-

Build the Wikipedia Graph along with attributes (including keywords, tags, NLP features)
Create additional features based on Graph using concepts like centrality metrics, clustering coefficient.
Development of Node classification Models
Currently doing - Graph Traversal and Article Ordering Algorithm

You can look into my GitHub Repo for more info.

jenifer · April 2, 2022, 8:36pm

I am Jenifer, a final years master’s student in data science at the Polytechnic University of Madrid. I have been doing a course about open data and knowledge graphs, that is where I found out about DBpedia. I am interested in the field and I would like to delve deeper into it in the future. I am working at the Ontology Engineering Group for the final project. My master thesis is close to being done, the topic is research software classification, where I use NLP techniques to develop a flexible methodology for classifying research software. You can check out the repo at this link: https://github.com/kuefmz/software_classification

I am familiar with SPARQL, RDF and relation extraction. I have gone through the warm-up exercises, I tried out FRED and it is pretty cool. I would like to explore the possibilities of the project as much as possible so I would like to apply for a long project. What do you think, could I be a good fit for the project?

tsoru · April 3, 2022, 1:29pm

Everyone who is interested in submitting a proposal for this project please follow the steps below:

Prepare a Google Docs draft on the lines of this example of an excellent proposal that was accepted a few years ago.
Share the draft proposal with my account (mommi84 at gmail dot com).
Address the comments that the other mentors and I will leave.
Submit the proposal to the official GSoC platform.

tsoru · April 12, 2022, 11:34am

Important message to @ananyaiitbhilai @jenifer and anyone interested in the project.

We are 7 DAYS away from the contributor application deadline, and we have so far received 0 (ZERO) applications for this project, hence there is still plenty of chance to get accepted in this year’s programme.

Please follow the steps below as soon as possible if you wish to get mentors’ feedback before your submission.