Towards a Neural Extraction Framework - GSoC 2023

tsoru · February 7, 2023, 2:13pm

This project started in 2021 and is looking to its 3rd participation in DBpedia’s GSoC.

Description

Every Wikipedia article links to a number of other articles. In DBpedia, we keep track of these links through the dbo:wikiPageWikiLink property. Thanks to them, we know that the :Berlin_Wall entity (at the time of writing this) is semantically connected to 299 base entities.

However, only 9 out of 299 base entities are linked from :Berlin_Wall via also another predicate. This suggests that in the large majority of cases, it is not clear what kind of relationship exists between the entities. In other words, DBpedia does not know what specific RDF predicate links the subject (in our case, :Berlin_Wall) to any of the objects above.

Currently, such relationships are extracted from tables and the infobox (usually found top right of a Wikipedia article) via the Extraction Framework. Instead of extracting RDF triples from semi-structured data only, we want to leverage information found in the entirety of a Wikipedia article, including page text.

The repository where all source code will be stored is the following:

Goal

The goal of this project is to develop a framework for predicate resolution of wiki links among entities. During GSoC 2022, we employed a suite of machine-learning models to perform joint entity-relation extraction on open-domain text. However, after we extracted entities and relations from text, only some of these entities were disambiguated (i.e., mapped to a DBpedia entity) and none of these relations were disambiguated (i.e., mapped to a DBpedia property). Now we want to be able to link each node/entity and edge/relation to a resource in the DBpedia knowledge graph.

Extraction examples

Last year we adopted a more general solution that targets multiple relationships at once, however the contributor may choose to extract complex relationships, such as:

Causality. (Addressed during GSoC 2021, but not completed.) The direct cause-effect between events, e.g., from the text

The Peaceful Revolution (German: Friedliche Revolution) was the process of sociopolitical change that led to the opening of East Germany’s borders with the west, the end of the Socialist Unity Party of Germany (SED) in the German Democratic Republic (GDR or East Germany) and the transition to a parliamentary democracy, which enabled the reunification of Germany in October 1990.

extract: :Peaceful_Revolution –––dbo:effect––> :German_reunification

Issuance. An abstract entity assigned to some agent, e.g., from the text

Messi won the award, his second consecutive Ballon d’Or victory.

extract: :2010_FIFA_Ballon_d'Or –––dbo:recipient––> :Lionel_Messi

Any other direct relationship which is not found in DBpedia.

Material

The contributor may use any Python deep learning framework and/or existing tool. The following resources are recommended for use in the project.

The project repository linked above and the machine-learning models mentioned in the readme files found in each GSoC folder.
Last year’s blog to understand the project status quo.
Python Wikipedia makes it easy to access and parse data from Wikipedia.
Huggingface Transformers for Natural Language Inference can be extremely useful to extract structured knowledge from text or perform zero-shot classification.
DBpedia Lookup is a service available both online and offline (e.g., given a string, list all entities that may refer to it).
DBpedia Anchor text is a dataset containing the text and the URL of all links in Wikipedia; the indexed dataset will be available to the student (e.g., given an entity, list all strings that point to it).
An example of an excellent proposal that was accepted a few years ago.

Project size

The size of this project can be either medium or large. Please state in your proposal the number of total project hours you intend to dedicate to it (175 or 300).

Impact

This project will potentially generate millions of new statements. This new information could be released by DBpedia to the public as part of a new dataset. The creation of a neural extraction framework could introduce the use of robust parsers for a more accurate extraction of Wikipedia content.

Warm-up tasks

Get familiar with SPARQL on the DBpedia endpoint.
Run a local DBpedia Virtuoso endpoint.
Understand the science behind relation extraction.
Run and understand the pipeline implemented in the Jupyter notebook.

Mentors

@tsoru, @zoelevert, @diegomoussallem, TBD

sky-2002 · February 12, 2023, 12:53pm

Hello @tsoru @zoelevert , I am Aakash, a 3rd year UG student at Indian Institute of Technology, Bhilai pursuing BTech in Data science and AI. I am interested to contribute to this project. I have done courses like CS550(Machine learning), DS504(NLP), DS603(Advanced ML) in college and have a deep interest in NLP and graph ML. Would love to work on Knowledge graphs.

I have previously worked on following projects:

Wikipedia graph - Did graph analysis, text extraction and vectorization, applied NLP models on the text to get embeddings for use in GNNs.
Implemented research papers in NLP and time series - Implemented N-BEATS paper using tensorflow.
Contributed to a search engine and database - I contributed a lot of examples and few distance metrics to the core of a vector database.
Neo4j projects
Developed a chatbot using Rasa framework - Extensively used huggingface transformers and modules in this.

Due to these, I have gained a lot of interest in NLP, graphs and databases and hence want to contribute to this project and feel that my background may come in handy. I am also planning to read some papers based on relation extraction like distributional similarity, entity and relation extraction using multi-head selection etc.

Would be interested in knowing more about this project.

My github: sky-2002 (Aakash Thatte) · GitHub

tsoru · February 13, 2023, 5:24pm

Hi @sky-2002 and thanks for your interest in the project. Your plan looks good. Have you already gone through the warm-up tasks? Feel free to ask anything, and we mentors will try to clarify your doubts.

sky-2002 · February 17, 2023, 4:53pm

Hi @tsoru , thank you. I tried the warm up tasks. As a high level overview, I could understand that dbpedia has triples of (entity-relation-entity) which can be queried using sparql through the endpoints. I could also get that currently, there is a need to extract these triples from the wikipedia text rather than just the structured parts like info-boxes.
Just wanted to clearly understand the objectives of this iteration of the project.

These are the objectives according to me:

Better extraction of entities and relationships from wikipedia text(unstructured)
Better coreference resolution to avoid unnecessary relations(noisy relations)

I have the following queries:

Does dbpedia store these triples as a graph ?
Please let me know all the objectives of this project(for this year) so that I can start thinking on it.

tsoru · February 20, 2023, 11:43am

DBpedia’s headline — as it appears on its home page — is Global and Unified Access to Knowledge Graphs, so that alone should answer your question. All the triples in DBpedia are based on the Resource Description Framework (RDF), which is a standard for describing and exchanging graph data.

As stated above in the Goal section, “The goal of this project is to develop a framework for predicate resolution of wiki links among entities”. We prefer Python as programming language. Apart from this, everything else is your choice. The two objectives you mentioned sound good to me.

deborahdjon · March 3, 2023, 1:21am

Hello @tsoru, @zoelevert, and @diegomoussallem,

My name is Deborah and I am a MSc computing (Maj. AI) student at Dublin City University. I would like to learn more about open-source development, by getting more hands-on experience in the field of Semantic web and linked open data in this project.

During my masters and computer science undergrad at the Baden-Württemberg Cooperate State University (Germany) I learned about the semantic web, SPARQL, programming in python and more in modules like Semantic Web, Artificial Intelligence and Information Seeking, and Big Data Technologies.

Projects Include:

Implantation of a the reinforcement learning framework OpticalRLGym at the Nokia Bell Labs Paris-Saclay (Python, Java, JSON-RPC, Apache Kafka, BASH)
Creation of a Telegram chatbot for playing a card game (Python)
Student NLP thesis on analysing movie scripts to create movie content based summaries. (Python, Beautifulseaup, NumPy, Pandas)

This is myGithub, and my LinkedIn.

deborahdjon · March 8, 2023, 7:12pm

I don’t perfectly understand what you mean by “However, some entities and no relations were disambiguated.” Could you elaborate on what you mean by that .

tsoru · March 9, 2023, 11:31am

You’re right — it isn’t clear. It should have said:

However, after we extracted entities and relations from text, only some of these entities were disambiguated (i.e., mapped to a DBpedia entity) and none of these relations were disambiguated (i.e., mapped to a DBpedia property).

I’m updating the paragraph in the project description.

deborahdjon · March 9, 2023, 12:51pm

Thanks!

aditya-coys · March 14, 2023, 12:30am

Hey @tsoru! I am Aditya Hari, a graduate student at IIIT-Hyderabad.
My current research focuses on fact-to-text generation in NLP. For this, aligning facts to text was a key step of the initial process, so I have a great of experience in working with RDF triples. I also have experience in working with large databases and SPARQL. I would love to participate in this project and contribute to the effort.

I am already going through the material and the warm-up tasks, and wanted to ask what the next steps should be after that.

tsoru · March 19, 2023, 12:46pm

Hi @aditya-coys. For this project as well, please see the other conversation.

tanglumy · March 23, 2023, 11:17pm

Hi, everyone! @tsoru @zoelevert. My name is Lumingyuan Tang. I am now a Computer Science master student at University of Southern California. I am very interested in contributing to DBpedia this summer.

My research interest in based on NLP, Information Extraction(Event Extraction, Rule Extraction, etc.) I have do several internships related to Knowledge Graph, database and NLP. Here is myCV.

I have read the warmup tasks and Materials. I see that the goal mentions that we need link each entity and relation to a resource in the DBpedia, and I wonder how our project size is quantitatively related to this goal, how many models we need to deploy or how much accuracy we need to achieve.

Thanks a lot for reading my comments!

tsoru · March 30, 2023, 6:14pm

Hi @tanglumy and thanks for your interest in the project.

This project can be either 175 or 300 hours, up to the candidate to decide.
The goal is to complete the implementation of one end-to-end approach.
There is no requirement for the accuracy; but the higher accuracy, the better chance to upgrade the project to a scientific publication.

When you are done writing your draft proposal, please add my Google account (mommi84 at gmail dot com) as an editor.

sky-2002 · March 31, 2023, 7:19am

Hello @tsoru , I have shared my draft proposal to you(as editor). Please let me know your feedback for the same.

jpardo08 · April 3, 2023, 9:20am

Greetings to all!

My name is Joel Pardo, I am a versatile professional with expertise in data science, artificial intelligence research, and entrepreneurship. I earned my data science degree from the University of Valencia, Spain and am now in the process of obtaining a master’s degree in artificial intelligence from the Polytechnical University of Madrid. Currently, I am doing an scholarship in the Ontology Engineering Group. I continue to refine my expertise and actively participate in innovative projects.

My diverse background has allowed me to develop a unique skill set that I apply to a variety of innovative projects. Proficient in programming languages such as Python, R, and SQL, mainly. I have developed my skills in various projects, leveraging tools and frameworks like TensorFlow and PyTorch between others.

I have been developing a similar project based on a relation extraction tool specifically adapted to the medical domain. This innovative solution extracts valuable information on relationships between two over electronic medical records. This is built using Python and SpaCy.

I really interested in this project. I believe my experience and passion for AI-driven solutions make me an ideal candidate to contribute to the Towards a Neural Extraction Framework

Thanks!,

Joel Pardo.

tsoru · April 3, 2023, 10:20am

Hi @jpardo08 and thanks for your interest in the project.

You have about 36 hours to write and submit your project proposal. Your next steps are to draft one on the lines of this example of an excellent proposal and add me (mommi84 at gmail dot com) as an editor.