Extending the DBpedia Extraction Framework to Extract Complete Historical Wikipedia Revisions - Temporal DBpedia Extraction - GSoC 2023

hopver · March 31, 2023, 12:27pm

Description
The goal of this project is to extend the DBpedia extraction framework with a new component that enables the extraction of historical revisions of Wikipedia articles, in addition to the current state of the articles. As the DBpedia extraction mainly focuses on the infobox data, we wanna be able to follow the update process of the graph resulting from it. The extracted information from the historical revisions can provide valuable insights into the evolution of the articles and the changes made over time.

Objectives: The objectives of this project are as follows:

Develop a module for the DBpedia extraction framework that allows the extraction of historical revisions of Wikipedia articles.
Implement a version control system to store and manage the extracted revisions. Each article revision is extracted in a single named-graph or using RDF-Star
Design and implement an interface for users to query and access the extracted historical revisions.
Evaluate the performance and accuracy of the new component on a large-scale dataset.

Methodology: The project will be divided into several phases:

Design and planning: In this phase, the requirements and specifications for the new component will be defined, and a project plan will be developed.
Implementation: In this phase, the module for extracting historical revisions will be developed, along with the version control system, the reconciliation mechanism, and the query interface.
Testing and evaluation: In this phase, the performance and accuracy of the new component will be evaluated using a large-scale dataset of Wikipedia revisions.

Deliverables: The deliverables of the project will be:

A working extractor (or extension of the live module) for the DBpedia extraction framework that allows the extraction of historical graph revisions of Wikipedia articles.
An interface for users to query and access the extracted historical revisions. Maintaining a SPARQL endpoint or providing an ReST API.
A report summarizing the project’s objectives, methodology, results, and conclusions.

Timeline: The project will take approximately 12 weeks to complete, with the following timeline:

Week 1-2: Design and planning.
Week 3-8: Implementation.
Week 9-10: Testing and evaluation.
Week 11-12: Report writing and finalization.

Skills Required: The following skills are required for this project:

Proficiency in programming languages such as Java, Scala, and Python.
Familiarity with version control systems such as Git.
Experience with database management and query languages such as SPARQL.
Familiarity with the DBpedia extraction framework and related technologies.

The proposed project will extend the DBpedia extraction framework with a new component that allows the extraction of historical revisions of Wikipedia articles, providing valuable insights into the evolution of the articles over time. The project will require skills in programming, API management, scala programming, semantic web, database management, and query languages, and will be conducted over a 12-week period. The deliverables will include a working module, a version control system, a reconciliation mechanism, a query interface, and a report summarizing the project’s objectives, methodology, results, and conclusions.

Warm up tasks

Execute DIEF test extraction
Try wikimedia revision api
Read DBpedia papers 1,2,3

Mentors
Marvin Hofer
Celian Ringwald
Mykola Medynskyi

Keywords
DBpedia, Knowledge Extraction, Temporal KG

akashkumar7902 · April 3, 2023, 4:26pm

Hey,
do you have a slack channel where we can further talk about this project ?

hopver · April 4, 2023, 10:09am

Hello, of course we can write in Slack. I have sent you a personal message.

akashkumar7902 · April 12, 2023, 1:10pm

Hey,

Do you have the full PDF of paper 2 ?
The full PDF is not affordable by me and also I can’t access the PDF via my institution.

Thanks,
Akash Kumar

hopver · April 13, 2023, 1:07pm

Sure, here you go https://svn.aksw.org/papers/2009/ODBASE_LiveExtraction/dbpedia_live_extraction_public.pdf

adtyarai · August 11, 2023, 5:18pm

Hey, is there some work left here?
Do you think I can join?

Matteo · January 12, 2024, 8:37am

Hi, I would love to hear more about the project. By chance I can contact you through Slack so as to ask for more information and try to provide some thoughts about the project and its mission.

Thank you very much

hopver · January 12, 2024, 2:02pm

@Matteo
Hi, you can also contact me in the DBpedia Slack. My handle is marvinh. I am happy to talk about this topic.

hopver · January 12, 2024, 2:03pm

I am sorry that I missed this message, as GSOC was already in progress for last year.

Matteo · January 15, 2024, 9:14am

Thank you so much for your availability, my contact is adilardimatteo001@gmail.com
I would love to join your workspace so that I can understand the dynamics and contribute to the goal of the project.

Thank you very much