Extending the DBpedia Extraction Framework to Extract Complete Historical Wikipedia Revisions - Temporal DBpedia Extraction - GSoC 2023

Description
The goal of this project is to extend the DBpedia extraction framework with a new component that enables the extraction of historical revisions of Wikipedia articles, in addition to the current state of the articles. As the DBpedia extraction mainly focuses on the infobox data, we wanna be able to follow the update process of the graph resulting from it. The extracted information from the historical revisions can provide valuable insights into the evolution of the articles and the changes made over time.

Objectives: The objectives of this project are as follows:

  1. Develop a module for the DBpedia extraction framework that allows the extraction of historical revisions of Wikipedia articles.
  2. Implement a version control system to store and manage the extracted revisions. Each article revision is extracted in a single named-graph or using RDF-Star
  3. Design and implement an interface for users to query and access the extracted historical revisions.
  4. Evaluate the performance and accuracy of the new component on a large-scale dataset.

Methodology: The project will be divided into several phases:

  1. Design and planning: In this phase, the requirements and specifications for the new component will be defined, and a project plan will be developed.
  2. Implementation: In this phase, the module for extracting historical revisions will be developed, along with the version control system, the reconciliation mechanism, and the query interface.
  3. Testing and evaluation: In this phase, the performance and accuracy of the new component will be evaluated using a large-scale dataset of Wikipedia revisions.

Deliverables: The deliverables of the project will be:

  1. A working extractor (or extension of the live module) for the DBpedia extraction framework that allows the extraction of historical graph revisions of Wikipedia articles.
  2. An interface for users to query and access the extracted historical revisions. Maintaining a SPARQL endpoint or providing an ReST API.
  3. A report summarizing the project’s objectives, methodology, results, and conclusions.

Timeline: The project will take approximately 12 weeks to complete, with the following timeline:

  1. Week 1-2: Design and planning.
  2. Week 3-8: Implementation.
  3. Week 9-10: Testing and evaluation.
  4. Week 11-12: Report writing and finalization.

Skills Required: The following skills are required for this project:

  1. Proficiency in programming languages such as Java, Scala, and Python.
  2. Familiarity with version control systems such as Git.
  3. Experience with database management and query languages such as SPARQL.
  4. Familiarity with the DBpedia extraction framework and related technologies.

The proposed project will extend the DBpedia extraction framework with a new component that allows the extraction of historical revisions of Wikipedia articles, providing valuable insights into the evolution of the articles over time. The project will require skills in programming, API management, scala programming, semantic web, database management, and query languages, and will be conducted over a 12-week period. The deliverables will include a working module, a version control system, a reconciliation mechanism, a query interface, and a report summarizing the project’s objectives, methodology, results, and conclusions.

Warm up tasks

Mentors
Marvin Hofer
Celian Ringwald
Mykola Medynskyi

Keywords
DBpedia, Knowledge Extraction, Temporal KG

Hey,
do you have a slack channel where we can further talk about this project ?

Hello, of course we can write in Slack. I have sent you a personal message.

1 Like

Hey,

Do you have the full PDF of paper 2 ?
The full PDF is not affordable by me and also I can’t access the PDF via my institution.

Thanks,
Akash Kumar

Sure, here you go https://svn.aksw.org/papers/2009/ODBASE_LiveExtraction/dbpedia_live_extraction_public.pdf

Hey, is there some work left here?
Do you think I can join?

Hi, I would love to hear more about the project. By chance I can contact you through Slack so as to ask for more information and try to provide some thoughts about the project and its mission.

Thank you very much

1 Like

@Matteo
Hi, you can also contact me in the DBpedia Slack. My handle is marvinh. I am happy to talk about this topic.

I am sorry that I missed this message, as GSOC was already in progress for last year.

Thank you so much for your availability, my contact is adilardimatteo001@gmail.com
I would love to join your workspace so that I can understand the dynamics and contribute to the goal of the project.

Thank you very much