Description
The goal of this project is to extend the DBpedia extraction framework with a new component that enables the extraction of historical revisions of Wikipedia articles, in addition to the current state of the articles. As the DBpedia extraction mainly focuses on the infobox data, we wanna be able to follow the update process of the graph resulting from it. The extracted information from the historical revisions can provide valuable insights into the evolution of the articles and the changes made over time.
Objectives: The objectives of this project are as follows:
- Develop a module for the DBpedia extraction framework that allows the extraction of historical revisions of Wikipedia articles.
- Implement a version control system to store and manage the extracted revisions. Each article revision is extracted in a single named-graph or using RDF-Star
- Design and implement an interface for users to query and access the extracted historical revisions.
- Evaluate the performance and accuracy of the new component on a large-scale dataset.
Methodology: The project will be divided into several phases:
- Design and planning: In this phase, the requirements and specifications for the new component will be defined, and a project plan will be developed.
- Implementation: In this phase, the module for extracting historical revisions will be developed, along with the version control system, the reconciliation mechanism, and the query interface.
- Testing and evaluation: In this phase, the performance and accuracy of the new component will be evaluated using a large-scale dataset of Wikipedia revisions.
Deliverables: The deliverables of the project will be:
- A working extractor (or extension of the live module) for the DBpedia extraction framework that allows the extraction of historical graph revisions of Wikipedia articles.
- An interface for users to query and access the extracted historical revisions. Maintaining a SPARQL endpoint or providing an ReST API.
- A report summarizing the project’s objectives, methodology, results, and conclusions.
Timeline: The project will take approximately 12 weeks to complete, with the following timeline:
- Week 1-2: Design and planning.
- Week 3-8: Implementation.
- Week 9-10: Testing and evaluation.
- Week 11-12: Report writing and finalization.
Skills Required: The following skills are required for this project:
- Proficiency in programming languages such as Java, Scala, and Python.
- Familiarity with version control systems such as Git.
- Experience with database management and query languages such as SPARQL.
- Familiarity with the DBpedia extraction framework and related technologies.
The proposed project will extend the DBpedia extraction framework with a new component that allows the extraction of historical revisions of Wikipedia articles, providing valuable insights into the evolution of the articles over time. The project will require skills in programming, API management, scala programming, semantic web, database management, and query languages, and will be conducted over a 12-week period. The deliverables will include a working module, a version control system, a reconciliation mechanism, a query interface, and a report summarizing the project’s objectives, methodology, results, and conclusions.
Warm up tasks
- Execute DIEF test extraction
- Try wikimedia revision api
- Read DBpedia papers 1,2,3
Mentors
Marvin Hofer
Celian Ringwald
Mykola Medynskyi
Keywords
DBpedia, Knowledge Extraction, Temporal KG