DESCRIPTION
DBpedia plans to release data much more frequently in the near future using a new improved release pipeline. As work on the extraction framework continues, the content of the released data will change over time. In order to ensure consistency and completeness of new dataset releases, verification measures have to be designed and implemented. This starts with simple checks over file name and sizes and goes on to more complicated tasks such as tracking and documenting the path of triples through the DBpedia extraction framework or logging the impact of specific mapping changes.
Goal
The goal of this task is to improve data quality of the continously released DBpedia datasets by detecting erroneous changes in the extraction and release process. The result is a verification pipeline that compares the results and processes of previous and upcoming DBpedia releases.
Warm-up tasks:
Download the latest DBpedia release (2016-10) and the current pre-release (2019-08-30) and implement a simple check over file names and sizes to verify the completeness of the pre-release. The process should log any files of the 2016-10 release that could not be matched to a file in the pre-release. Matched files should be compared by size and should be similar in size (at most 80% smaller or 200% larger).
Mentors
Jan Forberg
Keywords
Data Quality