DESCRIPTION
DBpedia currently provides monthly releases, with a stable release every three months (named ‘snapshot ‘) on the Databus (DBpedias publishing platform).
A DBpedia release consists of various data artifacts, like “disambiguations”, “labels”, “infobox-properties”, or “images”, each of them extracted by a different part (extractor) of the DBpedia extraction framework.
Another essential extractor is the abstract text extractor, which produces RDF triples of Wiki page abstracts in the form of “long-abstracts” (entire wiki page abstracts) or “short-abstracts” (limited to 400 words).
There are two current implementations of the Abstract Extractor (PlainAbstractExtractor, HtmlAbstractExtractor) based on different Wikimedia APIs. Both extractors suffer similar issues.
- They extract data for 140+ languages, which takes more than 30 days, often causing delays in publishing releases.
- Encounter a lot of 429 HTTP status (too many requests) responses from the used APIs during the extraction. Although a retry mechanism waits after such a response, the current page is skipped after ten retries. The number of encountered HTTP 429 responses fluctuates each month but can result in an enormous amount of lost abstracts in the release (e.g., ~455000 missing abstracts for the English language in 2021-06-01)
Goal
- research possible methods to generate or fetch Wiki page abstracts text
- compare and benchmark the investigated methods (performance-wise and fault-tolerance)
- develop a new, improved Abstract Extractor based on the gained knowledge (e.g., by using batch request to fetch multiple abstracts for multiple pages in one request)
- the implementation can either be done in the current DIEF using Java/Scala or as a new external project using a different language like python.
- Consider using a fallback value from the previous release
Impact
The new abstract extraction process should be faster and safer.
At least, it should reduce the number of abstracts lost during the extraction ( reduced/fixed 429 errors).
The provided data releases will contain more consistent data.
Warm-up tasks
- run the DBpedia Extraction framework as
- local installation with minidump tests
- using the marvin-config to start a test extraction
- investigate Wikimedia APIs
- This URL is used in the PlainAbstractExtractor to get abstracts: https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Berlin&redirects=
- This URL is used in the HtmlAbstractExtractor to get the content of the page: https://en.wikipedia.org/w/api.php?uselang=en&format=xml&action=parse&prop=text&pageid=3354 . After getting the content of a page, it is parsed and long-abstracts and short-abstracts are produced.
- The other way to retrieve content from Wikipedia pages is setting up the Mediawiki API on your local machine, which allows to make requests to your local Mediawiki API without restrictions. Link on the github repo of the Mediawiki text extract extension: https://github.com/wikimedia/mediawiki-extensions-TextExtracts . MediaWiki · mediawiki
Mentors
Mykola Medynskyi
Dr. Dimitris Kontokostas
Marvin Hofer
Project size (175h or 350h)
- estimated time: 350h
Keywords
- DIEF, Wikimedia, HTTP, RDF, Linked Data, Scala