DBpedia currently provides monthly releases, with a stable release every three months (named ‘snapshot ‘) on the Databus (DBpedias publishing platform).
A DBpedia release consists of various data artifacts, like “disambiguations”, “labels”, “infobox-properties”, or “images”, each of them extracted by a different part (extractor) of the DBpedia extraction framework.
Another essential extractor is the abstract text extractor, which produces RDF triples of Wiki page abstracts in the form of “long-abstracts” (entire wiki page abstracts) or “short-abstracts” (limited to 400 words).
- They extract data for 140+ languages, which takes more than 30 days, often causing delays in publishing releases.
- Encounter a lot of 429 HTTP status (too many requests) responses from the used APIs during the extraction. Although a retry mechanism waits after such a response, the current page is skipped after ten retries. The number of encountered HTTP 429 responses fluctuates each month but can result in an enormous amount of lost abstracts in the release (e.g., ~455000 missing abstracts for the English language in 2021-06-01)
- research possible methods to generate or fetch Wiki page abstracts text
- compare and benchmark the investigated methods (performance-wise and fault-tolerance)
- develop a new, improved Abstract Extractor based on the gained knowledge (e.g., by using batch request to fetch multiple abstracts for multiple pages in one request)
- the implementation can either be done in the current DIEF using Java/Scala or as a new external project using a different language like python.
- Consider using a fallback value from the previous release
The new abstract extraction process should be faster and safer.
At least, it should reduce the number of abstracts lost during the extraction ( reduced/fixed 429 errors).
The provided data releases will contain more consistent data.
- run the DBpedia Extraction framework as
- investigate Wikimedia APIs
- This URL is used in the PlainAbstractExtractor to get abstracts: https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Berlin&redirects=
- This URL is used in the HtmlAbstractExtractor to get the content of the page: https://en.wikipedia.org/w/api.php?uselang=en&format=xml&action=parse&prop=text&pageid=3354 . After getting the content of a page, it is parsed and long-abstracts and short-abstracts are produced.
- The other way to retrieve content from Wikipedia pages is setting up the Mediawiki API on your local machine, which allows to make requests to your local Mediawiki API without restrictions. Link on the github repo of the Mediawiki text extract extension: https://github.com/wikimedia/mediawiki-extensions-TextExtracts . MediaWiki · mediawiki
Dr. Dimitris Kontokostas
- estimated time: 350h
- DIEF, Wikimedia, HTTP, RDF, Linked Data, Scala