Developing a New DBpedia Abstract Extraction - GSoC2022

hopver · February 7, 2022, 8:57am

DESCRIPTION

DBpedia currently provides monthly releases, with a stable release every three months (named ‘snapshot ‘) on the Databus (DBpedias publishing platform).

A DBpedia release consists of various data artifacts, like “disambiguations”, “labels”, “infobox-properties”, or “images”, each of them extracted by a different part (extractor) of the DBpedia extraction framework.

Another essential extractor is the abstract text extractor, which produces RDF triples of Wiki page abstracts in the form of “long-abstracts” (entire wiki page abstracts) or “short-abstracts” (limited to 400 words).

There are two current implementations of the Abstract Extractor (PlainAbstractExtractor, HtmlAbstractExtractor) based on different Wikimedia APIs. Both extractors suffer similar issues.

They extract data for 140+ languages, which takes more than 30 days, often causing delays in publishing releases.
Encounter a lot of 429 HTTP status (too many requests) responses from the used APIs during the extraction. Although a retry mechanism waits after such a response, the current page is skipped after ten retries. The number of encountered HTTP 429 responses fluctuates each month but can result in an enormous amount of lost abstracts in the release (e.g., ~455000 missing abstracts for the English language in 2021-06-01)

Goal

research possible methods to generate or fetch Wiki page abstracts text
compare and benchmark the investigated methods (performance-wise and fault-tolerance)
develop a new, improved Abstract Extractor based on the gained knowledge (e.g., by using batch request to fetch multiple abstracts for multiple pages in one request)
- the implementation can either be done in the current DIEF using Java/Scala or as a new external project using a different language like python.
Consider using a fallback value from the previous release

Impact

The new abstract extraction process should be faster and safer.
At least, it should reduce the number of abstracts lost during the extraction ( reduced/fixed 429 errors).
The provided data releases will contain more consistent data.

Warm-up tasks

run the DBpedia Extraction framework as
- local installation with minidump tests
- using the marvin-config to start a test extraction
investigate Wikimedia APIs
- This URL is used in the PlainAbstractExtractor to get abstracts: https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Berlin&redirects=
- This URL is used in the HtmlAbstractExtractor to get the content of the page: https://en.wikipedia.org/w/api.php?uselang=en&format=xml&action=parse&prop=text&pageid=3354 . After getting the content of a page, it is parsed and long-abstracts and short-abstracts are produced.
- The other way to retrieve content from Wikipedia pages is setting up the Mediawiki API on your local machine, which allows to make requests to your local Mediawiki API without restrictions. Link on the github repo of the Mediawiki text extract extension: https://github.com/wikimedia/mediawiki-extensions-TextExtracts . MediaWiki · mediawiki

Mentors

Mykola Medynskyi

Dr. Dimitris Kontokostas

Marvin Hofer

Project size (175h or 350h)

estimated time: 350h

Keywords

DIEF, Wikimedia, HTTP, RDF, Linked Data, Scala

zhuyuqicheng · February 28, 2022, 10:48pm

Hello everyone!

My name is Yuqicheng Zhu and I am Electrical Engineering Student at Technical University Munich, Germany. As an incoming Ph.D. student in Knowledge Graph at Bosch Center of Artificial Intelligence & University Stuttgart (Prof. Steffen Staab). I am fully motivated to contribute to DBPedia as a GSoC 22’ student.

I have been working as AI engineer at Bosch since 2018, had two german patents regarding AI and I have carried out numerous real-world AI projects. You can see more details on my linkedIn profile [https://www.linkedin.com/in/yuqicheng-zhu-531658161/] or on Github [https://github.com/ZhuYuqicheng]

I have been going through the past year GSoC projects and am particularly interested in the Abstract Extraction project. Because it is closely related to my future research direction. And the content is also about investigating state-of-the-art method, which can basically be a perfect warm-up task for my Ph.D.

Looking forward to discussing the project idea with you!

Best regards,
Yuqicheng Zhu

jlareck · March 9, 2022, 10:32am

Hi @zhuyuqicheng ,

Thanks for getting in touch. As we already talked via google meet and discussed the next step (completing Warm-up tasks), notify us when you complete them. If you have any questions related to this project, ask us here under this topic. Also be familiar with the information of the Google Summer of Code from the DBpedia page (Google Summer of Code - DBpedia Association) and the DBpedia_ Contributor Application Procedure Infos for GSoC
page (DBpedia_ Contributor Application Procedure Infos for GSoC ). There you can find important information such as the “Contributor Application Template”.

hopver · March 24, 2022, 1:16pm

Topic is still open!

cringwald · April 7, 2022, 3:42pm

Dear @hopver, @jlareck, @jimkont,

I am also really interested to join the project.

My name is Célian Ringwald i received a first Master degree of Data Science from the University of Lyon (2016), i worked for three years in a web crawling and social media extraction company called Data Observer (2017-2020) as data scientist and TAL specialist. In parallel of that i followed a second Master degree from 2018 to 2020 in Digital Humanities also in the University of Lyon. For validating it I worked 6 months for the Ottawa University as a research assistant on the LINCS project.
Since October I am in charge of the French DBpedia Chapter as R&D Engineer at the Wimmics team of Inria.

We are thinking also to work more closely with the Wikipedia text and the abstract extraction is a concrete milestone for that. Moreover, I will also have to adapt and create new extractors in the future. For that reason would be delighted to develop one as a GSOC student in the project because I know that I could by this way have a better view of the best practices to follow.

I am today familiar with the DBpedia extraction framework, i will next weeks investigate more deeply the Wikimedia API for being in phase with Warm-up tasks.

Best regards,
Célian RINGWALD

jlareck · April 10, 2022, 9:15am

Hi @cringwald,
Welcome to GSoC, your experience is impressive. Please, start writing a draft proposal. The application deadline is April 19th. If you have any questions related to this project, ask us and we will try to help you. Here are some examples of proposals: http://tommaso-soru.it/files/misc/Akshay-DBpedia-GSoC-2017-proposal.pdf , gsoc proposal example 2.