Automatically adding Wikimedia Dumps on the Databus — GSoC 2025

kurzum · February 11, 2025, 7:59am

Project Description:
Wikimedia publishes their dumps via https://dumps.wikimedia.org . At the moment, these dumps are described via HTML, so the HTML serves as the metadata and it is required to parse to identify whether new dumps are available. To automate retrieval of data, the Databus (and also MOSS as its extension Databus and MOSS – Search over a flexible, multi-domain, multi-repository metadata catalog ) has a metadata knowledge graph, where one can do queries like “check whether a new version of x is available”. Since DBpedia uses the dumps to create knowledge graphs, it would be good to put the download links for the dumps and the metadata on the Databus.

Key Objectives

Build a docker image that we can run daily on our infrastructure to crawl dumps.wikimedia.org and identify all new finished dumps, then add a new record on the Databus.
Goal is to allow checking for new dumps via SPARQL. go to OIDC Form_Post Response , then example queries, then “Latest version of artifact”.
this would help us to 1. track new releases from wikimedia, so the core team and the community can more systematically convert them to RDF as well as to 2. build more solid applications on top, i.e. DIEF or other
process wise I would think that having an early prototype is necessary and then plan iterations from this.

Skills
The task is not very complex per se, but requires some experience in executing a software project. This includes a clean project setup, good code, tests and also a simple, but well-thought out process. (simple because it will be more robust and maintainable than sth. complicated). We would prefer coding in scala, but python or other would also be ok. Some devOp skills are required to produce good docker (swarm), but they can be learned during the project as well.

Size
120 to 180 hours to do it properly. I would estimate that the final deployment also takes a week to make it effective and thoroughly evaluate that the final result is working well.

tech0priyanshu · February 14, 2025, 4:42pm

Dear Kurzum,

I hope you’re doing well! I’m excited about the opportunity to contribute to Wikimedia’s dump automation project as part of GSoC 2025. The idea of streamlining the metadata extraction process and integrating it into Databus for structured queries truly fascinates me.

With my experience in software development, competitive programming, and DevOps, I look forward to building a robust Docker-based solution to track new Wikimedia dumps efficiently. I strongly believe that automating this process will enhance the community’s ability to systematically convert data into RDF and foster more scalable applications.

I’d love to discuss potential approaches and start working on an early prototype to iterate from there. Looking forward to collaborating with you and the team to make this project a success!

Best regards,
Priyanshu