Containerized Installers for Data-centric Services using Databus Collections

janfo · February 6, 2025, 8:20am

Project Description:

This GSoC project aims to develop containerized installers for data-centric services utilizing Databus collections. Databus collections provide a framework for managing and sharing datasets across distributed systems, offering versioning, replication, and access control features.

One exemplary application of this project is integrating Databus collections with the Virtuoso Open-Source triple store, a widely used RDF service. This integration enables seamless deployment and loading of RDF datasets into Virtuoso instances within containerized environments.

Additionally, the project entails both designing and documenting best practices for deploying other Databus-driven services, along with implementing more deployment-ready containers. These containers will encapsulate the necessary components for pulling data from Databus collections and installing them with associated services, ensuring ease of deployment and scalability.

Furthermore, the project may explore integration options with the Databus frontend or even metadata, enhancing discoverability and interoperability of the deployed services within the Databus ecosystem.

Key Objectives:

Integrate Databus collections with the Virtuoso Open-Source Triple Store as a first use case. This can be done by building upon the Virtuoso Quickstarter repository (GitHub - dbpedia/virtuoso-sparql-endpoint-quickstart: creates a docker image with Virtuoso preloaded with the latest DBpedia dataset)
Design and document best practices for deploying Databus-driven services.
Implement 4-5 deployment-ready containers for data-centric services utilizing Databus collections. Services could, for instance, be chosen from a list of Semantic Web applications and services here: GitHub - semantalytics/awesome-semantic-web: A curated list of various semantic web and linked data resources.
Explore integration possibilities with the Databus frontend or metadata systems for enhanced functionality and interoperability.

Expected Outcome:

A well-documented Databus-driven Virtuoso Quickstarter container that focuses on ease of deployment.
Documentation outlining best practices and guidelines for implementing, deploying and managing Databus-driven services.
4-5 Containerized installers for deploying data-centric services leveraging Databus collections.
Design proposal for integration of these services with the Databus frontend.
[Optional] integration with Databus frontend or even metadata for improved discoverability and usability.

Skills Required:

A good understanding of SPARQL, RDF and other Semantic Web technologies
Some proficiency in containerization technologies (e.g., Docker, Kubernetes).
Knowledge of the core concepts of the DBpedia Databus (see Overview | Databus Gitbook)
Good documentation and communication skills

Project Size:

Estimated anywhere between 90 to 180 hours, depending on expertise and number of tackled tasks.

Mohamad_Wahba · February 14, 2025, 12:05pm

Hi @janfo ,
I’m really interested in the Containerized Installers project. I have some hands-on experience with Docker and Kubernetes from my internship at Vodafone Egypt and am deepening my skills through the Data Engineering Zoomcamp 2025 by datatalksclub.

I noticed this project was on the GSOC 2024 idealist but wasn’t picked—could you share why and what its current priority is for GSOC 2025?

Thanks,
Mohamad Wahba

kurzum · February 27, 2025, 2:14pm

Hi all,
we already have this https://hub.docker.com/r/dbpedia/virtuoso-sparql-endpoint-quickstart which auto-loads rdf into SPARQL from the databus, i.e. everybody can create a DBpedia mirror easily. This was used over 100k times. Would be good to also have another SPARQL DB like GitHub - ad-freiburg/qlever: Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Then it would be cool to go through GitHub - semantalytics/awesome-semantic-web: A curated list of various semantic web and linked data resources. and pre-select some tools and have the community vote or at least give some feedback. Would be good to have a either a decent variety or a selection for one popular use case.

It’s tough to make a good selection, but if you find the right combination of tools with good synergies, then the value can be immense.