Databus-powered (FAIR) dataset mapping system for DBpedia`s National Knowledge Graphs Initiative - GSoC2023



Recently DBpedia has started to publish huge, fused knowledge graphs. The Dutch National Knowledge Graph (DNKG initiative) [3] focuses on linking, mapping and fusing datasets specific to the nation of the Netherlands to build a graph that connects Dutch data. The result is a knowledge graph that allows discovery of more detailed data by following links or federating over SPARQL endpoints of the sources or loading the Knowledge Graph components [4] (so-called DBpedia Cartridges ) of Dutch data. In order to make a cartridge integrate seamlessly with other DNKG cartridges, it is crucial to create and apply mappings for properties (and classes) between the underlying ontologies and vocabularies. DBpedia already uses a crowd-sourced approach to maintain Mappings between Wikipedia Infoboxes, Wikidata, and equivalent Properties and classes of other Datasets.

However, there is a need to make mappings more findable, accessible, interoperable and reusable (FAIR) to improve the efficiency of creation, management and maintenance of mappings. The DBpedia Databus and its ecosystem [1] (DBpedia Archivo augmented Ontology Archive, Databus Mods, etc.) have the potential to close the gap between datasets, ontologies, and mappings - at Web scale - while still allowing decentralized processes.


  • built a system (that can be used by humans and machines) on top of DBpedia Databus and Databus Mods Architecture to register and manage/maintain schema/vocabulary mappings on the DBpedia Databus to make them tangible, modular, and FAIR
  • find/ aut. discover and reuse mappings (from other Databus users) for datasets uploaded on the Databus


  • increasing interoperability of ontologies and datasets
  • giving users the facility to better integrate their data with other datasets/ontologies
  • supporting coordination and efficiency optimization of decentralized mapping and integration efforts with a global view
  • easier and more efficient creation of integrated / combined and derived knowledge graphs like the DNKG

Warm-up tasks

The tasks are not built on top of each other but should be done in sequential order to dive gently into the tools tools and technologies.

Easy but clicky:

  • E1: sign up for a Databus account; create and publish a Databus collection (you can use the collection editor [10]) which contains some of the ntriples (.nt) dump files for the ontologies (pick them from DBpedia Archivo Databus user [2]) of the datasets/cartridges contained in the DNKG pilot [0,3,8] (you can use the mapping dashboard [8] but also the VoID mod can help you to identify used vocabulary/ontologies [9])

Medium and smooth:

  • M1: write at least one pt-construct query [3] for a property not mapped yet (find DNKG pilot mapping dashboards[8]) and create a PR on DNKG-mappings repository [0]

Harder but cooler:

  • H1: Deploy the mod architecture locally [6], configure the master to use a tiny subset of the Databus (e.g. using the generated SPARQL query of the collection you created in E1) and start it so that you can run the VoID mod yourself on the computer.
  • H2: create another collection (by using the custom SPARQL query feature) which contains (almost) all the ontology ntriples dump files from Archivo used by all DNKG datasets (cartridge-input group). Join the VoID mod statistics for Archivo and DNKG as heuristic - use [10] as starting point. Setup your private instance of a SPARQL endpoint with the Archivo ontologies loaded (see instructions at Archivo - Ontology Access ) and write a federated query between your SPARQL endpoint (which now serves as index for all properties and ontologies) and the Mod-SPARQL endpoint for the VoID mods. Compare the both lists of ontologies and conclude how well the heuristic worked. If you would like to go the extra mile you could also modify the SPO mod to create rdf instead of TSV to achieve the same goal but without the need to create a dedicated SPARQL endpoint with the ontologies.

Johannes Frey
Beyza Yaman, PhD
Marvin Hofer

Project size:


high level task description

  • develop a system and strategy to manage created mappings (+metadata) on Databus such that they can be announced/discovered and reused in a unified and FAIR way by tools and users
  • extend the mapping dashboard concept to view available/reusable mappings for different kinds of Databus assets (collections, artifacts, files) in alignment with existing Databus ecosystem
  • deploy/adapt/develop an interfacing (UI and API) to create/edit/enhance mappings with suggestion from tool support but manual correction/override to allow collaborative ways

facultative tasks to shape the project according to personal interests and ambitions

  • extend Databus VoID mod to analyze schema of RDF datasets in order to get more sophisticated schema insight (with the goal to be able to better create mappings between datasets semi-automatically)
  • implement Databus mod for analyzing TSV/CSV schema and allow to manage, create and identify the TARQL mappings in a better way
  • sync Bioportal mappings into the system
  • your own proposal of extending / enhancing the task


ontology mapping, data integration, FAIR principles (findability, accessibility, interoperability, reusability)

Technology know-hows and Skills

  • Linked Data, RDF(S), OWL, SPARQL
  • JAVA or Python
  • Web Technologies: HTML/JS, HTTP, REST
  • Docker

References & Literature & Links

1 Like