Databus-powered (FAIR) dataset mapping system for DBpedia's National Knowledge Graphs Initiative - GSoC 2023

DESCRIPTION

Background:

Recently DBpedia has started to publish huge, fused knowledge graphs. The Dutch National Knowledge Graph (DNKG initiative) [3] focuses on linking, mapping and fusing datasets specific to the nation of the Netherlands to build a graph that connects Dutch data. The result is a knowledge graph that allows discovery of more detailed data by following links or federating over SPARQL endpoints of the sources or loading the Knowledge Graph components [4] (so-called DBpedia Cartridges ) of Dutch data. In order to make a cartridge integrate seamlessly with other DNKG cartridges, it is crucial to create and apply mappings for properties (and classes) between the underlying ontologies and vocabularies. DBpedia already uses a crowd-sourced approach to maintain Mappings between Wikipedia Infoboxes, Wikidata, and equivalent Properties and classes of other Datasets.

However, there is a need to make mappings more findable, accessible, interoperable and reusable (FAIR) to improve the efficiency of creation, management and maintenance of mappings. The DBpedia Databus and its ecosystem [1] (DBpedia Archivo augmented Ontology Archive, Databus Mods, etc.) have the potential to close the gap between datasets, ontologies, and mappings - at Web scale - while still allowing decentralized processes.

Goal:

  • built a system (that can be used by humans and machines) on top of DBpedia Databus and Databus Mods Architecture to register and manage/maintain schema/vocabulary mappings on the DBpedia Databus to make them tangible, modular, and FAIR
  • find/ aut. discover and reuse mappings (from other Databus users) for datasets uploaded on the Databus

Impact:

  • increasing interoperability of ontologies and datasets
  • giving users the facility to better integrate their data with other datasets/ontologies
  • supporting coordination and efficiency optimization of decentralized mapping and integration efforts with a global view
  • easier and more efficient creation of integrated / combined and derived knowledge graphs like the DNKG

Warm-up tasks

The tasks are not built on top of each other but should be done in sequential order to dive gently into the tools tools and technologies.

Easy but clicky:

  • E1: sign up for a Databus account; create and publish a Databus collection (you can use the collection editor [10]) which contains some of the ntriples (.nt) dump files for the ontologies (pick them from DBpedia Archivo Databus user [2]) of the datasets/cartridges contained in the DNKG pilot [0,3,8] (you can use the mapping dashboard [8] but also the VoID mod can help you to identify used vocabulary/ontologies [9])

Medium and smooth:

  • M1: write at least one pt-construct query [3] for a property not mapped yet (find DNKG pilot mapping dashboards[8]) and create a PR on DNKG-mappings repository [0]

Harder but cooler:

  • H1: Deploy the mod architecture locally [6], configure the master to use a tiny subset of the Databus (e.g. using the generated SPARQL query of the collection you created in E1) and start it so that you can run the VoID mod yourself on the computer.
  • H2: create another collection (by using the custom SPARQL query feature) which contains (almost) all the ontology ntriples dump files from Archivo used by all DNKG datasets (cartridge-input group). Join the VoID mod statistics for Archivo and DNKG as heuristic - use [10] as starting point. Setup your private instance of a SPARQL endpoint with the Archivo ontologies loaded (see instructions at Archivo - Ontology Access ) and write a federated query between your SPARQL endpoint (which now serves as index for all properties and ontologies) and the Mod-SPARQL endpoint for the VoID mods. Compare the both lists of ontologies and conclude how well the heuristic worked. If you would like to go the extra mile you could also modify the SPO mod to create rdf instead of TSV to achieve the same goal but without the need to create a dedicated SPARQL endpoint with the ontologies.

Mentors:
Johannes Frey
Beyza Yaman, PhD
Marvin Hofer

Project size:
175h

TASKS:

high level task description

  • develop a system and strategy to manage created mappings (+metadata) on Databus such that they can be announced/discovered and reused in a unified and FAIR way by tools and users
  • extend the mapping dashboard concept to view available/reusable mappings for different kinds of Databus assets (collections, artifacts, files) in alignment with existing Databus ecosystem
  • deploy/adapt/develop an interfacing (UI and API) to create/edit/enhance mappings with suggestion from tool support but manual correction/override to allow collaborative ways

facultative tasks to shape the project according to personal interests and ambitions

  • extend Databus VoID mod to analyze schema of RDF datasets in order to get more sophisticated schema insight (with the goal to be able to better create mappings between datasets semi-automatically)
  • implement Databus mod for analyzing TSV/CSV schema and allow to manage, create and identify the TARQL mappings in a better way
  • sync Bioportal mappings into the system
  • your own proposal of extending / enhancing the task

Keywords:

ontology mapping, data integration, FAIR principles (findability, accessibility, interoperability, reusability)

Technology know-hows and Skills

  • Linked Data, RDF(S), OWL, SPARQL
  • JAVA or Python
  • Web Technologies: HTML/JS, HTTP, REST
  • Docker

References & Literature & Links

1 Like

I am Arinjay Pathak, a final year Bachelor of Engineering student studying Computer Engineering in Thapar University, Patiala. Currently I am working as a research intern at Indian Institute of Information and Technology, Una.

I have a striking interest in machine learning, deep learning and natural language processing, and have done projects in these domains. My main projects include Speech Emotion Recognition, in which I achieved an accuracy of 85.59% on RAVDESS dataset using ensemble model.

I also made a semantic search system using sentence transformers to search for articles, acts, and court cases related to keywords provided, which helped me to win Smart India Hackathon, organized by Government of India.

I am an open source beginner, and am looking to learn and explore more about open source as a contributor through Google summer of code.

I am looking forward to learn more about mapping knowledge graphs, explore more and gain experience working under your guidance.

I am attaching my resume here in this proposal


Thanks

:wave:t5: Hi @jfrey, @beyza and @hopver,

My name is Deborah, and I am an MSc computing (Maj. AI) student at Dublin City University. I want to learn more about open-source development by getting hands-on experience in the semantic web and linked open data in this project.

I have built several applications in Python and Java and worked with web technologies like REST, RPC, HTTP, and Docker during my internships at Nokia (a work-integrated study program). During my master’s and computer science undergrad at the Baden-Württemberg Cooperate State University (Germany), I learned to work with OWL, ontologies, RDF, and SPARQL in modules like Semantic Web, Artificial Intelligence and Information Seeking, and Big Data Technologies.

I’m very excited about DBpedia, as I find knowledge and graphs and knowledge management and linked data very interesting. I will continue with the warm-up tasks in the following days. May I get in touch with you to discuss proposal ideas on slack?

Projects I worked on include:

  • Implantation of the reinforcement learning framework OpticalRLGym at the Nokia Bell Labs Paris-Saclay (Python, Java, JSON-RPC, Apache Kafka, BASH)
  • Creation of a Telegram chatbot for playing a card game (Python)
  • Student NLP thesis on analyzing movie scripts to create movie content-based summaries. (Python, Beautifulseaup, NumPy, Pandas)

This is my Github, and my LinkedIn.

CV

Hi Deborah,

It sounds like you have a good background for this project as you worked with semantic technologies before.

We expect the students to write their own proposals. You can start looking at the following proposals which are accepted in the previous years.

Example1
Example2

You can start a Google doc and start with proposing some solutions for the project. As you share your ideas we will be giving you feedback and helping you along the process.

Does that make sense for you?

Hi Arinjay,

It sounds like you have a good background for this project as you worked with semantic technologies before. However, as I have seen you sent your CV to a couple of projects maybe it might be better for you to focus on specific ones. Because we expect the students to write their own proposals. You can apply with more than one project but make sure you have enough time for all.

You can start looking at the following proposals which are accepted in the previous years.

Example1
Example2

You can start a Google doc and start with proposing some solutions for the project. As you share your ideas we will be giving you feedback and helping you along the process.

Does that make sense for you?

1 Like

Sounds perfectly sensible. I’m on it :slight_smile:

Sure, will be updating you with my regular progress