Archivo Ontolysense - GSoC 2022

Archivo Ontolysense - GSoC2022

Description

DBpedia Archivo is one of the biggest and most recent ontology Archives (check out the paper).

The idea of this proposal is to add an augmentation layer to DBpedia Archivo with a set of functionalities that help in an easier understanding of the sense and (adequate) usage of (parts - classes/properties of) an ontology.

Like in advanced code completion tools that suggest very common procedure calls based on knowledge learned from public code repositories, these functionalities shall be based on how ontologies are actually used in Linked Open Data.

There are multiple ways to make use and showcase / apply the learned data:

  • as part of an REST API e.g. to serve as more sophisticated replacement of the autocompletion in the popular YASGUI
  • to generate examples on the DBpedia Archivo Frontend which show or generate statistically relevant examples of the usage of a class/property
  • to generate SHACL tests that allow to verify the adequateness of a dataset.

The topic and realization can be shaped based on the interests and skill sets (e.g. full stack with frontend prototype vs. focus more on backend, or focus on machine learning) of the GSoC applicant.

In its core the idea is to manage the learned data using the Databus Mods Architecture that allows to augment Archivo without actually performing code integration in the Archivo backend itself.

Goal

  • determine the actual usage of ontologies (mainly classes and properties) in the real LOD cloud
  • “rdfs:range” of a property e.g. expected literal datatype or object class or even Regex like structure of the literal value (e.g. , zip code [\d\d\d\d\d] telephone number)

  • common properties (and additional values) for a class (e.g. based on type-property cooccurrences)

  • provide said information for each ontology (version?) in Linked Data format on the Databus by using the Mods architecture

  • Do something with this data as a proof of concept:

  • add a new REST API to the Archivo webpage that allows the easy retrieval for autocompletion in common formats, e.g. https://archivo.dbpedia.org/ontolysense?uri={class/prop-uri}&type={usage, range} (just an idea, the design of this API is also part of the task)

  • generate SHACL tests with the expected datatypes for the validation of datasets

  • provide a service similar to https://www.programcreek.com/, where ontology users can browse usage examples of certain classes/properties with provenance

  • The language to use depends on the part of the project and the way the student wants to tackle the issue:
  • for the data mining part the language id free to choose, although we recommend a language with proper RDF libraries (like Jena in JVM languages)

  • for the Databus Mod we recommend using a JVM based language (Java, Kotlin, Scala) since there are already easy to use libraries implemented

  • since Archivo web service is written in python + flask, the easiest would be to use that, but again here the student is free to choose since it can also be a standalone service

Impact

Ontology developers get a picture on the usage of their ontology and can evaluate user consent to identify clarity issues or popularity / importance of elements for further development.

Ontology users (knowledge engineers) can be assisted with meaningful examples and statistics how an ontology is used

Data processing tools can use the statics to measure interoperability indicators for datasets compared to other datasets the LOD cloud.

Warm-up tasks

easy

  • Download latest Archivo ontologies (as ntriples files) and load them into a Virtuoso (see Archivo - Ontology Access for instruction). Write a SPARQL query to count all classes. Write a grep or awk command to count all classes directly on the downloaded files and check if there is a difference.

medium

  • deploy the Mods Architecture and write a new Mod by extending the VoID Mod to generate statistics in RDF that give more information on used datatypes of for datatype Properties (a sketch in human description ot understand the aim is given below"
20 rdfs:label relations to literals of type "langstring" with language "en"
33 dbo:birthdate relations to literals of type "xsd:date" 
2 dbo:birthdate relations to literals of type "xsd:string" 

Mentors

Johannes Frey

Denis Streitmatter

Project size

350h

Keywords

Data Mining/Engineering, (OWL) Ontologies, RDF, Linked Data

Tools & Languages

RDF(S), OWL, SPARQL, Docker Compose
Java & Apache Jena in Mods Architecture
Python and Javascript in Archivo

2 Likes

Hello everyone!

My name is Yuqicheng Zhu and I am Electrical Engineering Student at Technical University Munich, Germany. As an incoming Ph.D. student in Knowledge Graph at Bosch Center of Artificial Intelligence & University Stuttgart (Prof. Steffen Staab). I am fully motivated to contribute to DBPedia as a GSoC 22’ student.

I have been working as AI engineer at Bosch since 2018, had two german patents regarding AI and I have carried out numerous real-world AI projects. You can see more details on my linkedIn profile [https://www.linkedin.com/in/yuqicheng-zhu-531658161/] or on Github [https://github.com/ZhuYuqicheng 1]

Looking forward to discussing the project idea with you!

Best regards,
Yuqicheng Zhu

Hi mentors,

I download some ontologies as owl data, and I can analyze this format directly with python. and I’m also trying to do the warm-up task. I have setup the Virtuoso, but to count the classes, I need to write sparql, which I am not familiar with. Could you give me some key sparql commands or some good sparql tutorial? That would be very helpful. Thanks in advance! :slight_smile:

Best regards,
Yuqicheng Zhu

With regard to DBpedia there is this Running Basic SPARQL Queries Against DBpedia | by Daniel Heward-Mills | OpenLink Virtuoso Weblog | Medium and when it comes to more advanced features this one could be helpful Wikidata:SPARQL tutorial - Wikidata

Thanks for the info!

Hi mentors,

From my point of view, this project can be considered as the implementation of code autocompletion/recommendation for based on the usage of classes & properties of ontologies.

We could probably achieve this goal step by step:

  1. understand the usage of classes & properties of ontologies using simple statistic methods (like histogram)
  2. implement the autocompletion feature based on the statistic (e.g. rank the corresponding properties with respect to the probability)
  3. many machine learning based methods are also available. (e.g. Some papers regarding code completion algorithm using machine learning) Therefore, the next step could be the implementation of one of these methods in our case.
  4. compare these two methods and analyze the pros and cons
  5. deploy the optimal solution in Archivo web service

This is just brief idea based on my current knowledge, please correct me if I misunderstand some concepts and welcome to add/modify the idea! Thanks in advance!

Best regards,
Yuqicheng Zhu

Hi,

sorry for the late reply :slight_smile:
Your plan sounds good and the comparison of a simple approach vs. a complex one is always nice.

With regard to the plan I have the following comments:

  • Step one involves analyzing RDF data to calculate statistics. We already did that in a very simple way for classes and property counts (on the lod-a-lot data dump) but this needs to be extended (e.g. for domains / ranges of the properties)
  • Also part of this task is persisting/presenting the analysis results so that users (Archivo/ontology users, knowledge engineers) understand it easier.

Also I’d like to add a really nice playlist about the Semantic Web and Linked Data technologies to better understand the importance and usage of ontologies (jump to chapter 3.0 for that).

Please hit me (or jfrey) up if you finished the warm up tasks or if you have any problems :slight_smile:

Best Regards,
Denis Streitmatter

Please also note that he application period for the GSoC students ends in 6 days.