Archivo Ontolysense - GSoC2023
Description
DBpedia Archivo is one of the biggest and most recent ontology Archives (check out the paper).
The idea of this proposal is to add an augmentation layer to DBpedia Archivo with a set of functionalities that help in an easier understanding of the sense and (adequate) usage of (parts - classes/properties of) an ontology.
Like in advanced code completion tools that suggest very common procedure calls based on knowledge learned from public code repositories, these functionalities shall be based on how ontologies are actually used in Linked Open Data.
There are multiple ways to make use and showcase / apply the learned data:
- as part of an REST API e.g. to serve as more sophisticated replacement of the autocompletion in the popular YASGUI
- to generate examples on the DBpedia Archivo Frontend which show or generate statistically relevant examples of the usage of a class/property
- to generate SHACL tests that allow to verify the adequateness of a dataset.
The topic and realization can be shaped based on the interests and skill sets (e.g. full stack with frontend prototype vs. focus more on backend, or focus on machine learning) of the GSoC applicant.
In its core the idea is to manage the learned data using the Databus Mods Architecture that allows to augment Archivo without actually performing code integration in the Archivo backend itself.
Goal
- determine the actual usage of ontologies (mainly classes and properties) in the real LOD cloud
-
“rdfs:range” of a property e.g. expected literal datatype or object class or even Regex like structure of the literal value (e.g. , zip code [\d\d\d\d\d] telephone number)
-
common properties (and additional values) for a class (e.g. based on type-property cooccurrences)
-
provide said information for each ontology (version?) in Linked Data format on the Databus by using the Mods architecture
-
Do something with this data as a proof of concept:
-
add a new REST API to the Archivo webpage that allows the easy retrieval for autocompletion in common formats, e.g.
https://archivo.dbpedia.org/ontolysense?uri={class/prop-uri}&type={usage, range}
(just an idea, the design of this API is also part of the task) -
generate SHACL tests with the expected datatypes for the validation of datasets
-
provide a service similar to https://www.programcreek.com/, where ontology users can browse usage examples of certain classes/properties with provenance
- The language to use depends on the part of the project and the way the student wants to tackle the issue:
-
for the data mining part the language id free to choose, although we recommend a language with proper RDF libraries (like Jena in JVM languages)
-
for the Databus Mod we recommend using a JVM based language (Java, Kotlin, Scala) since there are already easy to use libraries implemented
-
since Archivo web service is written in python + flask, the easiest would be to use that, but again here the student is free to choose since it can also be a standalone service
Impact
Ontology developers get a picture on the usage of their ontology and can evaluate user consent to identify clarity issues or popularity / importance of elements for further development.
Ontology users (knowledge engineers) can be assisted with meaningful examples and statistics how an ontology is used
Data processing tools can use the statics to measure interoperability indicators for datasets compared to other datasets the LOD cloud.
Warm-up tasks
easy
- Download latest Archivo ontologies (as ntriples files) and load them into a Virtuoso (see Archivo - Ontology Access for instruction). Write a SPARQL query to count all classes that are defined in the dowloaded ontologies. Write a grep or awk command to count all classes directly on the downloaded files and check if there is a difference.
medium
- deploy the Mods Architecture and write a new Mod by extending the VoID Mod to generate statistics in RDF that give more information on used datatypes of for datatype Properties (a sketch in human description ot understand the aim is given below"
20 rdfs:label relations to literals of type "langstring" with language "en"
33 dbo:birthdate relations to literals of type "xsd:date"
2 dbo:birthdate relations to literals of type "xsd:string"
Mentors
Johannes Frey
Denis Streitmatter
Project size
350h
Keywords
Data Mining/Engineering, (OWL) Ontologies, RDF, Linked Data
Tools & Languages
RDF(S), OWL, SPARQL, Docker Compose
Java & Apache Jena in Mods Architecture
Python and Javascript in Archivo