Archivo Ontolysense - GSoC 2023

jfrey · January 21, 2023, 11:00pm

Archivo Ontolysense - GSoC2023

Description

DBpedia Archivo is one of the biggest and most recent ontology Archives (check out the paper).

The idea of this proposal is to add an augmentation layer to DBpedia Archivo with a set of functionalities that help in an easier understanding of the sense and (adequate) usage of (parts - classes/properties of) an ontology.

Like in advanced code completion tools that suggest very common procedure calls based on knowledge learned from public code repositories, these functionalities shall be based on how ontologies are actually used in Linked Open Data.

There are multiple ways to make use and showcase / apply the learned data:

as part of an REST API e.g. to serve as more sophisticated replacement of the autocompletion in the popular YASGUI
to generate examples on the DBpedia Archivo Frontend which show or generate statistically relevant examples of the usage of a class/property
to generate SHACL tests that allow to verify the adequateness of a dataset.

The topic and realization can be shaped based on the interests and skill sets (e.g. full stack with frontend prototype vs. focus more on backend, or focus on machine learning) of the GSoC applicant.

In its core the idea is to manage the learned data using the Databus Mods Architecture that allows to augment Archivo without actually performing code integration in the Archivo backend itself.

Goal

determine the actual usage of ontologies (mainly classes and properties) in the real LOD cloud

“rdfs:range” of a property e.g. expected literal datatype or object class or even Regex like structure of the literal value (e.g. , zip code [\d\d\d\d\d] telephone number)
common properties (and additional values) for a class (e.g. based on type-property cooccurrences)

provide said information for each ontology (version?) in Linked Data format on the Databus by using the Mods architecture
Do something with this data as a proof of concept:

add a new REST API to the Archivo webpage that allows the easy retrieval for autocompletion in common formats, e.g. https://archivo.dbpedia.org/ontolysense?uri={class/prop-uri}&type={usage, range} (just an idea, the design of this API is also part of the task)
generate SHACL tests with the expected datatypes for the validation of datasets
provide a service similar to https://www.programcreek.com/, where ontology users can browse usage examples of certain classes/properties with provenance

The language to use depends on the part of the project and the way the student wants to tackle the issue:

for the data mining part the language id free to choose, although we recommend a language with proper RDF libraries (like Jena in JVM languages)
for the Databus Mod we recommend using a JVM based language (Java, Kotlin, Scala) since there are already easy to use libraries implemented
since Archivo web service is written in python + flask, the easiest would be to use that, but again here the student is free to choose since it can also be a standalone service

Impact

Ontology developers get a picture on the usage of their ontology and can evaluate user consent to identify clarity issues or popularity / importance of elements for further development.

Ontology users (knowledge engineers) can be assisted with meaningful examples and statistics how an ontology is used

Data processing tools can use the statics to measure interoperability indicators for datasets compared to other datasets the LOD cloud.

Warm-up tasks

easy

Download latest Archivo ontologies (as ntriples files) and load them into a Virtuoso (see Archivo - Ontology Access for instruction). Write a SPARQL query to count all classes that are defined in the dowloaded ontologies. Write a grep or awk command to count all classes directly on the downloaded files and check if there is a difference.

medium

deploy the Mods Architecture and write a new Mod by extending the VoID Mod to generate statistics in RDF that give more information on used datatypes of for datatype Properties (a sketch in human description ot understand the aim is given below"

20 rdfs:label relations to literals of type "langstring" with language "en"
33 dbo:birthdate relations to literals of type "xsd:date" 
2 dbo:birthdate relations to literals of type "xsd:string"

Mentors

Johannes Frey

Denis Streitmatter

Project size

350h

Keywords

Data Mining/Engineering, (OWL) Ontologies, RDF, Linked Data

Tools & Languages

RDF(S), OWL, SPARQL, Docker Compose
Java & Apache Jena in Mods Architecture
Python and Javascript in Archivo

zhuyuqicheng · January 23, 2023, 9:48am

Hi mentors,

I download some ontologies as owl data, and I can analyze this format directly with python. and I’m also trying to do the warm-up task. I have setup the Virtuoso, but to count the classes, I need to write sparql, which I am not familiar with. Could you give me some key sparql commands or some good sparql tutorial? That would be very helpful. Thanks in advance!

Best regards,
Yuqicheng Zhu

jfrey · January 23, 2023, 9:48am

With regard to DBpedia there is this Running Basic SPARQL Queries Against DBpedia | by Daniel Heward-Mills | OpenLink Virtuoso Weblog | Medium and when it comes to more advanced features this one could be helpful Wikidata:SPARQL tutorial - Wikidata

zhuyuqicheng · January 23, 2023, 9:48am

Hi mentors,

From my point of view, this project can be considered as the implementation of code autocompletion/recommendation for based on the usage of classes & properties of ontologies.

We could probably achieve this goal step by step:

understand the usage of classes & properties of ontologies using simple statistic methods (like histogram)
implement the autocompletion feature based on the statistic (e.g. rank the corresponding properties with respect to the probability)
many machine learning based methods are also available. (e.g. Some papers regarding code completion algorithm using machine learning) Therefore, the next step could be the implementation of one of these methods in our case.
compare these two methods and analyze the pros and cons
deploy the optimal solution in Archivo web service

This is just brief idea based on my current knowledge, please correct me if I misunderstand some concepts and welcome to add/modify the idea! Thanks in advance!

Best regards,
Yuqicheng Zhu

yum-yab · January 23, 2023, 9:48am

Hi,

sorry for the late reply
Your plan sounds good and the comparison of a simple approach vs. a complex one is always nice.

With regard to the plan I have the following comments:

Step one involves analyzing RDF data to calculate statistics. We already did that in a very simple way for classes and property counts (on the lod-a-lot data dump) but this needs to be extended (e.g. for domains / ranges of the properties)
Also part of this task is persisting/presenting the analysis results so that users (Archivo/ontology users, knowledge engineers) understand it easier.

Also I’d like to add a really nice playlist about the Semantic Web and Linked Data technologies to better understand the importance and usage of ontologies (jump to chapter 3.0 for that).

Please hit me (or jfrey) up if you finished the warm up tasks or if you have any problems

Best Regards,
Denis Streitmatter

codecavi · January 27, 2023, 7:58am

Hi Mentors,
I am pursuing my undergrad in CS, I don’t have any prior experience with ontologies and RDF. I felt like I will learn something new so, I decided to do some warm-up tasks.

Firstly, I downloaded the N-Triples data from the databus collection(latest ontologies as nt ) It was taking a lot of time and the size of the collection was also not specified, so I decided to suspend the task after downloading 5 GB of nt files and also I saw it will automatically be downloaded if I followed the docker-compose guide to set-up the virtuoso instance.

After setting up my Virtuoso Instance, I ran the following query on SPARQL Query Editor to find the count of classes.

SELECT (COUNT(DISTINCT ?class) as ?count)
WHERE {
    ?s a ?class
}

//Unfortunately I am unable to embed images as I am a new member

I tried to double check the result with Grep, I went to the data folder in cloned virtuoso repo where all the nt files where located and ran the following grep command

grep -r -o "rdf:type" | wc -l

But, from the grep result I got 23 which is more than the SPARQL result 19

data git:(master) ✗ grep -r -o "rdf:type" | wc -l
23

I don’t know what’s wrong with the query.

Also, I am not sure how to import local nt files to the same Virtuoso Instance and use Query only on that particular nt file rather that all whole databus collection.
I’m sorry if it’s a noobie question or I did anything incorrectly.

Best Regards,
Aviral

jfrey · January 29, 2023, 10:52am

hey @codecavi. thanks for your feedback. Indeed when we designed the warmup task the latest ontologies were smaller. you could have another run from scratch (so delete all the files and the database) with this smaller example https://databus.dbpedia.org/denis/collections/latest_ontologies_as_nt_sample/

please also note that your grep does something different than your SPARQL query (the distinct part)

if you look at the results of your query - the “?classes” you return not the counts itself - you will also find a hint that you did not count the classes in the ontologies itself but you counted the classes that are used as instance types. so these are indeed classes but mostly these are only classes from the OWL and RDF(S) vocabulary itself and not the actual classes that are defined in the ontology. Maybe have a quick look how you define a class in OWL and RDFS. I will update the task description to be more clear about that (thanks for the hint).

if you want to find out the size beforehand (you are right there is a bug in Databus UI) of what you would like to download there is also a trick/ tiny warmup challenge. if you look at the query behind the collection you see a bytesize property. you could open the query in yasgui and try do modify the query so that it sums up all the sizes