A Tool to Assess FAIRness (using FAIR Principles) of DBpedia - GSoC2020

Description:
The FAIR Data Principles are a set of guiding principles in order to make data findable, accessible, interoperable and reusable (Wilkinson et al., 2016). FAIR data allows reuse of data and enables the computers to find and use data. There are several metrics to define the FAIRness of data [1]:

TO BE FINDABLE####

F1. (meta)data are assigned a globally unique and eternally persistent identifier.
F2. data are described with rich metadata.
F3. (meta)data are registered or indexed in a searchable resource.
F4. metadata specify the data identifier.

TO BE ACCESSIBLE:

A1 (meta)data are retrievable by their identifier using a standardized communications protocol.
A1.1 the protocol is open, free, and universally implementable.
A1.2 the protocol allows for an authentication and authorization procedure, where necessary.
A2 metadata are accessible, even when the data are no longer available.

TO BE INTEROPERABLE:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles.
I3. (meta)data include qualified references to other (meta)data.

TO BE RE-USABLE:

R1. meta(data) have a plurality of accurate and relevant attributes.
R1.1. (meta)data are released with a clear and accessible data usage license.
R1.2. (meta)data are associated with their provenance.
R1.3. (meta)data meet domain-relevant community standards.

Goals:

  • creating and implementing FAIR data metrics in Java
  • assess and produce the metadata with standard vocabularies
  • integrate it with the RDFUnit tool (https://github.com/AKSW/RDFUnit)

Impact:

  • project will increase the FAIRness of the DBpedia

Warm-up tasks:

[1] https://www.force11.org/group/fairgroup/fairprinciples

1 Like

Some of the criteria are very vague or hard to validate automatically. I made some of the potential problems bold.

But we are thinking of sth. similar to 5 star linked data in the scope of DBpedia Databus and DBpedia FlexiFusion.
I think within the context of this technologies we could establish a labeling and metric framework in order to technically describe the “FAIRness” or Fitness (for use) of data.
To give an example w.r.t. to the identifiers. We could evaluate how many subjects/objects have are loaded in the DBpedia Global ID management and we could also measure how many distinct entities are mapped to the same Global ID. W.r.t interoerable we could also analyze how many properties are mapped to standard vocabulary in DBpedia Mapping management (DBpedia global property) and how good the data merges into Flexifusion.
So I suggest that we have very clear automatically evaluable quantitative and qualitative requirements in the scope of DBpedia Databus “ecosystem”.
Otherwise this will end up in philisophical and community discussions e.g. whether a licence is clear or not and the scope might reach the dimension of a Phd thesis itself…

TO BE FINDABLE

F1. globally unique eternally persistent identifier.
F2. rich metadata.

TO BE ACCESSIBLE:

A1.1 the protocol is open, free, and universally implementable.
A1.2 the protocol allows for an authentication and authorization procedure, where necessary.
A2 metadata are accessible, even when the data are no longer available.

TO BE INTEROPERABLE:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles.
I3. (meta)data include qualified references to other (meta)data.

TO BE RE-USABLE:

R1. meta(data) have a plurality of accurate and relevant attributes.
R1.1. (meta)data are released with a clear and accessible data usage license.
R1.3. (meta)data meet domain-relevant community standards.

The project is good still. The highlighted parts are a problem of FAIR, i.e. it is so vague that it is not useful as an application. I would say that Databus would be like a best-practice implementation of FAIR.

Not sure if java is a strict requirement.

@jfrey as @kurzum mentioned FAIR requirements are defined as they are in the referenced website. Even though they are vague in nature, they also allow flexibility in the implementation. Like quality metrics, you can define your own metrics depending on your environment but the paper that I pointed has some well-defined metrics: https://github.com/FAIRMetrics

On the other hand, I foresee that FAIR principles can be evaluated in different granularity levels. I agree with @kurzum that the best and first approach should be starting with Databus and DataID in this case not Global IDs. And if there is a direct link between Dataset DataID and Global IDs it would be even easier to produce them for GlobalIDs as well. The granularity can be discussed according to DBpedia requirements and if you wish you can suggest more features @jfrey. But what you proposed w.r.t. the identifiers seems to me more related to the LODstats not FAIR even though we can assume it can make the data richer. So it doesn’t increase the Findability of the data which is the main goal. Interoperability again, IMHO, should be between DBpedia and other datasets in the first place even though what you proposed might be relevant requirement for your applications. So this project is not a rocket science and doesn’t require PhD thesis to be written on but it could make a good chapter in one if done right!

@kurzum as you mentioned it is not a strict rule that it is written on Java but I thought it could make things easier when it was integrated with RDFunit.

RDFUnit comes with a web services, so it can be called by other languages. Or you run it and take the output.

Ok so @beyza @kurzum do you suggest that I propose my own GSOC topic which is a selection/combination of Contributing to DBpedia Ontology Management and some databus quality metrics (like ratio of global ids, and global properties) with the following warmup task https://docs.google.com/document/d/1XB5HECpc_rrtAFdbBkGhbIUL9lWtC1YjZ1akPUvqTcw/edit#

Yes sure why not!