DBpedia ontology and public SPARQL endpoint

karlisc · August 16, 2020, 8:52pm

Hi there!

Could anybody explain, what is the anticipated relation of the DBPedia ontology to the actual content of the public DBPedia SPARQL endpoint?

The first naive comparison of the two yields so many differences that it may even appear hard to say that they describe the same data.

Would my guess be correct that the classes that are in the ontology and are not rdf:type of any instance within the public endpoint should be present somewhere within the DBPedia data not included into the current version of the public endpoint (e.g. the classes dbo:Algorithm, dbo:Bishop, dbo:FileSystem, and many many others)?
Or - there is some other explanation -?

Re the other direction, it would be easier to accept that the SPARQL endpoint can be more technical and include a much wider scope of classes and properties.

My intention is to produce a visual query environment over DBPedia on the basis of ViziQuer tool. I could make some ad hoc choices re the supported classes and properties, however, I would prefer to understand, what is going on.

Thanks in advance for your comments!

Kārlis

Kārlis Čerāns
Leading Researcher, Institute of Mathematics and Computer Science, University of Latvia

kurzum · August 18, 2020, 9:32am

Dear @karlisc,
DBpedia has become much more dynamic recently. We are rectifying documentation. You could read about the current release process here:

OpenLink Software (@pkleef ) posted a demo endpoint loading latest-core some days ago:

https://dbpedia.demo.openlinksw.com/sparql/

https://dbpedia.demo.openlinksw.com/resource/Leipzig

https://dbpedia.demo.openlinksw.com/fct/

This latest-core also contains the latest version of the ontology, which is also used in the extraction. Note that we are working on persisting all ontologies on the Databus: http://archivo.dbpedia.org . Archivo is like LOV + web archive; persist and version all ontologies every 8 hours

Regarding the visualisation you are planning. A good practice is to make this part of the DBpedia Stack: https://wiki.dbpedia.org/tutorials/2nd-dbpedia-stack-tutorial which basically means that you need to create a Databus-compatible Docker which we add to https://hub.docker.com/u/dbpedia this way, the visualisation can be easily deployed by the chapters and anybody downloading DBpedia, see also: https://github.com/dbpedia/Dockerized-DBpedia

cheers.

karlisc · August 18, 2020, 10:29am

Thanks, @kurzum for explanation and suggestions re the visualization.
If I may ask a simple question. If I pose a simple question e.g. to OpenLink SPARQL endpoint:
select distinct ?x where {?x a owl:Class.
FILTER NOT EXISTS {[] a ?x}}
LIMIT 100
I get a lot of results, including e.g. Department, Type, Algorithm, etc.
Where are the instances of these classes? Are they in a public endpoint anywhere? How can I get them for loading into my server, if they are not in latest-core?

Related with this,
how large would be the data, and how they can be obtained, that would match all classes that are instances of owl:Class within the latest-core?
What are the ways of obtaining subsets of data that go beyond latest-core?

Thanks again!
Kārlis

kurzum · August 18, 2020, 11:17am

@karlisc these classes are for non-english data: http://mappings.dbpedia.org/index.php/Mapping_ru:Алгоритм

Latest core has en data only with several ontologies and enrichments, which are documented here: https://databus.dbpedia.org/dbpedia/collections/latest-core

We have a fused version here:

This is a merge of 140 language and Wikidata, i.e. all types. Other than that you can take this for individual languages (Wikidata is treated as a language version in DBpedia): https://databus.dbpedia.org/dbpedia/mappings/instance-types/ or this https://databus.dbpedia.org/dbpedia/wikidata/instance-types/

@karlisc: I changed the title from DBPedia -> DBpedia

karlisc · August 18, 2020, 2:28pm

Thanks a lot! This clarifies the situation. One more issue:
Why there is a property dbo:rank in the ontology (as a datatype property), yet the property in the data is dbp:rank, and there is no data in triples with dbo:rank.
There are a lot of properties like this.
At the same time there are also some dbo: namespace properties in the data.

karlisc · December 27, 2020, 8:46pm

Trying to exploit the DBPedia ontology in building a data schema that could guide a visual query building process over DBPedia, a question stands out regarding the http://dbpedia.org/ontology namespace for properties in the ontology and http://dbpedia.org/property for properties in the SPARQL endpoint (both the old one and the new one): is this difference by design? Can one assume that a property with local name, say, xyz, in both namespaces is the same property? Thanks a lot for clarification!

karlisc · December 27, 2020, 9:27pm

Sorry for making the question sound so bold. Some experiments with timeout setting gave me more replies to the simple query select distinct ?p where {?p rdfs:range ?c. [] ?p []. }
Still, the role of properties with the dbp: namespace remains for me unclear.
Thanks!

kurzum · December 30, 2020, 3:56pm

@karlisc,
did you have a look at the main DBpedia paper: DBpedia – A Large-scale, MultilingualKnowledge Base Extracted from Wikipedia . It is described on page 6-8.

In short:
raw or generic extraction uses /property/ in particular all artifacts of the generic group:
https://databus.dbpedia.org/dbpedia/generic/ and in particular these:

mappingbased
is everything using the mappings at mappings.dbpedia.org, i.e. https://databus.dbpedia.org/dbpedia/mappings/

most data is extracted twice, via /property/ as is and with /ontology/ in a mapped and more consistent way.

karlisc · December 30, 2020, 8:45pm

Thanks, @kurzum, for the explanation!
One more question, where your expert advice would be very valuable: are there any plans for materializing also the property domain and range assertions from the ontology over the SPARQL endpoint? If not, would such materialization make sense (i.e. wouldn’t there a lot of meaningless class assertion appear)?
We consider the option of creating an enriched version of the SPARQL endpoint that would take into account also the domain and range assertions, however we are yet to find ways, how to do that.

kurzum · December 31, 2020, 12:23pm

@karlisc,
I am unsure, whether this is necessary. In most cases, the domains and ranges should coincide with the existing classes by design. There might be some errors.

If you have a look here: https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/2020.10.01
there is already a disjointness check in place for everything with /ontology/ .
In principle, I would wonder, if materializing the domain/range assertions would actually produce something different than the existing rdf:type statements.

Of course, I am not sure. If you find a good way to check it, please tell me. Maybe formulate a query, i.e. show where rdfs:domain of any property does not have a corresponding rdf:type triple or such…

karlisc · December 31, 2020, 3:02pm

Hi, @kurzum!
For the property dbo:starring there are 520024 object values that are not of the class dbo:Actor, although the rdfs:range of dbo:starring is dbo:Actor.
I use the following SPARQL query to determine this:
select (count(?y) as ?cy) ?c where { ?x dbo:starring ?y. dbo:starring rdfs:range ?c. FILTER (NOT EXISTS {?y a ?c}) }
As a contrast, there are only 5330 instances in the class dbo:Actor altogether, so only at most about 1% of all object values of dbo:starring have rdf:type to the property range class.
For dbo:team there are 2627446 object values that are not of the property range class dbo:SportsTeam.
For dbo:country there are 25410 objects with missing rdf:type to the property range class out of 790847. For dbo:birthPlace - 225028 out of 1349054, for dbo:occupation - 44490 out of 433470. The dbo:careerStation is clean in this respect - all property targets have rdf:type to their property range class.
To avoid time-outs I did the following SPARQL to get the list of properties with ranges specified and their object counts, then checked the counts with missing rdf:type to the range class for each property separately:

> select ?p ?c ?cy where {
> {select ?p (count(?y) as ?cy) where { ?x ?p ?y. } order by ?p}
> {select ?pp ?c where {?pp rdfs:range ?c} order by ?pp}
> FILTER (?p = ?pp) }
> order by desc(?cy)

Regarding property domains the situation is somewhat better with only 466 subject missing a rdf:type to the property domain class for dbo:birthPlace, 435 for dbo:birthDate and 10 for dbo:starring. Yet dbo:years and dbo:currentMember have all subjects missing rdf:type to the property domain class (an error in the mappings-?) and dbo:managerClub has 83669 out of 83680 instances missing rdf:type to the property domain class.
Can this be cleaned/fixed on the DBPedia data side?
Thanks and Happy New Year!
Kārlis