Data discrepancy latest-core and dbpedia sparql endpoint?

I’ve been using the Dockerized-DBpedia repo to get the latest updated data. I compared it to the official dbpedia sparql endpoint, which contains the 2016-10 core
(or more precisely 2016-10/links
2016-10/core
instance_types_lhd_dbo_en.ttl.bz2
en/instance_types_lhd_ext_en.ttl.bz2)

All triples:

SELECT (COUNT(*) as ?Triples) 
FROM dbpedia org
WHERE { ?s ?p ?o }

Dbpedia.org/sparql: 438 336 346
Latest-core: 417 381 443

Number of companies:

SELECT COUNT(DISTINCT(?id))
       WHERE {
             ?id a <ontology/Company> .
}

dbpedia sparql endpoint: 84 048
Latest-core: 109 629

Number of people:

SELECT COUNT(DISTINCT(?id))
       WHERE {
             ?id a <ontology/Person> .
}

dbpedia sparql endpoint: 1 818 071
Latest-core: 1 609 995

I would expect the latest data to be bigger than the old data; some plausible explanations that I’ve come up with:

  • “A small part of this data (approx. 100 of 4000 files or 2.5%) is then selected into the latest-core collection” meaning it will be in similar size as before even though wikipedia data has grown.

  • Extraction scripts have become better, meaning a bunch of bad data has been cleaned out and removed. This is probably not the case since for instance we are missing http://dbpedia.org/page/Elkjøp in the new data, which is still an active company with a proper wikipedia infobox and page). This resource/company is available on the public dbpedia sparql endpoint

  • The extra data packages (links, instance_types_lhd_dbo_en, instance_types_lhd_ext_en) explains the additional data seen on dbpedia org (2016-10 dump)

I guess my question is what I am missing (or which of the above if any are valid?). The next question is of course how I would go about getting the missing data into a virtuoso, I guess if I knew which files that should be included in addition to the latest-core the Dockerized-DBpedia loader, would just load everything from the downloads folder, meaning I can download the compressed file manually into that one?

Many thanks in advance.

So I think I solved it:

The reason why data is missing is because the latest-core release are missing a bunch of rdf:type mappings.
Comparing http://dbpedia.org/resource/Eizo from latest-core and 2016-10 (dbpedia.org/sparql) this is the type for Eizo on latest-core by default:

umbel-rc:Business

These are the types on 2016-10 (dbpedia.org/sparql) by default:

owl:Thing
dbo:Company
dul:Agent
dul:SocialPerson
wikidata:Q24229398
wikidata:Q43229
dbo:Agent
dbo:Organisation
schema:Organization
umbel-rc:Business
yago:Abstraction100002137
yago:Company108058098
yago:ElectronicsCompany108003035
yago:Group100031264
yago:Institution108053576
yago:Organization108008335
yago:SocialGroup107950920
yago:YagoLegalActor
yago:YagoLegalActorGeo
yago:YagoPermanentlyLocatedEntity
yago:WikicatCompaniesBasedInIshikawaPrefecture
yago:WikicatCompaniesEstablishedIn1968
yago:WikicatCompaniesListedOnTheTokyoStockExchange
yago:WikicatComputerCompanies
yago:WikicatComputerHardwareCompanies
yago:WikicatDisplayTechnologyCompanies
yago:WikicatElectronicsCompanies
yago:WikicatElectronicsCompaniesOfJapan

The important difference here looking at my company query above, is that I’m looking for all of rdf:type company. This query will miss Eizo among a bunch of other companies.

Adding all ontology files from https://wiki.dbpedia.org/services-resources/ontology:

DBpedia Ontology T-BOX (Schema)
DBpedia Ontology RDF type statements (Instance Data)
DBpedia Ontology other A-Box properties (Instance Data, mapping-based properties)
DBpedia Ontology other A-Box specific properties (Instance Data, mapping-based properties (specific))

to your Dockerized-DBpedia downloads folder in addition to all files from latest-core ( made sure to overwrite any conflicting files) and reloaded all that data I went from around ~85k companies to ~130k companies, compared to 2016-10 (dbpedia.org/sparql) which has ~110k companies.

I have no idea why these mappings would be missing (files aren’t to big) but if you want to make sure your queries work the same for the old dump 2016-10 (dbpedia.org/sparql) and latest-core, you have to add these ontology files.

Hi klintan and welcome to the forums!
Sorry for the late relpy, this really seems to be a missing dataset in the collection. We’ll make sure to add it to the lastest-core collection.
Cheers!

1 Like

@klintan there is still a bunch of missing datasets. We documented them at the end of:
https://wiki.dbpedia.org/develop/datasets/latest-core-dataset-releases
Section “What’s missing?”