I’ve been using the Dockerized-DBpedia repo to get the latest updated data. I compared it to the official dbpedia sparql endpoint, which contains the 2016-10 core
(or more precisely 2016-10/links
2016-10/core
instance_types_lhd_dbo_en.ttl.bz2
en/instance_types_lhd_ext_en.ttl.bz2)
All triples:
SELECT (COUNT(*) as ?Triples)
FROM dbpedia org
WHERE { ?s ?p ?o }
I would expect the latest data to be bigger than the old data; some plausible explanations that I’ve come up with:
“A small part of this data (approx. 100 of 4000 files or 2.5%) is then selected into the latest-core collection” meaning it will be in similar size as before even though wikipedia data has grown.
Extraction scripts have become better, meaning a bunch of bad data has been cleaned out and removed. This is probably not the case since for instance we are missing http://dbpedia.org/page/Elkjøp in the new data, which is still an active company with a proper wikipedia infobox and page). This resource/company is available on the public dbpedia sparql endpoint
The extra data packages (links, instance_types_lhd_dbo_en, instance_types_lhd_ext_en) explains the additional data seen on dbpedia org (2016-10 dump)
I guess my question is what I am missing (or which of the above if any are valid?). The next question is of course how I would go about getting the missing data into a virtuoso, I guess if I knew which files that should be included in addition to the latest-core the Dockerized-DBpedia loader, would just load everything from the downloads folder, meaning I can download the compressed file manually into that one?
The reason why data is missing is because the latest-core release are missing a bunch of rdf:type mappings.
Comparing http://dbpedia.org/resource/Eizo from latest-core and 2016-10 (dbpedia.org/sparql) this is the type for Eizo on latest-core by default:
The important difference here looking at my company query above, is that I’m looking for all of rdf:type company. This query will miss Eizo among a bunch of other companies.
DBpedia Ontology T-BOX (Schema)
DBpedia Ontology RDF type statements (Instance Data)
DBpedia Ontology other A-Box properties (Instance Data, mapping-based properties)
DBpedia Ontology other A-Box specific properties (Instance Data, mapping-based properties (specific))
to your Dockerized-DBpedia downloads folder in addition to all files from latest-core ( made sure to overwrite any conflicting files) and reloaded all that data I went from around ~85k companies to ~130k companies, compared to 2016-10 (dbpedia.org/sparql) which has ~110k companies.
I have no idea why these mappings would be missing (files aren’t to big) but if you want to make sure your queries work the same for the old dump 2016-10 (dbpedia.org/sparql) and latest-core, you have to add these ontology files.
Hi klintan and welcome to the forums!
Sorry for the late relpy, this really seems to be a missing dataset in the collection. We’ll make sure to add it to the lastest-core collection.
Cheers!