DBpedia Dataset 2019-08-30 (Pre-Release)

The link leading to this prerelease from the main DBpedia download page is broken - it is still in the old format.

First link from https://wiki.dbpedia.org/develop/datasets (https://databus.dbpedia.org/dbpedia/collections/release-2019-08-30) returns “Unable to find the collection.”

We are working on it. There seem to be a bug on the Website which prevents to make changes to this site.

best

Sandra

Hello,

there appear to be no “Infobox Properties Mapped” files in this release. Is this intended?

In the 2016 release, there was infobox_properties_mapped_en.ttl.bz2 (preview).

For example,
dbr:Asus_Fonepad dbp:sound dbr:WAV
is missing in the linked 2019-08 dataset.

There are no dbr dbp dbr triples included – only infobox-properties (dbr dbp literal) and Cleaned object properties extracted with mappings (dbr dbo dbr).

This caused quite a headache to me, as I was trying to understand why so much data was missing when comparing 2019 to 2016… so, is it intended?

I assume it’s better to ask here instead of opening a new post since it is only related to this very release and no prior ones. Cheers

1 Like

@janfo seems legit, can you check, that the files in the collection are the same as in this folder: http://downloads.dbpedia.org/2016-10/core/

actually here is documented what is loaded beside core in the main endpoint for now
https://wiki.dbpedia.org/public-sparql-endpoint

@phil294 it seems to be an issue with the file itself:
previous: dbp/literal and dbp/dbp were in infobox-properties
new: infobox-properties only contain dbp/literal

Thanks for noting, we didn’t see this as the file size has still grown by almost 60% and we just assumed it is fine.

If you want to do it databus-style you could export the query and load the 2016-10 infobox-properties in addition, i.e. tweaking the dependency to point to the last working version.

Ok, good to know. Yes, I am already using a custom collection https://databus.dbpedia.org/phil294/collections/test_partial_dbpedia, it works great. I’ll just add the previous infobox-properties then. Thanks!

I have some bad news. We were very much focused on the mappings and on encoding issues and it seems we didn’t load generic/2016-10 on the bus
I am running a regex over all versions to see, whether there are any dbr/dbr/dbr triples:

 for i in `find . | grep "lang=en"` ; do echo $i ; bzcat $i | grep -v '"' | head -1   ; done

let’s see
This is a serious one, so it has high priority for us.

1 Like

@phil294 Hey, I uploaded generic/infoboxproperties/2016 on the bus, so it is available: https://databus.dbpedia.org/dbpedia/generic/infobox-properties/2016.10.01

This way you can include it into the collection until a fixed version is available.
I am close (I think) to fixing the dbr/dbr/dbr issue, so it should be available again. We are also fixing the rdf:langString without languageTag bug. Sometime after this, we will also upload generic/2016 completely.

Neat! I found combining 2016+2019 to be rather complicated, so – when, roughly, do you estimate the fixed 2019 infoboxproperties to be available? (or an entirely new release, for that matter)

@phil294 soon, its on our desks to fine-tune this.

What was the issue? I assume that a feature is missing, where you copy a collection into your own collection space and then are able to edit it and change the version of one artifact. Is that correct? If yes, we or you could record that feature on GitHub: https://github.com/dbpedia/databus-maven-plugin/issues

That was not the issue, even though I agree this might be an interesting feature for the Databus web UI, I’ll create an issue. But you can work around it by simply using the custom query feature. Not as pretty and formatted, but also works.
Edit: It looks like this (now?) works by removing and adding artefacts as desired, I think this is fine for now.
Edit2: Deleting/creating is somewhat broken because the versions dont reset upon deletion, now I’ll make an issue.

The issues I had instead were about the data itself: infobox-properties 2019 contains fixed values of statements that were wrong in the 2016 dataset, for example dbr:R dbp:P "value1" (2016) becomes dbr:R dbp:P "value2". And by including both, I suddenly have both value1 and value2 in the dataset which is a pain to clean up. Also, values might have been removed but this is not detecteable because 2016 is missing many statements anyway.
I guess I’ll be better off to only use 2016 infobox-properties, but instead I will wait for the next fix or release so migrating to subsequent datasets will be easier.

@kurzum I think the infobox property dbr/dbr/dbr issue is fixed in the 2020.02.01, isn’it it? If so, I might have found another bug again:

First off, the 2020 infobox-properties is significantly smaller (291 MB) than the 2019 one (671 MB). Was it split into another artifact? I dont see any.

Secondly, some values seem to be missing indeed, e.g. http://dbpedia.org/page/IPhone dbr: properties are missing in the new one:

grep '^<http://dbpedia.org/resource/IPhone>' infobox-properties_lang\=en_redirected.ttl

returns an empty result when it probably shouldnt.

The same thing seems to apply for infobox-property-definitions: 2020.02.01/infobox-property-definitions@en (0.3 MB) vs. 2019.08.30 (6.6 MB). For example, dbp:releaseDate definition is missing.

I did not investigate this or other datasets any further, hope it helps

For some reason 2020.02.01 seems to be missing half the data. So the extractor was fixed, but somehow the whole release slice has a fault. We are currently trying to run 2020.03.01 to see if this persists.

@phil294 we did generic/2020-03-01 . It looks ffine now, can you check?

@kurzum Looks pretty good now, over the last few days working with it (2020/07 dataset), everything looked fine.

One thing however, shouldn’t the foaf: properties be contained inside mappingbased-objects?

select count(*) { ?s foaf:gender ?o }

returns 1,418,206 in the 2016 public endpoint, but in my up to date local endpoint, these seem to be missing. I used these artifacts from my testing collection and the above query returns 0.

Same thing with dbo:thumbnail: 1,695,460 in 2016, 0 locally.

And with foaf:depiction (which is kinda the same thing as dbo:thumbnail anyway): 1,698,622 2016, 300 locally (??)

Is this another error or am I missing some dataset?

@phil294 could be this one: https://databus.dbpedia.org/dbpedia/generic/persondata/
we excluded it because it is broken anyhow. Not sure, if it worth repairing. Seems like they come from other datasets, i.e. wikidata or musicbrainz. The latest fusion has them:
https://databus.dbpedia.org/vehnem/flexifusion/fusion/2019.12.15
filter tag by gender.

see under: https://wiki.dbpedia.org/develop/datasets/latest-core-dataset-releases Missing:

mageExtractor was malfunctioning and disabled, i.e. only images from infoboxes are extracted, no clean licenses. (Will be fixed with https://databus.dbpedia.org/dbpedia/wikidata/images/)

Huh, so for so depictions/thumbnails, one needs to join wikidata resources over owl:sameAs and select its dbpedia/wikidata/images. That should work, thanks.

Not sure, if it worth repairing.

@kurzum I would assume it is worth – otherwise, the dataset doesnt contain even basic information like the gender of a person, which seems pretty important (?), without also taking the above mentioned route with owl:sameAs/wikidata-resource

For example, http://dbpedia.org/page/Hans_Sarpei knows that Hans Sarpei is male, https://dbpedia.demo.openlinksw.com/page/Hans_Sarpei doesnt.

@phil294 so you are right about the joining with other data. Wikipedia and wikidata are not suitable for that, since they are copies of other data.
We are integrating data from national libraries for example. The age of extraction is ending. the age of integration has started.