DBpedia Dataset 2019-08-30 (Pre-Release)

NOTE: this release is superseded by Latest Core DBpedia Releases

The DBpedia release 2019-08-30 can now be found here:
https://databus.dbpedia.org/dbpedia/collections/release-2019-08-30
https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30
Update: added pre- to collection uri, so it won’t be mistaken for the release

How to retrieve the data
Tools for easier download and usage of collection data are in development. Until then please follow the following steps:

  • Retrieve the data query (Visit the collection page and click on Actions > Copy Query to Clipboard or run curl https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30 -H "accept: text/sparql")
  • Run the query against https://databus.dbpedia.org/repo/sparql to get the list of downloadable files (make sure to use a POST request, since the request length exceeds the maximum length of a GET request)

More extensive information on DBpedia Databus Collections and how to use them will follow in the next few days.

@janfo I tested it with

#retrieve sparql query from collection 
QUERY=`curl  -H "Accept: text/sparql"  "https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30"`
#retrieve downloadurls with sparql query
DOWNLOADURLS=`curl -X POST --data-urlencode query="$QUERY" --data-urlencode format="text/tab-separated-values"  "https://databus.dbpedia.org/repo/sparql"`
#remove double quotes " from downloadurls, because of wget scheme missing
DOWNLOADURLS=`echo $DOWNLOADURLS| sed 's/"//g'`
# download
for i in $DOWNLOADURLS ; do ; echo "Downloading" $i ; wget $i ; done

but it downloads too many files, i.e. https://downloads.dbpedia.org/repo/lts/generic/categories/2019.08.30/categories_lang=br_labels.ttl.bz2
but in the http://downloads.dbpedia.org/2016-10/core/ folder there is:

  • all files only in English
  • just the text group has several languages, but they are en_uris and we don’t produce them yet. We could have a transition artifact.

update Ah yes, and these are supposed the main endpoint releases for http://dbpedia.org/sparql

@janfo it would also be cool to get a DataId / DCAT catalog in turtle, when doing "Accept: text/turtle" in the curl on the collection. These could also be available at
https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30.ttl with a 303 redirect. Could you record that feature in the issue tracker? I think it is not a priority, but very cool to have. It is related to some of the issues here: https://github.com/dbpedia/databus-maven-plugin/issues DataId needs some changes, but we decided to focus on data output and usability first.

1 Like

Created an issue here: https://github.com/dbpedia/databus-maven-plugin/issues/101

The link leading to this prerelease from the main DBpedia download page is broken - it is still in the old format.

First link from https://wiki.dbpedia.org/develop/datasets (https://databus.dbpedia.org/dbpedia/collections/release-2019-08-30) returns “Unable to find the collection.”

We are working on it. There seem to be a bug on the Website which prevents to make changes to this site.

best

Sandra

Hello,

there appear to be no “Infobox Properties Mapped” files in this release. Is this intended?

In the 2016 release, there was infobox_properties_mapped_en.ttl.bz2 (preview).

For example,
dbr:Asus_Fonepad dbp:sound dbr:WAV
is missing in the linked 2019-08 dataset.

There are no dbr dbp dbr triples included – only infobox-properties (dbr dbp literal) and Cleaned object properties extracted with mappings (dbr dbo dbr).

This caused quite a headache to me, as I was trying to understand why so much data was missing when comparing 2019 to 2016… so, is it intended?

I assume it’s better to ask here instead of opening a new post since it is only related to this very release and no prior ones. Cheers

1 Like

@janfo seems legit, can you check, that the files in the collection are the same as in this folder: http://downloads.dbpedia.org/2016-10/core/

actually here is documented what is loaded beside core in the main endpoint for now
https://wiki.dbpedia.org/public-sparql-endpoint

@phil294 it seems to be an issue with the file itself:
previous: dbp/literal and dbp/dbp were in infobox-properties
new: infobox-properties only contain dbp/literal

Thanks for noting, we didn’t see this as the file size has still grown by almost 60% and we just assumed it is fine.

If you want to do it databus-style you could export the query and load the 2016-10 infobox-properties in addition, i.e. tweaking the dependency to point to the last working version.

Ok, good to know. Yes, I am already using a custom collection https://databus.dbpedia.org/phil294/collections/test_partial_dbpedia, it works great. I’ll just add the previous infobox-properties then. Thanks!

I have some bad news. We were very much focused on the mappings and on encoding issues and it seems we didn’t load generic/2016-10 on the bus
I am running a regex over all versions to see, whether there are any dbr/dbr/dbr triples:

 for i in `find . | grep "lang=en"` ; do echo $i ; bzcat $i | grep -v '"' | head -1   ; done

let’s see
This is a serious one, so it has high priority for us.

1 Like

@phil294 Hey, I uploaded generic/infoboxproperties/2016 on the bus, so it is available: https://databus.dbpedia.org/dbpedia/generic/infobox-properties/2016.10.01

This way you can include it into the collection until a fixed version is available.
I am close (I think) to fixing the dbr/dbr/dbr issue, so it should be available again. We are also fixing the rdf:langString without languageTag bug. Sometime after this, we will also upload generic/2016 completely.

Neat! I found combining 2016+2019 to be rather complicated, so – when, roughly, do you estimate the fixed 2019 infoboxproperties to be available? (or an entirely new release, for that matter)

@phil294 soon, its on our desks to fine-tune this.

What was the issue? I assume that a feature is missing, where you copy a collection into your own collection space and then are able to edit it and change the version of one artifact. Is that correct? If yes, we or you could record that feature on GitHub: https://github.com/dbpedia/databus-maven-plugin/issues

That was not the issue, even though I agree this might be an interesting feature for the Databus web UI, I’ll create an issue. But you can work around it by simply using the custom query feature. Not as pretty and formatted, but also works.
Edit: It looks like this (now?) works by removing and adding artefacts as desired, I think this is fine for now.
Edit2: Deleting/creating is somewhat broken because the versions dont reset upon deletion, now I’ll make an issue.

The issues I had instead were about the data itself: infobox-properties 2019 contains fixed values of statements that were wrong in the 2016 dataset, for example dbr:R dbp:P "value1" (2016) becomes dbr:R dbp:P "value2". And by including both, I suddenly have both value1 and value2 in the dataset which is a pain to clean up. Also, values might have been removed but this is not detecteable because 2016 is missing many statements anyway.
I guess I’ll be better off to only use 2016 infobox-properties, but instead I will wait for the next fix or release so migrating to subsequent datasets will be easier.

@kurzum I think the infobox property dbr/dbr/dbr issue is fixed in the 2020.02.01, isn’it it? If so, I might have found another bug again:

First off, the 2020 infobox-properties is significantly smaller (291 MB) than the 2019 one (671 MB). Was it split into another artifact? I dont see any.

Secondly, some values seem to be missing indeed, e.g. http://dbpedia.org/page/IPhone dbr: properties are missing in the new one:

grep '^<http://dbpedia.org/resource/IPhone>' infobox-properties_lang\=en_redirected.ttl

returns an empty result when it probably shouldnt.

The same thing seems to apply for infobox-property-definitions: 2020.02.01/infobox-property-definitions@en (0.3 MB) vs. 2019.08.30 (6.6 MB). For example, dbp:releaseDate definition is missing.

I did not investigate this or other datasets any further, hope it helps

For some reason 2020.02.01 seems to be missing half the data. So the extractor was fixed, but somehow the whole release slice has a fault. We are currently trying to run 2020.03.01 to see if this persists.

@phil294 we did generic/2020-03-01 . It looks ffine now, can you check?