How to retrieve the data
Tools for easier download and usage of collection data are in development. Until then please follow the following steps:
Retrieve the data query (Visit the collection page and click on Actions > Copy Query to Clipboardor run curl https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30 -H "accept: text/sparql")
Run the query against https://databus.dbpedia.org/repo/sparql to get the list of downloadable files (make sure to use a POST request, since the request length exceeds the maximum length of a GET request)
More extensive information on DBpedia Databus Collections and how to use them will follow in the next few days.
#retrieve sparql query from collection
QUERY=`curl -H "Accept: text/sparql" "https://databus.dbpedia.org/dbpedia/collections/pre-release-2019-08-30"`
#retrieve downloadurls with sparql query
DOWNLOADURLS=`curl -X POST --data-urlencode query="$QUERY" --data-urlencode format="text/tab-separated-values" "https://databus.dbpedia.org/repo/sparql"`
#remove double quotes " from downloadurls, because of wget scheme missing
DOWNLOADURLS=`echo $DOWNLOADURLS| sed 's/"//g'`
# download
for i in $DOWNLOADURLS ; do ; echo "Downloading" $i ; wget $i ; done
For example, dbr:Asus_Fonepad dbp:sound dbr:WAV
is missing in the linked 2019-08 dataset.
There are nodbr dbp dbr triples included – only infobox-properties (dbr dbp literal) and Cleaned object properties extracted with mappings (dbr dbo dbr).
This caused quite a headache to me, as I was trying to understand why so much data was missing when comparing 2019 to 2016… so, is it intended?
I assume it’s better to ask here instead of opening a new post since it is only related to this very release and no prior ones. Cheers
@phil294 it seems to be an issue with the file itself:
previous: dbp/literal and dbp/dbp were in infobox-properties
new: infobox-properties only contain dbp/literal
Thanks for noting, we didn’t see this as the file size has still grown by almost 60% and we just assumed it is fine.
If you want to do it databus-style you could export the query and load the 2016-10 infobox-properties in addition, i.e. tweaking the dependency to point to the last working version.
I have some bad news. We were very much focused on the mappings and on encoding issues and it seems we didn’t load generic/2016-10 on the bus
I am running a regex over all versions to see, whether there are any dbr/dbr/dbr triples:
for i in `find . | grep "lang=en"` ; do echo $i ; bzcat $i | grep -v '"' | head -1 ; done
let’s see
This is a serious one, so it has high priority for us.
This way you can include it into the collection until a fixed version is available.
I am close (I think) to fixing the dbr/dbr/dbr issue, so it should be available again. We are also fixing the rdf:langString without languageTag bug. Sometime after this, we will also upload generic/2016 completely.
Neat! I found combining 2016+2019 to be rather complicated, so – when, roughly, do you estimate the fixed 2019 infoboxproperties to be available? (or an entirely new release, for that matter)
@phil294 soon, its on our desks to fine-tune this.
What was the issue? I assume that a feature is missing, where you copy a collection into your own collection space and then are able to edit it and change the version of one artifact. Is that correct? If yes, we or you could record that feature on GitHub: https://github.com/dbpedia/databus-maven-plugin/issues
That was not the issue, even though I agree this might be an interesting feature for the Databus web UI, I’ll create an issue. But you can work around it by simply using the custom query feature. Not as pretty and formatted, but also works. Edit: It looks like this (now?) works by removing and adding artefacts as desired, I think this is fine for now. Edit2: Deleting/creating is somewhat broken because the versions dont reset upon deletion, now I’ll make an issue.
The issues I had instead were about the data itself: infobox-properties 2019 contains fixed values of statements that were wrong in the 2016 dataset, for example dbr:R dbp:P "value1" (2016) becomes dbr:R dbp:P "value2". And by including both, I suddenly have both value1 and value2 in the dataset which is a pain to clean up. Also, values might have been removed but this is not detecteable because 2016 is missing many statements anyway.
I guess I’ll be better off to only use 2016 infobox-properties, but instead I will wait for the next fix or release so migrating to subsequent datasets will be easier.
For some reason 2020.02.01 seems to be missing half the data. So the extractor was fixed, but somehow the whole release slice has a fault. We are currently trying to run 2020.03.01 to see if this persists.