Most of us do as well, people at the core just do Knowledge Graph construction as research and therefore can dedicate more time due to synergies, but we all have other main projects. Speed is not an issue here. More consistently pushing it until the result is good.
I think we can remove the part of the script relying on the extraction framework as a software and instead use a databus data dependency, i.e. download the latest files from the bus instead of generating them. Should be faster, too.
I attached a script that replaces # DBpedia extraction:
It runs alone. before merging you need to remove the first 5 lines:
#TODO REMOVE BEFORE MERGE
LANGUAGE="it"
WDIR=.
BASE_WDIR=.
###### # ####### # ###### # # #####
# # # # # # # # # # # # #
# # # # # # # # # # # #
# # # # # # # ###### # # #####
# # ####### # ####### # # # # #
# # # # # # # # # # # # #
###### # # # # # ###### ##### #####
echo " Downloading the latest version of the following artifacts:
* https://databus.dbpedia.org/dbpedia/generic/disambiguations
* https://databus.dbpedia.org/dbpedia/generic/redirects
* https://databus.dbpedia.org/dbpedia/mappings/instance-types
Note of deviation from original index_db.sh:
takes the direct AND transitive version of redirects and instance-types and the redirected version of disambiguation
"
cd $BASE_WDIR
QUERY="PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dataid-cv: <http://dataid.dbpedia.org/ns/cv#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
SELECT ?file WHERE {
{
# Subselect latestVersion by artifact
SELECT ?artifact (max(?version) as ?latestVersion) WHERE {
?dataset dataid:artifact ?artifact .
?dataset dct:hasVersion ?version
FILTER (?artifact in (
# GENERIC
<https://databus.dbpedia.org/dbpedia/generic/disambiguations> ,
<https://databus.dbpedia.org/dbpedia/generic/redirects> ,
# MAPPINGS
<https://databus.dbpedia.org/dbpedia/mappings/instance-types>
# latest ontology, currently @denis account
# TODO not sure if needed for Spotlight
# <https://databus.dbpedia.org/denis/ontology/dbo-snapshots>
)) .
}GROUP BY ?artifact
}
?dataset dct:hasVersion ?latestVersion .
{
?dataset dataid:artifact ?artifact .
?dataset dcat:distribution ?distribution .
?distribution dcat:downloadURL ?file .
?distribution dataid:contentVariant '$LANGUAGE'^^xsd:string .
# remove debug info
MINUS {
?distribution dataid:contentVariant ?variants .
FILTER (?variants in ('disjointDomain'^^xsd:string, 'disjointRange'^^xsd:string))
}
}
} ORDER by ?artifact
"
# execute query and trim " and first line from result set
RESULT=`curl --data-urlencode query="$QUERY" --data-urlencode format="text/tab-separated-values" https://databus.dbpedia.org/repo/sparql | sed 's/"//g' | grep -v "^file$" `
# Download
TMPDOWN="dump-tmp-download"
mkdir $TMPDOWN
cd $TMPDOWN
for i in $RESULT
do
wget $i
ls
echo $TMPDOWN
pwd
done
cd ..
echo "decompressing"
bzcat -v $TMPDOWN/instance-types*.ttl.bz2 > $WDIR/instance_types.nt
bzcat -v $TMPDOWN/disambiguations*.ttl.bz2 > $WDIR/disambiguations.nt
bzcat -v $TMPDOWN/redirects*.ttl.bz2 > $WDIR/redirects.nt
# clean
rm -r $TMPDOWN