write a script for a cronjob that will generate these models for all language
write a script that deploys it on the Databus
we will deploy 1 and 2 on a production server
Create a configurable Spotlight docker that loads the specified model or models from the Databus and deploys Spotlight.
This way anybody can make their own models based on 1 and deploy whatever spotlight language she needs on 4.
To achieve 1. could you try to make a spotlight model in a smaller (non-English) language first and then tell us the results, i.e. how long it needed, how you did it?
Are you interested in any language in particular? Please send your public key, if you need a server to run it, if your laptop is to small or you don’t have a server.
@klaus82 I contacted him some days ago with no answer. So I think, you can just take over. Communication is very important. Otherwise, things get blocked. Just post progress and questions here and aashay or others can join in.
Thank you @kurzum.
Could you help me to find some information to achieve the point 1, for example for IT?
Can I edit this script to generate the spotlight model?
Thank you.
@klaus82 Seems about right. Could you try whether it runs as described?
I would be interested, what the output files are. I am sure they have other purposes besides spotlight.
Hi @kurzum,
I tried to run the script, but I have some errors during the building extraction-framework/core.
I think the problem is due to Hard point of this issue, but I need time to investigate.
I work on this in my spare time, so I can’t forecast about the resolution time.
We can keep in touch anyway I’ll be back when I’ll have some news.
Most of us do as well, people at the core just do Knowledge Graph construction as research and therefore can dedicate more time due to synergies, but we all have other main projects. Speed is not an issue here. More consistently pushing it until the result is good.
I think we can remove the part of the script relying on the extraction framework as a software and instead use a databus data dependency, i.e. download the latest files from the bus instead of generating them. Should be faster, too.
I attached a script that replaces # DBpedia extraction:
It runs alone. before merging you need to remove the first 5 lines:
#TODO REMOVE BEFORE MERGE
LANGUAGE="it"
WDIR=.
BASE_WDIR=.
###### # ####### # ###### # # #####
# # # # # # # # # # # # #
# # # # # # # # # # # #
# # # # # # # ###### # # #####
# # ####### # ####### # # # # #
# # # # # # # # # # # # #
###### # # # # # ###### ##### #####
echo " Downloading the latest version of the following artifacts:
* https://databus.dbpedia.org/dbpedia/generic/disambiguations
* https://databus.dbpedia.org/dbpedia/generic/redirects
* https://databus.dbpedia.org/dbpedia/mappings/instance-types
Note of deviation from original index_db.sh:
takes the direct AND transitive version of redirects and instance-types and the redirected version of disambiguation
"
cd $BASE_WDIR
QUERY="PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dataid-cv: <http://dataid.dbpedia.org/ns/cv#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>
SELECT ?file WHERE {
{
# Subselect latestVersion by artifact
SELECT ?artifact (max(?version) as ?latestVersion) WHERE {
?dataset dataid:artifact ?artifact .
?dataset dct:hasVersion ?version
FILTER (?artifact in (
# GENERIC
<https://databus.dbpedia.org/dbpedia/generic/disambiguations> ,
<https://databus.dbpedia.org/dbpedia/generic/redirects> ,
# MAPPINGS
<https://databus.dbpedia.org/dbpedia/mappings/instance-types>
# latest ontology, currently @denis account
# TODO not sure if needed for Spotlight
# <https://databus.dbpedia.org/denis/ontology/dbo-snapshots>
)) .
}GROUP BY ?artifact
}
?dataset dct:hasVersion ?latestVersion .
{
?dataset dataid:artifact ?artifact .
?dataset dcat:distribution ?distribution .
?distribution dcat:downloadURL ?file .
?distribution dataid:contentVariant '$LANGUAGE'^^xsd:string .
# remove debug info
MINUS {
?distribution dataid:contentVariant ?variants .
FILTER (?variants in ('disjointDomain'^^xsd:string, 'disjointRange'^^xsd:string))
}
}
} ORDER by ?artifact
"
# execute query and trim " and first line from result set
RESULT=`curl --data-urlencode query="$QUERY" --data-urlencode format="text/tab-separated-values" https://databus.dbpedia.org/repo/sparql | sed 's/"//g' | grep -v "^file$" `
# Download
TMPDOWN="dump-tmp-download"
mkdir $TMPDOWN
cd $TMPDOWN
for i in $RESULT
do
wget $i
ls
echo $TMPDOWN
pwd
done
cd ..
echo "decompressing"
bzcat -v $TMPDOWN/instance-types*.ttl.bz2 > $WDIR/instance_types.nt
bzcat -v $TMPDOWN/disambiguations*.ttl.bz2 > $WDIR/disambiguations.nt
bzcat -v $TMPDOWN/redirects*.ttl.bz2 > $WDIR/redirects.nt
# clean
rm -r $TMPDOWN
Hello, I merged locally your script and it works fine.
I still have problems running the script and in particular during the building because: Could not resolve dependencies for project com.diffbot.wikistatsextractor:wikistatsextractor:jar:0.1-SNAPSHOT
I need more time to study the project well and resolve this problem.
This reply is only to let you know that I’m still working on this.
Hello @kurzum,
happy new year!
I’m back because I made two spotlight models (it and en).
For made this I changed the index_db.sh following your suggestion, installed Oracle JDK 8u231 on ubuntu (with openJDK I had a lot of compilation error for scala) and run the script.
For it it took about 3 hours, for en it took about 6 hours with a ubuntu server with 8 cores and 32 GB ram.
How I can push the index_db.sh changes? What are the community guidelines for this?
Thanks