Consolidate Update Interval of DBpedia Spotlight

kurzum · November 15, 2019, 9:53am

@aashay225 thanks for volunteering.

We should start with the generation of the input data again for spotlight. These are called pre-compiled models: https://www.dbpedia-spotlight.org/faq The latest precompiled models are some years old: https://sourceforge.net/projects/dbpedia-spotlight/files/

Could you check the Where can I find the tools to build the models? whether this still works?

Goals are here:

write a script for a cronjob that will generate these models for all language
write a script that deploys it on the Databus
we will deploy 1 and 2 on a production server
Create a configurable Spotlight docker that loads the specified model or models from the Databus and deploys Spotlight.

This way anybody can make their own models based on 1 and deploy whatever spotlight language she needs on 4.

To achieve 1. could you try to make a spotlight model in a smaller (non-English) language first and then tell us the results, i.e. how long it needed, how you did it?
Are you interested in any language in particular? Please send your public key, if you need a server to run it, if your laptop is to small or you don’t have a server.

klaus82 · December 4, 2019, 10:41am

Hi all,
I want to know if @aashay225 is still working on this issue, and if yes how can I help for this task.

kurzum · December 4, 2019, 11:17am

@klaus82 I contacted him some days ago with no answer. So I think, you can just take over. Communication is very important. Otherwise, things get blocked. Just post progress and questions here and aashay or others can join in.

klaus82 · December 4, 2019, 11:47am

Thank you @kurzum.
Could you help me to find some information to achieve the point 1, for example for IT?
Can I edit this script to generate the spotlight model?
Thank you.

kurzum · December 4, 2019, 12:18pm

@klaus82 Seems about right. Could you try whether it runs as described?
I would be interested, what the output files are. I am sure they have other purposes besides spotlight.

klaus82 · December 4, 2019, 1:16pm

Ok I’ll try.
As soon as I have result I’ll be back

klaus82 · December 5, 2019, 11:13am

Hi @kurzum,
I tried to run the script, but I have some errors during the building extraction-framework/core.
I think the problem is due to Hard point of this issue, but I need time to investigate.
I work on this in my spare time, so I can’t forecast about the resolution time.
We can keep in touch anyway I’ll be back when I’ll have some news.

kurzum · December 5, 2019, 10:32pm

Most of us do as well, people at the core just do Knowledge Graph construction as research and therefore can dedicate more time due to synergies, but we all have other main projects. Speed is not an issue here. More consistently pushing it until the result is good.

I think we can remove the part of the script relying on the extraction framework as a software and instead use a databus data dependency, i.e. download the latest files from the bus instead of generating them. Should be faster, too.

I attached a script that replaces # DBpedia extraction:
It runs alone. before merging you need to remove the first 5 lines:


#TODO REMOVE BEFORE MERGE
LANGUAGE="it" 
WDIR=.
BASE_WDIR=.

######     #    #######    #    ######  #     #  #####
#     #   # #      #      # #   #     # #     # #     #
#     #  #   #     #     #   #  #     # #     # #
#     # #     #    #    #     # ######  #     #  #####
#     # #######    #    ####### #     # #     #       #
#     # #     #    #    #     # #     # #     # #     #
######  #     #    #    #     # ######   #####   #####

echo " Downloading the latest version of the following artifacts: 
* https://databus.dbpedia.org/dbpedia/generic/disambiguations
* https://databus.dbpedia.org/dbpedia/generic/redirects
* https://databus.dbpedia.org/dbpedia/mappings/instance-types

Note of deviation from original index_db.sh: 
takes the direct AND transitive version of redirects and instance-types and the redirected version of disambiguation 
"
cd $BASE_WDIR

QUERY="PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dataid-cv: <http://dataid.dbpedia.org/ns/cv#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat: <http://www.w3.org/ns/dcat#>

SELECT  ?file WHERE {
    { 
    # Subselect latestVersion by artifact
    SELECT  ?artifact (max(?version) as ?latestVersion)  WHERE {
            ?dataset dataid:artifact ?artifact .
            ?dataset dct:hasVersion ?version
            FILTER (?artifact in (
		        # GENERIC 
                <https://databus.dbpedia.org/dbpedia/generic/disambiguations> ,
                <https://databus.dbpedia.org/dbpedia/generic/redirects> ,
                # MAPPINGS
          	    <https://databus.dbpedia.org/dbpedia/mappings/instance-types>
             	# latest ontology, currently @denis account
          		# TODO not sure if needed for Spotlight
                # <https://databus.dbpedia.org/denis/ontology/dbo-snapshots>
             )) .
             }GROUP BY ?artifact 
	} 
  		
    ?dataset dct:hasVersion ?latestVersion .
    {
          ?dataset dataid:artifact ?artifact .
          ?dataset dcat:distribution ?distribution .
          ?distribution dcat:downloadURL ?file .
          ?distribution dataid:contentVariant '$LANGUAGE'^^xsd:string .
          # remove debug info	
          MINUS {
               ?distribution dataid:contentVariant ?variants . 
               FILTER (?variants in ('disjointDomain'^^xsd:string, 'disjointRange'^^xsd:string))
          }  		
    }   
} ORDER by ?artifact
"

# execute query and trim " and first line from result set
RESULT=`curl --data-urlencode query="$QUERY" --data-urlencode format="text/tab-separated-values" https://databus.dbpedia.org/repo/sparql | sed 's/"//g' | grep -v "^file$" `

# Download
TMPDOWN="dump-tmp-download"
mkdir $TMPDOWN 
cd $TMPDOWN
for i in $RESULT
	do  
			wget $i 
			ls
			echo $TMPDOWN
			pwd
	done

cd ..

echo "decompressing"
bzcat -v $TMPDOWN/instance-types*.ttl.bz2 > $WDIR/instance_types.nt
bzcat -v $TMPDOWN/disambiguations*.ttl.bz2 > $WDIR/disambiguations.nt
bzcat -v $TMPDOWN/redirects*.ttl.bz2 > $WDIR/redirects.nt

# clean
rm -r $TMPDOWN

klaus82 · December 21, 2019, 4:27pm

Hello, I merged locally your script and it works fine.
I still have problems running the script and in particular during the building because: Could not resolve dependencies for project com.diffbot.wikistatsextractor:wikistatsextractor:jar:0.1-SNAPSHOT
I need more time to study the project well and resolve this problem.
This reply is only to let you know that I’m still working on this.

klaus82 · January 13, 2020, 2:19pm

Hello @kurzum,
happy new year!
I’m back because I made two spotlight models (it and en).
For made this I changed the index_db.sh following your suggestion, installed Oracle JDK 8u231 on ubuntu (with openJDK I had a lot of compilation error for scala) and run the script.
For it it took about 3 hours, for en it took about 6 hours with a ubuntu server with 8 cores and 32 GB ram.
How I can push the index_db.sh changes? What are the community guidelines for this?
Thanks

klaus82 · January 14, 2020, 1:34pm

I’ve just created the pull request to dbpedia-spotlight / model-quickstarter

kurzum · January 20, 2020, 12:40pm

@klaus82 sorry for the late reply (holiday and then sick). I will try to look at it and merge it this week.

kurzum · February 29, 2020, 6:41am

Hi @klaus82,
@JulioNoe checked it and added another line:

we are waiting for spotlight github access, so we can merge.

klaus82 · March 2, 2020, 3:12pm

Ok thank you!