Spotlight: low similarity scores / confidence for possessive form in german

Hi,

consider this sentence in english :

query with confidence parameter 0.8:
“Germany’s coast is divided into the Baltic Sea and the North Sea.”
(Germany: - similarityScore 0.977)

query with confidence parameter 0.85:
(Germany -> not detected)

Now the german sentence:

query with confidence parameter 0.4:
“Deutschlands Küste ist in die Ostsee und die Nordsee unterteilt.”
(Deutschland - similarityScore 0.885)

query with confidence parameter 0.45:
(Deutschland -> not detected)

the similarityScore for german language is much lower than for english language.

Another example:

“It’s also very beautiful on Sicily’s east coast.”
(Sicily - similarityScore 0.997)

“An Siziliens Ostküste ist es auch sehr schön.”
(Sizilien - similarityScore 0.894)

My questions:

  1. Why is similarityScore for ‘Deutschland’ and ‘Sizilien’ quite low (compared to english form)?

  2. Why are there very different confidence parameter thresholds? German possessive form will not be detected if confidence parameter is > 0.4 (for english > 0.8)

i used the demo page with ‘n-best candidates’ option to determine the similarity scores:
https://demo.dbpedia-spotlight.org/

thanks!

Hi @christian,

Thank you for your questions.

Why is similarityScore for ‘Deutschland’ and ‘Sizilien’ quite low (compared to english form)?

The result depends on the language model. The English language model contains more elements (tokens, URLs, surface forms, pairs ,etc.) than the German language model. Then, in the candidate selection process or disambiguation process, the elements at hand for each language are different.

Why are there very different confidence parameter thresholds? German possessive form will not be detected if confidence parameter is > 0.4 (for english > 0.8)

For this question, the stemmer algorithm is the most probable answer. We are working on improving this part of the DBpedia-Spotlight. If you are interested in this topic please visit this link for more details.

Thanks for your questions, both help us to improve the DBpedia-Spotlight, and please if you have any other questions don’t hesitate to publish it in the forum. Thanks

Hi @JulioNoe,
thanks for your reply.

Good to hear that. Right now it seems hard to extract usefull information when using confidence parameter < 0.45

consider this example :

“Woher sie kommen, wohin sie gehen: Das Schicksal der Umsiedler”

will produce very weird results (confidence parameter 0.4):

{
    "@text": "Woher sie kommen, wohin sie gehen: Das Schicksal der Umsiedler",
    "@confidence": "0.4",
    "@support": "0",
    "@types": "",
    "@sparql": "",
    "@policy": "whitelist",
    "Resources": [
        {
            "@URI": "http://de.dbpedia.org/resource/Angela_Merkel",
            "@support": "4444",
            "@types": "Wikidata:Q386724,Wikidata:Q234460,Schema:CreativeWork,DBpedia:Work,DBpedia:WrittenWork",
            "@surfaceForm": "sie",
            "@offset": "6",
            "@similarityScore": "0.7618334327883661",
            "@percentageOfSecondRank": "0.11505440921756185"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Kosovo",
            "@support": "7953",
            "@types": "Wikidata:Q6256,Schema:Place,Schema:Country,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Country",
            "@surfaceForm": "kommen",
            "@offset": "10",
            "@similarityScore": "0.9959636611260666",
            "@percentageOfSecondRank": "0.004031950540247078"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Angela_Merkel",
            "@support": "4444",
            "@types": "Wikidata:Q386724,Wikidata:Q234460,Schema:CreativeWork,DBpedia:Work,DBpedia:WrittenWork",
            "@surfaceForm": "sie",
            "@offset": "24",
            "@similarityScore": "0.7618334327883661",
            "@percentageOfSecondRank": "0.11505440921756185"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Gehen",
            "@support": "221",
            "@types": "",
            "@surfaceForm": "gehen",
            "@offset": "28",
            "@similarityScore": "0.9999856227980902",
            "@percentageOfSecondRank": "0.0"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Das_Schicksal",
            "@support": "9",
            "@types": "Wikidata:Q386724,Wikidata:Q11424,Schema:Movie,Schema:CreativeWork,DBpedia:Work,DBpedia:Film",
            "@surfaceForm": "Das Schicksal",
            "@offset": "35",
            "@similarityScore": "0.9999999996082067",
            "@percentageOfSecondRank": "0.0"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Siddhartha_Gautama",
            "@support": "2928",
            "@types": "Http://xmlns.com/foaf/0.1/Person,Wikidata:Q5,Wikidata:Q24229398,Wikidata:Q215627,DUL:NaturalPerson,DUL:Agent,Schema:Person,DBpedia:Agent,DBpedia:Person",
            "@surfaceForm": "der",
            "@offset": "49",
            "@similarityScore": "0.5831854446108904",
            "@percentageOfSecondRank": "0.4274700030982275"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Umsiedler",
            "@support": "657",
            "@types": "",
            "@surfaceForm": "Umsiedler",
            "@offset": "53",
            "@similarityScore": "0.9999639826191368",
            "@percentageOfSecondRank": "2.343138361030307E-5"
        }
    ]
}

Are these results related to the stemming algorithm? When increasing confidence parameter to 0.45 these entries will go away, but i will lose the ability to detect possessive form in german language.

any improvement in this area will be highly welcome :slight_smile:

thanks!