Spotlight: low similarity scores / confidence for possessive form in german

christian · November 19, 2020, 12:41pm

Hi,

consider this sentence in english :

query with confidence parameter 0.8:
“Germany’s coast is divided into the Baltic Sea and the North Sea.”
(Germany: - similarityScore 0.977)

query with confidence parameter 0.85:
(Germany -> not detected)

Now the german sentence:

query with confidence parameter 0.4:
“Deutschlands Küste ist in die Ostsee und die Nordsee unterteilt.”
(Deutschland - similarityScore 0.885)

query with confidence parameter 0.45:
(Deutschland -> not detected)

the similarityScore for german language is much lower than for english language.

Another example:

“It’s also very beautiful on Sicily’s east coast.”
(Sicily - similarityScore 0.997)

“An Siziliens Ostküste ist es auch sehr schön.”
(Sizilien - similarityScore 0.894)

My questions:

Why is similarityScore for ‘Deutschland’ and ‘Sizilien’ quite low (compared to english form)?
Why are there very different confidence parameter thresholds? German possessive form will not be detected if confidence parameter is > 0.4 (for english > 0.8)

i used the demo page with ‘n-best candidates’ option to determine the similarity scores:
https://demo.dbpedia-spotlight.org/

thanks!

JulioNoe · November 20, 2020, 1:54pm

Hi @christian,

Thank you for your questions.

Why is similarityScore for ‘Deutschland’ and ‘Sizilien’ quite low (compared to english form)?

The result depends on the language model. The English language model contains more elements (tokens, URLs, surface forms, pairs ,etc.) than the German language model. Then, in the candidate selection process or disambiguation process, the elements at hand for each language are different.

Why are there very different confidence parameter thresholds? German possessive form will not be detected if confidence parameter is > 0.4 (for english > 0.8)

For this question, the stemmer algorithm is the most probable answer. We are working on improving this part of the DBpedia-Spotlight. If you are interested in this topic please visit this link for more details.

Thanks for your questions, both help us to improve the DBpedia-Spotlight, and please if you have any other questions don’t hesitate to publish it in the forum. Thanks

christian · November 24, 2020, 11:27am

Hi @JulioNoe,
thanks for your reply.

Good to hear that. Right now it seems hard to extract usefull information when using confidence parameter < 0.45

consider this example :

“Woher sie kommen, wohin sie gehen: Das Schicksal der Umsiedler”

will produce very weird results (confidence parameter 0.4):

{
    "@text": "Woher sie kommen, wohin sie gehen: Das Schicksal der Umsiedler",
    "@confidence": "0.4",
    "@support": "0",
    "@types": "",
    "@sparql": "",
    "@policy": "whitelist",
    "Resources": [
        {
            "@URI": "http://de.dbpedia.org/resource/Angela_Merkel",
            "@support": "4444",
            "@types": "Wikidata:Q386724,Wikidata:Q234460,Schema:CreativeWork,DBpedia:Work,DBpedia:WrittenWork",
            "@surfaceForm": "sie",
            "@offset": "6",
            "@similarityScore": "0.7618334327883661",
            "@percentageOfSecondRank": "0.11505440921756185"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Kosovo",
            "@support": "7953",
            "@types": "Wikidata:Q6256,Schema:Place,Schema:Country,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Country",
            "@surfaceForm": "kommen",
            "@offset": "10",
            "@similarityScore": "0.9959636611260666",
            "@percentageOfSecondRank": "0.004031950540247078"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Angela_Merkel",
            "@support": "4444",
            "@types": "Wikidata:Q386724,Wikidata:Q234460,Schema:CreativeWork,DBpedia:Work,DBpedia:WrittenWork",
            "@surfaceForm": "sie",
            "@offset": "24",
            "@similarityScore": "0.7618334327883661",
            "@percentageOfSecondRank": "0.11505440921756185"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Gehen",
            "@support": "221",
            "@types": "",
            "@surfaceForm": "gehen",
            "@offset": "28",
            "@similarityScore": "0.9999856227980902",
            "@percentageOfSecondRank": "0.0"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Das_Schicksal",
            "@support": "9",
            "@types": "Wikidata:Q386724,Wikidata:Q11424,Schema:Movie,Schema:CreativeWork,DBpedia:Work,DBpedia:Film",
            "@surfaceForm": "Das Schicksal",
            "@offset": "35",
            "@similarityScore": "0.9999999996082067",
            "@percentageOfSecondRank": "0.0"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Siddhartha_Gautama",
            "@support": "2928",
            "@types": "Http://xmlns.com/foaf/0.1/Person,Wikidata:Q5,Wikidata:Q24229398,Wikidata:Q215627,DUL:NaturalPerson,DUL:Agent,Schema:Person,DBpedia:Agent,DBpedia:Person",
            "@surfaceForm": "der",
            "@offset": "49",
            "@similarityScore": "0.5831854446108904",
            "@percentageOfSecondRank": "0.4274700030982275"
        },
        {
            "@URI": "http://de.dbpedia.org/resource/Umsiedler",
            "@support": "657",
            "@types": "",
            "@surfaceForm": "Umsiedler",
            "@offset": "53",
            "@similarityScore": "0.9999639826191368",
            "@percentageOfSecondRank": "2.343138361030307E-5"
        }
    ]
}

Are these results related to the stemming algorithm? When increasing confidence parameter to 0.45 these entries will go away, but i will lose the ability to detect possessive form in german language.

any improvement in this area will be highly welcome

thanks!

JulioNoe · November 25, 2020, 11:55am

Hi @christian,

Thanks for your comments, we are working to improve DBpedia-Spotlight and your suggestions are valuable. The stemmer algorithm is just a part of the DBpedia-Spotlight and corresponds to the language model.

The selection of the name entities corresponds to the process defined in the original paper and briefly explained in the DBpedia-Spotlight web page: Spotting, Candidate selection, Disambiguation, and Filtering.

In particular, the confidence value (or disambiguation confidence) determines how flexible will be the algorithm to try to match a named entity being 1 very strict and 0 very flexible. Citing part of the original paper:

The rationale is that a confidence value of 0.7 will eliminate 70% of incorrectly disambiguated test cases.

Then, the main problem, as you mentioned, is related with the correct identification of possessive form in German language. Thanks again for your comments and also for being specific about the problem, this will help us to improve the results but if you have something in mind to solve this problem please don’t hesitate into make us known your proposal or maybe you can implement it. Have a great day

My best regards

kurzum · December 2, 2020, 1:43pm

@christian a short note here. My colleague @bettinak wrote a paper on this a while back: https://link.springer.com/chapter/10.1007/978-3-319-73706-5_11
It is about problems NER tools have with morphology. They have their issues there as well. NER is simpler than entity linking (what spotlight does). So getting these inflected forms right is quite a hard and generally open research issue.

christian · January 22, 2021, 1:53pm

@kurzum interesting paper, thanks!

let me quickly describe my usecase: extracting geolocations from german text.

now i wonder if the generalized approach of spotlight really fits my usecase - or if i should use a very simple straightforward algorithm instead:

for each word in the text:
query local instance of dbpedia if this word is a location
if the word ends with ‘s’ (possessive form)- remove ‘s’ and query again

this should detect most of the locations with a 100% confidence. any thoughts on this approach?

thanks, christian

EDIT: maybe algorithm needs a sliding window to detect locations consisting of two words