Obtaining name property in non-English language

manuleo · April 23, 2021, 3:21pm

Dear DBpedia community,

I am doing some research on entity matching and most of the Knowledge Graphs I’m exploring are based on DBpedia. Specifically, I’m now working on the multilingual datasets and I’m trying to align the English version of DBpedia with the French, German, and Japanese version of DBpedia.

During the analysis of such datasets, I discovered that most of the non-English entities have a property called “name” (specifically, xmlns.com/foaf/0.1/name), which is most of the time in English. For example, (http://ja.dbpedia.org/resource/レッドブル・ザルツブルク, http://xmlns.com/foaf/0.1/name, “FC Red Bull Salzburg”) has an English name and even if I look at all the attributes of this entity, there is none in Japanese which would represent a name.

Hence, my question is: is there a way to obtain the Japanese name property for all the entities without looking at the Id?

Thank you so much for your help.

manuleo · May 10, 2021, 8:43am

Dear community,

During my analysis on this issue, I had some new interesting findings of the multilingualism of the aforementioned datasets. More specifically, what I have discovered is:

Many attributes have “English” literals, even in non-English datasets, and also the other way around.
The language tag is not useful for these attributes as it is often incorrect.

Hence, the problem resides not only on the “name” attribute but is distributed among all the attributes that compose a dataset. Following there are a few examples that better describe this issue:
Japanese DBpedia
<http://ja.dbpedia.org/resource/アムステルダム大学> <http://xmlns.com/foaf/0.1/name> “Athenaeum Illustre”@ja .
<http://ja.dbpedia.org/resource/ウォルター・ベッカー> <http://xmlns.com/foaf/0.1/givenName> “Walter Carl Becker”@ja .
French DBpedia
<http://fr.dbpedia.org/resource/Chris_Claremont> <http://dbpedia.org/ontology/homage> “Comics Buyer’s Guide” .
<http://fr.dbpedia.org/resource/Michigan> <http://xmlns.com/foaf/0.1/nick> “The Wolverine State, The Great Lakes State”@fr .
German DBpedia
<http://de.dbpedia.org/resource/Lakeway> <http://dbpedia.org/ontology/cityType> “City” .
<http://de.dbpedia.org/resource/New_Bedford_(Massachusetts)> <http://xmlns.com/foaf/0.1/nick> “The Whaling City”@de .
English DBpedia
<http://dbpedia.org/resource/Kanako_Maeda> <http://xmlns.com/foaf/0.1/name> “前田夏菜子“@en .
<http://dbpedia.org/resource/Bullet,_Switzerland> <http://dbpedia.org/ontology/demonym> “Lè Pi-Bot”@en .
<http://dbpedia.org/resource/North_Rhine-Westphalia> <http://xmlns.com/foaf/0.1/name> “Nordrhein-Westfalen”@en .
To make things even more explicit, clearly the literal “City” corresponding to the “cityType” attribute for the entity “Lakeway” in the German DBpedia is not in German, rather in English.

My question
The question is now more general with respect to what I asked before: is there a way to get the literals corresponding to entity attributes in the “correct” source language (i.e., the one of the dataset) without using any external translation or language recognition tools?

Thank you so much for your help. Looking forward to discussing this.

kurzum · May 10, 2021, 2:10pm

Dear @manuleo,
your post is very difficult to answer as we have no idea which sources you are talking about. In general, monthly dumps are produced, which all have a databus identifier, see release-dashboard or latest core or popular datasets. It would be preferred to have the exact version here as the deployed instances at de.dbpedia.org or ja.dbpedia.org are all from different years. There is also a live tool, where you can check the current state: DBpedia Test Extractors

As I see it, you are mostly using data from the mappings extraction. In this extraction data is normalized quite a lot already, e.g. it is mapped to FOAF and the /ontology/ namespace. For your use case, I would look at the generic labels and generic facts

See e.g. here dief.tools.dbpedia.org/server/extraction/fr/extract?title=Michigan&revid=&format=turt&extractors=custom

Subject	Predicate	Object	Triple Provenance
http://fr.dbpedia.org/resource/Michigan	surnom français	L’État du carcajou », « L’État des Grands Lacs	http://fr.wikipedia.org/wiki/Michigan?oldid=179586138#absolute-line=10&template=Infobox_État_des_États-Unis&property=surnom_français

manuleo · May 10, 2021, 2:47pm

Dear @kurzum,

Thank you for your answer!

I’m sorry I forgot to mention the source we are talking about, so I’ll give you a bit more context hoping that you can help me. I am currently using the dump from 2016-10 from DBpedia (which is the one used in literature for this problem usually), and yes I’m using the mapping extraction (i.e. mappingbased_literals files from such dump). I always used the mapping namespace because cleaner to use (as advised on the download page) and avoided using the generic facts because too full of dirty information.

What I am looking for is then a way to understand inside this dump/mapping space which attributes are in the source (correct) language and which ones are instead in an incorrect language without using any external tool. Do you think that something like that would be achievable?

Thank you again for your huge help. Please let me know if something is unclear and/or if you need additional information to help me solving this issue.

kurzum · May 10, 2021, 5:34pm

@manuleo well, the main goals of DBpedia is to:

reflect Wikipedias data with normalization
serve as linking space for external data

Regarding 1, I can only refer you to the mappings wiki. There is the specification of how the mappings are done. There is a property called wikiPageUsesTemplate , e.g. for Michigan EN, this mappings is used: Mapping en:Infobox U.S. state - DBpedia Mappings
You can get an account and edit it there. Each month you will get better data with these mappings. Not sure, if you can specify language tag. You would need to check the manual: DBpedia Mappings

Not sure, as I don’t think there is a correct language. This seems to be a localization issue or i18n. There is transcription and then there is words, which might be correct in a language. “USA”@de would be correct for me as we use it in spoken and written German though it is a English abbreviation. You could try and edit the mappings. I think there is a validate button or test button, that will show you the effects of your edits.

I think that many datasets with correct localization link to us. This is actually the recommended use of DBpedia, i.e. follow links and get more data from Linked Data. You could check CLDR http://cldr.unicode.org/ or JRC-Names - European Commission

kurzum · May 11, 2021, 6:31am

A comment on this. Our advertisement might be misleading here. The generic facts are a treasure trove of information (with some clean up). Normally, you would want to operate in mixed mode, i.e. use mappings and then add some more from generic.