DBpedia outdated problem

moha92 · November 5, 2019, 2:40pm

Hey everyone,

I wanna know what is the purpose of using DBpedia (not DBpedia Live) knowing that the represented information is outdated?
For example, at the ressource http://dbpedia.org/page/United_States , dbo: leader has still the value of the precedent democratic administration ( dbr:Barack_Obama, dbr:John_Roberts, dbr:Joe_Biden, dbr:Paul_Ryan)

kurzum · November 6, 2019, 6:55pm

Ok, but this is the free online service, you can download the fresh data as I wrote here:

We will update it soon and also have a better schedule for updates of the online service. But you can also always host your own, it is easier now.

bdevel · November 8, 2019, 6:46am

I’m curious if there’s been talks about making DBpedia easier for people to edit and contribute? Allowing free edits could be supported if there was some consensus method in place, for example, a majority with quorum.

This would greatly expand the depth of DBpedia.

kurzum · November 8, 2019, 8:15am

@bdevel You have to think bigger. Open Data and Linked Open Data is already there. So we allow contributions of whole datasets, which is synced then back to source as well.
One example here is swissbib.ch who contributed a mapping and we could integrate DNB data with it.

So our contributions will take the form:

dataset in any format
dataset in RDF
Links and EquivalentProperty statements

Free edits are not effective as this is data entered from somewhere else. We will probably allow discussion which source is the best.

See https://global.dbpedia.org/?s=https://en.wikipedia.org/wiki/Berlin

moha92 · November 8, 2019, 10:49am

According to Kingsley Uyi Idehen (CEO and Founder of OpenLink Software) Enterprises, such as Apple (via Siri), Google (via Freebase and Google Knowledge Graph), and IBM (via Watson) have used DBpedia in their projects, I’m wondering how did they got reliable data?

kurzum · November 9, 2019, 7:03am

There is a trade off between coverage and quality. Nobody has all data in high/perfect quality. The generic dataset has high coverage and the mappings dataset higher consistency/correctness.

Also dbpedia really shines in terms of structure with its 8 taxonomies and ontologies and also its high degree of connectivity. That is the real value here. The data as such is just a nice add on.

moha92 · November 9, 2019, 11:30am

@kurzum I’m a PhD student working on outdated data issue in LOD domain, compared to it’s data source and I’m wondering if DBpedia is a relevant study case?

kurzum · November 9, 2019, 3:43pm

What aspect are you working on?

DBpedia is being reengineered and this reengineering is almost finished, so it underlies the normal innovation lifecycle, i.e. we have been doing 10 years the same thing, but now we will do something better, because the time changed.

Databus allows to publish RDF dumps and allows to deploy Linked Data from these dumps and therefore makes LOD cheaper to persist. LOD being outdated means in the core that there is no economic model or incentive to keep it up-to-date. Are you working on that? Here are the slides from the last Community meeting where 3 or 4 presentations were about a better maintainable LOD Cloud.
Meeting: https://wiki.dbpedia.org/events/14th-dbpedia-community-meeting-karlsruhe
Databus slides: http://tinyurl.com/dbpedia-databus-semantics-2019

moha92 · November 10, 2019, 11:21am

@kurzum I’m working on assessing DBpedia “freshness” regarding property values, compared to wikipedia

kurzum · November 11, 2019, 7:34am

@moha92 The freshness depends on when and how you run the software.

The monthly releases are based on the dumps which are made by Wikimedia twice a month. Then the software needs 2-7 days to run depending on the module (generic/mappings/text/wikidata). Here you can run them yourself with some easy scripts: https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config
If you run the extraction framework server you can do ad hoc extraction: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/de/extract?extractors=mappings&format=trix&title=Ethanol
Then it is as up-to-date as you do the request.
Finally, live.dbpedia.org processes each article.

It would be more interesting to compare up-to-dateness between wikipedia and wikidata. Wikipedia should still be more up to date. You can use DBpedia for that as well since we extract both. The data is here: https://databus.dbpedia.org/vehnem/flexifusion/prefusion/2019.11.01
There is also DNB and Musicbrainz in it.

moha92 · November 11, 2019, 11:43am

@kurzum accroding to you, is DBpedia_Live more relevant for assessing data freshness compared to Wikipedia than DBpedia monthly release? I notice a few weeks delay between wikipedia and dbpedia-live for the same ressource.
For example, the last version of the movie Midway on DBpedia-live dates from 22/10/2019: http://live.dbpedia.org/page/Midway_(2019_film) , and in wikipedia the date of the last version of this movie is 11/11/2019 https://en.wikipedia.org/w/index.php?title=Midway_(2019_film)&action=history
N.B: This is not an isolated case

moha92 · November 11, 2019, 11:59am

@kurzum I just notice that the latest changeset on DBpedia-Live, date of 24/10/2019 http://dbpedia-live.openlinksw.com/live/
Isn’t this irrelevant for a version who is supposed to be synchronized with wikipedia (with a delay of few minutes at max)?

kurzum · November 11, 2019, 12:07pm

Yes, but I am not sure, what you are intending to measure there exactly. DBpedia Association runs on 15k€ membership fees yearly, which we mostly spend on travel grants so core community members can come to meetings.

Other than that it is a free services, as-is, without guarantees. So when there is a hickup, we need a couple of days/weeks to fix it.

Ideally, we would raise enough money to have a dedicated person maintaining the service and then breakage or delay will not occur any more. So if you are measuring this, you are actually assessing an economic factor, i.e. DBpedia not having enough money and staff to keep this in pristine condition at all times and provide SLA.

We could do a kickstarter…

Note that the delay here was caused by a wrongly configured log filling the boot partition and therefore didn’t allow writing to downloads.dbpedia.org any more. This is being fixed and also we added a check, which let’s us recognize earlier that such things happen. So it is improving, but only slowly.

moha92 · November 11, 2019, 3:30pm

@kurzum I’m not looking through the eyes of a maintainer, but the user, the user cannot have reliable data from DBpedia-Live during this delay

kurzum · November 11, 2019, 4:26pm

Yes, but this like any other OS model:

use the free stuff as is
do-it-yourself, i.e. http://dev.dbpedia.org/DBpedia_Live
Volunteer to help maintain/improve the free service
Pay for a premium service with guarantees

Note that the updates have been fixed:
http://downloads.dbpedia.org/live/changesets/2019/

The “user” you are mentioning, does he expect to get everything for free? Like electricity, internet, database access, updates? If yes, he/she could be a communist or a free-rider

By the way? Did you read about the update strategy when such a delay happens? I assume you have read the live documentation and papers, since you are a PhD student:

kurzum · November 12, 2019, 11:09am

Hm, maybe I understood the “outdated” topic better now. So there is always some technical time lag between Wikipedia and DBpedia as with all ETL-processes. Since data is extracted, transformed and loaded/dumped again, there is a delay and it is smaller in the ad-hoc and live extraction and greater in the dumps.

Is this what your PhD is about? Study delay in extraction processes?

moha92 · November 15, 2019, 2:46pm

@kurzum yes, I based my work on this paper : Capturing the Currency of DBpedia
Descriptions and Get Insight into their Validity, http://ceur-ws.org/Vol-1264/cold2014_RulaPPM.pdf
It study dbpedia data freshness compared to the same data on wikipedia, have you some adjacent papers to this topic?

kurzum · November 15, 2019, 4:06pm

Hm, no, but maybe this helps: https://github.com/vrandezo/colabs/blob/master/Comparing_coverage_and_accuracy_of_DBpedia%2C_Freebase%2C_and_Wikidata_for_the_parent_predicate.ipynb

I already criticized Denny’s approach, since it is not an on going evaluation.

Also I don’t see yet how this will become a more systematic approach that shows where to optimize, but I still need to read it fully.

If you would like to do something useful, you could think of an evaluation scenario, that we can run each month and which covers most of the properties. We also loaded Wikidata, DBpedia, DNB, Geonames and Musicbrainz into the comparison engine. So this would be useful for doing such a comparison:
http://dbpedia.informatik.uni-leipzig.de:9015/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2Feeb4&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Fcountry&src=general

But the user interface is still work in progress. The data is there already grouped by property: https://databus.dbpedia.org/vehnem/flexifusion/prefusion/2019.11.01

moha92 · November 16, 2019, 7:24pm

so it took 3 weeks for the release of an new changeset? and the only solution is to hire a person responsible for maintaining the service up to date? couldn’t be a technical solution?

kurzum · November 18, 2019, 7:25pm

The service runs by itself, of course, but if it stops somebody needs to find the reason and then fix and restart. There is also the issue of merging the fixes of the master branch into the live branch and then also writing and maintaining tests and improve overall robustness.

Now, we have a small group of volunteers who fix any breakage, when they have time; they all have normal jobs and therefore it needs 3 weeks sometimes.

If you want it really, really stable, feel free to clone the repo and setup your own live extraction, it is open source. Otherwise the free service runs fine most of the time, but there is no guarantee.