Continuous extraction of live changesets

Hello DBpedia Folks!

I am continuing my road in the French DBpedia Chapter universe, and the new stage to cross is today about hosting the DBpedia Live process.

This one needs three services for operating:

The extraction framework allows producing the changesets needed by the live-mirror for populating the virtuoso.
Before talking about it, i wanted to ask you a question about the changesets, the dumps were available here: Index of /live/changesets/.

  1. Is there a particular reason why you no longer provide these datadumps ?

Now going back on the live part of the application framework, i have to notify that a lot of packages need to be updated in the mvn dependencies, i list it here (i could open issues if you judge it useful) :

  • socket.io-java-client is not more available on , but as the lib is available on https://github.com/fatshotty/socket.io-java-client, it is still possible to Jitpack it
  • mysql-connector-java if you are working like me with a version 8 of mysql
  • fasterxml.Jackson.core from 2.5 to 2.9.8
  • jackson-module-scala_2.11 Version : 2.5.2 to 2.9.8

So after these updates, it is possible to follow the instructions of the DBpedia dev bible.
http://dev.dbpedia.org/DBpedia_Live_Continuous_Extraction

But… i think that the documentation can be more explicit on how to configure it.
In fact when i read your paper of 2017 i understood that the “update stream” mechanism was caught first by the IOA technology, and the choice was made to migrate it to the RCstream technology.

The live.ini file related that history by offering the possibility to choose one or another. Indeed both could be configurated :

;*********************
; OAI Configuration
;*********************

localApiURL = http://live.dbpedia.org/syncw/api.php


oaiUri = http://live.dbpedia.org/syncwiki/Special:OAIRepository
oaiPrefix = oai:live.dbpedia.org:dbpediawiki:
baseWikiUri = http://live.dbpedia.org/syncwiki/

mappingsOAIUri = http://mappings.dbpedia.org/index.php/Special:OAIRepository
mappingsOaiPrefix = oai:fr.wikipedia.org:frwiki:
mappingsBaseWikiUri = http://mappings.dbpedia.org/wiki/

and

;*********************
; FEEDERS
;*********************

feeder.rcstream.enabled = false
feeder.rcstream.room = fr.wikipedia.org
; Specify the namespace code of events you want to be processed
; Full list available at https://en.wikipedia.org/wiki/Wikipedia:Namespace
; Add at least namespace 6 "File:" to process files on commons.wikimedia.org
feeder.rcstream.allowedNamespaces = 0,10,14
; Specify how often the RCStream should try to reconnect (maxRetryCount)
; within a intervall of x minutes (maxRetryCountIntervall)
feeder.rcstream.maxRetryCount = 3
feeder.rcstream.maxRetryCountIntervall = 1

feeder.allpages.enabled = false
feeder.allpages.allowedNamespaces = 0,10,14

feeder.live.enabled = true
feeder.live.pollInterval = 3000
feeder.live.sleepInterval = 1000

feeder.mappings.enabled = true
#feeder.mappings.enabled = false
feeder.mappings.pollInterval = 3000
#feeder.mappings.pollInterval = 2000
feeder.mappings.sleepInterval = 1000

feeder.unmodified.enabled = true
feeder.unmodified.pollInterval = 2000
feeder.unmodified.sleepInterval = 1000

feeder.unmodified.minDaysAgo = 30
feeder.unmodified.chunk = 5000
feeder.unmodified.threshold = 500
feeder.unmodified.sleepTime = 30000

feeder.eventstreams.enabled = true
feeder.eventstreams.allowedNamespaces = 0,10,14
feeder.eventstreams.maxLineSize = 32768
feeder.eventstreams.maxEventSize = 65536
; see https://stream.wikimedia.org/?doc for documentation of the EventStreams API
feeder.eventstreams.baseURL = https://stream.wikimedia.org/v2/stream/
feeder.eventstreams.streams = recentchange
;sleeptime in milliseconds
feeder.eventstreams.sleepTime = 3000
feeder.eventstreams.minBackoffFactor = 2
feeder.eventstreams.maxBackoffFactor = 30
  • So the localApiURL is not existing today then i enabled the RCtream by switching feeder.rcstream.enabled variable to “true”
    I obtained this error :
    org.dbpedia.extraction.live.main.Main: An error in the RCStream connection occurred: Error while handshaking Trying to reconnect
  1. Do you understand why i can request the Rcstream api on the GUI and not via the extraction framework way ?

By working on the question i also read that on the page of API:Recent changes stream - MediaWiki :

" In 2017, wikitech:EventStreams was launched to expose arbitrary stream data over HTTP. This service replaces RCStream (described below)."

So ok, the API is out to date, i tried for fixing it to use the following API :
localApiURL = Aide de l’API MediaWiki — Wikipédia

  1. So is it a good way to fix it ?

but after this modification I got the following errors :

2022-01-28 12:21:31,790 [main] WARN  EventStreamsFeeder: Resuming from date: 1970-01-20T00:28:22Z
[Fatal Error] :6:3: The element type "hr" must be terminated by the matching end-tag "</hr>".
2022-01-28 12:21:31,965 [Feeder_FeederLive] WARN  org.dbpedia.extraction.live.util.iterators.OAIRecordIterator: org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 3; The element type "hr" must be terminated by the matching end-tag "</hr>".
	at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
	at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
	at ORG.oclc.oai.harvester2.verb.HarvesterVerb.harvest(HarvesterVerb.java:260)
	at ORG.oclc.oai.harvester2.verb.HarvesterVerb.<init>(HarvesterVerb.java:183)
	at ORG.oclc.oai.harvester2.verb.ListRecords.<init>(ListRecords.java:52)
	at org.dbpedia.extraction.live.util.iterators.OAIRecordIterator.prefetch(OAIRecordIterator.java:97)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.preparePrefetch(PrefetchIterator.java:40)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.getCurrent(PrefetchIterator.java:50)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.hasNext(PrefetchIterator.java:57)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.getCurrent(PrefetchIterator.java:49)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.next(PrefetchIterator.java:62)
	at org.dbpedia.extraction.live.util.iterators.XPathQueryIterator.prefetch(XPathQueryIterator.java:37)
	at org.dbpedia.extraction.live.util.iterators.XPathQueryIterator.prefetch(XPathQueryIterator.java:20)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.preparePrefetch(PrefetchIterator.java:40)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.getCurrent(PrefetchIterator.java:50)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.hasNext(PrefetchIterator.java:57)
	at org.apache.commons.collections15.iterators.TransformIterator.hasNext(TransformIterator.java:79)
	at org.dbpedia.extraction.live.util.iterators.DuplicateFeederItemRemoverIterator.prefetch(DuplicateFeederItemRemoverIterator.java:41)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.preparePrefetch(PrefetchIterator.java:40)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.getCurrent(PrefetchIterator.java:50)
	at org.dbpedia.extraction.live.util.iterators.PrefetchIterator.next(PrefetchIterator.java:62)
	at org.dbpedia.extraction.live.feeder.OAIFeeder.getNextItems(OAIFeeder.java:56)
	at org.dbpedia.extraction.live.feeder.Feeder.run(Feeder.java:111)
  1. Did you already have to deal with this XML parsing error?

As usual, I still have many questions to ask you but I stop here for the moment and wait for your first feedback.

My best regards,

CĂ©lian

1 Like

I think it makes sense to schedule a meeting with @kurzum and or @hopver because there was a lot of development effort for creating DBpedia Live 2.0 https://www.dbpedia.org/resources/live/dbpedia-live-sync/ because the code you are trying to use is very old and seems not to be reliable and the architecture makes a lot of problems w.r.t coevolution of the extraction software and the actual live feeding mechanisms.
In this case I also think it seems not a good idea if you would like to try to get something running and updating the dependencies first, since you don’t know whether it is not running because of the updates anymore.

I can contribute also some more details, but apart from that I can not offer further help :

a general note OAIpmh stuff is used to fetch mapping changes from the DBpedia mappings wiki
(not wikipedia itself anymore) and I would suggest to disable that for now since you can also reflect changes to mappings by just reaxtracting wich will eventually happen after a time interval of a configurable number of days

i am not sure whether the migration to rc streams was pushed entirely to the master see the following branch https://github.com/dbpedia/extraction-framework/tree/live-deployed

Hello @jfrey !
But passing by the sync api means that we couldn’t get raw data for applying our own custom extraction workflow, no ?
By reading the presentation page i understood that this API will later be destined for commercial use, is it the case concerning the langage chapter ? Because we understood that hosting this kind of service has a cost, which we are ready to assume :slight_smile:
I tried to work on the branch indicated but the Socket-io package is as you said completely out-to-date, that is the reason why the RCStreamFeeder class need to be rewrited…
We would be very happy to discuss this with your team, especially since we have many questions about DBpediaLive that are outside the scope of this ticket !

I suggest to just contact @kurzum. I am sick at the moment and I was only involved in 2019 and partially 2020 when it was a students project. Maybe as chapter you could host your own sync API for french only and optionally commercialize it on your terms in alignment with the Association.

RCStreamFeeder is deprecated
this is the new feeder for recent changes (event) stream (not to be confused with RCstream)

1 Like