Text (long and short abstract) extraction updates

we are still having trouble with providing text extraction.

with live

Our primary strategy was to run DBpedia live for abstracts. There is an unmodified feeder, which updates any records older than 30 days. So this would mean that we should have an updated short/log abstract triple for now. However there is this bug: https://github.com/dbpedia/extraction-framework/issues/584 and we still have 16091283 old records : select count(*) from DBPEDIALIVE_CACHE where updated <= now() - interval 30 day;

The Live branch https://github.com/dbpedia/extraction-framework/tree/live-deployed contains an updated version of <extractor name ="org.dbpedia.extraction.mappings.AbstractExtractorWikipedia" status="ACTIVE" languages= "en,ar,eu,ca,cs,nl,eo,fr,el,de,id,ga,it,ja,ko,pl,pt,es,sv,uk"></extractor>

AbstractExtractorWikipedia, does a request to the HTML of Wikipedia and uses the HTML. It should have the best data.

So one solution can be to fix this, then it is easy to dump the database, whenever we want.

with extraction framework

A second strategy would be to run the framework once a month. There is AbstractExtractor.scala https://github.com/dbpedia/extraction-framework/tree/master/core/src/main/scala/org/dbpedia/extraction/mappings

We didn’t have time yet, to look into it in too much detail, but it seems that:

  • I think, it uses the wiki syntax.
  • moving to the framework to Apache Spark might have broken it. I tried running it, but it seems broken. We wrote a dump CI under dump, which uses selected articles in a minidump mvn test. Added AbstractExtractor to the config, but it extracts nothing.
  • I didn’t test AbstractExtractorWikipedia as this does a request for each page. It could work however. This HTML extractor has much better data quality

Any help is appreciated. We can also give access to the live-abstracts server, if somebody wants to debug there.

@m1ci I got the extraction running on a minidump see fixing abstract extraction branch
we might deploy it for de, nl, en, for now as input for lhd