Dbpedia indexer for Elasticsearch (Pyspark app)

vcutrona · April 23, 2020, 4:20pm

Hello,
this post is for sharing a new Pyspark application that I developed for indexing Dbpedia dumps into Elasticsearch. The application is open under the Apache2 license (https://github.com/vcutrona/elasticpedia), and the project was inspired by the original DBpedia indexer used by the Lookup service (https://github.com/dbpedia/lookup) - i.e., the same properties are mapped to the same fields used in Lookup.

So far, the application works only with fully-qualified .ttl files (like the actual Dbpedia 2016-10 dump format). I built also a Docker image to help people lacking Spark expertise (a docker-compose file example is available under the Usage section in docs).

Hope many of you will find it useful!
The current version (1.0.1) is pretty stable (I tested it both locally, using Docker, and in a dedicated YARN cluster, indexing the 2016-10 dump successfully). I’m open to comments and suggestions, but also bug reports and contributions are welcome!

Best,
Vincenzo