I am performing some experiments in which I need to check, for each wikilink in the abstract of a Wikipedia page (English chapter), wheter there is a triple in DBpedia linking the page and the mentioned wikilink.

I firstly implemented this crawling Wikipedia and with some scritps against the DBpedia SPARQL endpoint, but I’m dealing with a large ammount of entities, and after some time running the bot have some issues. I guess because I am doing too many requests in short time periods. So I’m reimplementing my scripts to work against local data dumps of Wikipedia and DBpedia.

At the moment, the only triples I need from DBpedia are the ones linking entities in the http://dbpedia.org/resource/ namespace.

So my question is: ¿which files am I supposed to dowload from the endpoint to get those triples?

My guess is that the following ones may be enough, but I’d like to confirm that I am not missing relevant content nor proccessing too much information:

  • mappingbased-objects
  • persondata
  • Extracted facts from Wikipedia Infoboxes ? (Is this one redundant?)

Also, I’ve explained my background problem in case someone can figure out a better approach/tool to solve it =)

Hi Daniel,
In the future, there will be a Namespace Mod GitHub - dbpedia/databus-mods: Databus Mods (How To and Mod Ontology and Reference Implementation) that can help to find RDF files containing IRI of a specific namespace.

But currently, you can start with your selection (it sounds good) and should, in my opinion, contain all triples relevant to your problem.

To set up your own SPARQL endpoint with data published on the DBpedia Databus, you can use GitHub - dbpedia/virtuoso-sparql-endpoint-quickstart: creates a docker image with Virtuoso preloaded with the latest DBpedia dataset. I think you already had experience with Databus collections, and so you can use such as your input. Otherwise, you can write a custom query and pass it as a variable to the docker container.

