How to extract these values from DBpedia programmatically?

sanyides · April 21, 2020, 1:57pm

Hello, I would like to create two indexes (from the English language DBpedia) and storing them in binary format:
1: keys are a form (the string containing the DBpedia page name) and the values are the list of meanings (DBpedia URIs) for that form.
2: keys are DBpedia URIs and values are a pair consisting of a label of the DBpedia page and the shortened abstract for the Wikipedia article.

Is there any library or method you suggest for extracting this information programmatically? By default, I’d use Java but I’m also open to other languages.

kurzum · April 22, 2020, 7:04am

Hi @sanyides,
the second one is easy, use:
https://databus.dbpedia.org/dbpedia/text/short-abstracts/
and
https://databus.dbpedia.org/dbpedia/generic/labels/
and build the index.

For the first one, you would need to aggregate some data, e.g. from:

kurzum · April 22, 2020, 7:13am

Did you ask how to download the packages from the bus automatically or how to get them from files or which files to take?

sanyides · April 22, 2020, 10:31am

Thanks for the quick answer. After doing some research, I think these are the files I need: https://paste.debian.net/1142089/ (I put these here because I’m not allowed to link to more than 2 sites)

My question is not about how to download these files, it’s more about how to extract all the different information programmatically.

kurzum · April 22, 2020, 11:32am

Either Jena https://jena.apache.org/ or RDF4J https://rdf4j.org/
But there are really hundreds of RDF libraries. Does this answer your question?
Another option is to load it into a triple store and query it with SPARQL. We recommend doing a databus collection (which also gives you a persistent link to the file versions) and the use a docker: https://github.com/dbpedia/Dockerized-DBpedia
But for you maybe the java libs are the easiest options.

sanyides · April 26, 2020, 10:16am

Thank you. The main problem that I’m experiencing is that the files are too large. for example, page_lang=en_ids.ttl is over 2GB. Must I put it all in memory before doing anything? Are there any alternatives?

kurzum · April 27, 2020, 5:27pm

There is RDFSlice: http://aksw.org/Projects/RDFSlice
Jena also does Stream processing: https://jena.apache.org/documentation/io/streaming-io.html
For advanced users, I think you can also run star shaped sparql queries over streams with: https://github.com/SmartDataAnalytics/SparqlIntegrate but I need to try it.

For now, Jena Streams is the easiest, if you can’t load it into main memory. Or you use Jena TDB or another database, e.g. Virtuoso.

Also bash works for some things, e.g. bzcat file.nt.bz2 | cut -f1 -d '>' | sort | uniq -c gives you the outdegree of subjects.