How to extract these values from DBpedia programmatically?

Hello, I would like to create two indexes (from the English language DBpedia) and storing them in binary format:
1: keys are a form (the string containing the DBpedia page name) and the values are the list of meanings (DBpedia URIs) for that form.
2: keys are DBpedia URIs and values are a pair consisting of a label of the DBpedia page and the shortened abstract for the Wikipedia article.

Is there any library or method you suggest for extracting this information programmatically? By default, I’d use Java but I’m also open to other languages.

Hi @sanyides,
the second one is easy, use:
https://databus.dbpedia.org/dbpedia/text/short-abstracts/
and
https://databus.dbpedia.org/dbpedia/generic/labels/
and build the index.

For the first one, you would need to aggregate some data, e.g. from:

Did you ask how to download the packages from the bus automatically or how to get them from files or which files to take?

Thanks for the quick answer. After doing some research, I think these are the files I need: https://paste.debian.net/1142089/ (I put these here because I’m not allowed to link to more than 2 sites)

My question is not about how to download these files, it’s more about how to extract all the different information programmatically.

Either Jena https://jena.apache.org/ or RDF4J https://rdf4j.org/
But there are really hundreds of RDF libraries. Does this answer your question?
Another option is to load it into a triple store and query it with SPARQL. We recommend doing a databus collection (which also gives you a persistent link to the file versions) and the use a docker: https://github.com/dbpedia/Dockerized-DBpedia
But for you maybe the java libs are the easiest options.

Thank you. The main problem that I’m experiencing is that the files are too large. for example, page_lang=en_ids.ttl is over 2GB. Must I put it all in memory before doing anything? Are there any alternatives?

There is RDFSlice: http://aksw.org/Projects/RDFSlice
Jena also does Stream processing: https://jena.apache.org/documentation/io/streaming-io.html
For advanced users, I think you can also run star shaped sparql queries over streams with: https://github.com/SmartDataAnalytics/SparqlIntegrate but I need to try it.

For now, Jena Streams is the easiest, if you can’t load it into main memory. Or you use Jena TDB or another database, e.g. Virtuoso.

Also bash works for some things, e.g. bzcat file.nt.bz2 | cut -f1 -d '>' | sort | uniq -c gives you the outdegree of subjects.