Hello, I would like to create two indexes (from the English language DBpedia) and storing them in binary format:
1: keys are a form (the string containing the DBpedia page name) and the values are the list of meanings (DBpedia URIs) for that form.
2: keys are DBpedia URIs and values are a pair consisting of a label of the DBpedia page and the shortened abstract for the Wikipedia article.
Is there any library or method you suggest for extracting this information programmatically? By default, I’d use Java but I’m also open to other languages.
Thanks for the quick answer. After doing some research, I think these are the files I need: https://paste.debian.net/1142089/ (I put these here because I’m not allowed to link to more than 2 sites)
My question is not about how to download these files, it’s more about how to extract all the different information programmatically.
Either Jena https://jena.apache.org/ or RDF4J https://rdf4j.org/
But there are really hundreds of RDF libraries. Does this answer your question?
Another option is to load it into a triple store and query it with SPARQL. We recommend doing a databus collection (which also gives you a persistent link to the file versions) and the use a docker: https://github.com/dbpedia/Dockerized-DBpedia
But for you maybe the java libs are the easiest options.
Thank you. The main problem that I’m experiencing is that the files are too large. for example, page_lang=en_ids.ttl is over 2GB. Must I put it all in memory before doing anything? Are there any alternatives?