Best way to download specific parts of DBPedia

rogargon · June 19, 2021, 11:10am

We have been using a subset of files of the 2016 DBPedia dump to illustrate the capabilities of a semantic data exploration tool called Rhizomer. The subset includes just those parts that facilitate the user experience when exploring DBPedia with Rhizomer without overwhelming the user with too much information.

The details about the subset of the 2016 dump we are using are available from:

We would like to use the last version of DBPedia dumps but exploring the mechanisms to download DBPedia available now, we haven’t been able to see a clear connection between the files we were using and those available right now. Any guidance about how to proceed to get a similar subset of DBPedia using the new download mechanisms would be highly appreciated.

By the way, you can explore DBPedia using Rhizomer at: RhizomerEye

kurzum · June 21, 2021, 5:42pm

Hi @rogargon,
two things at the beginning: 1. Could you correct DBPedia to DBpedia wherever you find it (also Rhizomer docu), 2. I am writing some Databus things below, but we will soon release Databus 2.0 which has some small, but breaking changes, so could be that you need to update some things later, when this happens.

Other than that the use case you have is one of the main features of the Databus. DBpedia is now extracting 5000 files monthly and nobody needs all of them. So we refactored the releases into a Maven structure. I had a look at Rhizomer and it also uses Maven in the RhizomerAPI. Databus works similar to software dependency management in Maven using a user/group/artifact/version/file structure.

For each file from the list, you need to select the appropriate artifact and version. Instead of writing <dependency><group><artifact>etc in the pom.xml things can be shortened with sparql. I made a query which covers your 18 files, see this link. Note that images are missing and abstracts are old. the latter is being fixed now. Or well you could switch to the dev artifacts: https://databus.dbpedia.org/vehnem/text/short-abstracts and https://databus.dbpedia.org/vehnem/text/long-abstracts

Another option is to 1. register at databus.dbpedia.org and 2. use the collection feature. It is like a shopping cart, where you can aggregate all files you need under one collection URL. This is what we did for latest-core. You could make a stable collection for Rhizomer once a year.

A further notice is that we build a Databus Client to help with conversion. It has a download as function, i.e. you pass it the query or collection url and tell him that you would like the files as gzip and the client will convert on download.

Please tell us, whether this succeeded, thanks.

oliverjohn2030 · June 22, 2021, 12:33pm

I’m also looking for the same answer

kurzum · June 22, 2021, 1:03pm

@oliverjohn2030 was my answer helpful? I wrote it very technical as I was assuming familiarity with Maven and SPARQL by the Rhizomer people.

rogargon · June 22, 2021, 2:14pm

Thanks, @kurzum, I managed to create the collection: https://databus.dbpedia.org/rogargon/collections/browsable_core

We use images in the user interface whenever they are available, are they going to be included at some point?

Regarding databus-client, we tried to build it but we are getting the following error in an EC2 instance with Java 8 (also tried Java 11) and Maven 3.8.1:

[INFO] — scala-maven-plugin:3.3.1:compile (default) @ databus-client —
[INFO] /home/ec2-user/databus-client/src/main/scala:-1: info: compiling
[INFO] Compiling 31 source files to /home/ec2-user/databus-client/target/classes at 1624370890318
[ERROR] error: java.lang.NoClassDefFoundError: javax/tools/ToolProvider

I suppose we are missing some additional dependencies to build the client.

kurzum · June 22, 2021, 2:56pm

Nice. A small tip regarding the URL of https://databus.dbpedia.org/rogargon/collections/browsable_core . Here it is ok to put version info into the URL, e.g. you could call it browsable_core_2021 and then do browsable_core_2022 or you could make it dynamic, i.e. update browsable_core to always point to the latest working subset. But this decision is up to you.

Regarding the Databus client problem, Databus is also fully bash compatible for downloading:

query=$(curl -H "Accept:text/sparql" https://databus.dbpedia.org/rogargon/collections/browsable_core)
files=$(curl -H "Accept: text/csv" --data-urlencode "query=${query}" https://databus.dbpedia.org/repo/sparql | tail -n+2 | sed 's/"//g')
while IFS= read -r file ; do wget $file; done <<< "$files"

So you could also use this or rewrite it to do the bzip2 to gzip conversion.

@eisenbahnplatte can you look at the problem. This seems to be like a very good use case for the Databus Client, but it can’t be compiled?

eisenbahnplatte · June 22, 2021, 5:28pm

@rogargon I tried to reconstruct the exception you got, but with no success.

my Java Version: 1.8.0_292
Initially I used Maven 3.3.9, but I switched to 3.8.1 to see if that works too, and it does. I also deleted my maven .m2 folder to see if maven downloads all the dependencies correctly.
Did you mvn clean install first? Otherwise some dependencies may are missing for you.

Apart of that I made minor changes, because there was an issue with the handling of Collection Queries. The following commands work fine for me now.

git clone https://github.com/dbpedia/databus-client.git
cd databus-client
mvn clean install
./bin/DatabusClient -s "https://databus.dbpedia.org/rogargon/collections/browsable_core"

Another approach would be to use the released jar file of the DBpedia Databus-Client. Have you tried that already? It works out of the box, no maven commands needed. You can download it here:

The command for execution would be:

java -jar databus-client-1.0-SNAPSHOT.jar -s "https://databus.dbpedia.org/rogargon/collections/browsable_core"

rogargon · June 25, 2021, 6:01am

Thanks, it worked using the jar file.

Regarding the images, are they planned for the near future?

kurzum · June 27, 2021, 8:16am

Hi @rogargon,
Great.

I think the ‘generic/infobox-properties’ artifact contains many images as well.

The image extractor is just separating these into a separate file/artifact.

We are busy preparing a stable release for mid July. Not sure if we can slip it in by then. Let me ask @jlareck and @marvinh

hopver · June 28, 2021, 1:18pm

Hi,
I tested it on the DIEF minidump (a small subset of Wikipedia to test extractions), and it worked.
I will try to run the complete image extraction for the June dumps and publish it.

rogargon · August 20, 2021, 4:20pm

Hello, I’m unable to find the DBPedia images in the 2021-06 dump. Are they planned for the next dumps?

kurzum · August 31, 2021, 5:58am

@rogargon images will come with the next release, they didn’t make it. We did a bit of a sloppy job writing down everything. Next time, we will keep much better track of everything in the collection notes: https://databus.dbpedia.org/dbpedia/collections/dbpedia-snapshot-2021-06

rogargon · November 2, 2021, 9:36am

Thanks @kurzum, all.

After the 2021.09 release, I updated the browsable_core collection to include images. I also simplified it a little to avoid too much information in the generated user interface. The procedure is now summarised at: https://github.com/rhizomik/rhizomerAPI/wiki/Deploy-Neptune-and-Rhizomer#prepare-dbpedia-dataset

Regarding the outcome, there seem to be some issues with images. For instance, for the first 10 insects, just 3 of them seem to have a picture of the actual insect: RhizomerEye

Moreover, looking to further simplify the UI, there are many properties that are repeated in the DBpedia Property and DBpedia Ontology namespaces (i.e. dbp and dbo). Are they kept separate in different dump files so I can choose to just load those from dbo?

rogargon · November 12, 2021, 6:21pm

Hello, attaching a screenshot of the page showing the issues with images, as we are going to revert to the old dump and it will not be visible at the previous link.

As it can be observed, just the two instances at the top have images that correspond to insects:

jfrey · November 29, 2021, 12:06pm

could you please send the sparql query or the corresponding triples to our issue tracker?

jfrey · December 1, 2021, 11:25am

Update: we already found the reason for the issue. thanks for reporting it.

vgimeerut · May 12, 2023, 11:40am

One way to download specific parts of DBpedia is to use the DBpedia extraction framework and select the datasets or subsets that you are interested in. The DBpedia extraction framework is a set of scripts and tools for extracting structured information from Wikipedia and publishing it as Linked Data.

To download specific parts of DBpedia using the extraction framework, you can follow these general steps:

Choose the datasets that you are interested in. DBpedia provides a wide range of datasets, such as the core dataset, which contains information about concepts and their properties, the ontology dataset, which describes the DBpedia ontology, and the mapping-based dataset, which contains information extracted from Wikipedia infobox templates.
Download and install the DBpedia extraction framework. The extraction framework is available on GitHub and can be installed using Maven.
Configure the extraction framework to extract the datasets that you are interested in. You can do this by editing the configuration files, which are located in the “extraction-framework/config” directory.
Run the extraction framework to extract the datasets. You can do this by running the “run” script, which is located in the “extraction-framework” directory.
Once the extraction is complete, you can access the datasets in the output directory, which is specified in the configuration files.

Note that the extraction process can be time-consuming and resource-intensive, so it may be helpful to use a server with sufficient computational resources to perform the extraction. Additionally, you should ensure that you have permission to use the data in the way that you intend, as DBpedia is licensed under the Creative Commons Attribution-ShareAlike 3.0 license.

jfrey · May 12, 2023, 1:21pm

looks like a chat gpt answer to me