the goal of the databus is to produce a replication/deployment infrastructure. At the moment, the self-deployment is implemented already (but not overly documented). There are two ways to set up your own sparql endpoint:
- using the query and the databus client: https://github.com/dbpedia/databus-client#docker-example-deploy-a-small-dataset-to-docker-sparql-endpoint
- You can log in and create a collection and use this https://github.com/dbpedia/Dockerized-DBpedia
I can understand your request, that you first want to understand the data better before loading it. At the moment, we have a
preview (on the page you can fold open the > to see the first 10 lines). There is also a property called
dataid:nonEmptyLines "140614"^^xsd:decimal ; but it is still broken, i.e. the dataset has almost 4GB and therefore probably more than 140k lines.
We are currently implementing a triple store that keeps an analysis of all files on the bus, including VOID (https://www.w3.org/TR/void/). VOID has
void:distinctSubjects which is what you are looking for. This will need 2-4 weeks (maybe more) to be effective.
Otherwise you can do:
curl $downloadURL | lbzip2 -dc | cut -f1 -d '>' | sort -u --parallel=8 | wc -l
to get a count of all distinct subects.