WIP: CONJURE - The realization of workflows over data catalogs

aklakan · October 4, 2019, 12:51am

Internally, we had lots of discussions about data clients, mods, caching, parallelization, etc, and I have now reached the state where I can present my efforts (from several projects) working together.

This is work in progress that puts many of the pieces of the puzzle together into one coherent framework. Especially, it pushes the philosophy further. Some highlights are: In addition to datasets as instances of data models, there are now data objects, which represent specific digital copies of a dataset. Also, uniform access to remote datasets and Jena Models in the JVM can be achieved using URIs that refer to standard JNDI entries. And connections are just tools to interface with data objects - but connections are not data objects themselves.

Conjure is an extremely powerful, light-weight, scalable, algebra-based approach to data processing workflow specifications for RDF: The algebraic approach enables one to specify a blue-print for the construction of a dataset in terms of references to datasets and SPARQL-queries. This blue print can then be passed to an executor which can interact with the triple store of your choice. The expression nature of the workflow specification gives you the opportunity of caching intermediate and/or final datasets. The dataset references give executors the chance to look up available distributions in data catalogs, which in turn allows for discovery and re-use of existing databases as well as automatic setup of local ones.

And here is what it looks like in code: jena-sparql-api/MainConjurePlayground.java at develop · SmartDataAnalytics/jena-sparql-api · GitHub

As this is a declarative approach, it means that executors can be optimized and improved, yet the declarations can stay the same (provided the model is adequate). So the same declaration as of today may run much faster in the future.
Right now the executor only uses jena in-memory models, but we (Lorenz is providing support) already have some foundations for TDB and docker containers.