Combine DBpedia/Databus with Interplanetary File System (IFPS) - GSoC2020

kurzum · February 27, 2020, 8:23am

Description:
The DBpedia Databus has proven as a working infrastructure for capturing and distributing the approximately 10,000 monthly generated DBpedia files plus the community extensions in an automateable way.
On the basic delivery, we build a fusion, i.e. an indexing system for DBpedia data and other datasets such as DNB, Geonames, MusicBrainz and many more, which is displayed on http://global.dbpedia.org and download available here
Databus relies on well-structured, versioned dumps appearing in slow timeslices. While this enables a global view on the data and better debugging, it is not a live, speedy ecosystem.
Recently, the underlay.org project has gained some traction, enabling graph sharing over theInterplanetary File System.
Combined these can build a lifecycle of static dumps for global analysis and live-syncronization between the dump releases.

Goals:

find a meaningful way to combine underlay.org with the databus.
implement a use case environment as a proof of concept on how they can work together seamlessly

Impact:

Global comparison of values is possible: https://global.dbpedia.org/?s=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FJohann_Wolfgang_von_Goethe . Based on seeing the disparate value, the goal is to allow underlay, if properly deployed to sync specified values into local databases and feedback the change.
Note that syncing works only into structured datasets such as the DNB source in the above example.

manonthegithub · March 3, 2020, 2:47pm

Hello! I’m Kirill.
Here is more details about me:

University of Freiburg. MS Computer Science. 2018 - now
Worked as a developer in industry. 3-4 years of experience in Java and 3-4 years more in Scala. Basic Python, Javascript, web-development.
Contributed and continue to several opensource projects: akka, akka-http; small contributions to Apache JMeter, neosemantics, java-stix, swagger-akka-http.
Oracle Certified Professional (Java 8)

My linkedin: https://www.linkedin.com/in/kirill-yankov-8b0747118/
My github: https://github.com/manonthegithub

This project seems interesting to me, but description looks too abstract to me. Can you please give any hints:

Where to look at the sources of Databus? Is that databus-client repo in github?
Can you please gives some more details of what is expected to be done?
What technologies are expected to be used?

kurzum · March 5, 2020, 8:07am

Hi Kirill,

The databus is in its core a SPARQL endpoint, which loads DataIds, i.e. DCAT metadata. Around this there is tooling, like the website at databus.dbpedia.org which renders the RDF data as HTML for humans. Weekly store dumps are here: OIDC Form_Post Response
Users host data on their own server and the POST the dataid to OIDC Form_Post Response to load them on the bus.

yes, it is here: GitHub - dbpedia/databus-client for down and here for upload Databus Upload User Manual · DBpedia Development Wiki and GitHub - dbpedia/databus-maven-plugin: Databus Maven Plugin: Aligning Data and Software Lifecycle with Maven

At the moment, we implemented the upload client to publish files by copying them to /var/www or WebDav to make them web-accessible and then POST the dataid.ttl on the bus. This is done in the package or install step. However, this might also use underlay.org or Interplanetary File System instead of HTTP publishing such as Index of /repo/

manonthegithub · March 5, 2020, 11:08am

Hmm. So I have looked into it in more details. It may be still not clear to me, but I would like to explain what I understood, and get get you comments on it:

Databus is in fact kinda a graph database (ttl/rdf) or triples storage.
There are two parts: the endpoint which can process SPARQL queries and a tool (maven plugin) for external users/customers for publishing ttl files for the endpoint.
I can not understand why you need to do copying 2 times (copy to /var/www and then POST it on the bus), I think here I am missing something?
Underlay looks like another way to share the data. As I got it, they allow to upload the data to their db/repo and to share.

In this project you plan:

to add support of underlay in the client (ability to download it in a needed format or somehow work with the data)
to add support for the tool (maven-plugin) to publish the data into underlay

Do I get it right? Please correct where I’m wrong.

kurzum · March 6, 2020, 5:26am

Yes, that is the core. There is a whole process and an ontology behind it (DataId), but the main thing is loading around 25 triples metadata per file into a Virtuoso triple store and then query the dcat:downloadURL with SPARQL.

Like the databus on your computer. Files are stored decentralized, e.g. you have 10TB of data and you copy it on your webspace and then you post the 25 triples per file into the store. So the Databus hosts the links to all data centrally and the data is spread on the different servers. E.g. CaLiGraph data is hosted here in Mannheim: Index of /CaLiGraph/databus/repo/lts/ and on the Bus here: OIDC Form_Post Response

They are providing this centrally at the beginning.. This is also intended to be spread among different servers later as self-deployment.

Very good. These are the two points where it can go.

Either as a deployment of a downloaded artifact/collection via the Databus Download Client using e.g. Docker similar to GitHub - dbpedia/virtuoso-sparql-endpoint-quickstart: creates a docker image with Virtuoso preloaded with the latest DBpedia dataset
in the mvn package or mvn install phase (probably install) of the Databus Maven Plugin.. A note here: the maven plugin will get simplified a lot soon. In the end, all you need is the dataid.ttl and a POST request, so this could also be done in another programming language.

A concrete goal / testbed could be to make DBpedia available in underlay.org

manonthegithub · March 6, 2020, 11:03am

I went into more details, and got more understanding.
It raised several questions:

Underlay in essence seems a similar to databus project, they aggregate databases or references to databases. The question is what is bigger? Should the underlay keep links to databus, or databus to underlay, or both? It looks that underlay should be bigger, as they potentially will work with any type database (not only ttl), but I am not sure. Also plans of databus are not clear.
Maybe I am wrong, but currently underlay is under construction and they have not published any tools, libraries. Do you know more about their progress? How would I integrate databus with them if they have not published anything yet?

kurzum · March 6, 2020, 12:49pm

One of the GSoC mentors might be from underlay, I will ping them again to write something here.
Underlay has many good concepts. I see the projects as complementary, i.e. Databus can do some things that underlay can not and vice versa.

Databus just keeps the dataid.ttl as a ttl, i.e. the 25 triples per file (in any format). There is also JSON, CSV and binary formats and anything really on the bus already. Underlay is still quite young and not sure if they will implement all what the plan. For Databus, we have already a lot of databases and end user applications such as:

The search form on https://databus.dbpedia.org is build with the databus: https://github.com/dbpedia/lookup-application
There is a Download CSV as RDF function with Tarql mappings
GitHub - dbpedia/virtuoso-sparql-endpoint-quickstart: creates a docker image with Virtuoso preloaded with the latest DBpedia dataset
RDF loading into SQL database: Home - DBpedia Association

Databus is focuses on RDF, because the DBpedia community is a big fan of RDF.

I will ping the devs from underlay. As a basic solution it would also be possible to integrate IPFS into Databus and then later underlay.

manonthegithub · March 6, 2020, 3:35pm

I have made a draft of a project proposal. Can you please take a look?

github.com

manonthegithub/GSoC-2020/blob/master/dbpedia/Proposal.md

# Combine DBpedia/Databus with Underlay - Kirill Yankov


# Contact information

Name: Kirill Yankov  
Email Address: myworkpostbox [at] gmail.com  
My linkedin: https://www.linkedin.com/in/kirill-yankov-8b0747118/  
My github: https://github.com/manonthegithub  
  
# Project
"Combine DBpedia/Databus with Underlay" (https://forum.dbpedia.org/t/combine-dbpedia-databus-with-underlay/434)

Project Description:  
Integration of Underlay with Databus. Implement loading data to and from Underlay with Databus.  
If you would be willing and able to work on another of our suggested project ideas instead, which ones?  
Did not look into it.  
Please describe why you are interested in this specific project:  
The project is interesting because it potentially will deal with big amounts of data, and I love everything which is about performance. Scala is a big plus too.  
Please describe a tentative project architecture or an approach to it:

This file has been truncated. show original

manonthegithub · March 18, 2020, 4:15pm

Can you please have look at my proposal? Is it fine? Should I do the schedule more fine-grained?

kurzum · March 22, 2020, 9:42pm

Hi @manonthegithub,
I talked to the underlay contact and his answer was this:

Talking to Joel, architect of the current protocols: the Underlay libraries and interfaces may not be ready for this by the summer.
We are just starting to hire someone to help build out our first central registry and related packaging tools. Do you think that creating a toolchain that lets a subcommunity create a databus package, and learning to use IPFS interfaces to store those packages on IPFS, might be sufficient for a full summer’s project? That’s already learning two different languages and testing with a data community

So I guess, it doesn’t mke sense to go for Underlay, yet, although it is very RDF specific. Meanwhile I had a look at IPFS some more and I think integrating Databus and IPFS would be a really cool project. So there are two sides of this:

1 Publishing

At the moment, we are using Maven features extensively for publication. This includes either copying files to Apache Web Server or NGINX to /var/www or lately we have been using Maven WebDav Wagon . So here the process is

people run the upload the script on the same server, copying to /var/www
people run on one server/laptop and push to publication server via webdav or ssh

So a question here is how to get local data into IPFS. Maybe an IPFS client needs to run or maybe it is just simple file copying. Then also you need to get the IPFS hash, as this needs to go into the dataid and on the Databus.

2 Downloading

As far as I understood there are several ways to download IFPS files. There are some download clients in different implementations and normally you are supposed to share the files again. There is also a scala wrapper or a wget-like implementation. I didn’t get very deep into it, but these features should be possible:

download without having a local IPFS node
download and share via a local IPFS node
(Note for us it wouldn’t be a problem to host a public gateway

then also:

subscribe to new versions, i.e. we have collections such as OIDC Form_Post Response where users can get the latest version of an artifact. (The collection resolves to a SPARQL query: curl -H "Accept: text/sparql" https://databus.dbpedia.org/dbpedia/collections/latest-core ). I am not sure how IPFS reacts to file changes or updates, but combined with the Databus/Maven-like structure, it should be easy.
there is the feature of pinning for IPFS clusters https://cluster.ipfs.io/ which would be interesting as we could have a LOD-Cloud Cluster Swarm, which would start backing up the whole LOD Cloud. This would be a killer feature.

your workplan

I think the skeleton of your work plan is realistic. Before Milestone 1 (implementation with JUNIT test), I would recommend to have an earlier miestone with some sort of hacky, vertical prototype in order to gain experience with IPFS and also to have something working with a small dataset. This follows agile, rapid-prototyping and it is a good milestone to discuss and adjust the following tasks & timeline. Otherwise you spend 4 weeks implementation and testing just to find out that requirements are actually different (happens to almost 50% of all software projects).

manonthegithub · March 23, 2020, 11:13am

Great! Thank you for detailed comment. I have updated my Proposal according to the latest news and your comments. Please take a look once more at the same place. It now looks much better to me.

kurzum · March 25, 2020, 6:58pm

@manonthegithub the proposal is still a bit simple. It would be good to work out some of the details in advance at least. Some points here:

Project description is just half a sentence now. Could you extend it to have a clearer goal in mind. Maybe summarize Databus and IFPS a little and then stress the goals of this integration (most of the things are in this thread, but maybe you need to read some IFPS stuff or this https://databus.dbpedia.org/dbpedia/publication/strategy/2019.09.09/strategy_databus_initiative.pdf)
The layout is confusing, i.e. there are many #, altough there should be only 1 and not enough ## and ### to give it a sensible structure. Especially Project Description and the Detailed Project Plan and Timeline need to have a ## or ### as these describe the idea and how it is implemented.
Everything before technically skills needs a brush up regarding goal description, level of detail (e.g. more links), and layout/structure

-> I am one of the mentors and I know the conversation in this thread, but the other mentors need to be convinced by the proposal itself in order to agree giving you a slot in GSoC.

manonthegithub · March 26, 2020, 10:55am

Thanks for the review! I’ve done the corrections. Now it should be much better. Please check.