Data Quality Dashboard - GSoC2020

DESCRIPTION
DBpedia offers large quantities of structured data. Though, DBpedia has partly insufficient data quality which originate from different sources, e.g. incorrect extractions and value transformations in the extraction framework, inconsistent mappings, incorrect data in Wikipedia articles, and generally incompleteness.

Goal
Visualize a set of metrics in an easy to read interactive UI that facilitates the decision on what should be fixed next in DBpedia.

Impact
The interface will help DBpedia contributors to adopt a “data quality first” attitude, enable data-driven prioritization of development tasks.

This idea was already proposed in last year’s gsoc. In my opinion, it could be really useful to develop a Data Quality dashboard for DBpedia. What do you think about this proposal?

To clarify, did you propose it last year? I think @karankharecha proposed one. We would probably mentor two or three DQ dashboards this year.

The topic is highly difficult though. DQ is a measure of fitness to use. So actually a user can only evaluate DQ. So per se it is impossible to build something semantic here.

What makes sense here is to:

@lucav48 this is you? https://scholar.google.com/citations?user=2D771YsAAAAJ&hl=it
If you have a background in Network analysis/AI this can help a lot. But you need to commit to a specific task.

Hi @kurzum! This is not my idea, but last year I saw some proposals for building a Data Quality Dashboard but none of them were accepted. I thought that it could be a nice to try to propose again this idea!

However that is me :grin: I am willing to be a mentor for this project, but surely we should define better this idea.

ok, cool, welcome.
I tagged you as GSOC Mentor. @SandraPraetor @tsoru. We can work together on this. I am talking to Kharan on Friday, who proposed it last year. He did quite good things: see Interactive Dashboard for Datasets and https://olympichistory.azurewebsites.net/

@lucav48 I merged some of your ideas into Dashboard for Language/National Knowledge Graphs
Data quality is a super hard topic and this complexity is multiplied by managing errors and issues. I focused the other idea on visualisation with the option to pick some more features like Data Quality. If anybody does Data quality then focus on one aspect is important, i.e. no generic solutions.

Thank you @kurzum! I’d be glad to contribute to this project.

Hi Mentors, Prakhar here, Right now I am sophomore at IIIT BBSR, I am interested in this project, But its not clear from the project description about the technologies we would be using. Can you clear that up please.
Thanks and Regards
Prakhar

Hi Prakhar,

Languages like Python and Java are often used in several DBpedia projects, and I think that one of these could fit to even in this case.

I think that in the proposal for this project you should come up with a possible architecture and then explain why it should be implemented here.

For any other questions, feel free to contact me.

Luca

@pr4k for the dashboard the technology is flexible. Normally, something like AngularJS, https://dart.dev/ or https://d3js.org/ can be used. Docker deployment is a MUST in my opinion, furthermore it would be nice to have it a bit modular, so we could reuse some parts as widgets in other places.

For the data analytics side, we might also provide some stats as it is on our todo lists. We normally use SPARK or sansa-stack.net/ for this or parallelize with Scala/Reactive Streams. While coding a full on framework for statistics generation might be out of the questions, existing software or our stats could be used for this. However, they should be meaningful, i.e. clear ideas should be developed in this project, what would be useful stats.

Hi, mentors. It’s Simiao here, right now I am a postgraduate student in the department of Artificial Intelligence at KU Leuven, Belgium. Before that, I obtained my bachelor’s degree in Software Engineering and worked in the tech industry for around three years.
I am quite interested in this project. I have experience in building the computing engine for the graph database, and I know that data quality is of crucial importance, as it is the very beginning of all the mining. Besides, I am familiar with scala/java and Spark, k8s.
Do you already have some ideas that can share with us? With this guidance, we can think more. Thank you!