Data Quality Dashboard - GSoC2020

lucav48 · February 3, 2020, 9:31pm

DESCRIPTION
DBpedia offers large quantities of structured data. Though, DBpedia has partly insufficient data quality which originate from different sources, e.g. incorrect extractions and value transformations in the extraction framework, inconsistent mappings, incorrect data in Wikipedia articles, and generally incompleteness.

Goal
Visualize a set of metrics in an easy to read interactive UI that facilitates the decision on what should be fixed next in DBpedia.

Impact
The interface will help DBpedia contributors to adopt a “data quality first” attitude, enable data-driven prioritization of development tasks.

This idea was already proposed in last year’s gsoc. In my opinion, it could be really useful to develop a Data Quality dashboard for DBpedia. What do you think about this proposal?

kurzum · February 4, 2020, 3:52pm

To clarify, did you propose it last year? I think @karankharecha proposed one. We would probably mentor two or three DQ dashboards this year.

The topic is highly difficult though. DQ is a measure of fitness to use. So actually a user can only evaluate DQ. So per se it is impossible to build something semantic here.

What makes sense here is to:

check for technical quality for Databus artifacts, i.e. we already have RDF parsing component and a URI Char testing component and a SHACL component, where only RDF parsing is fast and reliable. So this can be picked up. We also have a diff component.
We are subspecialising DBpedia into 4-5 streams, i.e. for students (laptop size), for researchers, for national chapters, for the main version. These can be evaluated by form (do they get bigger) or function (do queries still work)
There is the ontology and mappings, where the stats can be improved: DBpedia Mapping Statistics
there is also a new task to benchmark and improve equivalentClass/property linking and sameAs linking
finally picking a domain could be good. Note that we include external data now as per https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf
Here fusion means guessing the correct value(s) from a list of candidates for each property. It can be seen here in its prefused state: https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FfloorCount&src=general

kurzum · February 4, 2020, 3:54pm

@lucav48 this is you? https://scholar.google.com/citations?user=2D771YsAAAAJ&hl=it
If you have a background in Network analysis/AI this can help a lot. But you need to commit to a specific task.

lucav48 · February 4, 2020, 4:52pm

Hi @kurzum! This is not my idea, but last year I saw some proposals for building a Data Quality Dashboard but none of them were accepted. I thought that it could be a nice to try to propose again this idea!

However that is me I am willing to be a mentor for this project, but surely we should define better this idea.

kurzum · February 4, 2020, 6:30pm

ok, cool, welcome.
I tagged you as GSOC Mentor. @SandraPraetor @tsoru. We can work together on this. I am talking to Kharan on Friday, who proposed it last year. He did quite good things: see Interactive Dashboard for Datasets and https://olympichistory.azurewebsites.net/

kurzum · February 7, 2020, 12:41pm

@lucav48 I merged some of your ideas into Dashboard for Language/National Knowledge Graphs
Data quality is a super hard topic and this complexity is multiplied by managing errors and issues. I focused the other idea on visualisation with the option to pick some more features like Data Quality. If anybody does Data quality then focus on one aspect is important, i.e. no generic solutions.

lucav48 · February 8, 2020, 9:45am

Thank you @kurzum! I’d be glad to contribute to this project.

pr4k · February 28, 2020, 12:43pm

Hi Mentors, Prakhar here, Right now I am sophomore at IIIT BBSR, I am interested in this project, But its not clear from the project description about the technologies we would be using. Can you clear that up please.
Thanks and Regards
Prakhar

lucav48 · March 2, 2020, 10:16pm

Hi Prakhar,

Languages like Python and Java are often used in several DBpedia projects, and I think that one of these could fit to even in this case.

I think that in the proposal for this project you should come up with a possible architecture and then explain why it should be implemented here.

For any other questions, feel free to contact me.

Luca

kurzum · March 5, 2020, 7:56am

@pr4k for the dashboard the technology is flexible. Normally, something like AngularJS, https://dart.dev/ or https://d3js.org/ can be used. Docker deployment is a MUST in my opinion, furthermore it would be nice to have it a bit modular, so we could reuse some parts as widgets in other places.

For the data analytics side, we might also provide some stats as it is on our todo lists. We normally use SPARK or sansa-stack.net/ for this or parallelize with Scala/Reactive Streams. While coding a full on framework for statistics generation might be out of the questions, existing software or our stats could be used for this. However, they should be meaningful, i.e. clear ideas should be developed in this project, what would be useful stats.

simiaolin · April 4, 2021, 9:33am

Hi, mentors. It’s Simiao here, right now I am a postgraduate student in the department of Artificial Intelligence at KU Leuven, Belgium. Before that, I obtained my bachelor’s degree in Software Engineering and worked in the tech industry for around three years.
I am quite interested in this project. I have experience in building the computing engine for the graph database, and I know that data quality is of crucial importance, as it is the very beginning of all the mining. Besides, I am familiar with scala/java and Spark, k8s.
Do you already have some ideas that can share with us? With this guidance, we can think more. Thank you!