DBpedia Spotlight Dashboard: an integrated statistical information tool from the Wikipedia dumps and the DBpedia Extraction Framework artifacts - GSoC2021

DBpedia Spotlight Dashboard: an integrated statistical information tool from the Wikipedia dumps and the DBpedia Extraction Framework artifacts.

DESCRIPTION

In 2011, DBpedia released the text annotation toolkit named DBpedia Spotlight, becoming a reference in the state of the art. DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.

The DBpedia Spotlight makes use of a specific language model to annotate DBpedia resources in text, e.g., English, German, French, Portuguese, etc. A language model is generated from the statistical extraction process over a Wikipedia dump, combined with the latest release artifacts from the DBpedia Extraction Framework (more precisely, the disambiguation, instance-types, and redirects artifacts). Both, the resulting language models and the results of the statistical extraction process are published in the DBpedia Databus.

However, the raw information extracted from Wikipedia dumps is only available as tab-separated values (TSV), which makes their analysis difficult. This information could be used as a support for Named Entity analysis based on DBpedia and Wikipedia sources, Wikipedia links analysis, or the analysis of the context words associated with a particular link. The representation of the statistical information extracted from Wikipedia dumps as a linked data taxonomy and their integration with the DBpedia Extraction Framework artifacts (disambiguation, instance-types, and redirects) will provide valuable information to understand the trends of DBpedia resources, Wikipedia links, and surface forms by means of dispersion measures (mean, median, standard deviation, interquartile range, etc.) computed over these data. These statistics will be available as a dashboard to provide a descriptive statistical information from each language, providing a way to compare information between languages. Additionally, the dashboard will present the updates from a new language model release, e.g., the difference between the number of resources from the latest version and an earlier version.

Goal

  • Define a taxonomy to represent the statistical information extracted from the Wikipedia dumps.
  • Integrate the statistical information extracted from Wikipedia dumps and the DBpedia Extraction Framework artifacts (disambiguation, instance-type, and redirects).
  • Generate a knowledge base from the integrated statistical information.
  • Generate a dashboard to resume the statistical information computed from the integrated statistical knowledge base applying dispersion measures.

Impact

  • Represent the statistical information from Wikipedia dumps as a linked data taxonomy.
  • A statistical analysis of Wikipedia dumps and DBpedia resources.
  • The familiarization of the DBpedia Spotlight users with the annotation process by means of providing visual descriptive statistical information.
  • The historical record of the statistical information extracted from each Wikipedia dump to generate the corresponding language model.

Warm-up tasks

  • Read the documentation from the spotlight-wikistats page to familiarize with concepts such as uriCounts, pairCounts etc.
  • Download the preferred language model statistical information from the spotlight-wikistats page.
  • Analyze the content of the language model selected.
  • Generate a preliminary statistic report on the downloaded preferred language, applying the measures of center: mean, median and mode.

Mentors

@JulioNoe …

Keywords
DBpedia Spotlight, DBpedia Extraction Framework, Wikipedia, Statistical Analysis

2 Likes

Hi @JulioNoe. I’m a final year Software Engineering undergraduate at IIT- Sri Lanka and I have good knowledge and experience with Deep Learning and NLP. Hope I’m not too late to contact the mentor and start warm-up tasks. Looking forward to work on this project

Hi @sahandilshan

Thanks for your interest, it is not too late. The first thing you must do is visit this link, it contains important information about the GSoC such as the timeline and student info kit.

1 Like

Hi @JulioNoe. Thank you for your reply. I’ll read those instructions and start to work with the warm-up task.

Hi @JulioNoe
I am a 1st-year graduate student at RWTH Aachen University. I have experience working with NLP and I am interested in working on this project. I have started working on the warm-up tasks. I am hoping to get a chance to work on this project.

Hi @aparnaj,

Thanks for the interest in this project. Please follow the instructions from this link and also from this link. The “Student Application Period” period ends on Tuesday 13th. If you have any question please let me know. Have a good day.