DBpedia Spotlight Dashboard: an integrated statistical information tool from the Wikipedia dumps and the DBpedia Extraction Framework artifacts.
DESCRIPTION
In 2011, DBpedia released the text annotation toolkit named DBpedia Spotlight, becoming a reference in the state of the art. DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.
The DBpedia Spotlight makes use of a specific language model to annotate DBpedia resources in text, e.g., English, German, French, Portuguese, etc. A language model is generated from the statistical extraction process over a Wikipedia dump, combined with the latest release artifacts from the DBpedia Extraction Framework (more precisely, the disambiguation, instance-types, and redirects artifacts). Both, the resulting language models and the results of the statistical extraction process are published in the DBpedia Databus.
However, the raw information extracted from Wikipedia dumps is only available as tab-separated values (TSV), which makes their analysis difficult. This information could be used as a support for Named Entity analysis based on DBpedia and Wikipedia sources, Wikipedia links analysis, or the analysis of the context words associated with a particular link. The representation of the statistical information extracted from Wikipedia dumps as a linked data taxonomy and their integration with the DBpedia Extraction Framework artifacts (disambiguation, instance-types, and redirects) will provide valuable information to understand the trends of DBpedia resources, Wikipedia links, and surface forms by means of dispersion measures (mean, median, standard deviation, interquartile range, etc.) computed over these data. These statistics will be available as a dashboard to provide a descriptive statistical information from each language, providing a way to compare information between languages. Additionally, the dashboard will present the updates from a new language model release, e.g., the difference between the number of resources from the latest version and an earlier version.
Goal
- Define a taxonomy to represent the statistical information extracted from the Wikipedia dumps.
- Integrate the statistical information extracted from Wikipedia dumps and the DBpedia Extraction Framework artifacts (disambiguation, instance-type, and redirects).
- Generate a knowledge base from the integrated statistical information.
- Generate a dashboard to resume the statistical information computed from the integrated statistical knowledge base applying dispersion measures.
Impact
- Represent the statistical information from Wikipedia dumps as a linked data taxonomy.
- A statistical analysis of Wikipedia dumps and DBpedia resources.
- The familiarization of the DBpedia Spotlight users with the annotation process by means of providing visual descriptive statistical information.
- The historical record of the statistical information extracted from each Wikipedia dump to generate the corresponding language model.
Warm-up tasks
- Read the documentation from the spotlight-wikistats page to familiarize with concepts such as uriCounts, pairCounts etc.
- Download the preferred language model statistical information from the spotlight-wikistats page.
- Analyze the content of the language model selected.
- Generate a preliminary statistic report on the downloaded preferred language, applying the measures of center: mean, median and mode.
Mentors
@JulioNoe …
Keywords
DBpedia Spotlight, DBpedia Extraction Framework, Wikipedia, Statistical Analysis