DBpedia-Spotlight: How to help improve quality of multilingual entity extraction

What is DBpedia Spotlight?

DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.

The new release version is a docker image to run the DBpedia-Spotlight as a server with the most recent language models, downloaded from the DBpedia Databus repository, e.g., English (en), German (de), Italian (it), etc.

During the development of this new released version, a set of problems were found, leading us to look for the help of the community to overcome them. A list of problems is presented below, which will be updated if any other problem is found.

Improving the quality of language models

The model-quickstarter project build models of DBpedia Spotlight for specific languages. To produce a model are required the following artifacts: redirects, disambiguation, and instance-type artifacts. However, some languages have missing one or more of these artifacts, i.e., Swedish (sv-SE), Turkish (tr_TR), Danish (da_DK).

The process to build a language model is explained in the model-quickstarter project. Some parts of this process must be checked or tested to be sure that the result is the best for a language model, the following is a list of these parts.

  1. [Check] Identify if there is a missing artifact in the following query. If there is a missing artifact, please, for each link refer to the corresponding README information. The following list is a first approach of how to solve a missing artifact:

    • Instance-types: Create mappings in mappings.dbpedia.org

    • Disambiguations: The DBpedia Extraction Framework integrates a configuration file to discover disambiguation pages for specific languages (DisambiguationExtractorConfig.scala). You could try to add a missing language or test it for those languages with missing disambiguation (refer to the SPARQL query). The following is an example of language-specific configurations:

      • label indicating a disambiguation page, e.g. “cs” -> " (rozcestník)", “de” -> " (Begriffsklärung)", “el” -> " (αποσαφήνιση)", “en” -> " (disambiguation)",
    • Redirects: please refer to documentation in Databus for more details.

  2. [Check] The language, locale, and Stemmer are defined correctly, e.g., for the English language is defined as en_US-English, where en is the language US is the locale and English is the Stemmer. If there is no language, locale, and/or Stemmer:

    • Refer to the BCP47 documentation to define the correct two-digit language and locale code. It could be possible that a language and locale are already defined but maybe the locale is not the best selection.
    • Refer to the snowball stemming algorithms to define the correct stemmer algorithm, if exists.
    • We are looking for new stemmer algorithms for those languages which have no snowball stemming algorithm defined.
  3. [Check] The stopword list file must exist in the [LANG]/stopwords.list directory, where LANG is the two digits language code, e.g., en, it or fr is the two digits code for English, Italian and French languages, respectively. If the stopword file does not exist:

    • Create the [LANG] folder in the main directory and create the [LANG]/stopwords.list file.
  4. [Test] Run the script (mainModelBuilder.sh) to generate the model from the model-quickstarter project

    • All artifacts (instance-type, disambiguation, and redirects) must be downloaded and uncompressed in the corresponding folder
  5. [Check] The “Wikipedia statistics extraction” section generates the files: tokenCounts, uriCounts/uriCounts_all and pairCounts in the wdir/[LANG]_[LOCALE] folder, where LANG is the two digits language code and LOCALE is the two digits locale code. If any of the files are missing:

    • A possible error could be related to an update in the template of the Wikipedia language dump file
    • For another kind of problem, please refer to wikistat project.
  6. [Test] Run the DBpedia-Spotlight with the corresponding language model.

    • The DBpedia-Spotlight server must be queryable through curl
    • Any sent query must return a valid value depending on the selected format, TURTLE, JSON, or CSV.

It is needed to check/test all these steps to produce a quality language model for the corresponding language.