DBpedia-Spotlight: How to help improve quality of multilingual entity extraction

What is DBpedia Spotlight?

DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.

The new release version is a docker image to run the DBpedia-Spotlight as a server with the most recent language models, downloaded from the DBpedia Databus repository, e.g., English (en), German (de), Italian (it), etc.

During the development of this new released version, a set of problems were found, leading us to look for the help of the community to overcome them. A list of problems is presented below, which will be updated if any other problem is found.

Improving the quality of language models

The model-quickstarter project builds a language model for DBpedia Spotlight, the process is explained in the project README file. Some steps of this process must be checked or tested to produce a quality language model, the following list defines the actions to be taken (check or test) to produce correct output.

- Abbreviations

  • LANG: is a two lowercase letters language code, e.g., en, it and fr are the two letters code for English, Italian and French languages, respectively.
  • LOC: is a two uppercase letters location code, e.g., US, IT and FR are the two letters code for United States of America, Italy, and France locations, respectively.
  • ./ : is the local directory where the project was cloned (git clone …)

- Initial steps:

  1. [Check] The language, locale, and Stemmer are defined correctly in the mainModelBuilder.sh file, e.g., the English language is defined as en_US-English, where en is the language, US is the locale, and English is the Stemmer. If there is no language, locale, and/or Stemmer:

    • Refer to the BCP47 documentation to define the correct two-digit language and locale code. It could be possible that a language and locale are already defined but maybe the locale is not the best selection.
    • Refer to the snowball stemming algorithms to define the correct stemmer algorithm, if exists.
      • If no stemmer algorithm is defined for a specific language, you could help us charing a possible solution for this problem. Also, we are looking for new stemmer algorithms for those languages which have no snowball stemming algorithm defined.
  2. [Check] The stopword list file (stopwords.list) must exist in the ./LANG directory. If the stopword file does not exist:

    • Create the LANG folder in the main directory and create the ./LANG/stopwords.list file containing the corresponding stopwords.

- Step 1: Preparing the data

  1. [Check] The Wikipedia dump file must be downloaded in ./wdir/LANG_LOC directory

- Step 2: DBpedia extraction

  1. [Check] Identify if there is a missing artifact in the following query. If there is a missing artifact, please, for each link refer to the corresponding README information. The following list is a first approach of how to solve a missing artifact:

    • Instance-types: Create mappings in mappings.dbpedia.org

    • Disambiguations: The DBpedia Extraction Framework integrates a configuration file to discover disambiguation pages for specific languages (DisambiguationExtractorConfig.scala). You could try to add a missing language or test it for those languages with missing disambiguation (refer to the SPARQL query). The following is an example of language-specific configurations:

      • label indicating a disambiguation page, e.g. “cs” -> " (rozcestník)", “de” -> " (Begriffsklärung)", “el” -> " (αποσαφήνιση)", “en” -> " (disambiguation)",
    • Redirects: please refer to documentation in Databus for more details.

- Step 3: Extracting wiki stats.

  1. [Check] The following non-empty files must be created in the ./wdir/LANG_LOC/ directory

    • pairCounts
    • sfAndTotalCounts
    • tokenCounts
    • uriCounts
  • If any of the files is not produced or is empty, a possible error could be related to an update in the template of the Wikipedia language dump file.
  • For another kind of problem, please refer to wikistat project.

- Step 4: Setting up Spotlight

  1. [Check] The dbpedia-spotlight-model project must be cloned in the ./wdir/dbpedia-spotlight directory

- Step 5: Build Spotlight model

  1. [Test] Run the script (mainModelBuilder.sh) to generate the language model from the model-quickstarter project.

  2. [Check] The language model must be produced in the ./models/LANG directory with the following structure:

    ├── fsa_dict.mem
    ├── model
    │ ├── candmap.mem
    │ ├── context.mem
    │ ├── quantized_counts.mem
    │ ├── res.mem
    │ ├── sf.mem
    │ └── tokens.mem
    ├── model.properties
    ├── spotter_thresholds.txt
    └── stopwords.list

  3. [Test] Run the DBpedia-Spotlight with the corresponding language model.

    • The DBpedia-Spotlight server must be queryable through curl
    • Any sent query must return a valid value depending on the selected format, TURTLE, JSON, or CSV.

If everything is fine (all the steps were checked/tested), the final step is to send a pull request to verify your changes and merge them with the main project (model-quickstarter).