DBpedia-Spotlight: How to help improve quality of multilingual entity extraction

JulioNoe · November 10, 2020, 2:38pm

What is DBpedia Spotlight?

DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.

The new release version is a docker image to run the DBpedia-Spotlight as a server with the most recent language models, downloaded from the DBpedia Databus repository, e.g., English (en), German (de), Italian (it), etc.

During the development of this new released version, a set of problems were found, leading us to look for the help of the community to overcome them. A list of problems is presented below, which will be updated if any other problem is found.

Improving the quality of language models

The model-quickstarter project builds a language model for DBpedia Spotlight, the process is explained in the project README file. Some steps of this process must be checked or tested to produce a quality language model, the following list defines the actions to be taken (check or test) to produce correct output.

- Abbreviations

LANG: is a two lowercase letters language code, e.g., en, it and fr are the two letters code for English, Italian and French languages, respectively.
LOC: is a two uppercase letters location code, e.g., US, IT and FR are the two letters code for United States of America, Italy, and France locations, respectively.
./ : is the local directory where the project was cloned (git clone …)

- Initial steps:

[Check] The language, locale, and Stemmer are defined correctly in the mainModelBuilder.sh file, e.g., the English language is defined as en_US-English, where en is the language, US is the locale, and English is the Stemmer. If there is no language, locale, and/or Stemmer:
- Refer to the BCP47 documentation to define the correct two-digit language and locale code. It could be possible that a language and locale are already defined but maybe the locale is not the best selection.
- Refer to the snowball stemming algorithms to define the correct stemmer algorithm, if exists.
  - If no stemmer algorithm is defined for a specific language, you could help us charing a possible solution for this problem. Also, we are looking for new stemmer algorithms for those languages which have no snowball stemming algorithm defined.
[Check] The stopword list file (stopwords.list) must exist in the ./LANG directory. If the stopword file does not exist:
- Create the LANG folder in the main directory and create the ./LANG/stopwords.list file containing the corresponding stopwords.

- Step 1: Preparing the data

[Check] The Wikipedia dump file must be downloaded in ./wdir/LANG_LOC directory

- Step 2: DBpedia extraction

[Check] Identify if there is a missing artifact in the following query. If there is a missing artifact, please, for each link refer to the corresponding README information. The following list is a first approach of how to solve a missing artifact:
- Instance-types: Create mappings in mappings.dbpedia.org
- Disambiguations: The DBpedia Extraction Framework integrates a configuration file to discover disambiguation pages for specific languages (DisambiguationExtractorConfig.scala). You could try to add a missing language or test it for those languages with missing disambiguation (refer to the SPARQL query). The following is an example of language-specific configurations:
  - label indicating a disambiguation page, e.g. “cs” -> " (rozcestník)", “de” -> " (Begriffsklärung)", “el” -> " (αποσαφήνιση)", “en” -> " (disambiguation)",
- Redirects: please refer to documentation in Databus for more details.

- Step 3: Extracting wiki stats.

[Check] The following non-empty files must be created in the ./wdir/LANG_LOC/ directory
- pairCounts
- sfAndTotalCounts
- tokenCounts
- uriCounts

If any of the files is not produced or is empty, a possible error could be related to an update in the template of the Wikipedia language dump file.
For another kind of problem, please refer to wikistat project.

- Step 4: Setting up Spotlight

[Check] The dbpedia-spotlight-model project must be cloned in the ./wdir/dbpedia-spotlight directory

- Step 5: Build Spotlight model

[Test] Run the script (mainModelBuilder.sh) to generate the language model from the model-quickstarter project.
[Check] The language model must be produced in the ./models/LANG directory with the following structure:

LANG
├── fsa_dict.mem
├── model
│ ├── candmap.mem
│ ├── context.mem
│ ├── quantized_counts.mem
│ ├── res.mem
│ ├── sf.mem
│ └── tokens.mem
├── model.properties
├── spotter_thresholds.txt
└── stopwords.list
[Test] Run the DBpedia-Spotlight with the corresponding language model.
- The DBpedia-Spotlight server must be queryable through curl
- Any sent query must return a valid value depending on the selected format, TURTLE, JSON, or CSV.

If everything is fine (all the steps were checked/tested), the final step is to send a pull request to verify your changes and merge them with the main project (model-quickstarter).