What is DBpedia Spotlight?
DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.
The new release version is a docker image to run the DBpedia-Spotlight as a server with the most recent language models, downloaded from the DBpedia Databus repository, e.g., English (en), German (de), Italian (it), etc.
During the development of this new released version, a set of problems were found, leading us to look for the help of the community to overcome them. A list of problems is presented below, which will be updated if any other problem is found.
Improving the quality of language models
The model-quickstarter project builds a language model for DBpedia Spotlight, the process is explained in the project README file. Some steps of this process must be checked or tested to produce a quality language model, the following list defines the actions to be taken (check or test) to produce correct output.
- LANG: is a two lowercase letters language code, e.g., en, it and fr are the two letters code for English, Italian and French languages, respectively.
- LOC: is a two uppercase letters location code, e.g., US, IT and FR are the two letters code for United States of America, Italy, and France locations, respectively.
- ./ : is the local directory where the project was cloned (git clone …)
- Initial steps:
[Check] The language, locale, and Stemmer are defined correctly in the mainModelBuilder.sh file, e.g., the English language is defined as en_US-English, where en is the language, US is the locale, and English is the Stemmer. If there is no language, locale, and/or Stemmer:
- Refer to the BCP47 documentation to define the correct two-digit language and locale code. It could be possible that a language and locale are already defined but maybe the locale is not the best selection.
- Refer to the snowball stemming algorithms to define the correct stemmer algorithm, if exists.
- If no stemmer algorithm is defined for a specific language, you could help us charing a possible solution for this problem. Also, we are looking for new stemmer algorithms for those languages which have no snowball stemming algorithm defined.
[Check] The stopword list file (
stopwords.list) must exist in the
./LANGdirectory. If the stopword file does not exist:
- Create the
LANGfolder in the main directory and create the
./LANG/stopwords.listfile containing the corresponding stopwords.
- Create the
- Step 1: Preparing the data
[Check] The Wikipedia dump file must be downloaded in
- Step 2: DBpedia extraction
[Check] Identify if there is a missing artifact in the following query. If there is a missing artifact, please, for each link refer to the corresponding README information. The following list is a first approach of how to solve a missing artifact:
Instance-types: Create mappings in mappings.dbpedia.org
Disambiguations: The DBpedia Extraction Framework integrates a configuration file to discover disambiguation pages for specific languages (DisambiguationExtractorConfig.scala). You could try to add a missing language or test it for those languages with missing disambiguation (refer to the SPARQL query). The following is an example of language-specific configurations:
- label indicating a disambiguation page, e.g. “cs” -> " (rozcestník)", “de” -> " (Begriffsklärung)", “el” -> " (αποσαφήνιση)", “en” -> " (disambiguation)",
Redirects: please refer to documentation in Databus for more details.
- Step 3: Extracting wiki stats.
[Check] The following non-empty files must be created in the
- If any of the files is not produced or is empty, a possible error could be related to an update in the template of the Wikipedia language dump file.
- For another kind of problem, please refer to wikistat project.
- Step 4: Setting up Spotlight
[Check] The dbpedia-spotlight-model project must be cloned in the
- Step 5: Build Spotlight model
[Test] Run the script (mainModelBuilder.sh) to generate the language model from the model-quickstarter project.
[Check] The language model must be produced in the
./models/LANGdirectory with the following structure:
│ ├── candmap.mem
│ ├── context.mem
│ ├── quantized_counts.mem
│ ├── res.mem
│ ├── sf.mem
│ └── tokens.mem
[Test] Run the DBpedia-Spotlight with the corresponding language model.
- The DBpedia-Spotlight server must be queryable through curl
- Any sent query must return a valid value depending on the selected format, TURTLE, JSON, or CSV.
If everything is fine (all the steps were checked/tested), the final step is to send a pull request to verify your changes and merge them with the main project (model-quickstarter).