DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. The annotation process is based on four steps: spotting, candidate selection, disambiguation, and filtering (a brief description could be found in the DBpedia Spotlight webpage).
DBpedia analyzes and extracts triples of 140 languages from Wikipedia (of 314 languages from active Wikipedias). At this moment, the DBpedia Spotlight annotation service is available for 17 languages. Ideally, the DBpedia Spotlight annotation service should provide support for the same number of languages analyzed by DBpedia.
The model-quickstarter tool generates the language models used by the DBpedia Spotlight annotation service. The language model generation process considers a step based on a stemming algorithm to produce the language model for a specific language and a tokenization method as an alternative for generating a language model with no available stemming algorithm. However, the models generated with the tokenization method produce errors during the spotting step of the DBpedia Spotlight annotation process. Additionally, it has not been extensively used to produce several language models.
This project will extend the number of language models available for DBpedia Spotlight by means of updating the implemented tokenization method with a multilingual tokenization technique to generate language models with no available stemming algorithm. Additionally, develop an evaluation process to determine the accuracy of the results obtained by the multilingual tokenization method.
This project is divided in the following goals:
- Integrate stemming algorithms to provide support for additional languages.
- Implement a multilingual tokenization method to generate language models with no available stimming algorithm.
- Evaluate the multilingual tokenization method.
- Increment the number of available language models for DBpedia Spotlight to annotate text.
- Attract users interested in tasks such as Named Entity Recognition (NER) and/or Named Entity Linking (NEL) to improve the language models generated through the model-quickstarter tool.
- Provide a new set of resources such as language models and their corresponding Wikipedia statistics to annotate text in any of the languages analyzed by DBpedia for tasks such as language modelling.
- Generate a language model with the model-quickstarter tool (see table from README)
- Analyze the data produced by the model generation process.
- Be familiar with the process followed by DBpedia Spotlight for annotating named entities (NEs), specially during the spotting process.
- Analyze the steps followed by the model-quickstarter (involving DBpedia Spotlight) to produce a language model.
tokenization, stemming, NER, NEL, language modeling