Integrate additional language models to the DBpedia Spotlight model generation through a multilingual tokenization process - GSoC2022

Description

DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. The annotation process is based on four steps: spotting, candidate selection, disambiguation, and filtering (a brief description could be found in the DBpedia Spotlight webpage).

DBpedia analyzes and extracts triples of 140 languages from Wikipedia (of 314 languages from active Wikipedias). At this moment, the DBpedia Spotlight annotation service is available for 17 languages. Ideally, the DBpedia Spotlight annotation service should provide support for the same number of languages analyzed by DBpedia.

The model-quickstarter tool generates the language models used by the DBpedia Spotlight annotation service. The language model generation process considers a step based on a stemming algorithm to produce the language model for a specific language and a tokenization method as an alternative for generating a language model with no available stemming algorithm. However, the models generated with the tokenization method produce errors during the spotting step of the DBpedia Spotlight annotation process. Additionally, it has not been extensively used to produce several language models.

This project will extend the number of language models available for DBpedia Spotlight by means of updating the implemented tokenization method with a multilingual tokenization technique to generate language models with no available stemming algorithm. Additionally, develop an evaluation process to determine the accuracy of the results obtained by the multilingual tokenization method.

Goals

This project is divided in the following goals:

  1. Integrate stemming algorithms to provide support for additional languages.
  2. Implement a multilingual tokenization method to generate language models with no available stimming algorithm.
  3. Evaluate the multilingual tokenization method.

Impact

  • Increment the number of available language models for DBpedia Spotlight to annotate text.
  • Attract users interested in tasks such as Named Entity Recognition (NER) and/or Named Entity Linking (NEL) to improve the language models generated through the model-quickstarter tool.
  • Provide a new set of resources such as language models and their corresponding Wikipedia statistics to annotate text in any of the languages analyzed by DBpedia for tasks such as language modelling.

Warm-up tasks

  1. Generate a language model with the model-quickstarter tool (see table from README)
  2. Analyze the data produced by the model generation process.
  3. Be familiar with the process followed by DBpedia Spotlight for annotating named entities (NEs), specially during the spotting process.
  4. Analyze the steps followed by the model-quickstarter (involving DBpedia Spotlight) to produce a language model.

Mentors

Julio Hernandez

Project size (175h, 350h)

175

Keywords

tokenization, stemming, NER, NEL, language modeling

Hello everyone!

My name is Yuqicheng Zhu and I am Electrical Engineering Student at Technical University Munich, Germany. As an incoming Ph.D. student in Knowledge Graph at Bosch Center of Artificial Intelligence & University Stuttgart (Prof. Steffen Staab). I am fully motivated to contribute to DBPedia as a GSoC 22’ student.

I have been working as AI engineer at Bosch since 2018, had two german patents regarding AI and I have carried out numerous real-world AI projects. You can see more details on my linkedIn profile [https://www.linkedin.com/in/yuqicheng-zhu-531658161/] or on Github [https://github.com/ZhuYuqicheng 1]

I’m particularly interested in integrating stemming algorithms to provide support for additional languages (Chinese) Looking forward to discussing the project idea with you!

Best regards,
Yuqicheng Zhu

Hi @zhuyuqicheng ,

Thanks for get in touch. The first step is to be familiar with the information of the Google Summer of Code from the DBpedia page (Google Summer of Code - DBpedia Association) and the DBpedia_ Contributor Application Procedure Infos for GSoC
page (DBpedia_ Contributor Application Procedure Infos for GSoC). There you will find important information such as the “Contributor Application Template”.

The next step, is to prepare a proposal of your work (this is an example). In this case, you are interest in the main goal (integrate additional language models) and a specialized goal (integrate a stemming algorithm or generate a method for Chinese). We can polish this part later.

DBpedia is mainly written in Java and Scala, I saw that you has experience with Python, that’s great and also I don’t think you have problem with Java.

Thanks again for getting in touch, if you have any other question please let me know. Have a great day

Hi Julio,
I am Fatma Chamekh mentor gsoc. Can i share with you the project mentoring. I am interessted on this subjects.

Best regards,

Hi @fatmachamekh,

Yes, of course; thanks for the interest and if you have any other ideas to improve the project are welcome. Have a great day.

My best regards

Julio

Hi @JulioNoe we can discuss about project/ mentoring all the details about the mentoring session.
Best regards,

Hi @fatmachamekh,

Sorry for the late reply. Yes, I have no problem, but we need a student or programmer interested in the project (also, I am not sure if we are in time to upload the proposal to the system). At this moment, nobody has uploaded a proposal or made contact to participate in the project; if you know somebody, that would be great. Have a great day.

My best regards

Julio