Extending Extraction Framework with Citations, Commons and Lexeme Extractors - GSoC2020


DBpedia is a crowd-sourced community effort to extract structured content from the various Wikimedia projects which is publicly available for everyone on the Web. This project will improve the DBpedia extraction (https://github.com/dbpedia/extraction-framework) process which is continuously being developed by community with citations, commons and lexemes information.


Student will develop the required modules which will parse the information from the specific source. Developed modules will be used to extract wider range of knowledge from the Wikimedia which will be presented openly to the community usage with different interest and language edition.


Created triples for the specific type of knowledge will be published to the community usage.

Warm up tasks

Preliminary experience with Extraction Framework




Extraction framework, text parsing, RDF generation

Would be willing to join in as co-mentor

1 Like

Hi everyone,
My name is Jorão Gomes Junior, I am a master’s degree student in Computer Science at the Federal University of Juiz de Fora in Brazil and I have interested in work with you on this project. I already have some knowledge with information extraction and text parsing (I have publications in this area http://lattes.cnpq.br/4648512356800217) and worked with the project to semantic extraction in noised text (text generated my ASR tools). I will do the warm-up task now.

Hello everyone,

My name is Mykola Medynskyi, I’m a 2nd year student, studying Software Engineering at the Kyiv National Ukrainian university. I’d like to participate in GSoC 2020 and contribute to Extraction Framework. My goal is to get a better experience in coding in Scala and DBPedia is the only organisation that provides such possibility. So I’ll do my best to make a useful contribution to the open source community.

I have a decent algorithmic background - here’s my leetcode page: https://leetcode.com/jlareck/

I’m currently writing a pet project in Scala, here’s my github account: https://github.com/jlareck. I also publish my university code there.

I’ve already done issue #8. Currently I am trying to improve a configuration for ukrainian language. Could you please guide what I should do to be accepted to students?

Hi and welcome to DBpedia and GSoC.

@JulioNoe & @beyza maybe you could guide the people in drafting their application.

In case you need some more clarification or have a problem, please get back to me ASAP.

All the best


Could you please point the pdf of your publication? Is it in English?

Hi Mykola,

@mommi84 shared a successful project proposal, you can check that:

You should start to write about your idea on what tasks you want to perform and how you are going to achieve them.

Please ask if you have further questions!

Hi Beyaza,
All of my publications are in my Research Gate (https://www.researchgate.net/profile/Jorao_Gomes_Jr). I have papers in English (some under peer-review yet) but the most part is written in Portuguese. Also, I have my Undergraduate Thesis that is an extended English version of one of my papers ( http://www.monografias.ice.ufjf.br/tcc-web/tcc?id=442)

Very good Jorao! thanks for the pointers!

Hi Beyza,

I really want to program in scala but I am having troubles with understanding how the extraction process works. Maybe you can guide me which areas of extraction framework need improvements or we could have a private conversation?

Hi Mykola,
Did you do the warm-up tasks? This is of course a complex framework. Did you read the github page to understand how extraction framework works [1]?

Hi Beyza,

I have already done two warmup tasks and have my pull request merged. I’ve read the github documentation and I see that there are some links that are broken. For example /Development. I will still try to understand how extraction process works but meanwhile I can propose an idea to write documentation for extraction framework or some of its modules.

Hi, Beyza,
I have got a better understanding of how extraction process works during the weekend. And now I have a question regarding citation extractor, the title of this gsoc idea says to extend extraction framework with a citation extractor. But when I explored the code I noticed that citation extractor was already implemented. Is it a misunderstanding or does a citation extractor need any improvement?

Hi Mykola! Good job! If you start to write proposal you can share it with me from beyza.yaman@adaptcentre.ie and I can give you suggestions on draft if needed. You can further check out following references:

You can check this paper : http://www.semantic-web-journal.net/system/files/swj1518.pdf

You can check this website: http://dev.dbpedia.org/Download_Data

Good catch! You are right about citation extractor, it already exists but it needs to be improved.

Hi Beyza,

Could you provide more details what needs to be improved in citation and commons extractors (I noticed that commons extractor is already implemented too)?

@kurzum do you have something specific in your mind which needs to be improved? Or would you like students to propose their ideas?

1 Like

@hopver or @jfrey do you have any specific task for this project?

no don’t even know what the topic is about. improve/fix citations extractor and commons- what and why?

@beyza Status is this:
We spend a lot of effort consolidating the framework as such and also checked and tested the most popular extractors. There are certain extractors that we are running, but we have hardly any clue how well they are doing. So there status is “running, but we don’t know how well”. These include the ImageExtractor (there are several versions of this), the commons extractions (commons changed a lot and does structural data now, but we haven’t looked at it, so we don’t know specifics), Wikidata extractor in particular mapping maintenance. We recently fixed the spotlight extraction and there is a task with missing datasets here: Tasks for Volunteers , but it includes many languages ad probably needs to be fixed over time.

So overall, ImageExtractor, Commons and Wikidata mappings are the biggest construction sites here. For Wikidata mappings there is this Wikidata mappings, effective for the Wikidata group (every artifact with mapping ). Also checking the equivalentClass/Property mappings in the ontologies and the links we extract.

1 Like