DBpedia is a crowd-sourced community effort to extract structured content from the various Wikimedia projects which is publicly available for everyone on the Web. This project will improve the DBpedia extraction (https://github.com/dbpedia/extraction-framework) process which is continuously being developed by community with citations, commons and lexemes information.
Goals
Student will develop the required modules which will parse the information from the specific source. Developed modules will be used to extract wider range of knowledge from the Wikimedia which will be presented openly to the community usage with different interest and language edition.
Impact
Created triples for the specific type of knowledge will be published to the community usage.
Warm up tasks
Preliminary experience with Extraction Framework #8 #9
Mentors
TBA
Keywords
Extraction framework, text parsing, RDF generation
Hi everyone,
My name is Jorão Gomes Junior, I am a master’s degree student in Computer Science at the Federal University of Juiz de Fora in Brazil and I have interested in work with you on this project. I already have some knowledge with information extraction and text parsing (I have publications in this area http://lattes.cnpq.br/4648512356800217) and worked with the project to semantic extraction in noised text (text generated my ASR tools). I will do the warm-up task now.
My name is Mykola Medynskyi, I’m a 2nd year student, studying Software Engineering at the Kyiv National Ukrainian university. I’d like to participate in GSoC 2020 and contribute to Extraction Framework. My goal is to get a better experience in coding in Scala and DBPedia is the only organisation that provides such possibility. So I’ll do my best to make a useful contribution to the open source community.
I’m currently writing a pet project in Scala, here’s my github account: https://github.com/jlareck. I also publish my university code there.
I’ve already done issue #8. Currently I am trying to improve a configuration for ukrainian language. Could you please guide what I should do to be accepted to students?
I really want to program in scala but I am having troubles with understanding how the extraction process works. Maybe you can guide me which areas of extraction framework need improvements or we could have a private conversation?
Hi Mykola,
Did you do the warm-up tasks? This is of course a complex framework. Did you read the github page to understand how extraction framework works [1]?
I have already done two warmup tasks and have my pull request merged. I’ve read the github documentation and I see that there are some links that are broken. For example /Development. I will still try to understand how extraction process works but meanwhile I can propose an idea to write documentation for extraction framework or some of its modules.
Hi, Beyza,
I have got a better understanding of how extraction process works during the weekend. And now I have a question regarding citation extractor, the title of this gsoc idea says to extend extraction framework with a citation extractor. But when I explored the code I noticed that citation extractor was already implemented. Is it a misunderstanding or does a citation extractor need any improvement?
Hi Mykola! Good job! If you start to write proposal you can share it with me from beyza.yaman@adaptcentre.ie and I can give you suggestions on draft if needed. You can further check out following references:
Could you provide more details what needs to be improved in citation and commons extractors (I noticed that commons extractor is already implemented too)?
@beyza Status is this:
We spend a lot of effort consolidating the framework as such and also checked and tested the most popular extractors. There are certain extractors that we are running, but we have hardly any clue how well they are doing. So there status is “running, but we don’t know how well”. These include the ImageExtractor (there are several versions of this), the commons extractions (commons changed a lot and does structural data now, but we haven’t looked at it, so we don’t know specifics), Wikidata extractor in particular mapping maintenance. We recently fixed the spotlight extraction and there is a task with missing datasets here: Tasks for Volunteers , but it includes many languages ad probably needs to be fixed over time.
So overall, ImageExtractor, Commons and Wikidata mappings are the biggest construction sites here. For Wikidata mappings there is this Wikidata mappings, effective for the Wikidata group (every artifact with mapping ). Also checking the equivalentClass/Property mappings in the ontologies and the links we extract.