Enrichment of DBpedia NIF Dataset with various NLP tasks - GSoC2020

Hello all,

I have done my Diploma Thesis on Enrichment of DBpedia NIF Dataset and I am looking for feedback from all of you. The goal of my thesis is to enrich the NIF Dataset by performing various Natural language processing tasks namely sentence splitting, Tokenization, Part of speech tagging and Link enhancement. I have performed these tasks for 5 different languages - en, de, fr, es and ja .

Link to my Github repo of the implementation : https://github.com/pragal18/Enrichment_of_DBpedia_NIF_Dataset

Quick overview on what I did for sentence-splitting , tokenisation and POS :

  1. Downloaded the NIF-context dataset for all these 5 languages in ‘ttl’ format and extracted them.
  2. Created a python script to separate the NIF-context file into individual articles (By article I am trying to imply - a wikipedia page ie. a resource in DBpedia) . For eg: lets say that nif_context_en.ttl contains 5million articles inside it. This script breaks nif_context_en into 5 million small files with each containing triples corresponding to that specific article. These files are stored in a common directory. Each language has its own directory. It is also possible to separate only a subset of these articles like - separate only articles starting with alphabet ‘A’ from NIF-context : in this case it stores about 60000 articles and ignores the rest.
  3. Obtain only the content of each article. This is stored as an object to one of the triples with predicate as nif.isString .
  4. Have used 6 different libraries in python ( NLTK, SpacyIO, TextBlob , Pattern , konoha and nagisa) to perform the these NLP tasks on the individual files created on step 3.
  5. Store the results of these NLP tasks as RDF triples in ‘ttl’ format.
  6. Compared the results of these different libraries for each of the tasks in terms of quality and efficiency.

It is possible to reproduce this Dataset containing the results of these NLP tasks. I have created shell script (which in turn executes the appropriate python script) where in you can specify the number of articles or specify a particular article for which you would like to perform some or all of these NLP tasks(including Link enhancement) and it is possible to specify the language as well.

Link Enhancement is a NLP task focussing on increasing the number of links to other wikipedia articles, for each article.

Quick overview on Link Enhancement task :

  1. Downloaded the NIF-Text Links for all these 5 languages. Extracted it.
  2. A python script to Create a csv file containing surface form - link to the dbpedia resource - Part of speech.
  3. A python script to remove the duplicates in the csv file
  4. The content of every small file created from separating NIF-context is parsed word by word. A link is provided to those word(s) if their surface-form and part of speech matches to anyone record on the csv file created on step 3.
  5. Stores the link dataset of each article in RDF triples in ‘ttl’ format.

I believe the result of these NLP tasks could be used for performing more advanced NLP tasks.

Please go though the GitHub repository and try to use its functionality- https://github.com/pragal18/Enrichment_of_DBpedia_NIF_Dataset

The processing steps are documented on readme.md

Your feedback is much appreciated.

I am really bad at explanation. I am sorry for that and also apologize me if I had used some silly term(s). Please get back to me in case of any further questions.

Thank you in advance,
Pragal

1 Like

Hi Pragal,
do you have some result files? Are you able to post them or are they too large?
do you have a draft of the thesis or just the implementation for now?

Hello,

The results files could be generated by running the shell script. There are some result files on Files/sentence for results of sentence splitting task. Similarly Files/Tokens , Files/POS and Files/Links have the results of Tokenisation, part of speech tagging and Link enhancement tasks respectively, on the GitHub repo.

I am writing my thesis currently , so only implementation is completed as of now.

Yes, I am aware that the script could generate these. But do you have them generated already?

I have it generated for sentences, tokens and pos. Don’t have the complete result set for Link enhancement. However, it is difficult to post them because there are millions of small files. It takes time to discover all these files and share them somewhere

They are on your laptop? If you use tar -czvf filename.tar.gz /path/to/dir1 how big is it?

Hello,

The results are present in my External Hard disk.
The size is roughly 5 GB for sentences.tar.gz , 17GB for Tokens.tar.gz, 20 GB for Part-of-speech.tar.gz