Hello all,
I have done my Diploma Thesis on Enrichment of DBpedia NIF Dataset and I am looking for feedback from all of you. The goal of my thesis is to enrich the NIF Dataset by performing various Natural language processing tasks namely sentence splitting, Tokenization, Part of speech tagging and Link enhancement. I have performed these tasks for 5 different languages - en, de, fr, es and ja .
Link to my Github repo of the implementation : https://github.com/pragal18/Enrichment_of_DBpedia_NIF_Dataset
Quick overview on what I did for sentence-splitting , tokenisation and POS :
- Downloaded the NIF-context dataset for all these 5 languages in ‘ttl’ format and extracted them.
- Created a python script to separate the NIF-context file into individual articles (By article I am trying to imply - a wikipedia page ie. a resource in DBpedia) . For eg: lets say that nif_context_en.ttl contains 5million articles inside it. This script breaks nif_context_en into 5 million small files with each containing triples corresponding to that specific article. These files are stored in a common directory. Each language has its own directory. It is also possible to separate only a subset of these articles like - separate only articles starting with alphabet ‘A’ from NIF-context : in this case it stores about 60000 articles and ignores the rest.
- Obtain only the content of each article. This is stored as an object to one of the triples with predicate as nif.isString .
- Have used 6 different libraries in python ( NLTK, SpacyIO, TextBlob , Pattern , konoha and nagisa) to perform the these NLP tasks on the individual files created on step 3.
- Store the results of these NLP tasks as RDF triples in ‘ttl’ format.
- Compared the results of these different libraries for each of the tasks in terms of quality and efficiency.
It is possible to reproduce this Dataset containing the results of these NLP tasks. I have created shell script (which in turn executes the appropriate python script) where in you can specify the number of articles or specify a particular article for which you would like to perform some or all of these NLP tasks(including Link enhancement) and it is possible to specify the language as well.
Link Enhancement is a NLP task focussing on increasing the number of links to other wikipedia articles, for each article.
Quick overview on Link Enhancement task :
- Downloaded the NIF-Text Links for all these 5 languages. Extracted it.
- A python script to Create a csv file containing surface form - link to the dbpedia resource - Part of speech.
- A python script to remove the duplicates in the csv file
- The content of every small file created from separating NIF-context is parsed word by word. A link is provided to those word(s) if their surface-form and part of speech matches to anyone record on the csv file created on step 3.
- Stores the link dataset of each article in RDF triples in ‘ttl’ format.
I believe the result of these NLP tasks could be used for performing more advanced NLP tasks.
Please go though the GitHub repository and try to use its functionality- https://github.com/pragal18/Enrichment_of_DBpedia_NIF_Dataset
The processing steps are documented on readme.md
Your feedback is much appreciated.
I am really bad at explanation. I am sorry for that and also apologize me if I had used some silly term(s). Please get back to me in case of any further questions.
Thank you in advance,
Pragal