We are using the dbpedia spotlight wikistatsextraction data (reference here). Are these 4 files related or linked in anyway? It doesn’t appear as though they are. We also have a use case where we want to push this data into a database (MySQL), and give users the ability to add new entities if it doesn’t exist in dbpedia. How can we go about doing since we don’t know how the stats were calculated.
Thanks for your interest in the wikistatextraction project. Your idea sounds interesting, and I hope the following information helps you reach your goal.
- How does it work?: the wikisatatextractor traverses the Wikipedia dump file three times to produce all the statistics required for the DBpedia spotlight. The first traverse extracts the uris and redirections; the second extracts the surface forms counts, and the third extracts the token counts.
- What is the output?: the following is the list of files produces by the wikistatextractor and their description:
- uriCounts: Contains the number of times each DBpedia resource (URI) appears in the Wikipedia dump
- pairCounts: contains the number of times that a text (surface form) is linked to a DBpedia resource
- sfAndTotalCounts: Contains the number of times a text (surface form) appears as an anchor (second column) and also the number of times it appears just as text (third column).
- tokenCounts: contains the number of times the words (tokens) appear in each Wikipedia article
- Where is the code?: the Launcher class is where the aforementioned files are generated (uriCounts, pairCounts, sfAndTotalCounts, and tokenCounts)
The model-quickstarter project could be your next step if you want to generate the language model from the aggregated entities and test it with DBpedia Spotlight service. I hope this information helps. Have a great day.
My best regards
Hello @JulioNoe ,
Thank you for your response. I’m not so sure that helped though. I was asking how the files were related or if they had any linkage? I would say that the tokenCounts data cannot join with the other 3 raw data files since tokenCounts uses a wikipedia_id which cannot be found in the other data.
Aside from that, we are already uses the spotlight language model to identify the concepts in text. What I was wondering is how I should go about adding new entities that cannot be found on wikipedia to our own internal data store
No problem, I will try to answer your questions more directly:
I was asking how the files were related or if they had any linkage?
There is a direct relationship between all these files.
uriCounts contains the TOTAL number of times a DBpedia resource was identified in the Wikipedia dump file, e.g.
pairCounts file defines the relationship between these resources with a surface form (plain text), e.g.
The Songstress http://dbpedia.org/resource/The_Songstress 18 You're The Best Thing Yet http://dbpedia.org/resource/The_Songstress 2 Angel http://dbpedia.org/resource/The_Songstress 7 Angel (Anita Baker Song) http://dbpedia.org/resource/The_Songstress 2 No More Tears (Anita Baker Song) http://dbpedia.org/resource/The_Songstress 2
(The total sum is 31)
In this example, there are multiple surface forms associated with the same DBpedia resource.
sfAndTotalCounts defines the number of times a surface form appears as an anchor text and the number of times it appears without an anchor (plain text), e.g.
The Songstress 23 27
tokenCounts contains the tokens associated with the DBpedia resources identified with the source (the Wikipedia URL)
As you can see, the content of each file are closed related.
…the tokenCounts data cannot join with the other 3 raw data files since tokenCounts uses a wikipedia_id which cannot be found in the other data.
The surface form (
pairCounts) could be used to link a DBpedia resource (
uriCounts) with the wikipedia_id (
tokenCounts), for example, by using text patterns search.
What I was wondering is how I should go about adding new entities that cannot be found on wikipedia to our own internal data store
wikistatextractor is an NLP analysis over Wikipedia dump files to extract statistics, associating surface forms with DBpedia resources. The output of this analysis is divided into 4 files. If you want to add a new surface form with their corresponding DBpedia resource URL, you need to update the 4 files information in order to integrate it. The main difference would be in the tokenCounts file where you need to define your URL to associate the corresponding tokens.
The tricky part (maybe) is to update the language model creation process (here is the link to the code) to recognize the URL you define in the tokenCount. But, only if you want to create your own language model and test it with DBpedia Spotlight.
I hope this information helps you. Have a great day.
My best regards
yes @JulioNoe I think the request was how to best add rows to all these files to include custom entities (not included in DBpedia) to built a language model that also spots and annotates these entities.
I think @rileyhun is now wondering how to best “emulate” these numbers with his knowledge graph and text documents as training data.
But I think in order to proceed @rileyhun it could potentially help to tell what training data you actually have. I think you would need to derive some kind popularity score for the entities on your knwoledge graph as replacement for URI count but I think without documents with correctly “annotated”(entity-linked) texts (spotlight uses the hyperlinks of wikipedia for this) for the token and surface forms statistics i assume the task is almost impossible to achieve or at least the resulting performance of the model could be really bad. (probably this would be task for DBpedia lookup then)
surface forms itself could also be enriched by using rdfs:label and using total count
-1, right @JulioNoe as per here https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Raw-data or is it better to use 1 and for skos:prefLabel a higher count like 20 as “booster” instead?