Get a list of all [xyz]

phil294 · September 9, 2019, 7:11am

Hello everyone,

TL;DR: How would you generate an up to date “list of smartphones” or other entities?

Motivation

DBpedia is awesome. However, what I feel is particularly missing, is a browsing functionality for the average Joe that offers a filterable, sortable, shareable table of instances of a given type. For example, the-site.com/smartphones should print out a list of all smartphones there are – and it has to be fast, responsive, mobile-friendly. Remotely similar to ExConQuer or maybe the table view of SemLens. It is sad to see how almost all of the linked projects are either offline or unmaintained.

I am working on implementing such an application, but there is a lot to do. The end product shall be a free, independent, modern and OS&OD product search engine or shopping search engine - the German wiki even has an article on those (Produktsuchmaschine). The fact that we all still have to use proprietary product search engines like Amazon’s integrated one or Google Shopping to compare and filter product data when there is something like DBpedia is unbearable.

This project should fill the gap.

Data extraction problems

“Smartphones” is just an example entity. But it is a great example because while there is an ontology class MobilePhone, there is not a single RDF subject linked to it.

There is a lot linked to the resource Smartphone, however, through various properties. In some cases, the information about a resource being a smartphone is lost with a more recent dump: The 2016 DBp Huawei Honor 8 knows it is a dbp:type dbr:Smartphone while in the live version, this information is missing. 2016 dump contains about five times more links to dbr:Smartphone in general.

Am I right to assume that to get a most possible-complete “list of smartphones”, I need to UNION a lot of different properties (hypernum, rdf:type, dbp:type, form etc.) and multiple datasets (live + 2016)? Also, the dataset should always be up to date, so I probably need to integrate downloads.dbpedia.org/live/changesets somehow.

These issues seem to be present for all ontologies, for example dbr:Cat (#122) vs dbo:Cat (#0). I hope I am not misinterpreting here.

Also, am I right to assume that data contained inside lists like the ones inside list of lists of lists like this list of smartphones is not included in the data dumps?

Off topic vision

It would be great if the users of above described website could add values / products / even categories or edit invalid values in place, either feeding directly into WikiData or serving as an inbetween layer for it.

I will apply with this project to the Prototype Fund funding initiative upcoming month but develop it either way.

.

Thank you so much for taking the time. Please do not hold back with criticism or comments as well.

Philip

kurzum · September 9, 2019, 7:56am

Some info here and then a short draft on how you can do it:

see a discussion for the GlobalFactSync Project and read the link for the study. It contains preliminary information on whether you can create an ultimate list, one that has clear entries and clear columns or whether it will stay fuzzy or multi-valued (population count). Smartphones are the later with their versions and variants and also cross-branding. The processors they use a much easier to list or a list of all libraries and their addresses
debugging is best done either on the dumps with tests, like including Huawei Honor in the minidump in the uris-lang.lst on mvn test or using this browser: http://global.dbpedia.org/?s=http://dbpedia.org/resource/Huawei_Honor_8
Global.dbpedia.org has all languages + wikidata loaded.

Building this might be done with:

using parts of DBpedia
adding more product datasets onto the databus. Databus is build for consumers with synergies across consumers, so you can load the datasets you need and then combine them with other data and others can do the same. The reason to do this is that Wikipedia or Wikidata are both not detailed sources for product data.
Databus is designed to create domain-specific DBpedias like a Productpedia on other sources
check out https://databus.dbpedia.org/sven-h/dbkwik
note that Databus can auto-deploy SPARQL endpoints and SOLR indexes (not documented) at the moment with dockers, we expect there to be a docker for many more tools.

kurzum · September 9, 2019, 8:09am

If you can’t find good structured sources, you could scrape the HTML tables of https://geizhals.at/samsung-galaxy-s10e-duos-g970f-ds-128gb-schwarz-a1992668.html?hloc=at
The extraction framework contains a HTMLpage extractor used in NIFExtractor, but there are other tools.
If you build it like a contribution platform other users interested in the data can help you with mappings. Really like DBpedia for other domains…

@phil294 we should investigate this, maybe look at smartphones in detail. At the moment the Mobile Phone class is only used here http://mappings.dbpedia.org/index.php/Mapping_cs:Infobox_-_mobilní_telefon in the Czech DBpedia. It could become a sync target in (GlobalFactSync)[https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE]. There these micro-domains are currently defined and measures devised to test completeness and correctness

phil294 · February 2, 2020, 11:34am

Thanks for your input, it really helped.

The new beta DBpedia deployment workflow divided into mappings, generics, etc., the new Collections system and the downloader and loader via dockerized-dbpedia, and the Databus publishing workflow are really neat and work smoothly. Unfortunately, it takes a long time to understand and get used to all those parts… but once you do, it all makes a lot of sense. Good job to those involved

@kurzum Is there a dataset that includes the infobox values as is or is there another way to access those? For example, the dimensions of Sony NEX-C3 are 1.3 in x 4.3 in x 2.4 in on Wikipedia, but in the latest 2019.08.30 dataset, it is only 1.3. Removing excess units makes sense for dbo: properties, but in the generic infobox properties, a lot of information goes to waste by this.

Yes – but correct me if I am wrong, isnt that illegal? Afaik the availability of data on some website does not give you the permit to scrape and unscrupulously republish it under a free licence per se.
Further datasources will however be a thing at some point, maybe even from manufacturers directly. Everything will be published at the Databus at some point.

For those interested in the current state of this project:

I found that for a comprehensive result from DBP data, the generic dataset needs to be included as well. So it will include significantly more information than the ontology tables, yet include all of that valuable data (starting with purchaseable products only). Loads of mapping to do here, but I will make this contributable. Focus always needs to be not to reinvent the wheel

Now, finding a “list of Smartphones” as asked for in the initial question, is really not as simple. I solved this by writing an interactive script that traverses properties of relevant subjects and asks you whether the respective property is defining for this category, whether it includes relevant data or whether it should be discarded.

Source will be published once everything works fine and a first website version is available.

kurzum · February 3, 2020, 7:20am

I think it should be pretty safe. Just do it like Google and honor the http://geizhals.at/robots.txt . Normally fact-scraping is not copyrighted, if you re-use natural language sentences longer than two sentences then it is not a citation anymore and you need to honor the copyright. You are thinking about database law, where you only are allowed to use 5% for free. A website is not a database, though. You can also check for JSON-LD in html meta.

@jfrey do we have this somewhere?

jfrey · February 3, 2020, 8:18am

thanks @phil294. There is also dev.dbpedia.org where you have links to edit the Markdown docu on github and then send them as pull request. It is hard for us to write the docu since we are a few people in Leipzig working on the renovation of DBpedia for almost 2 years. So we really depend on consumers to make it more comprehensible and easier to understand.

Well there is good and bad news.
At the moment the template-test GitHub - dbpedia/extraction-framework at template-test branch could potentially help you out. We have a version for testing of this deployed online. You can extract data on-demand for a single article (see e.g. http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/en/extract?title=Sony+NEX-C3&revid=&format=json&extractors=custom )
You will get a lot of extraction debugging information as JSON.

Let’s have a look at the dimension outcome.

{
  "gfhIri" : "http://prov.dbpedia.org/provenance/triple/5803085db296e2d6dae5996e9c7d9edf0b62a4001374f04afbf4419b91c2f95f",
  "triple" : {
    "s" : "http://dbpedia.org/resource/Sony_NEX-C3",
    "p" : "http://dbpedia.org/property/dimensions",
    "o" : "1.3",
    "l" : null,
    "d" : "http://www.w3.org/2001/XMLSchema#double"
  },
  "timeStamp" : 1580716998062,
  "metadata" : {
    "datasetIri" : "http://prov.dbpedia.org/datasets/infobox_properties",
    "node" : {
      "sourceUri" : "http://en.wikipedia.org/wiki/Sony_NEX-C3?oldid=908848723&ns=0",
      "nodeType" : "http://prov.dbpedia.org/wikinode/PropertyNode?githash=9f2fef95b2e3c4685a1bab505269747bbecb49b9",
      "revision" : 908848723,
      "namespace" : 0,
      "internalId" : -9223372032963309813,
      "language" : "en",
      "absoluteLine" : 41,
      "name" : "dimensions",
      "section" : null
    },
    "extractor" : {
      "uri" : "http://prov.dbpedia.org/extractor/InfoboxExtractor?githash=9f2fef95b2e3c4685a1bab505269747bbecb49b9",
      "parser" : [ {
        "uri" : "http://prov.dbpedia.org/parser/DoubleParser?githash=9f2fef95b2e3c4685a1bab505269747bbecb49b9",
        "wikiText" : " | dimensions       = 1.3&nbsp;in x 4.3&nbsp;in x 2.4&nbsp;in<ref>{{cite web |last=Pogue |first=David |title=Not a Dream -  Small Cameras, High Quality Images |publisher=NYTimes.com |date=3 August 2011 |url=https://www.nytimes.com/2011/08/04/technology/personaltech/not-a-dream-small-cameras-high-quality-images-state-of-the-art.html |access-date=19 June 2014 |archive-url=https://web.archive.org/web/20140619235106/http://www.nytimes.com/2011/08/04/technology/personaltech/not-a-dream-small-cameras-high-quality-images-state-of-the-art.html |archive-date=2014-06-19}}</ref>",
        "transformed" : "dimensions=1.3&nbsp;in x 4.3&nbsp;in x 2.4&nbsp;in\n ",
        "resultValue" : "1.3"
      } ],
      "splits" : 1,
      "property" : "dimensions",
      "templatesEncountered" : [ ],
      "templatesTransformed" : [ ],
      "alternativeValues" : [ ],
      "mappingTemplateUri" : "http://dbpedia.org/resource/Template:Infobox_camera",
      "mappingTemplateName" : "Infobox camera"
    },
    "transformer" : null,
    "replacing" : [ ]
  }
},

What might help you are wikiText and transformed fields. But you see the problems which arise to work with this messy strings. Moreover the branch is on a dead-end. The former developer promised to merge it back to master but that did not happen so far and I don’t believe that he will do this eventually. So we also depend on someone from the community to merge this back.

The challenge is to extract this information in a sustainable way.
The ‘old’ approach of DBpedia is to create a mapping (mappings.dbpedia.org) so that this property would appear in a reliable way (for ~80% of the cases) as length width and heigth using dbo properties. With the DBpedia Databus we now also encourage modular dataset extensions. So it would be possible that somebody creates a dimensions dataset and we would include it virtually with the collection feature.