Get a list of all [xyz]

Hello everyone,

TL;DR: How would you generate an up to date “list of smartphones” or other entities?

Motivation

DBpedia is awesome. However, what I feel is particularly missing, is a browsing functionality for the average Joe that offers a filterable, sortable, shareable table of instances of a given type. For example, the-site.com/smartphones should print out a list of all smartphones there are – and it has to be fast, responsive, mobile-friendly. Remotely similar to ExConQuer or maybe the table view of SemLens. It is sad to see how almost all of the linked projects are either offline or unmaintained.

I am working on implementing such an application, but there is a lot to do. The end product shall be a free, independent, modern and OS&OD product search engine or shopping search engine - the German wiki even has an article on those (Produktsuchmaschine). The fact that we all still have to use proprietary product search engines like Amazon’s integrated one or Google Shopping to compare and filter product data when there is something like DBpedia is unbearable.

This project should fill the gap.

Data extraction problems

“Smartphones” is just an example entity. But it is a great example because while there is an ontology class MobilePhone, there is not a single RDF subject linked to it.

There is a lot linked to the resource Smartphone, however, through various properties. In some cases, the information about a resource being a smartphone is lost with a more recent dump: The 2016 DBp Huawei Honor 8 knows it is a dbp:type dbr:Smartphone while in the live version, this information is missing. 2016 dump contains about five times more links to dbr:Smartphone in general.

Am I right to assume that to get a most possible-complete “list of smartphones”, I need to UNION a lot of different properties (hypernum, rdf:type, dbp:type, form etc.) and multiple datasets (live + 2016)? Also, the dataset should always be up to date, so I probably need to integrate downloads.dbpedia.org/live/changesets somehow.

These issues seem to be present for all ontologies, for example dbr:Cat (#122) vs dbo:Cat (#0). I hope I am not misinterpreting here.

Also, am I right to assume that data contained inside lists like the ones inside list of lists of lists like this list of smartphones is not included in the data dumps?

Off topic vision

It would be great if the users of above described website could add values / products / even categories or edit invalid values in place, either feeding directly into WikiData or serving as an inbetween layer for it.

I will apply with this project to the Prototype Fund funding initiative upcoming month but develop it either way.

.

Thank you so much for taking the time. :slight_smile: Please do not hold back with criticism or comments as well.

Philip

1 Like

Some info here and then a short draft on how you can do it:

  • see a discussion for the GlobalFactSync Project and read the link for the study. It contains preliminary information on whether you can create an ultimate list, one that has clear entries and clear columns or whether it will stay fuzzy or multi-valued (population count). Smartphones are the later with their versions and variants and also cross-branding. The processors they use a much easier to list or a list of all libraries and their addresses
  • debugging is best done either on the dumps with tests, like including Huawei Honor in the minidump in the uris-lang.lst on mvn test or using this browser: http://global.dbpedia.org/?s=http://dbpedia.org/resource/Huawei_Honor_8
    Global.dbpedia.org has all languages + wikidata loaded.

Building this might be done with:

  • using parts of DBpedia
  • adding more product datasets onto the databus. Databus is build for consumers with synergies across consumers, so you can load the datasets you need and then combine them with other data and others can do the same. The reason to do this is that Wikipedia or Wikidata are both not detailed sources for product data.
  • Databus is designed to create domain-specific DBpedias like a Productpedia on other sources
  • check out https://databus.dbpedia.org/sven-h/dbkwik
  • note that Databus can auto-deploy SPARQL endpoints and SOLR indexes (not documented) at the moment with dockers, we expect there to be a docker for many more tools.

If you can’t find good structured sources, you could scrape the HTML tables of https://geizhals.at/samsung-galaxy-s10e-duos-g970f-ds-128gb-schwarz-a1992668.html?hloc=at
The extraction framework contains a HTMLpage extractor used in NIFExtractor, but there are other tools.
If you build it like a contribution platform other users interested in the data can help you with mappings. Really like DBpedia for other domains…

@phil294 we should investigate this, maybe look at smartphones in detail. At the moment the Mobile Phone class is only used here http://mappings.dbpedia.org/index.php/Mapping_cs:Infobox_-_mobilní_telefon in the Czech DBpedia. It could become a sync target in (GlobalFactSync)[https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE]. There these micro-domains are currently defined and measures devised to test completeness and correctness