Data about Brazilian cities

Greetings!

I am trying to get information from DBPedia about Brazilian cities.

I’m interested in getting the home page of the local administration (prefeitura), which I can find through either foaf:homepage or dbo:wikiPageExternalLink properties.

I also need to know which Brazilian state the city belongs to.

I’ve come up with the following SPARQL query so far:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbo:<http://dbpedia.org/ontology/>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX dbp:<http://dbpedia.org/property/>
PREFIX foaf:<http://xmlns.com/foaf/0.1/>
PREFIX yago:<http://dbpedia.org/class/yago/>

SELECT ?city, ?name, ?state_abbr, ?state_name, ?link, ?external_link WHERE {
    ?city a dbo:City ;
        dbo:country dbr:Brazil .
    OPTIONAL {
        ?city foaf:homepage ?link .
    }
    OPTIONAL {
        FILTER REGEX(STR(?external_link), ".gov.br")
        ?city dbo:wikiPageExternalLink ?external_link .
    }
    OPTIONAL {
        ?city rdfs:label ?name
        FILTER(LANG(?name) = "" || LANGMATCHES(LANG(?name), "pt"))
    }
    OPTIONAL {
        ?city dbo:isPartOf ?state .
        ?state a yago:WikicatStatesOfBrazil .
        ?state dbp:coordinatesRegion ?state_abbr .
    }
    OPTIONAL {
        ?city dbo:isPartOf ?state .
        ?state a yago:WikicatStatesOfBrazil .
        ?state rdfs:label ?state_name .
        FILTER(LANG(?state_name) = "" || LANGMATCHES(LANG(?state_name), "pt"))
    }
    OPTIONAL { # cities linked to a state whose URI has changed
        ?city dbo:isPartOf ?state_old_page .
        ?state_old_page dbo:wikiPageRedirects ?state .
        ?state a yago:WikicatStatesOfBrazil .
        ?state dbp:coordinatesRegion ?state_abbr .
    }
    OPTIONAL { # cities wrongfully linked to a city instead of state
        ?city dbo:isPartOf ?other_city .
        ?other_city dbo:isPartOf ?state .
        ?state a yago:WikicatStatesOfBrazil .
        ?state dbp:coordinatesRegion ?state_abbr .
    }
}

But sometimes a city doesn’t have a state assigned (such as Marataízes) or they don’t have a link to a homepage. I can do some data cleaning myself, for my own use, but I wonder if and how I could contribute to improve the data on DBPedia.

The code for the queries on Github, in case you have suggestions for improving them.

2 Likes

Hi, I invited Diego Moussalem to contribute. Let’s see if he can help.

Best

Sandra

2 Likes

Hi @herrmann,

This kind of problem is directly related to the mappings extraction from Wikipedia, http://mappings.dbpedia.org/index.php/Mapping_pt. We have the Portuguese DBpedia chapter, we are constantly improving these problems, there and it is easier for you to contribute there instead of releasing your cleaned dump on the “official DBpedia” because within our server we can create different mirror endpoints and work on validations before releasing it. However, in case you desire to upload it in the official DBpedia endpoint, you have to rely on the DBpedia Databus, https://databus.dbpedia.org/, and there release your new cleaned dump.

For your query,

You, unfortunately, have to go through the wrong mappings as you are doing, so here are my two cents to include as optional,

OPTIONAL { # 
    ?city <http://dbpedia.org/property/subdivisionType> ?state .
    ?state <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:States_of_Brazil> .
    ?state rdfs:label ?state_name .
    FILTER(LANG(?state_name) = "" || LANGMATCHES(LANG(?state_name), "pt"))
}
OPTIONAL { # 
    ?city dbo:isPartOf ?other_city .
    ?other_city dbo:isPartOf ?state .
    ?state <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:States_of_Brazil> .
    ?state rdfs:label ?state_name .
    FILTER(LANG(?state_name) = "" || LANGMATCHES(LANG(?state_name), "pt"))
}
OPTIONAL { # 
    ?city dbo:isPartOf ?state .
    ?state dbo:type <http://dbpedia.org/resource/States_of_Brazil> .
    ?state rdfs:label ?state_name .
    FILTER(LANG(?state_name) = "" || LANGMATCHES(LANG(?state_name), "pt"))
}

Let’s get in touch. It would be awesome having you on board. We are actually looking for collaborators.

See an example of our DBpediaPT http://pt.dbpedia.org/page/Marataízes.

Nowadays, we have the Military Institute of Engineering and Federal University of Rio de Janeiro working with us, and RNP supports the infrastructure. My email is diegomoussallem@gmail.com. You can also easily find me on skype.

Cheers,

1 Like

Thanks, Diego. I already have enough information to figure out the state in my data extraction routine, so at the moment I don’t need to include more optionals.

But I do notice that the state information lacks the proper mappings. As you can see in this example, the link to the state lies in the subdivisão_tipo1 property, which, according to the statistics of the template, is not mapped to anything. So I suppose I could contribute by creating a mapping there.

Before I can contribute to the mappings, though, I noticed that mappings.dbpedia.org server does not support HTTPS. So it is at the moment quite insecure. Nowadays anyone can obtain an SSL/TLS certificate for free by using Let’s Encrypt. I suggest adding one and enabling HTTPS to make the server more secure for people logging in. Who is the sysadmin of that server? Installing the certificate should be easy, but I can help if needed.

Your link gives me a 404 error.

Nice to see that you have a structured project backed by academic institutions. I have already emailed you as you suggested.

@diegomoussallem we can check whether we can make a group here for Portuguese or better Brazil, where you can discuss org things for Brazil or Portuguese.

1 Like

Hi, @herrmann I will follow your suggestion regarding the HTTPS for the Portuguese server. Regarding your question about who the admin is of mappings.dbpedia.org, the one to answer this is @kurzum.

Btw, you have a nice pipeline and your contribution to theses mappings would be really great.

This link http://pt.dbpedia.org/page/Marataízes should be http://pt.dbpedia.org/resource/Marataízes sorry.

I will see your email as soon as possible.

Best

1 Like

@kurzum that would be great, could you then create a Portuguese one? because we actually do not have this split between Brazilian and Portuguese DBpedia communities at the moment (I hope never). We are all Portuguese speakers, only if it is necessary for management matters, but I wonder how large the sub-Portuguese chapters would be. Not only Brazilian and Portuguese but the other Lusophones.

1 Like

I set up a Group :slight_smile:

@diegomoussallem I am not sure if this split is so ideal. Let’s say there is a proposal call for Brazilian industrial data and you say that you are the Portuguese DBpedia? What if we establish a Brazilian office, would it run as a Portuguese venue. Data gets fragmented really fast, so with being national you have a better focus and can gain more weight. Or you make a meeting… Then we can make international chapters based on domains as well, e.g. health/law

Another point which I discussed with @herrmann is that we will try to turn things around now. e.g. he converted a brazilian city dataset from IBGE, which could go into the brazilian endpoint and from there we push it into pt.wikipedia.org . The main reason is that pt.wikipedia.org probably copied it from IBGE or other external data in the first place.

We tried to improve Wikipedia for 12 years now and all approaches basically failed. So by including external sources, we can push them to Wikipedia/Wikidata via the mappings.

done! https://forum.dbpedia.org/g/DBpedia_Portuguese

I guess, for communication here in the forum it is fine to have it per language and also to coordinate. But in the end we need a more national influence as we want to grow the linked data idea in each country.

I think the reason @diegomoussallem argued for keeping the lusophone (Portuguese speaking) community together is because it is so in Wikipedia already, and Wikipedia is one of the main sources of data, right?

If I understand you’re thinking more in terms of administrative structures, like a corporation that has a local branch. And Diego is talking more about the community of Portuguese speaking DBPedia users.

IMHO, both points of view make sense but are based on different assumptions.

I’m not really done with converting the data just yet. The main information I’m after, the URLs for local administration websites, as inputted by Wikipedia users, probably after using a search engine to find it. My intention at the moment is to fix the DBPedia mappings so that DBPedia does include this information from Wikipedia which is currently ignored.

I am now trying to fix the mapping of a Wikipedia infobox: Info/Município do Brasil.

From the mappings wiki, we have:

Property Mapping
template property link_brasão
ontology property foaf:homepage
Property Mapping
template property link_bandeira
ontology property foaf:homepage
Property Mapping
template property link_hino
ontology property foaf:homepage

Which is incorrect :warning:, as the foaf:homepage property has nothing to do with the semantics of those properties.

I was about to edit the mapping, when I decided to check out if these triples really do appear in the PT DBPedia. I took as an example the city of Curitiba, as its infobox is quite well filled in. From the PT DBPedia resource page about Curitiba:

dbp:linkBandeira Bandeira de Curitiba @pt
dbp:linkBrasão Brasão de Curitiba @pt
dbp:linkHino s:Hino do município de Curitiba @pt

Which are correct mappings. Furthermore, the template statistics tell me that the site_câmara and site_prefeitura are missing a mapping. But the resource page has everything correct there as well:

These are correct as well. :smiley:

With that, I did not edit the mapping there as I need to understand why this is happening in the first place.

I’m puzzled. :thinking: Is PT DBPedia using an updated, different mapping somewhere and I’m looking at an old version of the infobox mapping?

@diegomoussallem, can you help?

It is easy:

  • everything with dbp is generic. Generic means it uses the template parameter directly as property.
  • the mappings take this as input and you are able to modify it. It is on top of those. Let’s say another info box has linkh then you can merge both on the same dbo or foaf property.

Did you find the web service to test the mapping?

No, this doesn’t seem to be working properly.

What I described in the post above is that the mapping described in the wiki (I forgot to include the link before, but I have since edited my previous post and included it, check it out) seems not to be applied at all. Some mapped properties do not appear and unmapped properties do appear. See the examples above.

Could this be because I’m looking at the mappings from DBPedia and checking out PT DBPedia? Do they use the same or different mappings?

Yes. It just gives an error message.

There is a validate button next to the save page button of http://mappings.dbpedia.org/index.php/Mapping_pt:Info/Município_do_Brasil

must be one of the properties with the Unit:

So one of the mappings is wrong. Maybe this one: http://mappings.dbpedia.org/index.php?title=Mapping_pt%3AInfo%2FMunicípio_do_Brasil&diff=45909&oldid=17718

By the way, here is a recent deployment of the ad-hoc extraction, where you can check specific pages: http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/pt/

I thought you meant the “test this mapping” link on the second line of that page. That ones gives an exception.

Error

Exception: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 3; The element type “img” must be terminated by the matching end-tag “”.
Stacktrace: org.xml.sax.SAXParseException; lineNumber: 22; columnNumber: 3; The element type “img” must be terminated by the matching end-tag “”. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1709) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2900) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123) at
(…)

Now I’ve noticed the validate button. The error message is not super helpful, as it indicates no line number, leaving to guesswork the job of finding exactly with property has a problem.

Anyway, I think I figured out a way to get the data the way I want it just by using a SPARQL query, no I have no need to edit the mapping. Once I get it done I’ll post it here so other people don’t have to go through the same hoops.

Of course I could edit the mappings anyway, in order to improve the data on DBPedia, once I understand the discrepancies I reported a few messages back.

Here is my final (works finely so far) version of the query.

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo:<http://dbpedia.org/ontology/>
PREFIX dbp:<http://dbpedia.org/property/>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX ptdbp:<http://pt.dbpedia.org/property/>

SELECT ?city, ?name, ?state, ?link, ?link_prefeitura, ?link_camara, ?external_link WHERE {
    ?city a dbo:City .
    FILTER (
        EXISTS { ?city dbo:wikiPageWikiLink dbr:States_of_Brazil } ||
        EXISTS { ?city ptdbp:wikiPageUsesTemplate <http://pt.dbpedia.org/resource/Predefinição:Info/Município_do_Brasil> }
    )
    OPTIONAL { ?city foaf:homepage ?link }
    OPTIONAL {
        ?city dbo:wikiPageExternalLink ?external_link .
        FILTER REGEX(STR(?external_link), ".gov.br")
    }
    OPTIONAL {?city rdfs:label ?name}
    OPTIONAL {?city dbp:estado ?state}
    OPTIONAL {?city dbo:state/rdfs:label ?state}
    OPTIONAL {?city ptdbp:siteCâmara ?link_camara}
    OPTIONAL {?city ptdbp:sitePrefeitura ?link_prefeitura}
}

See:

I did not need to edit the mappings, after all. The mappings are wrong, but the data is right. I still don’t understand why. Perhaps another (correct) mapping is being used by the Portuguese DBPedia instead? @diegomoussallem? I could edit the mapping and try to improve it, once I understand this.

@herrmann yeah, let me explain it to you.

Your query does the following:

  1. ?city selects the part that you are interested in with the two EXIST clauses. ?city dbo:wikiPageWikiLink dbr:States_of_Brazil is quite imprecise. ?city a dbo:Settlement might be better.
  2. the second part is where you map and filter the raw data to what you want, i.e.
    REGEX(STR(?external_link), ".gov.br" ,

If you look at this query you would find more interesting dbp: properties:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo:<http://dbpedia.org/ontology/>
PREFIX dbp:<http://dbpedia.org/property/>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX ptdbp:<http://pt.dbpedia.org/property/>

SELECT distinct ?p  WHERE {
 
    ?city dbo:wikiPageWikiLink dbr:States_of_Brazil
    FILTER NOT EXISTS { ?city ptdbp:wikiPageUsesTemplate <http://pt.dbpedia.org/resource/Predefinição:Info/Município_do_Brasil> }

    ?city ?p ?o .
}
ORDER by ?p

such as http://dbpedia.org/property/site http://dbpedia.org/property/siteOficial

With the mappings you can map those a priori to the right ontology property and then the queries get easier. Otherwise, everybody needs to write more complicate queries, i.e. you could add:

OPTIONAL {
  {?city ptdbp:siteCâmara ?link_camara} UNION {?city dbp:site ?link_camara} }

(not sure if correct)

But the idea here is that the mappings help you to query the data easier and sometimes help with parsing. But it is still totally possible to write queries with many optionals and unions and other rules over dbp: and ptdbp.

Thank you, @kurzum, your suggestions make total sense.

You’re right. By including dbo:Settlement type resources I got a lot more results to my query.

But (and this is important), the purpose of using ?city dbo:wikiPageWikiLink dbr:States_of_Brazil is to restrict the query to include only Brazilian cities, and not cities anywhere else around the world. I could find no better way to filter results. For instance, the dbp:país (country) property is totally unreliable – I couldn’t find a single resource that used this property pointing to Brazil. I welcome other suggestions on how to restrict the query to include Brazil only.

Yes, thank you, dbp:site and dbp:siteOficial did in fact give me more links to work with. On the other hand, other properties that sound useful like dbp:website and dbp:websiteGoverno were so sparsely used as to be almost useless.

However, looking at those properties gave me an idea to broaden the search even more, by including this part:

    UNION { ?city dbp:prefeito ?mayor }
    UNION { ?thing dbp:cidade ?city }

With that, here is my current version of the query:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo:<http://dbpedia.org/ontology/>
PREFIX dbp:<http://dbpedia.org/property/>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX ptdbp:<http://pt.dbpedia.org/property/>

SELECT ?city, ?name, ?state, ?link, ?link_prefeitura, ?link_camara, ?external_link, ?link_site, ?link_site_oficial WHERE {
    
    # select by classes
    {
        ?city a ?city_type .
        FILTER (?city_type IN (dbo:City, dbo:Settlement))
    }
    
    # select by properties
    UNION { ?city dbp:prefeito ?mayor }
    UNION { ?thing dbp:cidade ?city }
    
    # restrict query to make sure those are cities in Brazil
    FILTER (
        EXISTS { ?city dbo:wikiPageWikiLink dbr:States_of_Brazil } ||
        EXISTS { ?city ptdbp:wikiPageUsesTemplate <http://pt.dbpedia.org/resource/Predefinição:Info/Município_do_Brasil> }
    )
    
    # get the properties likely to contain links
    OPTIONAL { ?city foaf:homepage ?link }
    OPTIONAL {
        ?city dbo:wikiPageExternalLink ?external_link .
        FILTER REGEX(STR(?external_link), ".gov.br")
    }
    OPTIONAL {?city rdfs:label ?name}
    OPTIONAL {?city dbp:estado ?state}
    OPTIONAL {?city dbo:state/rdfs:label ?state}
    OPTIONAL {?city ptdbp:siteCâmara ?link_camara}
    OPTIONAL {?city ptdbp:sitePrefeitura ?link_prefeitura}
    OPTIONAL {?city ptdbp:site ?link_site}
    OPTIONAL {?city ptdbp:siteOficial ?link_site_oficial}
}

Yes, I suppose I could edit the mappings to simplify my query quite a bit. What I still don’t understand about the mappings (and I keep repeating this) is how the mappings doesn’t match with the data. It’s as if the data was produced by a different version of the mapping. Take a look at this:

How is it possible to have an incorrect mapping produce correct data? :thinking: