Hi, I’m working on a project where I use extracted homepages.
While browsing the dataset, I noticed extracted URLs (foaf:homepage
) that contain %7C
, i.e., the |
character.
My suspicion is that these are due to the parsing of the URL template: if an optional display text is used, it seems the |
separator plus the part of this text up to the first space is incorrectly appended to the URL.
So as an example: if an infobox property contains {{URL|http://example.com|Example website}}
, foaf:homepage
and/or dbp:website
will contain http://example.com%7CExample
.
For the latest homepages_lang=en.ttl.bz2
, I counted 15,075 such cases. Some examples: 1, 2, 3
As an added complication, this display text is now deprecated, so you could argue this should be fixed in Wikipedia itself, but it’s still used often enough to affect data quality on a moderate scale. Of course, the fix is easy (discard everything starting from %7C
), but you have to be aware that the issue is there.
I must admit that I’m a bit overwhelmed by the extractors/mappings/datasets to find out exactly where/how this could/should be fixed, but I hope that at least the awareness of the issue is already helpful.
(PS: This might also explain the parsing issues when the URL contains an integer.)