Hi, I’m working on a project where I use extracted homepages.
While browsing the dataset, I noticed extracted URLs (
foaf:homepage) that contain
%7C, i.e., the
My suspicion is that these are due to the parsing of the URL template: if an optional display text is used, it seems the
| separator plus the part of this text up to the first space is incorrectly appended to the URL.
So as an example: if an infobox property contains
dbp:website will contain
For the latest
homepages_lang=en.ttl.bz2, I counted 15,075 such cases. Some examples: 1, 2, 3
As an added complication, this display text is now deprecated, so you could argue this should be fixed in Wikipedia itself, but it’s still used often enough to affect data quality on a moderate scale. Of course, the fix is easy (discard everything starting from
%7C), but you have to be aware that the issue is there.
I must admit that I’m a bit overwhelmed by the extractors/mappings/datasets to find out exactly where/how this could/should be fixed, but I hope that at least the awareness of the issue is already helpful.
(PS: This might also explain the parsing issues when the URL contains an integer.)