IRI Encoding Troubles
Recently, I was tracking down a bug in RDF::Trine submitted by Jonas Smedegaard relating to the handling of unicode in IRIs. The bug involved the decoding of a punycode URI (which was an inadvertent “feature” that snuck in to RDF::Trine due to a feature in URI):
http://www.xn--hestebedgrd-58a.dk/
Strangely, this URI was being treated differently than other punycode URIs (e.g. http://xn--df-oiy.ws/
), causing encoding errors in a completely different part of the code.
Upon investigating the problem, I determined that the punycode decoding code in the URI package treats punycode URIs differently depending on if they contain any unicode codepoints that cannot be represented in the Latin-1 encoding. For IRIs with all low-codepoint characters such as http://www.hestebedgård.dk
, the perl value returned by the URI->as_iri
method is a Latin-1 encoded string. High-codepoint IRIs such as http://✪df.ws/
are returned as UTF-8 encoded values. This difference caused problems when the values were subsequently passed into Redland’s C code that expected UTF-8 values.
As Gisle Aas, the author of URI, noted in the bug report I submitted on this issue, the simple fix is to call utf8::upgrade on values returned by URI->as_iri
. While I believe this should be done in the as_iri
method (as it is documented to return “a Unicode string”), until that is the case I hope this post might save some time and trouble for the next person to come across this issue.