IRI Encoding Troubles

July 10th, 2013 1:57 PM

Recently, I was tracking down a bug in RDF::Trine submitted by Jonas Smedegaard relating to the handling of unicode in IRIs. The bug involved the decoding of a punycode URI (which was an inadvertent “feature” that snuck in to RDF::Trine due to a feature in URI):

Strangely, this URI was being treated differently than other punycode URIs (e.g., causing encoding errors in a completely different part of the code.

Upon investigating the problem, I determined that the punycode decoding code in the URI package treats punycode URIs differently depending on if they contain any unicode codepoints that cannot be represented in the Latin-1 encoding. For IRIs with all low-codepoint characters such as http://www.hestebedgå, the perl value returned by the URI->as_iri method is a Latin-1 encoded string. High-codepoint IRIs such as http://✪ are returned as UTF-8 encoded values. This difference caused problems when the values were subsequently passed into Redland’s C code that expected UTF-8 values.

As Gisle Aas, the author of URI, noted in the bug report I submitted on this issue, the simple fix is to call utf8::upgrade on values returned by URI->as_iri. While I believe this should be done in the as_iri method (as it is documented to return “a Unicode string”), until that is the case I hope this post might save some time and trouble for the next person to come across this issue.