sleske February 2016

What is the point of Tomcat's setting URIEncoding?

In Apache Tomcat, parameter URIEncoding tells Tomcat how to interpret incoming URIs:

URIEncoding

This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used.

Apache Tomcat 7 - The HTTP Connector

However, as explained for example in What is the proper way to URL encode Unicode characters? , non-ASCII characters in URIs are always encoded in UTF-8, following current standards (RFC 3986 and 3987).

So:

  • Why is there even a setting for something that is mandated by a standard?
  • Why is the default different from what the standard mandates? (ISO-8859-1 instead of UTF-8)

Is this simply because the Tomcat setting predates the standard, and was retained for backwards compatibility? Or is there some situation where a value different from UTF-8 makes sense?

Answers


Siderite Zackwehdex March 2016

I see that at least for Tomcat 6 and below URIEncoding was not only important, but necessary, with many people having issues if not explicitly setting it to 'UTF-8'. As for your question, I can only assume that it is for backward compatibility. Developers hate to remove code once they have written it, even if the possibility of ever needing it again is zero :)


Alanmars March 2016

The description of parameter URIEncoding in Tomcat 8 - Apache Tomcat 8 - The HTTP Connector:

This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, UTF-8 will be used unless the org.apache.catalina.STRICT_SERVLET_COMPLIANCE system property is set to true in which case ISO-8859-1 will be used.

Thus the description was changed from that of Apache Tomcat 7. The default value of org.apache.catalina.STRICT_SERVLET_COMPLIANCE is false from Apache Tomcat 8. So UTF-8 is the default value of URIEncoding for Apache Tomcat 8, which means that Tomcat now follows the standard (and common usage).


As to why Tomcat used ISO 8859-1 as the default URI encoding until Tomcat 7:

That seems to be because the Tomcat devevelopers believed this to be what the Servlet specification requires (as the name of the setting STRICT_SERVLET_COMPLIANCE indicates).

As a matter of fact, the Servlet spec does not explicitly mention URI encoding in any version. It does, however, mention that POST data must be parsed as ISO 8859-1 if the Content-Type HTTP header does not specify an encoding via charset (Servlet Specification V2.5, "Request data encoding"). Apparently this was interpreted to mean that query parameters (and thus the whole URI) should also be decoded as ISO 8859-1 by default.

The root problem is arguably that the Servlet Specification does not specify the default encoding to use for decoding URIs, let alone a way to change this encoding. This in turn is probably because the URI spec originally did not allow for non-ASCII characters in URIs - this was only standardized by introducing IRIs, see RFC 3987 from January 2005. Therefore every servlet contai

Post Status

Asked in February 2016
Viewed 3,173 times
Voted 10
Answered 2 times

Search




Leave an answer