FilipLuch February 2016

How to decode unexpected strings from users?

I've published an app, and I find some of the comments to be like this: РекамедÑ

I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?


Pekka 웃 February 2016

These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &

When decoding the data twice using this online tool (there are many others) the result is


That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that

  1. was misinterpreted as single-byte input
  2. was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).

The fix will likely need to be on server side.

If you are using PHP, see UTF-8 all the way through for a handy guide.

