language agnostic - Detect if a string was double-encoded in UTF-8 -

May 15, 2014

i need process large list of short strings (mostly in russian, other language possible, including random garbage cat walking on keyboard).

some of these strings encoded in utf-8 twice.

i need reliably detect if given string double-encoded, , fix it. should without using external libraries, inspecting bytes. detection should fast possible.

the question is: how detect given string encoded in utf-8 twice?

update:

original strings in utf-8. here as3 code second encoding (unfortunately don't have control on client code, can't fix this):

private function toutf8(s : string) : string {        var bytearray : bytearray = new bytearray();        bytearray.writeutfbytes(s);        bytearray.position = 0;         var res : string = "";         while(bytearray.bytesavailable){            res += string.fromcharcode(bytearray.readunsignedbyte());        }         return res; }  mystring = toutf8(("" + mystring).tolowercase().substr(0, 64));

note tolowercase() call. maybe may help?

in principle can't, allowing cat-garbage.

you don't original character encoding of data before utf-8 encoded once or twice. i'll assume cp1251, (or @ least cp1251 1 of possibilities) because it's quite tricky case.

take non-ascii character. utf-8 encode it. bytes, , bytes valid characters in cp1251 unless 1 of them happens 0x98, hole in cp1251.

so, if convert bytes cp1251 utf-8, result same if you'd correctly utf-8 encoded cp1251 string consisting of russian characters. there's no way tell whether result incorrectly double-encoding 1 character, or correctly single-encoding 2 characters.

if have control on original data, put bom @ start of it. when comes you, inspect initial bytes see whether have utf-8 bom, or result of incorrectly double-encoding bom. guess don't have kind of control on original text.

in practice can guess - utf-8 decode , then:

(a) @ character frequencies, character pair frequencies, numbers of non-printable characters. might allow tentatively declare nonsense, , hence possibly double-encoded. enough non-printable characters may nonsensical couldn't realistically type mashing @ keyboard, unless maybe alt key stuck.

(b) attempt second decode. is, starting unicode code points got decoding utf-8 data, first encode cp1251 (or whatever) , decode result utf-8. if either step fails (due invalid sequences of bytes), wasn't double-encoded, @ least not using cp1251 faulty interpretation.

this more or less if have bytes might utf-8 or might cp1251, , don't know which.

you'll false positives single-encoded cat-garbage indistinguishable double-encoded data, , maybe few false negatives data double-encoded after first encode fluke still looked russian.

if original encoding has more holes in cp1251 you'll have fewer false negatives.

character encodings hard.

Search This Blog

Assebmley

language agnostic - Detect if a string was double-encoded in UTF-8 -

Comments

Post a Comment

Popular posts from this blog

apache - Add omitted ? to URLs -

redirect - bbPress Forum - rewrite to wwww.mysite prohibits login -

php - How can I stop spam on my custom forum/blog? -