language agnostic - Detect if a string was double-encoded in UTF-8 -


i need process large list of short strings (mostly in russian, other language possible, including random garbage cat walking on keyboard).

some of these strings encoded in utf-8 twice.

i need reliably detect if given string double-encoded, , fix it. should without using external libraries, inspecting bytes. detection should fast possible.

the question is: how detect given string encoded in utf-8 twice?

update:

original strings in utf-8. here as3 code second encoding (unfortunately don't have control on client code, can't fix this):

private function toutf8(s : string) : string {        var bytearray : bytearray = new bytearray();        bytearray.writeutfbytes(s);        bytearray.position = 0;         var res : string = "";         while(bytearray.bytesavailable){            res += string.fromcharcode(bytearray.readunsignedbyte());        }         return res; }  mystring = toutf8(("" + mystring).tolowercase().substr(0, 64)); 

note tolowercase() call. maybe may help?

in principle can't, allowing cat-garbage.

you don't original character encoding of data before utf-8 encoded once or twice. i'll assume cp1251, (or @ least cp1251 1 of possibilities) because it's quite tricky case.

take non-ascii character. utf-8 encode it. bytes, , bytes valid characters in cp1251 unless 1 of them happens 0x98, hole in cp1251.

so, if convert bytes cp1251 utf-8, result same if you'd correctly utf-8 encoded cp1251 string consisting of russian characters. there's no way tell whether result incorrectly double-encoding 1 character, or correctly single-encoding 2 characters.

if have control on original data, put bom @ start of it. when comes you, inspect initial bytes see whether have utf-8 bom, or result of incorrectly double-encoding bom. guess don't have kind of control on original text.

in practice can guess - utf-8 decode , then:

(a) @ character frequencies, character pair frequencies, numbers of non-printable characters. might allow tentatively declare nonsense, , hence possibly double-encoded. enough non-printable characters may nonsensical couldn't realistically type mashing @ keyboard, unless maybe alt key stuck.

(b) attempt second decode. is, starting unicode code points got decoding utf-8 data, first encode cp1251 (or whatever) , decode result utf-8. if either step fails (due invalid sequences of bytes), wasn't double-encoded, @ least not using cp1251 faulty interpretation.

this more or less if have bytes might utf-8 or might cp1251, , don't know which.

you'll false positives single-encoded cat-garbage indistinguishable double-encoded data, , maybe few false negatives data double-encoded after first encode fluke still looked russian.

if original encoding has more holes in cp1251 you'll have fewer false negatives.

character encodings hard.


Comments

Popular posts from this blog

apache - Add omitted ? to URLs -

redirect - bbPress Forum - rewrite to wwww.mysite prohibits login -

php - How can I stop spam on my custom forum/blog? -