language agnostic - Detect if a string was double-encoded in UTF-8 -
i need process large list of short strings (mostly in russian, other language possible, including random garbage cat walking on keyboard).
some of these strings encoded in utf-8 twice.
i need reliably detect if given string double-encoded, , fix it. should without using external libraries, inspecting bytes. detection should fast possible.
the question is: how detect given string encoded in utf-8 twice?
update:
original strings in utf-8. here as3 code second encoding (unfortunately don't have control on client code, can't fix this):
private function toutf8(s : string) : string { var bytearray : bytearray = new bytearray(); bytearray.writeutfbytes(s); bytearray.position = 0; var res : string = ""; while(bytearray.bytesavailable){ res += string.fromcharcode(bytearray.readunsignedbyte()); } return res; } mystring = toutf8(("" + mystring).tolowercase().substr(0, 64));
note tolowercase()
call. maybe may help?
in principle can't, allowing cat-garbage.
you don't original character encoding of data before utf-8 encoded once or twice. i'll assume cp1251, (or @ least cp1251 1 of possibilities) because it's quite tricky case.
take non-ascii character. utf-8 encode it. bytes, , bytes valid characters in cp1251 unless 1 of them happens 0x98, hole in cp1251.
so, if convert bytes cp1251 utf-8, result same if you'd correctly utf-8 encoded cp1251 string consisting of russian characters. there's no way tell whether result incorrectly double-encoding 1 character, or correctly single-encoding 2 characters.
if have control on original data, put bom @ start of it. when comes you, inspect initial bytes see whether have utf-8 bom, or result of incorrectly double-encoding bom. guess don't have kind of control on original text.
in practice can guess - utf-8 decode , then:
(a) @ character frequencies, character pair frequencies, numbers of non-printable characters. might allow tentatively declare nonsense, , hence possibly double-encoded. enough non-printable characters may nonsensical couldn't realistically type mashing @ keyboard, unless maybe alt key stuck.
(b) attempt second decode. is, starting unicode code points got decoding utf-8 data, first encode cp1251 (or whatever) , decode result utf-8. if either step fails (due invalid sequences of bytes), wasn't double-encoded, @ least not using cp1251 faulty interpretation.
this more or less if have bytes might utf-8 or might cp1251, , don't know which.
you'll false positives single-encoded cat-garbage indistinguishable double-encoded data, , maybe few false negatives data double-encoded after first encode fluke still looked russian.
if original encoding has more holes in cp1251 you'll have fewer false negatives.
character encodings hard.
Comments
Post a Comment