Wednesday, January 9, 2008

Lost in translation: ASC16-UTF8-1252

As soon as you start to deal with any string data, you need to make sure of which character encoding type this string is. The most commonly used are ASCII, ISO-8859-1, Windows-1252 and the two Unicode encodings UTF-8 and UTF-16.

Normally, you can fully ignore which character encoding you are using but as soon as you need to communicate with an external source, or need to make sure in which format an external data source is, you need to know the encoding.

Here is the list of the most commonly used encodings:

ASCII (http://www.asciitable.com/) is very old but still very widely used. Every character consists of one byte but only the values 0 – 127 are defined which makes it 7 bit. The characters between 128 and 255 (8 bit) are so called "Extended ASCII" and which characters they map can be defined in a codepage. This means: If you are using data that contains ASCII characters about 127, you need to make sure that sender and receiver use the same codepage.

ISO-8859-1 Latin-1 (http://en.wikipedia.org/wiki/ISO/IEC_8859-1) is very widely used in the internet since it's the default encoding for the "text/" MIME type. Basically it's a codepage that maps 0 – 127 to the same characters like ASCII, 128 and above to several characters of the Latin alphabet.

Windows-1252 (http://en.wikipedia.org/wiki/Windows-1252) is based on ISO-8859-1 and is still the most used codepage for Windows. It differs in the 0x80 to 0x9F range which contains non-printable characters in ISO-8859-1 but printable characters on Windows-1252.

UTF-8 (http://en.wikipedia.org/wiki/Utf-8) is a Unicode character encoding that again maps 0 – 127 like ASCII but can also display all Unicode characters. UTF-8 is a variable-length encoding which means the values 0-255 is mapped to one byte, like in "Extended ASCII", but UTF-8 can use two, three or even four bytes per character when needed. Because of this variable-length encoding and the backward compatibility with ASCII, it is the encoding you should use.

UTF-16 (http://en.wikipedia.org/wiki/UTF-16) is also a Unicode character encoding but maps 0 – 127 differently from ASCII and uses only two or four bytes per character. Because of this it requires more data space and should only be used if you need to use. Use UTF-8 in any other case.

No comments:

Post a Comment