Internet Explorer and Foreign Languages

Internet Explorer is pretty good at displaying foreign languages. There are several mechanisms for displaying languages that have different character sets to our own.

The most portable two for web browsers are UTF-8 and normal ASCII.

ASCII Character Sets

If you want to use a different code page you simply specify the character set, to specify the Russian code page use the following meta tag,

It is worth noting that you should always specify a character set for your pages. Never assume that a page will be interpreted with the standard character set. If it is just the normal latin one set the charset to iso-8859-1

For more information about characters sets a good resource is http://czyborra.com/charsets/codepages.html

Unicode Character Sets

Unicode is a single character set; it defines a unique character for every character in every language. At least that’s the idea. I understand that they are still defining some of the characters from obscure medieval languages but most of our current ones have their characters in place already. This is so that we do not have to flip code pages to display different languages, we can just use the same character set throughout the whole application and it will have Russian characters, Chinese, and even good old Latin characters.

The bit when most people start to get confused is when we talk about how many bytes we use to represent. I did so myself until I had spent quite a lot of time looking at the issue. On windows we have ‘Wide’ characters, which are Unicode. These are stored in 16-bits. This gives rise to the impression that Unicode characters are stored in 16 bits. You then come across UTF-8 being used in web browsers and things start to get murkier. These are simply different representations of the Unicode character set. The UTF stands for Unicode Transformation Format. These are simply alternative representations of the Unicode character set. The representation used by Windows in NT is UCS-2, a 16 bit representation only able to access the first 16-bits of the Unicode character set. This isn’t too much of a problem right now as there are no languages defined outside this space yet.

UTF-8 encodes the Unicode character set as a varying number of bytes. The one quirk that makes it so loveable is that the 7-bit US-ASCII characters are represented by the 7-bit ASCII values. In other words your Standard English text is not encoded any differently in UTF-8 to how it is currently in ASCII.

If you want to specify UTF-8 text use this meta tag.

For a lot more detailed information go to http://www.unicode.org/

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 64 other followers