While investigating some localization options, I stumbled across this as a save option in Visual Studio.
What is Unicode code page 1200 exactly?
The Microsoft documentation page Code Page Identifiers describes:
Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
So is Unicode code page 1200 really UTF-16 and therefore has a BOM?
Is it advisable to use this for JavaScript then, and if we have to use this, is a charset declaration necessary in the script tag?
Code page 1200 is UTF-16 little endian, and does not imply BOM or not.
For anything web use UTF-8 (everything: css, html, javascript, etc.)
Use UTF-8 for JavaScript, don't bother with UTF-16 or any of its variants (for JavaScript; this advice doesn't apply generally).
According to Microsoft documentation about the Code Page Identifiers, code page 1200 means the following:
Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
Likewise 1201 is the same, but with big endian byte order.
Unicode UTF-16, big endian byte order; available only to managed applications
Related
does any signifigant interchange take place in formats other than ascii/utf8? are there any fields where utf16xx and utf32xx are used heavily? i ask as a writer of multiple libraries that work on unicode text, and the burden of supporting all five major variants is quite high compared to the perceived utility.
Windows and Java both treat Unicode as UTF-16 internally, and Python uses UTF-16 or UTF-32 depending on the platform. So more than just UTF-8 is important for these. These are just the cases I'm most familiar with, I'm sure there are others.
So, in my opinion, if you have a Unicode library, you should support UTF-16 and UTF-32. (I can't believe UTF-32 is too difficult, since there's no special processing involved besides byte ordering. Although, I'm not a Unicode library author :) )
One important point is XML: it can come in pretty much any encoding imaginable, but UTF-8 is by far the most common.
However, the XML spec says this:
All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode
So if your application/library handles XML in any way it must support UTF-16 at least in that portion. Note that a conforming parser that converts the data to UTF-8 for processing would be enough here.
When it comes to interchange, I guess you are right that UTF-8 is prevalent. Some cases of using UTF-16 are various binary protocols such as DCOM, Java RMI and (maybe???) CORBA.
As for UTF-32 I've never heard of a case where it is used for interchange.
It seems to me if UTF-8 was the only encoding used everywhere ever, there would be a lot less issues with code:
Don't even need to think about encoding issues.
No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
Browsers don't need to wait for the <meta> tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
More characters can be represented in UTF-8.
Other things I can't think of right now.
So why haven't the inferior encodings been nuked from space?
Don't even need to think about encoding issues.
True. Except for all the data that's still in the old ASCII format.
No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
Incorrect. UTF-8 is variable length, from 1 to 6 or so bytes.
Browsers don't need to wait for the tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
Browsers don't generally wait for the full page, they make a guess based on the first part of the page data.
You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
Except for all those other old web pages that use other non-UTF-8 encodings (the non-English speaking world is pretty big).
More characters can be represented in UTF-8.
True. Your problems of data validation just got harder, too.
Why are EBCDIC, Baudot, and Morse still not nuked from orbit? Why did the buggy-whip manufacturers not close their doors the day after Gottlieb Daimler shipped his first automobile?
Relegating a technology to history takes non-zero time.
No issues with mixed 1-2-byte
character streaming, because
everything uses 2 bytes.
Not true at all. UTF-8 is a mixed-width 1, 2, 3, and 4-byte encoding. You may have been thinking of UTF-16, but even that has had 4-byte characters for a while. If you want a “simple” fixed-width encoding, you need UTF-32.
You would never see ? and other random
symbols on old web pages
Even with UTF-8 web pages, you still might not have a font that supports every Unicode character, so this is still a problem.
More characters can be represented in
UTF-8.
Sometimes this is a disadvantage. Having more characters means more bits are required to encode the characters. And to keep track of which ones are letters, digits, etc. And to store the fonts for displaying those characters. And to deal with additional Unicode-related complexities like normalization.
This is probably a non-issue for modern computers with gigabytes of RAM, but don't expect your TI-83 to support Unicode any time soon.
But still, if you do need those extra characters, it's way easier to work with UTF-8 than it is to work with than having zillions of different 8-bit character encodings (plus a few non-self-synchronizing East Asian multibyte encodings).
So why haven't the inferior encodings
been nuked from space?
In large part, this is because the “inferior” programming languages haven't been nuked from space. Lots of code is still written in languages like C and C++ (and even COBOL!) that predate Unicode and still don't have good support for it.
I badly wish we get rid of the situation where some libraries use char-based strings encoded in UTF-8 while others think char is for legacy encodings and Unicode should always use wchar_t and then you have to deal with whether wchar_t is UTF-16 or UTF-32 (or neither).
I don't think UTF-8 uses "2 bits" it's variable length. Also a lot of OS level code is UTF-16 and UTF-32 respectively, which means the choice is between ASCII or ISO-8859-1 for latin encodings.
Well, your question is a bit why-world-is-so-bad complaint. It is because it is so. The pages written in other encodings than UTF-8 come from the times when UTF-8 was badly supported by operating systems and when UTF-8 was not yet de-facto standard.
This pages will stay in their original encoding as long as someone will not change them, which is in many cases not very probable. Many of them are no longer supported by anyone.
There are also a lot of documents with non-unicode encoding in the internet, in many formats. Someone COULD convert them, but it, as above, requires a lot of effort.
So, the support for non-unicode must also stay.
And for the current times, keep as the rule that when someone uses non-unicode encoding, a kitten dies.
I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.
What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?
Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.
As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).
I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)
If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/
Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx
There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.
I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.
Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.
As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?
You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.
UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.
1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.
Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.
Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.
Could anyone give me a concise definitions of
Unicode
UTF7
UTF8
UTF16
UTF32
Codepages
How they differ from Ascii/Ansi/Windows 1252
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you want a really brief introduction:
Unicode in 5 Minutes
Or if you are after one-liners:
Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused
Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode
Þ (LATIN CAPITAL LETTER THORN)
fi (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.
As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...
Yea I got some insight but it might be wrong, however it's helped me to understand it.
Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.
UTF7, 8 16 etc are all just different codepages using different formats.
The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.
They also don't really differ from windows 1252 as that's just another codepage.
For a better smarter answer try one of the links.
Here, read this wonderful explanation from the Joel himself.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.
The Unicode FAQ is a good enough place to answer some (not all) of your queries.
A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:
Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program,
no matter what the language.
As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:
UTF-8 is popular for HTML and similar
protocols. UTF-8 is a way of
transforming all Unicode characters
into a variable length encoding of
bytes. It has the advantages that the
Unicode characters corresponding to
the familiar ASCII set have the same
byte values as ASCII, and that Unicode
characters transformed into UTF-8 can
be used with much existing software
without extensive software rewrites.
UTF-16 is popular in many environments
that need to balance efficient access
to characters with economical use of
storage. It is reasonably compact and
all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are accessible
via pairs of 16-bit code units.
UTF-32 is popular where memory space
is no concern, but fixed width, single
code unit access to characters is
desired. Each Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most
4 bytes (or 32-bits) of data for each
character.
A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.
By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!
Why should you care as a programmer?
It's fun to understand this!
If you don't have basic understanding of encodings, you can easily produce buggy code.
Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!
Resources:
I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.
As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.
If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.
What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?
In what way are these helpful for programmers?
Going down your list:
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
There's more on my Unicode page and tips for debugging Unicode problems.
The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.
Some reading to get you started on character encodings: Joel on Software:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By the way - ASP.NET has nothing to do with it. Encodings are universal.