Why do we need both UCS and Unicode character sets? [closed] - unicode

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I guess the codepoints of UCS and Unicode are the same, am I right?
In that case, why do we need two standards (UCS and Unicode)?

They are not two standards. The Universal Character Set (UCS) is not a standard but something defined in a standard, namely ISO 10646. This should not be confused with encodings, such as UCS-2.
It is difficult to guess whether you actually mean different encodings or different standards. But regarding the latter, Unicode and ISO 10646 were originally two distinct standardization efforts with different goals and strategies. They were however harmonized in the early 1990s to avoid all the mess resulting from two different standards. They have been coordinated so that the code points are indeed the same.
They were kept distinct, though, partly because Unicode is defined by an industry consortium that can work flexibly and has great interest in standardizing things beyond simple code point assignments. The Unicode Standard defines a large number of principles and processing rules, not just the characters. ISO 10646 is a formal standard that can be referenced in standards and other documents of the ISO and its members.

The codepoints are the same but there are some differences.
From the Wikipedia entry about the differences between Unicode and ISO 10646 (i.e. UCS):
The difference between them is that Unicode adds rules and specifications that are outside the scope of ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for scripts like Hebrew and Arabic
You might find useful to read the Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I think the differences come from the way the code points are encoded. UCS-x uses a fixed amount of bytes to encode a code point. For example, UCS-2 uses two bytes. However, UCS-2 cannot encode code points that would require over 2 bytes. On the other hand, UTF uses variable amount of bytes for encoding. For example, UTF-8 uses at least one byte (for ascii characters) but uses more bytes if the character is outside the ascii range.

Related

terminology for Unicode characters outside the ASCII range [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 11 months ago.
Improve this question
Is there an accepted terminology for referring to the Unicode characters that are above the ASCII range (above code point 127 decimal)?
I have seen these called "extended ASCII" and "Unicode characters", neither of which is satisfactory.
("Extended ASCII" is not well-defined, wrongly implies an "extension" to the ASCII standard, and in any event has historically only referred to characters up to 255 decimal, not the entire Unicode range. "Unicode" implies that ASCII characters are NOT Unicode, which is false)
tl;dr
The 144,697 characters in Unicode are organized into dozens of logical groupings known as blocks.
The 128 characters defined in the legacy encoding US-ASCII are known in Unicode as the Basic Latin block. Unicode is a superset of US-ASCII.
So no, there is no special name for the other 144,569 of 144,697 characters. If you mean Thai characters, those are found in the Thai block. If you mean the Cherokee characters, those are found in the Cherokee block. And so on.
Details
Unicode defines 144,697 characters, each assigned a number referred to as a code point. The code point numbers range from zero to over a million (1,114,112 decimal or 10FFFF hex), most being reserved or unassigned.
Those characters are grouped logically into a range of code points known as a block. The US-ASCII characters make up the Basic Latin block in Unicode, the first 128 code points, with Unicode being a superset of US-ASCII.
The next 128 code points, U+0080 to U+00FF, is known as the Latin-1 Supplement.
You will find dozens more blocks listed in Wikipedia. For example, Greek and Coptic, Cyrillic, Arabic, Samaritan, Bengali, Tibetan, Arrows, Braille Patterns, Chess Symbols, and many more. If curious, browse a history of the blocks added to versions of Unicode.
You asked:
Is there an accepted terminology for referring to the Unicode characters that are above the ASCII range (above code point 127 decimal)?
No official term that I know of. Some might say “non-ASCII”. Personally, I would say “beyond US-ASCII", with the word “beyond” referring to the number range higher than 127 decimal.
You said:
I have seen these called "extended ASCII" and "Unicode characters", neither of which is satisfactory.
The label “extended ASCII” is unofficial, ambiguous, and unhelpful. The term usually refers to the positions 0 to 255 decimal in various pre-Unicode 8-bit character encodings. There are many "extended ASCII" encodings. So I suggest you avoid this term when discussing Unicode. I believe that in 2022 we can consider all of those "extended ASCII" encodings to be legacy.
As for “Unicode characters”, all 144,697 characters defined in Unicode are “Unicode characters” including the 128 characters of US-ASCII. (Again, Unicode is a superset of US-ASCII.) So referring to any subset of those 144,697 characters as “Unicode characters” is silly and unhelpful.
As an American myself, I have to say I note a bias in the Question. It appears to me that many Americans in the information technology industry carry a bias that somehow US-ASCII characters, containing the alphabet of basic American English, are “normal” and all other characters are “foreign” or “weird”. This view misses the very reason that Unicode was invented: To put all scripts around the world on an equal footing, all accounted for in a single set of code point assignments, all documented together in identical fashion by a single authoritative organization, and all implemented with the same technology.
So I suggest adjusting your thinking. Rather than attempting to bifurcate Unicode into ASCII & non-ASCII, learn to think in terms of the dozens of Unicode blocks. When dealing with legacy systems that use only US-ASCII, know that the Basic Latin block of Unicode corresponds. This block is no more or less important than any other block.
Most every modern operating system today supports Unicode, thankfully. That support means all of Unicode, never a subset. Regarding subsets, the only limit is fonts. No one font contains glyphs for every one of the 144,697 characters defined in Unicode. So most fonts focus on only a few or several of the many blocks.
For those learning about these topics, I highly recommend the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky. You may find it to be a surprisingly entertaining read.

Why are there multiple versions of Unicode? Why isn't everything UTF-8? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Again and again, I keep asking myself: Why do they always insist on over-complicating everything?!
I've tried to read up about and understand Unicode many times over the years. When they start talking about endians and BOMs and all that stuff, my eyes just "zone out". I physically cannot keep reading and retain what I'm seeing. I fundamentally don't get their desire for over-complicating everything.
Why do we need UTF-16 and UTF-32 and "big endian" and "little endian" and BOMs and all this nonsense? Why wasn't Unicode just defined as "compatible with ASCII, but you can also use multiple bytes to represent all these further characters"? That would've been nice and simple, but nooo... let's have all this other stuff so that Microsoft chose UTF-16 for Windows NT and nothing is easy or straight-forward!
As always, there probably is a reason, but I doubt it's good enough to justify all this confusion and all these problems arising from insisting on making it so complex and difficult to grasp.
Unicode started out as a 16-bit character set, so naturally every character was simply encoded as two consecutive bytes. However, it quickly became clear that this would not suffice, so the limit was increased. The problem was that some programming languages and operating systems had already started implementing Unicode as 16-bit and they couldn’t just throw out everything they had already built, so a new encoding was devised that stayed backwards-compatible with these 16-bit implementations while still allowing full Unicode support. This is UTF-16.
UTF-32 represents every character as a sequence of four bytes, which is utterly impractical and virtually never used to actually store text. However, it is very useful when implementing algorithms that operate on individual codepoints – such as the various mechanisms defined by the Unicode standard itself – because all codepoints are always the same length and iterating over them becomes trivial, so you will sometimes find it used internally for buffers and such.
UTF-8 meanwhile is what you actually want to use to store and transmit text. It is compatible with ASCII and self-synchronising (unlike the other two) and it is quite space-efficient (unlike UTF-32). It will also never produce eight binary zeroes in a row (unless you are trying to represent the literal NULL character) so UTF-8 can safely be used in legacy environments where strings are null-terminated.
Endianness is just an intrinsic property of data types where the smallest significant unit is larger than one byte. Computers simply don’t always agree in what order to read a sequence of bytes. For Unicode, this problem can be circumvented by including a Byte Order Mark in the text stream, because if you read its byte representation in the wrong direction in UTF-16 or UTF-32, it will produce an invalid character that has no reason to ever occur, so you know that this particular order cannot be the right one.

what is meant by BOM? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
What is meant by BOM ? I tried reading this article but haven't really understood what does it mean.
I read that some text editors put BOM before the beginning of a file. What it is meant for ?
BOM stands for Byte Order Mark. In short, the BOM is marker at the beginning of a file to indicate if the most significant byte, or the least significant byte should come first.
It causes a lot of problems, especially with UTF8. UTF8 does not use a BOM, but there is a variant called UTF8Y (Or UTF with BOM) that includes a few extra characters at the beginning of a file.
Sending a UTF8Y file, with a UTF8 encoding type, causes a few extra bytes to be sent at the beginning of the file and can cause all sorts of hard-to-track down problems including the DOCTYPE not being parsed correctly one IE or JSON files to fail to be decoded.
It has bitten me a few times with files from other people, when I didn't check the filetype carefully.
My recommendation: Be mindful it exists, never purposefully use it.
A byte order mark allows a program to determine how to read Unicode data. From your Wiki page:
Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in.
For UTF-8, there is no ambiguity over how to read the bytes and hence a BOM is often omitted. For UTF-16 and UTF-32 it is necessary to know how to interpret the bytes and a BOM can serve this purpose.
Note that Java has problems with reading UTF-8 BOMs and you must manually handle these characters if present (see Reading UTF-8 - BOM marker for some links to the related Sun bugs).
I'm probably going to cover stuff you already know, but here goes...
To understand the purpose of a BOM, you need to understand (at least conceptually) what endian-ness is all about.
If you're dealing with a single byte (8 binary bits), it is ordered of increasing significance from right to left (just like reading a normal decimal number, like "19"). That's simple enough as long as you can contain the number in a single byte. Once you get to two bytes, you need to know which of the two bytes is more significant, which is either big endian or little endian. Big endian means that the lowest memory address (or the left-most, to continue the analogy to writing) contains the higher values - it continues the trend of Western decimal numbers. Historically, Intel has been little endian, and Motorola has been big endian. (I haven't looked lately, that may be different now.)
The BOM is simply a marker saying which way to interpret the byte order of the data.
Today, this is simply meant to say, "This file is in UTF-8". Or, "This file is in UTF-16". While it is still the same BOM character in both cases, the way the BOM is encoded implies how all the rest will be encoded.
If you do not know what the first character is, you cannot deduce the document encoding from it reliably - you have to determine it from somewhere else, or more or less guess it.
Post-downvote appendix:
Historically, the BOM had a different purpose - a zero width whitespace character (that is, as invisible as a Unicode character can be, but still a charater).
Lots of widely used software libraries such as .NET and Java are adding the BOM automatically or implicitly to written files or even byte arrays, which often tricks people into thinking that they are not using the BOM when they do. This often backfires when a stack of such libraries writes multiple BOMs at the beginning of the same file, because then your file begins with an illegal or unwanted character, the zero width unbreakable space; and you do not even see it when you inspect!
No wonder the BOM technique does not have it good with everyone.

Dummy's guide to Unicode

Could anyone give me a concise definitions of
Unicode
UTF7
UTF8
UTF16
UTF32
Codepages
How they differ from Ascii/Ansi/Windows 1252
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you want a really brief introduction:
Unicode in 5 Minutes
Or if you are after one-liners:
Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused
Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode
Þ (LATIN CAPITAL LETTER THORN)
fi (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.
As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...
Yea I got some insight but it might be wrong, however it's helped me to understand it.
Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.
UTF7, 8 16 etc are all just different codepages using different formats.
The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.
They also don't really differ from windows 1252 as that's just another codepage.
For a better smarter answer try one of the links.
Here, read this wonderful explanation from the Joel himself.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.
The Unicode FAQ is a good enough place to answer some (not all) of your queries.
A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:
Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program,
no matter what the language.
As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:
UTF-8 is popular for HTML and similar
protocols. UTF-8 is a way of
transforming all Unicode characters
into a variable length encoding of
bytes. It has the advantages that the
Unicode characters corresponding to
the familiar ASCII set have the same
byte values as ASCII, and that Unicode
characters transformed into UTF-8 can
be used with much existing software
without extensive software rewrites.
UTF-16 is popular in many environments
that need to balance efficient access
to characters with economical use of
storage. It is reasonably compact and
all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are accessible
via pairs of 16-bit code units.
UTF-32 is popular where memory space
is no concern, but fixed width, single
code unit access to characters is
desired. Each Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most
4 bytes (or 32-bits) of data for each
character.
A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.
By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!
Why should you care as a programmer?
It's fun to understand this!
If you don't have basic understanding of encodings, you can easily produce buggy code.
Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!
Resources:
I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.
As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.
If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.

What do I need to know about Unicode? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Being a application developer, do I need to know Unicode?
Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:
The standard for digital
representation of the characters used
in writing all of the world's
languages. Unicode provides a uniform
means for storing, searching, and
interchanging text in any language. It
is used by all modern computers and is
the foundation for processing text on
the Internet. Unicode is developed and
maintained by the Unicode Consortium.
There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.
First, go to the source for
authoritative, detailed information
and implementation guidelines.
As mentioned by others, Joel Spolsky
has a good list of these
errors.
I also like Elliotte Rusty Harold's
Ten Commandments of Unicode.
Developers should also watch out for
canonical representation attacks.
Some of the key concepts you should be aware of are:
Glyphs—concrete graphics used to represent written characters.
Composition—combining glyphs to create another glyph.
Encoding—converting Unicode points to a stream of bytes.
Collation—locale-sensitive comparison of Unicode strings.
At the risk of just adding another link, unicode.org is a spectacular resource.
In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.
(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)
Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.
One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).
You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.
E.G :
You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters.
PHP let you now that some function (like stristr) does not work with unicode.
Python declare unicode string this way : u"Hello World".
That's the kind of thin you must know.
Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.
Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).
So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.
Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.
Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.
Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.
Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".
There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.