Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
http://www.joelonsoftware.com/articles/Unicode.html. The below statement is from the this article:
"Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode". The author is trying to make a point that Unicode is not just "ASCII with more bytes"(extended ASCII), there is more to Unicode than just it appears. But I am not getting how Unicode is different? To me it appears as extended ASCII.
As unicode has more numbers, it can map more characters.
Yes, that's it.
The ASCII character set defines a whopping 127 numbers, and specifies which characters they represent, and how they should be serialized as byte sequences. It says that each number should be encoded as a single byte, end of story.
Unicode has room for over a million such numbers, and specifies several different ways in which they may be serialized as byte sequences.
In addition, Unicode does quite a lot more than that - for example it doesn't just map integers to characters, it also maps characters to glyphs (the graphical symbols in a font), as well as describing various metadata for each character. But the main thing is just that Unicode defines a much bigger code space and separates the integer/character mapping from the encoding (so the same integer can be encoded as different byte sequences depending on whether you encode as UTF-8, UTF-16 or UTF-32)
Unicode assigns a unique integer to characters (and character modifiers). There are many encodings, but common ones are UTF-16 and UTF-8, which are both variable-width encodings.
ASCII is a 1-byte encoding of a subset of characters.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 11 months ago.
Improve this question
Is there an accepted terminology for referring to the Unicode characters that are above the ASCII range (above code point 127 decimal)?
I have seen these called "extended ASCII" and "Unicode characters", neither of which is satisfactory.
("Extended ASCII" is not well-defined, wrongly implies an "extension" to the ASCII standard, and in any event has historically only referred to characters up to 255 decimal, not the entire Unicode range. "Unicode" implies that ASCII characters are NOT Unicode, which is false)
tl;dr
The 144,697 characters in Unicode are organized into dozens of logical groupings known as blocks.
The 128 characters defined in the legacy encoding US-ASCII are known in Unicode as the Basic Latin block. Unicode is a superset of US-ASCII.
So no, there is no special name for the other 144,569 of 144,697 characters. If you mean Thai characters, those are found in the Thai block. If you mean the Cherokee characters, those are found in the Cherokee block. And so on.
Details
Unicode defines 144,697 characters, each assigned a number referred to as a code point. The code point numbers range from zero to over a million (1,114,112 decimal or 10FFFF hex), most being reserved or unassigned.
Those characters are grouped logically into a range of code points known as a block. The US-ASCII characters make up the Basic Latin block in Unicode, the first 128 code points, with Unicode being a superset of US-ASCII.
The next 128 code points, U+0080 to U+00FF, is known as the Latin-1 Supplement.
You will find dozens more blocks listed in Wikipedia. For example, Greek and Coptic, Cyrillic, Arabic, Samaritan, Bengali, Tibetan, Arrows, Braille Patterns, Chess Symbols, and many more. If curious, browse a history of the blocks added to versions of Unicode.
You asked:
Is there an accepted terminology for referring to the Unicode characters that are above the ASCII range (above code point 127 decimal)?
No official term that I know of. Some might say “non-ASCII”. Personally, I would say “beyond US-ASCII", with the word “beyond” referring to the number range higher than 127 decimal.
You said:
I have seen these called "extended ASCII" and "Unicode characters", neither of which is satisfactory.
The label “extended ASCII” is unofficial, ambiguous, and unhelpful. The term usually refers to the positions 0 to 255 decimal in various pre-Unicode 8-bit character encodings. There are many "extended ASCII" encodings. So I suggest you avoid this term when discussing Unicode. I believe that in 2022 we can consider all of those "extended ASCII" encodings to be legacy.
As for “Unicode characters”, all 144,697 characters defined in Unicode are “Unicode characters” including the 128 characters of US-ASCII. (Again, Unicode is a superset of US-ASCII.) So referring to any subset of those 144,697 characters as “Unicode characters” is silly and unhelpful.
As an American myself, I have to say I note a bias in the Question. It appears to me that many Americans in the information technology industry carry a bias that somehow US-ASCII characters, containing the alphabet of basic American English, are “normal” and all other characters are “foreign” or “weird”. This view misses the very reason that Unicode was invented: To put all scripts around the world on an equal footing, all accounted for in a single set of code point assignments, all documented together in identical fashion by a single authoritative organization, and all implemented with the same technology.
So I suggest adjusting your thinking. Rather than attempting to bifurcate Unicode into ASCII & non-ASCII, learn to think in terms of the dozens of Unicode blocks. When dealing with legacy systems that use only US-ASCII, know that the Basic Latin block of Unicode corresponds. This block is no more or less important than any other block.
Most every modern operating system today supports Unicode, thankfully. That support means all of Unicode, never a subset. Regarding subsets, the only limit is fonts. No one font contains glyphs for every one of the 144,697 characters defined in Unicode. So most fonts focus on only a few or several of the many blocks.
For those learning about these topics, I highly recommend the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky. You may find it to be a surprisingly entertaining read.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm trying to encode some log files from US-ASCII to UTF-8
I am using iconv for that.
When I encode utf-8 there is a space in the only first-line just one char.
I tried using sed but it makes the format US-ASCII in the server. When I convert it to UTF-8, there is a space again.
I hope I was able to explain my problem.
I think you have a misconception about what character encodings are, and the relationship between ASCII and UTF-8. When we store text in a computer, we have to convert it into a binary sequence according to some code - we could choose something like "0001 means A, 0010 means B" and so on. To agree which code we're using, we give them names, like "ASCII" and "UTF-8".
If you look at a binary string, you can't tell what code its author was using; the best you can do is guess, by trying different codes and seeing which ones make sense. But some strings of bits will make sense in multiple codes - and, crucially for this question, they might mean exactly the same thing in multiple codes. For instance, two codes might both say that 0001 is an A, but one code says that 1110 means "?" and the other that it means "!". If all you have is a long line of A's, it will be written exactly the same way no matter which code you use.
In the case of UTF-8 and ASCII, this isn't coincidence; UTF-8 is deliberately designed so that anything written using ASCII will have exactly the same representation when written in UTF-8. The definition of UTF-8 basically begins with "if you can represent it using ASCII, do that; if you can't, follow these extra rules".
To use a different analogy, imagine the customer asked you to make sure the text was in British English, not American English - "colour" rather than "color", and so on. If the text you want to send is "It is raining today", there is nothing you need to change - the same exact string is both American English and British English at the same time.
If your text includes characters which can't be represented in ASCII, then it is not in ASCII. In that case, you need to know what encoding it is actually in - there are many encodings which, like UTF-8, are designed to be compatible with ASCII, so the majority of characters will be the same no matter which one you try. ISO 8859-1 and its cousin Windows-1252 are very common; ISO 8859-15 possibly more so in Europe; others in other parts of the world where they're useful for writing the local language.
The extra "space" you're seeing at the start of the file is probably the so-called "Byte Order Mark", a Unicode character deliberately defined as meaning nothing, but having a different representation in different encodings. It's there to give a hint to programs which want to guess the encoding used in a piece of text, but it is not mandatory, and the string was already valid UTF-8 before it was added.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Again and again, I keep asking myself: Why do they always insist on over-complicating everything?!
I've tried to read up about and understand Unicode many times over the years. When they start talking about endians and BOMs and all that stuff, my eyes just "zone out". I physically cannot keep reading and retain what I'm seeing. I fundamentally don't get their desire for over-complicating everything.
Why do we need UTF-16 and UTF-32 and "big endian" and "little endian" and BOMs and all this nonsense? Why wasn't Unicode just defined as "compatible with ASCII, but you can also use multiple bytes to represent all these further characters"? That would've been nice and simple, but nooo... let's have all this other stuff so that Microsoft chose UTF-16 for Windows NT and nothing is easy or straight-forward!
As always, there probably is a reason, but I doubt it's good enough to justify all this confusion and all these problems arising from insisting on making it so complex and difficult to grasp.
Unicode started out as a 16-bit character set, so naturally every character was simply encoded as two consecutive bytes. However, it quickly became clear that this would not suffice, so the limit was increased. The problem was that some programming languages and operating systems had already started implementing Unicode as 16-bit and they couldn’t just throw out everything they had already built, so a new encoding was devised that stayed backwards-compatible with these 16-bit implementations while still allowing full Unicode support. This is UTF-16.
UTF-32 represents every character as a sequence of four bytes, which is utterly impractical and virtually never used to actually store text. However, it is very useful when implementing algorithms that operate on individual codepoints – such as the various mechanisms defined by the Unicode standard itself – because all codepoints are always the same length and iterating over them becomes trivial, so you will sometimes find it used internally for buffers and such.
UTF-8 meanwhile is what you actually want to use to store and transmit text. It is compatible with ASCII and self-synchronising (unlike the other two) and it is quite space-efficient (unlike UTF-32). It will also never produce eight binary zeroes in a row (unless you are trying to represent the literal NULL character) so UTF-8 can safely be used in legacy environments where strings are null-terminated.
Endianness is just an intrinsic property of data types where the smallest significant unit is larger than one byte. Computers simply don’t always agree in what order to read a sequence of bytes. For Unicode, this problem can be circumvented by including a Byte Order Mark in the text stream, because if you read its byte representation in the wrong direction in UTF-16 or UTF-32, it will produce an invalid character that has no reason to ever occur, so you know that this particular order cannot be the right one.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What does it mean when I save a text file as "Unicode" in notepad? is it Utf-8, Utf-16 or Utf-32? Thanks in advance.
In Notepad, as in Windows software in general, “Unicode” as an encoding name means UTF-16 Little Endian (UTF-16LE). (I first thought it’s not real UTF-16, because Notepad++ recognizes it as UCS-2 and shows the content as garbage, but re-checking with BabelPad, I concluded that Notepad can encode even non-BMP characters correctly.)
Similarly, “Unicode big endian” means UTF-16 Big Endian. And “ANSI” means the system’s native legacy encoding, e.g. the 8-bit windows-1252 encoding in Western versions of Windows.
All of these formats are "Unicode". But usually editors on Mac and Windows mean UTF-8 with that because it is ASCII compatible below code 128 IIRC. UTF-8 can represent more codes than just 256 (which fits in a single byte of 8 bits) by using a special character which means that the following byte also belongs to the same character.
If you look at the output in terminal, say with vi, and if you see a space between every two characters then you are looking at UTF-16 because there every two bytes make up one character. What you should see is that the characters don't have spaces between them, that's an indication for UTF-8.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.
Does the Unicode Consortium Intend to make UTF-16 run out of characters?
i.e. make a code point > 0x10FFFF
If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.
Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?
As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):
leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.
In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF
Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):
CJK extension E(~10,000 chars)
Ferengi culture characters(~5,000 chars)
At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.
Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.
Cutting to the chase:
It is indeed intentional that the encoding system only supports code points up to U+10FFFF
It does not appear that there is any real risk of running out any time soon.
There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above U+10FFFF, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up to U+FFFFFFFF for over 4 billion characters.