I see the term 'octet' popping up in literature about nonces for hashing, and it seems to be synonymous with 'character', although there is a kind of pattern to how the words are used.
This leads me to believe that there is a formal distinction between the two. If anyone could enlighten me to what it is, I'd appreciate it.
(and please, no lectures about octal character codes or octal (base 8) numbers; I'm talking about the noun 'octet', not the adjective)
EDIT: as it turns out, the word I was looking for, is 'octet'.
You are probably thinking about the term octet that is often used to be synonymous with a single-byte (non-unicode) character. Octet in this instance means eight bits. A character can be eight or sixteen or even more bits, but an octet is always eight bits.
An octet is an 8-bit piece of data, a byte (though bytes don't necessarily have 8 bits). A character is the smallest unit of text. They are completely separate concepts, and using them interchangeably betrays serious ignorance of the complexity of text encodings. Unfortunately, this particular element of ignorance is far too common, and that the C standard explicitly defines a char to have a size of 1 byte does not help.
In particular, I'd be very wary of any cryptographic text that uses "character" to mean "byte" (or "octet").
Octet is the french word for a byte, it is called octet because it contains eight bits.
The term is used mainly in telecommunications, probably due to the heavy french influence in e.g. CCITT.
Related
I know that this is probably a stupid question, but I need to be sure on this issue. So I need to know for example if a programming language says that its String type uses UTF-16 encoding, does that mean:
it will use 2 bytes for code points in the range of U+0000 to U+FFFF.
it will use surrogate pairs for code points larger than U+FFFF (4 bytes per code point).
Or does some programming languages use their own "tricks" when encoding and do not follow this standard 100%.
UTF-16 is a specified encoding, so if you "use UTF-16", then you do what it says and don't invent any "tricks" of your own.
I wouldn't talk about "two bytes" the way you do, though. That's a detail. The key part of UTF-16 is that you encode code points as a sequence of 16-bit code units, and pairs of surrogates are used to encode code points greater than 0xFFFF. The fact that one code unit is comprised of two 8-bit bytes is a second layer of detail that applies to many systems (but there are systems with larger byte sizes where this isn't relevant), and in that case you may distinguish big- and little-endian representations.
But looking the other direction, there's absolutely no reason why you should use UTF-16 specifically. Ultimately, Unicode text is just a sequence of numbers (of value up to 221), and it's up to you how to represent and serialize those.
I would happily make the case that UTF-16 is a historic accident that we probably wouldn't have done if we had to redo everything now: It is a variable-length encoding just as UTF-8, so you gain no random access, as opposed to UTF-32, but it is also verbose. It suffers endianness problems, unlike UTF-8. Worst of all, it confuses parts of the Unicode standard with internal representation by using actual code point values for the surrogate pairs.
The only reason (in my opinion) that UTF-16 exists is because at some early point people believed that 16 bit would be enough for all humanity forever, and so UTF-16 was envisaged to be the final solution (like UTF-32 is today). When that turned out not to be true, surrogates and wider ranges were tacked onto UTF-16. Today, you should by and large either use UTF-8 for serialization externally or UTF-32 for efficient access internally. (There may be fringe reasons for preferring maybe UCS-2 for pure Asian text.)
UTF-16 per se is standard. However most languages whose strings are based on 16-bit code units (whether or not they claim to ‘support’ UTF-16) can use any sequence of code units, including invalid surrogates. For example this is typically an acceptable string literal:
"x \uDC00 y \uD800 z"
and usually you only get an error when you attempt to write it to another encoding.
Python's optional encode/decode option surrogateescape uses such invalid surrogates to smuggle tokens representing the single bytes 0x80–0xFF into standalone surrogate code units U+DC80–U+DCFF, resulting in a string such as this. This is typically only used internally, so you're unlikely to meet it in files or on the wire; and it only applies to UTF-16 in as much as Python's str datatype is based on 16-bit code units (which is on ‘narrow’ builds between 3.0 and 3.3).
I'm not aware of any other commonly-used extensions/variants of UTF-16.
What's the exact difference between Unicode and ASCII?
ASCII has a total of 128 characters (256 in the extended set).
Is there any size specification for Unicode characters?
ASCII defines 128 characters, which map to the numbers 0–127. Unicode defines (less than) 221 characters, which, similarly, map to numbers 0–221 (though not all numbers are currently assigned, and some are reserved).
Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. For example, the number 65 means "Latin capital 'A'".
Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8.
Understanding why ASCII and Unicode were created in the first place helped me understand the differences between the two.
ASCII, Origins
As stated in the other answers, ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations*. Which means that we can represent 128 characters maximum.
Wait, 7 bits? But why not 1 byte (8 bits)?
The last bit (8th) is used for avoiding errors as parity bit.
This was relevant years ago.
Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed, tab, etc.
See below the binary representation of a few characters in ASCII:
0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)
See the full ASCII table over here.
ASCII was meant for English only.
What? Why English only? So many languages out there!
Because the center of the computer industry was in the USA at that
time. As a consequence, they didn't need to support accents or other
marks such as á, ü, ç, ñ, etc. (aka diacritics).
ASCII Extended
Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support their language (to support "é", in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters). And not 2^7 as before (128).
10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)
The name for this "ASCII extended to 8 bits and not 7 bits as before" could be just referred as "extended ASCII" or "8-bit ASCII".
As #Tom pointed out in his comment below there is no such thing as "extended ASCII" yet this is an easy way to refer to this 8th-bit trick. There are many variations of the 8-bit ASCII table, for example, the ISO 8859-1, also called ISO Latin-1.
Unicode, The Rise
ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? Greek? Russian? Chinese and the likes?
We would have needed an entirely new character set... that's the rational behind Unicode. Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters (see this table).
You cannot save text to your hard drive as "Unicode". Unicode is an abstract representation of the text. You need to "encode" this abstract representation. That's where an encoding comes into play.
Encodings: UTF-8 vs UTF-16 vs UTF-32
This answer does a pretty good job at explaining the basics:
UTF-8 and UTF-16 are variable length encodings.
In UTF-8, a character may occupy a minimum of 8 bits.
In UTF-16, a character length starts with 16 bits.
UTF-32 is a fixed length encoding of 32 bits.
UTF-8 uses the ASCII set for the first 128 characters. That's handy because it means ASCII text is also valid in UTF-8.
Mnemonics:
UTF-8: minimum 8 bits.
UTF-16: minimum 16 bits.
UTF-32: minimum and maximum 32 bits.
Note:
Why 2^7?
This is obvious for some, but just in case. We have seven slots available filled with either 0 or 1 (Binary Code).
Each can have two combinations. If we have seven spots, we have 2 * 2 * 2 * 2 * 2 * 2 * 2 = 2^7 = 128 combinations. Think about this as a combination lock with seven wheels, each wheel having two numbers only.
Source: Wikipedia, this great blog post and Mocki.co where I initially posted this summary.
ASCII has 128 code points, 0 through 127. It can fit in a single 8-bit byte, the values 128 through 255 tended to be used for other characters. With incompatible choices, causing the code page disaster. Text encoded in one code page cannot be read correctly by a program that assumes or guessed at another code page.
Unicode came about to solve this disaster. Version 1 started out with 65536 code points, commonly encoded in 16 bits. Later extended in version 2 to 1.1 million code points. The current version is 6.3, using 110,187 of the available 1.1 million code points. That doesn't fit in 16 bits anymore.
Encoding in 16-bits was common when v2 came around, used by Microsoft and Apple operating systems for example. And language runtimes like Java. The v2 spec came up with a way to map those 1.1 million code points into 16-bits. An encoding called UTF-16, a variable length encoding where one code point can take either 2 or 4 bytes. The original v1 code points take 2 bytes, added ones take 4.
Another variable length encoding that's very common, used in *nix operating systems and tools is UTF-8, a code point can take between 1 and 4 bytes, the original ASCII codes take 1 byte the rest take more. The only non-variable length encoding is UTF-32, takes 4 bytes for a code point. Not often used since it is pretty wasteful. There are other ones, like UTF-1 and UTF-7, widely ignored.
An issue with the UTF-16/32 encodings is that the order of the bytes will depend on the endian-ness of the machine that created the text stream. So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.
Having these different encoding choices brings back the code page disaster to some degree, along with heated debates among programmers which UTF choice is "best". Their association with operating system defaults pretty much draws the lines. One counter-measure is the definition of a BOM, the Byte Order Mark, a special codepoint (U+FEFF, zero width space) at the beginning of a text stream that indicates how the rest of the stream is encoded. It indicates both the UTF encoding and the endianess and is neutral to a text rendering engine. Unfortunately it is optional and many programmers claim their right to omit it so accidents are still pretty common.
java provides support for Unicode i.e it supports all world wide alphabets. Hence the size of char in java is 2 bytes. And range is 0 to 65535.
ASCII has 128 code positions, allocated to graphic characters and control characters (control codes).
Unicode has 1,114,112 code positions. About 100,000 of them have currently been allocated to characters, and many code points have been made permanently noncharacters (i.e. not used to encode any character ever), and most code points are not yet assigned.
The only things that ASCII and Unicode have in common are: 1) They are character codes. 2) The 128 first code positions of Unicode have been defined to have the same meanings as in ASCII, except that the code positions of ASCII control characters are just defined as denoting control characters, with names corresponding to their ASCII names, but their meanings are not defined in Unicode.
Sometimes, however, Unicode is characterized (even in the Unicode standard!) as “wide ASCII”. This is a slogan that mainly tries to convey the idea that Unicode is meant to be a universal character code the same way as ASCII once was (though the character repertoire of ASCII was hopelessly insufficient for universal use), as opposite to using different codes in different systems and applications and for different languages.
Unicode as such defines only the “logical size” of characters: Each character has a code number in a specific range. These code numbers can be presented using different transfer encodings, and internally, in memory, Unicode characters are usually represented using one or two 16-bit quantities per character, depending on character range, sometimes using one 32-bit quantity per character.
ASCII and Unicode are two character encodings. Basically, they are standards on how to represent difference characters in binary so that they can be written, stored, transmitted, and read in digital media. The main difference between the two is in the way they encode the character and the number of bits that they use for each. ASCII originally used seven bits to encode each character. This was later increased to eight with Extended ASCII to address the apparent inadequacy of the original. In contrast, Unicode uses a variable bit encoding program where you can choose between 32, 16, and 8-bit encodings. Using more bits lets you use more characters at the expense of larger files while fewer bits give you a limited choice but you save a lot of space. Using fewer bits (i.e. UTF-8 or ASCII) would probably be best if you are encoding a large document in English.
One of the main reasons why Unicode was the problem arose from the many non-standard extended ASCII programs. Unless you are using the prevalent page, which is used by Microsoft and most other software companies, then you are likely to encounter problems with your characters appearing as boxes. Unicode virtually eliminates this problem as all the character code points were standardized.
Another major advantage of Unicode is that at its maximum it can accommodate a huge number of characters. Because of this, Unicode currently contains most written languages and still has room for even more. This includes typical left-to-right scripts like English and even right-to-left scripts like Arabic. Chinese, Japanese, and the many other variants are also represented within Unicode. So Unicode won’t be replaced anytime soon.
In order to maintain compatibility with the older ASCII, which was already in widespread use at the time, Unicode was designed in such a way that the first eight bits matched that of the most popular ASCII page. So if you open an ASCII encoded file with Unicode, you still get the correct characters encoded in the file. This facilitated the adoption of Unicode as it lessened the impact of adopting a new encoding standard for those who were already using ASCII.
Summary:
1.ASCII uses an 8-bit encoding while Unicode uses a variable bit encoding.
2.Unicode is standardized while ASCII isn’t.
3.Unicode represents most written languages in the world while ASCII does not.
4.ASCII has its equivalent within Unicode.
Taken From: http://www.differencebetween.net/technology/software-technology/difference-between-unicode-and-ascii/#ixzz4zEjnxPhs
Storage
Given numbers are only for storing 1 character
ASCII ⟶ 27 bits (1 byte)
Extended ASCII ⟶ 28 bits (1 byte)
UTF-8 ⟶ minimum 28, maximum 232 bits (min 1, max 4 bytes)
UTF-16 ⟶ minimum 216, maximum 232 bits (min 2, max 4 bytes)
UTF-32 ⟶ 232 bits (4 bytes)
Usage (as of Feb 2020)
ASCII defines 128 characters, as Unicode contains a repertoire of more than 120,000 characters.
Beyond how UTF is a superset of ASCII, another good difference to know between ASCII and UTF is in terms of disk file encoding and data representation and storage in random memory. Programs know that given data should be understood as an ASCII or UTF string either by detecting special byte order mark codes at the start of the data, or by assuming from programmer intent that the data is text and then checking it for patterns that indicate it is in one text encoding or another.
Using the conventional prefix notation of 0x for hexadecimal data, basic good reference is that ASCII text starts with byte values 0x00 to 0x7F representing one of the possible ASCII character values. UTF text is normally indicated by starting with the bytes 0xEF 0xBB 0xBF for UTF8. For UTF16, start bytes 0xFE 0xFF, or 0xFF 0xFE are used, with the endian-ness order of the text bytes indicated by the order of the start bytes. The simple presence of byte values that are not in the ASCII range of possible byte values also indicates that data is probably UTF.
There are other byte order marks that use different codes to indicate data should be interpreted as text encoded in a certain encoding standard.
We know that codepoints can be in this interval 0..10FFFF which is less than 2^21. Then why do we need UTF-32 when all codepoints can be represented by 3 bytes? UTF-24 should be enough.
Computers are generally much better at dealing with data on 4 byte boundaries. The benefits in terms of reduced memory consumption are relatively small compared with the pain of working on 3-byte boundaries.
(I speculate there was also a reluctance to have a limit that was "only what we can currently imagine being useful" when coming up with the original design. After all, that's caused a lot of problems in the past, e.g. with IPv4. While I can't see us ever needing more than 24 bits, if 32 bits is more convenient anyway then it seems reasonable to avoid having a limit which might just be hit one day, via reserved ranges etc.)
I guess this is a bit like asking why we often have 8-bit, 16-bit, 32-bit and 64-bit integer datatypes (byte, int, long, whatever) but not 24-bit ones. I'm sure there are lots of occasions where we know that a number will never go beyond 221, but it's just simpler to use int than to create a 24-bit type.
First there were 2 character coding schemes: UCS-4 that coded each character into 32 bits, as an unsigned integer in range 0x00000000 - 0x7FFFFFFF, and UCS-2 that used 16 bits for each codepoint.
Later it was found out that using just the 65536 codepoints of UCS-2 would get one into problems anyway, but many programs (Windows, cough) relied on wide characters being 16 bits wide, so UTF-16 was created. UTF-16 encodes the codepints in the range U+0000 - U+FFFF just like UCS-2; and U+10000 - U+10FFFF using surrogate pairs, i.e. a pair of two 16-bit values.
As this was a bit complicated, UTF-32 was introduced, as a simple one-to-one mapping for characters beyond U+FFFF. Now, since UTF-16 can only encode up to U+10FFFF, it was decided that this is will be the maximum value that will be ever assigned, so that there will be no further compatibility problems, so UTF-32 indeed just uses 21 bits. As an added bonus, UTF-8, which was initially planned to be a 1-6-byte encoding, now never needs more than 4 bytes for each code point. Therefore it can be easily proven that it never requires more storage than UTF-32.
It is true that a hypothetical UTF-24 format would save memory. However its savings would be dubious anyway, as it would mostly consume more storage than UTF-8, except for just blasts of emoji or such - and not many interesting texts of significant length consist solely of emojis.
But, UTF-32 is used as in memory representation for text in programs that need to have simply-indexed access to codepoints - it is the only encoding where the Nth element in a C array is also the Nth codepoint - UTF-24 would do the same for 25 % memory savings but more complicated element accesses.
It's true that only 21 bits are required (reference), but modern computers are good at moving 32-bit units of things around and generally interacting with them. I don't think I've ever used a programming language that had a 24-bit integer or character type, nor a platform where that was a multiple of the processor's word size (not since I last used an 8-bit computer; UTF-24 would be reasonable on an 8-bit machine), though naturally there have been some.
UTF-32 is a multiple of 16bit. Working with 32 bit quantities is much more common than working with 24 bit quantities and is usually better supported. It also helps keep each character 4-byte aligned (assuming the entire string is 4-byte aligned). Going from 1 byte to 2 bytes to 4 bytes is the most "logical" procession.
Apart from that: The Unicode standard is ever-growing. Codepoints outside of that range could eventually be assigned (it is somewhat unlikely in the near future, however, due to the huge number of unassigned codepoints still available).
I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 code points are actually assigned in Unicode 12.1.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.
Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.
The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).
Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).
To give a metaphorically accurate answer, all of them.
Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.
In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)
According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.
Unicode has the hexadecimal amount of 110000, which is 1114112
Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.
(All quotes are from RFC 3629)
Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets. The
only octet of a "sequence" of one has the higher-order bit set to 0,
the remaining 7 bits being used to encode the character number. In a
sequence of n octets, n>1, the initial octet has the n higher-order
bits set to 1, followed by a bit set to 0. The remaining bit(s) of
that octet contain bits from the number of the character to be
encoded. The following octet(s) all have the higher-order bit set to
1 and the following bit set to 0, leaving 6 bits in each to contain
bits from the character to be encoded.
So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?
Section 2:
The octet values C0, C1, F5 to FF never appear.
This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren't within the above range)?
Section 12:
Restricted the range of characters to 0000-10FFFF (the UTF-16
accessible range).
Looking at the previous RFC confirms this...they reduced the range of characters.
Section 10:
Another security issue occurs when encoding to UTF-8: the ISO/IEC
10646 description of UTF-8 allows encoding character numbers up to
U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore
a risk of buffer overflow if the range of character numbers is not
explicitly limited to U+10FFFF or if buffer sizing doesn't take into
account the possibility of 5- and 6-byte sequences.
So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?
Thanks in advance.
They are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.
UTF-8 is well-defined for 0-10FFFF.
Both UTF-8 and UTF-16 allow all Unicode characters to be encoded. What UTF-8 is not allowed to do is to encode upper and lower surrogate halves (which UTF-16 uses) or values above U+10FFFF, which aren't legal Unicode.
Note that the BMP ends at U+FFFF.
I would have to say no: Unicode code points are valid for the range [0, 0x10FFFF], and those map to 1-4 octets. So, if you did come across a 5- or 6-octet UTF-8 encoded code point, it's not a valid code point - there's certainly nothing assigned there. I am a little baffled as to why they're there in the ISO standard - I couldn't find an explanation.
It does make you wonder, however, if perhaps someday in the future, they would expand past U+10FFFF. 0x10FFFF allows for over a million characters, but there are a lot characters out there, and it would depend how much eventually gets encoded. (For sanity's sake, let's hope not, a million characters is a lot!) UTF-32 could handle more code points, and as you've discovered, UTF-8 could. It'd really be UTF-16 that's out of luck - more surrogate pairs would be needed somewhere in the spectrum of code points.