Why Unicode code points are always written with at least 2 bytes? - unicode

Why does Unicode code points are always written with 2 bytes (4 digits) even when that's not necessary ?
From the Wikipedia page about UTF-8 :
$ -> U+0024
¢ -> U+00A2

TL;DR This is all by convention of the Unicode Consortium.
Here is the formal definition, found in Appendix A: Notational Conventions of the Unicode standard (I've referenced the latest at this time, version 11):
In running text, an individual Unicode code point is expressed as U+n, where n is four to
six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15,
respectively). Leading zeros are omitted, unless the code point would have fewer than four
hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345,
U+102345.
They are hexadecimal digits, that represent Unicode Scalar Values. Initially only the first plane called the Basic Multilingual Plane were made available, which supported a range of U+0000 to U+FFFF to be defined. Initially the U+ encoding therefore always had 4 hexadecimal characters.
However, that only allows 64 Ki (65536) code points to use for characters (excluding some reserved values). So the single plane was later extended to 17 planes. Leading zero's are suppressed for values of U+10000 or higher, so the next character is written as U+10000, not U+010000. Currently there are 17 planes of 64Ki code points (some of which may be reserved), starting at U+0000, U+10000 ... U+90000 and finally U100000.
The U+xxxx notation does not follow UTF-8 encoding. Nor does it follow UTF-16, UTF-32 or the deprecated UCS encodings, either in big or little endian. The encoding of characters within the Basic Multilingual Plane is however identical to UTF-16(BE) in hexadecimal. Note that UTF-16 may contain surrogate code units that are used as escape to encode the characters in the other planes. The ranges of those code units are not mapped to characters and will therefore not be present in the textual code point representation.
See for instance, the PLUS-MINUS SIGN, ±:
Unicode code point: U+00B1 (as a textual string)
UTF-8 : 0xC2 0xB1 (as two bytes)
UTF-16 : 0x00B1
UTF-16BE : 0x00B1 as 0x00 0xB1 (as two bytes)
UTF-16LE : 0x00B1 as 0xB1 0x00 (as two bytes)
https://www.fileformat.info/info/unicode/char/00b1/index.htm
Much of this information can be found at sil.org.

Related

Why is Unicode restricted to 0x10FFFF?

Why is the maximum Unicode code point restricted to 0x10FFFF? Is it possible to represent Unicode above this code point - for e.g. 0x10FFFF + 0x000001 = 0x110000 - through any encoding schemes like UTF-16, UTF-8?
It's because of UTF-16. Characters outside of the base multilingual plane (BMP) are represented using a surrogate pair in UTF-16 with the first code unit (CU) lies between 0xD800–0xDBFF and the second one between 0xDC00–0xDFFF. Each of the CU represents 10 bits of the code point, allowing total 20 bits of data (0x100000 characters) which is split into 16 planes (16×216 characters). The remaining BMP will represent 0x10000 characters (code points 0–0xFFFF)
Therefore the total number of characters is 17×216 = 0x100000 + 0x10000 = 0x110000 which allows for code points from 0 to 0x110000 - 1 = 0x10FFFF. Alternatively the last representable code point can be calculated like this: Code points in the BMP are in the range 0–0xFFFF, so the offset for characters encoded with a surrogate pair is 0xFFFF + 1 = 0x10000, which means the last code point that a surrogate pair represents is 0xFFFFF + 0x10000 = 0x10FFFF
That's guaranteed by Unicode Character Encoding Stability Policies that a code point above that will never be assigned
The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change.
Historically UTF-8 allows up to U+7FFFFFFF using 6 bytes whereas UTF-32 can store twice the number of that. However due to the limit in UTF-16 the Unicode committee has decided that UTF-8 can never be longer than 4 bytes, resulting in the same range as UTF-16
In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.
https://en.wikipedia.org/wiki/UTF-8#History
The same has been applied to UTF-32
In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32
https://en.wikipedia.org/wiki/UTF-32
You can read this more detailed answer and
Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?
Does the Unicode Consortium Intend to make UTF-16 run out of characters?
How many characters can be mapped with Unicode?
Proposal to restrict the range of code positions to the values up to U-0010FFFF

How is a high unicode codepoint expressed as two codepoints?

I've seen that >2 byte unicode codepoints like U+10000 can be written as a pair, like \uD800\uDC00. They seem to start with the nibble d, but that's all I've noticed.
What is that splitting action called and how does it work?
UTF-8 means (using my own words) that the minimum atom of processing is a byte (the code unit is 1-byte long). I don't know if historically, but at least, conceptually spoken, the UCS-2 and UCS-4 Unicode encodings come first, and UTF-8/UTF-16 appear to solve some problems of UCS-*.
UCS-2 means that each character uses 2 bytes instead of one. It's a fixed-length encoding. UCS-2 saves the bytestring of each code point as you say. The problem is there are characters which codepoints require more than 2 bytes to store it. So, UCS-2 only can handle a subset of Unicode (the range U+0000 to U+FFFF of course).
UCS-4 uses 4 bytes for each character instead, and it's capable enough to store the bitstring of any Unicode code point, obviously (the Unicode range is from U+000000 to U+10FFFF).
The problem with UCS-4 is that characters outside the 2-bytes range are very, very uncommon, and any text encoded using UCS-4 will waste too much space. So, using UCS-2 is a better approach, unless you need characters outside the 2-bytes range.
But again, English texts, source code files and so on use mostly ASCII characters and UCS-2 has the same problem: wasting too much space for texts which use mostly ASCII characters (too many useless zeros).
That is what UTF-8 does. Characters inside the ASCII range are saved in UTF-8 texts as-is. It just takes the bitstring of the code point/ASCII value of each character. So, if a UTF-8 encoded text uses only ASCII characters, it is indistinguishable from any other Latin1 encoding. Clients without UTF-8 support can handle UTF-8 texts using only ASCII characters, because they look identical. It's a backward compatible encoding.
From then on (Unicode characters outside the ASCII range), UTF-8 texts use two, three or four bytes to save code points, depending on the character.
I don't know the exact method, but the bitestring is split in two, three or four bytes using known bit prefixes to know the amount of bytes used to save the code point. If a byte begins with 0, means the character is ASCII and uses only 1 byte (the ASCII range is 7-bits long). If it begins with 1, the character is encoded using two, three or four bytes depending on what bit comes next.
The problem with UTF-8 is that it requires too much processing (it must examine the first bits of each character to know its length), specially if the text is not English-like. For example, a text written in Greek will use mostly two-byte characters.
UTF-16 uses two-bytes code units to solve that problem for non-ASCII texts. That means that the atoms of processing are 16-bit words. If a character encoding doesn't fit in a two-byte code unit, then it uses 2 code units (four bytes) to encode the character. That pair of two code units is called a surrogate pair. I think a UTF-16 text using only characters inside the 2-byte range is equivalent to the same text using UCS-2.
UTF-32, in turn, uses 4-bytes code units, as UCS-4 does. I don't know the differences between them though.
The complete picture filling in your confusion is formatted below:
Referencing what I learned from the comments...
U+10000 is a Unicode code point (hexadecimal integer mapped to a character).
Unicode is a one-to-one mapping of code points to characters.
The inclusive range of code points from 0xD800 to 0xDFFF is reserved for UTF-161 (Unicode vs UTF) surrogate units (see below).
\uD800\uDC002 are two such surrogate units, called a surrogate pair. (A surrogate unit is a code unit that's part of a surrogate pair.)
Abstract representation: Code point (abstract character) --> Code unit (abstract UTF-16) --> Code unit (UTF-16 encoded bytes) --> Interpreted UTF-16
Actual usage example: Input data is bytes and may be wrapped in a second encoding, like ASCII for HTML entities and unicode escapes, or anything the parser handles --> Encoding interpreted; mapped to code point via scheme --> Font glyph --> Character on screen
How surrogate pairs work
Surrogate pair advantages:
There are only high and low units. A high must be followed by a low. No confusing high&low units.
UTF-16 can use 2 bytes for the first 63487 code points because surrogates cannot be mistaken for code points.
A range of 2048 code points is (2048/2)**2 to yield a range of 1048576 code points.
The processing is done on the less frequently used characters.
1 UTF-16 is the only UTF which uses surrogate pairs.
2 This is formatted as a unicode escape sequence.
Graphics describing character encoding:
Keep reading:
How does UTF-8 "variable-width encoding" work?
Unicode, UTF, ASCII, ANSI format differences
Code point

Why must unicode use utf-8?

as far I know, the UNICODE is the industry standard for character mapping.
What I don't get is that why it has to be encoded via UTF-8 and not directly as Unicode?
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value, and must be stored as octal 0061?
do i make any sense?
Who says it must be encoded as UTF-8? There are several common encodings for Unicode, including UTF-16 (big- or little-endian), and some less common ones such as UTF-7 and UTF-32.
Unicode itself is not an encoding; it's merely a specification of numeric code points for several thousand characters.
The Unicode code point for lowercase a is 0x61 in hexadecimal, 97 in decimal, or 0141 in octal.
If you're suggesting that 'a' should be encoded as the 6-character ASCII string "U+0061", that would be terribly wasteful of space and more difficult to decode than UTF-8.
If you're suggesting storing the numeric values directly, that's what UTF-32 does: it stores each character as a 32-bit (4-octet) number that directly represents the code point. The trouble with that is that it's nearly as wasteful of space as "U+0061" (4 bytes per character vs. 6.)
The UTF-8 encoding has a number of advantages. One is that it's upward compatible with ASCII. Another is that it's reasonably efficient even for non-ASCII characters, as long as most of the encoded text is within the first few thousand code points.
UTF-16 has some other advantages, but I personally prefer UTF-8. MS Windows tends to use UTF-16, but mostly for historical reasons; Windows added Unicode support when there were fewer than 65536 defined code points, which made UTF-16 equvalent to UCS-2, which is a simpler representation.
UTF-8 is only one 'memory format' of Unicode. There is also UTF-16, UTF-32 and a number of other memory mapping formats.
UTF-8 has been used broadly because it is upwardly compatible with an 8 bit character code like Ascii.
You can tell a browser via html, mySQL at several levels, and Notepad++ vie encoding option to use other formats for the data they operate on.
DuckDuckGo or Google Unicode and you will find plenty of articles on this on the internet. Here is one: https://ssl.icu-project.org/docs/papers/forms_of_unicode/
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value
Stored data is a sequence of byte values, generally interpreted at the lowest level as numbers. We usually use bytes that can be one of 256 values, so we look at them as numbers in the range 0 to 255.
So when you say 'just stored as a String with "U+0061"' what sequence of numbers in the range 0-255 do you mean?
Unicode code points like U+0061 are written in hexadecimal. Hexadecimal 61 is the number 97 in the more familiar decimal system, so perhaps you think that the letter 'a' should be stored as a single byte with the value 97. You might be surprised to learn that this is exactly how the encoding UTF-8 represents this string.
Of course there are more than 256 characters defined in Unicode, so not all Unicode characters can be stored as bytes with the same value as their Unicode codepoint. UTF-8 has one way of dealing with this, and there are other encodings with different ways.
UTF-32, for example, is an encoding which uses 4 bytes together at a time to represent a codepoint. Since one byte has 256 values four bytes can have 256 × 256 × 256 × 256, or 4,294,967,296 different arrangements. We can number those arrangements of bytes from 0 to 4,294,967,295 and then store every Unicode codepoint as the arrangement of bytes that we've numbered with the number corresponding to the Unicode codepoint value. This is exactly what UTF-32 does.
(However, there are different ways to assign numbers to those arrangements of four bytes and so there are multiple versions of UTF-32, such as UTF-32BE and UTF-32LE. Typically a particular medium of storing or transmitting bytes specifies its own numbering scheme, and the encoding 'UTF-32' without further qualification implies that whatever the medium's native scheme is should be used.)
Read this article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
do i make any sense?
Not a lot! (Read on ...)
as far I know, the UNICODE (sic) is the industry standard for character mapping.
That is incorrect. Unicode IS NOT a standard for character mapping. It is a standard that defines a set of character codes and what they mean.
It is essentially a catalogue that defines a mapping of codes (Unicode "code points") to conceptual characters, but it is not a standard for mapping characters. It certainly DOES NOT define a standard way to represent the code points; i.e. a mapping to a representation. (That is what character encoding schemes do!)
What I don't get is that why it has to be encoded via UTF-8 and not directly as Unicode?
That is incorrect. Character data DOES NOT have to be encoded in UTF-8. It can be encoded as UTF-8. But it can also be encoded in a number of other ways too:
The Unicode has specified a number of encoding schemes, including UTF-8, UTF-16 and UTF-32, and various historical variants.
There are many other standard encoding schemes (probably hundreds of them). This Wikipedia page lists some of the common ones.
The various different encoding schemes have different purposes (and different limitations). For example:
ASCII and LATIN-1 are 7 and 8-bit character sets (respectively) that encode a small subset of Unicode code-points. (ASCII encodes roman letters and numbers, some punctuation, and "control codes". LATIN-1 adds a number of accented latin letters using in Western Europe and some other common "typographical" characters.)
UTF-8 is a variable length encoding scheme that encodes Unicode code points as 1 to 5 bytes (octets). (It is biased towards western usage ... since it encodes all latin / roman letters and numbers as single bytes.)
UTF-16 is designed for encoding Unicode code points in 16-bit units. (Java Strings are essentially UTF-16 encoded.)
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value, and must be stored as octal 0061?
In fact, a Java String is represented as a sequence of char values. The char type is a 16-bit unsigned integer type; i.e. it has values 0 through 65535. And the char value that represents a lowercase "a" character is hex 0061 == octal 141 == decimal 97.
You are incorrect about "octal 0061" ... but I can't figure out what distinction you are actually trying to make here, so I can't really comment on that.

What's the difference between ASCII and Unicode?

What's the exact difference between Unicode and ASCII?
ASCII has a total of 128 characters (256 in the extended set).
Is there any size specification for Unicode characters?
ASCII defines 128 characters, which map to the numbers 0–127. Unicode defines (less than) 221 characters, which, similarly, map to numbers 0–221 (though not all numbers are currently assigned, and some are reserved).
Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. For example, the number 65 means "Latin capital 'A'".
Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8.
Understanding why ASCII and Unicode were created in the first place helped me understand the differences between the two.
ASCII, Origins
As stated in the other answers, ASCII uses 7 bits to represent a character. By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations*. Which means that we can represent 128 characters maximum.
Wait, 7 bits? But why not 1 byte (8 bits)?
The last bit (8th) is used for avoiding errors as parity bit.
This was relevant years ago.
Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed, tab, etc.
See below the binary representation of a few characters in ASCII:
0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)
See the full ASCII table over here.
ASCII was meant for English only.
What? Why English only? So many languages out there!
Because the center of the computer industry was in the USA at that
time. As a consequence, they didn't need to support accents or other
marks such as á, ü, ç, ñ, etc. (aka diacritics).
ASCII Extended
Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support their language (to support "é", in French, for example). Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters). And not 2^7 as before (128).
10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)
The name for this "ASCII extended to 8 bits and not 7 bits as before" could be just referred as "extended ASCII" or "8-bit ASCII".
As #Tom pointed out in his comment below there is no such thing as "extended ASCII" yet this is an easy way to refer to this 8th-bit trick. There are many variations of the 8-bit ASCII table, for example, the ISO 8859-1, also called ISO Latin-1.
Unicode, The Rise
ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? Greek? Russian? Chinese and the likes?
We would have needed an entirely new character set... that's the rational behind Unicode. Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters (see this table).
You cannot save text to your hard drive as "Unicode". Unicode is an abstract representation of the text. You need to "encode" this abstract representation. That's where an encoding comes into play.
Encodings: UTF-8 vs UTF-16 vs UTF-32
This answer does a pretty good job at explaining the basics:
UTF-8 and UTF-16 are variable length encodings.
In UTF-8, a character may occupy a minimum of 8 bits.
In UTF-16, a character length starts with 16 bits.
UTF-32 is a fixed length encoding of 32 bits.
UTF-8 uses the ASCII set for the first 128 characters. That's handy because it means ASCII text is also valid in UTF-8.
Mnemonics:
UTF-8: minimum 8 bits.
UTF-16: minimum 16 bits.
UTF-32: minimum and maximum 32 bits.
Note:
Why 2^7?
This is obvious for some, but just in case. We have seven slots available filled with either 0 or 1 (Binary Code).
Each can have two combinations. If we have seven spots, we have 2 * 2 * 2 * 2 * 2 * 2 * 2 = 2^7 = 128 combinations. Think about this as a combination lock with seven wheels, each wheel having two numbers only.
Source: Wikipedia, this great blog post and Mocki.co where I initially posted this summary.
ASCII has 128 code points, 0 through 127. It can fit in a single 8-bit byte, the values 128 through 255 tended to be used for other characters. With incompatible choices, causing the code page disaster. Text encoded in one code page cannot be read correctly by a program that assumes or guessed at another code page.
Unicode came about to solve this disaster. Version 1 started out with 65536 code points, commonly encoded in 16 bits. Later extended in version 2 to 1.1 million code points. The current version is 6.3, using 110,187 of the available 1.1 million code points. That doesn't fit in 16 bits anymore.
Encoding in 16-bits was common when v2 came around, used by Microsoft and Apple operating systems for example. And language runtimes like Java. The v2 spec came up with a way to map those 1.1 million code points into 16-bits. An encoding called UTF-16, a variable length encoding where one code point can take either 2 or 4 bytes. The original v1 code points take 2 bytes, added ones take 4.
Another variable length encoding that's very common, used in *nix operating systems and tools is UTF-8, a code point can take between 1 and 4 bytes, the original ASCII codes take 1 byte the rest take more. The only non-variable length encoding is UTF-32, takes 4 bytes for a code point. Not often used since it is pretty wasteful. There are other ones, like UTF-1 and UTF-7, widely ignored.
An issue with the UTF-16/32 encodings is that the order of the bytes will depend on the endian-ness of the machine that created the text stream. So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.
Having these different encoding choices brings back the code page disaster to some degree, along with heated debates among programmers which UTF choice is "best". Their association with operating system defaults pretty much draws the lines. One counter-measure is the definition of a BOM, the Byte Order Mark, a special codepoint (U+FEFF, zero width space) at the beginning of a text stream that indicates how the rest of the stream is encoded. It indicates both the UTF encoding and the endianess and is neutral to a text rendering engine. Unfortunately it is optional and many programmers claim their right to omit it so accidents are still pretty common.
java provides support for Unicode i.e it supports all world wide alphabets. Hence the size of char in java is 2 bytes. And range is 0 to 65535.
ASCII has 128 code positions, allocated to graphic characters and control characters (control codes).
Unicode has 1,114,112 code positions. About 100,000 of them have currently been allocated to characters, and many code points have been made permanently noncharacters (i.e. not used to encode any character ever), and most code points are not yet assigned.
The only things that ASCII and Unicode have in common are: 1) They are character codes. 2) The 128 first code positions of Unicode have been defined to have the same meanings as in ASCII, except that the code positions of ASCII control characters are just defined as denoting control characters, with names corresponding to their ASCII names, but their meanings are not defined in Unicode.
Sometimes, however, Unicode is characterized (even in the Unicode standard!) as “wide ASCII”. This is a slogan that mainly tries to convey the idea that Unicode is meant to be a universal character code the same way as ASCII once was (though the character repertoire of ASCII was hopelessly insufficient for universal use), as opposite to using different codes in different systems and applications and for different languages.
Unicode as such defines only the “logical size” of characters: Each character has a code number in a specific range. These code numbers can be presented using different transfer encodings, and internally, in memory, Unicode characters are usually represented using one or two 16-bit quantities per character, depending on character range, sometimes using one 32-bit quantity per character.
ASCII and Unicode are two character encodings. Basically, they are standards on how to represent difference characters in binary so that they can be written, stored, transmitted, and read in digital media. The main difference between the two is in the way they encode the character and the number of bits that they use for each. ASCII originally used seven bits to encode each character. This was later increased to eight with Extended ASCII to address the apparent inadequacy of the original. In contrast, Unicode uses a variable bit encoding program where you can choose between 32, 16, and 8-bit encodings. Using more bits lets you use more characters at the expense of larger files while fewer bits give you a limited choice but you save a lot of space. Using fewer bits (i.e. UTF-8 or ASCII) would probably be best if you are encoding a large document in English.
One of the main reasons why Unicode was the problem arose from the many non-standard extended ASCII programs. Unless you are using the prevalent page, which is used by Microsoft and most other software companies, then you are likely to encounter problems with your characters appearing as boxes. Unicode virtually eliminates this problem as all the character code points were standardized.
Another major advantage of Unicode is that at its maximum it can accommodate a huge number of characters. Because of this, Unicode currently contains most written languages and still has room for even more. This includes typical left-to-right scripts like English and even right-to-left scripts like Arabic. Chinese, Japanese, and the many other variants are also represented within Unicode. So Unicode won’t be replaced anytime soon.
In order to maintain compatibility with the older ASCII, which was already in widespread use at the time, Unicode was designed in such a way that the first eight bits matched that of the most popular ASCII page. So if you open an ASCII encoded file with Unicode, you still get the correct characters encoded in the file. This facilitated the adoption of Unicode as it lessened the impact of adopting a new encoding standard for those who were already using ASCII.
Summary:
1.ASCII uses an 8-bit encoding while Unicode uses a variable bit encoding.
2.Unicode is standardized while ASCII isn’t.
3.Unicode represents most written languages in the world while ASCII does not.
4.ASCII has its equivalent within Unicode.
Taken From: http://www.differencebetween.net/technology/software-technology/difference-between-unicode-and-ascii/#ixzz4zEjnxPhs
Storage
Given numbers are only for storing 1 character
ASCII ⟶ 27 bits (1 byte)
Extended ASCII ⟶ 28 bits (1 byte)
UTF-8 ⟶ minimum 28, maximum 232 bits (min 1, max 4 bytes)
UTF-16 ⟶ minimum 216, maximum 232 bits (min 2, max 4 bytes)
UTF-32 ⟶ 232 bits (4 bytes)
Usage (as of Feb 2020)
ASCII defines 128 characters, as Unicode contains a repertoire of more than 120,000 characters.
Beyond how UTF is a superset of ASCII, another good difference to know between ASCII and UTF is in terms of disk file encoding and data representation and storage in random memory. Programs know that given data should be understood as an ASCII or UTF string either by detecting special byte order mark codes at the start of the data, or by assuming from programmer intent that the data is text and then checking it for patterns that indicate it is in one text encoding or another.
Using the conventional prefix notation of 0x for hexadecimal data, basic good reference is that ASCII text starts with byte values 0x00 to 0x7F representing one of the possible ASCII character values. UTF text is normally indicated by starting with the bytes 0xEF 0xBB 0xBF for UTF8. For UTF16, start bytes 0xFE 0xFF, or 0xFF 0xFE are used, with the endian-ness order of the text bytes indicated by the order of the start bytes. The simple presence of byte values that are not in the ASCII range of possible byte values also indicates that data is probably UTF.
There are other byte order marks that use different codes to indicate data should be interpreted as text encoded in a certain encoding standard.

Are 6 octet UTF-8 sequences valid?

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.
(All quotes are from RFC 3629)
Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets. The
only octet of a "sequence" of one has the higher-order bit set to 0,
the remaining 7 bits being used to encode the character number. In a
sequence of n octets, n>1, the initial octet has the n higher-order
bits set to 1, followed by a bit set to 0. The remaining bit(s) of
that octet contain bits from the number of the character to be
encoded. The following octet(s) all have the higher-order bit set to
1 and the following bit set to 0, leaving 6 bits in each to contain
bits from the character to be encoded.
So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?
Section 2:
The octet values C0, C1, F5 to FF never appear.
This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren't within the above range)?
Section 12:
Restricted the range of characters to 0000-10FFFF (the UTF-16
accessible range).
Looking at the previous RFC confirms this...they reduced the range of characters.
Section 10:
Another security issue occurs when encoding to UTF-8: the ISO/IEC
10646 description of UTF-8 allows encoding character numbers up to
U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore
a risk of buffer overflow if the range of character numbers is not
explicitly limited to U+10FFFF or if buffer sizing doesn't take into
account the possibility of 5- and 6-byte sequences.
So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?
Thanks in advance.
They are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.
UTF-8 is well-defined for 0-10FFFF.
Both UTF-8 and UTF-16 allow all Unicode characters to be encoded. What UTF-8 is not allowed to do is to encode upper and lower surrogate halves (which UTF-16 uses) or values above U+10FFFF, which aren't legal Unicode.
Note that the BMP ends at U+FFFF.
I would have to say no: Unicode code points are valid for the range [0, 0x10FFFF], and those map to 1-4 octets. So, if you did come across a 5- or 6-octet UTF-8 encoded code point, it's not a valid code point - there's certainly nothing assigned there. I am a little baffled as to why they're there in the ISO standard - I couldn't find an explanation.
It does make you wonder, however, if perhaps someday in the future, they would expand past U+10FFFF. 0x10FFFF allows for over a million characters, but there are a lot characters out there, and it would depend how much eventually gets encoded. (For sanity's sake, let's hope not, a million characters is a lot!) UTF-32 could handle more code points, and as you've discovered, UTF-8 could. It'd really be UTF-16 that's out of luck - more surrogate pairs would be needed somewhere in the spectrum of code points.