Understanding Unicode: Surrogate Blocks, Noncharacters - unicode

I am trying to actually understand the unicode standard and was poking through the xml spec where it reads:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Now I have a couple of questions:
What are the surrogate blocks? Are they the UTF-16 codes that indicate a 4 byte code point?
Does #xXXXX refer to the code point or to the UTF-16 encoded value here?
If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here? Isn't it the task of an encoding to hide those encoding-related details from the space the encoding maps from?
Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.
Thanks for clarification!

What are the surrogate blocks?
Unicode codepoints in the U+D800 to U+DFFF range, inclusive, which are reserved for exclusive use as UTF-16 surrogates and are illegal in any other context.
Are they the UTF-16 codes that indicate a 4 byte code point?
Yes.
Does #xXXXX refer to the code point or to the UTF-16 encoded value here?
The actual Unicode codepoints. Considering that the definition of Char includes values > #xFFFF, which individual encoded UTF-16 values cannot exceed. UTFs are byte encoding schemes for codepoint values. The XML spec is written in terms of codepoints, not encodings. An XML document can be encoded in any charset specified in the "encoding" attribute of the XML prolog, for purposes of storage and transmission, but the actual XML content is processed in terms of unencoded codepoints.
If it refers to the code point and my understanding of the surrogate blocks is correct: Why are the surrogate blocks mentioned here?
The surrogate codepoints are reserved and not allowed to appear unencoded in any textual content. The Char definition is simply enforcing that rule.
Why are non-characters like "U+FFFE" defined as part of the unicode standard? As to my understanding, Byte-order detection (as well as handling flexible sized code words) is up to the encoding.
Because the encoding is not always known ahead of time, and may have to be detected dynamically. U+FFFE is used as a BOM marker to help facilitate that. Early versions of Unicode allowed U+FFFE to be used as either a BOM or an actual non-breaking space character within textual content. That lead to ambiguity at times. So newer versions of Unicode reserve U+FFFE strictly as a BOM only, and non-breaking spacing is handled by U+2060 WORD JOINER instead to avoid any ambiguity.
That being said, in the context of XML, it doesn't make sense to use U+FFFE in any textual content. The entire document is encoded in a particular charset, and any BOM used would have to appear before the XML prolog. The XML spec defines BOM handling and charset detection outside of the XML document itself. So that is why the Char definition excludes U+FFFE.
U+FFFF is reserved and is not intended to ever be used in real content in practice. So that is why the Char definition excludes it.
So basically the Char definition allows all Unicode codepoints minus restricted codepoints.

Related

Character set vs codepage layout

1) Can anyone explain me why the ASCII and Latin-1 table is once in the chapter Character Set and once under Code page layout? I am fine if both terms are interchangbly used, but this is still inconsistent, or am I missing something?
2) Are ASCII and Latin-1 fully compatible? 0x00 to 0x1F don't seem to be defined in Latin-1, why?
A character set is a set of notional writing system concepts, such as capital Fraktur Z, line feed, or bicycle symbol. These include typographic style variations that have significant contexts for usage (e.g. mathematics) but not typical typeface (font) variations.
Each codepoint in a character set is an element in a mapping between the "character" and an integer.
A character encoding is an algorithm to convert between a codepoint in the character set and a sequence of one or more code units in the character encoding. Code units are integers. Integers wider than one byte have a byte order (endianness). A code unit is serialized to a sequence of bytes for streaming or storage. Character encoding functions often map both steps at once: between a codepoint and bytes.
Many character sets have one character encoding. Many character encodings have single-byte code units. This makes them easy to present with the concepts of codepoint, code unit and byte collapses as well as character set and character encoding collapsed.
This all has a long history. Terminology, focus and standards have evolved. The context can be a clue as to what is meant. "Code page" is/was often used when identifying a particular extension to ASCII. In some original standards, only the differences or extensions were documented. Vendor libraries often filled in gaps in the character sets so they would be completely defined over 256 codepoints. When the Unicode character set was being developed, transcoding tables between Unicode and other character set were accepted from vendors. This effectively standardized some character set to 256 codepoints. (You can see the Unicode codepoint in hexadecimal in your tables.)
ASCII and Latin-1 (effectively the same as ISO 8859-1) are compatible in a limited sense:
The first 128 codepoints and code unit values are the same. ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. Nobody likes a mess like that. That's why the members of Unicode just took the characters sets as they were used in the field when creating mappings between Unicode and other character sets.

Understanding encoding schemes

I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.

Unicode Encodings

I have a question as to how programs parse strings if they do not a priori know the encoding that is used.
As I understand it, the UTF-8 encoding stores ASII characters with 1 byte, and all other chracters with up to as many as 6 (I think it's 6) bytes. Thus, for example, two spaces would be stored in memory as 0x2020.
How then, would a program be able to determine the difference between this string and the string`0x2020 encoded using the UTF-16 encoding which corresponds to the single character which evidently is a character that appears similar to the symbol sometimes used to denote the adjoint of an operator in mathematics (I just looked that up here).
It seems as if the parser would always have to know the encoding of a string before hand. If so, how is this implemented in practice? Is there a byte preceeding each string which tells the parser what encoding is used or something?
In general, it is not possible to know for certain the exact encoding used based solely on the stream of bytes that can represent text. However, if there is a byte order mark somewhere, you can use it at least as a hint as to what encoding is being used.
But with no hints or some kind of contract/exchange of metadata between the producer and consumer of the text, you can't be 100% sure. You can try using a heuristic, but then you get these kinds of problems if you end up guessing wrong.
If you want to be really sure, set up some kind of protocol or contract between the producer and the consumer of the text so that the text and the encoding scheme is known. You can hardcode the encoding scheme (for example, your program may parse UTF-8 and only UTF-8), or ensure the producer of the text always prepend a byte order mark or specially designed header bytes to communicate the encoding scheme.
Does the language always store strings in a certain encoding so that
the display function could safely assume that the string was encoded,
say, using UTF-8?
In depends on the language.
In C#, yes. A char is defined by the language specification (8.2.1) as a UTF-16 code unit, and thus a string is always UTF-16. Just like Java.
In Ruby 1.9, a string is a byte array with an associated Encoding.
But in pre-Unicode languages like C (and badly-designed post-Unicode languages like PHP), a string is just a byte array with no encoding information. You have to rely on convention. It's a real interesting experience to write a program that uses both a library that assumes UTF-8 strings and another that assumes windows-1252 strings.
A question that's equally relevant to all languages is: How do you determine the encoding of a byte array that contains encoded text? There are several different approaches:
Encoding declarations.
In protocols that use MIME types (notably, SMTP and HTTP), you can declare Content-Type: text/html; charset=UTF-8. In HTML, you can use <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> or the newer <meta charset="UTF-8">. In XML, there's <?xml version="1.0" encoding="UTF-8"?>. In Python source code, there's # -*- coding: UTF-8 -*-.
Unfortunately, such declarations aren't always accurate. And they aren't available at all for locally-stored plain .txt files, so then a different approach must be used.
Byte-order mark (BOM)
Putting the special character U+FEFF at the beginning of a file lets you distinguish between the various UTF encodings.
But it's not usable for legacy encodings like ISO-8859-x or Windows-125x, and not always used with UTF-8.
Validation
Some encodings have strict rules about what makes a valid string. The best-known is UTF-8, with its rigid separation of leading/trailing bytes, prohibition of "overlong" encodings, etc. UTF-32 is even easier to recognize because the restriction of Unicode to 17 "planes" means that every code unit must have the form 00 {00-10} xx xx (or xx xx {00-10} 00 for little-endian).
So if text validates as being UTF-8 or UTF-32, you can safely assume that it is. There's a possibility of false positives, but it's very low.
However, this approach doesn't work well for UTF-16, where the false-positive rate is too high. (The only way for an even-length byte array to not be valid UTF-16 is to contain unpaired surrogates, or U+FFFE or U+FFFF.)
Statistical analysis
Use character frequency tables of various language/encoding combinations. This is the approach used by chardet (in combination with BOM and validation).
Falling back on a default encoding
When all else fails, assume ISO-8859-1, windows-1252, or Encoding.Default.

What is the difference between charsets and character encoding

What is the difference between charsets and character encoding? When i say i am using utf-8 encoding then what will be my charset? Does it take unicode as charset by default?
UTF-8 is an encoding of the Unicode character set. Therefore if you're using UTF-8, the character set is Unicode, but you're not likely to have to specify this separately anywhere. The other main encoding of Unicode is UTF-16, which is not put into 8-bit byte streams because it contains zero bytes. If you are dealing with Unicode in a byte sequence, it is certainly encoded as UTF-8.
Other than Unicode, character sets are usually considered to have a single fixed encoding, and then terms like character set, charset, codepage, encoding are often used interchangeably, or depending on the vendor. This is sloppy but creates no runtime problems.
The only possible exceptions I can think of are East Asian: JIS and EUC originally defined multiple encodings for the same character set, but in practice today, each encoding is just treated separately.
Character set: definition which character has which numeric code point (ascii, jis, unicode)
Encoding: definition how the numeric code point is physically represented (utf, ucs, shiftjis)
According to Unicode terminology
ACR: Abstract Character Repertoire
= the set of characters to be encoded, for example, some alphabet or symbol set
CCS: Coded Character Set
= a mapping from an abstract character repertoire to a set of nonnegative integers
CEF: Character Encoding Form
= a mapping from a set of nonnegative integers that are elements of a
CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
CES: Character Encoding Scheme
= a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)
CM: Character Map
= a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation
TES: Transfer Encoding Syntax
= a reversible transform of encoded data, which may or may not contain textual data
Older protocols like MIME use "charset" when they really mean "character encoding scheme". Originally, different character encodings were though of as independent character repertoires instead of subsets of Unicode.
A character set defines the mapping between numbers and characters. Almost all char sets say 65 is A, and agree in general about mappings of numbers up to 127. But they might have different stands when it comes to numbers above 127.
There are a lot of character sets
EBCDIC
Double Byte Character Set
ANSI
Different OEM char sets
Unicode, an effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.
When you say character encoding, you're talking about how a Unicode code point (a character) is stored internally.
In UTF-8 encoding, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language).
UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
This post is almost entirely based on Joel Spolsky's post on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Read it to get a better idea.
Charset is synonym for character encoding
Default encoding depends on the operating system and locale.
EDIT
http://www.w3.org/TR/REC-xml/#sec-TextDecl
http://www.w3.org/TR/REC-xml/#NT-EncodingDecl

How are unicode allocated for different languages?

It seems the most confusing issue to me.
How is the beginning of a new character recognized?
How are the codepoints allocated?
Let's take Chinese character for example.
What range of codepoints are allocated to them,
and why is it thus allocated,any reason?
EDIT:
Plz describe it in your own words,not by citation.
Or could you recommend a book that talks about Unicode systematically,which you think have made it clear(it's the most important).
The Unicode Consortium is responsible for the codepoint allocation. If you have want a new character or a code page allocated, you can apply there. See the proposal pipeline for examples.
Chapter 2 of the Unicode specification defines the general structure of Unicode, including what ranges are allocated for what kind of characters.
Take a look here for a general overview of Unicode that might be helpful: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)
Unicode is a standard specified by the Unicode Consortium. The specification defines Unicode’s character set, the Universal Character Set (UCS), and some encodings to encode that characters, the Unicode Transformation Formats UTF-7, UTF-8, UTF-16 and UTF-32.
How is the beginning of a new character recognized?
It depends on the encoding that’s been used. UTF-16 and UTF-32 are encodings with fixed code word lengths (16 and 32 bits respectively) while UTF-7 and UTF-8 have a variable code word length (from 8 bits up to 32 bits) depending on the character point that is to be encoded.
How are the codepoints allocated? Let's take Chinese character for example. What range of codepoints are allocated to them, and why is it thus allocated,any reason?
The UCS is separated into so called character planes. The first one is Basic Latin (U+0000–U+007F, encoded like ASCII), the second is Latin-1 Supplement (U+0080–U+00FF, encoded like ISO 8859-1) and so on.
It is better to say Character Encoding instead of Codepage
A Character Encoding is a way to map some character to some data (and also vice-versa!)
As Wikipedia says:
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers
Most popular character encodings are ASCII,UTF-16 and UTF-8
ASCII
First code-page that widely used in computers. in ANSI just one byte is allocated for each character. So ANSI could have a very limited set of characters (English letters, Numbers,...)
As I said, ASCII used videly in old operating systems like MS-DOS. But ASCII is not dead and still used. When you have a txt file with 10 characters and it is 10 bytes, you have a ASCII file!
UTF-16
In UTF-16, Two bytes is allocated of a character. So we can have 65536 different characters in UTF-16 !
Microsoft Windows uses UTF-16 internally.
UTF-8
UTF-8 is another popular way for encoding characters. it uses variable-length bytes (1byte to 4bytes) for characters. It is also compatible with ASCII because uses 1byte for ASCII characters.
Most Unix based systems uses UTF-8
Programming languages do not depend on code-pages. Maybe a specific implementation of a programming language do not support codepages (like Turbo C++)
You can use any code-page in modern programming languages. They also have some tools for converting the code-pages.
There is different Unicode versions like Utf-7,Utf-8,... You can read about them here (recommanded!) and maybe for more formal details here