XML and Unicode specifications: what’s a legal character? - unicode

My manager asked me to explain why I called jdom’s checkCharacterData before passing my string to an XMLStreamWriter, so I referred to the XML spec and then got confused.
XML 1.0 and XML 1.1 say that a valid XML character is “tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.” That sounds stupid: tab, carriage return, and line feed are legal characters of Unicode. Then there’s the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF,” which was modified in XML 1.1 to refer to U+0000 – U+10FFFF excluding U+0000, U+D800 – U+DFFF, and U+FFFE – U+FFFF; note that NUL is excluded. Then there’s the Note that says authors are “discouraged” from using the compatibility characters including some characters that are already excluded by the BNF.
Question: What is/was a legal Unicode character? Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.) Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded? And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?

Question: What is/was a legal Unicode character?
The Unicode Glossary defines it thus:
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]
Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.)
NUL is a codepoint, and it falls under the definition of "abstract character" so it is a character by sense 2 above.
Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded?
NUL has been a control character from early versions.
Appendix D contains a list of changes.
It says in table D.2 that there have been 65 control characters from Version 1 through Version 3 without change.
Table D-2 documents the number of characters assigned in the different versions of the Unicode standard.
V1.0 V1.1 V2.0 V2.1 V3.0
...
Controls 65 65 65 65 65
And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
Writing specifications that are both complete and succinct is hard. When the text disagrees with the BNF, trust the BNF.

The use of the word “character” is intentionally fuzzy in the Unicode standard, but mostly it is used in a technical sense: a code point designated as an assigned character code point. This does not completely coincide with the intuitive concept of character. For example, the intuitive character that consists of letter i with macron and grave accent does not exist as a code point; in Unicode, it can only be represented as a sequence of two or three code points. As another example, the so-called control characters are not characters in the intuitive sense.
When other standards and specifications refer to “Unicode characters,” they refer to code points designated as assigned character code points. The set of Unicode characters varies by Unicode standard version, since new code points are assigned. Technically, the UnicodeData.txt file (at ftp://ftp.unicode.org/Public/UNIDATA/) indicates which code points are characters.
U+0000, conventionally denoted by NUL, has been a Unicode character since the beginning.
The XML specifications are inexact in many ways as regards to characters, as you have observed. But the essential definition is the BNF production for “Char” and the statement “XML processors MUST accept any character in the range specified for Char.” This means that in XML specifications, the concept of character is broader than Unicode character. The ranges in the production contain unassigned code points, actually a huge number of them.
The comment to the “Char” production in XML specifications is best ignored. It is very confusing and even incorrect. The “Char” production simply refers to a set of Unicode code points (different sets in different versions of XML). The set includes code points that you should never use in character data, as well as code points that should be avoided for various reasons. But such rules are at a level different from the formal rules of XML and requirements on XML implementations.
When selecting or writing a routine for checking character data, it depends on the application and purpose what should be accepted and what should be done with code points that fail the test. Even surrogate code points might be processed in some way instead of being just discarded; they may well appear due to confusions with encodings (or e.g. when a Java string has been naively taken as a string of Unicode characters – it is as such just a sequence of 16-bit code units).

I would ignore the verbage and just focus on the definitions:
XML 1.0:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
XML 1.1:
Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]
Document authors are encouraged to avoid "compatibility characters", as defined in Unicode [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].

It sounds stupid because it is stupid. The First Edition of XML (1998) read "the legal graphic characters of Unicode." For whatever reason, the word "graphic" was removed from the Second Edition of 2000, perhaps because it is inaccurate: XML allows many characters that are not graphic characters.
The definition in the Char production is indeed the right place to look.

Related

How do you determine the byte width of a UTF-16 character?

What are the rules for reading a UTF-16 byte stream, to determine how many bytes a character takes up? I've read the standards, but based on empirical observations of real-world UTF-16 encoded streams, it looks like there are certain where the standards don't hold true (or there's an aspect of the standard that I'm missing).
From the reading the UTF-16 standard https://www.rfc-editor.org/rfc/rfc2781:
Value of leading 2 bytes
Resulting character length (bytes)
0x0000-0xC7FF
2
0xD800-0xDBFF
4
0xDC00-0xDFFF
Invalid sequence (RFC2781 2.2.2)
0xDFFF-0xFFFF
4
In practice, this appears to hold true, for some cases at least. Using an ad-hoc SQL script (SQL Server 2019; UTF-16 collation), but also verified with an online decoder:
Character
Unicode Name
ISO 10646
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
A
LATIN CAPITAL LETTER A
U+0041
00 41
2
Б
CYRILLIC CAPITAL LETTER BE
U+0411
04 11
2
ァ
KATAKANA LETTER SMALL A
U+30A1
30 A1
2
🐰
RABBIT FACE
U+1F430
D8 3D DC 30
4
However when encoding the following ISO 10646 character into UTF-16, it appears to be 4 bytes, but reading the leading 2 bytes appears to give no indication that it will be this long:
Character
Unicode Name
UTF-16 Encoding (hexadecimal, big endian)
Size (bytes)
⚕️
STAFF OF AESCULAPIUS
26 95 FE 0F
4
Whilst I'd rather keep my question software-agnostic; the following SQL will reproduce this behaviour on Microsoft SQL Server 2019, with default collation and default language. (Note that SQL Server is little endian).
select cast(N'⚕️' as varbinary);
----------
0x95260FFE
Quite simply, how/why do you read 0x2695 and think "I'll need to read in the next word for this character."? Why doesn't this appear to align with the published UTF-16 standard?
The formal definition of all of this is called an "extended grapheme cluster," and it's defined in the Unicode Text Segmentation report. As Joachim Sauer notes, it's wise to be careful with the term "character" in Unicode.
Code points are what "U+...." syntax is referring to, and is attempting to capture a "unit" of written language, for example "an acute accent." But what a reader would think of a character (for example "an e with an acute accent") is a "grapheme cluster" and is made up of one or more code points. What is ultimately rendered to the screen is a "glyph" which is both context- and font-dependent.
Grapheme clusters in Unicode are actually more subtle than this. Unicode attempts to define them in a "neutral" way. (There's really no such thing as "neutral" when thinking about languages, but Unicode does try.) For example, in Slovak, ch, dz, and dž are each one letter, but are considered two grapheme clusters in Unicode. (Try to count the "letters" in a Slovak word. There are words that contain the letter dz and other words that have the letter d followed by the letter z. Oh human writing systems. I love you so much.)
The mapping of grapheme clusters to glyphs is also complex. For example, in Arabic, the single glyph لا is actually two grapheme clusters, ل (ARABIC LETTER LAM) followed by ا (ARABIC LETTER ALEF). If you use your mouse to select the glyph, you'll see there are two selectable pieces, and if you copy and paste them to another window you'll see them transform into their component parts. (Just to make thing even more complicated, Unicode also defines a single code point for ligature, ARABIC LIGATURE LAM WITH ALEF ISOLATED FORM: ﻻ. If you try to select part of that one, you'll find you can't. It's one "character.")
Your specific case is a bit more special. The Variation Selector predates Unicode, and is mostly designed to handle different variations of Han (Chinese) characters. However, as with every Unicode feature, it eventually has come to be used primarily for emoji. VS-16 is the "emoji" presentation form. The most famous example is the red heart, which is HEAVY BLACK HEART ❤, followed by VS-16: ❤️.
Similarly, your character U+2695 STAFF OF AESCULAPIUS is a single code point, and it looks like this by default (text style): ⚕. When you add VS-16, it is rendered in "emoji style": ⚕️. In some ways it's the same "character." Or is it? Depends on what you're using it for.
Emoji style is typically a bit larger and centered in its block, sometimes adding color. Notice where the period after the staff is drawn in each case (there are no extra spaces in the second example; the glyph is just much wider).
There are other combining systems as well:
U+0031: 1
U+0031 U+20e3: 1⃣ (+ COMBINING ENCLOSING KEYCAP, default text style)
U+0031 U+20e3 U+fe0f: 1⃣️ (+ VARIATION SELECTOR-16, emoji style)
All of these predate Unicode. Modern emoji is dramatically more complicated, and includes several combining systems of its own (including two that are currently just used for flags).
But luckily, to your actual question, your wife is correct, and you can generally just consume all trailing code points that are marked "combining" to form an extended grapheme cluster, and that is kind of a "character" for some broad enough definition of "character."
All of your assertions are completely correct; your interpretation of the UTF-16 standards is correct and complete.
In your empirical observations however, you've assumed that you only have one character. In actuality, you've ran into a nuance of the Unicode implementation. Your "character" is actually two (albeit technically, not visually): U+2695 "STAFF OF AESCULAPIUS" followed by U+FE0F "VARIATION SELECTOR-16". The second character is a non-spacing mark which combines with the base character for the purpose of rendering a character variant.
This results in the byte sequence 26 95 FE 0F, however as you note neither of the words fall within the UTF-16 reserved extension character range. But this is because neither of them require the UTF-16 4 byte extension. They're simply classified as two discrete Unicode characters.
From 7.9 Combining Marks in ISO 10646: Universal Coded Character Set (UCS):,
Combining marks are a special class of characters in the Unicode Standard that are
intended to combine with a preceding character, called their base.
Combining marks usually have a visible glyphic form... a combining mark may interact graphically with neighbouring characters in various ways.
http://unicode.org/L2/L2010/10038-fcd10646-main.pdf
To explain why I'm answering my own question; I had my SO question all ready to fire off. My wife came into my office; after looking over my shoulder she whispered into my ear, "You know combination characters are a thing, right?". I've however still asked the question and answered it myself, in case my wife's sweet nothings help another member of the community.

Unicode version of ABNF?

I want to write a grammar for a file format whose content can contain characters other than US-ASCII ones. Since I am used to ABNF, I try to use it...
However, none of RFCs 5234 and 7405 are very friendly towards people who DO NOT use US ASCII.
In fact, I'm looking for an ABNF version (and possibly some basic rules as well) which is character oriented rather than byte oriented; the only thing which RFC 5234 has to say about this is in section 2.4:
2.4. External Encodings
External representations of terminal value characters will vary
according to constraints in the storage or transmission environment.
Hence, the same ABNF-based grammar may have multiple external
encodings, such as one for a 7-bit US-ASCII environment, another for
a binary octet environment, and still a different one when 16-bit
Unicode is used. Encoding details are beyond the scope of ABNF,
although Appendix B provides definitions for a 7-bit US-ASCII
environment as has been common to much of the Internet.
By separating external encoding from the syntax, it is intended that
alternate encoding environments can be used for the same syntax.
That doesn't really clarify matters.
Is there a version of ABNF somewhere which is code point oriented rather than byte oriented?
Refer to section 2.3 of RFC 5234, which says:
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
Unicode is just the set of non-negative integers U+0000 through U+10FFFF minus the surrogate range D800-DFFF and there are various RFCs that use ABNF accordingly. An example is RFC 3987.
If the ABNF you're writing is intended for human reading, then I'd say just use the normal syntax and refer to code points instead of bytes instead. You could take a look at various language specifications that allow Unicode in source text, e.g. C#, Java, PowerShell, etc. They all have a grammar, and they all have to define Unicode characters somewhere (e.g. for identifiers).
E.g. the PowerShell grammar has lines like this:
double-quote-character:
       " (U+0022)
       Left double quotation mark (U+201C)
       Right double quotation mark (U+201D)
       Double low-9 quotation mark (U+201E)
Or in the Java specification:
UnicodeInputCharacter:
       UnicodeEscape
       RawInputCharacter
UnicodeEscape:
       \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
       u
       UnicodeMarker u
RawInputCharacter:
       any Unicode character
HexDigit: one of
       0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
The \, u, and hexadecimal digits here are all ASCII characters.
Note that there is surrounding text explaining the intent – which is always better than just dumping a heap of grammar on someone.
If it's for automatic parser generation, you may be better off finding a tool that allows you to specify a grammar both in Unicode and ABNF-like form and publish that instead. People writing parsers should be expected to understand either, though.

What is the meaning of the indicator XXX in the Unicode charts

Consider the unicode chart for C1 Controls and Latin-1 supplement in Unicode Charts. If a character has a glyph, it is shown, if it does not have a glyph, a special dotted line and symbolic marker or identifier is given. In this case, both 0080 and 0081 seem to have some "invalid marker", which I think is what "XXX" means. Is that what it means?
Secondly, what should be the behaviour of a Unicode aware string type that has a value stored into the string of value 0x80 (hex) or 128 (decimal)? Should it be converted to some other point, such as the mapping like this:
Byte Value 128 in many ANSI Codepages is the EURO marker.
Storing a 128 decimal value is equivalent to storing U+20AC ?
The magic "non orthogonality" I have encountered in a particular language or operating system API implementation of its MBCS and Unicode types, and Java's interesting handling, leads me to wonder, what is the real intended use of the U+0080 character? This reference link confuses me by showing that Java treats this character as a Euro symbol (ANSI codepage to Unicode one way friendliness) but that it's name is <control> which is not anything I know how to deal with. Wikipedia says it's PAD here
Can anyone help me? Did I skip a foundational concepts day at Unicode School? What am I missing?
Update The block from 0080 to 0098 is non printable control characters. This much I know. What I wonder is what does the XXX mean and how am I to think of this character when I am processing unicode data with this value in it?
According to the explanation in Ch. 17 (About the Code Charts) of the Unicode Standard, p. 573, by the “Dashed Box Convention”, characters that have no visible rendering as such “are represented by a square dashed box. This box surrounds a short mnemonic abbreviation of the character’s name.” The characters referred to in the questions are control characters, in the C1 Controls area.
The Unicode Standard says, in Ch. 16, p. 544, about C0 and C1 Controls: “The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are gen-erally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.” And the abbreviations in the square dashed boxes reflect the meanings given in ISO/IEC 6429:1992.
Some code points in the C1 Controls area are not defined in ISO/IEC 6429:1992. For them, such as U+0080, the code chart has “XXX” in place of a mnemonic abbreviation. So this indicates that the Unicode standard does not refer to any meaning for those code points, beyond their being control characters with some abstract properties.
Thus, “XXX” does not mean “invalid”, but rather “completely undefined meaning”. The meaning of such code points can be defined by various standards or other conventions, as long as they are consistent with the general definitions—e.g., it would be incompatible to define U+0080 as a graphic character.
Such code points must not be replaced or omitted in any character-level processing; applications that actually change data may do whatever they want, but any general conversion routines, for example, must keep these code points (characters) intact. They must not be treated as malformed or invalid; but an application may treat them as undefined. By Unicode principles, it’s OK to be ignorant of a character, but not completely wrong about it.
This has nothing to do with the meaning of bytes like 0x80 in 8-bit codes like Windows-1252. But if you send e.g. data labeled as ISO-8859-1 encoded (where e.g. 0x80 is in principle U+0080) to a web browser, it will actually treat it as Windows-1252 encoded. The reason is that characters like U+0080 are practically never used in ISO-8859-1 data; occurrence of 0x80 in ISO-8859-1 labeled data is virtually always either windows-1252 mislabeled or messed-up data that cannot be meaningfully processed. So browsers take the practical route and treat ISO-8859-1 as windows-1252; this is being formalized in HTML5 and related specifications.

What's the purpose of the noncharacters U+FDD0 to U+FDEF?

U+FFFE needs to be a noncharacter in order to allow the Byte Order Mark to work.
U+FFFF is described in The Unicode Standard as "useful for internal purposes as sentinels". Makes sense.
But I can't figure out, and The Unicode Standard doesn't really explain, why the set of noncharacters includes some random block within "Arabic Presentation Forms-A". What are these for? (Besides the eye of the basilisk?)
OK the question is "what are they for" and "Why are they in the middle of the Arabic Presentation Forms".
There was a need for a block of 32 non-characters "to make additional codes available to programmers to use for internal processing purposes" http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=IWS-Chapter04a#4d3110c8
It was required that it be in the Basic Multilingual Plane (BMP), i.e. 0x0000 to 0xFFFF, so that they could have single-codepoint representations in UTF-16.
There was a block of unused codepoints in the Arabic Presentation Forms block.
It had been agreed not to encode any more Arabic Presentation Forms, so these were never going to be used.
http://www.unicode.org/mail-arch/unicode-ml/y2001-m10/0014.html
Therefore it was agreed that these codepoints, which were never going to be used otherwise, would be designated noncharacters so they could be used internally by applications/programmers.
These noncharacters are for internal use by application and should not be interchanged.
I tried to explain based on what is said in Unicode standard.
Unicode got 66 non-characters. For all 17 planes they have two each, last two code points of the plane ending with FFFE FFFF. 32 other no-characters are continuous block U+FDD0 to U+FDEF.
So total count
17*2 + 32 = 66
Read following text from the unicode chapter 16, which says that its in some random place because of "historic reason", I'm curious but I don't think there is any ambiguity.
For historical reasons, the range U+FDD0..U+FDEF is contained within the Arabic Presentation Forms-A block, but those noncharacters are not
"Arabic noncharacters" or "right-to-left noncharacters," and are not distinguished in any
other way from the other noncharacters, except in their code point values
U+FEFF is BOM and U+FFFE is byte-swapped version of it. But since U+FFFE is a noncharacter, when an interpreting process finds U+FFFE as the first character, it signals either that the process has encountered text that is of the incorrect byte order or that the file is not valid Unicode text, It just gives a signal, not a standard way. It can be either of the one, reverse bytes or a wrong text.
In the Unicode section 3.2 clause C2 says
C2 A process shall not interpret a noncharacter code point as an abstract character.
The noncharacter code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly.
So as application developers you are free to use these characters as you wish. They are used as sentinel or delimter or may be some baslik characters, but they should not be interchanged.
Section 16.7 says
In effect, noncharacters can be thought of as application-internal private-use code points.
Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which
are assigned characters and which are intended for use in open interchange, subject to
interpretation by private agreement, noncharacters are permanently reserved (unassigned)
and have no interpretation whatsoever outside of their possible application-internal private uses
Again U+FFFF is not reserved as sentinel by Unicode standard but just given the typical use case. Read in section 16.7
U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being
associated with the largest code unit values for particular Unicode encoding forms. In
UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16
U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16
This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For
example, they might be used to indicate the end of a list, to represent a value in an index
guaranteed to be higher than any valid character value, and so on
As mentioned here at xkcd, U+FDD0 is actually the Unicode character for the eye of a basilisk. For (obvious) reasons of personal safety however, the character is not rendered to the screen... :)

What's the difference between an "encoding," a "character set," and a "code page"?

I'm really trying to get better with this stuff. I'm pretty functional with internationalization concepts like this, but I need to get a better background on the theory behind it.
I've read Spolsky's article, but I'm still unclear because these three terms get used interchangeably a LOT -- even in that article. I think at least two of them are talking about the same thing.
I suspect a high percentage of developers flub their way through this stuff on a daily basis. I don't want to be one of those developers anymore.
A ‘character set’ is just what it says: a properly-specified list of distinct characters.
An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.
UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).
The confusion comes about because most other well-known encodings (eg.: ISO-8859-1) started out as separate character sets. Then when Unicode came along as a superset of most of these character sets, it became possible to think of them as different (but partial) encodings of the same (Unicode) character set, rather than just isolated character sets. Looking at them this way allows you to convert between them through Unicode easily, which would not be possible if they were merely isolated character sets. But it still makes sense to refer to them as character sets, so either term could be used.
A ‘code page’ is a term stemming from IBM, where it chose which set of symbols would be displayed. The term continued to be used by DOS and then Windows, through to Unicode-aware Windows where it just acts as an encoding with a numbered identifier. Whilst a numbered ‘code page’ is an idea not inherently limited to Microsoft, today the term would almost always just mean an encoding that Windows knows about.
When one is talking of code page ‹some number› one is typically talking about a Windows-specific encoding, as distinct from an encoding devised by a standards body. For example code page 28591 would not normally be referred to under that name, but simply ‘ISO-8859-1’. The Windows-specific Western European encoding based on ISO-8859-1 (with a few extra characters replacing some of its control codes) would normally be referred to as ‘code page 1252’.
[*: All the UTFs are encodings not character sets, but this kind of thing isn't exclusive to Unicode. For example the Japanese standard JIS X 0208 defines a character set and two different byte encodings for it: the somewhat unpleasant high-byte-based encoding (‘Shift-JIS’), and the deeply horrific escape-switching-based encoding (‘JIS’).]
A Character Set is just that, a set of characters that can be used.
Each of these characters is mapped to an integer called code point.
How these code points are represented in memory is the encoding. An encoding is just a method to transform a code-point (U+0041 - Unicode code-point for the character 'A') into raw data (bits and bytes).
I thought Joel's article was pretty much spot on - it is the history behind the evolution of character sets and storage which has brought this about.
FWIW, in my oversimplistic view
Character Sets (ASCII, EBCDIC, UNICODE) would be the numeric representation of characters, independent of storage considerations
Encoding would relate to the efficient storage of characters, ANSI, UTF-7, UTF-8 etc, for file, across the wire etc
Code Page would be the 'kluge' needed when the demand for the addition of new characters (without wanting to increase storage capacity) meant that (certain) characters were only knowable in the additional context of a code page.
IMHO Wikipedia currently doesn't help things by defining code page as 'another name for character encoding'
and redirecting 'character set' to 'character encoding'
A character set is a set of characters, i.e. "glyphs" i.e. visual symbols representing units of communication. The letter a is a glyph and so is € (euro sign). Character sets usually map integers (codepoints) to each character, but it's the encoding that dictates the binary/byte-level representation of the character.
I'm a ruby programmer, so here are some examples to help you understand the concepts.
This reveals how the Unicode character set maps codepoints to characters, but not how each byte is stored. (ruby 1.9 defaults to Unicode strings.)
>> 'a'.codepoints.to_a
=> [97]
>> '€'.codepoints.to_a
=> [8364]
Since 8364 (base 10) is too large to fit in one byte, various encoding strategies exist to specify a translation from Unicode codepoints into one or many bytes. The UTF-8 encoding is probably the most popular of these encodings. (Wikipedia shows the UTF-8 encoding algorithm, if you want to delve into the implementation.) Note that the UTF-8 encoding only makes sense in the context of the Unicode character set.
The following reveals how the UTF-8 encoding stores each Unicode character as bytes (0 thru 255 in base-10). (Ruby 1.9's default encoding is UTF-8.)
>> 'a'.bytes.to_a
=> [97]
>> '€'.bytes.to_a
=> [226, 130, 172]
Here's the same thing in ISO-8859-15 character set:
>> 'a'.encode('iso-8859-15').codepoints.to_a
=> [97]
>> '€'.encode('iso-8859-15').codepoints.to_a
=> [164]
And the ISO-8859-15 encoding:
>> 'a'.encode('iso-8859-15').bytes.to_a
=> [97]
>> '€'.encode('iso-8859-15').bytes.to_a
=> [164]
Notice that the ISO-8859-15 codepoints match the byte representation.
Here's a blog entry that might be helpful: http://graysoftinc.com/character-encodings/what-is-a-character-encoding. Entries 1 thru 3 are good if you don't want to get too ruby-specific.
The chapter on Unicode in this book, Advanced Perl Programming contains the best description of encoding, character sets and the other entities of unicode that I've come across. Unfortunately I don't think its available for free on line.