I have been attending a lecture on XML where it was written "ISO-8859-1 is a Unicode format". It sounds wrong to me, but as I research on it, I struggle understanding precisely what Unicode is.
Can you call ISO-8859-1 a Unicode format ? What can you actually call Unicode ?
ISO 8859-1 is not Unicode
ISO 8859-1 is also known as Latin-1. It is not directly a Unicode format.
However, it does have the unique privilege that its code points 0x00 .. 0xFF map one-to-one to the Unicode code points U+0000 .. U+00FF. So, the first 256 code points of Unicode, treated as 1 byte unsigned integers, map to ISO 8859-1.
Control characters
Peregring-lk observes that ISO 8859-1 does not define the control codes. The Unicode charts for U+0000..U+007F and U+0080..U+00FF suggest that the C0 controls found in positions U+0000..U+001F and U+007F come from ISO/IEC 6429:1992 and the C1 controls found in positions U+0080..U+9F likewise. Wikipedia on the C0 and C1 controls suggests that the standard is ISO/IEC 2022 instead. Note that three of the C1 controls do not have a formal name.
In general parlance, the control code points of the ISO 8859-1 code set are assumed to be the C0 and C1 controls from ISO 6429 (or 2022).
ISO-8859-1 contains a subset of UTF-8 Unicode, which substantially overlaps with ASCII.
All ASCII is UTF-8 Unicode.
All the ISO 8859-1 (ISO Latin 1) characters below codes 7f hex are ASCII compatible and UTF-8 compatible in one byte. The ligatures and characters with diacritics use multi-byte Unicode UTF-8 representations, and use Unicode compatibility codepoints.
All UTF-8 single-byte character are contained in ASCII.
UTF-8 also contains multi-byte sequences, some of which are collatable (i.e. sortable) equivalents - composed equivalents - of the characters represented by compatibility codepoints, and some of which are the characters represented by all other characters sets other than ASCII and ISO Latin 1.
No, ISO 8859-1 is not a Unicode charset, simply because ISO 8859-1 does not provide encoding for all Unicode characters, only a small subset thereof. The word “charset” is sometimes used loosely (and therefore often best avoided), but as a technical term, it means a character encoding.
Loosening the definition so that “Unicode charset” would mean an encoding that covers part of Unicode would be pointless. Then every encoding would be a “Unicode charset”.
No. ISO/IEC 8859-1 is older than Unicode. For example, you won't find € in it. Unicode is compatible to ISO 8859-1 up to some point. For the coding of characters in Unicode look at UCS / UTF8 / UTF16.
If you look at code formats you have something like
Abstract letters - The letters you are using
Code table - Bring the letters in some form (like alphabetic ordering)
Code format - Say which position in the code table is which letter, (that is the UTF8 or UTF16 encoding)
Code schema - If you use more words for accessing a code position, in which order are they? (Big Endian, Little Endian in UTF16)
[character encoding of steering instruction (e.g. < in XML)]
It depends on how you define "Unicode format."
I think most people would take it to mean an encoding capable of representing any codepoint in Unicode's range (U+0000 - U+10FFFF).
In that case, no, ISO 8859-1 is not a Unicode format.
However some other definitions might be 'a character set that is a subset of the Unicode character set,' or 'an encoding that can be considered to contain Unicode data (not necessarily arbitrary Unicode data).' ISO 8859-1 meets both of these definitions.
Unicode is a number of things. It contains a character set, in which 'characters' are assigned codepoint values. It defines properties for characters and provides a database of characters and their properties. It defines many algorithms for doing various things with Unicode text data, such as ways of comparing strings, of dividing strings into grapheme clusters, words, etc. It defines a few special encodings that can encode any Unicode codepoint and have some other useful properties. It defines mappings between Unicode codepoints and codepoints of legacy character sets.
Here you can find a more complete answer: Unicode.org
Related
1) Can anyone explain me why the ASCII and Latin-1 table is once in the chapter Character Set and once under Code page layout? I am fine if both terms are interchangbly used, but this is still inconsistent, or am I missing something?
2) Are ASCII and Latin-1 fully compatible? 0x00 to 0x1F don't seem to be defined in Latin-1, why?
A character set is a set of notional writing system concepts, such as capital Fraktur Z, line feed, or bicycle symbol. These include typographic style variations that have significant contexts for usage (e.g. mathematics) but not typical typeface (font) variations.
Each codepoint in a character set is an element in a mapping between the "character" and an integer.
A character encoding is an algorithm to convert between a codepoint in the character set and a sequence of one or more code units in the character encoding. Code units are integers. Integers wider than one byte have a byte order (endianness). A code unit is serialized to a sequence of bytes for streaming or storage. Character encoding functions often map both steps at once: between a codepoint and bytes.
Many character sets have one character encoding. Many character encodings have single-byte code units. This makes them easy to present with the concepts of codepoint, code unit and byte collapses as well as character set and character encoding collapsed.
This all has a long history. Terminology, focus and standards have evolved. The context can be a clue as to what is meant. "Code page" is/was often used when identifying a particular extension to ASCII. In some original standards, only the differences or extensions were documented. Vendor libraries often filled in gaps in the character sets so they would be completely defined over 256 codepoints. When the Unicode character set was being developed, transcoding tables between Unicode and other character set were accepted from vendors. This effectively standardized some character set to 256 codepoints. (You can see the Unicode codepoint in hexadecimal in your tables.)
ASCII and Latin-1 (effectively the same as ISO 8859-1) are compatible in a limited sense:
The first 128 codepoints and code unit values are the same. ISO-8859-1 is the IANA preferred name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429. Nobody likes a mess like that. That's why the members of Unicode just took the characters sets as they were used in the field when creating mappings between Unicode and other character sets.
I have read Joel's article about encodings. As I understand in case of unicode:
unicode is a charater set - mapping between integer value and character
utf-8 is an encoding which is used for unicode integers to present them in binary view
What about iso-8859-1? Is it encoding or character set or both?
ISO 8859-1 (Latin-1) is a single-byte encoding. It represents the first 256 Unicode characters. So, as long as it is subset of Unicode character set, I suppose it could be treated as both encoding and character set.
What about iso-8859-1? Is it encoding or character set or both?
Historically, it was described as a coded character set: it defined both a set of characters, and a mapping of those characters to byte values — what we would today call an encoding, but it was not explicitly described in those terms.
When Unicode was created, it was designed to encompass (nearly) all characters in widely-used character sets, and hence it recast the byte stream defined by the ISO-8859-1 coded character set as an encoding of the wider Universal Character Set.
So if you are working in a modern Unicode environment you would consider ISO-8859-1 to be an encoding. But it can't really be said to be wrong to consider it also a character set.
(There are other encodings which are definitely not character sets: for example the UTFs, and multibyte encodings like Shift-JIS, which was itself defined as an encoding for the JIS X 0208 character set prior to Unicode's extend-and-embrace.)
Was reading Joel Spolsky's 'The Absolute Minimum' about character encoding.
It is my understanding that ASCII is a Code-point + Encoding scheme, and in modern times, we use Unicode as the Code-point scheme and UTF-8 as the Encoding scheme. Is this correct?
In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.
Yes, except that UTF-8 is an encoding scheme. Other encoding schemes include UTF-16 (with two different byte orders) and UTF-32. (For some confusion, a UTF-16 scheme is called “Unicode” in Microsoft software.)
And, to be exact, the American National Standard that defines ASCII specifies a collection of characters and their coding as 7-bit quantities, without specifying a particular transfer encoding in terms of bytes. In the past, it was used in different ways, e.g. so that five ASCII characters were packed into one 36-bit storage unit or so that 8-bit bytes used the extra bytes for checking purposes (parity bit) or for transfer control. But nowadays ASCII is used so that one ASCII character is encoded as one 8-bit byte with the first bit set to zero. This is the de facto standard encoding scheme and implied in a large number of specifications, but strictly speaking not part of the ASCII standard.
Unicode and ASCII are both Codepoints + Encoding scheme
Unicode(UTF-8) is a superset of ASCII as its backward compatible with ASCII.
Conversion and Representation(in binary/hexadecimal) of String:
String := sequence of Graphemes(character is a "kind of" its subset).
Sequence of graphemes(characters) is converted into Codepoints (also using Encoding scheme)
Codepoints are Encoded(converted) to binary/hex also using Encoding Schemes
for Graphemes its UTF-8/UTF-32(aka Unicodes), for Character its ASCII.
Unicode(UTF-8) supports 1,112,064 valid character codepoints(covers most of the graphemes from different languages)
ASCII supports 128 character codepoints(mostly english)
What is the difference between charsets and character encoding? When i say i am using utf-8 encoding then what will be my charset? Does it take unicode as charset by default?
UTF-8 is an encoding of the Unicode character set. Therefore if you're using UTF-8, the character set is Unicode, but you're not likely to have to specify this separately anywhere. The other main encoding of Unicode is UTF-16, which is not put into 8-bit byte streams because it contains zero bytes. If you are dealing with Unicode in a byte sequence, it is certainly encoded as UTF-8.
Other than Unicode, character sets are usually considered to have a single fixed encoding, and then terms like character set, charset, codepage, encoding are often used interchangeably, or depending on the vendor. This is sloppy but creates no runtime problems.
The only possible exceptions I can think of are East Asian: JIS and EUC originally defined multiple encodings for the same character set, but in practice today, each encoding is just treated separately.
Character set: definition which character has which numeric code point (ascii, jis, unicode)
Encoding: definition how the numeric code point is physically represented (utf, ucs, shiftjis)
According to Unicode terminology
ACR: Abstract Character Repertoire
= the set of characters to be encoded, for example, some alphabet or symbol set
CCS: Coded Character Set
= a mapping from an abstract character repertoire to a set of nonnegative integers
CEF: Character Encoding Form
= a mapping from a set of nonnegative integers that are elements of a
CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
CES: Character Encoding Scheme
= a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)
CM: Character Map
= a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation
TES: Transfer Encoding Syntax
= a reversible transform of encoded data, which may or may not contain textual data
Older protocols like MIME use "charset" when they really mean "character encoding scheme". Originally, different character encodings were though of as independent character repertoires instead of subsets of Unicode.
A character set defines the mapping between numbers and characters. Almost all char sets say 65 is A, and agree in general about mappings of numbers up to 127. But they might have different stands when it comes to numbers above 127.
There are a lot of character sets
EBCDIC
Double Byte Character Set
ANSI
Different OEM char sets
Unicode, an effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.
When you say character encoding, you're talking about how a Unicode code point (a character) is stored internally.
In UTF-8 encoding, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language).
UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
This post is almost entirely based on Joel Spolsky's post on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Read it to get a better idea.
Charset is synonym for character encoding
Default encoding depends on the operating system and locale.
EDIT
http://www.w3.org/TR/REC-xml/#sec-TextDecl
http://www.w3.org/TR/REC-xml/#NT-EncodingDecl
It seems the most confusing issue to me.
How is the beginning of a new character recognized?
How are the codepoints allocated?
Let's take Chinese character for example.
What range of codepoints are allocated to them,
and why is it thus allocated,any reason?
EDIT:
Plz describe it in your own words,not by citation.
Or could you recommend a book that talks about Unicode systematically,which you think have made it clear(it's the most important).
The Unicode Consortium is responsible for the codepoint allocation. If you have want a new character or a code page allocated, you can apply there. See the proposal pipeline for examples.
Chapter 2 of the Unicode specification defines the general structure of Unicode, including what ranges are allocated for what kind of characters.
Take a look here for a general overview of Unicode that might be helpful: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses)
Unicode is a standard specified by the Unicode Consortium. The specification defines Unicode’s character set, the Universal Character Set (UCS), and some encodings to encode that characters, the Unicode Transformation Formats UTF-7, UTF-8, UTF-16 and UTF-32.
How is the beginning of a new character recognized?
It depends on the encoding that’s been used. UTF-16 and UTF-32 are encodings with fixed code word lengths (16 and 32 bits respectively) while UTF-7 and UTF-8 have a variable code word length (from 8 bits up to 32 bits) depending on the character point that is to be encoded.
How are the codepoints allocated? Let's take Chinese character for example. What range of codepoints are allocated to them, and why is it thus allocated,any reason?
The UCS is separated into so called character planes. The first one is Basic Latin (U+0000–U+007F, encoded like ASCII), the second is Latin-1 Supplement (U+0080–U+00FF, encoded like ISO 8859-1) and so on.
It is better to say Character Encoding instead of Codepage
A Character Encoding is a way to map some character to some data (and also vice-versa!)
As Wikipedia says:
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers
Most popular character encodings are ASCII,UTF-16 and UTF-8
ASCII
First code-page that widely used in computers. in ANSI just one byte is allocated for each character. So ANSI could have a very limited set of characters (English letters, Numbers,...)
As I said, ASCII used videly in old operating systems like MS-DOS. But ASCII is not dead and still used. When you have a txt file with 10 characters and it is 10 bytes, you have a ASCII file!
UTF-16
In UTF-16, Two bytes is allocated of a character. So we can have 65536 different characters in UTF-16 !
Microsoft Windows uses UTF-16 internally.
UTF-8
UTF-8 is another popular way for encoding characters. it uses variable-length bytes (1byte to 4bytes) for characters. It is also compatible with ASCII because uses 1byte for ASCII characters.
Most Unix based systems uses UTF-8
Programming languages do not depend on code-pages. Maybe a specific implementation of a programming language do not support codepages (like Turbo C++)
You can use any code-page in modern programming languages. They also have some tools for converting the code-pages.
There is different Unicode versions like Utf-7,Utf-8,... You can read about them here (recommanded!) and maybe for more formal details here