I want a character which has different bit representation in different encoding. Like there could be some character whose codepoint is U+00AA. utf-8 should be representing it as 0x00 0xaa, but some other encoding say iso-latin-5 should be representing it as 0x01 0xab. I am learning about unicode and encoding and so want any such character to play around with it.
If you look at UTF-7 every code point is encoded with a different octet sequence to UTF-8.
Related
I have read Joel's article about encodings. As I understand in case of unicode:
unicode is a charater set - mapping between integer value and character
utf-8 is an encoding which is used for unicode integers to present them in binary view
What about iso-8859-1? Is it encoding or character set or both?
ISO 8859-1 (Latin-1) is a single-byte encoding. It represents the first 256 Unicode characters. So, as long as it is subset of Unicode character set, I suppose it could be treated as both encoding and character set.
What about iso-8859-1? Is it encoding or character set or both?
Historically, it was described as a coded character set: it defined both a set of characters, and a mapping of those characters to byte values — what we would today call an encoding, but it was not explicitly described in those terms.
When Unicode was created, it was designed to encompass (nearly) all characters in widely-used character sets, and hence it recast the byte stream defined by the ISO-8859-1 coded character set as an encoding of the wider Universal Character Set.
So if you are working in a modern Unicode environment you would consider ISO-8859-1 to be an encoding. But it can't really be said to be wrong to consider it also a character set.
(There are other encodings which are definitely not character sets: for example the UTFs, and multibyte encodings like Shift-JIS, which was itself defined as an encoding for the JIS X 0208 character set prior to Unicode's extend-and-embrace.)
I asked Google the question above and was sent to Difference between UTF-8 and UTF-16? which unfortunately doesn't answer the question.
From my understanding UTF-8 should be a subset of UTF-16 meaning: if my code uses UTF-16 and I hand in a UTF-8 encoded string everything should always be fine. The other way around (expecting UTF-8 and getting UTF-16) may cause problems.
Is that correct?
EDIT: To clarify why the linked SO question doesn't answer my question: My problem arose when trying to process a JSON string using WebClient.DownloadString, because the WebClient used the wrong encoding. The JSON I received from the request was encoded as UTF-8 and the question for me was: if I set webClient.Encoding = New System.Text.UnicodeEncoding (a.k.a UTF-16) would I be on the safe side, i.e. able to handle UTF-8 and UTF-16 request results, or should I use webClient.Encoding = New System.Text.UTF8Encoding?
It's not entirely clear what you mean by "compatible", so let's get some basics out of the way.
Unicode is the underlying concept, and UTF-16 and UTF-8 are two different ways to encode Unicode. They are obviously different -- otherwise, why would there be two different serialization formats?
Unicode by itself does not specify a serialization format. UTF-8 and UTF-16 are two alternative serialization formats.
There are several others, but these two are arguably the most widely used.
They are "compatible" in the sense that they can represent the same Unicode code points, but "incompatible" in that the representations are completely different, and irreconcileable.
There are two additional twists with UTF-16. Firstly, there are actually two different encodings, UTF-16LE and UTF-16BE. These differ in endianness. (UTF-8 is a byte encoding, so does not have endianness.) Secondly, legacy UTF-16 used to be restricted to 65,536 possible characters, which is less than Unicode currently contains. This is handled with surrogates, but really old and/or broken UTF-16 implementations (properly identified as UCS-2, not "real" UTF-16) do not support them.
For a bit of concretion, let's compare four different code points. We pick U+0041, U+00E5, U+201C, and U+1F4A9, as they illustrate the differences nicely.
U+0041 is a 7-bit character, so UTF-8 represents it simply with a single byte. U+00E5 is an 8-bit character, so UTF-8 needs to encode it. U+1F4A9 is outside the Basic Multilingual Plane, so UTF-16 represents it with a surrogate sequence. Finally, U+201C is none of the above.
Here are the representations of our candidate characters in UTF-8, UTF-16LE, and UTF-16BE.
Character
UTF-8
UTF-16LE
UTF-16BE
U+0041 (a)
0x41
0x41 0x00
0x00 0x41
U+00E5 (å)
0xC3 0xA5
0xE5 0x00
0x00 0xE5
U+201C (“)
0xE2 0x80 0x9C
0x1C 0x20
0x20 0x1C
U+1F4A9 (💩)
0xF0 0x9F 0x92 0xA9
0x3D 0xD8 0xA9 0xDC
0xD8 0x3D 0xDC 0xA9
To pick one obvious example, the UTF-8 encoding of U+00E5 would represent a completely different character if interpreted as UTF-16 (in UTF-16LE, it would be U+A5C3, and in UTF-16BE, U+C3A5.) Any UTF-8 sequence with an odd number of bytes is an incomplete 16-bit sequence. I suppose UTF-8 when interpreted as UTF-16 could also happen to encode an invalid surrogate sequence. Conversely, many of the UTF-16 codes are not valid UTF-8 sequences at all. So in this sense, UTF-8 and UTF-16 are completely and utterly incompatible.
These are byte values; in ASCII, 0x00 is the NUL character (sometimes represented as ^#), 0x41 is uppercase A, and 0xE5 is undefined; in e.g. Latin-1 it represents the character å (which is also conveniently U+00E5 in Unicode), but in KOI8-R it is the Cyrillic character Е (U+0415), etc.
Perhaps notice also how the last example requires a nontrivial transformation in UTF-16, too, using a pair of surrogate code points, in some sense superficially similarly to how UTF-8 encodes all multibyte code points.
In modern programming languages, your code should simply use Unicode, and let the language handle the nitty-gritty of encoding it in a way which is suitable for your platform and libraries. On a somewhat tangential note, see also http://utf8everywhere.org/
Exist some real double byte encoding (DBCS)?
Except UCS-2, UTF-16 of course.
I mean encoding, which save also ASCII as 2 bytes.
I mean with null bytes. (00 20 - space)
Please tell if it is used, if it is obsolete in standart / in use.
The same question for 4 bytes encoding, exists any(not UCS-4, UTF-32)?
Thanks.
There are certainly legacy character sets that use exactly two bytes for every character, but these generally do not encode ASCII characters at all, being intended to supplement a single-byte character set rather than replacing it. All of those that I am aware of exist to support Chinese, Japanese, and/or Korean ideograph characters.
There are plenty of legacy documents around that use such encodings, and I would not be surprised to find that in some places they are still used in new documents.
If you are trying to determine whether your software can ignore the existence of multi-byte character encodings other than the UTFs, then I'm afraid you won't come away with an easy answer. Of course your software can do so, in the same sense that it can ignore single-byte encodings other than ISO-8859-15, but only you can determine whether your program will adequately serve its purpose if it does so.
No, there are no double-byte character sets that satisfy your list of requirements. This is because designers back in the day used 7-bit ASCII as their starting point (good for compatibility), then put extra characters or multi-byte start codes in the upper half of the 256 byte values.
Similarly for quad-byte character sets, no serious standard before Unicode even tried to provision for more than 65536 characters.
To give one example, Chinese Big5 uses ASCII definitions for bytes 0x00 to 0x7F, uses 0x81 to 0xFF as a start byte for extended characters, and {0x40 to 0x7E, 0xA1 to 0xFE} for the second byte. This can code a maximum of 20067 different characters.
Is the ® symbol a 3-byte or 4-byte Unicode character? How can I tell?
Also known as \xAE
A Unicode character as such does not have any length in bytes. It is the character encoding that matters. You know the length of a character in bytes in a specific encoding from the definition of the encoding.
For example, in the ISO-8859-1 (ISO Larin 1) encoding, which encodes just a small subset of Unicode, including “®”, every character is 1 byte long.
In the UTF-16 encoding, all characters are either 2 or 4 bytes long, and characters in the range U+0000...U+FFFF, such as “®”, are 2 bytes
In the UTF-32 encoding, all characters are 4 bytes long.
In the UTF-8 encoding, characters take 1 to 4 bytes. A simple way to check this out is to use the Fileformat.info Character search (though this is not normative information, just a nice quick reference). E.g., the page about U+00AE shows the character in some encodings, including 0xC2 0xAE (that is, 2 bytes) in UTF-8.
It is unicode number U+00AE. It's in the range [0x80, 0x7ff] so in UTF-8 it'll be encoded as two bytes — the table at the top of the Wikipedia article explains in more detail*.
If you were using UTF-16 it'd also be two bytes, since no continuation is necessary.
(* my summary though: one of the features of UTF-8 is that you can jump midway into a byte stream and synchronise with the text without generating any spurious characters, because you can tell whether any byte is a continuation character without further context.
An unavoidable side effect is that only the 7-bit ASCII characters fit into a single byte and everything else takes multiple bytes. 0xae is sufficiently close to the 7-bit range to require only one extra byte. See Wikipedia for specifics.)
What is the difference between charsets and character encoding? When i say i am using utf-8 encoding then what will be my charset? Does it take unicode as charset by default?
UTF-8 is an encoding of the Unicode character set. Therefore if you're using UTF-8, the character set is Unicode, but you're not likely to have to specify this separately anywhere. The other main encoding of Unicode is UTF-16, which is not put into 8-bit byte streams because it contains zero bytes. If you are dealing with Unicode in a byte sequence, it is certainly encoded as UTF-8.
Other than Unicode, character sets are usually considered to have a single fixed encoding, and then terms like character set, charset, codepage, encoding are often used interchangeably, or depending on the vendor. This is sloppy but creates no runtime problems.
The only possible exceptions I can think of are East Asian: JIS and EUC originally defined multiple encodings for the same character set, but in practice today, each encoding is just treated separately.
Character set: definition which character has which numeric code point (ascii, jis, unicode)
Encoding: definition how the numeric code point is physically represented (utf, ucs, shiftjis)
According to Unicode terminology
ACR: Abstract Character Repertoire
= the set of characters to be encoded, for example, some alphabet or symbol set
CCS: Coded Character Set
= a mapping from an abstract character repertoire to a set of nonnegative integers
CEF: Character Encoding Form
= a mapping from a set of nonnegative integers that are elements of a
CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers
CES: Character Encoding Scheme
= a reversible transformation from a set of sequences of code units (from one or more CEFs to a serialized sequence of bytes)
CM: Character Map
= a mapping from sequences of members of an abstract character repertoire to serialized sequences of bytes bridging all four levels in a single operation
TES: Transfer Encoding Syntax
= a reversible transform of encoded data, which may or may not contain textual data
Older protocols like MIME use "charset" when they really mean "character encoding scheme". Originally, different character encodings were though of as independent character repertoires instead of subsets of Unicode.
A character set defines the mapping between numbers and characters. Almost all char sets say 65 is A, and agree in general about mappings of numbers up to 127. But they might have different stands when it comes to numbers above 127.
There are a lot of character sets
EBCDIC
Double Byte Character Set
ANSI
Different OEM char sets
Unicode, an effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too.
When you say character encoding, you're talking about how a Unicode code point (a character) is stored internally.
In UTF-8 encoding, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero
There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language).
UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
This post is almost entirely based on Joel Spolsky's post on Unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Read it to get a better idea.
Charset is synonym for character encoding
Default encoding depends on the operating system and locale.
EDIT
http://www.w3.org/TR/REC-xml/#sec-TextDecl
http://www.w3.org/TR/REC-xml/#NT-EncodingDecl