I already searched for answers to this sort of question here, and have found plenty of them -- but I still have this nagging doubt about the apparent triviality of the matter.
I have read this very interesting an helpful article on the subject: http://www.joelonsoftware.com/articles/Unicode.html, but it left me wondering about how one would go about identifying individual glyphs given a buffer of Unicode data.
My questions are:
How would I go about parsing a Unicode string, say UTF-8?
Assuming I know the byte order, what happens when I encounter the beginning of a glyph that is supposed to be represented by 6 bytes?
That is, if I interpreted the method of storage correctly.
This is all related to a text display system I am designing to work with OpenGL.
I am storing glyph data in display lists and I need to translate the contents of a string to a sequence of glyph indexes, which are then mapped to display list indices (since, obviously, storing the entire glyph set in graphics memory is not always practical).
To have to represent every string as an array of shorts would require a significant amount of storage considering everything I have need to display.
Additionally, it seems to me that 2 bytes per character simply isn't enough to represent every possible Unicode element.
How would I go about parsing a Unicode string, say UTF-8?
I'm assuming that by "parsing", you mean converting to code points.
Often, you don't have to do that. For example, you can search for a UTF-8 string within another UTF-8 string without needing to care about what characters those bytes represent.
If you do need to convert to code points (UTF-32), then:
Check the first byte to see how many bytes are in the character.
Look at the trailing bytes of the character to ensure that they're in the range 80-BF. If not, report an error.
Use bit masking and shifting to convert the bytes to the code point.
Report an error if the byte sequence you got was longer than the minimum needed to represent the character.
Increment your pointer by the sequence length and repeat for the next character.
Additionally, it seems to me that 2
bytes per character simply isn't
enough to represent every possible
Unicode element.
It's not. Unicode was originally intended to be a fixed-with 16-bit encoding. It was later decided that 65,536 characters wasn't enough, so UTF-16 was created, and Unicode was redefined to use code points between 0 and 1,114,111.
If you want a fixed-width encoding, you need 21 bits. But they aren't many languages that have a 21-bit integer type, so in practice you need 32 bits.
Well, I think this answers it:
http://en.wikipedia.org/wiki/UTF-8
Why it didn't show up the first time I went searching, I have no idea.
Related
I have read the answer here. I understand what a byte stream is (a stream of 1s and 0s), encoding is (a mapping from that stream to what characters that we humans understand) and decoding is (a reverse mapping from characters to corresponding bytes).
I still cannot reconcile the entire concept in my head. In the RAM we already have everything as bytes only. And I guess my interpreter is inherently using some decoding scheme to show me the characters corresponding to that bytes stream. What then do we mean by having to encode before saving to the disk? If my interpreter is using 'utf-8' to show us this text that I am typing and I ask it to save this text using 'cp-1252' have I changed the underlying bytes stream?
There are different ways to see it.
On way: "Hello World!" could be encoded in different way. You want the semantic of the string: so a salutation and a target. But if you save to a UTF-8 file, you will have different values, as in a UTF-16LE file, or in a EBCDIC encoding.
E.g. A is 65 on ASCII encoding, but 193 in EBCDIC encoding (used e.g. by many IBM mainframes), 0 65 on a UTF-16 encoding (or 65 0). Etc. So when you save a number, you need to specify the encoding (as expected for the reader, so it may depend on file format).
But also libraries on a language could not handle all encodings (for all functions). Usually it is better to decode, using the standard libraries, and then encode when the data should go out. So you need to implement just encoding and decoding (e.g. for EBCDIC), and not all sorting, upper/lower case handling, is_digits, is_symbol, etc.
it is standard practice to divide semantic with real values. Or display with logic. If you are a control freak, you can do all without decoding values. But it is error prone, and you should know so many details, that few people want to know.
An other example, do you need to know the real values of your data/strings? You have a number, it is encoded little-endian or big-endian? Or maybe as a float (e.g. JavaScript). We just know it, when we save data (e.g. to send in internet, we need a way to tell the ordering. Or when saving images: we tell the ordering, so on some machines, the bytes will be swapped, when reading a large number).
Or an other example: you take a selfies. You have an image, but you can save it as a PNG file, or a JPEG file: you will get very different files, with different values. But you know the encoding (fortunately, for such image files, the first bytes describe the format, and then few data about the encoding). For you it is enough to know that it is your image. But do you think computer will take the bytes of the two formats? Probably no. When you read the image, you will convert in a different encoding in memory (but you probably do not need to care about it): often a RGB (or RGBA) format, but how many bit per channel, or if there is some colour rendering (from profiles), you do not know [JPEG saves it as YCC]
Python has a stricter semantic view: you do not know how Python will encode the string. It may be 8bit: ASCII/Latin1, or 16-bit (UCS2), or 32-bit (UTF-32). It handles the internal encoding dynamically, according the most efficient way to store a string. You can still get a codepoint, a for each character, and many string/character function. Just then you encode a string, you have a fix sequence of numbers. On the string side you really do not know how strings are represented in memory. So this keep the two different parts of Unicode clearly separated: semantic value (description of all character), and the encoding/decoding (how to represent the values in bytes).
When you are handling a string in Python, you should just care about the semantic. The implementation (and so the physical layout of string in memory) is not your businesses, and Python can change it. (it changed it).
But with your example:
You may not get much of it, because recent standardisation: ASCII become nearly the only encoding for the most common Latin letters, and symbols. Latin-1 is compatible with ASCII, just extending from 7-bit to 8-bit. "Windows ANSI" uses Latin-1 and add characters on the non-allocated parts. Unicode based from Latin-1 (for first 256 characters). So you may see a character with a fixed number (or not available), but this was not the rule, also in early Windows.
So your cp-1252 is for most characters compatible with UTF-8 (but few characters). But if you uses other encoding, you should do much a transcoding (changing from an encoding to an other). But usually you do this just when you save: you keep the internal encoding, but you do a copy to be saved.
A byte is 8 bits, whether it is in RAM, on disk, or on the wire.
A bit is the "atom" of computer data. A byte is the "molecule", except that there is only one kind of byte.
A bit is the smallest unit of information in computers. It is usually said to represent 0 or 1, or OFF or ON.
Whether you "interpret" a byte as a number (0 to 255), a signed number (-128 to +127), an "ascii" character, like the characters I am typing, depends on what you (or the computer) does with the byte. Or a byte can be part of a bigger number, one that requires several bytes to represent.
Because there are too many "letters" or "characters" (especially in Chinese), to fit in a byte, there is the additional concept of a "character" may be composed of multiple bytes. UTF-8 is the main standard today. Giacomo discusses several less-common encodings that say what "character" is represented by a byte (or bytes). Remember, each byte is composed of 8 bits.
English letters and numbers and some punctuation is represented (encoded) in bytes in the same way for Ascii, Latin1, cp-1252, and UTF-8 (and some other encodings). But as soon as you get into European accented letters, the encodings diverge.
A common thing you may hear of is to represent one byte as two hexadecimal digits.
I read some article about Unicode and UTF-8.
The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12CA‘. U+12CA is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly. I think the reason is:
Unicode is not Self-synchronizing code, so if
10 represent A
110 represent B
10110 represent C
When I see 10110 in the disk we can't tell it's A and B or just C.
Unicode uses much more space instead of UTF-8 or UTF-16.
Am I right?
Read about Unicode, UTF-8 and the UTF-8 everywhere website.
There are more than a million Unicode code-points (you mentionned 1,114,111...). So you need at least 21 bits to be able to separate all of them (since 221 > 1114111).
So you can store Unicode characters directly, if you represent each of them by a wide enough integral type. In practice, that type would be some 32 bits integer (because it is not convenient to handle 3-bytes i.e. 24 bits integers). This is called UCS-4 and some systems or software do already handle their Unicode string in such a format.
Notice also that displaying Unicode strings is quite difficult, because of the variety of human languages (and also since Unicode has combining characters). Some need to be displayed right to left (Arabic, Hebrew, ....), others left to right (English, French, Spanish, German, Russian ...), and some top to down (Chinese, ...). A library displaying Unicode strings should be capable of displaying a string containing English, Chinese and Arabic words.... Then you see that decoding UTF-8 is the easy part of Unicode string displaying (and storing UCS-4 strings won't help much).
But, since English is the dominant language in IT technology (for economical reasons), it is very often cheaper to keep strings in UTF8 form. If most of the strings handled by your system are English (or in some other European language using the Latin alphabet), it is cheaper and it takes less space to keep them in UTF-8.
I guess than when China will become a dominant power in IT, things might change (or maybe not).
(I have no idea of the most common encoding used today on Chinese supercomputers or smartphones; I guess it is still UTF-8)
In practice, use a library (perhaps libunistring or Glib in C), to process UTF-8 strings and another one (e.g. pango and GTK in C) to display them. You'll find many Unicode related libraries in various programming languages.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly.
How do you write 12CA to a disk directly? It is a bigger value than a byte can hold, so you need to write at least two bytes. Do you write 12 followed by CA? You just encoded it in UTF-16BE. That's what an encoding is...a definition of how to write an abstract number as bytes.
Other reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Pragmatic Unicode
For good and specific reasons, Unicode doesn't specify any particular encoding. If it makes sense for your scenario, you can specify your own.
Because Unicode doesn't specify any serialization, there is no way to "directly" store Unicode, just like you can't "directly" store a mathematical number or a flow chart to implement a program you designed. The question isn't really well-defined.
There are a number of existing serialization formats (encodings) so it is very likely that it makes the most sense to use an existing one unless your requirements are significantly different than what any existing encoding provides; even then, is it really worth the cost?
A stream of bits is just a stream of bits. Conventionally, we chop them up into groups of 8 and call that a "byte" and the latter half of your question is really "if it's not a byte, how can you tell which bits belong to which symbol?" There are many ways to do that, but the common ones generally define a sequence of some particular length (8, 16, and 32 are often convenient for reasons of compatibility with bus width on modern computers etc) but again, if you really wanted to, you could come up with something different. Huffman trees come to mind as one way to implement a way to communicate a structure of variable length (and is used for precisely that in many compression algorithms).
Consider one situation, even if you can directly save unicode binary into disk and close the file, what happens when you open the file again? It's just a bunch of binary, you don't know how many bytes a char occupied right, which means, if '🥶'(U+129398) and 'A' are the content of your file, then if you take it 1 byte for a char, then '🥶' can't be decoded correctly, which takes 2 bytes, then instead 1 emoji you see, you get two, which is U+63862 and U+65536 unicode char.
As I understand it, UTF-8 is a superset of ASCII, and therefore includes the control characters which are not used to represent printable characters.
My question is: Are there any bytes (of the 256 different) that are not used by the UTF-8 encoding?
I wondered if you could convert/encode UTF-8 text to binary.
Here my though process:
I have no idea how the UTF-8 text encoding works and how it can use so many characters (only that it uses multiple bytes for characters not in ASCII (Latin-1??)) but I know that ASCII text is valid in UTF-8 so the control characters (bytes 0-30) are not used differently by the UTF-8 encoding but they are at the same time not used for displaying characters, right??
So of the 256 different bytes, only ~230 are used. For a 1000 (binary) long Unicode text there are only 1000^230 different texts? Right?
If that is true, you could convert it to a binary data which is smaller than 1000 bytes.
Wolfram alpha: 1000 bytes of unicode (assumption unicode only uses 230 of the 256 different bytes) --> 496 bytes
Yes, it is possible to devise encodings which are more space-efficient than UTF-8, but you have to weigh the advantages against the disadvantages.
For example, if your primary target is (say) ISO-8859-1, you could map the character codes 0xA0-0xFF to themselves, and only use 0x80-0x9F to select an extension map somewhat vaguely like UTF-8 uses (nearly) all of 0x80-0xFF to encode sequences which can represent all of Unicode > 0x80. You would gain a significant advantage when the majority of your text does not use characters in the ranges 0x80-0x9F or 0x0100-0x1EFFFFFFFF, but correspondingly lose when this is not the case.
Or you could require the user to keep a state variable which tells you which range of characters is currently selected, and have each byte in the stream act as an index into that range. This has significant disadvantages, but used to be how these things were done way back when (witness e.g. ISO-2022).
The original UTF-8 draft before Ken Thompson and Rob Pike famously intervened was probably also somewhat more space-efficient than the final specification, but the changes they introduced had some very attractive properties, trading (I assume) some space efficiency for lack of contextual ambiguity.
I would urge you to read the Wikipedia article about UTF-8 to understand the design desiderata -- the spec is possible to grasp in just a few minutes, although you might want to reserve an hour or more to follow footnotes etc. (The Thompson anecdote is currently footnote #7.)
All in all, unless you are working on space travel or some similarly effeciency-intensive application, losing UTF-8 compatibility is probably not worth the time you have already spent, and you should stop now.
0xF8-0xFF are not valid anywhere in UTF-8, and some other bytes are not valid at certain positions.
The lead byte of a character indicates the number of bytes used to encode the character, and each continuation byte has 10 as its two high order bits. This is so that you can pick any byte within the text and find the start of the character containing it. If you don't mind losing this ability, you could certainly come up with more efficient encoding.
You have to distinguish Characters, Unicode and UTF-8 encoding:
In encodings like ASCII, LATIN-1, etc. there is a one-to-one relation of one character to one number between 0 and 255 so a character can be encoded by exactly one byte (e.g. "A"->65). For decoding such a text you need to know which encoding was used (does 65 really mean "A"?).
To overcome this situation Unicode assigns every Character (including all kinds of special things like control characters, diacritic marks, etc.) a unique number in the range from 0 to 0x10FFFF (so-called Unicode codepoint). As this range does not fit into one byte the question is how to encode. There are several ways to do this, e.g. simplest way would always use 4 bytes for each character. As this consumes a lot of space a more efficient encoding is UTF-8: Here every Unicode codepoint (= Character) is encoded in one, two, three or four bytes (for this encoding not all byte values from 0 to 255 are used but this is only a technical detail).
I have to write some code working with character encoding. Is there a good introduction to the subject to get me started?
First posted at What every developer should know about character encoding.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
From Joel Spolsky
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html
As usual, Wikipedia is a good starting point: http://en.wikipedia.org/wiki/Character_encoding
I have a very basic introduction on my blog, which also includes links to in-depth resources if you REALLY want to dig into the subject matter.
http://www.dotnetnoob.com/2011/12/introduction-to-character-encoding.html
I have several files that are in several different languages. I thought they were all encoded UTF-8, but now I'm not so sure. Some characters look fine, some do not. Is there a way that I can break out the strings and try to identify the character sets? Perhaps split on white space then identify each word? Finally, is there an easy way to translate characters from one set to UTF-8?
If you don't know the character set for sure You can only guess, basically. utf8::valid might help you with that, but you can't really know for sure. If you know that if it isn't unicode it must be a specific character set (Like Latin-1), you lucky. If you have no idea, you're screwed. In any case, you should always assume the whole file is in the same character set, unless otherwise specified. You will lose your sanity if you don't.
As for your question how to convert between character sets: Encode is there to do that for you
Determining whether a file is probably UTF-8 or not should be pretty easy. Determining the encoding if it is not UTF-8 would be very difficult in general.
If the file is encoded with UTF-8, the high bits of each byte should follow a pattern. If a character is one byte, its high bit will be cleared (zero). Otherwise, an n byte character (where n is 2–4) will have the high n bits of the first byte set to one, followed by a single zero bit. The following n - 1 bytes should all have the highest bit set and the second-highest bit cleared.
If all the bytes in your file follow these rules, it's probably encoded with UTF-8. I say probably, because anyone can invent a new encoding that happens to follow the same rules, deliberately or by chance, but interprets the codes differently.
Note that a file encoded with US-ASCII will follow these rules, but the high bit of every byte is zero. It's okay to treat such a file as UTF-8, since they are compatible in this range. Otherwise, it's some other encoding, and there's not an inherent test to distinguish the encoding. You'll have to use some contextual knowledge to guess.
Take a look at iconv
http://www.gnu.org/software/libiconv/
Text::Iconv