"In language x strings are y - e.g. UTF-16 - by default" - what does that mean? - encoding

In many places we can read that, for example, "C# uses UTF-16 for its strings" (link). Technically, what does this mean?
My source file is just some text. Say I'm using Notepad++ to code a simple C# app; how the text is represented in bytes on disk, after I save the file, depends on N++, so that's probably not what people mean. Does that mean that:
The language specification requires/recommends that the compiler input be encoded as UTF-16?
The standard library functions are encoding-aware and treat the strings as UTF-16, for example String's operator [] (which returns the n-th character and not the n-th byte)?
Once the compiler produces an executable, the strings stored inside it are in UTF-16?
I've used C# as an example, but this question applies to any language of which one could say that it uses encoding Y for its strings.

"C# uses UTF-16 for its strings"
As far as I understand this concept, this is a simplification at best. A CLI runtime (such as the CLR) is required to store strings it loads from assemblies or that are generated at runtime in UTF-16 encoding in memory - or at least present them as such to the rest of the runtime and the application.
See CLI specification:
III.1.1.3 Character data type
A CLI char type occupies 2 bytes in memory and represents a Unicode code unit using UTF-16
encoding. For the purpose of stack operations char values are treated as unsigned 2-byte integers
(§III.1.1.1)
And C# specification:
4.2.4 The string type
Instances of the string class represent Unicode [being UTF-16 in .NET jargon] character strings.
I can't find that quickly which file encodings the C# compiler supports, but I'm quite sure you can have a source file stored in UTF-8 encoding, or even ASCII (or another non-unicode code page).
The standard library functions are encoding-aware and treat the strings as UTF-16
No, the BCL just treats strings as strings, being a wrapper around a char[] array. Only when transitioning outside the runtime, like in a P/Invoke call, the runtime "knows" which platform functions to invoke and how to marshal a string to those functions. See for example C++/CLI Converting from System::String^ to std::string
Once the compiler produces an [assembly], the strings are stored inside it in UTF-16?
Yes.

Let take a look at C/C++ char type. It is 8 bits long(1 byte). This means that it can store 255 different symbols. Now let's think what actually a font is. It is something like a map. Values from 0 to 255 (1 byte) are mapped to symbols. These type of fonts usually contains 2 type of characters (cyrillic and latin for example) and special symbols. There is no enough space (255 limit) to save greece or chinese letters.
Now let's see what is UTF-8. It is encoding, which stores some symbols using 8 bits and some using 16 bits. For example if you type in notepad word "word" and save file with UTF-8 encoding the resulting file will be exactly 4 bytes length, but if you type word "дума", which is again 4 symbols it will use 8 bytes on your storage. So some letters are stored as 1 byte, others as 2.
UTF-16 means that all symbols are stored in 2 bytes and logically UTF-32 = 4 bytes.
Let's see how this looks from programming sight. When you are typing symbols in notepad they are stored in RAM (in some format that notepad understands). When you save file on the disk notepad write a sequence of bytes on the disk. These sequence depends on chosen encoding. When you are reading (with C# or some other language) file you have to know its encoding. By knowing it you will know how to interpret the sequence written on the disk.

Related

Understanding encoding schemes

I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.

Why can't we store Unicode directly?

I read some article about Unicode and UTF-8.
The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12CA‘. U+12CA is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly. I think the reason is:
Unicode is not Self-synchronizing code, so if
10 represent A
110 represent B
10110 represent C
When I see 10110 in the disk we can't tell it's A and B or just C.
Unicode uses much more space instead of UTF-8 or UTF-16.
Am I right?
Read about Unicode, UTF-8 and the UTF-8 everywhere website.
There are more than a million Unicode code-points (you mentionned 1,114,111...). So you need at least 21 bits to be able to separate all of them (since 221 > 1114111).
So you can store Unicode characters directly, if you represent each of them by a wide enough integral type. In practice, that type would be some 32 bits integer (because it is not convenient to handle 3-bytes i.e. 24 bits integers). This is called UCS-4 and some systems or software do already handle their Unicode string in such a format.
Notice also that displaying Unicode strings is quite difficult, because of the variety of human languages (and also since Unicode has combining characters). Some need to be displayed right to left (Arabic, Hebrew, ....), others left to right (English, French, Spanish, German, Russian ...), and some top to down (Chinese, ...). A library displaying Unicode strings should be capable of displaying a string containing English, Chinese and Arabic words.... Then you see that decoding UTF-8 is the easy part of Unicode string displaying (and storing UCS-4 strings won't help much).
But, since English is the dominant language in IT technology (for economical reasons), it is very often cheaper to keep strings in UTF8 form. If most of the strings handled by your system are English (or in some other European language using the Latin alphabet), it is cheaper and it takes less space to keep them in UTF-8.
I guess than when China will become a dominant power in IT, things might change (or maybe not).
(I have no idea of the most common encoding used today on Chinese supercomputers or smartphones; I guess it is still UTF-8)
In practice, use a library (perhaps libunistring or Glib in C), to process UTF-8 strings and another one (e.g. pango and GTK in C) to display them. You'll find many Unicode related libraries in various programming languages.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly.
How do you write 12CA to a disk directly? It is a bigger value than a byte can hold, so you need to write at least two bytes. Do you write 12 followed by CA? You just encoded it in UTF-16BE. That's what an encoding is...a definition of how to write an abstract number as bytes.
Other reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Pragmatic Unicode
For good and specific reasons, Unicode doesn't specify any particular encoding. If it makes sense for your scenario, you can specify your own.
Because Unicode doesn't specify any serialization, there is no way to "directly" store Unicode, just like you can't "directly" store a mathematical number or a flow chart to implement a program you designed. The question isn't really well-defined.
There are a number of existing serialization formats (encodings) so it is very likely that it makes the most sense to use an existing one unless your requirements are significantly different than what any existing encoding provides; even then, is it really worth the cost?
A stream of bits is just a stream of bits. Conventionally, we chop them up into groups of 8 and call that a "byte" and the latter half of your question is really "if it's not a byte, how can you tell which bits belong to which symbol?" There are many ways to do that, but the common ones generally define a sequence of some particular length (8, 16, and 32 are often convenient for reasons of compatibility with bus width on modern computers etc) but again, if you really wanted to, you could come up with something different. Huffman trees come to mind as one way to implement a way to communicate a structure of variable length (and is used for precisely that in many compression algorithms).
Consider one situation, even if you can directly save unicode binary into disk and close the file, what happens when you open the file again? It's just a bunch of binary, you don't know how many bytes a char occupied right, which means, if '🥶'(U+129398) and 'A' are the content of your file, then if you take it 1 byte for a char, then '🥶' can't be decoded correctly, which takes 2 bytes, then instead 1 emoji you see, you get two, which is U+63862 and U+65536 unicode char.

unicode:characters_to_list seems doesn't work for utf8 list

I am trying to convert UTF-8 string to Unicode (code point) list with Erlang library "unicode. My input data is a string "АБВ" (Russian string, which correct Unicode representation is [1040,1041,1042]), encoded in UTF-8. When I am running following code:
1> unicode:characters_to_list(<<208,144,208,145,208,146>>,utf8).
[1040,1041,1042]
it returns correct value, but following:
2> unicode:characters_to_list([208,144,208,145,208,146],utf8).
[208,144,208,145,208,146]
does not. Why does it happens? As I read in specification, input data could be either binary or list of chars, so, as for me, I am doing everything right.
The signature of the function is unicode:characters_to_list(Data, InEncoding), it expects Data to be either binary containing string encoded in InEncoding encoding or possibly deep list of characters (code points) and binaries in InEncoding encoding. It returns list of unicode characters. Characters in erlang are integers.
When you call unicode:characters_to_list(<<208,144,208,145,208,146>>, utf8) or unicode:characters_to_list([1040,1041,1042], utf8) it correctly decodes unicode string (yes, second is noop as long as Data is list of integers). But when you call unicode:characters_to_list([208,144,208,145,208,146], utf8) erlang thinks you pass list of 6 characters in utf8 encoding, since it's already unicode the output will be exactly the same.
There is no byte type in erlang, but you assume that unicode:characters_to_list/2 will accept list of bytes and will behave correctly.
To sum it up. There are two usual ways to represent string in erlang, they are bitstrings and lists of characters. unicode:characters_to_list(Data, InEncoding) takes string Data in one of these representations (or combination of them) in InEncoding encoding and converts it to list of unicode codepoints.
If you have list [208,144,208,145,208,146] like in your example you can convert it to binary using erlang:list_to_binary/1 and then pass it to unicode:characters_to_list/2, i.e.
1> unicode:characters_to_list(list_to_binary([208,144,208,145,208,146]), utf8).
[1040,1041,1042]
unicode module supports only unicode and latin-1. Thus, (since the function expects codepoints of unicode or latin-1) characters_to_list does not need to do anything with list in a case of flat list of codepoints. However, list may be deep (unicode:characters_to_list([[1040],1041,<<1042/utf8>>]).). That is a reason to support list datatype for Data argument.
<<208,144,208,145,208,146>> is an UTF-8 binary.
[208,144,208,145,208,146] is a list of bytes (not code points).
[1040,1041,1042] is a list of code points.
You are passing a list of bytes, but the function wants a list of chars or a binary.

Unicode byte vs code point (Python)

In http://nedbatchelder.com/text/unipain.html it is explained that:
In Python 2, there are two different string data types. A plain-old
string literal gives you a "str" object, which stores bytes. If you
use a "u" prefix, you get a "unicode" object, which stores code
points.
What's the difference between code point vs byte? (I'm thinking not really in term of Python per se but just the concept in general). Essentially it's just a bunch of bits, right? I think of pain old string literal treat each 8-bits as a byte and is handled as such, and we interpret the byte as integers and that allow us to map it to ASCII and the extended character sets. What's the difference between interpreting integer as that set of characters and interpreting the "code point" as Unicode characters? It says Python's Unicode object stores "code point". Isn't that just the same as plain old bytes except possibly the interpretation (where bits of each Unicode character starts and stops as utf-8, for example)?
A code point is a number which acts as an identifier for a Unicode character. A code point itself cannot be stored, it must be encoded from Unicode into bytes in e.g. UTF-16LE. While a certain byte or sequence of bytes can represent a specific code point in a given encoding, without the encoding information there is nothing to connect the code point to the bytes.

What are the limitations of primitive character types in D?

I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and limitations of the language in this area.
The types are given on the website as:
char; // unsinged 8 bit UTF-8
wchar; // unsigned 16 bit UTF-16
dchar; // unsigned 32 bit UTF-32
Since we know that most of the Unicode Transformation (UTF) Format encodings represent characters with a variable bit-width, does this mean that a char in D can only contain the values that will fit in 8 bits, or does it expand in the machine's physical memory when you give it double byte characters? Perhaps there is some other possibility, like automatic casting into the next most appropriate type as you overload the variable?
Let's say for example, I want to use the UTF-8 char in an editor and type in Chinese . Will it simply fall over, or is it able to deal with Unicode characters more 'correctly', like in C#? Would it still be necessary to provide glue code to allow working with any language supported by Unicode?
I'd appreciate any specific information you can offer on how these types work under the covers, and any general best practices advice on dealing with their limitations.
A single char or wchar represents an UTF code unit. This means that, by its own, a char in can either represent an ASCII symbol (0-127) or be part of an UTF-8 sequence representing an Unicode character (code point). Only the dchar type can represent an entire Unicode character, because there are more than 65536 code points in Unicode.
Casting one type of string type (string, wstring and dstring, which are simply dynamic arrays of the character types) will not automatically convert their contents to the respective UTF representation. In order to do this, you must use the functions toUTF8, toUTF16 and toUTF32 from std.utf (or toString / toString16 / toString32 from tango.text.convert.Utf if you use Tango).
Users have implemented string classes which will automatically use the most memory-efficient representation that can map each character to a single code unit. This allows quick slicing and indexing with a minimal memory overhead. One such implementation is mtext by Christopher E. Miller.
Further reading:
the String handling section in Wikipedia's entry on D
Text in D, by Daniel Keep