unicode:characters_to_list seems doesn't work for utf8 list - unicode

I am trying to convert UTF-8 string to Unicode (code point) list with Erlang library "unicode. My input data is a string "АБВ" (Russian string, which correct Unicode representation is [1040,1041,1042]), encoded in UTF-8. When I am running following code:
1> unicode:characters_to_list(<<208,144,208,145,208,146>>,utf8).
[1040,1041,1042]
it returns correct value, but following:
2> unicode:characters_to_list([208,144,208,145,208,146],utf8).
[208,144,208,145,208,146]
does not. Why does it happens? As I read in specification, input data could be either binary or list of chars, so, as for me, I am doing everything right.

The signature of the function is unicode:characters_to_list(Data, InEncoding), it expects Data to be either binary containing string encoded in InEncoding encoding or possibly deep list of characters (code points) and binaries in InEncoding encoding. It returns list of unicode characters. Characters in erlang are integers.
When you call unicode:characters_to_list(<<208,144,208,145,208,146>>, utf8) or unicode:characters_to_list([1040,1041,1042], utf8) it correctly decodes unicode string (yes, second is noop as long as Data is list of integers). But when you call unicode:characters_to_list([208,144,208,145,208,146], utf8) erlang thinks you pass list of 6 characters in utf8 encoding, since it's already unicode the output will be exactly the same.
There is no byte type in erlang, but you assume that unicode:characters_to_list/2 will accept list of bytes and will behave correctly.
To sum it up. There are two usual ways to represent string in erlang, they are bitstrings and lists of characters. unicode:characters_to_list(Data, InEncoding) takes string Data in one of these representations (or combination of them) in InEncoding encoding and converts it to list of unicode codepoints.
If you have list [208,144,208,145,208,146] like in your example you can convert it to binary using erlang:list_to_binary/1 and then pass it to unicode:characters_to_list/2, i.e.
1> unicode:characters_to_list(list_to_binary([208,144,208,145,208,146]), utf8).
[1040,1041,1042]
unicode module supports only unicode and latin-1. Thus, (since the function expects codepoints of unicode or latin-1) characters_to_list does not need to do anything with list in a case of flat list of codepoints. However, list may be deep (unicode:characters_to_list([[1040],1041,<<1042/utf8>>]).). That is a reason to support list datatype for Data argument.

<<208,144,208,145,208,146>> is an UTF-8 binary.
[208,144,208,145,208,146] is a list of bytes (not code points).
[1040,1041,1042] is a list of code points.
You are passing a list of bytes, but the function wants a list of chars or a binary.

Related

"In language x strings are y - e.g. UTF-16 - by default" - what does that mean?

In many places we can read that, for example, "C# uses UTF-16 for its strings" (link). Technically, what does this mean?
My source file is just some text. Say I'm using Notepad++ to code a simple C# app; how the text is represented in bytes on disk, after I save the file, depends on N++, so that's probably not what people mean. Does that mean that:
The language specification requires/recommends that the compiler input be encoded as UTF-16?
The standard library functions are encoding-aware and treat the strings as UTF-16, for example String's operator [] (which returns the n-th character and not the n-th byte)?
Once the compiler produces an executable, the strings stored inside it are in UTF-16?
I've used C# as an example, but this question applies to any language of which one could say that it uses encoding Y for its strings.
"C# uses UTF-16 for its strings"
As far as I understand this concept, this is a simplification at best. A CLI runtime (such as the CLR) is required to store strings it loads from assemblies or that are generated at runtime in UTF-16 encoding in memory - or at least present them as such to the rest of the runtime and the application.
See CLI specification:
III.1.1.3 Character data type
A CLI char type occupies 2 bytes in memory and represents a Unicode code unit using UTF-16
encoding. For the purpose of stack operations char values are treated as unsigned 2-byte integers
(§III.1.1.1)
And C# specification:
4.2.4 The string type
Instances of the string class represent Unicode [being UTF-16 in .NET jargon] character strings.
I can't find that quickly which file encodings the C# compiler supports, but I'm quite sure you can have a source file stored in UTF-8 encoding, or even ASCII (or another non-unicode code page).
The standard library functions are encoding-aware and treat the strings as UTF-16
No, the BCL just treats strings as strings, being a wrapper around a char[] array. Only when transitioning outside the runtime, like in a P/Invoke call, the runtime "knows" which platform functions to invoke and how to marshal a string to those functions. See for example C++/CLI Converting from System::String^ to std::string
Once the compiler produces an [assembly], the strings are stored inside it in UTF-16?
Yes.
Let take a look at C/C++ char type. It is 8 bits long(1 byte). This means that it can store 255 different symbols. Now let's think what actually a font is. It is something like a map. Values from 0 to 255 (1 byte) are mapped to symbols. These type of fonts usually contains 2 type of characters (cyrillic and latin for example) and special symbols. There is no enough space (255 limit) to save greece or chinese letters.
Now let's see what is UTF-8. It is encoding, which stores some symbols using 8 bits and some using 16 bits. For example if you type in notepad word "word" and save file with UTF-8 encoding the resulting file will be exactly 4 bytes length, but if you type word "дума", which is again 4 symbols it will use 8 bytes on your storage. So some letters are stored as 1 byte, others as 2.
UTF-16 means that all symbols are stored in 2 bytes and logically UTF-32 = 4 bytes.
Let's see how this looks from programming sight. When you are typing symbols in notepad they are stored in RAM (in some format that notepad understands). When you save file on the disk notepad write a sequence of bytes on the disk. These sequence depends on chosen encoding. When you are reading (with C# or some other language) file you have to know its encoding. By knowing it you will know how to interpret the sequence written on the disk.

what's the definition of "encoding-agnostic"?

In lua 5.3 reference manual, we can see:
Lua is also encoding-agnostic; it makes no assumptions about the contents of a string.
I can't understand what the sentence says.
The same byte value in a string may represent different characters depending on the character encoding used for that string. For example, the same value \177 may represent ▒ in Code page 437 encoding or ± in Windows 1252 encoding.
Lua makes no assumption as to what the encoding of a given string is and the ambiguity needs to be resolved at the script level; in other words, your script needs to know whether to deal with the byte sequence as Windows 1252, Code page 437, UTF-8, or something else encoded string.
Essentially, a Lua string is a counted sequence of bytes. If you use a Lua string for binary data, the concept of character encodings is not relevant and does not interfere with the binary data. It that way, string is encoding-agnostic.
There are functions in the standard string library that treat string values as text—an uncounted, sequence of characters. There is no text but encoded text. An encoding maps a member of a character set to a sequence of bytes. A string would have the bytes for zero or more such encoded characters. To understand a string as text, you must know the character set and encoding. To use the string functions, the encoding should be compatible with os.setlocale().

Unicode byte vs code point (Python)

In http://nedbatchelder.com/text/unipain.html it is explained that:
In Python 2, there are two different string data types. A plain-old
string literal gives you a "str" object, which stores bytes. If you
use a "u" prefix, you get a "unicode" object, which stores code
points.
What's the difference between code point vs byte? (I'm thinking not really in term of Python per se but just the concept in general). Essentially it's just a bunch of bits, right? I think of pain old string literal treat each 8-bits as a byte and is handled as such, and we interpret the byte as integers and that allow us to map it to ASCII and the extended character sets. What's the difference between interpreting integer as that set of characters and interpreting the "code point" as Unicode characters? It says Python's Unicode object stores "code point". Isn't that just the same as plain old bytes except possibly the interpretation (where bits of each Unicode character starts and stops as utf-8, for example)?
A code point is a number which acts as an identifier for a Unicode character. A code point itself cannot be stored, it must be encoded from Unicode into bytes in e.g. UTF-16LE. While a certain byte or sequence of bytes can represent a specific code point in a given encoding, without the encoding information there is nothing to connect the code point to the bytes.

yaws unicode symbols in {html, ...}

Why {html, "доуч"++[1076,1086,1091,1095]} in yaws-page gives me next error:
Yaws process died: {badarg,[{erlang,list_to_binary,
[[[[208,180,208,190,209,131,209,135,1076,
1086,1091,1095]],
...
"доуч" = [1076,1086,1091,1095] -> gives me exact match, but how yaws translate 2-byte per elem list in two times longer list with 1 byte per elem for "доуч", but doesnt do it for [1076,1086,1091,1095]. Is there some internal represintation of unicode data involed?
I want to output to the web pages lists like [1076,1086,1091,1095], but it crushed.
Erlang source files only support the ISO-LATIN-1 charset. The Erlang console can accept Unicode characters, but to enter them inside a source code file, you need to use this syntax:
K = "A weird K: \x{a740}".
See http://www.erlang.org/doc/apps/stdlib/unicode_usage.html for more info.
You have to do the following to make it work:
{html, "доуч"++ binary_to_list(unicode:characters_to_binary([1076,1086,1091,1095]))}
Why it fails?
In a bit more detail, the list_to_binary fails because it is trying to convert each item in the list to a byte, which it cannot do because each value in [1076,1086,1091,1095] would take more than a byte.
What is going on?
[1076,1086,1091,1095] is a pure unicode string representation of "доуч". Yaws tries to convert the string (list) into a binary string directly using list_to_binary and thus fails. Since each unicode character can take more than one byte, we need to convert it into a byte array. This can be done using:
unicode:characters_to_binary([1076,1086,1091,1095]).
<<208,180,208,190,209,131,209,135>>
This can now be safely converted back and forth between list and binary representations. See unicode for more details.
You can convert back to unicode as follows:
unicode:characters_to_list(<<208,180,208,190,209,131,209,135>>).
[1076,1086,1091,1095]

Erlang, io_lib and unicode

I'm having a little trouble getting erlang to give me a unicode string.
Here's what works:
io:format("~ts~n", [<<226,132,162>>]).
™
ok
But instead of printing to the console, I want to assign it to a variable. So I thought:
T = lists:flatten(io_lib:format("~ts~n", [<<226,132,162>>])).
T.
[8482,10]
How can I get T in the io_lib example to contain the ™ symbol so I can write it to a network stream?
Instead of assigning the flattened version to a variable for sending on the network, can you instead re-write your code that sends over the network to accept the binary in the first place and use the formatted write mechanism ~ts when sending over the socket?
That would also let you avoid the lists:flatten, which isn't needed for the built-in IO mechanisms.
It does contain the trademark symbol: as you can see here, 8482 is its code. It isn't printed as ™ in the shell, because the shell prints as strings only lists which contain printable character code in Latin-1. So [8482, 10] is a Unicode string (in UTF-32 encoding). If you want to convert it to a different encoding, use the unicode module.
First thing is knowing what you need to do. Then you can adapt your code the best way you find.
Erlang represents unicode strings as lists of codepoints. Unicode codepoints are integers, not bytes. Snce you can only send bytes over the network, things like unicode strings, need to be encoded in byte squences by the sending side and decoded by the receiving side. UTF-8 is the most used encoding for unicode strings, and that's what your binary is, the UTF-8 encoding of the unicode string composed by the codepoint 8482.
What you get out of the io_lib:format call is the erlang string representation of that codepoint plus the new line character.
A very reasonable way to send unicode strings over the network is encoding them in UTF-8. Don't use io_lib:format for that, though. unicode:characters_to_binary/1 is the function meant to transform unicode strings in UTF-8 encoded binaries.
In the receiving side (and probably even better in your whole application) you'll have to decide how you will handle the strings, either in encoded binaries (or lists) or in plain unicode lists. But over the network the only choice is using binaries (or iolists wich are possibly deep lists of bytes) and I'll bet the most reasonable encoding for your application will be UTF-8.