Erlang and binary with Cyrillic - unicode

I need to be able to use binaries with Cyrillic characters in them. I tried just writing <<"абвгд">> but I got a badarg error.
How can I work with Cyrillic (or unicode) strings in Erlang?

If you want to input the above expression in erlang shell, please read unicode module user manual.
Function character_to_binary, and character_to_list are both reversable function. The following are an example:
(emacs#yus-iMac.local)37> io:getopts().
[{expand_fun,#Fun<group.0.33302583>},
{echo,true},
{binary,false},
{encoding,unicode}]
(emacs#yus-iMac.local)40> A = unicode:characters_to_binary("上海").
<<228,184,138,230,181,183>>
(emacs#yus-iMac.local)41> unicode:characters_to_list(A).
[19978,28023]
(emacs#yus-iMac.local)45> io:format("~s~n",[ unicode:characters_to_list(A,utf8)]).
** exception error: bad argument
in function io:format/3
called as io:format(<0.30.0>,"~s~n",[[19978,28023]])
(emacs#yus-iMac.local)46> io:format("~ts~n",[ unicode:characters_to_list(A,utf8)]).
上海
ok
If you want to use unicode:characters_to_binary("上海"). directly in the source code, it is a little more complex. You can try it firstly to find difference.

The Erlang compiler will interpret the code as ISO-8859-1 encoded text, which limits you to Latin characters. Although you may be able to bang in some ISO characters that may have the same byte representation as you want in Unicode, this is not a very good idea.
You want to make sure your editor reads and writes ISO-8859-1, and you want to avoid using literals as much as possible. Source these strings from files.

Related

wxTextCtrl OSX mutated vowel

i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
üüüäääööößßß
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,
If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.
It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.

Erlang, io_lib and unicode

I'm having a little trouble getting erlang to give me a unicode string.
Here's what works:
io:format("~ts~n", [<<226,132,162>>]).
™
ok
But instead of printing to the console, I want to assign it to a variable. So I thought:
T = lists:flatten(io_lib:format("~ts~n", [<<226,132,162>>])).
T.
[8482,10]
How can I get T in the io_lib example to contain the ™ symbol so I can write it to a network stream?
Instead of assigning the flattened version to a variable for sending on the network, can you instead re-write your code that sends over the network to accept the binary in the first place and use the formatted write mechanism ~ts when sending over the socket?
That would also let you avoid the lists:flatten, which isn't needed for the built-in IO mechanisms.
It does contain the trademark symbol: as you can see here, 8482 is its code. It isn't printed as ™ in the shell, because the shell prints as strings only lists which contain printable character code in Latin-1. So [8482, 10] is a Unicode string (in UTF-32 encoding). If you want to convert it to a different encoding, use the unicode module.
First thing is knowing what you need to do. Then you can adapt your code the best way you find.
Erlang represents unicode strings as lists of codepoints. Unicode codepoints are integers, not bytes. Snce you can only send bytes over the network, things like unicode strings, need to be encoded in byte squences by the sending side and decoded by the receiving side. UTF-8 is the most used encoding for unicode strings, and that's what your binary is, the UTF-8 encoding of the unicode string composed by the codepoint 8482.
What you get out of the io_lib:format call is the erlang string representation of that codepoint plus the new line character.
A very reasonable way to send unicode strings over the network is encoding them in UTF-8. Don't use io_lib:format for that, though. unicode:characters_to_binary/1 is the function meant to transform unicode strings in UTF-8 encoded binaries.
In the receiving side (and probably even better in your whole application) you'll have to decide how you will handle the strings, either in encoded binaries (or lists) or in plain unicode lists. But over the network the only choice is using binaries (or iolists wich are possibly deep lists of bytes) and I'll bet the most reasonable encoding for your application will be UTF-8.

write()-ing an encoded string in Python 3.x

I've got a unicode string (s) which I want to write into a file.
In Python 2 I could write:
open('filename', 'w').write(s.encode('utf-8'))
But this fails for Python 3. Apparently, s.encode() returns something of type 'bytes', which the write() function does not accept:
TypeError: must be str, not bytes
Does anyone know how to port the above code to Python 3?
Edit:
Thanks to all of you who proposed using binary mode! Unfortunately, this causes a problem with the \n characters. Is there any way to achieve the same result I had with Python 2 (namely to encode non-ANSI characters in UTF-8 while keeping the OS-specific rendition of \n)?
Thanks!
You do not want to muck around with manually encoding each and every piece of data like that! Simply pass the encoding as an argument to open, like this:
#!/usr/bin/env python3.2
slist = [
"Ca\N{LATIN SMALL LETTER N WITH TILDE}on City",
"na\N{LATIN SMALL LETTER I WITH DIAERESIS}vet\N{LATIN SMALL LETTER E WITH ACUTE}",
"fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade",
"\N{GREEK SMALL LETTER BETA}-globulin"
]
with open("/tmp/sample.utf8", mode="w", encoding="utf8") as f:
for s in slist:
print(s, file=f)
Now if you the file you made, you’ll see that it says:
$ cat /tmp/sample.utf8
Cañon City
naïveté
façade
β-globulin
And you can see that those are the right code points this way:
$ uniquote -x /tmp/sample.utf
Ca\x{F1}on City
na\x{EF}vet\x{E9}
fa\x{E7}ade
\x{3B2}-globulin
See how much easier that is? Let the stream object handle any low-level encoding or decoding for you.
Summary: Don't call encode or decode yourself when all you are doing is using them to process a homogeneous stream that's all of it in the same encoding. That's way too much bother for zero gain. Use the encoding argument just once and for all.
Open the file in binary mode, that's the least invasive way in terms of changes.
On the other hand, you could set the output file encoding with open() and avoid explicit string encoding altogether.
You might want to read the manual of the open() function.
Open the file in binary mode
open('filename', 'wb').write(s.encode('utf-8'))

Perl encodings question

I need to get a string from <STDIN>, written in latin and russian mixed encodings, and convert it to some url:
$search_url = "http://searchengine.com/search?text=" . uri_escape($query);
But this proccess goes bad and gives out Mojibake (a mixture of weird letters). What can I do with Perl to solve it?
Before you can get started, there's a few things you need to know.
You'll need to know the encoding of your input. "Latin" and "russian" aren't (character) encodings.
If you're dealing with multiple encodings, you'll need to know what is encoded using which encoding. "It's a mix" isn't good enough.
You'll need to know the encoding the site expects the query to use. This should be the same encoding as the page that contains the search form.
Then, it's just a matter of decoding the input using the correct encoding, and encoding the query using the correct encoding. That's the easy part. Encode provides functions decode and encode to do just that.

Detect presence of a specific charset

I need a way to detect whether a file contains characters from a certain charset.
Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?
Thanks
If you are looking for ready solution, you might want to try Enca.
However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).
Both methods have their good and bad sides and may sometimes give wrong results.
IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.
Edit: I did remember correctly, check out this paper / tutorial