Must UTF-8 binaries include /utf8 in the binary literal in Erlang? - unicode

In erlang, when defining a UTF-8 binary string, I need to specify the encoding in the binary literal, like this:
Star = <<"★"/utf8>>.
> <<226,152,133>>
io:format("~ts~n", [Star]).
> ★
> ok
But, if the /utf8 encoding is omitted, the unicode characters are not handled correctly:
Star1 = <<"★">>.
> <<5>>
io:format("~ts~n", [Star1]).
> ^E
> ok
Is there a way that I can create literal binary strings like this without having to specify /utf8 in every binary I create? My code has quite a few binaries like this and things have become quite cluttered. Is there a way to set some sort of default encoding for binaries?

This is probably a result of the ambiguity of Erlang strings and lists. When you enter <<"★">>, what Erlang is actually seeing is <<[9733]>>, which, of course, is just a list containing an integer. As such, I believe Erlang in this case would encode 9733 as an integer, most likely with 16-bits (though I could certainly be wrong on that).
The /utf8 flag indicates to Erlang that this is supposed to be a UTF8 string, and thus gives a hint to the VM about how best to encode the integer it encounters.

Related

When using UTF-8, is it better to reference character for international use using decimal or hex... and why?

When using UTF-8, which character reference is better, or more widely supported worldwide on various browsers... using decimal references or hex references?
UPDATE
For instance, for replacing quotation marks...
" or "
which one is better to use, and why?
All HTML entities use only the ASCII subset, so the fact that you encode your document in UTF-8, as opposed to any other byte oriented encoding which extends ASCII, is unrelated.
Anyway:
When using UTF-8, you can just copy and paste the relevant characters into the document, without references at all. E.g. StackOverflow does not convert this ⫅ to an entity (see the source of this page).
If you prefer using entities, then I would use the hex references purely since this is the way Unicode codepoints are usually written in the charts. References are so widely supported that I do not think that you will head a compatibility problem with neither hex nor decimal references.
There is no functional difference between decimal references and hexadecimal references. Old browsers did not support the latter, but then we are talking about really old browsers like Netscape 4 and IE 4.
Hexadecimal references are usually more handy, because in character code standards and other reference works, characters are referred to by their code numbers in hexadecimal. Using them, you avoid the conversion from hexadecimal to decimal (and thereby may avoid some mistakes).
There is no reason to use either " or " in text. (In attribute values, they, or ", are needed in rare cases.)
This does not depend on the document encoding (UTF-8 or something else), except in the sense that when using UTF-8, you do not need the references (except for the markup-significant characters < and &). UTF-8 lets you enter any character as such, though you might still use references if you find that more comfortable than finding an editor that lets you enter the characters themselves.

Working with strings with mixed encodings in python 3.x

I'm working with a binary file that references another file using absolute paths.
The path contains both japanese and ascii characters.
The length of the string is given, so I can just read that many bytes and convert it into a string.
However the problem is trying to convert the string. If I specify the encoding as ascii, it'll fail on the japanese characters. If I specify it as japanese encoding (shift-jis or something), it won't read the english characters properly.
One byte is used for each ascii character, while two bytes are used for each japanese character.
What is the fastest and cleanest way to convert these bytes into a string? The encodings are known. Will the same technique work in older versions of python.
This sounds like you have fallen victim for a misunderstand the basics of Unicode and encodings. It may be that you have not, but misunderstandnings are common and understandable, while the situation you describe are not.
A string of bytes that contains mixed encodings are, per definition, invalid in any of these encodings. If this really was the case, you would have to split the bytes string into it's parts, and decode every part separately. In this case it would probably mean splitting on the path separators, so it would be reasonably easy, but in other cases it would not. However, I serously doubt that this is the case, as it would mean that your source is insane. That happens, but it is unlikely. :-)
If the source gives you one path as a bytes string, it is most likely that this string uses only one encoding. It may contain both Japanese and ASCII-characters and still be using one encoding. The most common encodings that can handle both Japanese and ASCII are UTF-8 and UTF-16. My guess is that your source uses one of those. In fact, since you write "One byte is used for each ascii character, while two bytes are used for each japanese character" it is probably UTF-8. It could also be Shift JIS, but it seems you already tried that.
If not, please explain what your source is, and give examples of the byte strings (in ASCII/HEX) that you are given.

Command-line arguments as bytes instead of strings in python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.
I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.
I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.
I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).
The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.
So,
1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)
2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and
3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?
When I receive a filename argument
which is invalid, however, it is
handed to me as a unicode string with
strange characters like \udce8.
Those are surrogate characters. The low 8 bits is the original invalid byte.
See PEP 383: Non-decodable Bytes in System Character Interfaces.
Don't go against the grain: filenames are strings, not bytes.
You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.
(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)
Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.

Efficient way to ASCII encode UTF-8

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:
all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed
My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.
Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.
UTF-7, or, slightly less transparent but more widespread, quoted-printable.
all ASCII chars in the input should stay ASCII chars in the output
(Obviously not fully possible as you need at least one character to act as an escape.)
Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.
Edited to add:
I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.
If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.
Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.
If you're talking about non-standard schemes - MECE
URL encoding or numeric character references are two possible options.
It depends on the distribution of characters in your strings.
Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.
Punycode is used for IDNA, but you can use it outside the restrictions imposed by it
Per se, Punycode doesn't fail your last 2 requirements:
>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True
(for idna, python supplies another homonymous encoding)
obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go

Does Lua support Unicode?

Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.