addPortalMessage requires decode('utf-8') - encoding

Currently it seems that in order for UTF-8 characters to display in a portal message you need to decode them first.
Here is a snippet from my code:
self.context.plone_utils.addPortalMessage(_(u'This document (%s) has already been uploaded.' % (doc_obj.Title().decode('utf-8'))))
If Titles in Plone are already UTF-8 encoded, the string is a unicode string and the underscore function is handled by i18ndude, I do not see a reason why we specifically need to decode utf-8. Usually I forget to add it and remember once I get a UnicodeError.
Any thoughts? Is this the expected behavior of addPortalMessage? Is it i18ndude that is causing the issue?

UTF-8 is a representation of Unicode, not Unicode and not a Python unicode string. In Python, we convert back and forth between Python's unicode strings and representations of unicode via encode/decode.
Decoding a UTF-8 string via utf8string.decode('utf-8') produces a Python unicode string that may be concatenated with other unicode strings.
Python will automatically convert a string to unicode if it needs to by using the ASCII decoder. That will fail if there are non-ASCII characters in the string -- because, for example, it is encoded in UTF-8.

Related

what's the definition of "encoding-agnostic"?

In lua 5.3 reference manual, we can see:
Lua is also encoding-agnostic; it makes no assumptions about the contents of a string.
I can't understand what the sentence says.
The same byte value in a string may represent different characters depending on the character encoding used for that string. For example, the same value \177 may represent ▒ in Code page 437 encoding or ± in Windows 1252 encoding.
Lua makes no assumption as to what the encoding of a given string is and the ambiguity needs to be resolved at the script level; in other words, your script needs to know whether to deal with the byte sequence as Windows 1252, Code page 437, UTF-8, or something else encoded string.
Essentially, a Lua string is a counted sequence of bytes. If you use a Lua string for binary data, the concept of character encodings is not relevant and does not interfere with the binary data. It that way, string is encoding-agnostic.
There are functions in the standard string library that treat string values as text—an uncounted, sequence of characters. There is no text but encoded text. An encoding maps a member of a character set to a sequence of bytes. A string would have the bytes for zero or more such encoded characters. To understand a string as text, you must know the character set and encoding. To use the string functions, the encoding should be compatible with os.setlocale().

Erlang, io_lib and unicode

I'm having a little trouble getting erlang to give me a unicode string.
Here's what works:
io:format("~ts~n", [<<226,132,162>>]).
™
ok
But instead of printing to the console, I want to assign it to a variable. So I thought:
T = lists:flatten(io_lib:format("~ts~n", [<<226,132,162>>])).
T.
[8482,10]
How can I get T in the io_lib example to contain the ™ symbol so I can write it to a network stream?
Instead of assigning the flattened version to a variable for sending on the network, can you instead re-write your code that sends over the network to accept the binary in the first place and use the formatted write mechanism ~ts when sending over the socket?
That would also let you avoid the lists:flatten, which isn't needed for the built-in IO mechanisms.
It does contain the trademark symbol: as you can see here, 8482 is its code. It isn't printed as ™ in the shell, because the shell prints as strings only lists which contain printable character code in Latin-1. So [8482, 10] is a Unicode string (in UTF-32 encoding). If you want to convert it to a different encoding, use the unicode module.
First thing is knowing what you need to do. Then you can adapt your code the best way you find.
Erlang represents unicode strings as lists of codepoints. Unicode codepoints are integers, not bytes. Snce you can only send bytes over the network, things like unicode strings, need to be encoded in byte squences by the sending side and decoded by the receiving side. UTF-8 is the most used encoding for unicode strings, and that's what your binary is, the UTF-8 encoding of the unicode string composed by the codepoint 8482.
What you get out of the io_lib:format call is the erlang string representation of that codepoint plus the new line character.
A very reasonable way to send unicode strings over the network is encoding them in UTF-8. Don't use io_lib:format for that, though. unicode:characters_to_binary/1 is the function meant to transform unicode strings in UTF-8 encoded binaries.
In the receiving side (and probably even better in your whole application) you'll have to decide how you will handle the strings, either in encoded binaries (or lists) or in plain unicode lists. But over the network the only choice is using binaries (or iolists wich are possibly deep lists of bytes) and I'll bet the most reasonable encoding for your application will be UTF-8.

How did SourceForge maim this Unicode character?

A little encoding puzzle for you.
A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.
In the XML export, however, it shows up as:
—
Decoding the entities, that results in these code points:
U+00E2 U+20AC U+201D
I.e. the characters —. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.
Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?
The the XML output is incorrectly been encoded using CP1252. To revert this, convert — to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.
Java based evidence:
String s = "—";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —
Note that this assumes that the stdout console uses by itself UTF-8 to display the character.
In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("—")) returns —.
SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.

base64 encoding: input character

I'm trying to understand what the input requirements are for base64 encoding. Nicholas Zakas, who I have tremendous respect for has an article here where he quotes a specification that an error should be thrown if input contains any character with a code higher than 255 Zakas Article on base64
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters. Since base64 encoding requires eight bits per input character, any character with a code higher than 255 cannot be accurately represented. The specification indicates that an error should be thrown in this case:
if (/([^\u0000-\u00ff])/.test(text)){
throw new Error("Can't base64 encode non-ASCII characters.");
}
He provides a link in another separate part of the article to the RFC 3548 but I don't see any input requirements other than:
Implementations MUST reject the encoding if it contains characters
outside the base alphabet when interpreting base encoded data, unless
the specification referring to this document explicitly states
otherwise.
Not sure what "base alphabet" means but perhaps this is what Zakas is referring to. But by saying they must reject the encoding it seems to imply that this is something that has already been encoded as opposed to the input (of course if the input is invalid it will also show up in the encoding so perhaps the point is moot).
A bit confused on what the standard is.
Fundamentally, it's a mistake to talk about "base64 encoding a string" where "string" is meant in terms of text.
Base64 encoding is applied to binary data (a sequence of bytes, or octets if you want to be even more picky), and the result is text. Every character in the output is printable ASCII text. The whole point of base64 is to provide a safe way of converting arbitrary binary data into a text format which can be reliably embedded in other text, transported etc. ASCII is compatible with almost all character sets, so you're very unlikely to be unable to encode ASCII text as part of something else.
When someone talks about "base64 encoding a string" they're really talking about encoding text as binary using some existing encoding (e.g. UTF-8), then applying a base64 encoding to the result. When decoding, you'd need to decode the base64 back to binary, and then decode that binary data with the original encoding, to get the original text.
For me the (first) linked article has a fundamental problem:
Before even attempting to base64 encode a string, you should check to see if the string contains only ASCII characters
You don't base64 encode strings. You base64 encode byte sequences. And when you're dealing with any kind of encoding work, it's extremely important to keep in mind this difference.
Also, his check for 'ASCII' actually lets through everything from 80 to ff, which aren't ASCII - ASCII is only 00 to 7f.
Now, if you have a string which you have checked is pure ASCII, you can then safely treat it as a byte sequence of the ASCII values of the characters in it - but this is a separate earlier step, nothing strictly to do with the act of base64 encoding.
(I should say that I do like his repeated urging for the reader to note that base64 encoding is not in any shape or form encryption)

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.