How to decode mixed string with unicode symbols? - unicode

Yesterday I was confused by output of FM SSFC_PARSE_CERTIFICATE. It serves for decoding fields of X.509 certificate into readable format.
Everything is OK for latin symbols, but cyrillic letters are turned into something like \u041F\u0440\u0438\u0432\u0435\u0442.
Besides, if original text contains mixed symbols, i.e. latin, non-latin, spaces and digits, the task becomes even more comlex: Hello! \u041F\u0440\u0438\u0432\u0435\u0442 1234.
I wrote some code myself to scan string character by character and decode single entities using CL_ABAP_CONV_IN_CE=>UCCP and it seems to work well, but I'd like to know if there is a standard way to acheive same result?

Well, it's seams like in your input xstring all non-latin charcodes have been escaped instead of being encoded in UTF8. So if you not satisfied with your DIY solution, you should work upstream of the call to FM SSFC_PARSE_CERTIFICATE

Related

Unicode character creation in Python 3.4

Using Python 3.4, suppose I have some data from a file, and it is literally the 6 individual characters \ u 0 0 C 0 but I need to convert it to the single unicode character \u00C0. Is there a simple way of doing that conversion? I can't find anything in the Python 3.4 Unicode documentation that seems to provide that kind of conversion, except for a complex way using exec() of an assignment statement which I'd like to avoid if possible.
Thanks.
Well, there is:
>>> b'\\u00C0'.decode('unicode-escape')
'À'
However, the unicode-escape codec is aimed at a particular format of string encoding, the Python string literal. It may produce unexpected results when faced with other escape sequences that are special in Python, such as \xC0, \n, \\ or \U000000C0 and it may not recognise other escape sequences from other string literal formats. It may also handle characters outside the Basic Multilingual Plane incorrectly (eg JSON would encode U+10000 to surrogates\uD800\uDC00).
So unless your input data really is a Python string literal shorn of its quote delimiters, this isn't the right thing to do and it'll likely produce unwanted results for some of these edge cases. There are lots of formats that use \u to signal Unicode characters; you should try to find out what format it is exactly, and use a decoder for that scheme. For example if the file is JSON, the right thing to do would be to use a JSON parser instead of trying to deal with \u/\n/\\/etc yourself.

addPortalMessage requires decode('utf-8')

Currently it seems that in order for UTF-8 characters to display in a portal message you need to decode them first.
Here is a snippet from my code:
self.context.plone_utils.addPortalMessage(_(u'This document (%s) has already been uploaded.' % (doc_obj.Title().decode('utf-8'))))
If Titles in Plone are already UTF-8 encoded, the string is a unicode string and the underscore function is handled by i18ndude, I do not see a reason why we specifically need to decode utf-8. Usually I forget to add it and remember once I get a UnicodeError.
Any thoughts? Is this the expected behavior of addPortalMessage? Is it i18ndude that is causing the issue?
UTF-8 is a representation of Unicode, not Unicode and not a Python unicode string. In Python, we convert back and forth between Python's unicode strings and representations of unicode via encode/decode.
Decoding a UTF-8 string via utf8string.decode('utf-8') produces a Python unicode string that may be concatenated with other unicode strings.
Python will automatically convert a string to unicode if it needs to by using the ASCII decoder. That will fail if there are non-ASCII characters in the string -- because, for example, it is encoded in UTF-8.

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

How did SourceForge maim this Unicode character?

A little encoding puzzle for you.
A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.
In the XML export, however, it shows up as:
—
Decoding the entities, that results in these code points:
U+00E2 U+20AC U+201D
I.e. the characters —. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.
Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?
The the XML output is incorrectly been encoded using CP1252. To revert this, convert — to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.
Java based evidence:
String s = "—";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —
Note that this assumes that the stdout console uses by itself UTF-8 to display the character.
In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("—")) returns —.
SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.