Cyrillic UTF8 to ASCII with XQuery - unicode

I need to generate String ID's from Cyrillic data using only ASCII characters. I'm using XQuery to grab the data from XML files. This is the point where I can convert any Cyrillic characters into ASCII characters.
I've tried using normalize-unicode($string, FORM) with FORM as each of the four Normalization Forms, but I don't think it has any mappings from Cyrillic to English, at least it doesn't change the $string input.
For example, I want:
normalize-unicode("Идејата", 'NFKD')
To produce something along the lines of:
"NAejaTa"
Are there any functions in XQuery that can do this sort of thing? Any help is greatly appreciated!

Related

How to decode mixed string with unicode symbols?

Yesterday I was confused by output of FM SSFC_PARSE_CERTIFICATE. It serves for decoding fields of X.509 certificate into readable format.
Everything is OK for latin symbols, but cyrillic letters are turned into something like \u041F\u0440\u0438\u0432\u0435\u0442.
Besides, if original text contains mixed symbols, i.e. latin, non-latin, spaces and digits, the task becomes even more comlex: Hello! \u041F\u0440\u0438\u0432\u0435\u0442 1234.
I wrote some code myself to scan string character by character and decode single entities using CL_ABAP_CONV_IN_CE=>UCCP and it seems to work well, but I'd like to know if there is a standard way to acheive same result?
Well, it's seams like in your input xstring all non-latin charcodes have been escaped instead of being encoded in UTF8. So if you not satisfied with your DIY solution, you should work upstream of the call to FM SSFC_PARSE_CERTIFICATE

ASCII to Unicode conversion in doxygen blocks

I'm using doxygen with Fortran. Apparently some Fortran compilers still have problems with Unicode strings, even in comments, so I have to keep all input as ASCII. No problem, I set INPUT_ENCODING = ASCII in the Doxyfile, and I can use some HTML entities like é. But I have some questions:
The entities are maintained in the output (or converted to \'{e} in LaTeX). Can I have them converted to UTF-8 characters instead?
It seems numerical entities é or é are not recognized. Can I enter characters for which there is no named entity in some other way?

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

Unicode characters xn--ls8h

I want to know how to write the unicode Emoji characters in this form "xn--ls8h" <-- that is the pile of poo emoji unicode character. I had never seen this form, always something like &#5623*; (no asterisk) or something like that... What is this "xn--" form and how do I convert to it? Thanks!
xn-- is the prefix used in the ASCII representation of an Internationalized Domain Name, and ls8h is the Punycode representation of the character.
In Python, Punycode is one of the standard character encodings:
>>> b'ls8h'.decode('punycode')
'\U0001f4a9'

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.