Mapping arbitrary unicode alphanumeric characters to their ascii equivalents - unicode

When I encounter arbitrary unicode string, such as in a hashtag, I would like to express only its alphanumeric components in a string of their ascii equivalents. For example,
x='β‚¬π™‹π™–π™©π™§π™žπ™€π™©'
would be rendered as
x='Patriot'
Since I cannot anticipate the unicode that could appear in such strings, I would like the method to be as general as possible. Any suggestions?

The unicodedata.normalize method can translate Unicode code points to a canonical value. Then, run the value through ascii encoding ignoring non-ASCII values for a byte string, then back through ascii decode to get a Unicode string again:
>>> x='β‚¬π™‹π™–π™©π™§π™žπ™€π™©'
>>> ud.normalize('NFKC',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'
If you need to removed accents from letters, but still keep the base letter, use 'NFKD' instead.
>>> x='β‚¬π™‹π™–π™©π™§π™žΓ΄π™©'
>>> ud.normalize('NFKD',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'

Related

Formal definition of a unicode string

I am trying to understand what is a "Unicode string" is, and the more I read the unicode standard the less I understand it. Let's start from a definition coming from the unicode standard.
A unicode scalar value is any integer in between 0x0 and 0xD7FF included, or in between 0xE000 and 0x10FFFF included (D76, p:119)
My feeling was that a unicode string is a sequence of unicode scalar values. I would define a UTF-8 unicode string as a sequence of unicode scalar values encoded in UTF-8. But I am not sure that it is the case. Here is one of the many definitions we can see in the standard.
"Unicode string: A code unit sequence containing code units of a particular Unicode encoding form" (D80, p:120)
But to me this definition is very fuzzy. Just too understand how bad it is, here are a few other "definitions" or strange things in the standard.
(p: 43) "A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units."
According to this definition, any sequence of uint8 is a valid UTF-8. I would rule out this definition as it would accept anything as a unicode string!!!
(p: 122) "Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and , each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is."
I would rule out this definition as it would be impossible to define a sequence of unicode scalar values for a unicode string encoded in UTF-16 as this definition would allow to cut surrogate pairs!!!
For a start, let's seek for a clear definition of an UTF-8 unicode string. So far, I can propose 3 definitions, but the real one (if there is) might be different:
(1) Any array of uint8
(2) Any array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
(3) Any subarray of an array of uint8 that comes from the sequence of unicode scalar value encoded in UTF-8
To make things concrete, here are a few examples:
[ 0xFF ] would be a UTF-8 unicode string according to definition 1, but not to definition 2 and 3 as no 0xFF can appear in a sequence of code units that comes from an UTF-8 encoded unicode scalar value.
[ 0xB0 ] would be a UTF-8 unicode string according to definition 3, but not according to definition 2 as it is the leading byte of a multi-byte code unit.
I am just lost with this "standard". Do you have any clear definition?
My feeling was that a unicode string is a sequence of unicode scalar values.
No, a Unicode string is a sequence of code units. The standard doesn't contain "many definitions", but only a single one:
D80 Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
This doesn't require the string to be well-formed (see the following definitions). None of your other quotes from the standard contradict this definition. To the contrary, they only illustrate that a Unicode string, as defined by the standard, can be ill-formed.
An application shall only create well-formed strings, of course:
If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence.
But the standard also contains some sections on how to deal with ill-formed input sequences.

addPortalMessage requires decode('utf-8')

Currently it seems that in order for UTF-8 characters to display in a portal message you need to decode them first.
Here is a snippet from my code:
self.context.plone_utils.addPortalMessage(_(u'This document (%s) has already been uploaded.' % (doc_obj.Title().decode('utf-8'))))
If Titles in Plone are already UTF-8 encoded, the string is a unicode string and the underscore function is handled by i18ndude, I do not see a reason why we specifically need to decode utf-8. Usually I forget to add it and remember once I get a UnicodeError.
Any thoughts? Is this the expected behavior of addPortalMessage? Is it i18ndude that is causing the issue?
UTF-8 is a representation of Unicode, not Unicode and not a Python unicode string. In Python, we convert back and forth between Python's unicode strings and representations of unicode via encode/decode.
Decoding a UTF-8 string via utf8string.decode('utf-8') produces a Python unicode string that may be concatenated with other unicode strings.
Python will automatically convert a string to unicode if it needs to by using the ASCII decoder. That will fail if there are non-ASCII characters in the string -- because, for example, it is encoded in UTF-8.

Unicode characters xn--ls8h

I want to know how to write the unicode Emoji characters in this form "xn--ls8h" <-- that is the pile of poo emoji unicode character. I had never seen this form, always something like &#5623*; (no asterisk) or something like that... What is this "xn--" form and how do I convert to it? Thanks!
xn-- is the prefix used in the ASCII representation of an Internationalized Domain Name, and ls8h is the Punycode representation of the character.
In Python, Punycode is one of the standard character encodings:
>>> b'ls8h'.decode('punycode')
'\U0001f4a9'

Efficient way to ASCII encode UTF-8

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:
all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
the resulting string should be as short as possible
the operation needs to be reversable without any data loss
the resulting ASCII string should be case insensitive
there should be no restriction on the input length
the whole UTF-8 range should be allowed
My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.
Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.
UTF-7, or, slightly less transparent but more widespread, quoted-printable.
all ASCII chars in the input should stay ASCII chars in the output
(Obviously not fully possible as you need at least one character to act as an escape.)
Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.
Edited to add:
I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.
If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.
Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.
If you're talking about non-standard schemes - MECE
URL encoding or numeric character references are two possible options.
It depends on the distribution of characters in your strings.
Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.
Punycode is used for IDNA, but you can use it outside the restrictions imposed by it
Per se, Punycode doesn't fail your last 2 requirements:
>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True
(for idna, python supplies another homonymous encoding)
obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go

How to convert from unicode to ASCII

Is there any way to convert unicode values to ASCII?
To simply strip the accents from unicode characters you can use something like:
string.Concat(input.Normalize(NormalizationForm.FormD).Where(
c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have. Almost the only thing you can do that is even close to the right thing is to discard all characters above codepoint 128, and even that is very likely nowhere near what your requirements say. (The other possibility is to simplify accented or umlauted letters to make more than 128 characters 'nearly' expressible, but that still doesn't even begin to actually cover Unicode.)
Technically, yes you can by using Encoding.ASCII.
Example (from byte[] to ASCII):
// Convert Unicode to Bytes
byte[] uni = Encoding.Unicode.GetBytes("Whatever unicode string you have");
// Convert to ASCII
string Ascii = Encoding.ASCII.GetString(uni);
Just remember Unicode a much larger standard than Ascii and there will be characters that simply cannot be correctly encoded. Have a look here for tables and a little more information on the two encodings.
This workaround might better suit your needs. It strips the unicode chars from a string and only keeps the ASCII chars.
byte[] bytes = Encoding.ASCII.GetBytes("eΓ©ΓͺëèiΓ―aÒÀàΓ₯cΓ§ Β test");
char[] chars = Encoding.ASCII.GetChars(bytes);
string line = new String(chars);
line = line.Replace("?", "");
//Results in "eiac test"
Please note that the 2nd "space" in the character input string is the char with ASCII value 255
It depends what you mean by "convert".
You can transliterate using the AnyAscii package.
// C#
using AnyAscii;
string s = "άνθρωποι".Transliterate();
// anthropoi
Well, seeing as how there's some 100,000+ unicode characters and only 128 ASCII characters, a 1-1 mapping is obviously impossible.
You can use the Encoding.ASCII object to get the ASCII byte values from a Unicode string, though.
If your metadata fields only accept ASCII input. Unicode characters can be converted to their TEX equivalent through MathJax. What is MathJax?
MathJax is a JavaScript display engine for rendering TEX or MathML-coded mathematics in browsers without requiring font installation or browser plug-ins. Any modern browser with JavaScript enabled will be MathJax-ready. For general information about MathJax, visit mathjax.org.