Unicode characters xn--ls8h - unicode

I want to know how to write the unicode Emoji characters in this form "xn--ls8h" <-- that is the pile of poo emoji unicode character. I had never seen this form, always something like &#5623*; (no asterisk) or something like that... What is this "xn--" form and how do I convert to it? Thanks!

xn-- is the prefix used in the ASCII representation of an Internationalized Domain Name, and ls8h is the Punycode representation of the character.
In Python, Punycode is one of the standard character encodings:
>>> b'ls8h'.decode('punycode')
'\U0001f4a9'

Related

Mapping arbitrary unicode alphanumeric characters to their ascii equivalents

When I encounter arbitrary unicode string, such as in a hashtag, I would like to express only its alphanumeric components in a string of their ascii equivalents. For example,
x='€𝙋𝙖𝙩𝙧𝙞𝙤𝙩'
would be rendered as
x='Patriot'
Since I cannot anticipate the unicode that could appear in such strings, I would like the method to be as general as possible. Any suggestions?
The unicodedata.normalize method can translate Unicode code points to a canonical value. Then, run the value through ascii encoding ignoring non-ASCII values for a byte string, then back through ascii decode to get a Unicode string again:
>>> x='€𝙋𝙖𝙩𝙧𝙞𝙤𝙩'
>>> ud.normalize('NFKC',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'
If you need to removed accents from letters, but still keep the base letter, use 'NFKD' instead.
>>> x='€𝙋𝙖𝙩𝙧𝙞ô𝙩'
>>> ud.normalize('NFKD',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'

How to convert from unicode to ASCII

Is there any way to convert unicode values to ASCII?
To simply strip the accents from unicode characters you can use something like:
string.Concat(input.Normalize(NormalizationForm.FormD).Where(
c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have. Almost the only thing you can do that is even close to the right thing is to discard all characters above codepoint 128, and even that is very likely nowhere near what your requirements say. (The other possibility is to simplify accented or umlauted letters to make more than 128 characters 'nearly' expressible, but that still doesn't even begin to actually cover Unicode.)
Technically, yes you can by using Encoding.ASCII.
Example (from byte[] to ASCII):
// Convert Unicode to Bytes
byte[] uni = Encoding.Unicode.GetBytes("Whatever unicode string you have");
// Convert to ASCII
string Ascii = Encoding.ASCII.GetString(uni);
Just remember Unicode a much larger standard than Ascii and there will be characters that simply cannot be correctly encoded. Have a look here for tables and a little more information on the two encodings.
This workaround might better suit your needs. It strips the unicode chars from a string and only keeps the ASCII chars.
byte[] bytes = Encoding.ASCII.GetBytes("eéêëèiïaâäàåcç  test");
char[] chars = Encoding.ASCII.GetChars(bytes);
string line = new String(chars);
line = line.Replace("?", "");
//Results in "eiac test"
Please note that the 2nd "space" in the character input string is the char with ASCII value 255
It depends what you mean by "convert".
You can transliterate using the AnyAscii package.
// C#
using AnyAscii;
string s = "άνθρωποι".Transliterate();
// anthropoi
Well, seeing as how there's some 100,000+ unicode characters and only 128 ASCII characters, a 1-1 mapping is obviously impossible.
You can use the Encoding.ASCII object to get the ASCII byte values from a Unicode string, though.
If your metadata fields only accept ASCII input. Unicode characters can be converted to their TEX equivalent through MathJax. What is MathJax?
MathJax is a JavaScript display engine for rendering TEX or MathML-coded mathematics in browsers without requiring font installation or browser plug-ins. Any modern browser with JavaScript enabled will be MathJax-ready. For general information about MathJax, visit mathjax.org.

to extract characters of a particular language

how can i extract only the characters in a particular language from a file containing language characters, alphanumeric character english alphabets
This depends on a few factors:
Is the string encoded with UTF-8?
Do you want all non-English characters, including things like symbols and punctuation marks, or only non-symbol characters from written languages?
Do you want to capture characters that are non-English or non-Latin? What I mean is, would you want characters like é and ç or would you only want characters outside of Romantic and Germanic alphabets?
and finally,
What programming language are you wanting to do this in?
Assuming that you are using UTF-8, you don't want basic punctuation but are okay with other symbols, and that you don't want any standard Latin characters but would be okay with accented characters and the like, you could use a string regular expression function in whatever language you are using that searches for all non-Ascii characters. This would elimnate most of what you probably are trying to weed out.
In php it would be:
$string2 = preg_replace('/[^(\x00-\x7F)]*/','', $string1);
However, this would remove line endings, which you may or may not want.

Japanese ASCII Code

Where can I get a list of ASCII codes corresponding to Japanese kanji, hiragana and katakana characters. I am doing a java function and Javascript which determines wether it is a Japanese character. What is its range in the ASCII code?
ASCII stands for American Standard Code for Information Interchange, only includes 128 characters (not all of them even printable), and is based on the needs of American use circa 1960. It includes nothing related to any Japanese characters.
I believe you want the Unicode code points for some characters, which you can lookup in the charts provided by unicode.org.
Please see my similar question regarding Kanji/Kana characters. As #coobird mentions it may be tricky to decide what range you want to check against since many Kanji overlap with Chinese characters.
In short, the Unicode ranges for hiragana and katakana are:
Hiragana: Unicode: 3040-309F
Katakana: Unicode: 30A0–30FF
If you find this answer useful please upvote #coobird's answer to my question as well.
がんばって!
Well it has been a while, but here's a link to tables of hiragana, katakana, kanji etc and their Unicodes...
http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
BUT, as you probably know Unicodes are hexadecimal. You can translate them into decimal numbers using Windows Calc in programmer mode and then input that number as an ASCII code and it will produce the character you want, well depending on what you're putting it into. It will in MS Wordpad and Word(not Notepad).
For example the hiragana ぁ is 3041 in Unicode. 3041 is hexadecimal and translates to 12353 in decimal. If you enter 12353 as an ASCII code into Wordpad or Word i.e hold Alt, enter 12353 on the number-pad then release Alt, it will print ぁ. The range of Japanese characters seems to be Hiragana:3040 - 309f(12352-12447 in ASCII), Katakana:30a0 - 30ff(12448-12543 in ASCII), Kanji: 4e00-4DB5(19968-19893 ASCII), so there are several ranges. There's also a half-width katakana range on that chart.
Japanese characters won't be in the ASCII range, they'll be in Unicode. What do you want, just the char value for each character?
I won't rehash the ASCII part. Just have a look at the Unicode Code Charts.
Kanji will have a Unicode "Script" property of Hani, hiragana will have a "Script" property of Hira, and katakana have a "Script" property of Kana. In Java, you can determine the "Script" property of a character using the Character.UnicodeScript class: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.UnicodeScript.html I don't know if you can determine a character's "Script" property in Javascript.
Of course, most kanji are characters that are also used in Chinese; given a character like 猫, it is impossible to tell whether it's being used as a Chinese character or a Japanese character.
I think what you mean by ASCII code for Japanese is the SBCS (Single Byte Character Set) equivalent in Japanese. For Japanese you only have a MBCS (Multi-Byte Character Sets) that has a combination of single byte character and multibyte characters. So for a Japanese text file saved in MBCS you have non-Japanese characters (english letters and numbers and common non-alphanumeric characters) saved as one byte and Japanese characters saved as two bytes.
Assuming that you are not referring to UNICODE which is a uniform DBCS (Double Byte Character Set) where each character is exactly two bytes. Actually to be more correct lately UNICODE also has multiple DBCS because the character set could not accomodate other character anymore. Some UNICODE character consiste of 4 bytes already having the first two bytes as leading character.
If you are referring to The first one (MBCS) that and not UNICODE then there are a lot of Japanese character set like Shift-JIS (the more popular one). So I suggest that you search Shift-JIS character map. Although there are other Japanese character set map aside from Shift-JIS.

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.