is there a way to decipher a given encoding? - unicode

On Twitter, this user: https://twitter.com/Rockprincess818
seems to have used creative encoding techniques to achieve special formatting:
They list their name as:
๐“›๐“ฒ๐“ผ๐“ช
And their bio as:
๐ˆ'๐ฆ ๐ง๐จ๐ญ ๐ก๐ž๐ซ๐ž ๐Ÿ๐จ๐ซ ๐ฒ๐จ๐ฎ๐ซ ๐š๐ฆ๐ฎ๐ฌ๐ž๐ฆ๐ž๐ง๐ญ. ๐˜๐จ๐ฎ'๐ซ๐ž ๐ก๐ž๐ซ๐ž ๐Ÿ๐จ๐ซ ๐ฆ๐ข๐ง๐ž.
None of this seems to be a standard encoding (nor even English -- though I could be wrong about this).
My questions:
What did they do to achieve this special formatting?
How does one decipher such non-normal text to understand what's going on?

1) There are many online generators (eg. this one or this one) that let users convert normal text to some fancy graphical representation by replacing Latin alphabet letters with similar-looking Unicode symbols.
2) The most obvious way to decipher such text back to normal Latin characters would be to try to find which tools the user uses and what mappings those tools employ. You could then map the fancy Unicode codepoints back to Latin characters. You could find the mappings eg. by converting "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" to "cursive" with those tools and analyzing the output.

The Unicode standard has a concept of compatibility, which allows some codepoints to be defined as equivalent to others. Given the strings in the question, the NFKC normalisation (Normalization Form Compatibility Composition) can be applied to obtain the equivalent latin characters. Programming languages may provide tools to apply normalisation programmatically.
In javascript, the string.normalize method may be used
name = '๐“›๐“ฒ๐“ผ๐“ช'
"๐“›๐“ฒ๐“ผ๐“ช"
bio = "๐ˆ'๐ฆ ๐ง๐จ๐ญ ๐ก๐ž๐ซ๐ž ๐Ÿ๐จ๐ซ ๐ฒ๐จ๐ฎ๐ซ ๐š๐ฆ๐ฎ๐ฌ๐ž๐ฆ๐ž๐ง๐ญ. ๐˜๐จ๐ฎ'๐ซ๐ž ๐ก๐ž๐ซ๐ž ๐Ÿ๐จ๐ซ ๐ฆ๐ข๐ง๐ž."
"๐ˆ'๐ฆ ๐ง๐จ๐ญ ๐ก๐ž๐ซ๐ž ๐Ÿ๐จ๐ซ ๐ฒ๐จ๐ฎ๐ซ ๐š๐ฆ๐ฎ๐ฌ๐ž๐ฆ๐ž๐ง๐ญ. ๐˜๐จ๐ฎ'๐ซ๐ž ๐ก๐ž๐ซ๐ž ๐Ÿ๐จ๐ซ ๐ฆ๐ข๐ง๐ž."
name.normalize('NFKC')
"Lisa"
bio.normalize('NFKC')
"I'm not here for your amusement. You're here for mine."
In python, the unicodedata.normalize function may be used
>>> import unicodedata as ud
>>> name = '๐“›๐“ฒ๐“ผ๐“ช'
>>> bio = "๐ˆ'๐ฆ ๐ง๐จ๐ญ ๐ก๐ž๐ซ๐ž ๐Ÿ๐จ๐ซ ๐ฒ๐จ๐ฎ๐ซ ๐š๐ฆ๐ฎ๐ฌ๐ž๐ฆ๐ž๐ง๐ญ. ๐˜๐จ๐ฎ'๐ซ๐ž ๐ก๐ž๐ซ๐ž ๐Ÿ๐จ๐ซ ๐ฆ๐ข๐ง๐ž."
>>> ud.normalize('NFKC', name)
'Lisa'
>>> ud.normalize('NFKC', bio)
"I'm not here for your amusement. You're here for mine."

Related

How can I normalize fonts?

Users sometimes use weird ASCII characters in a program, and I was wondering if there was a way to "normalize" it.
So basically, if the input แด€ส™แด„แด…แด‡๊œฐษข, the output would be ABCDEFG. Is there a dictionary that exists somewhere that does something like this? If not, is there a better method than just doing something like str.replace("แด€", "A") for all the different "fonts"?
This isn't a language specific question -- if something doesn't exist like this than I guess the next step is to create a dictionary myself.
Yes.
BTWโ€”The technical terms are: Latin Capital Letters from the C0 Controls and Basic Latin block and the Latin Letter Small Capitals from the Phonetic Extensions block.
Anyway, the general topic for your question is Unicode confusables. The link is for a mapping. Uncode.org has more material on confusables and everything else Unicode.
(Normalization is always something to consider when processing Unicode text, but it doesn't particularly relate to this issue.)
Your example seems to involve unicode characters, not ASCII characters. Unicode normalization (FAQ) is a large and complex subject, with many difference equivalence classes of characters, depending on what you are trying to do.

Strategy for defining Unicode Ranges by Culture

I am new to Unicode have been given the requirement to look at some translated text, iterate over all of the characters of that translation and determine if all the characters are valid for the target culture (language and location).
For example, if I am translating a document from English to Greek, I want to detect if there are any English/ASCII "A"s in the Greek translation and report that as an error. This may likely be the case from corrupted data from a translation memory.
Is there any existing grouping of Unicode characters by culture? Or is there any existing strategy for developing this kind of grouping? I see that there is some grouping of characters at (http://www.unicode.org/charts/). But it seems that this is not quite what I am looking for at first glance.
Does any thing exist like "Here are the valid Unicode characters for Spanish - Spain: [some Unicode range(s)]" or "Here are the valid Unicode characters for Russian - Russia: [some Unicode range(s)]"
Or has anyone developed a strategy to define these?
If this is not the right place to ask this question, I would welcome any direction on where might be a good place to ask the question.
This is something that CLDR (Common Locale Data Repository) deals with. It is not part of the Unicode Standard, but it is an activity and a resource managed by the Unicode Consortium. The LDML specification defines the format of the locale data. The Character Elements define some sets of characters: โ€œmain/standardโ€, โ€œauxiliaryโ€, โ€œindexโ€, and โ€œpunctuationโ€.
The data for Greek includes only Greek letters and some basic punctuation. This, like all such data at CLDR, is largely subjective. And even though the CLDR process is meant to produce well-reviewed data based on consensus, the reality is different. It can be argued that in normal Greek texts, Latin letters are not uncommon, especially in technical areas. For example, the international symbol for the ampere is โ€œAโ€ as a Latin letter; the symbol for the kilogram is โ€œkgโ€, in Latin letters, even though the word for it is written Greek letters in Greek.
Thus, no matter how you run the analysis, the occurrence of Latin โ€œAโ€ in Greek text could be flagged as potentially suspicious, but not an error.
There are C/C++ and Java libraries that implement access to CLDR data, as part of ICU.

Unicode comparison of Cyrillic 'ะก' and Latin 'C'

I have a dataset which mixes use of unicode characters \u0421, 'ะก' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.
There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with โ€œconfusablesโ€ โ€“ characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as โ€œintentionally confusableโ€ pairs, i.e. โ€œcharacters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface designโ€, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and ะก. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.
when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /ั/, 'c'.

Unicode to ASCII: Standardized transcription?

My other question brought up a related question:
Is there a standard table of Unicode to ASCII transcriptions? Think for instance of German รผ mapping to ue.
User bobince mentioned in a comment that other languages use the same character in a different way and I fear they may not only use the same glyph but also the same codepoint. Hence mapping e.g. "รผ" to "u" would also be acceptable (mapping by visual similarity). So is mapping รผ to "u as done by iconv (see for instance link posted by Juancho).
The methods shown in the link posted by Juancho are technically working solutions. However, is there a formal standard for such a mapping or at least a mapping used as a quasi-standard? Ideally it would also include for instance phonetics-based transcriptions for non-latin characters. I remember that one exists for Japanese kana and greek characters. It shouldn't be a big problem in that regard either.
There is no formal standard on such mappings. Mappings that deal with Latin letters in general (like รผ, รฉ and รŸ) mapping all to Ascii are not really transcriptions or transliterations but just, well, mappings, which might be called simplifications or Asciifications. They are performed for various purposes, often in an ad hoc way.
Mapping รผ to ue is rather common in German and might be called an unofficial or de facto standard for German names when รผ cannot be used. But other languages use other rules, and it would be odd to Asciify French or Spanish that way; instead, the diacritic would just be dropped, mapping รผ to u.
People may map e.g. รผ to u" when they are forced (or they believe they are forced) to use Ascii and yet want to convey the message that the u has a diaeresis on it.

Romanization of Unicode text

I am looking for a way to transliterate Unicode letter characters from any language into accented Latin letters. The intent is to allow foreigners to gain insight into the pronunciation of names and words written in any non-Latin script.
Examples:
Greek:Romanize("ฮ‘ฮปฯ†ฮฑฮฒฮทฯ„ฮนฮบฯŒฯ‚") returns "Alphabฤ“tikรณs" (or "Alfaviฬฑtikรณs")
Japanese:Romanize("ใ—ใ‚“ใฐใ—") returns "shimbashi" (or "sinbasi")
Russian:Romanize("ัะนั†ะฐ ะคะฐะฑะตั€ะถะต") returns "yaytsa Faberzhe" (or "jajca Faberลพe")
It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek. It should to be data driven and extensible, using data from either the Unicode Consortium, the USA, the EU or the UN. The code should be open source written in .NET or Java.
Does such a library exist?
The problem is a lot more complex than you think.
Greek, Cyrillic, Indic scripts, Georgian -> trivial, you could program that in an hour
Thai, Japanese Kana -> doable with a bit more effort
Japanese Kanji, Chinese -> these are not alphabets/syllaberies, so you're not in fact transliterating, you're looking up the pronunciation of each symbol in a hopefully large dictionary (EDICT and CCDICT should work), and a lot of times you'll get it wrong unless you're also considering the context, especially in Japanese
Korean -> technically an alphabet, but computers can only handle the composed characters, so you need another large database, I'm not aware of any
Arabic, Hebrew -> these languages don't write down short vowels, so a lot of times your transliteration will be something unreadable like "bytlhm" (Bethlehem). I'm not aware of any large databases that map Arabic or Hebrew words to their pronunciation.
You can use Unidecode Sharpย :
[a C#] port from Python Unidecode that itself port from Perl unidecode.
(there are also PHP and Ruby implementations available)
Usage;
using BinaryAnalysis.UnidecodeSharp;
.......................................
string _Greek="ฮ‘ฮปฯ†ฮฑฮฒฮทฯ„ฮนฮบฯŒฯ‚";
MessageBox.Show(_Greek.Unidecode());
string _Japan ="ใ—ใ‚“ใฐใ—";
MessageBox.Show(_Japan.Unidecode());
string _Russian ="ัะนั†ะฐ ะคะฐะฑะตั€ะถะต";
MessageBox.Show(_Russian.Unidecode());
I hope, it will be good for you.
I am unaware of any open source solution here beyond ICU. If ICU works for you, great. If not, note that I am the CTO of a company that sells a commercial produce for this purpose that can deal with the icky cases like Chinese words, Japanese multiple reading, and Arabic incomplete orthography.
The Unicode Common Locale Data Repository has some transliteration mappings you could use.