What's the simplest and most reliable way to get a string containing nothing but _all unicode emojis_? - unicode

Using the command line, a script, etc.
http://www.unicode.org/emoji/

"Unicode symbols" don't have a clean "plane of Unicode blocks" (see Wikipedia: Unicode symbols) and to make matters worse there doesn't exist a definite subset of Unicode symbols that can be considered "proper emojis" (as far as I know).
E.g. the emojis covered by Twitter don't cover all of the Unicode symbols but are dispersed all over the place (though particularly concentrated in some areas over others, such as Miscellaneous Symbols, Emoticons, Miscellaneous Symbols and Pictographs, Transport and Map Symbols, and Dingbats).
But once you've settled on the Unicode block ranges that you want you can easily print them out, I'll be using the "Emoticons (U+1F600–U+1F64F)" range as an example, since they are "the most universally considered emojis".
var emojis = "";
var code = parseInt("1F600", 16);
while(code <= parseInt("1F64F", 16)) {
emojis += String.fromCodePoint("0x"+code.toString(16));
code += 1;
}
console.log(emojis);
parseInt() - JavaScript | MDN
String.fromCodePoint() - JavaScript | MDN

Related

Unit Separator "us"

I've seen the unit separator represented as different symbols (I've provided links to each one). What's the difference between each one? I'm working on a project and the only symbol that works is the "us" symbol.
Unit Separator Symbol #1:
Unit Separator Symbol #2:
Unit Separator Symbol #3:
Unit separator is one of the many ASCII control codes, so done for very old times. You see that you can use FS, GS, RS, and US, to split data (e.g. on a serial console).
Such control characters are interpreted as control character in Unicode (so in modern world), so without a real symbol.
And then things may get complex. Text processor, shaping engines and/or fonts may interpret control characters differently: either just as control, and so possibly ignoring them, if they do not have semantic for such control, or trying to display it. One common form it is to use U+241F (SYMBOL FOR UNIT SEPARATOR), in the Unicode block Control Pictures (U+2400 – U+243F), which includes symbols for all ASCII control codes. Note: fonts display it differently, some fonts as a boxed text with an abbreviation, some fonts as small letters in diagonal.
Note old fonts (with 256 symbols) used control character for extra symbols, see e.g. the default DOS code page: https://en.wikipedia.org/wiki/Code_page_437, where you see your symbol: the black triangle. ("Black" in font means filled, so not just the sides/contour). Note: there were also special methods on how to print them (instead of interpreting them as control characters), and different systems used different symbols on control codes.

Why do the printed unicode characters change?

The way the unicode symbol is displayed depends on whether I use the White Heavy Check Mark or the Negative Squared Cross Mark before it or not. If I do, the Warning Sign is coloured. If I put a space between the symbols, I get the mono-coloured text-like version.
Why does this behaviour exist and can I force the coloured symbol somehow?
I tried a couple of different REPLs, the behaviour was the same.
; No colour
(str (char 0x274e) " " (char 0x26A0))
; Coloured
(str (char 0x274e) "" (char 0x26A0))
Clojure unicode display.
I expect the symbol being displayed the same way regardless of which symbol comes before it.
Why does this behaviour exist
A vendor thought it would be a neat idea to render emoji glyhps in colour. The idea caught on.
https://en.wikipedia.org/wiki/Emoji#Emoji_versus_text_presentation
can I force the coloured symbol somehow
U+FE0E VARIATION SELECTOR-15 and U+FE0F VARIATION SELECTOR-16
http://mts.io/2015/04/21/unicode-symbol-render-text-emoji/
Unicode is about characters (code points), not glyphs (see it as "image" of a character).
Fonts are free to (and should) merge nearby characters into a single glyphs. In printed Latin scripts this is not very common (but we could have it e.g. ff,fi, ffi), without considering the combining codepoints which, per definition, should combine with other characters, to get just one glyph,
Many other scripts require it. Starting to cursive Latin scripts, but most cursive scripts requires changes. E.g. Arabic has different glyphs of initial, final, middle or separated character (+ special combination, common to cursive scripts). Indian scripts have similar behaviours.
So the base of Unicode has already this behaviour, and modern good fonts should be able to do it.
It was not so late, that emojii uses such functionality, e.g. country letters/flags to other common cases.
Often the Unicode documentation tell you of such possibilities, and the special code points which could change behaviour, but then it is task of the font to fullfil the expected behaviour (and to find good glyphs).
So: character (as unicode code point) is not one to one to a design (glyphs).

How to search for any unicode symbol in a character string?

I've got an existing DOORS module which happens to have some rich text entries; these entries have some symbols in them such as 'curly' quotes. I'm trying to upgrade a DXL macro which exports a LaTeX source file, and the problem is that these high-number symbols are not considered "standard UTF-8" by TexMaker's import function (and in any case probably won't be processed by Xelatex or other converters) . I can't simply use the UnicodeString functions in DXL because those break the rest of the rich text, and apparently the character identifier charOf(decimal_number_code) only works over the basic set of characters, i.e. less than some numeric code value. For example, charOf(8217) should create a right-curly single quote, but when I tried code along the lines of
if (charOf(8217) == one_char)
I never get a match. I did copy the curly quote from the DOORS module and verified via an online unicode analyzer that it was definitely Unicode decimal value 8217 .
So, what am I missing here? I just want to be able to detect any symbol character, identify it correctly, and then replace it with ,e.g., \textquoteright in the output stream.
My overall setup works for lower-count chars, since this works:
( c is a single character pulled from a string)
thedeg = charOf(176)
if( thedeg == c )
{
temp += "$\\degree$"
}
Got some help from DXL coding experts over at IBM forums.
Quoting the important stuff (there's some useful code snippets there as well):
Hey, you are right it seems intOf(char) and charOf(int) both do some
modulo 256 and therefore cut anything above that off. Try:
int i=8217;
char c = addr_(i);
print c;
Which then allows comparison of c with any input char.

Unicode comparison of Cyrillic 'С' and Latin 'C'

I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.
There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.
when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'.

How to mark all CJK text in a document?

I have a file, file1.txt, containing text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each region of text within the file according to language, except for English, and output a new file, e.g., here is a sample line:
The 恐龙 ate 鱼.
As this contains text in Chinese characters, this will get marked like this:
The \language[cn]{恐龙} ate \language[cn]{鱼}.
The document is saved as UTF-8.
Text in Chinese should be marked \language[cn]{*}.
Text in Japanese should be marked \language[ja]{*}.
Text in Korean should be marked \language[ko]{*}.
The content never continues from one line to the next.
If the code is ever in doubt about whether something is Chinese, Japanese, or Korean, it is best if it defaults to Chinese.
How can I mark the text according to the language present?
A crude algorithm:
use 5.014;
use utf8;
while (<DATA>) {
s
{(\p{Hangul}+)}
{\\language[ko]{$1}}g;
s
{(\p{Hani}+)}
{\\language[zh]{$1}}g;
s
{(\p{Hiragana}+|\p{Katakana}+)}
{\\language[ja]{$1}}g;
say;
}
__DATA__
The 恐龙 ate 鱼.
The 恐竜 ate 魚.
The キョウリュウ ate うお.
The 공룡 ate 물고기.
(Also see Detect chinese character using perl?)
There are problems with that. Daenyth comments that e.g. 恐竜 is misidentified as Chinese. I find it unlikely that you are really working with mixed English-CJK, and are just giving bad example text. Perform a lexical analysis first to differentiate Chinese from Japanese.
I'd like to provide a Python solution. No matter which language, it is based on Unicode Script information (from Unicode Database, aka UCD). Perl has rather detailed UCD compared to Python.
Python has no Script information opened in its "unicodedata" module. But someone has added it at here https://gist.github.com/2204527 (tiny and useful). My implementaion is based on it. BTW, it is not space sensitive(no need of any lexical analysis).
# coding=utf8
import unicodedata2
text=u"""The恐龙ate鱼.
The 恐竜ate 魚.
Theキョウリュウ ate うお.
The공룡 ate 물고기. """
langs = {
'Han':'cn',
'Katakana':'ja',
'Hiragana':'ja',
'Hangul':'ko'
}
alist = [(x,unicodedata2.script_cat(x)[0]) for x in text]
# Add Last
alist.append(("",""))
newlist = []
langlist = []
prevlang = ""
for raw, lang in alist:
if prevlang in langs and prevlang != lang:
newlist.append("\language[%s]{" % langs[prevlang] +"".join(langlist) + "}")
langlist = []
if lang not in langs:
newlist.append(raw)
else:
langlist.append(raw)
prevlang = lang
newtext = "".join(newlist)
print newtext
The Output is :
$ python test.py
The\language[cn]{恐龙}ate\language[cn]{鱼}.
The \language[cn]{恐竜}ate \language[cn]{魚}.
The\language[ja]{キョウリュウ} ate \language[ja]{うお}.
The\language[ko]{공룡} ate \language[ko]{물고기}.
While Korean doesn't use much sinograms [漢字/Kanji] anymore, they still pop up sometimes. Some Japanese sinograms are solely Japanese, like 竜, but many are identical to either Simplified Chinese or Traditional. So you're kind of stuck. So you need to look at a full sentence if you have some "Han" chars. If it has some hiragana/katakana + kanji, probability is very high it's Japanese. Likewise, a bunch of hangul syllables and a couple of sinograms will tell you the sentence is in Korean.
Then, if it's all Han characters, ie Chinese, you can look at whether some of the chars are simplified: kZVariant denotes a Simplified Chinese char. Oh, and kSpecializedSemanticVariant is very often used for Japanese specific simplified chars. 内 and 內 may look the same to you, but the first is Japanese, the second Traditional Chinese and Korean (Korean uses Traditional Chinese as a standard).
I have code somewhere that returns, for one codepoint, the script name. That could help. You go through a sentence, and see what's left at the end. I'll put up the code somewhere.
EDIT: the code
http://pastebin.com/e276zn6y
In response to the comment below:
This function above is built based on data provided by Unicode.org... While not being an expert per se, I contributed quite a bit to the Unihan database – and I happen to speak CJK. Yes, all 3. I do have some code that takes advantage of the kXXX properties in the Unihan database, but A/ I wasn't aware we were supposed to write code for the OP, and B/ it would require a logistics that might go beyond what the OP is ready to implement. My advice stands. With the function above, loop through one full sentence. If all codepoints are "Han", (or "Han"+"Latin"), chances are high it's Chinese. If on the other hand the result is a mix of "Han"+"Hangul"(+"latin" possibly) you can't go wrong with Korean. Likewise, a mix of "Han" and "Katakana"/"Hiragana" you have Japanese.
A QUICK TEST
Some code to be used with the function I linked to before.
function guessLanguage(x) {
var results={};
var s='';
var i,j=x.length;
for(i=0;i<j;i++) {
s=scriptName(x.substr(i,1));
if(results.hasOwnProperty(s)) {
results[s]+=1;
} else {
results[s]=1;
}
}
console.log(results);
mostCount=0;
mostName='';
for(x in results) {
if (results.hasOwnProperty(x)) {
if(results[x]>mostCount) {
mostCount=results[x];
mostName=x;
}
}
}
return mostName;
}
Some tests:
r=guessLanguage("外人だけど、日本語をペラペラしゃべるよ!");
Object
Common: 2
Han: 5
Hiragana: 9
Katakana: 4
__proto__: Object
"Hiragana"
The r object contains the number of occurrences of each script. Hiragana is the most frequent, and Hiragana+Katakana --> 2/3 of the sentence.
r=guessLanguage("我唔知道,佢講乜話.")
Object
Common: 2
Han: 8
__proto__: Object
"Han"
An obvious case of Chinese (Cantonese in this case).
r=guessLanguage("中國이 韓國보다 훨씬 크지만, 꼭 아름다운 나라가 아니다...");
Object
Common: 11
Han: 4
Hangul: 19
__proto__: Object
"Hangul"
Some Han characters, and a whole lot of Hangul. A Korean sentence, assuredly.