How to obtain a full list of Unicode emojis from the Unicode website - unicode

I'm building an application that requires the use of emojis, specifically generating large sequences of random emojis. This requires having a large list to pull from. Rather than taking the approach detailed here by looping over hardcode hex ranges, I decided to take a different approach and download and parse data from the Unicode website. From there, I do some code-generation and write all the unique emojis to disk which I can then pick up inside my application. All this happens either as a manual step or a build step for my app.
However, the Unicode specification is complicated and I'm unsure which data I should be pulling from to build up a definitive list. There are three files under the latest version of Unicode (14.0):
emoji-sequences.txt
emoji-test.txt
emoji-zwj-sequences.txt
There are also two files in the Unicode Character Database (UCD):
emoji-data.txt
emoji-variation-sequences.txt
There are definitely duplicates amongst all these lists such as 😀 and while I could download and parse all five files and reduce the list down to unique instances in my script, I'd like to keep my script as simple as I can without doing unnecessary work.
From what I understand:
emoji-test.txt is a grouping of emoji characters as you might see in a keyboard, grouped by category
emoji-sequences.txt is a list of emoji ranges, single emojis, and multi character emojis such as 🇦🇨 (1F1E6 1F1E8) or emojis combined with a variation selector like FE0F
emoji-zwj-sequences.txt is a list of emojis joined by the zero width joiner character
emoji-variation-sequences is a list of emojis that can be presented either in textual form or as emojis
emoji-data.txt seems to be a very comprehensive list of not just emojis but also emoji modifiers and the like
All this has left me rather perplexed as to which list or combination of lists would give me the most comprehensive list of emojis. emoji-data.txt seems to have a most wide-ranging list but I don't want things like emoji modifiers or emoji components; I'm only looking for emojis that a user can select with the keyboard (for example you can't select a skin tone modifier by itself).
Which lists or combination of lists would yield the most comprehensive, wide-ranging list of emojis that I could use in my app?

Use the union of emoji-sequences.txt and emoji-zwj-sequences.txt. That set comprises the emoji recommended for general interchange. see https://www.unicode.org/reports/tr51/tr51-19.html#def_rgi_set.

Related

The list of unicode unusual characters

Where can I get the complete list of all unicode characters that doesn't behave as simple characters. Examples: character 0x0363 (won't be printed without another one before), character 0x0084 (does weird things when printed). I need just a raw list of such unusual characters to replace them with something harmless to avoid unwanted output effects. Regular characters (those who not in this list) should use exactly one character place when printed (= cursor moved +1 to the right), should not depend on previous or next characters, and should not affect printing style in any way.
Edit because of multiple comments:
I have some unicode string, usually consists of "usual" characters like 0x20-0x7E or cyrillic letters. Also, there are a lot of other unicode characters that are usual and may be safely assumed as having strlen() = 1. The string is printed on the terminal and I should know the resulting position of the cursor. I don't want to use some complex and non-stable libraries to do that, i want to have simplest possible logic to do that. Every problematic character may be replaced with U+0xFFFD or something like "<U+0363>" (ASCII string with its index instead of character itself). I want to have a list of "possibly-problematic" characters to replace. It is acceptable to have some non-problematic characters in this list too, but not much.
There is no simple algorithm for this. You'll likely need a complex, but extremely stable library: libicu, or something based on it. Basically every other library that does this kind of work is based on libicu, which is maintained by the Unicode organization.
If you don't want to use the official library (or something based on their library), you'll need to parse the Unicode Character Database yourself. In particular, you need to look at Character Properties, and parse the files in the UCD.
I believe you're asking for Bidi_Class (i.e. "direction") to be Left_To_Right, Canonical_Combining_Class to be Not_Reordered, and Joining_Type to be Non_Joining.
You probably also want to check the General_Category and avoid M* (Marks) and C* (Other).
This should work for some Emoji, but this whole approach will break a lot of emoji that look simple and are not. Most famously: ❤️, which is two "characters," not one. You may want to filter out Emoji. As a simple starting point, you may want to restrict yourself to the Basic Multilingual Plane (BMP), which are code points 0000-FFFF. Anything above this range is, almost by definition, rare or unusual. The BMP does include some emoji, but most emoji (and all new emoji) are outside the range.
Remember that the glyphs for single characters can still have radically different widths, even in nominally fixed-width fonts. For example, 𒈙 (U+12219 CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is a completely "normal" character in the way you're describing. It is left-to-right. It doesn't depend on or influence characters around it (it's non-combining and non-joining). Its "length in characters" is 1. Its glyph is also extremely wide in most fonts and breaks a lot of layout. I don't know anything in the Unicode database that would warn you of this, since "glyph width" is entirely a function of fonts, not characters, and Unicode explicitly does not consider fonts. (That said, most of the most problematic characters are outside the BMP. Probably the most common exception is DŽ, but many fixed-width fonts have a narrow glyph for it: DŽ.)
Let's write some cuneiform in a fixed-width font.
Normally, every character should line up with a character above.
Here: 𒈙. See how these characters don't align correctly?
Not only is it a very wide glyph, but its width is not even a multiple.
At least not in my font (Mac Safari 15.0).
But DŽ is ok.
Also remember that there are multiple ways to encode the same "character." For example, é can be a "simple" character (U+00E9), or it can be two characters (U+0065, U+0301). So in some cases é may print in your scheme, and in others it won't. I suspect this is fine for your problem, but if it isn't, you're going to need to apply a normalization form (likely NFC).

Swift: Unicode transformations: How to generate a rainbow infinity symbol

In xcode, developing for iOS "\u{1F3F3}\u{FE0F}\u{200D}\u{1F308}" is a rainbow flag.
"\u{1F3F3}" is a white flag, and "\u{1F308}" is a rainbow. The middle symbols "\u{FE0F}\u{200D}" are invisible symbols used to join these two together to make the rainbow flag symbol.
I am trying to combine unicode characters to make a rainbow infinity symbol, but not exactly sure how to implement this.
Not sure if there is an already existing unicode character or apple api I can use to do this, but would appreciate learning how to do this
I wouldn't mind having an infinity symbol over the rainbow flag either (like the apple anti-lgbt flag incident) as an alternative.
Emoji fonts are still just fonts. If they don’t contain a specific glyph, then they cannot display that glyph. The reason “🏳️‍🌈” looks like a rainbow flag is because someone drew a picture of a rainbow flag and then defined their font in such a way that the sequence <U+1F3F3, U+FE0F, U+200D, U+1F308> would be displayed using that specific image. Much like how someone first had to define the precise shape of the letter “A” in their font and then apply that glyph to the codepoint U+0041.
There is no image-rendering code that instinctively knows how to apply the colours of 🌈 to the shape of 🏳️ and then automatically generates a new glyph on the fly. It’s all explicitly pre-defined.
U+200D is the so-called Zero Width Joiner (ZWJ), so emoji sequences using that character are appropriately named Zero Width Joiner Sequences. They were originally invented by Apple to support emoji that weren’t part of the Unicode standard (in particular, variants of 💏, 💑, and 👪️ with different gender configurations), but later other vendors jumped on board as well and nowadays they are officially part of Unicode as an alternative way for defining new emoji without having to encode entirely new characters. Currently, about a third of all officially recommended emoji are ZWJ sequences.
In theory, any person can make up their own ZWJ sequences just by joining existing characters together (as was their original intent). In your case, “♾️+ZWJ+🌈” or <U+267E, U+FE0F, U+200D, U+1F308> would be an obvious sequence for a rainbow-coloured infinity symbol. You just have to create your own font containing the glyph you want, and then distribute that font to other people so that they can see the same glyph as you. There are just a few problems:
Making fonts with colourful glyphs is not easy. I couldn’t tell you whether there even exist freely available tools for that task.
There are four different formats for emoji fonts (used by Apple, Google, Microsoft, and Mozilla respectively) and they generally do not work on each other’s platforms, so you would need to create not just one, but several fonts unless you don’t care about people on other operating systems.
Installing your own fonts is not possible on most mobile phones, so your custom emoji would mostly only be available to desktop users.

Will precluding surrogate code points also impede entering Chinese characters?

I have a name input field in an app and would like to prevent users from entering emojis. My idea is to filter for any characters from the general categories "Cs" and "So" in the Unicode specification, as this would prevent the bulk of inappropriate characters but allow most characters for writing natural language.
But after reading the spec, I'm not sure if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points. (My understanding is still rough.)
Would excluding surrogates still leave most Chinese users with the characters they need to enter their names, or is the original Unicode space not big enough for that to be a reasonable expectation?
Your method would be both ineffective and too excessive.
Not all emoji are outside of the Basic Multilingual Plane (and thus don’t require surrogates in the first place), and not all emoji belong to the general category So. Filtering out only these two groups of characters would leave the following emoji intact:
#️⃣ *️⃣ 0️⃣ 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ 6️⃣ 7️⃣ 8️⃣ 9️⃣ ‼️ ⁉️ ℹ️ ↔️ ◼️ ◻️ ◾️ ◽️ ⤴️ ⤵️ 〰️ 〽️
At the same time, this approach would also exclude about 79,000 (and counting) non-emoji characters covering several dozen scripts – many of them historic, but some with active user communities. The majority of all Han (Chinese) characters for instance are encoded outside the BMP. While most of these are of scholarly interest only, you will need to support them regardless especially when you are dealing with personal names. You can never know how uncommon your users’ names might be.
This whole ordeal also hinges on the technical details of your app. Removing surrogates would only work if the framework you are using encodes strings in a format that actually employs surrogates (i.e. UTF-16) and if your framework is simultaneously not aware of how UTF-16 really works (as Java or JavaScript are, for example). Surrogates are never treated as actual characters; they are exceptionally reserved codepoints that exist for the sole purpose of allowing UTF-16 to deal with characters in the higher planes. Other Unicode encodings aren’t even allowed to use them at all.
If your app is written in a language that either uses a different encoding like UTF-8 or is smart enough to process surrogates correctly, then removing Cs characters on input is never going to have any effect because no individual surrogates are ever being exposed to your program. How these characters are entered by the user does not matter because all your app gets to see is the finished product (the actual character codepoints).
If your goal is to remove all emoji and only emoji, then you will have to put a lot of effort into designing your code because the Unicode emoji spec is incredibly convoluted. Most emoji nowadays are constructed out of multiple characters, not all of which are categorised as emoji by themselves. There is no easy way to filter out just emoji from a string other than maintaining an explicit list of every single official emoji which would need to be steadily updated.
Will precluding surrogate code points also impede entering Chinese characters? […] if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points.
You cannot intercept how characters are entered, whether via input method editor, copy-paste or dozens of other possibilities. You only get to see a character when it is completed (and an IME's work is done), or depending on the widget toolkit, even only after the text has been submitted. That leaves you with validation. Let's consider a realistic case. From Unihan_Readings.txt 12.0.0 (2018-11-09):
U+20009 ‹𠀉› (the same as U+4E18 丘) a hill; elder; empty; a name
U+22218 ‹𢈘› variant of 鹿 U+9E7F, a deer; surname
U+22489 ‹𢒉› a surname
U+224B9 ‹𢒹› surname
U+25874 ‹𥡴› surname
Assume the user enters 𠀉, then your unnamed – but hopefully Unicode compliant – programming language must consider the text on the grapheme level (1 grapheme cluster) or character level (1 character), not the code unit level (surrogate pair 0xD840 0xDC09). That means that it is okay to exclude characters with the Cs property.

What are the unicode ranges for Hindi accented characters?

I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better.
I intend to use this unicode-list in a RegExp.
I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear / incorrectly warps... in other words... HINDI HELL!)
I've tried this with Notepad++ too, but although it was more responsive, it eventually crapped out on me like it did in the Flash Player textfield. This seems to occur especially while removing the [] block (nulls?) characters. Some of them trigger odd behaviors.
Anyways, all I want is a list of the accents.
An example of a few are in the image below (but I would need ALL accents):
Thanks!
You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/
For Hindi, you probably want Devanagari or Devanagari Extended.
Here is the character class for Devanagari combining marks:
[\u901\u902\u903\u93c\u93e\u93f\u940\u941\u942\u943
\u944\u945\u946\u947\u948\u949\u94a\u94b\u94c\u94d
\u951\u952\u953\u954\u962\u963]
This is only the basic Devanagari block (not Devanagari Extended).
If you want the complete set (for all languages), you can do it problematically.
You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)
You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want.
Can't be more precise, because "accent" a bit vague :-)
You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).
And a script doing this would definitely be better than trying to mess with text editors.
One of the characteristics of combining characters is that they combine :-)
So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)

Where to get a reference image for any unicode code point?

I am looking for an online service (or collection of images) that can return an image for any unicode code point.
Unicode.org does not have an image for each one, consider for example
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=31cf
EDIT: I need to use these images programmatically, so the code chart PDFs provided at unicode.org are not useful.
The images in the PDF are copyrighted, so there are legal issues around extracting them. (I am not a lawyer.) I suspect that those legal issues prevent a simple solution from being provided, unless someone wants to go to the trouble of drawing all of those images. It might happen, but seems unlikely.
Your best bet is to download a selection of fonts that collectively cover the entire range of characters, and display the characters using those fonts. There are two difficulties with this approach: combining characters and invisible characters.
The combining characters can easily be detected from the Unicode database, and you can supply a base character (such as NBSP) to use for displaying them. (There is a special code point intended for this purpose, but I can't find it at the moment.)
Invisible characters could be displayed with a dotted square box containing the abbreviation for the character. Those you may have to locate manually and construct the necessary abbreviations. I am not aware of any shortcuts for that.