What is the unicode variation selector - unicode

I was wondering. What is the unicode Variation Selectors U-FE00 to U-FE0F used for.
Example: ︀︁︂︂ 

The Unicode standard talks about this. Here's a bit of the relevant section from 3.2.0, annex 28 (I'm sure there are more recent versions around; this is the first I found):
Unicode characters can be represented by a wide variety of glyphs, as discussed in Chapter 2, General Structure in The Unicode Standard, Version 3.0. Occasionally the need arises in text processing to restrict or change the set of glyphs that are to be used to represent a character. Normally such changes are indicated by choice of font or style in rich-text documents. In special circumstances, such a variation from the normal range of appearance needs to be expressed side-by-side in the same document in plain-text contexts, where it is impossible or inconvenient to exchange formatted text. For example, in languages employing the Mongolian script, sometimes a specific variant range of glyphs is needed for a specific textual purpose for which the range of “generic” glyphs is considered inappropriate. The variation selectors are used when characters have essentially the same semantic.
Variation selectors provide a mechanism for specifying a restriction on the set of glyphs that are used to represent a particular character. They also provide a mechanism for specifying variants, such as for CJK Ideographs and Mongolian, that have essentially the same semantic but have substantially different ranges of glyphs. A variation sequence, which always consists of a base character followed by the variation selector, may be specified as part of the Unicode Standard. That sequence is referred to as a variant of the base character. The variation selector affects only the appearance of the base character,* and only in the variation sequences defined in this Standard. The variation selector is not used as a general code extension mechanism.
(It goes on...)
You may also be interested in the Standardized Variants (this time from 6.0.0).

This is not a complete answer to the question, but it's pertinent to Emojis and Variant Selectors:
The ❤ character (U+2764 code point) is a Unicode character from 1993.
But the ❤️ emoji is actually the ❤ (U+2764) character followed by the Variant Selector-16 (U+FE0F).
Why?
Exclusively speaking about Emojis (documentation):
VS15 and VS16 are reserved to determine whether or not a character
should be displayed as an emoji. [...]
Emoji variation sequences contain VS16 (U+FE0F) for emoji-style (with color) or VS15 (U+FE0E) for text style (monochrome)
If there is a character (or symbol, glyph, etc...) that is intended to be also a emoji, the Variant Selector-16 will specify to the render, to renders it as Emoji. But if the same character is followed by the Variant Selector-15, it will specify to the render, to renders it as just text. If no Variant Selector is appended, than the default representation will depends on Unicode's specification. For Emoticons the default is Emoji. For other characters like ❤, the default is text...
Another example from Emoticons (Unicode_block)'s documentation:
Each emoticon has two variants:
U+FE0E (VARIATION SELECTOR-15) selects text presentation (e.g. 😊︎ 😐︎ ☹︎)
U+FE0F (VARIATION SELECTOR-16) selects emoji-style (e.g. 😊️ 😐️ ☹️).
If there is no variation selector appended, the default is the
emoji-style. Example:
U+1F610 (NEUTRAL FACE) 😐
U+1F610 (NEUTRAL FACE), U+FE0E (VARIATION SELECTOR-15) 😐︎
U+1F610 (NEUTRAL FACE), U+FE0F (VARIATION SELECTOR-16) 😐️
Note: The VS15 and VS16 are not mandatory to a valid emoji. There are a lot of emoji without Variant Selectors.

Your guess is as good as mine.. but according to this source...
has got it...
Emoji Character Encoding Data Hints: 1 In iOS 5 / OSX 10.7, the underlying code that the Apple OS generates for this emoji was changed.2 The code generated for this emoji was changed slightly in iOS 7 / OSX 10.9 (a variation selector was added) to make it easier for this emoji to be identified and shown in OSX and iOS. We don't mind Apple, thank you! We just love our emojis!
Their chart goes on to note that this "new", post-10.9 version
has a UTF-8 Character Count of 2 vs the previous 1... if that helps.

The Variation Selectors range was introduced with version 3.2 of the Unicode Standard, and is located in Plane 0, the Basic Multilingual Plane. Further selectors can be found in the Variation Selectors Supplement range.
Most Unicode characters can be represented by a wide variety of glyphs, and in rich text a particular glyph can be indicated by choosing a particular font or style. This mechanism is not available in plain text, and so variation selectors have been introduced as a way of indicating that the glyphs applicable to a particular character should be changed or restricted. The base character is followed by the variation selector, the combination being called a variation sequence. This is not intended to be general-purpose mechanism, and the only permitted variation sequences are those defined in the Standardized Variants file, which forms part of the Unicode Character Database.
From http://www.alanwood.net/unicode/variation_selectors.html

Related

Is it possible to use unicode combining characters to combine arbitrary characters?

Is it possible to use unicode combining characters to for example make the characters x and y appear to be partially overlapping each other?
I know that in layout systems like CSS there are other ways to achieve this, but I specifically want to know if its possible with just unicode so I can for example do it in Slack messages.
No, there is no Unicode mechanism to make arbitrary letters overlap each other. You can put an x above a y using the character U+036F COMBINING LATIN SMALL LETTER X like so: yͯ, but that’s about it.
Latin letters partially overlapping each other serves no semantic function, so it is not part of the Unicode standard. And if it was found to be used to convey actual meaning in some writing system, it would most likely not be encoded as a generalised mechanism but as individual characters representing specific such ligatures.
The Unicode Consortium does not consider styling features like that to be part of plain text. That is also why those bold and italic mathematical letters you sometimes see on Twitter (𝐀, 𝐴, 𝓐 etc.) aren’t implemented as the base letters plus some style modifiers, but as separate character codes entirely. A character that means “display the preceding letter as bold” would have been too general; non-crucial style variation should be dealt with through higher-level protocols (like the CSS you mentioned) which are much more powerful and enjoy more widespread support anyway.

Will precluding surrogate code points also impede entering Chinese characters?

I have a name input field in an app and would like to prevent users from entering emojis. My idea is to filter for any characters from the general categories "Cs" and "So" in the Unicode specification, as this would prevent the bulk of inappropriate characters but allow most characters for writing natural language.
But after reading the spec, I'm not sure if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points. (My understanding is still rough.)
Would excluding surrogates still leave most Chinese users with the characters they need to enter their names, or is the original Unicode space not big enough for that to be a reasonable expectation?
Your method would be both ineffective and too excessive.
Not all emoji are outside of the Basic Multilingual Plane (and thus don’t require surrogates in the first place), and not all emoji belong to the general category So. Filtering out only these two groups of characters would leave the following emoji intact:
#️⃣ *️⃣ 0️⃣ 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ 6️⃣ 7️⃣ 8️⃣ 9️⃣ ‼️ ⁉️ ℹ️ ↔️ ◼️ ◻️ ◾️ ◽️ ⤴️ ⤵️ 〰️ 〽️
At the same time, this approach would also exclude about 79,000 (and counting) non-emoji characters covering several dozen scripts – many of them historic, but some with active user communities. The majority of all Han (Chinese) characters for instance are encoded outside the BMP. While most of these are of scholarly interest only, you will need to support them regardless especially when you are dealing with personal names. You can never know how uncommon your users’ names might be.
This whole ordeal also hinges on the technical details of your app. Removing surrogates would only work if the framework you are using encodes strings in a format that actually employs surrogates (i.e. UTF-16) and if your framework is simultaneously not aware of how UTF-16 really works (as Java or JavaScript are, for example). Surrogates are never treated as actual characters; they are exceptionally reserved codepoints that exist for the sole purpose of allowing UTF-16 to deal with characters in the higher planes. Other Unicode encodings aren’t even allowed to use them at all.
If your app is written in a language that either uses a different encoding like UTF-8 or is smart enough to process surrogates correctly, then removing Cs characters on input is never going to have any effect because no individual surrogates are ever being exposed to your program. How these characters are entered by the user does not matter because all your app gets to see is the finished product (the actual character codepoints).
If your goal is to remove all emoji and only emoji, then you will have to put a lot of effort into designing your code because the Unicode emoji spec is incredibly convoluted. Most emoji nowadays are constructed out of multiple characters, not all of which are categorised as emoji by themselves. There is no easy way to filter out just emoji from a string other than maintaining an explicit list of every single official emoji which would need to be steadily updated.
Will precluding surrogate code points also impede entering Chinese characters? […] if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points.
You cannot intercept how characters are entered, whether via input method editor, copy-paste or dozens of other possibilities. You only get to see a character when it is completed (and an IME's work is done), or depending on the widget toolkit, even only after the text has been submitted. That leaves you with validation. Let's consider a realistic case. From Unihan_Readings.txt 12.0.0 (2018-11-09):
U+20009 ‹𠀉› (the same as U+4E18 丘) a hill; elder; empty; a name
U+22218 ‹𢈘› variant of 鹿 U+9E7F, a deer; surname
U+22489 ‹𢒉› a surname
U+224B9 ‹𢒹› surname
U+25874 ‹𥡴› surname
Assume the user enters 𠀉, then your unnamed – but hopefully Unicode compliant – programming language must consider the text on the grapheme level (1 grapheme cluster) or character level (1 character), not the code unit level (surrogate pair 0xD840 0xDC09). That means that it is okay to exclude characters with the Cs property.

Two different eye emojis?

As far as I knew, there are currently two emojis for eyes. The pair of eyes (U+1F440) with hex code f09f9180 (👀), and a single eye (U+1F441) with hex code f09f9181 (👁).
I now found when using the emojis of the keyboard in my phone that another eye emoji exists, with hex code f09f9181efb88f (👁️).
The gajim messenger on the PC, and the Conversations app on the mobile phone, can display both. The gajim emoji-chooser only contains the short sequence and the Swiftkey-Keyboard Emoji-Chooser only the longer one.
When I copy and paste the emojis i.e. in the Firefox URL address bar, they look the same (blue eye, while the messengers both display them in black). When I Google for the emojis, I only find pages describing the shorter code point.
Firefox renders both emojis the same, but Vivaldi (Chromium based) shows the one with the shorter code point as narrow black and white emoji and the other one as larger brown eye.
When I Google for the hex dump, I find a lot of emojipedia sites for the shorter dump, and nothing useful at all for the longer one.
Is there somewhere any documentation about the additional emoji? Why aren't both emojis available in both emoji choosers?
f0 9f 91 80 is the UTF-8 encoded form of codepoint U+1F440.
f0 9f 91 81 is the UTF-8 encoded form of codepoint U+1F441.
f0 9f 91 81 ef b8 8f is the UTF-8 encoded form of codepoints U+1F441 U+FE0F.
U+FE0F is a Variation Selector:
Variation Selectors is a Unicode block containing 16 Variation Selector format characters (designated VS1 through VS16). They are used to specify a specific glyph variant for a Unicode character. They are currently used to specify standardized variation sequences for mathematical symbols, emoji symbols, 'Phags-pa letters, and CJK unified ideographs corresponding to CJK compatibility ideographs. At present only standardized variation sequences with VS1, VS15 and VS16 have been defined.
Where U+FE0F is VARIATION SELECTOR-16:
U+FE0F was added to Unicode in version 3.2 (2002). It belongs to the block Variation Selectors in the Basic Multilingual Plane.
This character is a Nonspacing Mark and inherits its script property from the preceding character.
The glyph is not a composition. It has a Ambiguous East Asian Width. In bidirectional context it acts as Nonspacing Mark and is not mirrored. In text U+FE0F behaves as Combining Mark regarding line breaks. It has type Extend for sentence and Extend for word breaks. The Grapheme Cluster Break is Extend.
This codepoint may change the appearance of the preceding character. If that is a symbol, dingbat or emoji, U+FE0F forces it to be rendered as a colorful image as compared to a monochrome text variant. The Unicode standard defines some standardized variants. See also “Unicode symbol as text or emoji” for a discussion of this codepoint.
In other words, U+FE0F tells VS-aware software to render U+1F441 as a colorful emoji instead of as monochromatic text.
The singular ‘👁’ is used as an emoji, but is defined as being text-style (i.e. black-and-white rather than colourful) by default. This isn’t implemented consistently across all platforms, however, so sometimes the character will also display as emoji style instead. In order to explicitly force one or the other style, the characters U+FE0E and U+FE0F can be appended to 👁 to make it appear as text style (👁︎) or emoji style (👁️) respectively. Because of the inconsistencies I mentioned, some devices and applications automatically add U+FE0F to the character (resulting in the longer code your phone keyboard produced), while others leave the character as-is (leaving just the code for the eye itself).

What range of unicode characters should be kept in a #font-face web font for a US based website with a US audience?

As part of optimizing a web development project, we need to strip out unnecessary characters that are never going to be used to reduce the size of font files. I have searched Google and found nothing canonical on the subject of which characters are required and which are safe to remove.
I've found the following ranges that may be of interest:
0020 — 007F Basic Latin
00A0 — 00FF Latin-1 Supplement
0100 — 017F Latin Extended-A
0180 — 024F Latin Extended-B
0250 — 02AF IPA Extensions
02B0 — 02FF Spacing Modifier Letters
0300 — 036F Combining Diacritical Marks
27C0 — 27EF Miscellaneous Mathematical Symbols-A
It seems that the most aggressive approach would be to only keep "Basic Latin", 0020 — 007F, which provides upper and lower-case letters, numbers and a few basic symbols, like the $, +, (, ), etc.
Latin-1 Supplement contains some extra goodies like Trademark and Copyright symbols and fractions.
Latin Extended-A and -B contain letters with accent marks, and since our copy is in English, I'm not sure if these will ever be needed.
If we use only that ranges (0020 — 007F) and (00A0 — 00FF), will we run into problems down the line with missing characters, should some user decide to post a comment in Spanish (for example)? Or will the browser fall back to a default font for characters that aren't included the web font?
The point of a web-font is to make the main bodies of text and headlines look pretty, which the basic latin set should cover, but I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font, etc.
What range of unicode characters should be kept in a #font-face web
font for a US based website with a US audience? Are there any best practices or guidelines for striping unnecessary characters from a font for web use?
I would recommend subsetting to one of the common "code page" definitions that support US/Western Europe. Most code page definitions pre-date Unicode and typically have the bits and pieces needed for various regional support without including entire Unicode blocks. Suggestions:
Windows Code Page 1252
ISO/IEC 8859-1 "Latin 1"*
ISO/IEC 8859-15
*This is the same as Unicode Ranges 0020-007F Basic Latin + 00A0-00FF Latin-1 Supplement
These include much more than is strictly required for US English, though as noted above, several accented characters commonly appear in English text (é, ñ, as well as other punctuation marks and symbols). These sets include those characters, so you should be in good shape for the vast majority of text for a U.S. audience. Note also that in most fonts, these characters are typically "composites", which means that they use a reference to the components (e.g. 'é' is built from references to 'e' and '´'); as such, they don't normally require as much size to store them, so retaining them usually won't incur a major size penalty.
If you might encounter European financial text, I'd suggest either Windows 1252 or ISO/IEC 8859-15 which include the Euro currency symbol.
I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font
Any characters that don't exist in the font you are using will fall back to any default font the browser can find with the characters in. This will likely be ugly when interleaved with other characters from your custom font, but modern OSes provide decent font coverage for commonly-used characters from the above blocks so typically it will still be readable.
So you should include characters based on whether you think they'll be used commonly enough that having them rendered in an ugly font is a deal-breaker. For what it's worth, a pretty minimal set I have used before for a similar purpose is ¡£°±²³¿ÉËÑéëñ‘’“”–—•€™, but your site's exactly requirements may vary. (For example, if you coöpted the New-Yorker-style diaeresis you would certainly want äëïöü.)
(How exactly default fallback fonts work varies between browsers and was famously troublesome in older versions of IE, and IE Mobile. But the basic accented Latin letters are pretty safe.)

Detect if character is simplified or traditional Chinese character

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.