Unusual rendering and copy-paste for the character 誤 - unicode

I'm seeing somewhat unusual behavior around the rendering of 誤 in the browser (works across both Firefox and Chrome), which I'm having trouble explaining.
Specifically, check out the Wiktionary page for 誤:
Notice that there are 3 variations marked in black bold:
The top left one has 3 pieces: 言 + ⼝ + 天
The middle one kinda' has 4 pieces: 言 + ⼝ + a rotated ꒔ + ⼤
The bottom one has 3 pieces: ⻈ + ⼝ + 天
The relation between 2 and 3 is clear: 2 represents the traditional character and 1 represents the simplified character. But what does 1 represent? I've tried the following:
I tried copying character 1 but when I paste it, it ends up looking like character 2.
I tried various font combinations, both in the browser and in TextEdit, but the appearance and copy-pasting behavior persist.
So what is going on with this unusual character rendering and copy-pasting behavior? How can I reproduce character 1 (and not 2) in other applications?
FWIW, when I look at a Chinese dictionary, the stroke order shows character 2 even though the browser renders the character as 1.

This is a z-variant, and in this case probably an example of Han unification.
From https://www.zdic.net/hans/%E8%AA%A4:
You can see that the first character (marked as 内地 Mainland China) is what you're getting in the headword.
The headword on Wikipedia is formatted with lang=zh, whereas the example sentences use zh-Hans and zh-Hant respectively, and that's the core of this, along with likely subtags fallback.
Most systems dealing with locales perform locale fallback using likely subtags: So, Hans without any country specified typically implies CN, and Hant implies TW during fallback. The reverse is also true (and some other countries like HK imply Hant as well). Hans/Hant are script codes for Simplified and Traditional Chinese, and CN/TW are China and Taiwan respectively. zh on its own implies zh-CN (and thus zh-Hans-CN)
Fallback also need not always occur the same way, different fonts have different priorities (e.g. a Mainland Chinese font may assume CN by default unless explicitly told otherwise)
I made a little table, screenshot showing the rendering of different language tags on my system when run on Wikipedia (snippet at the bottom of this post)
The font's actually defaulting to Noto Sans CJK JP unless I put it in a class=Hant context (where it switches to Noto Sans CJK TC).
What's happening under the hood is: traditional vs simplified is not unified in Unicode, but such variants are. Even though zh implies zh-Hans-CN, because this is a traditional character, the font will not use the Hans to pick a Simplified character: it must pick a traditional character since Simplified is encoded differently. So you get the Mainland Chinese traditional variant in zh contexts (like the headword), but since zh-Hant implies zh-TW, the font is happy to oblige and give you the Taiwanese (still traditional) variant in the example sentence.
Note that not all cases stick to a single font: sometimes the choice of language can force a different font to be selected (or the precise CSS used). Additionally, you can have z-variants crop up in different contexts without needing to change the language, for example the Cantonese possessive 嘅 can be built as ⿰口既 or ⿰口旣 and the choice is not clearly locale based and seems to vary freely between fonts.
Code for table above:
<table>
<tr lang=zh><td>zh</td><td>誤</td></tr>
<tr lang=zh-Hans><td>zh-Hans</td><td>誤</td></tr>
<tr lang=zh-Hant><td>zh-Hant</td><td>誤</td></tr>
<tr lang=zh-CN><td>zh-CN</td><td>誤</td></tr>
<tr lang=zh-Hant-CN><td>zh-Hant-CN</td><td>誤</td></tr>
<tr lang=zh-Hans-CN><td>zh-Hans-CN</td><td>誤</td></tr>
<tr lang=zh-TW><td>zh-TW</td><td>誤</td></tr>
<tr lang=zh-HK><td>zh-HK</td><td>誤</td></tr>
<tr lang=zh-Hans-TW><td>zh-Hans-TW</td><td>誤</td></tr>
<tr lang=ja><td>ja</td><td>誤</td></tr>
<tr lang=ko><td>ko</td><td>誤</td></tr>
<tr lang=vi><td>vi</td><td>誤</td></tr>
</table>

(Based on a Twitter discussion with manishearth)
The difference is coming up due to variations across fonts (called z-variants). Specifically, based on the language tag, the browser can pick different fonts within the same font family (e.g. sans-serif). For example, on my device:
With lang="zh", the browser picks PingFang SC from sans-serif.
With lang="zh-Hant", the browser picks PingFang TC from sans-serif.
These two fonts render the character differently. The lang tag is different in different parts of the HTML, causing different font selection and hence different rendering.
Outside the browser, depending on the language context, the variant/language can also change. There is more discussion of this with examples on the Han Unification Wikipedia page.

Related

Swift: Unicode transformations: How to generate a rainbow infinity symbol

In xcode, developing for iOS "\u{1F3F3}\u{FE0F}\u{200D}\u{1F308}" is a rainbow flag.
"\u{1F3F3}" is a white flag, and "\u{1F308}" is a rainbow. The middle symbols "\u{FE0F}\u{200D}" are invisible symbols used to join these two together to make the rainbow flag symbol.
I am trying to combine unicode characters to make a rainbow infinity symbol, but not exactly sure how to implement this.
Not sure if there is an already existing unicode character or apple api I can use to do this, but would appreciate learning how to do this
I wouldn't mind having an infinity symbol over the rainbow flag either (like the apple anti-lgbt flag incident) as an alternative.
Emoji fonts are still just fonts. If they don’t contain a specific glyph, then they cannot display that glyph. The reason “🏳️‍🌈” looks like a rainbow flag is because someone drew a picture of a rainbow flag and then defined their font in such a way that the sequence <U+1F3F3, U+FE0F, U+200D, U+1F308> would be displayed using that specific image. Much like how someone first had to define the precise shape of the letter “A” in their font and then apply that glyph to the codepoint U+0041.
There is no image-rendering code that instinctively knows how to apply the colours of 🌈 to the shape of 🏳️ and then automatically generates a new glyph on the fly. It’s all explicitly pre-defined.
U+200D is the so-called Zero Width Joiner (ZWJ), so emoji sequences using that character are appropriately named Zero Width Joiner Sequences. They were originally invented by Apple to support emoji that weren’t part of the Unicode standard (in particular, variants of 💏, 💑, and 👪️ with different gender configurations), but later other vendors jumped on board as well and nowadays they are officially part of Unicode as an alternative way for defining new emoji without having to encode entirely new characters. Currently, about a third of all officially recommended emoji are ZWJ sequences.
In theory, any person can make up their own ZWJ sequences just by joining existing characters together (as was their original intent). In your case, “♾️+ZWJ+🌈” or <U+267E, U+FE0F, U+200D, U+1F308> would be an obvious sequence for a rainbow-coloured infinity symbol. You just have to create your own font containing the glyph you want, and then distribute that font to other people so that they can see the same glyph as you. There are just a few problems:
Making fonts with colourful glyphs is not easy. I couldn’t tell you whether there even exist freely available tools for that task.
There are four different formats for emoji fonts (used by Apple, Google, Microsoft, and Mozilla respectively) and they generally do not work on each other’s platforms, so you would need to create not just one, but several fonts unless you don’t care about people on other operating systems.
Installing your own fonts is not possible on most mobile phones, so your custom emoji would mostly only be available to desktop users.

What range of unicode characters should be kept in a #font-face web font for a US based website with a US audience?

As part of optimizing a web development project, we need to strip out unnecessary characters that are never going to be used to reduce the size of font files. I have searched Google and found nothing canonical on the subject of which characters are required and which are safe to remove.
I've found the following ranges that may be of interest:
0020 — 007F Basic Latin
00A0 — 00FF Latin-1 Supplement
0100 — 017F Latin Extended-A
0180 — 024F Latin Extended-B
0250 — 02AF IPA Extensions
02B0 — 02FF Spacing Modifier Letters
0300 — 036F Combining Diacritical Marks
27C0 — 27EF Miscellaneous Mathematical Symbols-A
It seems that the most aggressive approach would be to only keep "Basic Latin", 0020 — 007F, which provides upper and lower-case letters, numbers and a few basic symbols, like the $, +, (, ), etc.
Latin-1 Supplement contains some extra goodies like Trademark and Copyright symbols and fractions.
Latin Extended-A and -B contain letters with accent marks, and since our copy is in English, I'm not sure if these will ever be needed.
If we use only that ranges (0020 — 007F) and (00A0 — 00FF), will we run into problems down the line with missing characters, should some user decide to post a comment in Spanish (for example)? Or will the browser fall back to a default font for characters that aren't included the web font?
The point of a web-font is to make the main bodies of text and headlines look pretty, which the basic latin set should cover, but I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font, etc.
What range of unicode characters should be kept in a #font-face web
font for a US based website with a US audience? Are there any best practices or guidelines for striping unnecessary characters from a font for web use?
I would recommend subsetting to one of the common "code page" definitions that support US/Western Europe. Most code page definitions pre-date Unicode and typically have the bits and pieces needed for various regional support without including entire Unicode blocks. Suggestions:
Windows Code Page 1252
ISO/IEC 8859-1 "Latin 1"*
ISO/IEC 8859-15
*This is the same as Unicode Ranges 0020-007F Basic Latin + 00A0-00FF Latin-1 Supplement
These include much more than is strictly required for US English, though as noted above, several accented characters commonly appear in English text (é, ñ, as well as other punctuation marks and symbols). These sets include those characters, so you should be in good shape for the vast majority of text for a U.S. audience. Note also that in most fonts, these characters are typically "composites", which means that they use a reference to the components (e.g. 'é' is built from references to 'e' and '´'); as such, they don't normally require as much size to store them, so retaining them usually won't incur a major size penalty.
If you might encounter European financial text, I'd suggest either Windows 1252 or ISO/IEC 8859-15 which include the Euro currency symbol.
I don't know if there are hidden "gotchas" with stripping down to just the "Basic Latin" range, like accented characters showing as diamond question marks instead of falling back to a system font
Any characters that don't exist in the font you are using will fall back to any default font the browser can find with the characters in. This will likely be ugly when interleaved with other characters from your custom font, but modern OSes provide decent font coverage for commonly-used characters from the above blocks so typically it will still be readable.
So you should include characters based on whether you think they'll be used commonly enough that having them rendered in an ugly font is a deal-breaker. For what it's worth, a pretty minimal set I have used before for a similar purpose is ¡£°±²³¿ÉËÑéëñ‘’“”–—•€™, but your site's exactly requirements may vary. (For example, if you coöpted the New-Yorker-style diaeresis you would certainly want äëïöü.)
(How exactly default fallback fonts work varies between browsers and was famously troublesome in older versions of IE, and IE Mobile. But the basic accented Latin letters are pretty safe.)

Unicode Keystroke Characters?

Does unicode have characters in it similar to stuff like the things formed by the <kbd> tag in HTML? I want to use it as part of a game to indicate that the user can press a key to perform a certain action, for example:
Press R to reset, or S to open the settings menu.
Are there characters for that? I don't need anything fancy like ⇧ Shift or Tab ⇆, single-letter keys are plenty. I am looking for something that would work somewhat like the Enclosed Alphanumerics subrange.
If there are characters for that, where could I find a page describing them? All the google searches I tried turned only turned up "unicode character keyboard shortcuts" stuff.
If there are not characters for that, how can I display something like that as part of (or at least in line with) a text string in Processing 2.0.1?
(The rendering referred to is not the default rendering of kbd, which simply shows the content in the system’s default monospace font. But e.g. in StackOverflow pages, a style sheet is used to format kbd so that it looks like a keycap.)
Somewhat surprisingly, there is a Unicode way to create something that looks like a character in a keycap: enter the character, then immediately COMBINING ENCLOSING KEYCAP U+20E3.
Font support to this character is very limited but contains a few free fonts. Unfortunately, none of them is a sans-serif font, and the character to be shown inside should normally appear in such a font – after all, real keycaps contains very simple shapes for characters, without serifs. And generally, a character and an enclosing mark should be taken from the same font; otherwise they might be incompatible. However, it seems that taking the normal character from the sans-serif font (FreeSans) in GNU Freefont and the combining mark from the serif font (FreeSerif) of the same source creates a reasonable presentation:
I’m afraid it won’t work here in text, but I’ll try: A⃣ .
Whether this works depends on the use of suitable fonts, as mentioned, but also on the rendering software. Programs have been rather bad at displaying combining marks, but there has been some improvement. I tested this in Word 2007, where it works OK, and also on web browsers (Chrome, Firefox, IE) with good results using code like this:
<style>
.cap { font-family: FreeSerif; }
.cap span { font-family: FreeSans; }
</style>
<span class="cap"><span>A</span>⃣</span>
It isn’t perfect, when using the fonts mentioned. The character in the cap is not quite centered. Moreover, if I try to use the technique e.g. for the character Å (which is present on normal Nordic keyboards), the ring above A extends out of the cap. You could tweak this by setting the font size of the letter in the cap to, say, 85% of the font size of the combining mark, but then the horizontal position of the letter is even more off.
To summarize, it is possible to do such things at the character level, but if you can use other methods, like using a border or a background image for a character, you can probably achieve better rendering.

What is the unicode variation selector

I was wondering. What is the unicode Variation Selectors U-FE00 to U-FE0F used for.
Example: ︀︁︂︂ 
The Unicode standard talks about this. Here's a bit of the relevant section from 3.2.0, annex 28 (I'm sure there are more recent versions around; this is the first I found):
Unicode characters can be represented by a wide variety of glyphs, as discussed in Chapter 2, General Structure in The Unicode Standard, Version 3.0. Occasionally the need arises in text processing to restrict or change the set of glyphs that are to be used to represent a character. Normally such changes are indicated by choice of font or style in rich-text documents. In special circumstances, such a variation from the normal range of appearance needs to be expressed side-by-side in the same document in plain-text contexts, where it is impossible or inconvenient to exchange formatted text. For example, in languages employing the Mongolian script, sometimes a specific variant range of glyphs is needed for a specific textual purpose for which the range of “generic” glyphs is considered inappropriate. The variation selectors are used when characters have essentially the same semantic.
Variation selectors provide a mechanism for specifying a restriction on the set of glyphs that are used to represent a particular character. They also provide a mechanism for specifying variants, such as for CJK Ideographs and Mongolian, that have essentially the same semantic but have substantially different ranges of glyphs. A variation sequence, which always consists of a base character followed by the variation selector, may be specified as part of the Unicode Standard. That sequence is referred to as a variant of the base character. The variation selector affects only the appearance of the base character,* and only in the variation sequences defined in this Standard. The variation selector is not used as a general code extension mechanism.
(It goes on...)
You may also be interested in the Standardized Variants (this time from 6.0.0).
This is not a complete answer to the question, but it's pertinent to Emojis and Variant Selectors:
The ❤ character (U+2764 code point) is a Unicode character from 1993.
But the ❤️ emoji is actually the ❤ (U+2764) character followed by the Variant Selector-16 (U+FE0F).
Why?
Exclusively speaking about Emojis (documentation):
VS15 and VS16 are reserved to determine whether or not a character
should be displayed as an emoji. [...]
Emoji variation sequences contain VS16 (U+FE0F) for emoji-style (with color) or VS15 (U+FE0E) for text style (monochrome)
If there is a character (or symbol, glyph, etc...) that is intended to be also a emoji, the Variant Selector-16 will specify to the render, to renders it as Emoji. But if the same character is followed by the Variant Selector-15, it will specify to the render, to renders it as just text. If no Variant Selector is appended, than the default representation will depends on Unicode's specification. For Emoticons the default is Emoji. For other characters like ❤, the default is text...
Another example from Emoticons (Unicode_block)'s documentation:
Each emoticon has two variants:
U+FE0E (VARIATION SELECTOR-15) selects text presentation (e.g. 😊︎ 😐︎ ☹︎)
U+FE0F (VARIATION SELECTOR-16) selects emoji-style (e.g. 😊️ 😐️ ☹️).
If there is no variation selector appended, the default is the
emoji-style. Example:
U+1F610 (NEUTRAL FACE) 😐
U+1F610 (NEUTRAL FACE), U+FE0E (VARIATION SELECTOR-15) 😐︎
U+1F610 (NEUTRAL FACE), U+FE0F (VARIATION SELECTOR-16) 😐️
Note: The VS15 and VS16 are not mandatory to a valid emoji. There are a lot of emoji without Variant Selectors.
Your guess is as good as mine.. but according to this source...
has got it...
Emoji Character Encoding Data Hints: 1 In iOS 5 / OSX 10.7, the underlying code that the Apple OS generates for this emoji was changed.2 The code generated for this emoji was changed slightly in iOS 7 / OSX 10.9 (a variation selector was added) to make it easier for this emoji to be identified and shown in OSX and iOS. We don't mind Apple, thank you! We just love our emojis!
Their chart goes on to note that this "new", post-10.9 version
has a UTF-8 Character Count of 2 vs the previous 1... if that helps.
The Variation Selectors range was introduced with version 3.2 of the Unicode Standard, and is located in Plane 0, the Basic Multilingual Plane. Further selectors can be found in the Variation Selectors Supplement range.
Most Unicode characters can be represented by a wide variety of glyphs, and in rich text a particular glyph can be indicated by choosing a particular font or style. This mechanism is not available in plain text, and so variation selectors have been introduced as a way of indicating that the glyphs applicable to a particular character should be changed or restricted. The base character is followed by the variation selector, the combination being called a variation sequence. This is not intended to be general-purpose mechanism, and the only permitted variation sequences are those defined in the Standardized Variants file, which forms part of the Unicode Character Database.
From http://www.alanwood.net/unicode/variation_selectors.html

What's the unicode glyph used to indicate combining characters?

My application needs to display "orphaned" combining characters. I would like to use the same format as the "official" unicode charts, using the dotted circle placeholder. See, for example:
Combining Diacritical Marks (PDF)
A quick scan through the charts and I came up with U+25CC "DOTTED CIRCLE". That looks good, but the note on this character reads:
note that the reference glyph for this
character is intentionally larger than
the dotted circle glyph used to
indicate combining characters in this
standard; see, for example, 0300
Which says (I think) that U+25CC is not the correct character. (Or, if it is, perhaps just a poorly worded note.)
So: if the dotted circle used on the "Combining Diacritical Marks" is not U+25CC, what is the correct code for that little booger?
I have tried:
Copying the text from the PDF and inspecting it, but the copy is disabled in the PDF.
Emailing it to myself in Gmail and then viewing the attachment as HTML, but there is gets converted to U+0024 ("DOLLAR SIGN"). Which means that either the conversion failed or they are just playing some font rendering games in the PDF.
[Clarification] I realize that the U+25CC looks OK (assuming one's font supports it), but it sounds like the spec says that this is the wrong character. Many unicode characters have similar glyphs but are different characters, semantically speaking. "Latin Capital Letter A" (U+0041) and "Greek Capital Letter Alpha" (U+0391) will look identical for most fonts, but they have different semantic meanings and are not interchangable.
I don't think there is an official placeholder character. The way I read that note, they chose U+25CC arbitrarily, purely for display purposes. Then, in the chart where the "real" dotted circle is listed, they made it a little larger to emphasize that it's not being used as a placeholder there. (Or maybe they shrunk it in the other charts; as you said, the note's poorly worded.)
Whatever the case, I don't see any reason not to use U+25CC as your placeholder.
Just tried this: create a blank .html file, copy the text, and load in Firefox. Displays as expected (although I really didn't expect space+combining character to display correctly):
<html>
<body>
<font size="24pt">
◌̀
◌́
◌̂
◌̃
<br/>
À
Á
Â
Ã
<br/>
̀
́
̂
̃
</font>
</body>
</html>