Why do the printed unicode characters change?

Why do the printed unicode characters change? - unicode

The way the unicode symbol is displayed depends on whether I use the White Heavy Check Mark or the Negative Squared Cross Mark before it or not. If I do, the Warning Sign is coloured. If I put a space between the symbols, I get the mono-coloured text-like version.
Why does this behaviour exist and can I force the coloured symbol somehow?
I tried a couple of different REPLs, the behaviour was the same.
; No colour
(str (char 0x274e) " " (char 0x26A0))
; Coloured
(str (char 0x274e) "" (char 0x26A0))
Clojure unicode display.
I expect the symbol being displayed the same way regardless of which symbol comes before it.

Why does this behaviour exist
A vendor thought it would be a neat idea to render emoji glyhps in colour. The idea caught on.
https://en.wikipedia.org/wiki/Emoji#Emoji_versus_text_presentation
can I force the coloured symbol somehow
U+FE0E VARIATION SELECTOR-15 and U+FE0F VARIATION SELECTOR-16
http://mts.io/2015/04/21/unicode-symbol-render-text-emoji/

Unicode is about characters (code points), not glyphs (see it as "image" of a character).
Fonts are free to (and should) merge nearby characters into a single glyphs. In printed Latin scripts this is not very common (but we could have it e.g. ff,fi, ffi), without considering the combining codepoints which, per definition, should combine with other characters, to get just one glyph,
Many other scripts require it. Starting to cursive Latin scripts, but most cursive scripts requires changes. E.g. Arabic has different glyphs of initial, final, middle or separated character (+ special combination, common to cursive scripts). Indian scripts have similar behaviours.
So the base of Unicode has already this behaviour, and modern good fonts should be able to do it.
It was not so late, that emojii uses such functionality, e.g. country letters/flags to other common cases.
Often the Unicode documentation tell you of such possibilities, and the special code points which could change behaviour, but then it is task of the font to fullfil the expected behaviour (and to find good glyphs).
So: character (as unicode code point) is not one to one to a design (glyphs).

Related

The list of unicode unusual characters

Where can I get the complete list of all unicode characters that doesn't behave as simple characters. Examples: character 0x0363 (won't be printed without another one before), character 0x0084 (does weird things when printed). I need just a raw list of such unusual characters to replace them with something harmless to avoid unwanted output effects. Regular characters (those who not in this list) should use exactly one character place when printed (= cursor moved +1 to the right), should not depend on previous or next characters, and should not affect printing style in any way.
Edit because of multiple comments:
I have some unicode string, usually consists of "usual" characters like 0x20-0x7E or cyrillic letters. Also, there are a lot of other unicode characters that are usual and may be safely assumed as having strlen() = 1. The string is printed on the terminal and I should know the resulting position of the cursor. I don't want to use some complex and non-stable libraries to do that, i want to have simplest possible logic to do that. Every problematic character may be replaced with U+0xFFFD or something like "<U+0363>" (ASCII string with its index instead of character itself). I want to have a list of "possibly-problematic" characters to replace. It is acceptable to have some non-problematic characters in this list too, but not much.

There is no simple algorithm for this. You'll likely need a complex, but extremely stable library: libicu, or something based on it. Basically every other library that does this kind of work is based on libicu, which is maintained by the Unicode organization.
If you don't want to use the official library (or something based on their library), you'll need to parse the Unicode Character Database yourself. In particular, you need to look at Character Properties, and parse the files in the UCD.
I believe you're asking for Bidi_Class (i.e. "direction") to be Left_To_Right, Canonical_Combining_Class to be Not_Reordered, and Joining_Type to be Non_Joining.
You probably also want to check the General_Category and avoid M* (Marks) and C* (Other).
This should work for some Emoji, but this whole approach will break a lot of emoji that look simple and are not. Most famously: ❤️, which is two "characters," not one. You may want to filter out Emoji. As a simple starting point, you may want to restrict yourself to the Basic Multilingual Plane (BMP), which are code points 0000-FFFF. Anything above this range is, almost by definition, rare or unusual. The BMP does include some emoji, but most emoji (and all new emoji) are outside the range.
Remember that the glyphs for single characters can still have radically different widths, even in nominally fixed-width fonts. For example, 𒈙 (U+12219 CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is a completely "normal" character in the way you're describing. It is left-to-right. It doesn't depend on or influence characters around it (it's non-combining and non-joining). Its "length in characters" is 1. Its glyph is also extremely wide in most fonts and breaks a lot of layout. I don't know anything in the Unicode database that would warn you of this, since "glyph width" is entirely a function of fonts, not characters, and Unicode explicitly does not consider fonts. (That said, most of the most problematic characters are outside the BMP. Probably the most common exception is Ǆ, but many fixed-width fonts have a narrow glyph for it: Ǆ.)
Let's write some cuneiform in a fixed-width font.
Normally, every character should line up with a character above.
Here: 𒈙. See how these characters don't align correctly?
Not only is it a very wide glyph, but its width is not even a multiple.
At least not in my font (Mac Safari 15.0).
But Ǆ is ok.
Also remember that there are multiple ways to encode the same "character." For example, é can be a "simple" character (U+00E9), or it can be two characters (U+0065, U+0301). So in some cases é may print in your scheme, and in others it won't. I suspect this is fine for your problem, but if it isn't, you're going to need to apply a normalization form (likely NFC).

Unit Separator "us"

I've seen the unit separator represented as different symbols (I've provided links to each one). What's the difference between each one? I'm working on a project and the only symbol that works is the "us" symbol.
Unit Separator Symbol #1:
Unit Separator Symbol #2:
Unit Separator Symbol #3:

Unit separator is one of the many ASCII control codes, so done for very old times. You see that you can use FS, GS, RS, and US, to split data (e.g. on a serial console).
Such control characters are interpreted as control character in Unicode (so in modern world), so without a real symbol.
And then things may get complex. Text processor, shaping engines and/or fonts may interpret control characters differently: either just as control, and so possibly ignoring them, if they do not have semantic for such control, or trying to display it. One common form it is to use U+241F (SYMBOL FOR UNIT SEPARATOR), in the Unicode block Control Pictures (U+2400 – U+243F), which includes symbols for all ASCII control codes. Note: fonts display it differently, some fonts as a boxed text with an abbreviation, some fonts as small letters in diagonal.
Note old fonts (with 256 symbols) used control character for extra symbols, see e.g. the default DOS code page: https://en.wikipedia.org/wiki/Code_page_437, where you see your symbol: the black triangle. ("Black" in font means filled, so not just the sides/contour). Note: there were also special methods on how to print them (instead of interpreting them as control characters), and different systems used different symbols on control codes.

Unicode Character COMBINING LATIN SMALL LETTER C

What is the likelihood that I'll run into COMBINING LATIN SMALL LETTER C (U+0368) in "real life" (besides clever Scottish folk)?
I'm asking since it's in both the Unicode Block Combining Diacritical Marks and the Category Mark, Nonspacing [Mn].
As a result, it seems to gets treated the same as characters such as COMBINING GRAVE ACCENT (U+0300) by Utilities such as the ICU Transliterator (using either the suggested "NFD; [:Nonspacing Mark:] Remove; NFC" or a straight "Latin-ASCII" transliteration).

The likelihood is very close to zero, but not exactly zero. You cannot prevent anyone from using a Unicode character as he likes. There is no specific information about U+0368 in the Unicode Standard, but it has definitely been defined as a combining character that will cause a symbol (c) to be displayed above the preceding character. I would expect to find it mostly in digitized forms of medieval manuscripts, or something like that.
Using it after a space character, as in the “clever” page mentioned, is not the intended use, but not invalid either. Unicode lets you use any combining mark after any character, whether it makes sense or not.
It has no canonical or compatibility decomposition, so there is no clear-cut way to deal with in a context where you cannot, or do not want to, retain the character.

The likelihood is utterly indeterminate except to say that if you expect it not to occur, then it will occur.

Visually-identical characters in Unicode

I want to find visually identical characters for a specific character in Unicode.
I know how to find canonical or compatibility decompositions of a character; but they do not give me what I want.
I want to find characters that are visually identical (not similar), and their only difference can be their sizes.
for example I want : (s,S), or (S,S) (whose code points are different).
I do not want (ß, β), or (e, é).
Any suggestions? Thanks.

For a particular character, you could start from annotations in the code charts in the Unicode standard. The annotations often refer to other characters for various reasons, including similarity or identity of shape. But the annotations are not meant to cover everything.
You could also draw your character at http://shapecatcher.com/ and ask it to recognize it. You often get a long list of visually similar alternatives.
As #TedHopp writes in his comment, visual identity is font-dependent. For example, “s” and “S” need not be identical in shape; in most fonts, they are not – the basic form is the same, but there are various differences in stroke width variation, curvature, serifs, etc. However, some characters can be expected to be visually identical in any font that contains them, such as Latin capital A, Greek capital alpha Α, and Cyrillic capital А.
You did not specify the purpose of the study, but you might be doing something that has been carried out to some extent by the Unicode Consortium. See UTR #6, Unicode Security Considerations, which also contains references to related work, including UTS #9, Unicode Security Mechanisms, which contains confusables.txt, Recommended confusable mapping for IDN (i.e., for a particular context, but it may be of interest for other purposes as well).

Which is the difference between the tick symbol U+2713 and U+2714

Appart from the visual aspect... do those tick symbols have any different semantics?
I mean: One is thin and the other bold. But... any special meaning for one or the other? Or it is just a matter of using one graphical aspect or another?

Unlike the majority of characters in Unicode, the Dingbats range U+27xx have no particular semantic content. The 'heavy' check mark has no meaning beyond 'a check mark that is visually bolder than the other one'; contrast this with the 'bold' letters in plane that have a mathematical meaning.
This range of characters comes from the symbol font Zapf Dingbats. Symbol fonts are visual in nature and don't fit well in Unicode, but Zapf Dingbats has historical significance as a one of the PostScript core font set guaranteed to be available on PS printers. Subsequently characters from Zapf Dingbats have commonly been used in document interchange, making it worthwhile to standardise them.

Appart from the visual aspect.
There's no appart here, the visual aspect is King in Unicode. U+2713 is a check mark. U+2714 is a heavy check mark. It should appear as a bolder version of U+2713 if you have a decent font.
These codepoints are in a group named Dingbats, a group of typographical symbols. Chess pieces, arrows, asterisks, that sort of thing. There's no semantic meaning attached to them. It is just heavier.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse