How is transformation of code point to final character implemented in Unicode? - unicode

Characters included in BMP as specified by 4 digits,
and those characters outside of BMP contains 5 or 6 digits.
But my doubt is:
how is the finanal character drawed from value of code point?
Are the pictures of each character restored in each computer and when displaying just show the matching picture?
Or the final glyph is a computed result of code point itself?

Each Unicode character has a code. The software displaying the character obtains a glyph for that character code - usually from a font installed onto the hosting computer. It then uses the obtained glyph to display the character.
If it can't find a glyph for that character (many fonts for Latin characters completely omit the glyphs used for East Asian languages characters) it formally can't display it. It will then either indicate error or use a supplement glyph meaning that the actual glyph can't be displayed (it can be a question mark or a square or whatever).

Related

Need a single width unicode character to indicate a wide character has been shortened for lack of space

I'm looking at formatting a utf8 free text string to fit an exact column width on a terminal. I'm coding various truncation methods (left/middle/right) for long strings however, when the truncation break point lies over a wide character, such as an emoji, the display column counting falls apart. some form of padding is needed for the 'half wide' column placement.
Is there a suitable narrow character to show that indicates we do have valid unicode character, but insufficient display space to show it, as opposed to the special replacement character � usually used for invalid unicode ??
Example: on a fixed spacing terminal fit two smiley emojis into the space that would fit 'aaa'. e.g. "👨👨" ! so need a, preferably standardised, substitute character for the second emoji/wide character, e.g. "👨⋮" to fit that three wide space.
A side issue is trying to work out when decomposed composite characters start and end, (also are there combining prefixes?). It looks like the next code point needs to be read to see if it is still zero width (e.g. 'o' U+006F, then 'umlaut' U+0308, rather than ö U+00F6; don't stop after the plain 'o').

Are all "non-grapheme" code points invisible?

In a unicode string, each grapheme consists of one or more code points. However, there are some code points, such as the Zero-width joiner (ZWJ), which are never a part of a grapheme. The ZWJ is, in itself, invisible. Are all of those "non-grapheme" code points always invisible?
The Unicode representation of the Ogham script is notable for containing a non-invisible whitespace character. (U+1680: OGHAM SPACE MARK)
Tom Scott made an excellent YouTube video on the subject: link
There are many joining characters which are intended to modify a base character. Whether they provide a grapheme on their own is partially an implementation detail, I expect.
Example: o followed by U+0308 COMBINING DIAERESIS produces ö (the glyph in isolation is rendered by your browser as ̈)
List of all code points in this category: https://codepoints.net/search?lb=CM
Recent Unicode versions also have invisible characters which modify how a previous emoji is being rendered, famously to add e.g. a skin color trait to emojis with human figures or faces. These by definition are not graphemes in their own right, though again, rendering engines are probably free to figure out a way to represent them if they are encountered in isolation.
Example: 👋 U+1F44B WAVING HAND SIGN followed by U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 (which in isolation renders as 🏻) produces 👋🏻
Full catalog: https://www.unicode.org/emoji/charts/full-emoji-modifiers.html

What the character codes are in the cmap table in TrueType fonts

Wondering what the "character codes" are for the cmap table in TrueType fonts. Microsoft talks about the Character to Glyph Index Mapping Table, but I don't see what the character or glyph index mean.
Wondering if somewhere in the font file you specify the encoding, such as Unicode 11.0, and then the character codes are equal to the Unicode code points such as U+0061 for a. Or if the character codes are instead the "browser" character codes (decimal codes I guess), such as 97 for a.
Basically wondering how you map keyboard characters to font glyphs, and what that really means. I think you not so much want to map keyboard codes to the font glyphs, but unicode codes like U+0061 to the font glyphs, so if in JavaScript (for example) you can do \u03A9 and it will give you Ω if your font supports that.
Trying to understand the anatomy of a font file in terms of how it maps the mathematical glyphs as vectors/paths, to characters or codes of some sort.
The short, but perhaps not desired, answer is of course "read the OpenType spec. It takes a while", so a slightly longer, but easier and less detailed answer would be http://pomax.github.io/CFF-glyphlet-fonts, although that skips over TTF so let's look at that here:
Your input code gets run through whatever is the applicable CMAP given the context you're applying the font to, which maps the computer's code (ascii code, unicode code point, ISO-2022-jp, what have you) to a glyph id. For TTF specifically, that id is then used as array offset in the "loca" table, which is the "glyph index to data location" table and specifies the byte offset in the "glyf" table for each glyph that the font contains. You then consult the glyf table at that byte offset, and start parsing the bytes as specified by https://learn.microsoft.com/en-us/typography/opentype/spec/glyf

Codepoint of the 'missing glyph'-box

When a textbox, browser or other program can't display a character, or the character is not valid unicode, a white-box character is drawn instead to represent the missing glyph.
I assume that this box-glyph is a Unicode character itself, thus I am looking for its codepoint so I can use it. Does anyone know which codepoint is used, or perhaps if my assumption is wrong and it is not necessarily a member of the font?
At first I thought it might be the White Square (U+25A1), but, after I compared this glyph with an example, I found white square was smaller. There is a larger variant of it (medium and large), but these do not appear in the font under consideration, so these can not be the ones I am seeing.
I managed to find my answer, here on stackoverflow: https://stackoverflow.com/a/22636426/2718186
Particularly, the part that talks about .notdef glyph. It seems that fonts reserve a special glyph, that is not mapped to by any Unicode point, to indicate that a character has no glyph in the current font.

How can I detect any unicode characters which have descenders, using .NET

I am trying to minimize the vertical distance between controls on a programmatically constructed Windows Form (using C#). This involves setting the Height property appropriately.
I have found that if the text of the control does not contain any letters with descenders in them (i.e. does not have any of the characters j, g, p, q or y) then the control Height can be smaller than when it does contain such letters (if it does contain letters with descenders then the descenders are chopped off if the Height isn't enough).
It will work fine to test for any of the above 5 characters as long as the language is English, or English - like, but I need to be able to cater for (just about) any language.
Is there a way, given some arbitrary Unicode character (and perhaps a font) to determine if that Unicode character has a descender or not?
There is no property defined for Unicode characters to indicate the presence of a descender, and it’s really a feature of glyph design rather than characters. For example, “Q” has a descenders in many fonts, and “J” has one in some. Besides, given the context, you should also consider diacritic marks placed below a letter, not just descenders of base letters. And probably diacritics above letters, too.
So you would need to read the font information (when available) about character dimensions, or tentatively draw characters in your software and measure their dimensions.
As a rule of thumb, any line height below 1.1 times the font size will cause problems with some characters and fonts. Using 1 (“setting solid”) is not enough, because characters may in fact extend outside the font size.
In Windows, you call GetPath() to get an array containing the X/Y coordinates of every point making up the perimeter or outline of the string of glyphs. Search the array for min/max, which will get you the rectangle exactly enclosing the string. Right to the edge of the letters.