Are all "non-grapheme" code points invisible?

Are all "non-grapheme" code points invisible? - unicode

In a unicode string, each grapheme consists of one or more code points. However, there are some code points, such as the Zero-width joiner (ZWJ), which are never a part of a grapheme. The ZWJ is, in itself, invisible. Are all of those "non-grapheme" code points always invisible?

The Unicode representation of the Ogham script is notable for containing a non-invisible whitespace character. (U+1680: OGHAM SPACE MARK)
Tom Scott made an excellent YouTube video on the subject: link

There are many joining characters which are intended to modify a base character. Whether they provide a grapheme on their own is partially an implementation detail, I expect.
Example: o followed by U+0308 COMBINING DIAERESIS produces ö (the glyph in isolation is rendered by your browser as ̈)
List of all code points in this category: https://codepoints.net/search?lb=CM
Recent Unicode versions also have invisible characters which modify how a previous emoji is being rendered, famously to add e.g. a skin color trait to emojis with human figures or faces. These by definition are not graphemes in their own right, though again, rendering engines are probably free to figure out a way to represent them if they are encountered in isolation.
Example: 👋 U+1F44B WAVING HAND SIGN followed by U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 (which in isolation renders as 🏻) produces 👋🏻
Full catalog: https://www.unicode.org/emoji/charts/full-emoji-modifiers.html

Related

Codepoint of the 'missing glyph'-box

When a textbox, browser or other program can't display a character, or the character is not valid unicode, a white-box character is drawn instead to represent the missing glyph.
I assume that this box-glyph is a Unicode character itself, thus I am looking for its codepoint so I can use it. Does anyone know which codepoint is used, or perhaps if my assumption is wrong and it is not necessarily a member of the font?
At first I thought it might be the White Square (U+25A1), but, after I compared this glyph with an example, I found white square was smaller. There is a larger variant of it (medium and large), but these do not appear in the font under consideration, so these can not be the ones I am seeing.

I managed to find my answer, here on stackoverflow: https://stackoverflow.com/a/22636426/2718186
Particularly, the part that talks about .notdef glyph. It seems that fonts reserve a special glyph, that is not mapped to by any Unicode point, to indicate that a character has no glyph in the current font.

Is there a downwards double arrow with stroke unicode character?

I want the character ⇓ with stroke, just like ⇏ but downwards, but I can't find it. Does it exist?
Edit:
If you don't see the arrows (e.g. you use IE),
I want the character [downwards double arrow] with stroke, just like [rightwards double arrow with stroke] but downwards, but I can't find it. Does it exist?

There is no such character as a precomposed character (i.e., as a single encoded character, a code point assigned to a character), but you can in principle represent it using an arrow character followed by a combining overlay character.
The character “⇏” U+21CF RIGHTWARDS DOUBLE ARROW WITH STROKE has been defined as having the canonical decomposition RIGHTWARDS DOUBLE ARROW (U+21D2) COMBINING LONG SOLIDUS OVERLAY (U+0338). In principle, a character should be expected to be rendered the same way as its canonical decomposition. In practice, things don’t always go that way.
Along the same lines, a downwards double arrow with stroke could be written as the two-character sequence DOWNWARDS DOUBLE ARROW (U+21D3) COMBINING LONG SOLIDUS OVERLAY (U+0338) or, in HTML, as ⇓̸. In practice, few fonts contain these characters, and browsers may fail to implement the combination properly. Moreover, in many fonts, the result is awkward. In Arial Unicode MS and in DejaVu Serif, the result might be acceptable, but only the latter is free (can be legally used as a downloadable font via #font-face). Here’s the combination as rendered by your browser with the SO stylesheets in effect: ⇓̸.

It doesn't seem to exist, according to this page (compared to this).

How can I detect any unicode characters which have descenders, using .NET

I am trying to minimize the vertical distance between controls on a programmatically constructed Windows Form (using C#). This involves setting the Height property appropriately.
I have found that if the text of the control does not contain any letters with descenders in them (i.e. does not have any of the characters j, g, p, q or y) then the control Height can be smaller than when it does contain such letters (if it does contain letters with descenders then the descenders are chopped off if the Height isn't enough).
It will work fine to test for any of the above 5 characters as long as the language is English, or English - like, but I need to be able to cater for (just about) any language.
Is there a way, given some arbitrary Unicode character (and perhaps a font) to determine if that Unicode character has a descender or not?

There is no property defined for Unicode characters to indicate the presence of a descender, and it’s really a feature of glyph design rather than characters. For example, “Q” has a descenders in many fonts, and “J” has one in some. Besides, given the context, you should also consider diacritic marks placed below a letter, not just descenders of base letters. And probably diacritics above letters, too.
So you would need to read the font information (when available) about character dimensions, or tentatively draw characters in your software and measure their dimensions.
As a rule of thumb, any line height below 1.1 times the font size will cause problems with some characters and fonts. Using 1 (“setting solid”) is not enough, because characters may in fact extend outside the font size.

In Windows, you call GetPath() to get an array containing the X/Y coordinates of every point making up the perimeter or outline of the string of glyphs. Search the array for min/max, which will get you the rectangle exactly enclosing the string. Right to the edge of the letters.

iOS japanese handwriting input code help please

I have a series of questions about writing code for iOS and including handwritten recognition of japanese. I am a beginner, so be gentle and assume I am stupid ...
I'd like to present a japanese word in hiragana (japanese phonetic alphabet), then have the user handwrite the appropriate kanji (chinese character). Then, this is internally compared to the correct character. Then, user gets feedback (if they were correct or not).
My questions here revolve around the handwritten input.
I know normally if one uses the chinese keyboard this type of input is possible.
How can I institute something similar, without using the keyboard itself? Are there already library functions for this (I feel there must be since that input is available on the chinese keyboard)?
Also, Kanji aren't exactly the same as chinese characters. There are unique characters that japanese people invented themselves. How would I be able to include these in my handwriting recognition?

We worked on a similar exercise back at University.
As the order of the strokes is well defined with kanji and there are only 8 (?) different strokes. Basically each Kanji is a well-ordered sequence of strokes. Like te (hand) is the sequence "The short falling backward stroke" and then twice the "left to right stroke" and finally "The long downward stroke with the little tip at the bottom". There are databases that give you this information.
Now the problem is almost reduced to identify the correct stroke. You will still run into some ambiguities where you have to take into consideration in which spatial relation some strokes are to some others.
EDIT: For stroke recognition we snapped the free hand writing to 45 degrees (Where is the little circle symbol on the keyboard?) angles, thus converting it into a sequence of vectors along one of these directions. Let's assume that direction zero is from bottom to top, direction 1 bottom right to top left, 2 from right to left and so on CCW.
Then the first stroke of te (手) would be [23]+ (as some write it falling and some horizontal)
The second and third stroke would be 6+
and the last would be 4+[123] (as with the little tip, every writer uses a different direction)
This coarse snapping was actually enough for us to recognize kanjis. Maybe there are more sofisticated ways, but this simple solution managed to recognize about 90% of kanjis. It couldn't grasp only the handwriting of one professor, but the problem was that also no human except himself could read his handwriting.
EDIT2: It is important that your user "prints" the Kanji and doesn't write in calligraphy, since in calligraphy many strokes are merged into one. Like when writing a kanji with the radical of "rice field" in calligraphy, this radical morphs into something completely different. Or radicals with a lot of horizontal dashes (like the radical of "speech" iu) just become one long wriggly line.

Unicode character that lines up with ⎮ but is as long as ⎢

Sorry if this isn't the right overflow for this question. I need a unicode character that is as long as ⎢ (23A2, LEFT SQUARE BRACKET EXTENSION) but lines up horizontally with ⎮ (23AE, INTEGRAL EXTENSION). Is there such a character?

Take a look at shapecatcher. If you draw a straight line, it shows plenty of different codepoints resembling |.
As already pointed out, exact placement and size may depend on the font, but if you know that the font is going to be a specific one (because you supply it), you could still find the character you're looking for.

It turns out this does depend on the font. If I use DejaVu Sans Mono, INTEGRAL EXTENSION is as long as I want it to be. This font appears to be almost exactly the same as the font I was using, Menlo, except for some small differences with some characters (including this one).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse