What determines whether a combining character will combine? - unicode

If I have a string "\u05D2\u0308" I don't actually get a diaeresis on top of a gimel. It sits to the side, as a discrete glyph.
I don't actually want the combined glyph, but I'm confused in general. How does a combining diacritic like U+0308 decide whether to combine with the preceeding character or hang out on its own?
And how much of this behavior is specified in the unicode standard and how much is up to the individual text renderer or font?

Actually, it does combine.
You are perhaps using some environment where the text engine fails to render this correctly, but in fact your string is one character long (using the conventional sense of "character"), and a correct Unicode-compliant environment will tell you so.

Related

In Python (or any language) what does an "upper" function do to Hindi, Amharric and other non-Latin character sets?

Subject says it all. Been looking for an answer, but cannot seem to find it.
I am writing a web app that will store data in a database and also have language files translated into a wide variety of character sets. At various moments, the text will be presented. I want to control presentation such as spurious blank spaces at the beginning and end of strings. Also I want to ensure some letters are upper or lower case.
My question is: what happens in upper/lower case functions when the character set only has one case?
EDIT Sub question: Are there any unexpected side effects to be aware of?
My guess is that you simply get back the one and only character.
EDIT - Added Description
The main reason for asking this question is that I am writing a webapp that will be distributed and run on machines in remote areas with little or no chance to fix "on-the-spot" bugs. It's not a complicated webapp, but will run with many different language char sets. I want to be certain of my footing before releasing the server.
First of all the upper() and lower() method in python can be applied to Hindi, Amharric and non-letter character sets.
For instance will the upper() method converts the lowercase characters if an equivalent uppercase of this char exists. If not, then not.
Or better said, if there is nothing to convert, it stays the same.

The list of unicode unusual characters

Where can I get the complete list of all unicode characters that doesn't behave as simple characters. Examples: character 0x0363 (won't be printed without another one before), character 0x0084 (does weird things when printed). I need just a raw list of such unusual characters to replace them with something harmless to avoid unwanted output effects. Regular characters (those who not in this list) should use exactly one character place when printed (= cursor moved +1 to the right), should not depend on previous or next characters, and should not affect printing style in any way.
Edit because of multiple comments:
I have some unicode string, usually consists of "usual" characters like 0x20-0x7E or cyrillic letters. Also, there are a lot of other unicode characters that are usual and may be safely assumed as having strlen() = 1. The string is printed on the terminal and I should know the resulting position of the cursor. I don't want to use some complex and non-stable libraries to do that, i want to have simplest possible logic to do that. Every problematic character may be replaced with U+0xFFFD or something like "<U+0363>" (ASCII string with its index instead of character itself). I want to have a list of "possibly-problematic" characters to replace. It is acceptable to have some non-problematic characters in this list too, but not much.
There is no simple algorithm for this. You'll likely need a complex, but extremely stable library: libicu, or something based on it. Basically every other library that does this kind of work is based on libicu, which is maintained by the Unicode organization.
If you don't want to use the official library (or something based on their library), you'll need to parse the Unicode Character Database yourself. In particular, you need to look at Character Properties, and parse the files in the UCD.
I believe you're asking for Bidi_Class (i.e. "direction") to be Left_To_Right, Canonical_Combining_Class to be Not_Reordered, and Joining_Type to be Non_Joining.
You probably also want to check the General_Category and avoid M* (Marks) and C* (Other).
This should work for some Emoji, but this whole approach will break a lot of emoji that look simple and are not. Most famously: ❤️, which is two "characters," not one. You may want to filter out Emoji. As a simple starting point, you may want to restrict yourself to the Basic Multilingual Plane (BMP), which are code points 0000-FFFF. Anything above this range is, almost by definition, rare or unusual. The BMP does include some emoji, but most emoji (and all new emoji) are outside the range.
Remember that the glyphs for single characters can still have radically different widths, even in nominally fixed-width fonts. For example, 𒈙 (U+12219 CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is a completely "normal" character in the way you're describing. It is left-to-right. It doesn't depend on or influence characters around it (it's non-combining and non-joining). Its "length in characters" is 1. Its glyph is also extremely wide in most fonts and breaks a lot of layout. I don't know anything in the Unicode database that would warn you of this, since "glyph width" is entirely a function of fonts, not characters, and Unicode explicitly does not consider fonts. (That said, most of the most problematic characters are outside the BMP. Probably the most common exception is DŽ, but many fixed-width fonts have a narrow glyph for it: DŽ.)
Let's write some cuneiform in a fixed-width font.
Normally, every character should line up with a character above.
Here: 𒈙. See how these characters don't align correctly?
Not only is it a very wide glyph, but its width is not even a multiple.
At least not in my font (Mac Safari 15.0).
But DŽ is ok.
Also remember that there are multiple ways to encode the same "character." For example, é can be a "simple" character (U+00E9), or it can be two characters (U+0065, U+0301). So in some cases é may print in your scheme, and in others it won't. I suspect this is fine for your problem, but if it isn't, you're going to need to apply a normalization form (likely NFC).

How do I embed arbitrary unicode without messing up the rest of the line?

So we have header|sequence|string1|string2|directive where string1 and string2 are arbitrary Unicode junk. Assuming the input can be really trashy Unicode (I'm expecting for it to contain things like right-to-left text, unbalanced Unicode direction control characters, etc) but not actually malicious, how can I get these strings to display in order?
The final website target is HTML but we believe it's best to process as string as far as possible. Blindly jamming a force-LTR before each | is not remotely acceptable as this tends to carry into the text past the | and cause RTL to render as LTR.
First step: replace control codes with control pictures
Second step: fix RTL nonsense ??
I have to admit I was expecting the RTL stack to be simpler than it was. I cannot simply run the algorithm because there's no way to know the RTL-LTR-ness of a private use character.
We ended up with this kludgy method. It works. (Note that in the production code these inline styles turn into a class reference.)
<PRE><DIV DIR=LTR STYLE="display:inline-block;">|</DIV><DIV STYLE="display:inline-block;">something1</DIV><DIV DIR=LTR STYLE="display:inline-block;">|</DIV><DIV STYLE="display:inline-block;">something2</DIV><DIV DIR=LTR STYLE="display:inline-block;">|</DIV></PRE>

Is it possible to use unicode combining characters to combine arbitrary characters?

Is it possible to use unicode combining characters to for example make the characters x and y appear to be partially overlapping each other?
I know that in layout systems like CSS there are other ways to achieve this, but I specifically want to know if its possible with just unicode so I can for example do it in Slack messages.
No, there is no Unicode mechanism to make arbitrary letters overlap each other. You can put an x above a y using the character U+036F COMBINING LATIN SMALL LETTER X like so: yͯ, but that’s about it.
Latin letters partially overlapping each other serves no semantic function, so it is not part of the Unicode standard. And if it was found to be used to convey actual meaning in some writing system, it would most likely not be encoded as a generalised mechanism but as individual characters representing specific such ligatures.
The Unicode Consortium does not consider styling features like that to be part of plain text. That is also why those bold and italic mathematical letters you sometimes see on Twitter (𝐀, 𝐴, 𝓐 etc.) aren’t implemented as the base letters plus some style modifiers, but as separate character codes entirely. A character that means “display the preceding letter as bold” would have been too general; non-crucial style variation should be dealt with through higher-level protocols (like the CSS you mentioned) which are much more powerful and enjoy more widespread support anyway.

Detect if character is simplified or traditional Chinese character

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.