What is a practical application of having a combining character representation of a symbol in Unicode when a single code point mapping to the symbol will alone suffice?
What programming/non-programming advantage does it give us?
There is no particular programming advantage in using a decomposed presentation (base character and combining character) when a precomposed presentation exists, e.g. using U+0065 U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT instead of U+00E9 LATIN SMALL LETTER E WITH ACUTE “é”. Such decomposed presentations are something that needs to be dealt with in programming, part of the problem, rather than an advantage. So it’s similar to asking about the benefits of having the letter U in the character code.
The reasons why decomposed presentations (or the letter U) are used in actual data and need to be handled are external to programming and hence off-topic at SO.
Decomposing all decomposable characters may have advantages in processing, as it makes the data more uniform, canonical. This would relate to some particular features of the processing needed, and it would be implemented by performing (with a library routine, usually) normalization to NFD or NFKD form. But this would normally be part of the processing, not something imposed on input format. If some string matching is performed, it is mostly desirable to treat decomposed and precomposed representations of a character as equivalent, and normalization makes this easy. But this a way of dealing with the two different representations, not a cause for their existence, and it can equally well be performed by normalizing to NFC (i.e., precompose everything that can be precomposed). See the Unicode FAQ section Normalization.
The decomposed components are better for text editing, and "possibly but not definite" with good compression ratio.
When editing text, there are times when modifying an accent mark is wanted, but precomposed (precomposed is not a word by Firefox spell check) characters do not allow partial modifications. Sometimes, users may want to modify the base character without removing the accent. Those kinds of editing prefers using decomposed characters.
About compression ratio, it makes more sense during the days of separate encoding per language. In such times, the 8-bit encoding per language allows each language to have their own character sets. Some languages have better compression ratio for decomposed character. The small space of the 8-bits means that they could only fit so many unique code points and use variable width with decomposed characters.
Related
I am trying to represent devanagari characters on a screen, but in the dev environment where I'm programming I don't have unicode support. Then, to write characters I use binary matrices to color the related screen's pixels. I sorted these matrices according to the unicode order. For the languages that uses the latin alphabet I had no issues, I only needed to write the characters one after the other to represent a string, but for the devanagari characters it's different.
In the devanagari script some characters, when placed next to others can completely change the appearance of the word itself, both in the order and in the appearance of the characters. The resulting characters are considered as a single character, but when read as unicode they actually return 2 distinct characters.
This merging sometimes occurs in a simple way:
क + ् = क्
ग + ् = ग्
फ + ि = फि
But other times you get completely different characters:
क + ् + क = क्क
ग + ् + घ = ग्घ
क + ् + ष = क्ष
I found several papers describing the complex grammatical rules that determine how these characters merges (https://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf), but the more I look into it the more I realize that I need to learn Hindi for understand that rules and then create an algorithm.
I would like to understand the principles behind these characters combinations but without necessarily having to learn the Hindi language. I wonder if anyone before me has already solved this problem or found an alternative solution and would like to share it with me.
Whether Devanagari text is encoded using Unicode or ISCII, display of the text requires a combination of shaping engine and font data that maps a string of characters into an appropriate sequence of positioned glyphs. The set of glyphs needed for Devanagari will be a fair bit larger than the initial set of characters.
The shaping steps involves an analysis of clusters, re-ordering of certain elements within clusters, substitution of glyphs, and finally positioning adjustments to the glyphs. Consider this example:
क + ् + क + ि = क्कि
The cluster analysis is needed to recognize elements against a general cluster pattern — e.g., which comprise the "base" consonant within the cluster, which are additional consonants that will conjoin to it, which are vowels and what the type of vowel with regard to visual positioning. In that sequence, the <ka, virama, ka> sequence will form a base that vowel or other marks are positioned relative to. The second ka is the "base" consonant and the inital <ka, virama> sequence will conjoin as a "half" form. And the short-i vowel is one that needs to be re-positioned to the left of the conjoined-consonant combination.
The Devanagari section in the Unicode Standard describes in a general way some of the actions that will be needed in display, but it's not a specific implementation guide.
The OpenType font specification supports display of scripts like Devanagari through a combination of "OpenType Layout" data in the font plus shaping implementations that interact with that data. You can find documentation specifically for Devanagari font implementations here:
https://learn.microsoft.com/en-us/typography/script-development/devanagari
You might also find helpful the specification for the "Universal Shaping Engine" that several implementations use (in combination with OpenType fonts) for shaping many different scripts:
https://learn.microsoft.com/en-us/typography/script-development/use
You don't necessarily need to use OpenType, but you will want some implementation with the functionality I've described. If you're running in a specific embedded OS environment that isn't, say, Windows IOT, you evidently can't take advantage of the OpenType shaping support built into Windows or other major OS platforms. But perhaps you could take advantage of Harfbuzz, which is an open-source OpenType shaping library:
https://github.com/harfbuzz/harfbuzz
This would need to be combined with Devanagari fonts that have appropriate OpenType Layout data, and there are plenty of those, including OSS options (e.g., Noto Sans Devanagari).
I totally understand the necessity of integral, and brackets by pieces (2320 2321 239B-23AE) Since it helps building large notations.
But the for the large summation 23B3 23B4, if one stretch these two, they will still lose their shapes. I do not understand what is the logic behind separating this character, or why not a corresponding one for the product symbol 220F. Furthermore, I wonder in which case these two symbols should be used.
This doesn't answer the question fully, but the following paragraph from Unicode Technical Report #28: Unicode 3.2 is probably the most authoritative explanation you can find:
Symbol Pieces. [to follow “APL Functional Symbols”] The characters in the range U+239B..U+23B3, plus U+23B7, comprise a set of bracket and other symbol fragments for use in mathematical typesetting. These pieces originated in older font standards, but have been used in past mathematical processing as characters in their own right to make up extra-tall glyphs for enclosing multi-line mathematical formulae. Mathematical fences are ordinarily sized to the content that they enclose. However, in creating a large fence, the glyph is not scaled proportionally; in particular the displayed stem weights must remain compatible with the accompanying smaller characters. Thus, simple scaling of font outlines cannot be used to create tall brackets. Instead, a common technique is to build up the symbol from pieces. In particular, the characters U+239B LEFT PARENTHESIS UPPER HOOK through U+23B3 SUMMATION BOTTOM represent a set of glyph pieces for building up large versions of the fences (, ), [, ], {, and }, and of the large operators ∑ and ∫. These brace and operator pieces are compatibility characters. They should not be used in stored mathematical text, but are often used in the data stream created by display and print drivers.
This is followed by a table showing how pieces are intended to be used together to create specific symbols.
Unicode mostly standardises existing character repertoires, and keeps their peculiarities so that conversions round-trip properly. A corresponding product symbol is not part of Unicode because the originating character repertoire did not have one. Ask on https://www.unicode.org/consortium/distlist-unicode.html about the provenance of the summation top/bottom.
I am looking for large symbols in unicode like these:
∏ ∐ ∑ ∫
⨀ ⨁ ⨂
⊕ ⊖ ⊗ ⊘ ⊙
⎲
⎳
⌠
⌡
The only one I found is by combining two unicode symbols ⎲and ⎳. Not sure why that exists, but not a large product symbol. That's all I am really looking for (∏ over multiple lines like the sigma). If any of the other ones exist over 2 lines that would be great to know as well. Perhaps there is some way to manually make the large ∏ symbol out of smaller primitives.
⎲and ⎳. Not sure why that exists
When a collection of existing glyphs is added to Unicode, it is desirable to make encoding between character sets round-trip safe. So glyphs that are duplicates or variants of each other are kept anyway.
As of Unicode 10, these are the greek letter pi (and its compat decompositions) available: ∏Ππϖᴨℼℿ There are no top and bottom halves like for integral and summation.
You should not attempt to build a glyph piecewise from other glyphs shifted into position. (You said "primitives", but Unicode does not work that way.) The result is not accessible and somewhat likely to break in rendering on systems other than yours.
The correct solution is to use the ∏ glyph and simply scale up its font size. Look into MathML if you are using only ad-hoc notation so far.
ASCII has versions of the whole Roman alphabet. I was surprised recently to learn that Unicode contains other version/s of those same characters. One example is "U+1D5C4: MATHEMATICAL SANS-SERIF SMALL K", or "𝗄".
Can't LaTeX math mode, or MS Word equation editor, or whatever other program just use a sans-serif font if it wants the letters in a mathematical formula to be sans-serif?
These characters exist so that the semantic distinction between them can be encoded in plain text, or where the specific font shape can't be controlled.
The block you mention is only intended for use in mathematical and technical contexts, where the distinction between, say, 𝑑 as a variable vs. d as a differential operator vs. 𝖽 as an object (in category theory) is important. TR #25 gives another example where losing the distinction between ℋ and H can completely change the meaning of an equation. Being able to encode this formatting into the text itself is also important for ISO 31-11.
All of these characters maintain compatibility mappings with their "normal" Latin and Greek counterparts, so the distinction between them should not affect searching and sorting.
You are confusing the display mode with the encoding for texts.
The idea is that unicode has ALL the symbols used to write known to mankind grouped by usage. That's why you will find many code-points that look alike.
So a formula with a k is different is supposed to be different then a word written with a k. The sans-serif part is just a description of the kind of k best used to display. Tomorrow somebody might want to add a serif k and then how would you describe the difference?
What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?
They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly missing something here. Why the distinction?
The Unicode Standard, Chapter 3, D52
Combining character: A character with the General Category of Combining Mark (M).
Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing Mark (Me).
All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class.
The interpretation of private-use characters (Co) as combining characters or not is determined by the implementation.
These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non- joiner. The combining character is said to apply to that base character.
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character—for example, a carriage return, tab, or right-left mark. In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
The representative images of combining characters are depicted with a dotted circle in the code charts. When presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.
The Unicode Standard, Chapter 3, D59
Grapheme extender: A character with the property Grapheme_Extend.
Grapheme extender characters consist of all nonspacing marks, zero width joiner, zero width non-joiner, U+FF9E, U+FF9F, and a small number of spacing marks.
A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character.
zero width joiner and zero width non-joiner are formally defined to be grapheme extenders so that their presence does not break up a sequence of other grapheme extenders.
The small number of spacing marks that have the property Grapheme_Extend are all the second parts of a two-part combining mark.
The set of characters with the Grapheme_Extend property and the set of characters with the Grapheme_Base property are disjoint, by definition.
The difference in actual usage is that combining characters are defined as a General Category for rough classification of characters and grapheme extenders are mainly used for UAX #29 text segmentation.
EDIT: Since you offered a bounty, I can elaborate a bit.
Combining characters are characters that can't be use as stand-alone characters but must be combined with another character. They're used to define combining character sequences.
Grapheme extenders were introduced in Unicode 3.2 to be used in Unicode Technical Report #29: Text Boundaries (then in a proposed status, now known as Unicode Standard Annex #29:
Unicode Text Segmentation). The main use is to define grapheme clusters. Grapheme clusters are basically user-perceived characters. According to UAX #29:
Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.
The main difference is that grapheme extenders don't include most of the spacing marks (the set is actually smaller than the set of combining characters). Most of the spacing marks are vowel signs for Asian scripts. In these scripts, vowels are sometimes written by modifying a consonant character. If this modification takes up horizontal space (spacing mark), it used to be seen as a separate user-perceived character and forms a new (legacy) grapheme cluster. In later versions of UAX #29, this was changed and extended grapheme clusters were introduced where most but not all spacing marks don't break a cluster.
I think they key sentence from the standard is: "A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character." Combining characters, on the other hand, also include spacing marks that are applied to the left or right. There are a few exceptions, though (see property Other_Grapheme_Extend).
Example
U+0995 BENGALI LETTER KA:
ক
U+09C0 BENGALI VOWEL SIGN II (combining character, but no grapheme extender):
ী
Combination of the two:
কী
This is a single combining character sequence consisting of two legacy grapheme clusters. The vowel sign can't be used by itself but it still counts as a legacy grapheme cluster. A text editor, for example, could allow to place the cursor between the two characters.
There are over 300 combining characters like this which do not extend graphemes, and four characters which are not combining but do extend graphemes.
I’ve posted this question on the Unicode mailing list and got some more responses. I’ll post some of them here.
Tom Gewecke wrote:
I'm not an expert on this aspect of Unicode, but I understand that
"grapheme extender" is a finer distinction in character properties
designed to be used in certain specific and complex processes like
grapheme breaking. You might find this blog article helpful in seeing
where it comes into play:
http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-4.html
PS The answer by nwellnhof at StackOverflow is an excellent explanation of this issue in my view.
Philippe Verdy wrote:
Many grapheme extenders are not "combining characters". Combining
characters are classified this way for legacy reasons (the very weak
"general category" property) and this property is normatively
stabilized. As well most combining characters have a non-zero
combining class and they are stabilized for the purpose of
normalization.
Grapheme extenders include characters that are also NOT combining
characters but controls (e.g. joiners). Some graphemclusters are also
more complex in some scripts (there are extenders encoded BEFORE the
base character; and they cannot be classified as combining characters
because combining characters are always encoded AFTER a base
character)
For legacy reasons (and roundtrip compatibility with older standards)
not all scripts are encoded using the UCS character model using
combining characters. (E.g. the Thai script; not following the
"logical" encoding order; but following the model used in TIS-620 and
other standards based on it; including for Windows, and *nix/*nux).
Richard Wordingham wrote:
Spacing combining marks (category Mc) are in general not grapheme
extenders. The ones that are included are mostly included so that the
boundaries between 'legacy grapheme clusters'
http://www.unicode.org/reports/tr29/tr29-23.html are invariant under
canonical equivalence. There are six grapheme extenders that are not
nonspacing (Mn) or enclosing (Me) and are not needed by this rule:
ZWNJ, ZWJ, U+302E HANGUL SINGLE DOT TONE MARK U+302F HANGUL DOUBLE DOT
TONE MARK U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F HALFWIDTH
KATAKANA SEMI-VOICED SOUND MARK
I can see that it will sometimes be helpful to ZWNJ and ZWJ along with
the previous base character. The fullwidth soundmarks U+3099 and
U+309A are included for reasons of canonical equivalence, so it makes
sense to include their halfwidth versions.
I don't actually see the logic for including U+302E and U+302F. If
you're going to encourage forcing someone who's typed the wrong base
character before a sequence of 3 non-spacing marks to retype the lot,
you may as well do the same with Hangul tone marks.
May I quote from Yannis Haralambous' Fonts and Encodings, page 116f.:
The idea is that a script or a system of notation is sometimes too
finely divided into characters. And when we have cut constructs up
into characters, there is no way to put them back together again to
rebuild larger characters. For example, Catalan has the ligature
‘ŀl’. This ligature is encoded as two Unicode characters: an ‘ŀ’
0x0140 latin small letter l with middle dot and an ordinary ‘l’. But
this division may not always be what we want.
Suppose that we wish to
place a circumflex accent over this ligature, as we might well wish to
do with the ligatures ‘œ’ and ‘æ’. How can this be done in Unicode?
To allow users to build up characters in constructs that play the rôle
of new characters, Unicode introduced three new properties (grapheme
base, grapheme extension, grapheme link) and one new character:
0x034F combining grapheme joiner.
So the way I see it, this means that grapheme extenders are used to apply (for example) accents on characters that are themselves composed of several characters.