What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?
They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly missing something here. Why the distinction?
The Unicode Standard, Chapter 3, D52
Combining character: A character with the General Category of Combining Mark (M).
Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing Mark (Me).
All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class.
The interpretation of private-use characters (Co) as combining characters or not is determined by the implementation.
These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non- joiner. The combining character is said to apply to that base character.
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character—for example, a carriage return, tab, or right-left mark. In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
The representative images of combining characters are depicted with a dotted circle in the code charts. When presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.
The Unicode Standard, Chapter 3, D59
Grapheme extender: A character with the property Grapheme_Extend.
Grapheme extender characters consist of all nonspacing marks, zero width joiner, zero width non-joiner, U+FF9E, U+FF9F, and a small number of spacing marks.
A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character.
zero width joiner and zero width non-joiner are formally defined to be grapheme extenders so that their presence does not break up a sequence of other grapheme extenders.
The small number of spacing marks that have the property Grapheme_Extend are all the second parts of a two-part combining mark.
The set of characters with the Grapheme_Extend property and the set of characters with the Grapheme_Base property are disjoint, by definition.
The difference in actual usage is that combining characters are defined as a General Category for rough classification of characters and grapheme extenders are mainly used for UAX #29 text segmentation.
EDIT: Since you offered a bounty, I can elaborate a bit.
Combining characters are characters that can't be use as stand-alone characters but must be combined with another character. They're used to define combining character sequences.
Grapheme extenders were introduced in Unicode 3.2 to be used in Unicode Technical Report #29: Text Boundaries (then in a proposed status, now known as Unicode Standard Annex #29:
Unicode Text Segmentation). The main use is to define grapheme clusters. Grapheme clusters are basically user-perceived characters. According to UAX #29:
Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.
The main difference is that grapheme extenders don't include most of the spacing marks (the set is actually smaller than the set of combining characters). Most of the spacing marks are vowel signs for Asian scripts. In these scripts, vowels are sometimes written by modifying a consonant character. If this modification takes up horizontal space (spacing mark), it used to be seen as a separate user-perceived character and forms a new (legacy) grapheme cluster. In later versions of UAX #29, this was changed and extended grapheme clusters were introduced where most but not all spacing marks don't break a cluster.
I think they key sentence from the standard is: "A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character." Combining characters, on the other hand, also include spacing marks that are applied to the left or right. There are a few exceptions, though (see property Other_Grapheme_Extend).
Example
U+0995 BENGALI LETTER KA:
ক
U+09C0 BENGALI VOWEL SIGN II (combining character, but no grapheme extender):
ী
Combination of the two:
কী
This is a single combining character sequence consisting of two legacy grapheme clusters. The vowel sign can't be used by itself but it still counts as a legacy grapheme cluster. A text editor, for example, could allow to place the cursor between the two characters.
There are over 300 combining characters like this which do not extend graphemes, and four characters which are not combining but do extend graphemes.
I’ve posted this question on the Unicode mailing list and got some more responses. I’ll post some of them here.
Tom Gewecke wrote:
I'm not an expert on this aspect of Unicode, but I understand that
"grapheme extender" is a finer distinction in character properties
designed to be used in certain specific and complex processes like
grapheme breaking. You might find this blog article helpful in seeing
where it comes into play:
http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-4.html
PS The answer by nwellnhof at StackOverflow is an excellent explanation of this issue in my view.
Philippe Verdy wrote:
Many grapheme extenders are not "combining characters". Combining
characters are classified this way for legacy reasons (the very weak
"general category" property) and this property is normatively
stabilized. As well most combining characters have a non-zero
combining class and they are stabilized for the purpose of
normalization.
Grapheme extenders include characters that are also NOT combining
characters but controls (e.g. joiners). Some graphemclusters are also
more complex in some scripts (there are extenders encoded BEFORE the
base character; and they cannot be classified as combining characters
because combining characters are always encoded AFTER a base
character)
For legacy reasons (and roundtrip compatibility with older standards)
not all scripts are encoded using the UCS character model using
combining characters. (E.g. the Thai script; not following the
"logical" encoding order; but following the model used in TIS-620 and
other standards based on it; including for Windows, and *nix/*nux).
Richard Wordingham wrote:
Spacing combining marks (category Mc) are in general not grapheme
extenders. The ones that are included are mostly included so that the
boundaries between 'legacy grapheme clusters'
http://www.unicode.org/reports/tr29/tr29-23.html are invariant under
canonical equivalence. There are six grapheme extenders that are not
nonspacing (Mn) or enclosing (Me) and are not needed by this rule:
ZWNJ, ZWJ, U+302E HANGUL SINGLE DOT TONE MARK U+302F HANGUL DOUBLE DOT
TONE MARK U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F HALFWIDTH
KATAKANA SEMI-VOICED SOUND MARK
I can see that it will sometimes be helpful to ZWNJ and ZWJ along with
the previous base character. The fullwidth soundmarks U+3099 and
U+309A are included for reasons of canonical equivalence, so it makes
sense to include their halfwidth versions.
I don't actually see the logic for including U+302E and U+302F. If
you're going to encourage forcing someone who's typed the wrong base
character before a sequence of 3 non-spacing marks to retype the lot,
you may as well do the same with Hangul tone marks.
May I quote from Yannis Haralambous' Fonts and Encodings, page 116f.:
The idea is that a script or a system of notation is sometimes too
finely divided into characters. And when we have cut constructs up
into characters, there is no way to put them back together again to
rebuild larger characters. For example, Catalan has the ligature
‘ŀl’. This ligature is encoded as two Unicode characters: an ‘ŀ’
0x0140 latin small letter l with middle dot and an ordinary ‘l’. But
this division may not always be what we want.
Suppose that we wish to
place a circumflex accent over this ligature, as we might well wish to
do with the ligatures ‘œ’ and ‘æ’. How can this be done in Unicode?
To allow users to build up characters in constructs that play the rôle
of new characters, Unicode introduced three new properties (grapheme
base, grapheme extension, grapheme link) and one new character:
0x034F combining grapheme joiner.
So the way I see it, this means that grapheme extenders are used to apply (for example) accents on characters that are themselves composed of several characters.
Related
I am trying to represent devanagari characters on a screen, but in the dev environment where I'm programming I don't have unicode support. Then, to write characters I use binary matrices to color the related screen's pixels. I sorted these matrices according to the unicode order. For the languages that uses the latin alphabet I had no issues, I only needed to write the characters one after the other to represent a string, but for the devanagari characters it's different.
In the devanagari script some characters, when placed next to others can completely change the appearance of the word itself, both in the order and in the appearance of the characters. The resulting characters are considered as a single character, but when read as unicode they actually return 2 distinct characters.
This merging sometimes occurs in a simple way:
क + ् = क्
ग + ् = ग्
फ + ि = फि
But other times you get completely different characters:
क + ् + क = क्क
ग + ् + घ = ग्घ
क + ् + ष = क्ष
I found several papers describing the complex grammatical rules that determine how these characters merges (https://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf), but the more I look into it the more I realize that I need to learn Hindi for understand that rules and then create an algorithm.
I would like to understand the principles behind these characters combinations but without necessarily having to learn the Hindi language. I wonder if anyone before me has already solved this problem or found an alternative solution and would like to share it with me.
Whether Devanagari text is encoded using Unicode or ISCII, display of the text requires a combination of shaping engine and font data that maps a string of characters into an appropriate sequence of positioned glyphs. The set of glyphs needed for Devanagari will be a fair bit larger than the initial set of characters.
The shaping steps involves an analysis of clusters, re-ordering of certain elements within clusters, substitution of glyphs, and finally positioning adjustments to the glyphs. Consider this example:
क + ् + क + ि = क्कि
The cluster analysis is needed to recognize elements against a general cluster pattern — e.g., which comprise the "base" consonant within the cluster, which are additional consonants that will conjoin to it, which are vowels and what the type of vowel with regard to visual positioning. In that sequence, the <ka, virama, ka> sequence will form a base that vowel or other marks are positioned relative to. The second ka is the "base" consonant and the inital <ka, virama> sequence will conjoin as a "half" form. And the short-i vowel is one that needs to be re-positioned to the left of the conjoined-consonant combination.
The Devanagari section in the Unicode Standard describes in a general way some of the actions that will be needed in display, but it's not a specific implementation guide.
The OpenType font specification supports display of scripts like Devanagari through a combination of "OpenType Layout" data in the font plus shaping implementations that interact with that data. You can find documentation specifically for Devanagari font implementations here:
https://learn.microsoft.com/en-us/typography/script-development/devanagari
You might also find helpful the specification for the "Universal Shaping Engine" that several implementations use (in combination with OpenType fonts) for shaping many different scripts:
https://learn.microsoft.com/en-us/typography/script-development/use
You don't necessarily need to use OpenType, but you will want some implementation with the functionality I've described. If you're running in a specific embedded OS environment that isn't, say, Windows IOT, you evidently can't take advantage of the OpenType shaping support built into Windows or other major OS platforms. But perhaps you could take advantage of Harfbuzz, which is an open-source OpenType shaping library:
https://github.com/harfbuzz/harfbuzz
This would need to be combined with Devanagari fonts that have appropriate OpenType Layout data, and there are plenty of those, including OSS options (e.g., Noto Sans Devanagari).
What is the difference between zero-width space (U+200B) and zero-width non-joiner (U+200C) from practical point of view?
I have already read Wikipedia articles, but I can't understand if these characters are interchangeable or not.
I think they are completely interchangeable, but then I can't understand why we have two in Unicode set instead of one.
A zero-width non-joiner is almost non-existing. Its only purpose is to split things into two. For example, 123 zero-width-non-joiner 456 is two numbers with nothing in between.
A zero-width space is a space character, just a very very narrow one. For example 123 zero-width-space 456 is two numbers with a space character in between.
A zero width non joiner (ZWNJ) only interrupts ligatures. These are hard to notice in the latin alphabet but are most frequent in serif fonts displaying some specific combinations of lowercase letters. There are a few alphabets, such as the arabic abjad, that use ligatures very prominently.
A zero width space (ZWSP) does everything a ZWNJ does, but it also creates opportunities for line breaks. Very good for displaying file paths and long URLs, but beware that it might screw up copy pasting.
By the way, I tested regular expression matching in Python 3.8 and Javascript 1.5 and none of them match \s. Unicode considers these characters as formatting characters (similar to direction markers and such) as opposed to space/punctuation. There are other codepoints in the same Unicode block (e.g. Thin Space, U+2009) that are considered space by Unicode and do match \s.
What is a practical application of having a combining character representation of a symbol in Unicode when a single code point mapping to the symbol will alone suffice?
What programming/non-programming advantage does it give us?
There is no particular programming advantage in using a decomposed presentation (base character and combining character) when a precomposed presentation exists, e.g. using U+0065 U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT instead of U+00E9 LATIN SMALL LETTER E WITH ACUTE “é”. Such decomposed presentations are something that needs to be dealt with in programming, part of the problem, rather than an advantage. So it’s similar to asking about the benefits of having the letter U in the character code.
The reasons why decomposed presentations (or the letter U) are used in actual data and need to be handled are external to programming and hence off-topic at SO.
Decomposing all decomposable characters may have advantages in processing, as it makes the data more uniform, canonical. This would relate to some particular features of the processing needed, and it would be implemented by performing (with a library routine, usually) normalization to NFD or NFKD form. But this would normally be part of the processing, not something imposed on input format. If some string matching is performed, it is mostly desirable to treat decomposed and precomposed representations of a character as equivalent, and normalization makes this easy. But this a way of dealing with the two different representations, not a cause for their existence, and it can equally well be performed by normalizing to NFC (i.e., precompose everything that can be precomposed). See the Unicode FAQ section Normalization.
The decomposed components are better for text editing, and "possibly but not definite" with good compression ratio.
When editing text, there are times when modifying an accent mark is wanted, but precomposed (precomposed is not a word by Firefox spell check) characters do not allow partial modifications. Sometimes, users may want to modify the base character without removing the accent. Those kinds of editing prefers using decomposed characters.
About compression ratio, it makes more sense during the days of separate encoding per language. In such times, the 8-bit encoding per language allows each language to have their own character sets. Some languages have better compression ratio for decomposed character. The small space of the 8-bits means that they could only fit so many unique code points and use variable width with decomposed characters.
The arrangement of the characters that can be used as super-/subscript letters seems completely chaotic. Most of them are obviously not meant to be used as sup/subscr. letters, but even those which are do not hint a very reasonable ordering. In Unicode 6.0 there is now at last an alphabetically-ordered subset of the subscript letters h-t in U+2095 through U+209C, but this was obviously rather squeezed into the remaining space in the block and encompasses less than 1/3 of all letters.
Why did the consortium not just allocate enough space for at least one sup and one subscript alphabet in lower case?
The disorganization in the arrangement of these characters is because they were encoded piecemeal as scripts that used them were encoded, and as round-trip compatibility with other character sets was added. Chapter 15 of the Unicode Standard has some discussion of their origins: for example superscript digits 1 to 3 were in ISO Latin-1 while the others were encoded to support the MARC-8 bibliographic character set (see table here); and U+2071 SUPERSCRIPT LATIN SMALL LETTER I and U+207F SUPERSCRIPT LATIN SMALL LETTER N were encoded to support the Uralic Phonetic Alphabet.
The Unicode Consortium have a general policy of not encoding characters unless there's some evidence that people are using the characters to make semantic distinctions that require encoding. So characters won't be encoded just to complete the set, or to make things look neat.
What are the difficulties inherent in ASCII and Extended ASCII and how these difficulties are overcome by Unicode?
Can some one explain me the unicode compatibility?
And what does the terms associated with Unicode like Planes, Basic Multilingual Plane (BMP), Suplementary Multilingual Plane (SMP), Suplementary Ideographic Plane (SIP), Supplementary Special Plane (SSP) and Private Use Planes (PUP) means.
I have found all these words very confusing
ASCII
ASCII was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.
Extended ASCII and ISO 8859
Later the remaining bit of the byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. But because everyone used the remaining room their own way (IBM, Commodore, Universities, Organizations, etcetera), it was not interchangeable. Characters which were originally encoded using encoding X will show up as Mojibake when they are decoded using a different encoding Y. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards based on top of ASCII such as ISO 8859-1, so that it is all better interchangeable.
Unicode
8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera, let alone to include them all in only 8 bits. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. It provides room for over a million characters of which currently about 10% is filled. The UTF-8 character encoding is based on Unicode.
Unicode Planes
The Unicode characters are categorized in seventeen planes, each providing room for 65536 characters (16 bits).
Plane 0: Basic Multilingual Plane (BMP), it contains characters of all modern languages known in the world.
Plane 1: Suplementary Multilingual Plane (SMP), it contains historic languages/scripts as well as multilingual musical and mathematical symbols.
Plane 2: Suplementary Ideographic Plane (SIP), it contains "special" CJK (Chinese/Japanese/Korean) characters of which there are pretty a lot, but very seldom used in modern writing. The "normal" CJK characters are already present in BMP.
Planes 3-13: unused.
Plane 14: Supplementary Special Plane (SSP), as far it contains only some tag characters and glyph variation selectors. The tag characters are currently deprecated and may be removed in the future. The glyph variation selectors are to be used as kind of metadata which you add to existing characters which in turn can instruct the reader to give the character a slight different glyph.
Planes 15-16: Private Use Planes (PUP), it provides room for (major) organizations or user initiatives to include their own special characters or symbols in the standard so that it is interchangeable everywhere. For example Emoji (Japanese-style smilies/emoticions).
Usually, you would be only interested in the BMP and using UTF-8 encoding as the standard character encoding throughout your entire application.