What is the difference between zero-width space (U+200B) and zero-width non-joiner (U+200C) from practical point of view?
I have already read Wikipedia articles, but I can't understand if these characters are interchangeable or not.
I think they are completely interchangeable, but then I can't understand why we have two in Unicode set instead of one.
A zero-width non-joiner is almost non-existing. Its only purpose is to split things into two. For example, 123 zero-width-non-joiner 456 is two numbers with nothing in between.
A zero-width space is a space character, just a very very narrow one. For example 123 zero-width-space 456 is two numbers with a space character in between.
A zero width non joiner (ZWNJ) only interrupts ligatures. These are hard to notice in the latin alphabet but are most frequent in serif fonts displaying some specific combinations of lowercase letters. There are a few alphabets, such as the arabic abjad, that use ligatures very prominently.
A zero width space (ZWSP) does everything a ZWNJ does, but it also creates opportunities for line breaks. Very good for displaying file paths and long URLs, but beware that it might screw up copy pasting.
By the way, I tested regular expression matching in Python 3.8 and Javascript 1.5 and none of them match \s. Unicode considers these characters as formatting characters (similar to direction markers and such) as opposed to space/punctuation. There are other codepoints in the same Unicode block (e.g. Thin Space, U+2009) that are considered space by Unicode and do match \s.
Related
I totally understand the necessity of integral, and brackets by pieces (2320 2321 239B-23AE) Since it helps building large notations.
But the for the large summation 23B3 23B4, if one stretch these two, they will still lose their shapes. I do not understand what is the logic behind separating this character, or why not a corresponding one for the product symbol 220F. Furthermore, I wonder in which case these two symbols should be used.
This doesn't answer the question fully, but the following paragraph from Unicode Technical Report #28: Unicode 3.2 is probably the most authoritative explanation you can find:
Symbol Pieces. [to follow “APL Functional Symbols”] The characters in the range U+239B..U+23B3, plus U+23B7, comprise a set of bracket and other symbol fragments for use in mathematical typesetting. These pieces originated in older font standards, but have been used in past mathematical processing as characters in their own right to make up extra-tall glyphs for enclosing multi-line mathematical formulae. Mathematical fences are ordinarily sized to the content that they enclose. However, in creating a large fence, the glyph is not scaled proportionally; in particular the displayed stem weights must remain compatible with the accompanying smaller characters. Thus, simple scaling of font outlines cannot be used to create tall brackets. Instead, a common technique is to build up the symbol from pieces. In particular, the characters U+239B LEFT PARENTHESIS UPPER HOOK through U+23B3 SUMMATION BOTTOM represent a set of glyph pieces for building up large versions of the fences (, ), [, ], {, and }, and of the large operators ∑ and ∫. These brace and operator pieces are compatibility characters. They should not be used in stored mathematical text, but are often used in the data stream created by display and print drivers.
This is followed by a table showing how pieces are intended to be used together to create specific symbols.
Unicode mostly standardises existing character repertoires, and keeps their peculiarities so that conversions round-trip properly. A corresponding product symbol is not part of Unicode because the originating character repertoire did not have one. Ask on https://www.unicode.org/consortium/distlist-unicode.html about the provenance of the summation top/bottom.
ASCII has versions of the whole Roman alphabet. I was surprised recently to learn that Unicode contains other version/s of those same characters. One example is "U+1D5C4: MATHEMATICAL SANS-SERIF SMALL K", or "𝗄".
Can't LaTeX math mode, or MS Word equation editor, or whatever other program just use a sans-serif font if it wants the letters in a mathematical formula to be sans-serif?
These characters exist so that the semantic distinction between them can be encoded in plain text, or where the specific font shape can't be controlled.
The block you mention is only intended for use in mathematical and technical contexts, where the distinction between, say, 𝑑 as a variable vs. d as a differential operator vs. 𝖽 as an object (in category theory) is important. TR #25 gives another example where losing the distinction between ℋ and H can completely change the meaning of an equation. Being able to encode this formatting into the text itself is also important for ISO 31-11.
All of these characters maintain compatibility mappings with their "normal" Latin and Greek counterparts, so the distinction between them should not affect searching and sorting.
You are confusing the display mode with the encoding for texts.
The idea is that unicode has ALL the symbols used to write known to mankind grouped by usage. That's why you will find many code-points that look alike.
So a formula with a k is different is supposed to be different then a word written with a k. The sans-serif part is just a description of the kind of k best used to display. Tomorrow somebody might want to add a serif k and then how would you describe the difference?
What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?
They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly missing something here. Why the distinction?
The Unicode Standard, Chapter 3, D52
Combining character: A character with the General Category of Combining Mark (M).
Combining characters consist of all characters with the General Category values of Spacing Combining Mark (Mc), Nonspacing Mark (Mn), and Enclosing Mark (Me).
All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero canonical combining class.
The interpretation of private-use characters (Co) as combining characters or not is determined by the implementation.
These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.
The graphic positioning of a combining character depends on the last preceding base character, unless they are separated by a character that is neither a combining character nor either zero width joiner or zero width non- joiner. The combining character is said to apply to that base character.
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character—for example, a carriage return, tab, or right-left mark. In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
The representative images of combining characters are depicted with a dotted circle in the code charts. When presented in graphical combination with a preceding base character, that base character is intended to appear in the position occupied by the dotted circle.
The Unicode Standard, Chapter 3, D59
Grapheme extender: A character with the property Grapheme_Extend.
Grapheme extender characters consist of all nonspacing marks, zero width joiner, zero width non-joiner, U+FF9E, U+FF9F, and a small number of spacing marks.
A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character.
zero width joiner and zero width non-joiner are formally defined to be grapheme extenders so that their presence does not break up a sequence of other grapheme extenders.
The small number of spacing marks that have the property Grapheme_Extend are all the second parts of a two-part combining mark.
The set of characters with the Grapheme_Extend property and the set of characters with the Grapheme_Base property are disjoint, by definition.
The difference in actual usage is that combining characters are defined as a General Category for rough classification of characters and grapheme extenders are mainly used for UAX #29 text segmentation.
EDIT: Since you offered a bounty, I can elaborate a bit.
Combining characters are characters that can't be use as stand-alone characters but must be combined with another character. They're used to define combining character sequences.
Grapheme extenders were introduced in Unicode 3.2 to be used in Unicode Technical Report #29: Text Boundaries (then in a proposed status, now known as Unicode Standard Annex #29:
Unicode Text Segmentation). The main use is to define grapheme clusters. Grapheme clusters are basically user-perceived characters. According to UAX #29:
Grapheme cluster boundaries are important for collation, regular expressions, UI interactions (such as mouse selection, arrow key movement, backspacing), segmentation for vertical text, identification of boundaries for first-letter styling, and counting “character” positions within text.
The main difference is that grapheme extenders don't include most of the spacing marks (the set is actually smaller than the set of combining characters). Most of the spacing marks are vowel signs for Asian scripts. In these scripts, vowels are sometimes written by modifying a consonant character. If this modification takes up horizontal space (spacing mark), it used to be seen as a separate user-perceived character and forms a new (legacy) grapheme cluster. In later versions of UAX #29, this was changed and extended grapheme clusters were introduced where most but not all spacing marks don't break a cluster.
I think they key sentence from the standard is: "A grapheme extender can be conceived of primarily as the kind of nonspacing graphical mark that is applied above or below another spacing character." Combining characters, on the other hand, also include spacing marks that are applied to the left or right. There are a few exceptions, though (see property Other_Grapheme_Extend).
Example
U+0995 BENGALI LETTER KA:
ক
U+09C0 BENGALI VOWEL SIGN II (combining character, but no grapheme extender):
ী
Combination of the two:
কী
This is a single combining character sequence consisting of two legacy grapheme clusters. The vowel sign can't be used by itself but it still counts as a legacy grapheme cluster. A text editor, for example, could allow to place the cursor between the two characters.
There are over 300 combining characters like this which do not extend graphemes, and four characters which are not combining but do extend graphemes.
I’ve posted this question on the Unicode mailing list and got some more responses. I’ll post some of them here.
Tom Gewecke wrote:
I'm not an expert on this aspect of Unicode, but I understand that
"grapheme extender" is a finer distinction in character properties
designed to be used in certain specific and complex processes like
grapheme breaking. You might find this blog article helpful in seeing
where it comes into play:
http://useless-factor.blogspot.com/2007/08/unicode-implementers-guide-part-4.html
PS The answer by nwellnhof at StackOverflow is an excellent explanation of this issue in my view.
Philippe Verdy wrote:
Many grapheme extenders are not "combining characters". Combining
characters are classified this way for legacy reasons (the very weak
"general category" property) and this property is normatively
stabilized. As well most combining characters have a non-zero
combining class and they are stabilized for the purpose of
normalization.
Grapheme extenders include characters that are also NOT combining
characters but controls (e.g. joiners). Some graphemclusters are also
more complex in some scripts (there are extenders encoded BEFORE the
base character; and they cannot be classified as combining characters
because combining characters are always encoded AFTER a base
character)
For legacy reasons (and roundtrip compatibility with older standards)
not all scripts are encoded using the UCS character model using
combining characters. (E.g. the Thai script; not following the
"logical" encoding order; but following the model used in TIS-620 and
other standards based on it; including for Windows, and *nix/*nux).
Richard Wordingham wrote:
Spacing combining marks (category Mc) are in general not grapheme
extenders. The ones that are included are mostly included so that the
boundaries between 'legacy grapheme clusters'
http://www.unicode.org/reports/tr29/tr29-23.html are invariant under
canonical equivalence. There are six grapheme extenders that are not
nonspacing (Mn) or enclosing (Me) and are not needed by this rule:
ZWNJ, ZWJ, U+302E HANGUL SINGLE DOT TONE MARK U+302F HANGUL DOUBLE DOT
TONE MARK U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F HALFWIDTH
KATAKANA SEMI-VOICED SOUND MARK
I can see that it will sometimes be helpful to ZWNJ and ZWJ along with
the previous base character. The fullwidth soundmarks U+3099 and
U+309A are included for reasons of canonical equivalence, so it makes
sense to include their halfwidth versions.
I don't actually see the logic for including U+302E and U+302F. If
you're going to encourage forcing someone who's typed the wrong base
character before a sequence of 3 non-spacing marks to retype the lot,
you may as well do the same with Hangul tone marks.
May I quote from Yannis Haralambous' Fonts and Encodings, page 116f.:
The idea is that a script or a system of notation is sometimes too
finely divided into characters. And when we have cut constructs up
into characters, there is no way to put them back together again to
rebuild larger characters. For example, Catalan has the ligature
‘ŀl’. This ligature is encoded as two Unicode characters: an ‘ŀ’
0x0140 latin small letter l with middle dot and an ordinary ‘l’. But
this division may not always be what we want.
Suppose that we wish to
place a circumflex accent over this ligature, as we might well wish to
do with the ligatures ‘œ’ and ‘æ’. How can this be done in Unicode?
To allow users to build up characters in constructs that play the rôle
of new characters, Unicode introduced three new properties (grapheme
base, grapheme extension, grapheme link) and one new character:
0x034F combining grapheme joiner.
So the way I see it, this means that grapheme extenders are used to apply (for example) accents on characters that are themselves composed of several characters.
The arrangement of the characters that can be used as super-/subscript letters seems completely chaotic. Most of them are obviously not meant to be used as sup/subscr. letters, but even those which are do not hint a very reasonable ordering. In Unicode 6.0 there is now at last an alphabetically-ordered subset of the subscript letters h-t in U+2095 through U+209C, but this was obviously rather squeezed into the remaining space in the block and encompasses less than 1/3 of all letters.
Why did the consortium not just allocate enough space for at least one sup and one subscript alphabet in lower case?
The disorganization in the arrangement of these characters is because they were encoded piecemeal as scripts that used them were encoded, and as round-trip compatibility with other character sets was added. Chapter 15 of the Unicode Standard has some discussion of their origins: for example superscript digits 1 to 3 were in ISO Latin-1 while the others were encoded to support the MARC-8 bibliographic character set (see table here); and U+2071 SUPERSCRIPT LATIN SMALL LETTER I and U+207F SUPERSCRIPT LATIN SMALL LETTER N were encoded to support the Uralic Phonetic Alphabet.
The Unicode Consortium have a general policy of not encoding characters unless there's some evidence that people are using the characters to make semantic distinctions that require encoding. So characters won't be encoded just to complete the set, or to make things look neat.
What are the difficulties inherent in ASCII and Extended ASCII and how these difficulties are overcome by Unicode?
Can some one explain me the unicode compatibility?
And what does the terms associated with Unicode like Planes, Basic Multilingual Plane (BMP), Suplementary Multilingual Plane (SMP), Suplementary Ideographic Plane (SIP), Supplementary Special Plane (SSP) and Private Use Planes (PUP) means.
I have found all these words very confusing
ASCII
ASCII was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.
Extended ASCII and ISO 8859
Later the remaining bit of the byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. But because everyone used the remaining room their own way (IBM, Commodore, Universities, Organizations, etcetera), it was not interchangeable. Characters which were originally encoded using encoding X will show up as Mojibake when they are decoded using a different encoding Y. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards based on top of ASCII such as ISO 8859-1, so that it is all better interchangeable.
Unicode
8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera, let alone to include them all in only 8 bits. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. It provides room for over a million characters of which currently about 10% is filled. The UTF-8 character encoding is based on Unicode.
Unicode Planes
The Unicode characters are categorized in seventeen planes, each providing room for 65536 characters (16 bits).
Plane 0: Basic Multilingual Plane (BMP), it contains characters of all modern languages known in the world.
Plane 1: Suplementary Multilingual Plane (SMP), it contains historic languages/scripts as well as multilingual musical and mathematical symbols.
Plane 2: Suplementary Ideographic Plane (SIP), it contains "special" CJK (Chinese/Japanese/Korean) characters of which there are pretty a lot, but very seldom used in modern writing. The "normal" CJK characters are already present in BMP.
Planes 3-13: unused.
Plane 14: Supplementary Special Plane (SSP), as far it contains only some tag characters and glyph variation selectors. The tag characters are currently deprecated and may be removed in the future. The glyph variation selectors are to be used as kind of metadata which you add to existing characters which in turn can instruct the reader to give the character a slight different glyph.
Planes 15-16: Private Use Planes (PUP), it provides room for (major) organizations or user initiatives to include their own special characters or symbols in the standard so that it is interchangeable everywhere. For example Emoji (Japanese-style smilies/emoticions).
Usually, you would be only interested in the BMP and using UTF-8 encoding as the standard character encoding throughout your entire application.