Why does unicode multiple characters representing the same letter? - unicode

ASCII has versions of the whole Roman alphabet. I was surprised recently to learn that Unicode contains other version/s of those same characters. One example is "U+1D5C4: MATHEMATICAL SANS-SERIF SMALL K", or "𝗄".
Can't LaTeX math mode, or MS Word equation editor, or whatever other program just use a sans-serif font if it wants the letters in a mathematical formula to be sans-serif?

These characters exist so that the semantic distinction between them can be encoded in plain text, or where the specific font shape can't be controlled.
The block you mention is only intended for use in mathematical and technical contexts, where the distinction between, say, 𝑑 as a variable vs. d as a differential operator vs. 𝖽 as an object (in category theory) is important. TR #25 gives another example where losing the distinction between β„‹ and H can completely change the meaning of an equation. Being able to encode this formatting into the text itself is also important for ISO 31-11.
All of these characters maintain compatibility mappings with their "normal" Latin and Greek counterparts, so the distinction between them should not affect searching and sorting.

You are confusing the display mode with the encoding for texts.
The idea is that unicode has ALL the symbols used to write known to mankind grouped by usage. That's why you will find many code-points that look alike.
So a formula with a k is different is supposed to be different then a word written with a k. The sans-serif part is just a description of the kind of k best used to display. Tomorrow somebody might want to add a serif k and then how would you describe the difference?

Related

How can I draw devanagari characters on a screen?

I am trying to represent devanagari characters on a screen, but in the dev environment where I'm programming I don't have unicode support. Then, to write characters I use binary matrices to color the related screen's pixels. I sorted these matrices according to the unicode order. For the languages that uses the latin alphabet I had no issues, I only needed to write the characters one after the other to represent a string, but for the devanagari characters it's different.
In the devanagari script some characters, when placed next to others can completely change the appearance of the word itself, both in the order and in the appearance of the characters. The resulting characters are considered as a single character, but when read as unicode they actually return 2 distinct characters.
This merging sometimes occurs in a simple way:
ΰ€• + ΰ₯ = ΰ€•ΰ₯
ΰ€— + ΰ₯ = ΰ€—ΰ₯
ΰ€« + ΰ€Ώ = ΰ€«ΰ€Ώ
But other times you get completely different characters:
ΰ€• + ΰ₯ + ΰ€• = ΰ€•ΰ₯ΰ€•
ΰ€— + ΰ₯ + ΰ€˜ = ΰ€—ΰ₯ΰ€˜
ΰ€• + ΰ₯ + ΰ€· = ΰ€•ΰ₯ΰ€·
I found several papers describing the complex grammatical rules that determine how these characters merges (https://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf), but the more I look into it the more I realize that I need to learn Hindi for understand that rules and then create an algorithm.
I would like to understand the principles behind these characters combinations but without necessarily having to learn the Hindi language. I wonder if anyone before me has already solved this problem or found an alternative solution and would like to share it with me.
Whether Devanagari text is encoded using Unicode or ISCII, display of the text requires a combination of shaping engine and font data that maps a string of characters into an appropriate sequence of positioned glyphs. The set of glyphs needed for Devanagari will be a fair bit larger than the initial set of characters.
The shaping steps involves an analysis of clusters, re-ordering of certain elements within clusters, substitution of glyphs, and finally positioning adjustments to the glyphs. Consider this example:
ΰ€• + ΰ₯ + ΰ€• + ΰ€Ώ = ΰ€•ΰ₯ΰ€•ΰ€Ώ
The cluster analysis is needed to recognize elements against a general cluster pattern β€” e.g., which comprise the "base" consonant within the cluster, which are additional consonants that will conjoin to it, which are vowels and what the type of vowel with regard to visual positioning. In that sequence, the <ka, virama, ka> sequence will form a base that vowel or other marks are positioned relative to. The second ka is the "base" consonant and the inital <ka, virama> sequence will conjoin as a "half" form. And the short-i vowel is one that needs to be re-positioned to the left of the conjoined-consonant combination.
The Devanagari section in the Unicode Standard describes in a general way some of the actions that will be needed in display, but it's not a specific implementation guide.
The OpenType font specification supports display of scripts like Devanagari through a combination of "OpenType Layout" data in the font plus shaping implementations that interact with that data. You can find documentation specifically for Devanagari font implementations here:
https://learn.microsoft.com/en-us/typography/script-development/devanagari
You might also find helpful the specification for the "Universal Shaping Engine" that several implementations use (in combination with OpenType fonts) for shaping many different scripts:
https://learn.microsoft.com/en-us/typography/script-development/use
You don't necessarily need to use OpenType, but you will want some implementation with the functionality I've described. If you're running in a specific embedded OS environment that isn't, say, Windows IOT, you evidently can't take advantage of the OpenType shaping support built into Windows or other major OS platforms. But perhaps you could take advantage of Harfbuzz, which is an open-source OpenType shaping library:
https://github.com/harfbuzz/harfbuzz
This would need to be combined with Devanagari fonts that have appropriate OpenType Layout data, and there are plenty of those, including OSS options (e.g., Noto Sans Devanagari).

Why summation top and bottom ⎲⎳?

I totally understand the necessity of integral, and brackets by pieces (2320 2321 239B-23AE) Since it helps building large notations.
But the for the large summation 23B3 23B4, if one stretch these two, they will still lose their shapes. I do not understand what is the logic behind separating this character, or why not a corresponding one for the product symbol 220F. Furthermore, I wonder in which case these two symbols should be used.
This doesn't answer the question fully, but the following paragraph from Unicode Technical Report #28: Unicode 3.2 is probably the most authoritative explanation you can find:
Symbol Pieces. [to follow β€œAPL Functional Symbols”] The characters in the range U+239B..U+23B3, plus U+23B7, comprise a set of bracket and other symbol fragments for use in mathematical typesetting. These pieces originated in older font standards, but have been used in past mathematical processing as characters in their own right to make up extra-tall glyphs for enclosing multi-line mathematical formulae. Mathematical fences are ordinarily sized to the content that they enclose. However, in creating a large fence, the glyph is not scaled proportionally; in particular the displayed stem weights must remain compatible with the accompanying smaller characters. Thus, simple scaling of font outlines cannot be used to create tall brackets. Instead, a common technique is to build up the symbol from pieces. In particular, the characters U+239B LEFT PARENTHESIS UPPER HOOK through U+23B3 SUMMATION BOTTOM represent a set of glyph pieces for building up large versions of the fences (, ), [, ], {, and }, and of the large operators βˆ‘ and ∫. These brace and operator pieces are compatibility characters. They should not be used in stored mathematical text, but are often used in the data stream created by display and print drivers.
This is followed by a table showing how pieces are intended to be used together to create specific symbols.
Unicode mostly standardises existing character repertoires, and keeps their peculiarities so that conversions round-trip properly. A corresponding product symbol is not part of Unicode because the originating character repertoire did not have one. Ask on https://www.unicode.org/consortium/distlist-unicode.html about the provenance of the summation top/bottom.

Large product ∏ symbol in unicode

I am looking for large symbols in unicode like these:
∏ ∐ βˆ‘ ∫
⨀ ⨁ ⨂
βŠ• βŠ– βŠ— ⊘ βŠ™
⎲
⎳
⌠
⌑
The only one I found is by combining two unicode symbols ⎲and ⎳. Not sure why that exists, but not a large product symbol. That's all I am really looking for (∏ over multiple lines like the sigma). If any of the other ones exist over 2 lines that would be great to know as well. Perhaps there is some way to manually make the large ∏ symbol out of smaller primitives.
⎲and ⎳. Not sure why that exists
When a collection of existing glyphs is added to Unicode, it is desirable to make encoding between character sets round-trip safe. So glyphs that are duplicates or variants of each other are kept anyway.
As of Unicode 10, these are the greek letter pi (and its compat decompositions) available: βˆΞ Ο€Ο–α΄¨β„Όβ„Ώ There are no top and bottom halves like for integral and summation.
You should not attempt to build a glyph piecewise from other glyphs shifted into position. (You said "primitives", but Unicode does not work that way.) The result is not accessible and somewhat likely to break in rendering on systems other than yours.
The correct solution is to use the ∏ glyph and simply scale up its font size. Look into MathML if you are using only ad-hoc notation so far.

Why are there holes in the Unicode table?

Given this area of the Unicode table, for instance:
...
π‘Ž U+1D44E Dec:119886 MATHEMATICAL ITALIC SMALL A π‘Ž
𝑏 U+1D44F Dec:119887 MATHEMATICAL ITALIC SMALL B 𝑏
𝑐 U+1D450 Dec:119888 MATHEMATICAL ITALIC SMALL C 𝑐
𝑑 U+1D451 Dec:119889 MATHEMATICAL ITALIC SMALL D 𝑑
𝑒 U+1D452 Dec:119890 MATHEMATICAL ITALIC SMALL E 𝑒
𝑓 U+1D453 Dec:119891 MATHEMATICAL ITALIC SMALL F 𝑓
𝑔 U+1D454 Dec:119892 MATHEMATICAL ITALIC SMALL G 𝑔
𝑖 U+1D456 Dec:119894 MATHEMATICAL ITALIC SMALL I 𝑖 # what?!
𝑗 U+1D457 Dec:119895 MATHEMATICAL ITALIC SMALL J 𝑗
π‘˜ U+1D458 Dec:119896 MATHEMATICAL ITALIC SMALL K π‘˜
𝑙 U+1D459 Dec:119897 MATHEMATICAL ITALIC SMALL L 𝑙
π‘š U+1D45A Dec:119898 MATHEMATICAL ITALIC SMALL M π‘š
𝑛 U+1D45B Dec:119899 MATHEMATICAL ITALIC SMALL N 𝑛
π‘œ U+1D45C Dec:119900 MATHEMATICAL ITALIC SMALL O π‘œ
...
I would naturally expect u+1d455 to be MATHEMATICAL ITALIC SMALL H. But it seems not defined on any table I look around.
Why are there holes in Unicode table? (also U+1d49d, u+1d53a, etc.)
Is there any way I can fill them?
[EDIT]: These links do state:
The "holes" in the alphabetic ranges are filled by previously defined characters in the Letter like Symbols block shown below.
and
The Unicode Consortium adds new codepoints to the standard all the time. Visit their website to find out about pending codepoints and whether this one is in the pipe. The following table shows typical representations of how the codepoint would look, if it existed. This may help you when debugging, but is not of real use otherwise.
But I just... don't understand what they mean :\
From the comments (cheers guys), I have learnt that these holes are due to some characters being already assigned in Unicode when the whole alphabet had been added.
For instance: before U+1D4* MATHEMATICAL ITALIC SMALL * identifiers were defined, β„Ž was already known in the table as
β„Ž U+210E Dec:008462 PLANCK CONSTANT &planckh; # here it is
So in order to keep consistency in numbering but NOT to duplicate β„Ž id, a hole has been inserted at U+1D455 position.
Similarly, ℬ is known as U+212C SCRIPT CAPITAL B rather than U+1D49D - - - reserved in the MATHEMATICAL SCRIPT CAPITAL letters family.
Similarly, β„‚ from MATHEMATICAL DOUBLE-STRUCK CAPITAL letters family is not U+1D53A because it was already known as U+2102 DOUBLE-STRUCK CAPITAL C.
This was a difficult choice, having to deal with retro-compatibility, consistency and reliability altogether :)
First of all, sorry for necroposting, but I believe that if I ended up here through a Google search where it was the first or second result, many other people might, too, and they will be as confused as I was.
I don't have a final answer, but I wanted to point out that iago-lito's answer is not completely rightβ€”it seems to be a legitimate mistake, whether from the Unicode Consortium, the operating systems I've used to check, or the typeface designers. Well, at least in the case of that specific h: there is the β„Ž that's used for the Plack constant, but there is no glyph that would fit what we would consider the mathematical italic small hβ€”that is, a regular width italic serif lowercase h, actually.
My suspicions are that, at the time, most people used serif typefaces everywhere, as Times New Roman is both the default typeface for LaTeX and for many scientific writing guides, such as APAβ€”not to mention browsers, which usually have Times New Roman as the default serif and default typeface. So it could be that the Planck constant h was always rendered as serif, but now, since we use sans-serif typefaces, it's displayed as sans-serif, and there seems to be no way to get a proper, regular weight serif small letter h. Bear in mind that the Planck constant address doesn't have a specific glyph; font files just "redirect" the address to the glyph of whatever letter h they use, so that's why I think that's a possibility, even if it doesn't make so much sense when you think about it.
It's also important to note that many characters have various identical versions throughout Unicode, and, in fact, there is the entire sans-serif alphabet under between 0x1D5A0 - mathematical sans-serif capital a, and 0x1D5D3 - mathematical sans-serif small z, so it's puzzling why they decided not to add this one letterβ€”though people have speculated that it's because of how 'famous' the other one was, and you do want backwards-compatibility. But that doesn't answer it for me, as that actually wouldn't break compatibility. It would just mean that they used the wrong one, and now there is a right one.
Of course, I'm not entirely sure it is a problem in the Unicode Consortium's standard. It could be a mistake in the typeface; maybe the typefaces should have used a serif h as Planck's constant. But this seems to be wide-spread regardless of font file, and, at the very least, there isn't clarity on what typeface designers should have done.
I have, now, submitted a request for information to the Unicode Consortium as to whether they plan to add the letter. Hopefully, they will add it, as the byte number does exist. At least they were this smart.
Meanwhile you can use the mathematical bold italic small h, 𝒉, which is represented in 8-bit as 0x1D489, or in html as 𝒉. That's all for now, at least.

What is the need of combining characters in Unicode?

What is a practical application of having a combining character representation of a symbol in Unicode when a single code point mapping to the symbol will alone suffice?
What programming/non-programming advantage does it give us?
There is no particular programming advantage in using a decomposed presentation (base character and combining character) when a precomposed presentation exists, e.g. using U+0065 U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT instead of U+00E9 LATIN SMALL LETTER E WITH ACUTE β€œΓ©β€. Such decomposed presentations are something that needs to be dealt with in programming, part of the problem, rather than an advantage. So it’s similar to asking about the benefits of having the letter U in the character code.
The reasons why decomposed presentations (or the letter U) are used in actual data and need to be handled are external to programming and hence off-topic at SO.
Decomposing all decomposable characters may have advantages in processing, as it makes the data more uniform, canonical. This would relate to some particular features of the processing needed, and it would be implemented by performing (with a library routine, usually) normalization to NFD or NFKD form. But this would normally be part of the processing, not something imposed on input format. If some string matching is performed, it is mostly desirable to treat decomposed and precomposed representations of a character as equivalent, and normalization makes this easy. But this a way of dealing with the two different representations, not a cause for their existence, and it can equally well be performed by normalizing to NFC (i.e., precompose everything that can be precomposed). See the Unicode FAQ section Normalization.
The decomposed components are better for text editing, and "possibly but not definite" with good compression ratio.
When editing text, there are times when modifying an accent mark is wanted, but precomposed (precomposed is not a word by Firefox spell check) characters do not allow partial modifications. Sometimes, users may want to modify the base character without removing the accent. Those kinds of editing prefers using decomposed characters.
About compression ratio, it makes more sense during the days of separate encoding per language. In such times, the 8-bit encoding per language allows each language to have their own character sets. Some languages have better compression ratio for decomposed character. The small space of the 8-bits means that they could only fit so many unique code points and use variable width with decomposed characters.