Large product ∏ symbol in unicode - unicode

I am looking for large symbols in unicode like these:
∏ ∐ ∑ ∫
⨀ ⨁ ⨂
⊕ ⊖ ⊗ ⊘ ⊙
⎲
⎳
⌠
⌡
The only one I found is by combining two unicode symbols ⎲and ⎳. Not sure why that exists, but not a large product symbol. That's all I am really looking for (∏ over multiple lines like the sigma). If any of the other ones exist over 2 lines that would be great to know as well. Perhaps there is some way to manually make the large ∏ symbol out of smaller primitives.

⎲and ⎳. Not sure why that exists
When a collection of existing glyphs is added to Unicode, it is desirable to make encoding between character sets round-trip safe. So glyphs that are duplicates or variants of each other are kept anyway.
As of Unicode 10, these are the greek letter pi (and its compat decompositions) available: ∏Ππϖᴨℼℿ There are no top and bottom halves like for integral and summation.
You should not attempt to build a glyph piecewise from other glyphs shifted into position. (You said "primitives", but Unicode does not work that way.) The result is not accessible and somewhat likely to break in rendering on systems other than yours.
The correct solution is to use the ∏ glyph and simply scale up its font size. Look into MathML if you are using only ad-hoc notation so far.

Related

Why summation top and bottom ⎲⎳?

I totally understand the necessity of integral, and brackets by pieces (2320 2321 239B-23AE) Since it helps building large notations.
But the for the large summation 23B3 23B4, if one stretch these two, they will still lose their shapes. I do not understand what is the logic behind separating this character, or why not a corresponding one for the product symbol 220F. Furthermore, I wonder in which case these two symbols should be used.
This doesn't answer the question fully, but the following paragraph from Unicode Technical Report #28: Unicode 3.2 is probably the most authoritative explanation you can find:
Symbol Pieces. [to follow “APL Functional Symbols”] The characters in the range U+239B..U+23B3, plus U+23B7, comprise a set of bracket and other symbol fragments for use in mathematical typesetting. These pieces originated in older font standards, but have been used in past mathematical processing as characters in their own right to make up extra-tall glyphs for enclosing multi-line mathematical formulae. Mathematical fences are ordinarily sized to the content that they enclose. However, in creating a large fence, the glyph is not scaled proportionally; in particular the displayed stem weights must remain compatible with the accompanying smaller characters. Thus, simple scaling of font outlines cannot be used to create tall brackets. Instead, a common technique is to build up the symbol from pieces. In particular, the characters U+239B LEFT PARENTHESIS UPPER HOOK through U+23B3 SUMMATION BOTTOM represent a set of glyph pieces for building up large versions of the fences (, ), [, ], {, and }, and of the large operators ∑ and ∫. These brace and operator pieces are compatibility characters. They should not be used in stored mathematical text, but are often used in the data stream created by display and print drivers.
This is followed by a table showing how pieces are intended to be used together to create specific symbols.
Unicode mostly standardises existing character repertoires, and keeps their peculiarities so that conversions round-trip properly. A corresponding product symbol is not part of Unicode because the originating character repertoire did not have one. Ask on https://www.unicode.org/consortium/distlist-unicode.html about the provenance of the summation top/bottom.

Subscripted 'y' in unicode

I have to display $CₓH\subscript{y}$.
Is there any chance to display a subscripted 'y' in Unicode?
\u2093 represents the subscripted 'x'
Usually you do this with formatting. Unicode's selection of superscript and subscript characters doesn't stem from the need or desire to cover whole alphabets but rather to enable specific use cases, e.g. writing IPA. Furthermore, if you're using a good OpenType font it can also support proper subscripts for arbitrary characters at the font level (where a glyph isn't simply scaled down by the layout engine, but rather a specifically-designed subscript glyph from the font is used).
In fact, since you're already using TeX or something vaguely similar to it, just let one of the many implementations render it. There are lots of things you simply cannot do in plain text without formatting, and this is one of them.
The subscript and superscript characters in Unicode do not cover the whole alphabet.
See the Wiki article on this topic or this answer on SO.
In Sublime Text this subscripted y works: ᵧ. Copied from here: https://lingojam.com/SubscriptGenerator
EDIT This is actually the greek letter gamma

Why does unicode multiple characters representing the same letter?

ASCII has versions of the whole Roman alphabet. I was surprised recently to learn that Unicode contains other version/s of those same characters. One example is "U+1D5C4: MATHEMATICAL SANS-SERIF SMALL K", or "𝗄".
Can't LaTeX math mode, or MS Word equation editor, or whatever other program just use a sans-serif font if it wants the letters in a mathematical formula to be sans-serif?
These characters exist so that the semantic distinction between them can be encoded in plain text, or where the specific font shape can't be controlled.
The block you mention is only intended for use in mathematical and technical contexts, where the distinction between, say, 𝑑 as a variable vs. d as a differential operator vs. 𝖽 as an object (in category theory) is important. TR #25 gives another example where losing the distinction between ℋ and H can completely change the meaning of an equation. Being able to encode this formatting into the text itself is also important for ISO 31-11.
All of these characters maintain compatibility mappings with their "normal" Latin and Greek counterparts, so the distinction between them should not affect searching and sorting.
You are confusing the display mode with the encoding for texts.
The idea is that unicode has ALL the symbols used to write known to mankind grouped by usage. That's why you will find many code-points that look alike.
So a formula with a k is different is supposed to be different then a word written with a k. The sans-serif part is just a description of the kind of k best used to display. Tomorrow somebody might want to add a serif k and then how would you describe the difference?

What is the need of combining characters in Unicode?

What is a practical application of having a combining character representation of a symbol in Unicode when a single code point mapping to the symbol will alone suffice?
What programming/non-programming advantage does it give us?
There is no particular programming advantage in using a decomposed presentation (base character and combining character) when a precomposed presentation exists, e.g. using U+0065 U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT instead of U+00E9 LATIN SMALL LETTER E WITH ACUTE “é”. Such decomposed presentations are something that needs to be dealt with in programming, part of the problem, rather than an advantage. So it’s similar to asking about the benefits of having the letter U in the character code.
The reasons why decomposed presentations (or the letter U) are used in actual data and need to be handled are external to programming and hence off-topic at SO.
Decomposing all decomposable characters may have advantages in processing, as it makes the data more uniform, canonical. This would relate to some particular features of the processing needed, and it would be implemented by performing (with a library routine, usually) normalization to NFD or NFKD form. But this would normally be part of the processing, not something imposed on input format. If some string matching is performed, it is mostly desirable to treat decomposed and precomposed representations of a character as equivalent, and normalization makes this easy. But this a way of dealing with the two different representations, not a cause for their existence, and it can equally well be performed by normalizing to NFC (i.e., precompose everything that can be precomposed). See the Unicode FAQ section Normalization.
The decomposed components are better for text editing, and "possibly but not definite" with good compression ratio.
When editing text, there are times when modifying an accent mark is wanted, but precomposed (precomposed is not a word by Firefox spell check) characters do not allow partial modifications. Sometimes, users may want to modify the base character without removing the accent. Those kinds of editing prefers using decomposed characters.
About compression ratio, it makes more sense during the days of separate encoding per language. In such times, the 8-bit encoding per language allows each language to have their own character sets. Some languages have better compression ratio for decomposed character. The small space of the 8-bits means that they could only fit so many unique code points and use variable width with decomposed characters.

Space-saving character encoding for japanese?

In my opinion a common problem: character encoding in combination with a bitmap-font. Most multi-language encodings have an huge space between different character types and even a lot of unused code points there. So if I want to use them I waste a lot of memory (not only for saving multi-byte text - i mean specially for spaces in my bitmap-font) - and VRAM is mostly really valuable... So the only reasonable thing seems to be: Using an custom mapping on my texture for i.e. UTF-8 characters (so that no space is waste). BUT: This effort seems to be same with use an own proprietary character encoding (so also own order of characters in my texture). In my specially case I got texture space for 4096 different characters and need characters to display latin languages as well as japanese (its a mess with utf-8 that only support generall cjk codepages). Had somebody ever a similiar problem (I really wonder, if not)? If theres already any approach?
Edit: The same Problem is described here http://www.tonypottier.info/Unicode_And_Japanese_Kanji/ but it doesnt provide an real solution how to save these bitmapfont mappings to utf-8 space efficent. So any further help is welcome!
Edit2:
Thank you very much for your answer. Im sorry, that my problem wasn't clear enough described.
What I really want to solve, is: The CJK Unicode range is over 20000 characters. But only a subset of around 2000 characters are necessary to display japanese text properly. These characteres are spreaded in range from U+4E00 to U+9FA5. So I need to transform these Unicode Codepoints (only the 2000 for japanese) somehow to the coordinates of my created texture (where I can order the characters also like I want).
i.e. U+4E03 is a japanese character, but U+4E04, U+4E05, U+4E06 is not. Then U+4E07 is a japanese character as well. So the easiest solution, I can see: after character U+4E03 leave three spaces in my texture (or write the not necessary characters U+4E04, U+4E05, U+4E06 there) and then write U+4E07. But this would waste soo much texture space (20000 characters, even if only 2000 are necessary). So I want to be able to put in my texture only: "...U+4E03, U+4E07...". But I have no idea how to write my displayText function then - because I cant know where are the texture coordinates of the glyph I want to display. There would be a hashmap or something like this necessary, but I have no idea how to store these data (it would be a mess to write for every character something like ...{U+4E03, 128}, {U+4E07, 129}... to fill the hasmap).
To the questions:
1) No specific format - so I will write the displayText function by myself.
2) No reason against unicode - its only that CJK range problem for my bitmapfont.
3) I think, thats generally plattform & language independent, but in my case Im using C++ with OpenGL on Mac OS X/iOS.
Thank you very much for your help! If you have any further idea for this, it would really help me a lot!
What is the real problem you want to solve?
Is it that a UTF-8 encoded string occupies three bytes per character? If yes, switch to UTF-16. Otherwise don't blame UTF-8. (Explanation: UTF-8 is just an algorithm to convert a sequence of integers to a sequence of bytes. It has nothing to do with the grouping of characters in codepages. That in turn is what Unicode code points are for.)
Is it that the Unicode code points are distributed over many "codepages" (where a "codepage" means a block of 256 adjacent Unicode code points)? If yes, invent a mapping from the Unicode code points (0x000000 - 0x10FFFF) to a smaller set of integers. In terms of memory this should cost no more than 4 bytes times the number of characters you really need. The lookup time would be approximately 24 memory accesses, 24 integer comparisons and 24 branch instructions. (In fact, this would be a binary search in a tree map.) And if that's too expensive you could use a mapping based on a hash table.
Is it something else? Then please give us some examples, to better understand your problem.
As far as I understand it you should probably write a small utility program that takes as input a set of Unicode code points that you want to use in your application and then generates The code and data for displaying texts. This raises the questions:
Do you have to use a specific bitmap font format or will you write the displayText function yourself?
Is there any reason against using Unicode for all strings and to convert them to your bitmap-optimized encoding just for the time when you render text? The encoding conversion would of course be internal to the displayText method and not visible to the normal application code.
Just out of interest: Is the problem specific to a certain programming language or environment?
Update:
I am assuming that your main problem is some function like this:
Rectangle position(int codepoint)
If I had to do this, I would start by having one bitmap for each character. The bitmap's file name would be the codepoint, so that the "big picture" can be regenerated easily, just in case you find some more characters you need. The preparation consists of the following steps:
Load all the bitmaps and determine their dimensions. The result of this step is a map from integers to (width, height) pairs.
Compute a good layout for the character images in the big picture and remember where each character was placed. Save the big picture. Save the mapping from codepoints to (x, y, width, height) to another file. This can be a text file, or if you don't have disk space, a binary file. The details don't matter.
The displayText function would then work as follows:
void displayText(int x, int y, String s) {
for (char c : s.toCharArray()) { // TODO: handle code points correctly
int codepoint = c;
Rectangle position = positions.get(codepoint);
if (position != null) {
// draw bitmap
x += position.width;
}
}
}
Map<Integer, Rectangle> positions = loadPositionsFromFile();
Now the only problem that is left is how this map can be represented in memory using as little memory as possible, and still be fast enough. That, of course, depends on your programming language.
The in-memory representation could be a few arrays that contain x, y, width, height. For each element, a 16 bit integer should be enough. And probably you only need 8 bits for width and height anyway. Another array would then map the codepoint to the index into positionData (or some special value if the codepoint is not available). This would be an array of 20000 16 bit integers, so in summary you have:
2000 * (2 + 2 + 1 + 1) = 12000 bytes for positionX, positionY, positionWidth and positionHeight
20000 * 2 = 40000 bytes for codepointToIndexInPositionArrays, if you use an array instead of a map.
Compared to the size of the bitmap itself, this should be small enough. And since the arrays don't change they can be in read-only memory.
I believe the most efficient (lossless) method for encoding this data will be to use a Huffman encoding to store your document information. This is a classic information theory problem. You will need to perform a mapping to go from your compressed space to your character space.
This technique will compress your document as efficiently as possible, based on character frequency per document (or whatever domain/documents you choose to apply it to). Only the characters you use will be stored, and they will be stored in an efficient manner directly proportional to how often they are used.
I think the best way for you to solve this problem is to use an existing implementation (UTF16, UTF8...) This will be much less error prone than implementing your own Huffman coding in order to save a little bit of space. Disk space and bandwidth is cheap, errors that anger customers or managers are not. It is my belief that a Huffman encoding will theoretically be the most efficient (lossless) encoding possible, but not the most practical for this application. Check out the link though, this might help with some of these concepts.
-Brian J. Stinar-
UTF-8 is usually a very efficient encoding. If your application focuses primarily on Asia and other regions with multi-byte character sets, you may benefit more from using UTF-16. You could of course write your own encoding, but it won't save yo that much data and it will provide you with a lot of work.
If you really need to compact your data (and I wonder if and why) you could best use some algorithm to compress you UTF data. Most algorithms work more efficient on larger blocks of data, but there are also algorithms for compressing small chunks of text. I think you will save yourself a lot of time if you explore these instead of defining your own encoding.
The paper is pretty much obsolete, it isn't 1980 any more, scrounging bits is not a requirement of almost any display application. When developing an application, e.g. the iPhone you have to plan for l10n across multiple languages so saving a few bits for just Japanese is a bit pointless.
Japan is still on Shift-JIS because like China with GB18030, Hong Kong with BIG5, etc, they have a big, stable, and efficient resource pool already locked into locale encodings. Migrating to Unicode requires re-writing a significant amount of framework tools and the additional testing that ensues.
If you look at the iPod it saves bits by only supporting Latin, Chinese, Japanese, and Korean, skipping Thai and other scripts. As memory prices and dropped and storage increased with the iPhone Apple have been able to add support for more scripts.
UTF-8 is the way to save space, use UTF-8 for storage and convert to UCS-2 or higher for more convenient manipulation and display. The differences between Shift-JIS and Unicode are really pretty minor.
Chinese alone has more than 4096 characters, and I'm not talking punctuation, but characters that are used to form words. From Wikipedia:
The number of Chinese characters contained in the Kangxi dictionary is approximately 47,035, although a large number of these are rarely used variants accumulated throughout history.
Even though many of those are rarely used, even if 90% weren't needed you'd still exhaust your quota. (I think the actual number used in modern text is somewhere around 10 - 20k.)
If you know in advance which characters you'll need to use your best bet may be to create an indirection table of Unicode codepoints to indexes into your texture. Then you only have to put as many characters in your texture as you'll actually use. I believe Flash (and some PDFs) do something like this internally.
You could use multiple bitmaps and load them on demand, instead of a single bitmap that tries to encompass all possible characters.