Using characters larger than 0xFFFF - unicode

I have an OpenType font with some optional glyphs selected by features. I've opened it in FontForge and I can see that the associated unicode code point is, for example, 0x1002a.
Is it possible to use this value to render the glyph in iText? I've tried calling showText() with a string containing the corresponding surrogate pairs ("\uD800\uDC2A") but nothing appears.
Is there another way to do this, or am I barking up the wrong tree?

Related

OpenSceneGraph: how to check whether ogsText::Font supports a provided character?

I am trying to check whether the font contains a glyph for a given character. To achieve this, I have loaded the font by readFontFile() and get a glyph of the character. Then, I wanted to check whether the glyph texture is available. I tried the next code
osg::ref_ptr<osgText::Font> font = osgText::readFontFile("path_to_fft_font_file");
auto glyph = font->getGlyph(std::make_pair(32.0f, 32.0f), charcode);
auto texture_info = glyph->getTextureInfo(osgText::ShaderTechnique::GREYSCALE);
For all char codes (that are really supported by the font and that are not) the texture_info is nullptr.
I also tried to check glyph->getTotalDataSize(). It gives not a zero value if the character is not supported by the font but the font contains a glyph for none Unicode (usually looks like ▯).
Is there a way to check if the osgText::Font object contains a non-none glyph for the given character?

TextMeshPro does not recognize subscripts and superscripts

I am replacing my current Text components with TextMeshPro, and when replacing texts containing characters such as superscripts and subscripts, I am getting the error.
I am aware that in TextMeshPro there is the option <sub> </sub> and <sup> </sup> to create superscripts and subscripts, but I preferred to do it directly with characters, as I did with normal Text.
The doubt comes from the fact that some characters are recognized, and others not:
² (recognized)
⁵ (recognized)
⁶ (not recognized)
The subscripts, on the other hand, do not recognize even one of them.
This thread contains all the information on solving the problem.
https://forum.unity.com/threads/textmeshpro-does-not-recognize-subscripts-and-superscripts.1151834/
In short words,
I changed the default font to a font that supported superscripts and subscripts. From that font I created two TMP fonts, one static with the main ASCII characters and one dynamic for the rest of the characters. Then I added the dynamic font in the static font fallbacks, and fine. Add the static font to the TMP and that's it.

Unicode converted text isn't shown properly in MS-Word

In a mapping editor, the display is correct after the legacy to unicode conversion for DEVANAGARI text shown using a unicode font (Arial Unicode MS). However, in MS-WORD, the display isn't as expected for the same unicode text in the unicode font (Arial Unicode MS) or any other Devanagari unicode fonts. The expected sequence of unicodes are provided as per the documentation. The sequence can be seen on the left-hand side table.
Please let me know where I am going wrong.
Thanks for your help!
Does your map have to insert the zero_width_joiner? The halant (virama) by itself is enough to get the half-consonant (for some combinations) and in particular, it may be that Word is using the presence of the ZWJ to keep them separate.
If getting rid of the ZWJ doesn't help, another possibility is that Word may be treating the individual characters of the text string as individual "runs" of text.
If those first 4 characters are not in a single run, this can happen.
[aside: the way to tell if it's being treated as a single run, is to save the document as an xml file and then open it with something like notepad++ and look at the xml "w:t" element (IIRC) associated with these characters. If they're all in separate w:t elements, it means they're in separate runs. In that case, you might need to copy the text from Word to some other tool (e.g. Notepad++) and then copy it from there and paste it back in Word -- that might cause it to be imported into Word in a single run.

iText -- How do I identify a single font that can print all the characters in a string?

This is wrt iText 2.1.6.
I have a string containing characters from different languages, for which I'd like to pick a single font (among the registered fonts) that has glyphs for all these characters. I would like to avoid a situation where different substrings in the string are printed using different fonts, if I already have one font that can display all these glyphs.
If there's no such single font, I would still like to pick a minimal set of fonts that covers the characters in my string.
I'm aware of FontSelector, but it doesn't seem to try to find a minimal set of fonts for the given text. Correct? How do I do this?
iText 2.1.6 is obsolete. Please stop using it: http://itextpdf.com/salesfaq
I see two questions in one:
Is there a font that contains all characters for all languages?
Allow me to explain why this is impossible:
There are 1,114,112 code points in Unicode. Not all of these code points are used, but the possible number of different glyphs is huge.
A simple font only contains 256 characters (1 byte per font), a composite font uses CIDs from 0 to 65,535.
65,535 is much smaller that 1,114,112, which means that it is technically impossible to have a single font that contains all possible glyphs.
FontSelector doesn't find a minimal set of fonts!
FontSelector doesn't look for a minimal set of fonts. You have to tell FontSelector which fonts you want to use and in which order! Suppose that you have this code:
FontSelector selector = new FontSelector();
selector.addFont(font1);
selector.addFont(font2);
selector.addFont(font3);
In this case, FontSelector will first look at font1 for each specific glyph. If it's not there, it will look at font2, etc... Obviously font1, font2 and font3 will have different glyphs for the same character in common. For instance: a, a and a. Which glyph will be used depends on the order in which you added the font.
Bottom line:
Select a wide range of fonts that cover all the glyphs you need and add them to a FontSelector instance. Don't expect to find one single font that contains all the glyphs you need.

What is this unicode invisible character?

While trying to parse some unicode text strings, I'm hitting an invisible character that I can't find any definition for. If I paste it in to a text editor and show invisibles, I can see that it looks like a bullet point (• alt-8), and by copy/pasting them, I can see it has an effect like a space or tab, but it's none of those.
I need to test for it, something like...
if(uniChar == L'\t')
But of course I need to provide something to match to.
It has bytes 0xc2 0xa0 in UTF-8.
If no-one has a definition, is there any devious way to test for something I can't define!?
(I happen to be using NSStrings in Objective-C, OSX, Xcode, but I don't think that has any bearing.)
Bytes C2 A0 in UTF-8 encode U+00A0 ɴᴏ-ʙʀᴇᴀᴋ sᴘᴀᴄᴇ, which can be used, for example, to display combining marks in isolation. It is as a named HTML entity. It is almost the same as a U+0020 sᴘᴀᴄᴇ, except it prevents line breaks before or after it, and acts as a numerical separator for bidirectional layout.
The dot you see when you ask a text editor to show invisibles just happens to be what glyph the text editor chose to display spaces. It does not mean the character in question is U+00B7 ᴍɪᴅᴅʟᴇ ᴅᴏᴛ, which is definitely not invisible.
In code, if you have it as a unichar, you can compare it to L'\x00A0'.