How do I add a new Arabic vowel-sign in the PUA area of a font? - unicode

I am using Ubuntu 14.04, with FontForge compiled from the Git repo as of 31
July.
I'm trying to add a vowel-sign to an Arabic font, Graph, by Future Soft Egypt:
http://openfontlibrary.org/en/font/graph
I have added glyphs where the Unicode code-point already exists (eg peh,
U+067E), and that works fine. I am now trying to add a vowel sign where no
Unicode code-point exists - it is a "damma with tail", used by some writers in
Swahili to mean "o".
I decided to put it in the PUA at U+E909, and copied the font's damma (U+064F)
and added a tail:
http://kevindonnelly.org.uk/swahili/images/dammas.png
I generated the font, and set up the keyboard to emit that character.
The glyph comes up OK, but there are two problems, as can be seen here:
http://kevindonnelly.org.uk/swahili/images/output.png
showing at top "bubu", using the original damma, and at bottom "bobo", using
the new damma-with-tail.
(1) The damma-with-tail is too far to the left, even though the anchor points
in FF have not been moved.
(2) Worse, the damma-with-tail means that only the isolated versions of the
consonant glyphs get used - in the second line the two bs should be joined, as
in the first line.
I'm not sure whether this is a function of using the PUA, or whether it's due
to my missing some step I need to take in FF (eg the Encoding -> Add Encoding
Slots that needs to be done for the consonants), but if anyone could shed some
light on how to fix the two problems, I'd be very grateful.

Related

How to find the missing components of Chinese characters within Unicode?

I am currently working on the decomposition of Chinese characters (Japanese kanji, to be more exact) and I have found a few components that are seemingly either not included in the Unihan database or they cannot be properly displayed with any font I am aware of. Is there some way to locate these characters within UTF-8 or UTF-16 and make them to be properly displayed in their character form? The list of components is provided below:
渋 ---> 氵+ 止 + ??? ... I have not managed to find those four dots in the Unihan database ... even here the authors had to encode the component ... the same issue appears in kanji 楽 and 摂 and 率
龍 ---> 𦚏 + ??? .... the component on the right hand side seems not to be in the Unicode ... the same goes for 拝 or 継
制 ---> ??? + 刂 ... left component seems not to be in the Unicode (the closest probably is 韦) ... the same goes for the kanji 段 ---> ??? + 殳
祭 ---> ??? + 示
留 ---> ??? + 田 (it is possible to decompose into three components 𠫔 + 刀 + 田, but two would be better)
Thank you very much for your advice :-)
I went through the whole Unihan database (over 90 000 chars) and did not manage to find the missing components. I tried installing various fonts Babel Stone Han, simch5100, etc. but their coverage of Unicode is not 100%. Nevertheless, I am afraid that some of these components are not included within Unicode by themselves and they can be displayed only as a part another character.
You may want to have a look at the IDS.TXT data file maintained by Andrew West (BabelStone), which provides Ideographic Description Sequences (IDS) for all the 97,058 CJK unified ideographs defined in Unicode version 15.0.
It makes use of about 120 "numbered components" which are characters not yet defined in Unicode (although it seems they may be added later on, according to some official proposal). They are currently represented by glyphs found in an associated Private Use Area (PUA) font named BabelStone Han PUA, which can be freely downloaded from the bottom of the page.
There is also one open-source application making extensive use of this data in a graphical way, called Unicopedia Sinica, available on GitHub.

How to fix PowerShell 7 fonts not showing correctly | oh-my-posh

I've already installed Windows Terminal, set it up with "oh my posh" and everything working as intended.
Though whenever I launch PowerShell 7 (without the terminal), the font is messy as you can see at the image below
I have already tried to change the font, to the same one I used in terminal's .json but there are still some parts that are not rendering correctly and I cannot use it that way with VSCode
The problem is because the Windows Console doesn't fully support UTF-8:
Windows Console was created way back in the early days of Windows,
back before Unicode itself existed! Back then, a decision was made to
represent each text character as a fixed-length 16-bit value (UCS-2).
Thus, the Console’s text buffer contains 2-byte wchar_t values per
grid cell, x columns by y rows in size.
...
One problem, for example, is that because UCS-2 is a fixed-width
16-bit encoding, it is unable to represent all Unicode codepoints.
This means you have "partial" support for Unicode characters in the Windows Console (i.e. as long as the character can be represented in UCS-2), but won't support all potential (32-bit) Unicode regions.
When you see boxes, that means that the character that is being used is using a region outside of the UCS-2 range. You also tell this because you get 2 boxes (i.e. 2 x 16 bit values). That is why you can't have happy faces 😀 in your Windows Console (which makes me sad ☹️).
In order for it to work in all locations, you will have to modify your oh-my-posh themes to use a different character that can be represented with a UCS-2 character.
For Version 2 of Oh My Posh, to make the font changes you have to edit the $ThemeSettings variable. Follow the instructions on the GitHub on configuring Theme Settings. e.g.:
$ThemeSettings.GitSymbols.BranchSymbol = [char]::ConvertFromUtf32(0x2514)
For Version 3+ of Oh My Posh, you have to edit the JSON configuration file to make the changes, e.g.:
...
{
"type": "git",
"style": "powerline",
"powerline_symbol": "\u2514",
....

Unicode characters aren't combined properly

I am working with some Devanagari text data I want to display in the browser. Unfortunately, there's one combination of nonspacing combining characters that doesn't get rendered as a proberly combined character.
The problem occurs every time a base character is combined with the Devanagari Stress Sign Udatta ॑ (U+0951) and the Devanagari Sign Visarga ः (U+0903).
An example for this would be र॑ः, which is र (U+0930) + ॑ + ः and should be rendered as one character. But the stress sign and the other one don't seem to like each other (as you can see above!).
It's no problem to combine the base char with each of the other two signs alone, btw: र॑ / रः
I already tried to use several fonts which should be able to render Devanagari characters (some Noto fonts, Siddhanta, GentiumPlus) and tested it with different browsers, but the problem seems to be something else.
Does anyone have an idea? Is this not a valid combination of symbols?
EDIT: I just tried to switch around the two marks just to see what if - it renders as रः॑, so U+0951 and U+0903 don't seem to have the same function, as the stress sign gets rendered on top of the other mark.
It looks like i don't understand Unicode enough, yet.
This is NOT a solution for your problem, but might be useful information:
I am working with some Devanagari text data I want to display in the
browser.
Like you, I couldn't get this to work in any browser despite trying several fonts, including Arial Unicode MS:
The browser was simply rendering the text Devanagari Test: रः॑ from within the <body> of a JSP. The stress sign is clearly appearing above the Sign Visarga instead of the base character.
Is this not a valid combination of symbols?
It is a valid combination. I don't know Devanagari, so I don't know whether it is semantically "valid", but it is trivial to generate exactly the character you want from a Java application:
System.out.println("Devanagari test: \u0930\u0903\u0951");
This is the output from executing the println() call, showing the stress sign above the base character:
The screenshot above is from NetBeans 8.2 on Windows 10, but the rendering also worked fine using the latest releases of Eclipse and Intellij IDEA. The constraints are:
The three characters must be specified in that order in println() for the rendering to work.
The Sign Visarga and the Stress Sign Udatta must be presented in their Unicode form. Pasting their glyph representations into the source code won't work, although this can be done for the base character.
An appropriate font must be used for the display. I used Arial Unicode MS for the screen shot above, but other fonts such as Serif, SansSerif and Monospaced also worked.
Does anyone have an idea?
Unfortunately not, although it is clear that:
The grapheme you want to render exists, and is valid.
Although it won't render in a browser, it can be written to the console by a Java application.
The problem seems to be that all browsers apply the diacritic (Stress Sign Udatta) to the immediately preceding character rather than the base character.
See Why are some combining diacritics shifted to the right in some programs? for more information on this.

What character is this:?

EDIT
While posting the question, character I ask for was shown well to me, but after postig it does not show up anymore. As it does not appear, please look up in original site
EDIT2
I looked for Unicode chars associated with "alien", and found no matching ones. Here is how they are compared side by side:
I found, that some texts inside my database contain character like . I am not sure, how it would rendered with different fonts and environments, so here is the image, how I see it:
I tried to identify it with different ways. For example, when I paste it into Sublime Text, it automatically shows as control character <0x85>. When I tried to identify it in different unicode-detectors (http://www.babelstone.co.uk/Unicode/whatisit.html, https://unicode-table.com/en/, https://unicode-search.net/unicode-namesearch.pl), their conclusion is pretty match the same:
Uni­code code point char­acter U+0085
UTF-8 en­co­ding c2 85 hexa­decimal
194 133 deci­mal
0302 0205 octal
Uni­co­de char­ac­ter name <control>
Uni­co­de 1.0 char­act­er name (de­pre­ca­ted) NEXT LINE (NEL)
https://unicode-search.net/unicode-namesearch.pl
also included this information
HTML en­co­ding … … hexa­decimal
… … deci­mal
which gave me some vague hint, how it was possible, that … become ``. But this is not main problem here.
My question is: how is possible, that control character is shown up like this and what is the actual glyph used to represent it?
I tried to sketch into http://shapecatcher.com/ to identify it but without success. I did not find such a glyph in any Unicode table.
The alien symbol is not a Unicode character; but is in Microsoft's Webdings font, with character code 0x85. Running Start > Run > charmap, then selecting Webdings from the Font drop list, opens this window:
If I click that alien character in the leftmost column, the message Character Code : 0x85 is shown at the bottom of the window.
I can even copy that character from the Character Map and paste it into Microsoft Wordpad:
The WebDings symbols were included in Unicode Release 7: Pictographic symbols (including many emoji), geometric symbols, arrows, and ornaments originating from the Wingdings and Webdings sets. Therefore you would expect the alien symbol to also be in Unicode. However, I don't think the version of Webdings that was used included that alien symbol, since Windows 10 also has a ttf file for Webdings (version 5.01), and it also does not include the alien symbol:
So presumably what originally caught your attention was some text being rendered with an older version of the Webdings font which included that alien symbol.
The glyph is 👽 U+1F47D EXTRATERRESTRIAL ALIEN. I don't know why your system misrenders a control character.

Where to get a reference image for any unicode code point?

I am looking for an online service (or collection of images) that can return an image for any unicode code point.
Unicode.org does not have an image for each one, consider for example
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=31cf
EDIT: I need to use these images programmatically, so the code chart PDFs provided at unicode.org are not useful.
The images in the PDF are copyrighted, so there are legal issues around extracting them. (I am not a lawyer.) I suspect that those legal issues prevent a simple solution from being provided, unless someone wants to go to the trouble of drawing all of those images. It might happen, but seems unlikely.
Your best bet is to download a selection of fonts that collectively cover the entire range of characters, and display the characters using those fonts. There are two difficulties with this approach: combining characters and invisible characters.
The combining characters can easily be detected from the Unicode database, and you can supply a base character (such as NBSP) to use for displaying them. (There is a special code point intended for this purpose, but I can't find it at the moment.)
Invisible characters could be displayed with a dotted square box containing the abbreviation for the character. Those you may have to locate manually and construct the necessary abbreviations. I am not aware of any shortcuts for that.