Unifont & UnicodeData.txt how do I deduce if character is full or half width (x-advance) - unicode

Is there reliable way for determining if glyph in unifont is half width like latin characters (ie all in chart 0002) which take left half space only or full width like character 0x06E9 (from chart 0006)?
Pixel analysis is not good solution for me as it would fail on many characters like spaces.
I'd prefer to use information from UnicodeData.txt:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
Unfortunately I'm not able to find good match between unifont and any field from data.
Chart 0002: http://unifoundry.com/png/plane00/uni0002.png
Chart 0006: http://unifoundry.com/png/plane00/uni0006.png

Looks like you'll need the source code '.hex' for the version of unifont you're using and the appropriate versions of the Unicode Utilities from [1]. 'unigenwidth' [2] seems to generate code related to the width of characters in Unifont; perhaps you'll need to write a parser to look through that code and give you what you want?
[1] http://unifoundry.com/unicode-utilities.html
[2] http://manpages.ubuntu.com/manpages/trusty/man1/unigenwidth.1.html

Related

How can I manage display and spacing on a Crystal Report where I have to display images between the text field?

I have a field that I'm displaying on a report that is a combination of text and codes that represent an image. Some of those icons have ascii symbols that I've used a replace formula to display them as their ascii version. For two or three of the images, I have no luck and have to display a mini picture for the representation.
The codes being sent are something like:
^he^ = ♥ ^st^ = ⭐ ^cl^ = 🍀 etc...
So for the clover leaf, there is no emoji support in my version of Crystal for clover leaves, and the ascii icon I found online for it just shows the empty square icon when an emoji isn't supported.
My workaround for this is to have a formula that converts all my icons to the appropriate ascii where supported, and to leave two blank spaces for the unsupported icons.
>stringvar gift_msg;
>gift_msg:= {DataTable1.gift_field};
>gift_msg := replace(gift_msg,"^CL^"," ");
>gift_msg := replace(gift_msg,"^HE^","♥");
>gift_msg := replace(gift_msg,"^ST^","★");
>gift_msg
I then put a suppression formula on each image that looks like this:
>mid({DataTable1.gift_field},2,4)<>"^CL^"
So I duplicated the image along the length of the field and increment the mid formula to match the field. I also set the font to Consolas so that it's fixed width to remove any surprises in spacing. My issue is that this still creates very strange spacing, and I'm almost certain there's a much easier way to do this.
One option is to use a free service such as Calligraphr.com to convert your image to a font.
Given that your image relies on several colors, the font option might not work.
Another option is to build the expression as html with image source directives where you need them. You would then need a create or use a 3rd-party UFL to convert the full expression to an image that you can load on the fly using the Graphic Location expression. At least one of the UFLs listed by Ken Hamady here provides such a function.

Determine the individual unicode characters that make up a word

I'm having trouble breaking a word into its individual unicode components. I'm working with the devanagari script using google input tools. An example is र्म (pronounced -rm), which I want to break into म (-m) and the that hook at the top (-r). But I can't seem to find the unicode character that corresponds to the hook at the top. Here's some of the solutions I tried
1. copy and past र्म into MS word and hit alt x. But this breaks the word into र् and म. It doesn't give me the unicode character for the top hook
2. I tried the site http://shapecatcher.com/. I found a character called latin egyptological ain; while similar in shape, it cannot be used on top of another character. I'm looking the conjunct version of the hook.
Any help would be appreciated. I'm using TekMaker on Windows 8.
The ‘hook at the top’ representing a preceding र् is an inseparable part of the glyph for a variety of biconsonantal ligatures. It's not a discrete, freely-combinable diacritical mark as we would understand it in Latin-like scripts.
Consequently the visual rendering element doesn't have its own Unicode representation distinct from its linguistic meaning र्, sorry!

iText -- How do I identify a single font that can print all the characters in a string?

This is wrt iText 2.1.6.
I have a string containing characters from different languages, for which I'd like to pick a single font (among the registered fonts) that has glyphs for all these characters. I would like to avoid a situation where different substrings in the string are printed using different fonts, if I already have one font that can display all these glyphs.
If there's no such single font, I would still like to pick a minimal set of fonts that covers the characters in my string.
I'm aware of FontSelector, but it doesn't seem to try to find a minimal set of fonts for the given text. Correct? How do I do this?
iText 2.1.6 is obsolete. Please stop using it: http://itextpdf.com/salesfaq
I see two questions in one:
Is there a font that contains all characters for all languages?
Allow me to explain why this is impossible:
There are 1,114,112 code points in Unicode. Not all of these code points are used, but the possible number of different glyphs is huge.
A simple font only contains 256 characters (1 byte per font), a composite font uses CIDs from 0 to 65,535.
65,535 is much smaller that 1,114,112, which means that it is technically impossible to have a single font that contains all possible glyphs.
FontSelector doesn't find a minimal set of fonts!
FontSelector doesn't look for a minimal set of fonts. You have to tell FontSelector which fonts you want to use and in which order! Suppose that you have this code:
FontSelector selector = new FontSelector();
selector.addFont(font1);
selector.addFont(font2);
selector.addFont(font3);
In this case, FontSelector will first look at font1 for each specific glyph. If it's not there, it will look at font2, etc... Obviously font1, font2 and font3 will have different glyphs for the same character in common. For instance: a, a and a. Which glyph will be used depends on the order in which you added the font.
Bottom line:
Select a wide range of fonts that cover all the glyphs you need and add them to a FontSelector instance. Don't expect to find one single font that contains all the glyphs you need.

Where to get a reference image for any unicode code point?

I am looking for an online service (or collection of images) that can return an image for any unicode code point.
Unicode.org does not have an image for each one, consider for example
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=31cf
EDIT: I need to use these images programmatically, so the code chart PDFs provided at unicode.org are not useful.
The images in the PDF are copyrighted, so there are legal issues around extracting them. (I am not a lawyer.) I suspect that those legal issues prevent a simple solution from being provided, unless someone wants to go to the trouble of drawing all of those images. It might happen, but seems unlikely.
Your best bet is to download a selection of fonts that collectively cover the entire range of characters, and display the characters using those fonts. There are two difficulties with this approach: combining characters and invisible characters.
The combining characters can easily be detected from the Unicode database, and you can supply a base character (such as NBSP) to use for displaying them. (There is a special code point intended for this purpose, but I can't find it at the moment.)
Invisible characters could be displayed with a dotted square box containing the abbreviation for the character. Those you may have to locate manually and construct the necessary abbreviations. I am not aware of any shortcuts for that.

How do I use Unicode Character Combining with Kanji/Hanzi?

I'm trying to find a workaround to display old and rare characters in unicode using character combining. Currently I'm converting some dictionaries from EPWING into text and there are 36 different characters which cannot be reproduced using normal UTF-8. Below is the problem section of the epwing gaiji to unicode mappings for one of the dictionaries that I am converting, in some areas it has an interesting syntax that is clearly being used to combine characters in different ways. I was hoping if someone could identify what this syntax is, and where I might find documentation or a tutorial on how to use it.
s/<?w=b02a>/𡓦/g
s/<?w=b04b>/者/g
s/<?w=b064>/<⾱ 𤰇>/g
s/<?w=b077>/<彳<匕\/匕>>/g
s/<?w=b07c>/<山\/⺀>/g
s/<?w=b12e>/𥝝/g
s/<?w=b155>/</>/g
s/<?w=b156>/<\/>/g
s/<?w=b157>/<\/\/>/g
s/<?w=b158>/<こ[1]/と|ヿ>/g
s/<?w=b16f>/<㗢>/g
s/<?w=b170>/<㗥>/g
s/<?w=b171>/ଏ/g
s/<?w=b175>/lb/g
s/<?w=b22a>//g
s/<?w=b234>/ff/g
s/<?w=b25e>/㯌/g
s/<?w=b271>/<扌 晉>/g
s/<?w=b36b>/𣴴/g
s/<?w=b373>/𥝱/g
s/<?w=b42c>/𦼠/g
s/<?w=b434>/<已\/大>/g
s/<?w=b438>/𩸽/g
s/<?w=b43a>/𩺊/g
s/<?w=b43f>/<㇀/丶>/g
s/<?w=b440>/𠂆/g
s/<?w=b45a>/<?>/g
s/<?w=b45b>/<|>/g
s/<?w=b53d>/<?>/g
s/<?w=b53e>/<?>/g
s/<?w=b540>/<o>/g
s/<?w=b537>/<ト モ>/g
s/<?w=b541>/<一/𠔀>/g
s/<?w=b544>/<?>/g
s/<?w=b546>/<[r45]卐>/g
s/<?w=b55f>/*/g
I know that this line is supposed to represent 彳as a left vertical radical with one 匕 stacked on top of another 匕 as the right vertical portion of the character:
s/<?w=b077>/<彳<匕\/匕>>/g
This one is also pretty obvious, it's a 卐 rotated 45 degrees:
s/<?w=b546>/<[r45]卐>/g
Note: the four character hexadecimal codes that come after the ?w= is an identifier for the epwing gaiji that the unicode is supposed to correspond to.
Thank you for your time.
Please see The Unicode Standard section 12.2, Ideographic Description Characters. It discusses your precise situation.
Unfortunately, you may found that software support for what you are trying to do is practically non-existent.