itextsharp , why is GetSingleSpaceWidth() returning 0 when a space is visually obvious? - itext

Hi All,
This is a question related to itextsharp version 5.5.13.1. I am using a custom LocationTextExtractionStrategy implementation to extract sensible words from a PDF document. I am calling the method GetSingleSpaceWidth of TextRenderInfo to determine when to
join two adjacent blocks of characters into a single word as per the SFO link
itext java pdf to text creation
This approach has generally worked well. However, if you look at the attached document, the words "Credit" and "Extended" is giving me some problems.
Why are all the characters shown encircled in the screen capture returning a zero value for GetSingleSpaceWidth? This causes a problem . Instead of two separate words, my logic returns me one word "CreditExtended".
I understand that itextsharp5 is not supported any more. Any suggestions would be highly appreciated?
Sample document
https://drive.google.com/open?id=1pPyNRXvnUyIA2CeRrv05-H9q0sTUN97d

As already conjectured in a comment, the cause is that the font in question does not contain a regular space glyph, or even more exactly, does not map any of its glyphs to the Unicode value U+0020 in its ToUnicode map.
If a font has a ToUnicode map, iText uses only the information from that map. Thus, iText does not identify a space glyph in that font, so it cannot provide the actual SingleSpaceWidth value and returns 0 instead.
The font in question is named F5 and has this ToUnicode map:
/CIDInit /ProcSet findresource begin
14 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
4 beginbfchar
<0004> <0041>
<0012> <0043>
<001C> <0045>
<002F> <0049>
endbfchar
1 beginbfrange
<0044> <0045> <004D>
endbfrange
13 beginbfchar
<0102> <0061>
<0110> <0063>
<011A> <0064>
<011E> <0065>
<0150> <0067>
<015D> <0069>
<016F> <006C>
<0176> <006E>
<017D> <006F>
<0189> <0070>
<018C> <0072>
<0190> <0073>
<019A> <0074>
endbfchar
5 beginbfrange
<01C0> <01C1> <0076>
<01C6> <01C7> <0078>
<0359> <0359> [<2026>]
<035A> <035B> <2018>
<035E> <035F> <201C>
endbfrange
1 beginbfchar
<0374> <2013>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
As you can see, there is no mapping to <0020>.
The use of fonts in this PDF page is quite funny, by the way:
Its body is (mostly) drawn using Calibri, but it uses two distinct PDF font objects for this, F4 which uses WinAnsiEncoding from character 32 through 122, i.e. including the space glyph, and F5 which uses Identity-H and provides the above quoted ToUnicode map without a space glyph. Each maximal sequence of glyphs without gap is drawn separately; if that whole sequence can be drawn using F4, that font is used, otherwise F5 is used.
Thus, CMI, (Credit, and sub-indexes are drawn using F4 while I’ve, “Credit, and Extended” are drawn using F5.
In your problem string “Credit Extended”, therefore, we see two consecutive sequences drawn using F5. Thus, you'll get a 0 SingleSpaceWidth both for the “Credit t and the Extended” E.
At first glance these are the only two consecutive sequences using F5, so you have that issue only there.
As a consequence you should develop a fallback strategy for the case of two consecutive characters both coming with a 0 SingleSpaceWidth, e.g. using something like a third of the font size.

Related

Showing glyphs with Unicodes higher than decimal 256

I am looking for help in printing the club symbol from the Arial font in postscript.
It has a Unicode of 9827 (2663 hexadecimal).
The ampersand is Unicode 38 (26 hexadecimal)
This postscript code snippet
!PS
/ArialMT findfont
12 scalefont setfont
72 72 moveto
<26> show
showpage
produces the ampersand system when I run it through Adobe Distiller. It appears that postscript understands Unicodes with UTF-8 encoding by default.
I am unable to do the same with the club symbol.
My research indicates that I have to use Character Encoding and this where I am lost.
Could some kind soul show me (hopefully fairly short) how to show the club symbol using Character Encoding?
Alternatively if you could point me to a simple tutorial it would be greatly appreciated.
Reading the reference manual leaves my head spinning.
PostScript does not understand Unicode, not at all, or at least not as standard, though there are ways to deal with it.
Section 5.3 of the PostScript Language Reference Manual contains complete information on Character Encoding. You really need to read this in detail, what you are asking is a deceptively simple question, with no simple answer.
The way this works for PostScript fonts is that the characters in the document have a character code which lies between 0 and 255. When processing text, the interpreter takes the character code and looks it up in the Encoding attached to the font. If you didn't supply an Encoding to the font, then it will normally have a pre-defined StandardEncoding.
StandardEncoding has some congruence with UTF-8, for character codes 0x7F and below, but it's not exactly the same I don't think.
The Encoding maps the character code to a glyph name, For example 0x41 in StandardEncoding maps to /A (that's a name in PostScript). Note that is not UTF-8 or anything else, it's a mapping. It's entirely possible, and common practice for subset fonts, to map the first character used to character code 1, the second to character code 2 and so on.
So if we applied that scheme to 'Hello World' we would use an Encoding which maps
0x01->/H
0x02->/e
0x03->/l
0x04->/o
0x05->/space
0x06->/W
0x07->/r
0x08->/d
and then we might draw the text by :
<0102030304050604070308> show
Which, as you can see, bears no relation to UTF-8 at all.
Anyway, having retrieved the glyph name the interpreter then looks at the CharStrings dictionary in the font and locates the key associated with the character code. So for StandardEncoding we would map the 0x41 to /A and we'd then look in the CharStrings dictionary for the /A key. We then take the value associated with that key, which will be a PostScript glyph program and run it.
Your problem is that you are trying to use a TrueType font. PostScript does not support TrueType fonts in that way, it does support them when they are defined as Type42 fonts, because a Type42 font carries around some additional information which allows the PostScript interpreter to treat them, broadly speaking, the same way as PostScript fonts.
Many modern PostScript interpreters will load a TrueType font and create a Type42 font from it for you, but this involves guessing at the additional information, and there's no real way to tell in advance how any given interpreter will deal with this. I suspect that Adobe Distiller will behave similarly to Ghostscript and attempt to map the type42 to a StandardEncoding.
Essentially the Encoding maps the character code to a key in the CharStrings dictionary and the value associated with that key is the GID. The GID is used to index the GLYF table in the TrueType font, the TrueType rasteriser then reads that glyph program.
So in order to create a type42 font with an Encoding which will map a character code to a club symbol, you would need to know what the GID of the club symbol in the font actually is. This can be derived from one of the CMAP subtables in the TrueType font, which is how PostScript interpreters such as Ghostscript build the required Encoding when they load a TrueType font as a Type 42. You would then need to alter the CharStrings dictionary in the type42 font so that it maps to the correct GID. You would also need to alter the Encoding; choose a character code that you want to use, map the character code to the key in the CharStrings dictionary.
You would have to determine what kind of keys the Encoding and CharStrings dictionary is using. It might be names or it might be integers or anything else. You could figure that out of course by looking at the content of the Encoding array.
In all honesty unless you know a lot about TrueType fonts I think it would be hard for you to reverse-engineer the font to retrieve the correct GID and then re-encode the font that gets loaded by the interpreter. You would also need to examine the contents of the font dictionary returned by findfont to see what the existing mapping is. And crucially you may need to modify the CharStrings dictionary to map the key to the GID. It may be that Distiller returns a dictionary which is defined as no-access which will prevent you looking inside or (or at least, inside parts of it). You might be able to get away with looking at the Encoding in the font dictionary and modifying that, if the CharStrings dictionary already contains a key for every glyph in the font, which it may well do.
I could probably guide you through doing this with Ghostscript, but I have no idea how Adobe Distiller defines TrueType fonts loaded from disk.
You could use a CIDFont instead. These are defined in section 5.11.1 and it may be that if you were to use something like the pre-defined Identity-H or UCS2 CMaps you could create a CID-Keyed instance of ArialMT with TrueType outlines which would work for your Unicode code point.
But that would mean defining the font yourself, so you would need to include the whole TrueType font as part of your PostScript program. Again this isn't simple.
There is some good information here: Show Unicode characters in PostScript
I also have the ArialMT.ttf and made the ArialMT.ttf.t42 just to look inside. I found the /club glyph with GID 389 as described by KenS and tried this as described in the linked post with good results:
%!
100 300 moveto
/ArialMT.ttf 46 selectfont (ArialMT) show
100 200 moveto /club glyphshow
showpage
Note: I use ArialMT.ttf because the TT font wasn't installed in the ghostscript Fontmap just in the current directory so used gs -P for that reason. The normal /ArialMT findfont should work when the TT font is already installed in the search path. This is my first attempt with these glyphs and was just using trial and error.
There is a comprehensive Adobe list of glyphs that map many of the Unicode characters: https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt.
If the desired Unicode character is in that list, say club;2663 or clubsuitblack;2663 or clubsuitwhite;2667, all one needs to say is
/club glyphshow and most modern fonts will know what to do. But #KenS says this "can cause problems".
Instead the preferred scheme that emerges from the recommended references is to:
create a composite font in the preamble, one for each of the fonts
you are using;
include the lower 256 characters as Font0;
add whatever Unicode characters you are planning to use, in chunks of
256 characters, as Font1, Font2 etc.;
remap the Unicode of the special characters onto a two-character
sequence, of the sub-font index within the composite font, followed
by the byte that is the index of the character within that sub-font.
The following is a complete example of both methods.
I use http://www.acumentraining.com/Acumen_Journal/AcumenJournal_May2002.zip, but with Font1 is a custom-remapping of a portion of the same font as Font0, re-using some of the well known ascii character(s).
This a complete file.eps:
%!PS-Adobe-3.0 EPSF-3.0
%%BoundingBox: 0 0 792 612
%%LanguageLevel: 2
%%EndComments
%%BeginProlog
userdict begin
%%EndProlog
%%BeginSetup
% The following encodes a few useful Unicode glyphs, if only a few are needed.
% Based on https://stackoverflow.com/questions/54840594/show-unicode-characters-in-postscript
% Usage: /Times-Roman /Times-Roman-Uni UniVec new-font-encoding
/new-font-encoding { <<>> begin
/newcodesandnames exch def
/newfontname exch def
/basefontname exch def
/basefontdict basefontname findfont def % Get the font dictionary on which to base the re-encoded version.
/newfont basefontdict maxlength dict def % Create a dictionary to hold the description for the re-encoded font.
basefontdict
{ exch dup /FID ne % Copy all the entries in the base font dictionary to the new dictionary except for the FID field.
{ dup /Encoding eq
{ exch dup length array copy % Make a copy of the Encoding field.
newfont 3 1 roll put }
{ exch newfont 3 1 roll put }
ifelse
}
{ pop pop } % Ignore the FID pair.
ifelse
} forall
newfont /FontName newfontname put % Install the new name.
newcodesandnames aload pop % Modify the encoding vector. First load the new encoding and name pairs onto the operand stack.
newcodesandnames length 2 idiv
{ newfont /Encoding get 3 1 roll put}
repeat % For each pair on the stack, put the new name into the designated position in the encoding vector.
newfontname newfont definefont pop % Now make the re-encoded font description into a POSTSCRIPT font.
% Ignore the modified dictionary returned on the operand stack by the definefont operator.
end} def
/Helvetica /Helvetica-Uni [
16#43 /club % ASCII 43 = C = /club
] new-font-encoding
/Helv
<<
/FontType 0
/FontMatrix [ 1 0 0 1 0 0 ]
/FDepVector [
/Helvetica findfont % this is Font0
/Helvetica-Uni findfont % this is Font1
]
/Encoding [ 0 1 ]
/FMapType 3
>> definefont pop
%%EndSetup
%%BeginScript
/Helv 20 selectfont
72 300 moveto
(The club character is \377\001C\377\000 a part of the string.) show
/Helvetica findfont 20 scalefont setfont
263 340 moveto
/club glyphshow
showpage
%%EOF
Which produces this
Obviously, this can be extended to more characters, but only 256 per sub-font. I am not aware of a "standard" convention for such re-encoding, although I would imagine a set of Greek letters alpha,beta,gamma... would map pretty obviously onto a,b,c... Perhaps somebody else is aware of such an implementation for all of the Unicode characters from the Adobe glyph list using multiple custom sub-fonts, and can provide a pointer..

Using characters larger than 0xFFFF

I have an OpenType font with some optional glyphs selected by features. I've opened it in FontForge and I can see that the associated unicode code point is, for example, 0x1002a.
Is it possible to use this value to render the glyph in iText? I've tried calling showText() with a string containing the corresponding surrogate pairs ("\uD800\uDC2A") but nothing appears.
Is there another way to do this, or am I barking up the wrong tree?

Unicode converted text isn't shown properly in MS-Word

In a mapping editor, the display is correct after the legacy to unicode conversion for DEVANAGARI text shown using a unicode font (Arial Unicode MS). However, in MS-WORD, the display isn't as expected for the same unicode text in the unicode font (Arial Unicode MS) or any other Devanagari unicode fonts. The expected sequence of unicodes are provided as per the documentation. The sequence can be seen on the left-hand side table.
Please let me know where I am going wrong.
Thanks for your help!
Does your map have to insert the zero_width_joiner? The halant (virama) by itself is enough to get the half-consonant (for some combinations) and in particular, it may be that Word is using the presence of the ZWJ to keep them separate.
If getting rid of the ZWJ doesn't help, another possibility is that Word may be treating the individual characters of the text string as individual "runs" of text.
If those first 4 characters are not in a single run, this can happen.
[aside: the way to tell if it's being treated as a single run, is to save the document as an xml file and then open it with something like notepad++ and look at the xml "w:t" element (IIRC) associated with these characters. If they're all in separate w:t elements, it means they're in separate runs. In that case, you might need to copy the text from Word to some other tool (e.g. Notepad++) and then copy it from there and paste it back in Word -- that might cause it to be imported into Word in a single run.

iText -- How do I identify a single font that can print all the characters in a string?

This is wrt iText 2.1.6.
I have a string containing characters from different languages, for which I'd like to pick a single font (among the registered fonts) that has glyphs for all these characters. I would like to avoid a situation where different substrings in the string are printed using different fonts, if I already have one font that can display all these glyphs.
If there's no such single font, I would still like to pick a minimal set of fonts that covers the characters in my string.
I'm aware of FontSelector, but it doesn't seem to try to find a minimal set of fonts for the given text. Correct? How do I do this?
iText 2.1.6 is obsolete. Please stop using it: http://itextpdf.com/salesfaq
I see two questions in one:
Is there a font that contains all characters for all languages?
Allow me to explain why this is impossible:
There are 1,114,112 code points in Unicode. Not all of these code points are used, but the possible number of different glyphs is huge.
A simple font only contains 256 characters (1 byte per font), a composite font uses CIDs from 0 to 65,535.
65,535 is much smaller that 1,114,112, which means that it is technically impossible to have a single font that contains all possible glyphs.
FontSelector doesn't find a minimal set of fonts!
FontSelector doesn't look for a minimal set of fonts. You have to tell FontSelector which fonts you want to use and in which order! Suppose that you have this code:
FontSelector selector = new FontSelector();
selector.addFont(font1);
selector.addFont(font2);
selector.addFont(font3);
In this case, FontSelector will first look at font1 for each specific glyph. If it's not there, it will look at font2, etc... Obviously font1, font2 and font3 will have different glyphs for the same character in common. For instance: a, a and a. Which glyph will be used depends on the order in which you added the font.
Bottom line:
Select a wide range of fonts that cover all the glyphs you need and add them to a FontSelector instance. Don't expect to find one single font that contains all the glyphs you need.

Zebra Programming Language (ZPL) II using ^FB or ^TB truncates text at specific lenghts

I am writing code to print labels for botanic gardens. Each label is printed individually but with different information on each label. Each label contains a scientific name which can vary greatly in size and thus can go over 2 lines (our label size is 10cm wide by 2.5cm high).
Our problem occurs mainly with the name when we go over 24 characters (See line with **).
If we choose a name that has 24 characters or less then it prints fine.
Anything more it will not print.
If we take all the other "items" off the label and just leave the "name" element then it prints only the first 24 characters and truncates the rest (we did this to test whether a possible overlap between our ^FB block and another element could be causing this problem).
We tried this with other elements that use a ^FB and we found that they displayed the same behaviour but varied in the length at which this issue occurred: for example "cc" (short for country code) had a limit of 21 characters.
For added information: we compile this code within a BASIC environment and use variables such as ":name:", ":Acc.dt":" as seen bellow. Our database provides this information and we have checked for any internal routines that would have truncated long names etc. Our code was working fine in ZPL but we recently had to move to ZPL II (we purchased a newer model GX430t) and had to modify our ZPL code at which point this problem started to occur.
Here is our code:
^XA
^LH40,40
^MMT
^PW1200
^LL1200
^FO16,05^A0N,35,^FDAcc. num.^FS
^FO170,05^A0,35,^FV":accnum:"^FS
^FO360,05^A0,35,^FV":qual:"^FS
^FO350,35^A0N,30,^FDAcc.dt.^FS
^FO450,35^A0N,30,^FB790,3,0,L,
^FH\^FV":accdt:"^FS
^FO430,70^^A0N,25,^FB790,3,0,L,
^FH\^FDProv. type^FS
^FO560,70^A0N,25,^FV":provtype:"^FS
^FO800,225^A0N,30,^FB790,3,0,L,
^FV":cc:"^FS
**^FO10,100^A0N,40,^FB790,3,0,L,
^FV":name:"^FS**
^FO1000,05^A0,35,^FV":proptype:"^FS
^FO5,225^A0,25^FVColl.^FS
^FO55,225^A0,25^FV":coll:"^FS
^FO375,225^A0,25,^FV":consstat:"^FS
^FO1000,70^A0,25,^FV":reqby:"^FS
^FO535,180^BCN,55,N,N,N^FV":qual:"^FS
^FO60,45^BCN,35,N,N,N^FV":accnum:"^FS
^PQ1,0,1,Y
^XZ
Here is what we have tried to fix this (apologies if some seem like wild cards):
Changing font type, size, and location on label;
Changing ^FO to ^FT;
Looked at our internal database logic;
Taking away ^FH\;
Changing the values within the ^FB line (we tried nearly all possible permutations);
Manually typed in a name longer than 24 characters (using notepad - no database/compiler) - same issue.
Any thoughts on this would be greatly appreciated
Kerry
I've had this issue before, and across printer manufacturers, firmwares and languages.
First, some paraphrased explanations straight out of the 2014 ZPL II Programming Guide (P1012728-009 Rev. A).
"The ^TB command prints a text block with defined width and height. The text block has an automatic word-wrap function. If the text exceeds the block height, the text is truncated."
"The ^FB (Field Block) command allows you to print text into a defined block type format. It can format a ^FD (Field Data) string into a block of text using the origin, font, and
rotation specified for the text string, and it contains an automatic word-wrap function."
Technically, the difference between a text block and a field block is that height is in dots for the former and in lines for the latter.
Also notice that although not mentioned, the ^FB command also truncates text that does not fit in the number of lines specified, and here's where the font size of the A0 command and the line spacing of the FB command now play an important role in determining whether to show or truncate that second or third line.
Incidentally, in other languages such as TSPL there is no truncation of text blocks--if you tell the block to be 3 lines in height but there's enough text for 4 lines, line 4 overlaps line 3 to indicate this--which may seem awful, but it is better than the data loss of truncation, which is not obvious.
For both commands:
"Using ^FT (Field Typeset) for your data takes the baseline origin of
the last possible line of text, meaning that the field block will be
filled from bottom to top."
"Using ^FO (Field Origin) means that the field block will be filled from top to bottom."
In reality, I have only been able to make the ^FB command work as expected, but that may be because ^TB is not implemented in the firmware I've worked with (ZPL II "compliant" Bluetooth printers).
You can test the following snippet for a 2x2 label in the Labelary Viewer:
^XA
~TA0
^MTD
^MNW
^MMT
^MFN
~SD15
^PR6
^PON
^PMN
^PW406
^LS0
^LRN
^LL406
^LT0
^LH0,0
^CI0
^XZ
^XA
^FO324,10,0^FB386,2,0,C,0^A0R,36,28.8^FH^FD"The King" Cupcake^FS
^FO278,10,0^FB386,1,0,C,0^A0R,28,22.4^FH^FDUse By 11/24/2015 02:45 PM^FS
^FO152,10,0^FB386,1,0,C,0^A0R,24,19.2^FH^FD11/24/2015 02:45 PM^FS
^FO62,140,0^FB250,1,0,R,0^A0R,24,19.2^FH^FDSL: 4 hours^FS
^FO38,10,0^FB386,1,0,L,0^A0R,18,14.4^FH^FDPREP DATE:^FS
^FO8,10,0^FB386,1,0,L,0^A0R,28,22.4^FH^FD11/24/2015 10:45 AM^FS
^FO62,10,0^FB50,1,0,L,0^A0R,24,19.2^FH^FDEMP:^FS
^FO92,10,0^FB376,3,0,J,0^A0R,18,14.4^FH^FDIngredients: 1 1/2 cups all-purpose flour, 1 teaspoon baking powder, 1/2 teaspoon salt, 8 tablespoons (1 stick) unsalted butter, room temperature, 1 cup sugar, 3 large eggs, 1 1/2 teaspoons pure vanilla extract, 3/4 cup milk.^FS
^PQ3,,,Y
^XZ
In particular, I've preceeded the A0 and FD commands with FB. Using the viewer, you can quickly test the effects of changing from FT and FO in the ingredients line, the effects of changing the A0 font sizes and the effects of changing the FB number of lines from say 3 to 2 (the viewer does not truncate text btw).
Of course there is no match for actually printing a label, for your ZPL II "compliant" printer may or may not truncate text according to its manufacturer and firmware version.
I hope that helps!