Can I tell iText how to clip text to fit in a cell - itext

When I call setFixedHeight() on a PdfPCell, and add more text than fits in the given height, iText seems to print the prefix of the string which fits.
Can I control this clipping algorithm? For example:
Print a suffix of the string rather than a prefix.
Mark a substring of the string as not to be removed. This is with footnote references. If I add text saying "Hello World [1]", the [1] is a reference to a footnote and should not be removed. It's okay to remove the other characters of the string, like "World".
When there are multiple words in the string, iText seems to eliminate a word that doesn't fit, while I would like it partially printed. That is, if the string is "Hello World", and the cell has room only for "Hello Wo...", I would like that to be printed, rather than just "Hello", as iText prints.
Rather than printing characters in their entirety, print only part of them. Imagine printing the text to a PNG and chopping off the top and/or bottom part of the PNG to fit it in the space available. For example, notice that the top line and the bottom line are partially clipped here:
Are any of these possible? Does iText give me any control over how text is clipped? Thanks.
This is with reference to iText 2.1.6.

I have written a proof of concept, ClipCenterCellContent, where we try to fit the text "D2 is a cell with more content than we can fit into the cell." in a cell that is too small.
Just like in your other question ( iText -- How do I get the rendered dimensions of text? ), we add the content using a cell event, but we now add it twice: once in simulation mode (to find out how much space is needed vertically) and once for real (using an offset).
This adds the content in simulation mode (we use the width of the cell and an arbitrary height):
PdfContentByte canvas = canvases[PdfPTable.TEXTCANVAS];
ColumnText ct = new ColumnText(canvas);
ct.setSimpleColumn(new Rectangle(0, 0, position.getWidth(), -1000));
ct.addElement(content);
ct.go(true);
float spaceneeded = 0 - ct.getYLine();
System.out.println(String.format("The content requires %s pt whereas the height is %s pt.", spaceneeded, position.getHeight()));
We now know the needed height and we can add the content for real using an offset:
float offset = (position.getHeight() - spaceneeded) / 2;
System.out.println(String.format("The difference is %s pt; we'll need an offset of %s pt.", -2f * offset, offset));
PdfTemplate tmp = canvas.createTemplate(position.getWidth(), position.getHeight());
ct = new ColumnText(tmp);
ct.setSimpleColumn(0, offset, position.getWidth(), offset + spaceneeded);
ct.addElement(content);
ct.go();
canvas.addTemplate(tmp, position.getLeft(), position.getBottom());
In this case, I used a PdfTemplate to clip the content.
I also have answers to your other questions, but I don't have the time to answer them right now.

For straight Text box clipping, I adapted the C# code given here
http://itextsharp.10939.n7.nabble.com/Limiting-Text-Width-using-PdfContentByte-td2481.html
to the Java code below. The clipping area ends up outside this rectangle, so you can still draw a rectangle on the same exact coordinates.
cb.saveState();
cb.rectangle(left,top,width,height);
cb.clip();
cb.newPath();
// perform clipped output here
cb.restoreState();
I used a try/finally to ensure restoreState() was called.

Related

ITextSharp: Extract text without small spaces

I am trying to extract the headlines of some pdf files to sort them. Unfortunately there's a space between every letters with the spaces between words bigger than the ones between letters of the same word.
Here's my extraction method:
PdfReader reader = new PdfReader(filename);
Rectangle rect = new Rectangle(0, 0, 1000, 1000);
RenderFilter regionFilter = new RegionTextRenderFilter(rect);
FontRenderFilter fontFilter = new FontRenderFilter();
FilteredTextRenderListener strategy = new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), regionFilter, fontFilter);
string result = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
reader.Close();
Is there a way to filter out the smaller spaces?
iText uses the distance of the rendered glyphs as base to decide if a space is present or not. The general rule applied is, if the distance is larger than the width of a normal space, devided by 2, than a space character is recognized. While this works quite well in most cases, it doesn't work at all, if the width of a space character could not be determined for the font used. In my case the width of a space was recognized as 0, thus the smallest distance between glyphs was recognized as a space. I based my solution on another answer from mkl to a question that is very similar to yours.
In short: You need to derive from e.g. SimpleTextExtractionStrategy or LocationTextExtractionStrategy and override the methods that convert the distance between glyphs into spaces (renderText or isChunkAtWordBoundary respectively).
You can also refer to the answer I gave here or the original solution by mkl.

Drawing text using PdfTextArray in iTextSharp - how?

I am drawing text in a PDF page using iTextSharp, and I have two requirements:
1) the text needs to be searchable by Adobe Reader and such
2) I need character-level control over where the text is drawn.
I can draw the text word-by-word using PdfContentByte.ShowText(), but I don't have control over where each character is drawn.
I can draw the text character-by-character using PdfContentByte.ShowText() but then it isn't searchable.
I'm now trying to create a PdfTextArray, which would seem to satisfy both of my requirements, but I'm having trouble calculating the correct offsets.
So my first question is: do you agree that PdfTextArray is what I need to do, in order to satisfy both of my original requirements?
If so, I have the PdfTextArray working correctly (in that it's outputting text) but I can't figure out how to accurately calculate the positioning offset that needs to get put between each pair of characters (right now I'm just using the fixed value -200 just to prove that the function works).
I believe the positioning offset is the distance from the right edge of the previous character to the left edge of the new character, expressed in "thousandths of a unit of text space". That leaves me two problems:
1) How wide is the previous character (in points), as drawn in the specified font & height? (I know where its left edge is, since I drew it there)
2) How do I convert from points to "units of text space"?
I'm not doing any fancy scaling or rotating, so my transformation matrices should all be identity matrices, which should simplify the calculations ...
Thanks,
Chris

get the exact position of text from image in tesseract

Using GetHOCRText(0) method in tesseract I'm able to retrieve the text in html and on presenting the html in webview i'm able get the text but the postion of text in image is different from the output. Any idea is highly helpful.
tesseract->SetInputName("word");
tesseract->SetOutputName("xyz");
tesseract->Recognize(NULL);
char *utf8Text=tesseract->GetHOCRText(0);
and output image
If you have the hocr output, you should have a tag for each word. These tags should have class="ocrx_word" and name="bbox x1 y1 x2 y2" where the x and y are the top left and bottom right corner of the bounding box around the word. I don't think it's possible to automatically use this information to format a text document - would require translating pixel differences to number of tabs/spaces. But, you should be able to render text in the given location.
GetBoxText() method will return exact position of each characters in an array.
char *boxtext = _tesseract->GetBoxText(0);
NSString* aBoxText = [NSString stringWithUTF8String:boxtext];

Core Text - select text in iPhone?

I need to render rich text using Core Text in my view (simple formatting, multiple fonts in one line of texts, etc.). I am wondering if text rendered this way can be selected by user using (standard copy / paste function)?
I implemented a text selection in CoreText. It is really a hard work... But it's doable.
Basically you have to save all CTLine rects and origins using CTFrameGetLineOrigins(1), CTLineGetTypographicBounds(2), CTLineGetStringRange(3) and CTLineGetOffsetForStringIndex(4).
The line rect can be calculated using the origin(1), ascent(2), descent(2) and offset(3)(4) as shown bellow.
lineRect = CGRectMake(origin.x + offset,
origin.y - descent,
offset,
ascent + descent);
After doing that, you can test which line has the touched point looping the lines (always remember that CoreText uses inverse Y coordinates).
Knowing the line that has the touched point, you can know the letter that is located at that point (or the nearest letter) using CTLineGetStringIndexForPosition.
Here's one screenshot.
For that loupe, I used the code shown in this post.
Edit:
To draw the blue background selection, you have to paint the rect using CGContextFillRect. Unfortunately, there's no background color in NSAttributedString.

PDF calculate Glyph sizes

I (think) have every values for Text-Rendering in a PDF.
* Position (Text Matrix)
* FontDescriptor with Widths Array
* FontBBox
* StemV/StemH
* FontName
* Descent
* Ascent
* CapHeight
* XHeight
* ItalicAngle
My problem is I don't know what to do with these values. I went through the PDF Spec 1.7 a couple of times and cannot find a formular to calculate the real pixel sizes of every glyph in PDF. Can you give me a hint?
Thank you.
What are you trying to do? Rendering PDF is a lot of work and you also need to factor in leading, Text raise, kerning, CTM and several other factors.
Position: (optional, you can avoid it)
Text Matrix: (optional, you can avoid it)
Widths Array: (use empty array [], PDF can read it directly from CFF (FontFile3 stream))
FontBBox: font file->'CFF ' table->Top DICT INDEX->DICT-> 4 operands for 'FontBBox' operator
StemV: (optional, you can avoid it)
StemH: (optional, you can avoid it)
FontName: font file->'name' table->records
or: font file->'CFF ' table->Top DICT INDEX->string by index 0 for 'fonts names' operator
Descent: font file->'hhea' table->'Descender' parameter
Ascent: font file->'hhea' table->'Ascender' parameter
CapHeight: font file->'OS/2' table->'sCapHeight' parameter
XHeight: font file->'OS/2' table->'sxHeight' parameter
ItalicAngle: font file->'OS/2' table->'sxHeight' parameter
Actually, you can calculate Widths array. For each glyph:
Decoding array(PDF) -> Glyph name (PDF) -> Glyph index (CFF table of font file) -> table 'hmtx' -> Glyph 'hMetrics'[Glyph index] = array ('advanceWidth', 'leftSideBearing')
I spent a WEEK, to understand it...
If you want just highlight a text, it's not necessary calculate the text. You can add as much as you want content objects to the page (rectangle, image, line, semi-transparent stuff) and re-calculate the PDF structure. It is really simple. Ask your mouse about the selection coordinates))
These values are designed to properly typeset type, not draw glyphs, so you can't get the exact pixel size of each glyph from these attributes. The only way to get the exact pixel dimensions of a glyph is to draw the glyph into an image and analyze that image.
The FontBBox (font bounding box) is the smallest box that will hold each glyph. The Widths Array holds information on how far apart each character should be drawn, not the actual glyph image size. Some fonts will draw some glyphs outsize that width.
When you highlight text in a typical text editor, the highlight will be the full height of the font, and the width of each individual character. This highlight is made by getting the FontBBox height, and each character's width from the Widths Array, and transforming those values to match the current font's attributes (size, etc.). This information is sufficient to make your app draw type like typical applications.