ITextSharp: Extract text without small spaces

ITextSharp: Extract text without small spaces - itext

I am trying to extract the headlines of some pdf files to sort them. Unfortunately there's a space between every letters with the spaces between words bigger than the ones between letters of the same word.
Here's my extraction method:
PdfReader reader = new PdfReader(filename);
Rectangle rect = new Rectangle(0, 0, 1000, 1000);
RenderFilter regionFilter = new RegionTextRenderFilter(rect);
FontRenderFilter fontFilter = new FontRenderFilter();
FilteredTextRenderListener strategy = new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), regionFilter, fontFilter);
string result = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
reader.Close();
Is there a way to filter out the smaller spaces?

iText uses the distance of the rendered glyphs as base to decide if a space is present or not. The general rule applied is, if the distance is larger than the width of a normal space, devided by 2, than a space character is recognized. While this works quite well in most cases, it doesn't work at all, if the width of a space character could not be determined for the font used. In my case the width of a space was recognized as 0, thus the smallest distance between glyphs was recognized as a space. I based my solution on another answer from mkl to a question that is very similar to yours.
In short: You need to derive from e.g. SimpleTextExtractionStrategy or LocationTextExtractionStrategy and override the methods that convert the distance between glyphs into spaces (renderText or isChunkAtWordBoundary respectively).
You can also refer to the answer I gave here or the original solution by mkl.

Related

How to easily control spacing height between two paragraphs?

Currently, I am using
document.add( Chunk.NEWLINE );
after each paragraph to generate space between two paragraphs. What is the way to generate a spacing of any height I specify?

The space between two lines of the same Paragraph is called the leading. See Changing text line spacing
If you want to introduce extra spacing before or after a Paragraph, you can use the setSpacingBefore() or setSpacingAfter() method. See itext spacingBefore property applied to Paragraph causes new page
For instance:
Paragraph paragraph1 = new Paragraph("First paragraph");
paragraph1.setSpacingAfter(72f);
document.add(paragraph1);
Paragraph paragraph2 = new Paragraph("Second paragraph");
document.add(paragraph2);
This puts 72 user units of extra white space between paragraph1 and paragraph2. One user unit corresponds with one point, so by choosing 72, we've added an inch of white space.

iText, Reduce a mono-spaced font's pitch

Trying to reduce a mono-spaced(FontFactory.COURIER) font's pitch i.e reduce the spacing between the letters. Could not locate the method in the API.
Thanks in advance.

Use Chunk#setCharacterSp‌acing or PdfContentByte#setCh‌aracterSpacing, depending on your context.
Negative charSpacing values will decrease spacing between letters, positive ones will increase it.
Example from Bruno's book (slightly modified):
Chunk chunk = new Chunk(text, font1);
// reduce spacing
chunk.setCharacterSpacing(-0.5f);
document.add(new Paragraph(chunk));
// usual spacing
chunk = new Chunk(text, font1);
document.add(new Paragraph(chunk));
What you will get will look similar to this:
As you see, the first line has reduced char spacing, whereas the second one has regular spacing.

Can I tell iText how to clip text to fit in a cell

When I call setFixedHeight() on a PdfPCell, and add more text than fits in the given height, iText seems to print the prefix of the string which fits.
Can I control this clipping algorithm? For example:
Print a suffix of the string rather than a prefix.
Mark a substring of the string as not to be removed. This is with footnote references. If I add text saying "Hello World [1]", the [1] is a reference to a footnote and should not be removed. It's okay to remove the other characters of the string, like "World".
When there are multiple words in the string, iText seems to eliminate a word that doesn't fit, while I would like it partially printed. That is, if the string is "Hello World", and the cell has room only for "Hello Wo...", I would like that to be printed, rather than just "Hello", as iText prints.
Rather than printing characters in their entirety, print only part of them. Imagine printing the text to a PNG and chopping off the top and/or bottom part of the PNG to fit it in the space available. For example, notice that the top line and the bottom line are partially clipped here:
Are any of these possible? Does iText give me any control over how text is clipped? Thanks.
This is with reference to iText 2.1.6.

I have written a proof of concept, ClipCenterCellContent, where we try to fit the text "D2 is a cell with more content than we can fit into the cell." in a cell that is too small.
Just like in your other question ( iText -- How do I get the rendered dimensions of text? ), we add the content using a cell event, but we now add it twice: once in simulation mode (to find out how much space is needed vertically) and once for real (using an offset).
This adds the content in simulation mode (we use the width of the cell and an arbitrary height):
PdfContentByte canvas = canvases[PdfPTable.TEXTCANVAS];
ColumnText ct = new ColumnText(canvas);
ct.setSimpleColumn(new Rectangle(0, 0, position.getWidth(), -1000));
ct.addElement(content);
ct.go(true);
float spaceneeded = 0 - ct.getYLine();
System.out.println(String.format("The content requires %s pt whereas the height is %s pt.", spaceneeded, position.getHeight()));
We now know the needed height and we can add the content for real using an offset:
float offset = (position.getHeight() - spaceneeded) / 2;
System.out.println(String.format("The difference is %s pt; we'll need an offset of %s pt.", -2f * offset, offset));
PdfTemplate tmp = canvas.createTemplate(position.getWidth(), position.getHeight());
ct = new ColumnText(tmp);
ct.setSimpleColumn(0, offset, position.getWidth(), offset + spaceneeded);
ct.addElement(content);
ct.go();
canvas.addTemplate(tmp, position.getLeft(), position.getBottom());
In this case, I used a PdfTemplate to clip the content.
I also have answers to your other questions, but I don't have the time to answer them right now.

For straight Text box clipping, I adapted the C# code given here
http://itextsharp.10939.n7.nabble.com/Limiting-Text-Width-using-PdfContentByte-td2481.html
to the Java code below. The clipping area ends up outside this rectangle, so you can still draw a rectangle on the same exact coordinates.
cb.saveState();
cb.rectangle(left,top,width,height);
cb.clip();
cb.newPath();
// perform clipped output here
cb.restoreState();
I used a try/finally to ensure restoreState() was called.

Drawing text using PdfTextArray in iTextSharp - how?

I am drawing text in a PDF page using iTextSharp, and I have two requirements:
1) the text needs to be searchable by Adobe Reader and such
2) I need character-level control over where the text is drawn.
I can draw the text word-by-word using PdfContentByte.ShowText(), but I don't have control over where each character is drawn.
I can draw the text character-by-character using PdfContentByte.ShowText() but then it isn't searchable.
I'm now trying to create a PdfTextArray, which would seem to satisfy both of my requirements, but I'm having trouble calculating the correct offsets.
So my first question is: do you agree that PdfTextArray is what I need to do, in order to satisfy both of my original requirements?
If so, I have the PdfTextArray working correctly (in that it's outputting text) but I can't figure out how to accurately calculate the positioning offset that needs to get put between each pair of characters (right now I'm just using the fixed value -200 just to prove that the function works).
I believe the positioning offset is the distance from the right edge of the previous character to the left edge of the new character, expressed in "thousandths of a unit of text space". That leaves me two problems:
1) How wide is the previous character (in points), as drawn in the specified font & height? (I know where its left edge is, since I drew it there)
2) How do I convert from points to "units of text space"?
I'm not doing any fancy scaling or rotating, so my transformation matrices should all be identity matrices, which should simplify the calculations ...
Thanks,
Chris

FreeType2: Get global font bounding box in pixels?

I'm using FreeType2 for font rendering, and I need to get a global bounding box for all fonts, so I can align them in a nice grid. I call FT_Set_Char_Size followed by extracting the global bounds using
int pixels_x = ::FT_MulFix((face->bbox.xMax - face->bbox.xMin), face->size->metrics.x_scale );
int pixels_y = ::FT_MulFix((face->bbox.yMax - face->bbOx.yMin), face->size->metrics.y_scale );
return Size (pixels_x / 64, pixels_y / 64);
which works, but it's quite a bit too large. I also tried to compute using doubles (as described in the FreeType2 tutorial), but the results are practically the same. Even using just face->bbox.xMax results in bounding boxes which are too wide. Am I doing the right thing, or is there simply some huge glyph in my font (Arial.ttf in this case?) Any way to check which glyph is supposedly that big?

Why not calculate the min/max from the characters that you are using in the string that you want to align? Just loop through the characters and store the maximum and minimum from the characters that you are using. You can store these values after you rendered them so you don't need to look it up every time you render the glyphs.

I have a similar problem using freetype to render a bunch of text elements that will appear in a grid. Not all of the text elements are the same size, and I need to prerender them before I know where they would be laid out. The different sizes were the biggest problem when the heights changed, such as for letters with descending portions (like "j" or "Q").
I ended up using the height that is on the face (kind of like you did with the bbox). But like you mentioned, that value was much to big. It's supposed to be the baseline to baseline distance, but it appeared to be about twice that distance. So, I took the easy way out and divided the reported height by 2 and used that as a general height value. Most likely, the height is too big because there are some characters in the font that go way high or way low.
I suppose a better way might be to loop through all the characters expected to be used, get their glyph metrics and store the largest height found. But that doesn't seem all that robust either.

Your code is right.
It's not too large.
Because there are so many special symbols that is vary large than ascii charater. . view special big symbol
it's easy to traverse all unicode charcode, to find those large symbol.
if you only need ascii, my hack method is
FT_MulFix(face_->units_per_EM, face_->size->metrics.x_scale ) >> 6
FT_MulFix(face_->units_per_EM, face_->size->metrics.y_scale ) >> 6

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse