Remove unused image objects - itext

I have PDF files that are being created with a composition tool to produce financial statements.
The PDF files are around the 5000 - 10000 pages per file using global image resources to maximise space efficiences.
These statements include marketing images. Many of them (about 3mb worth), not every particular statements uses all the images.
When I extract the PDF file using a tool that has been developed for this purpose (or if I use adobe acrobat just for testing purposes) - to extract a blank page at the start of the PDF file, the resulting extracted PDF is around the 3mb. Auditing the space usage sees that it is comprised of 3mb of images.
Using iTextSharp (latest 5.4.4) I have attempted to iterate through each page and copy to a writer calling reader.RemoveUnusedObjects. But this does not reduce the size.
I also found another example to use a pdfstamper and tried the same thing. Same result.
I've also tried setting maximum compression and SetFullCompression. Neither made any difference.
Can anyone give me any pointers for what I might do. I'm hoping I can do it as a simple exercise and not have to parse the objects in the PDF file and manually remove the unused ones.
Code Below:
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(inputFile);
iTextSharp.text.Document document = new iTextSharp.text.Document(reader.GetPageSizeWithRotation(1));
// step 2: we create a writer that listens to the document
// step 3: we open the document
iTextSharp.text.pdf.PdfCopy pdfCpy = new iTextSharp.text.pdf.PdfCopy(document, new System.IO.FileStream(outputFile, System.IO.FileMode.Create));
document.Open();
iTextSharp.text.pdf.PdfContentByte cb = pdfCpy.DirectContent;
//pdfCpy.NewPage();
int objects = reader.RemoveUnusedObjects();
reader.RemoveFields();
reader.RemoveAnnotations();
// we retrieve the total number of pages
int numberofPages = reader.NumberOfPages;
int i = 0;
while (i < numberofPages)
{
i++;
document.SetPageSize(reader.GetPageSizeWithRotation(i));
document.NewPage();
iTextSharp.text.pdf.PdfImportedPage page = pdfCpy.GetImportedPage(reader, i);
pdfCpy.SetFullCompression();
reader.RemoveUnusedObjects();
reader.RemoveFields();
reader.RemoveAnnotations();
int rotation = reader.GetPageRotation(i);
if (rotation == 90 || rotation == 270)
{
cb.AddTemplate(page, 0, -1f, 1f, 0, 0, reader.GetPageSizeWithRotation(i).Height);
}
else
{
cb.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
pdfCpy.AddPage(page);
}
pdfCpy.NewPage();
pdfCpy.Add(new iTextSharp.text.Paragraph("This is added text"));
document.Close();
pdfCpy.CompressionLevel = iTextSharp.text.pdf.PdfStream.BEST_COMPRESSION;
pdfCpy.Close();
reader.Close();
Stamper example:
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(inputFile);
using (FileStream fs = new FileStream(outputFile + ".2" , FileMode.Create))
{
iTextSharp.text.pdf.PdfStamper stamper = new iTextSharp.text.pdf.PdfStamper(reader, fs, iTextSharp.text.pdf.PdfWriter.VERSION_1_5);
iTextSharp.text.pdf.PdfWriter writer = stamper.Writer;
writer.SetPdfVersion(iTextSharp.text.pdf.PdfWriter.PDF_VERSION_1_5);
writer.CompressionLevel = iTextSharp.text.pdf.PdfStream.BEST_COMPRESSION;
reader.RemoveFields();
reader.RemoveUnusedObjects();
stamper.Reader.RemoveUnusedObjects();
stamper.SetFullCompression();
stamper.Writer.SetFullCompression();
stamper.Close();
}
reader.Close();

Try using iTextSharp.text.pdf.PdfSmartCopy instead of PdfCopy.
For me it decreased a PDF with a size of ~43MB PDF to ~4MB.

Related

iTextSharp IExtRenderListener and boundingbox [duplicate]

I have a pdf which comprises of some data, followed by some whitespace. I don't know how large the data is, but I'd like to trim off the whitespace following the data
PdfReader reader = new PdfReader(PDFLOCATION);
Rectangle rect = new Rectangle(700, 2000);
Document document = new Document(rect);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(SAVELCATION));
document.open();
int n = reader.getNumberOfPages();
PdfImportedPage page;
for (int i = 1; i <= n; i++) {
document.newPage();
page = writer.getImportedPage(reader, i);
Image instance = Image.getInstance(page);
document.add(instance);
}
document.close();
Is there a way to clip/trim the whitespace for each page in the new document?
This PDF contains vector graphics.
I'm usung iTextPDF, but can switch to any Java library (mavenized, Apache license preferred)
As no actual solution has been posted, here some pointers from the accompanying itext-questions mailing list thread:
As you want to merely trim pages, this is not a case of PdfWriter + getImportedPage usage but instead of PdfStamper usage. Your main code using a PdfStamper might look like this:
PdfReader reader = new PdfReader(resourceStream);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("target/test-outputs/test-trimmed-stamper.pdf"));
// Go through all pages
int n = reader.getNumberOfPages();
for (int i = 1; i <= n; i++)
{
Rectangle pageSize = reader.getPageSize(i);
Rectangle rect = getOutputPageSize(pageSize, reader, i);
PdfDictionary page = reader.getPageN(i);
page.put(PdfName.CROPBOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()}));
stamper.markUsed(page);
}
stamper.close();
As you see I also added another argument to your getOutputPageSize method to-be. It is the page number. The amount of white space to trim might differ on different pages after all.
If the source document did not contain vector graphics, you could simply use the iText parser package classes. There even already is a TextMarginFinder based on them. In this case the getOutputPageSize method (with the additional page parameter) could look like this:
private Rectangle getOutputPageSize(Rectangle pageSize, PdfReader reader, int page) throws IOException
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextMarginFinder finder = parser.processContent(page, new TextMarginFinder());
Rectangle result = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
System.out.printf("Text/bitmap boundary: %f,%f to %f, %f\n", finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());
return result;
}
Using this method with your file test.pdf results in:
As you see the code trims according to text (and bitmap image) content on the page.
To find the bounding box respecting vector graphics, too, you essentially have to do the same but you have to extend the parser framework used here to inform its listeners (the TextMarginFinder essentially is a listener to drawing events sent from the parser framework) about vector graphics operations, too. This is non-trivial, especially if you don't know PDF syntax by heart yet.
If your PDFs to trim are not too generic but can be forced to include some text or bitmap graphics in relevant positions, though, you could use the sample code above (probably with minor changes) anyways.
E.g. if your PDFs always start with text on top and end with text at the bottom, you could change getOutputPageSize to create the result rectangle like this:
Rectangle result = new Rectangle(pageSize.getLeft(), finder.getLly(), pageSize.getRight(), finder.getUry());
This only trims top and bottom empty space:
Depending on your input data pool and requirements this might suffice.
Or you can use some other heuristics depending on your knowledge on the input data. If you know something about the positioning of text (e.g. the heading to always be centered and some other text to always start at the left), you can easily extend the TextMarginFinder to take advantage of this knowledge.
Recent (April 2015, iText 5.5.6-SNAPSHOT) improvements
The current development version, 5.5.6-SNAPSHOT, extends the parser package to also include vector graphics parsing. This allows for an extension of iText's original TextMarginFinder class implementing the new ExtRenderListener methods like this:
#Override
public void modifyPath(PathConstructionRenderInfo renderInfo)
{
List<Vector> points = new ArrayList<Vector>();
if (renderInfo.getOperation() == PathConstructionRenderInfo.RECT)
{
float x = renderInfo.getSegmentData().get(0);
float y = renderInfo.getSegmentData().get(1);
float w = renderInfo.getSegmentData().get(2);
float h = renderInfo.getSegmentData().get(3);
points.add(new Vector(x, y, 1));
points.add(new Vector(x+w, y, 1));
points.add(new Vector(x, y+h, 1));
points.add(new Vector(x+w, y+h, 1));
}
else if (renderInfo.getSegmentData() != null)
{
for (int i = 0; i < renderInfo.getSegmentData().size()-1; i+=2)
{
points.add(new Vector(renderInfo.getSegmentData().get(i), renderInfo.getSegmentData().get(i+1), 1));
}
}
for (Vector point: points)
{
point = point.cross(renderInfo.getCtm());
Rectangle2D.Float pointRectangle = new Rectangle2D.Float(point.get(Vector.I1), point.get(Vector.I2), 0, 0);
if (currentPathRectangle == null)
currentPathRectangle = pointRectangle;
else
currentPathRectangle.add(pointRectangle);
}
}
#Override
public Path renderPath(PathPaintingRenderInfo renderInfo)
{
if (renderInfo.getOperation() != PathPaintingRenderInfo.NO_OP)
{
if (textRectangle == null)
textRectangle = currentPathRectangle;
else
textRectangle.add(currentPathRectangle);
}
currentPathRectangle = null;
return null;
}
#Override
public void clipPath(int rule)
{
}
(Full source: MarginFinder.java)
Using this class to trim the white space results in
which is pretty much what one would hope for.
Beware: The implementation above is far from optimal. It is not even correct as it includes all curve control points which is too much. Furthermore it ignores stuff like line width or wedge types. It actually merely is a proof-of-concept.
All test code is in TestTrimPdfPage.java.

Issues converting certain TIF compressions to PDF using iTextSharp

I am using iTextSharp to convert & stitch single-page TIF files to multi-page PDF file. The single-page TIF files are of different bit depths and compressions.
Here is the code-
private void button1_Click(object sender, EventArgs e)
{
List<string> TIFfiles = new List<string>();
Document document;
PdfWriter pdfwriter;
Bitmap tifFile;
pdfFilename = <file path>.PDF;
TIFfiles = <load the path to each TIF file in this array>;
//Create document
document = new Document();
// creation of the different writers
pdfwriter = PdfWriter.GetInstance(document, new System.IO.FileStream(pdfFilename, FileMode.Create));
document.Open();
document.SetMargins(0, 0, 0, 0);
foreach (string file in TIFfiles)
{
//load the tiff image
tifFile = new Bitmap(file);
//Total number of pages
iTextSharp.text.Rectangle pgSize = new iTextSharp.text.Rectangle(tifFile.Width, tifFile.Height);
document.SetPageSize(pgSize);
document.NewPage();
PdfContentByte cb = pdfwriter.DirectContent;
tifFile.SelectActiveFrame(FrameDimension.Page, 0);
iTextSharp.text.Image img = iTextSharp.text.Image.GetInstance(tifFile, ImageFormat.Tiff);
// scale the image to fit in the page
img.SetAbsolutePosition(0, 0);
cb.AddImage(img);
}
document.Close();
}
This code works well and stitches & converts tifs to PDF. Issue is with processing time and pdf file size that it creates when processing certain types of TIFs.
For e.g.
Original TIF --> B&W/Bit depth 1/Compression CCITT T.6 --> Faster processing, PDF file size is ~1.1x times the TIF file size.
Original TIF --> Color/Bit depth 8/Compression LZW --> Faster processing, PDF file size is ~1.1x times the TIF file size.
Original TIF --> Color/Bit depth 24/Compression JPEG--> Slow processing, PDF file size is ~12.5x times the TIF file size.
Why doesn't converting Color/Bit depth 24/Compression JPEG files gives similar result as other tif files?
Moreover, this issue is only with iTextSharp. I had a colleague test the same set of TIF samples using Java-iText and the resulting PDF was of smaller size (1.1x times) and had faster processing.
Unfortunately, I need to use .Net for this TIF to PDF conversion, so am stuck with using iTextSharp.
Any ideas/suggestions on how to get those Compression JPEG TIF files to create smaller size PDFs as it does for other TIF compressions?
Appreciate your help!
Regards,
AG
I was able to reproduce your problem with the code you supplied, but found that the problem went away once I used Image.GetInstance instead of the bitmap used in your sample. When using the code below, the file size and run time was the same between Java and C#. If you have any questions about the sample, don't hesitate to ask.
List<string> TIFfiles = new List<string>();
Document document;
PdfWriter pdfwriter;
iTextSharp.text.Image tifFile;
String pdfFilename = pdfFile;
TIFfiles = new List<string>();
TIFfiles.Add(tifFile1);
TIFfiles.Add(tifFile2);
TIFfiles.Add(tifFile3);
TIFfiles.Add(tifFile4);
TIFfiles.Add(tifFile5);
TIFfiles.Add(tifFile6);
TIFfiles.Add(tifFile7);
//Create document
document = new Document();
// creation of the different writers
pdfwriter = PdfWriter.GetInstance(document, new System.IO.FileStream(pdfFilename, FileMode.Create));
document.Open();
document.SetMargins(0, 0, 0, 0);
int i = 0;
while (i < 50)
{
foreach (string file in TIFfiles)
{
//load the tiff image
tifFile = iTextSharp.text.Image.GetInstance(file);
//Total number of pages
iTextSharp.text.Rectangle pgSize = new iTextSharp.text.Rectangle(tifFile.Width, tifFile.Height);
document.SetPageSize(pgSize);
document.NewPage();
PdfContentByte cb = pdfwriter.DirectContent;
// scale the image to fit in the page
tifFile.SetAbsolutePosition(0, 0);
cb.AddImage(tifFile);
}
i++;
}
document.Close();

inserting pages with PdfStamper does not import form fields

I'm inserting pages pages into pdf doc. The pages don't get added at the end or begining they need to be inserted in the middle somewhere (I have a way to determine the insert location with bookmarks).
The key is not to loose bookmarks. So I'm using PdfStamper to insert the pages. The problem is pdfs that are being inserted have form fields and those fields are not coming through.
The code that does inserting
for (int pageNum = 1; pageNum <= readerPdfToAdd.NumberOfPages; pageNum++)
{
PdfImportedPage page = pdfStamper.GetImportedPage(readerPdfToAdd, pageNum);
pdfStamper.InsertPage(filesByCategory[i].PageOfInsert + pageNum, readerPdfToAdd.GetPageSizeWithRotation(pageNum));
var rotation = readerPdfToAdd.GetPageRotation(pageNum);
if (rotation == 90 || rotation == 270)
{
pdfStamper.GetUnderContent(filesByCategory[i].PageOfInsert + pageNum)
.AddTemplate(page, 0, -1f, 1f, 0, 0, readerPdfToAdd.GetPageSizeWithRotation(pageNum).Height);
}
else
{
pdfStamper.GetUnderContent(filesByCategory[i].PageOfInsert + pageNum).AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
}
I tried something like this to copy the fields but this doesn't copy exactly.
foreach (KeyValuePair<string, AcroFields.Item> kvp in pdfFormFields.Fields)
{
var s = pdfFormFields.GetFieldPositions("Date");
PdfArray r = kvp.Value.GetWidget(0).GetAsArray(PdfName.RECT);
var name = kvp.Value.GetWidget(0).GetAsArray(PdfName.NAME);
Rectangle rr = new Rectangle(r.GetAsNumber(0).FloatValue, r.GetAsNumber(1).FloatValue, r.GetAsNumber(2).FloatValue, r.GetAsNumber(3).FloatValue);
TextField field = new TextField(pdfStamper.Writer, rr, kvp.Value.GetWidget(0).Get(PdfName.T).ToString());
if (kvp.Value.GetWidget(0).Get(PdfName.V) != null)
field.Text = kvp.Value.GetWidget(0).Get(PdfName.V).ToString();
// add the field here, the second param is the page you want it on
pdfStamper.AddAnnotation(field.GetTextField(), filesByCategory[i].PageOfInsert + pageNum);
fields.SetField(kvp.Key, kvp.Value.ToString());
}
Is there a better way to do this? I've tried PdfCopy that looses bookmarks on the source document.
Please use PdfCopy or PdfSmartCopy to assemble documents. As documented in chapter 6 of my book a PdfImportedPage only copies what is in the content stream.
You probably need PdfStamper to stamp some extra content on the original document, and that is fine, but you need to combine this with PdfCopy or PdfSmartCopy to assemble the final document.

iTextSharp z-index

I'm using itextSharp to add anotations in a pdf document.
I have a pdf document that already contains an image saved in it, it's a stamp.
So I draw some stroke on this pdf in the stamp and everything is fine when I draw them in my WPF but when I send the pdf by email using iTextSharp for the conversion the line I drawed is now below the stamp.
How I can solve this problem ?
Thank you
The explanation you posted as an answer (BTW, more apropos would have been to edit your question to contain that data) explains the issue.
There are two principal types of objects visible on a PDF page:
the PDF page content;
annotations associated with the page.
The annotations are always displayed above the page content if they are displayed at all.
In your case you add the image to the PDF page content (using OverContent or UnderContent only changes where in relation to other PDF page content material your additions appear). The stamp, on the other hand, most likely is realized by means of an annotation. Thus, the stamp annotation always is above your additions.
If you want to have your additions appear above the stamp, you either have to add your additions as some kind of annotation, too, or you have to flatten the stamp annotation into the page content before adding your stuff.
Which of these varients is better, depends on the requirements you have. Are there any requirements forcing the stamp to remain a stamp annotation? Are there any requirements forcing your additions to remain part of the content? Please elaborate your requirements. As content and annotations have some different properties when displayed or printed, please state all requirements.
And furthermore, please supply sample documents.
So like I said the original pdf have a stamp saved inside it, if I open the pdf with acrobat reader I can move the stamp.
So here my code to write some strokes :
using (var outputStream = new FileStream(outputPath, FileMode.Create, FileAccess.Write, FileShare.Read))
using (var intputStream = new FileStream(pathPdf, FileMode.Open, FileAccess.Read, FileShare.Read))
{
PdfReader reader = new PdfReader(intputStream);
using (var pdfStamper = new PdfStamper(reader, outputStream))
{
foreach (var page in pages)
{
if (page != null && page.ExportedImages.HasItems())
{
PdfContentByte pdfContent = pdfStamper.GetOverContent(page.PageIndex);
Rectangle pageSize = reader.GetPageSizeWithRotation(page.PageIndex);
PdfLayer pdfLayer = new PdfLayer(string.Format(ANNOTATIONNAMEWITHPAGENAME, page.PageIndex), pdfContent.PdfWriter);
foreach (ExporterEditPageInfoImage exportedInfo in page.ExportedImages)
{
Image image = PngImage.GetImage(exportedInfo.Path);
image.Layer = pdfLayer;
if (quality == PublishQuality.Normal || quality == PublishQuality.Medium || quality == PublishQuality.High)
{
float width = (float)Math.Ceiling((image.Width / image.DpiX) * 72);
float height = (float)Math.Ceiling((image.Height / image.DpiY) * 72);
image.ScaleAbsolute(width, height);
float x = (float)(exportedInfo.HorizontalTile * (page.TileSize * (72 / 96d)));
float y = (float)Math.Max(0, (pageSize.Height - ((exportedInfo.VerticalTile + 1) * (page.TileSize * (72 / 96d)))));
image.SetAbsolutePosition(x, y);
}
else
throw new NotSupportedException();
pdfContent.AddImage(image);
GC.Collect();
GC.WaitForPendingFinalizers();
}
}
}
pdfStamper.Close();
}
}
So my strokes are saved good in the pdf the problem the stamp is always on top of everything and I think is normal so can I do a workaround for this ?

Add file with bookmark

I want to add a PDF file using iTextSharp but if PDF file contains bookmarks then they should also be added.
Currently I'm using following code
Document document = new Document();
//Step 2: we create a writer that listens to the document
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(outputFileName, FileMode.Create));
writer.ViewerPreferences = PdfWriter.PageModeUseOutlines;
//Step 3: Open the document
document.Open();
PdfContentByte cb = writer.DirectContent;
//The current file path
string filename = "D:\\rtf\\2.pdf";
// we create a reader for the document
PdfReader reader = new PdfReader(filename);
//Chapter ch = new Chapter("", 1);
for (int pageNumber = 1; pageNumber < reader.NumberOfPages + 1; pageNumber++)
{
document.SetPageSize(reader.GetPageSizeWithRotation(1));
document.NewPage();
// Insert to Destination on the first page
if (pageNumber == 1)
{
Chunk fileRef = new Chunk(" ");
fileRef.SetLocalDestination(filename);
document.Add(fileRef);
}
PdfImportedPage page = writer.GetImportedPage(reader, pageNumber);
int rotation = reader.GetPageRotation(pageNumber);
if (rotation == 90 || rotation == 270)
{
cb.Add(page);
}
else
{
cb.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
}
}
document.Close();
Please read Chapter 6 of my book. In table 6.1, you'll read:
Can import pages from other PDF documents. The major downside is that all interactive features of the imported page (annotations, bookmarks, fields, and so forth) are lost in the process.
This is exactly what you experience. However, if you look at the other classes listed in that table, you'll discover PdfStamper, PdfCopy, etc... which are classes that do preserve interactive features.
PdfStamper will keep the bookmarks. If you want to use PdfCopy (or PdfSmartCopy), you need to read chapter 7 to find out how to keep them. Chapter 7 isn't available for free, but you can consult the examples here: Java / C#. You need the ConcatenateBookmarks example.
Note that you're code currently looks convoluted because you're not using the correct classes. Using PdfStamper should significantly reduce the number of lines of code.