iText7(PDFSweep) throw OutofMemory Error when redact a large PDF - itext

I was trying to remove some contents from PDF using PDFSweep, below are part of my code, I am using the CompositeCleanupStrategy and adding RegexBasedCleanupStrategy to the strategy:
CompositeCleanupStrategy strategy = new CompositeCleanupStrategy();
for (int i = 0; i < keywordlist.size(); i++) {
String kvalue = keywordlist.get(i);
Loger.getLogger().info("keyword " + i + "=" + kvalue);
strategy.add(new RegexBasedCleanupStrategy(kvalue).setRedactionColor(ColorConstants.GRAY));
}
try {
PdfWriter writer = new PdfWriter(dest);
writer.setCompressionLevel(0);
PdfDocument pdf = new PdfDocument(new PdfReader(src), writer);
// sweep
PdfAutoSweep pdfAutoSweep = new PdfAutoSweep(strategy);
pdfAutoSweep.cleanUp(pdf);
// close the document
pdf.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
When stragety is small, like there is only one or two , the cleanup is working fine, howere if there are 243 in the keywordlist, the PDF size is about 70 MB, I got following error:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
at java.lang.String.toLowerCase(String.java:2590)
at java.lang.String.toLowerCase(String.java:2670)
at com.itextpdf.io.font.PdfEncodings.convertToString(PdfEncodings.java:287)
at com.itextpdf.kernel.pdf.PdfString.toUnicodeString(PdfString.java:163)
at com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo.getUnscaledBaselineWithOffset(TextRenderInfo.java:425)
at com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo.getBaseline(TextRenderInfo.java:213)
at com.itextpdf.kernel.pdf.canvas.parser.listener.CharacterRenderInfo.<init>(CharacterRenderInfo.java:112)
at com.itextpdf.kernel.pdf.canvas.parser.listener.RegexBasedLocationExtractionStrategy.toCRI(RegexBasedLocationExtractionStrategy.java:156)
at com.itextpdf.kernel.pdf.canvas.parser.listener.RegexBasedLocationExtractionStrategy.eventOccurred(RegexBasedLocationExtractionStrategy.java:135)
at com.itextpdf.pdfcleanup.autosweep.CompositeCleanupStrategy.eventOccurred(CompositeCleanupStrategy.java:115)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.eventOccurred(PdfCanvasProcessor.java:534)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.displayPdfString(PdfCanvasProcessor.java:549)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.access$4700(PdfCanvasProcessor.java:108)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor$ShowTextArrayOperator.invoke(PdfCanvasProcessor.java:617)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.invokeOperator(PdfCanvasProcessor.java:452)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processContent(PdfCanvasProcessor.java:281)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processPageContent(PdfCanvasProcessor.java:302)
at com.itextpdf.kernel.pdf.canvas.parser.PdfDocumentContentParser.processContent(PdfDocumentContentParser.java:77)
at com.itextpdf.kernel.pdf.canvas.parser.PdfDocumentContentParser.processContent(PdfDocumentContentParser.java:90)
at com.itextpdf.pdfcleanup.autosweep.PdfAutoSweep.getPdfCleanUpLocations(PdfAutoSweep.java:130)
at com.itextpdf.pdfcleanup.autosweep.PdfAutoSweep.cleanUp(PdfAutoSweep.java:186)

(Full disclosure: original author of RegexBasedCleanupStrategy here)
RegexBasedCleanupStrategy is not meant to be used like this.
You are creating 200 instances of this class, all of which will go over the document to see whether they can match (chunk by chunk) the PDF against the regular expression.
In order to do this, they will store all chunks in the document, sort them, and then loop over them.
So you are duplicating the document 200-something times in memory.
That is your bottleneck.
My suggestion: build a better regular expression.
You can obviously match keyword a, b, c etc with regex
(a)|(b)|(c)
This would copy the document in memory only once, and then attempt to match the aggregate regex against it.
It has both performance, and memory-footprint benefits.

Related

Word OpenXml Word Found Unreadable Content

We are trying to manipulate a word document to remove a paragraph based on certain conditions. But the word file produced always ends up being corrupted when we try to open it with the error:
Word found unreadable content
The below code corrupts the file but if we remove the line:
Document document = mdp.Document;
The the file is saved and opens without issue. Is there an obvious issue that I am missing?
var readAllBytes = File.ReadAllBytes(#"C:\Original.docx");
using (var stream = new MemoryStream(readAllBytes))
{
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
}
}
File.WriteAllBytes(#"C:\New.docx", readAllBytes);
UPDATE:
using (WordprocessingDocument wpd = WordprocessingDocument.Open(#"C:\Original.docx", true))
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
document.Save();
}
Running the code above on a physical file we can still open Original.docx without the error so it seems limited to modifying a stream.
Here's a method that reads a document into a MemoryStream:
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
Note how the MemoryStream is instantiated. I am passing the capacity rather than the buffer (as in your own code). Why is that?
When using MemoryStream() or MemoryStream(int), you are creating a resizable MemoryStream instance, which you will want in case you make changes to your document. When using MemoryStream(byte[]) (as in your code), the MemoryStream instance is not resizable, which will be problematic unless you don't make any changes to your document or your changes will only ever make it shrink in size.
Now, to read a Word document into a MemoryStream, manipulate that Word document in memory, and end up with a consistent MemoryStream, you will have to do the following:
// Get a MemoryStream.
// In this example, the MemoryStream is created by reading a file stored
// in the file system. Depending on the Stream you "receive", it makes
// sense to copy the Stream to a MemoryStream before processing.
MemoryStream stream = ReadAllBytesToMemoryStream(#"C:\Original.docx");
// Open the Word document on the MemoryStream.
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true)
{
MainDocumentPart mdp = wpd.MainDocumentPart;
Document document = mdp.Document;
// Manipulate document ...
}
// After having closed the WordprocessingDocument (by leaving the using statement),
// you can use the MemoryStream for whatever comes next, e.g., to write it to a
// file stored in the file system.
File.WriteAllBytes(#"C:\New.docx", stream.GetBuffer());
Note that you will have to reset the stream.Position property by calling stream.Seek(0, SeekOrigin.Begin) whenever your next action depends on that MemoryStream.Position property (e.g., CopyTo, CopyToAsync). Right after having left the using statement, the stream's position will be equal to its length.

maximum page size in itextpdf

I'm creating adf project and generating reports using itextpdf 5.1.3.
For eg:- My table customer has 3000 rows.In my jspx page has customer table & button has file downloader listner there i'm simply calling the report.
Report is generating for only first 50 rows (maximum 4 pages) are only coming.
Why the remaining rows are not coming in reports?
private void generatePDFFile(FacesContext facesContext, java.io.OutputStream outputStream) {
try {
DCBindingContainer dcBindings = (DCBindingContainer)BindingContext.getCurrent().getCurrentBindingsEntry();
DCIteratorBinding iterBind1 = (DCIteratorBinding)dcBindings.get("CustomerView1Iterator");
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(FILE)); \\file path local disk D:\mywork\customer.pdf
document.open();
Row[] rows= iterBind1.getAllRowsInRange();
for(Row row:rows){
custcode = (String) row.getAttribute("CustomerCode");
custname = (String) row.getAttribute("CustomerNameE");
addgroup(document);
}
document.close();
facesContext = facesContext.getCurrentInstance();
ServletContext context = (ServletContext)facesContext.getExternalContext().getContext();
File file = new File(FILE);
FileInputStream fileInputStream;
byte[] b;
System.out.println(file.getCanonicalPath());
System.out.println(file.getAbsolutePath());
fileInputStream = new FileInputStream(file);
int n = fileInputStream.available();
while (n > 0) {
b = new byte[n];
//b = new byte[8192];
int result = fileInputStream.read(b);
outputStream.write(b, 0, b.length);
if (result == -1)
break;
}
outputStream.flush();
} catch (Exception e) {
e.printStackTrace();
}
}
private static void addgroup(Document document) throws DocumentException{
Paragraph preface1 = new Paragraph();
Paragraph preface2 = new Paragraph();
preface1.add(new Chunk("\n"));
preface1.add(new Chunk("Customer : "+custcode+" "+custname,BlueFont));
preface1.add(new Chunk("\n"));
document.add(preface1);
document.add(preface2);
}
The maximum size of a page in a PDF document is 14,400 by 14,400 user units. See the blog post Help, I only see blank pages in my PDF. This is not a limitation introduced by iText; it's a limitation that exists in PDF viewers such as Adobe Reader. Other viewers (such as Apple Preview) have no problem displaying PDFs with larger pages.
This answers the question in the Subject-line of your post.
The body of your post contains a completely different question. Maybe you aren't asking about the page size, but rather about the file size. That question has also been answered in the past. See What is the size limit of pdf file?
Allow me to summarize that answer:
The maximum size of a PDF with a version older than PDF 1.5 is about 10 gigabytes.
The maximum size of a PDF with a version PDF 1.5 or higher using a compressed cross-reference stream depends only on the limitations of the software processing the PDF.
The maximum size of a PDF created with iText versions before 5.3 is 2 gigabytes. The maximum size of a PDF created with iText versions 5.3 and higher is 1 terabyte.
Since you are using iText 5.1.3, you can create a PDF of 2 GBytes. It beats me why you are using a version of iText that dates from November 2011 instead of a more recent version, but there is no reason why you wouldn't be able to put a table containing 3000 rows in a PDF that can be as large as 2 GBytes.
Chances are that your code is really bad. Unfortunately, you don't show us your code. Did you read the documentation on how to create Large Tables?
I can't comment on the part where you say:
In my jspx page has customer table & button has file downloader listner there i'm simply calling the report.
That doesn't seem to be relevant in the context of your question.

iText7 strategy for limiting memory consumption of PdfFont

most of the iText7 examples refer to the use of PdfFontFactory.createFont() to get handles to PdfFont instances for text operations. With moderation, this is fine...but PdfFont is a pretty heavy-weight object (PdfEncoding) that doesn't seem to go away until the PdfDocument is closed. So the following innocent block is gonna gobble up memory:
for (int i = 0; i < someLargeNumber; i++) {
list.add(
new ListItem("never gonna give")
.setFont(PdfFontFactory.createFont("Helvetica-Oblique"))
)
}
a trivial attempt at a solution using statics failed because it appears PdfFont instances cannot be used across more than one PdfDocument. And because my actual case is more complex than the example above, i don't want to have to pass a bunch of PdfFont references across a pretty deep stack.
in the iText7 API, there's no way to iterate over existing PdfFont's for the PdfDocument (is there?)
is the rule for PdfFont usage simply that a) it can be used as many times as you want b) within a single PdfDocument instance
(i.e. is a possible solution here to simply cache PdfFont instances using a PdfDocument + PdfFontProgram key?)
PdfFonts appear to be cacheable/reusable at the PdfDocument level. If using a WeakHashMap as the cache, both the Keys and Values need to be weak refs. for example
private static WeakHashMap<PdfDocument, Map<String, WeakReference<PdfFont>>> fontCache = new WeakHashMap<>();
public static synchronized PdfFont createFont(PdfDocument forDocument, String path) throws IOException {
Map<String, WeakReference<PdfFont>> documentFontMap = fontCache.get(forDocument);
if (documentFontMap == null) {
documentFontMap = new HashMap<>();
fontCache.put(forDocument, documentFontMap);
}
WeakReference<PdfFont> font = documentFontMap.get(path);
if (font == null) {
font = new WeakReference<>(PdfFontFactory.createFont(path));
documentFontMap.put(path, font);
}
return font.get();
}
care should also be taken for iText API that calls PdfFontFactory itself, such as Barcode1D derivatives configured to show a human readable value (i.e. creating a new Barcode1D instance per page w/out calling setFont() will quickly exhaust memory for large documents)

slow retrieval of BYTEA column

I'm supposed to be testing different methods to store PDF files in a Postgres Database using JDBC. Currently I'm trying it with BYTEA. Storing files works without problems, but the retrieval is super slow.
I am working with a couple files around 3MB each. Storing them takes around 3 seconds (total), so that's alright. But when I try to retreive them, it takes around 2 minutes between the output of how many files are in the DB and the program actually starting to create the files. Once it starts, it only takes around 5 seconds though to finish. Why is Postgres taking so long for the Query "SELECT file..." ? The query takes equally long when I use pgAdmin. Not retrieving the filesize doesn't change anything.
As far as I understand, the DB uses TOAST to split my files up and when I want to retrieve them, it hast to piece them back together first. But since splitting them (when uploading) only takes a couple seconds, putting them back together shouldn't take that long, right?
Here are some code snippets:
public void saveToDB(Files[] files) {
try (PreparedStatement s = con.prepareStatement("INSERT INTO fileTable (filename, file) VALUES (?,?)")) {
for (File f : files) {
System.out.println(f.getName()+" (" + f.length() / 1024 + "KB)");
s.setString(1, f.getName());
s.setBinaryStream(2, new FileInputStream(f), f.length());
s.executeUpdate();
}
con.commit();
}
}
public void getFromDB(File dir) {
dir.mkdirs();
try (Statement s = con.createStatement(); ResultSet rs = s.executeQuery("SELECT COUNT(*) FROM useByteA")) {
rs.next();
System.out.println("Files: " + rs.getInt(1));
}
try (Statement s = con.createStatement(); ResultSet rs = s.executeQuery("SELECT length(file), filename, file FROM fileTable")) {
while (rs.next()) {
System.out.println(rs.getString(2)+" (" + rs.getInt(1) / 1024 + "KB)");
File f = new File(dir, filename);
f.createNewFile();
try (FileOutputStream out = new FileOutputStream(f)) {
out.write(rs.getBytes(3));
out.flush();
}
}
}
}

How to recognize PDF watermark and remove it using PDFBox

I'm trying to extract text except watermark text from PDF files with Apache PDFBox library,so I want to remove the watermark first and the rest is what I want.but unfortunately,Both PDmetadata and PDXObject can't recognize the watermark,any help will be appreciated.I found some code below.
// Open PDF document
PDDocument document = null;
try {
document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
} catch (IOException e) {
e.printStackTrace();
}
// Get all pages and loop through them
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while( iter.hasNext() ) {
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = null;
// Get all Images on page
try {
images = resources.getImages();//How to specify watermark instead of images??
} catch (IOException e) {
e.printStackTrace();
}
if( images != null ) {
// Check all images for metadata
Iterator imageIter = images.keySet().iterator();
while( imageIter.hasNext() ) {
String key = (String)imageIter.next();
PDXObjectImage image = (PDXObjectImage)images.get( key );
PDMetadata metadata = image.getMetadata();
System.out.println("Found a image: Analyzing for Metadata");
if (metadata == null) {
System.out.println("No Metadata found for this image.");
} else {
InputStream xmlInputStream = null;
try {
xmlInputStream = metadata.createInputStream();
} catch (IOException e) {
e.printStackTrace();
}
try {
System.out.println("--------------------------------------------------------------------------------");
String mystring = convertStreamToString(xmlInputStream);
System.out.println(mystring);
} catch (IOException e) {
e.printStackTrace();
}
}
// Export the images
String name = getUniqueFileName( key, image.getSuffix() );
System.out.println( "Writing image:" + name );
try {
image.write2file( name );
} catch (IOException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
}
System.out.println("--------------------------------------------------------------------------------");
}
}
}
In contrast to your assumption there is nothing like an explicit watermark object in a PDF to recognize watermarks in generic PDFs.
Watermarks can be applied to a PDF page in many ways; each PDF creating library or application has its own way to add watermarks, some even offer multiple ways.
Watermarks can be
anything (Bitmap graphics, vector graphics, text, ...) drawn early in the content and, therefore, forming a background on which the rest of the content is drawn;
anything (Bitmap graphics, vector graphics, text, ...) drawn late in the content with transparency, forming a transparent overlay;
anything (Bitmap graphics, vector graphics, text, ...) drawn in the content stream of a watermark annotation which shall be used to represent graphics that shall be printed at a fixed size and position on a page, regardless of the dimensions of the printed page (cf. section 12.5.6.22 of the PDF specification ISO 32000-1).
Some times even mixed forms are used, have a look at this answer for an example, at the bottom you find a 'watermark' drawn above graphics but beneath text (to allow for easy reading).
The latter choice (the watermark annotation) obviously is easy to remove, but it actually also is the least often used choice, most likely because it is so easy to remove; people applying watermarks generally don't want their watermarks to get lost. Furthermore, annotations are sometimes handled incorrectly by PDF viewers, and code copying page content often ignores annotations.
If you do not handle generic documents but a specific type of documents (all generated alike), on the other hand, the very manner in which the watermarks are applied in them, probably can be recognized and an extraction routine might be feasible. If you have such a use case, please share a sample PDF for inspection.