We're trying to concatenate many dynamically generated pdf together using PdfSmartCopy as follow
Document document = new Document();
PdfCopy copy = new PdfSmartCopy(document, outputStream);
document.open();
for (PdfReader reader : list) {
int n = reader.getNumberOfPages();
// page numbers start at 1
for (int page = 1; page <= n; page++) {
copy.addPage(copy.getImportedPage(reader, page));
}
copy.freeReader(reader);
reader.close();
}
document.close();
Once in a while, we keep getting this exception InvalidPdfException: Rebuild failed: '>' not expected at file pointer 65537; Original message: Invalid object number. at file pointer
Or InvalidPdfException: Rebuild failed: Dictionary key 0 is not a name. at file pointer 65536
Have anyone encountered this before?
Related
When a collate is created from multiple documents, java.lang.OutOfMemoryError: Java heap space error is coming in the server and the application goes down.
Below is a sample code snippet,
PdfReader objReader = new PdfReader(new ByteArrayInputStream(content));
PdfDocument srcPdfDocument = new PdfDocument(objReader);
Document srcDocument = new Document(srcPdfDocument);
WriterProperties wp = new WriterProperties();
wp.setPdfVersion(PdfVersion.PDF_1_7);
PdfDocument destPdfDoc = new PdfDocument(new PdfWriter(baos,wp));
Document destDocument = new Document(destPdfDoc);
PdfMerger merger = new PdfMerger(destPdfDoc, true,true);
merger.merge(srcPdfDocument, 1, srcPdfDocument.getNumberOfPages());
//finally block
finally{
if(srcPdfDocument != null && !srcPdfDocument.isClosed()) {
srcPdfDocument.close();
}
if(srcDocument != null) {
srcDocument.close();
}
if(destPdfDoc != null && !destPdfDoc.isClosed()) {
destPdfDoc.close();
}
if(destDocument != null) {
destDocument.close();
}
}
If its large documents (file size in GBs) or documents with corrupted tag structure are collated (Error in server log -com.itextpdf.kernel.pdf.tagging.ParentTreeHandler Corrupted tag structure: encountered invalid marked content reference - it doesn't refer to any page or any mcid. This content reference will be ignored), the out of memory error is thrown.
Is there any way we can collate documents without keeping the bytes in the memory
using Itext 7. Please note, using Itext 5 (PdfCopy) the functionality works fine without issues.
a bit late reply.
I had the same problem itext 7.2.1.
I solved it by adding.
flushCopiedObjects()
Merging a 6000 page, 400 MiB file, works fine.
PdfDocument pdf = new PdfDocument(new PdfWriter(outfile));
PdfMerger merger = new PdfMerger(pdf);
for ( String file : listFilesToAdd ) {
PdfDocument fileAdd = new PdfDocument(new PdfReader(file));
merger.merge(fileAdd , 1, fileAdd.getNumberOfPages());
pdf.flushCopiedObjects(fileAdd);
fileAdd.close();
}
pdf.close();
I have multiple copies of a .pdf document that are commented by different users. I would like to merge all these comments into a new pdf "merged".
I wrote this sub inside a class called document with properties "path" and "directory".
Public Sub MergeComments(ByVal pdfDocuments As String())
Dim oSavePath As String = Directory & "\" & FileName & "_Merged.pdf"
Dim oPDFdocument As New iText.Kernel.Pdf.PdfDocument(New PdfReader(Path),
New PdfWriter(New IO.FileStream(oSavePath, IO.FileMode.Create)))
For Each oFile As String In pdfDocuments
Dim oSecundairyPDFdocument As New iText.Kernel.Pdf.PdfDocument(New PdfReader(oFile))
Dim oAnnotations As New PDFannotations
For i As Integer = 1 To oSecundairyPDFdocument.GetNumberOfPages
Dim pdfPage As PdfPage = oSecundairyPDFdocument.GetPage(i)
For Each oAnnotation As Annot.PdfAnnotation In pdfPage.GetAnnotations()
oPDFdocument.GetPage(i).AddAnnotation(oAnnotation)
Next
Next
Next
oPDFdocument.Close()
End Sub
This code results in an exception that I am failing to solve.
iText.Kernel.PdfException: 'Pdf indirect object belongs to other PDF document. Copy object to current pdf document.'
What do I need to change in order to perform this task? Or am I completely off with my code block?
You need to explicitly copy the underlying PDF object to the destination document. After that you will be easily able to add that object to the list of page annotations.
Instead of adding the annotation directly:
oPDFdocument.GetPage(i).AddAnnotation(oAnnotation)
Copy the object to the destination document first, wrap it into PdfAnnotation class with makeAnnotation method and then add it as usual. Code is in Java but you will easily be able to convert it into VB:
PdfObject annotObject = oAnnotation.getPdfObject().copyTo(pdfDocument);
pdfDocument.getPage(i).addAnnotation(PdfAnnotation.makeAnnotation(annotObject));
Here is a working Java code, with annotations copied from one document to other using the copyTo method.
PdfReader reader = new PdfReader(new
RandomAccessSourceFactory().createBestSource(sourceFileName), null);
PdfDocument document = new PdfDocument(reader);
PdfReader toMergeReader = new PdfReader(new RandomAccessSourceFactory().createBestSource(targetFileName), null);
PdfDocument toMergeDocument = new PdfDocument(toMergeReader);
PdfWriter writer = new PdfWriter(targetFileName + "_MergedVersion.pdf");
PdfDocument writeDocument = new PdfDocument(writer);
int pageCount = toMergeDocument.getNumberOfPages();
for (int i = 1; i <= pageCount; i++) {
PdfPage page = document.getPage(i);
writeDocument.addPage(page.copyTo(writeDocument));
PdfPage pdfPage = toMergeDocument.getPage(i);
List<PdfAnnotation> pageAnnots = pdfPage.getAnnotations();
if (pageAnnots != null) {
for (PdfAnnotation pdfAnnotation : pageAnnots) {
PdfObject annotObject = pdfAnnotation.getPdfObject().copyTo(writeDocument);
writeDocument.getPage(i).addAnnotation(PdfAnnotation.makeAnnotation(annotObject));
}
}
}
reader.close();
toMergeReader.close();
toMergeDocument.close();
document.close();
writeDocument.close();
writer.close();
We are in the last steps of evaluating iText7. We use iText 7.1.0 and html2pdf 2.0.0.
What we do: we send a json_encoded collection with pdf-data (which includes html for header, body and footer) to our Java app. There we iterate over the collection, create a byteArrayOutputStream for each pdf-data element and merge them together. We then send the results to a script which echoes it to e.g. a browser. Although the pdf is displayed correctly, we encounter errors while creating it:
com.itextpdf.io.IOException: Error at file pointer 226,416.
...
Caused by: com.itextpdf.io.IOException: xref subsection not found.
... 73 common frames omitted
If we create only one part of the collection, no error is thrown.
Iterate over collection and merge:
#RequestMapping(value = "/pdf", method = RequestMethod.POST, produces = MediaType.APPLICATION_PDF_VALUE)
public byte[] index(#RequestBody PDFDataModelCollection elements, Model model) throws IOException {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfWriter writer = new PdfWriter(byteArrayOutputStream);
try (PdfDocument resultDoc = new PdfDocument(writer)) {
for (PDFDataModel pdfDataModel : elements.getElements()) {
PdfReader reader = new PdfReader(new ByteArrayInputStream(creationService.createDatasheet(pdfDataModel)));
try (PdfDocument sourceDoc = new PdfDocument(reader)) {
int n = sourceDoc.getNumberOfPages(); //<-- IOException on second iteration
for (int i = 1; i <= n; i++) {
PdfPage page = sourceDoc.getPage(i).copyTo(resultDoc);
resultDoc.addPage(page);
}
}
}
}
return byteArrayOutputStream.toByteArray(); //outputs the final pdf
}
Creation of part:
public byte[] createDatasheet(PDFDataModel pdfDataModel) throws IOException {
PdfWriter writer = new PdfWriter(byteArrayOutputStream);
//Initialize PDF document
PdfDocument pdfDoc = new PdfDocument(writer);
try (
Document document = new Document(pdfDoc)
) {
//header, footer, etc
//body
for (IElement element : HtmlConverter.convertToElements(pdfDataModel.getBody(), this.props)) {
document.add((IBlockElement) element);
}
footer.writeTotalNumberOnPages(pdfDoc);
}
return byteArrayOutputStream.toByteArray();
}
We are grateful for any suggestion.
In createDatasheet you appear to re-use some byteArrayOutputStream without clearing it first.
In the first iteration, therefore, everything works as desired, at the end of createDatasheet you have a single PDF file in it.
In the second iteration, though, you have two PDF files in that byteArrayOutputStream, one after the other. This concatenation does not form a valid single PDF.
Thus, byteArrayOutputStream.toByteArray() returns something broken.
To fix this, either make the byteArrayOutputStream local to createDatasheet and create a new instance every time or alternatively reset byteArrayOutputStream at the start of createDatasheet:
public byte[] createDatasheet(PDFDataModel pdfDataModel) throws IOException {
byteArrayOutputStream.reset();
PdfWriter writer = new PdfWriter(byteArrayOutputStream);
[...]
I currently use iTextSharp to read in some PDF files and parse them by using the string I receive. I have encountered a strange behavior with some PDF files. When getting the string back of a for example 4 page PDF, the string is filled with the pages in the following order:
1 2 1 3 1 4
My code for reading the files is as follows:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
if (!string.IsNullOrWhiteSpace(text))
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
}
Debug.WriteLine(sb.ToString());
}
Here is a link to a file with which this behaviour occurs:
https://onedrive.live.com/redir?resid=D9FEFF3BF45E05FD!1536&authkey=!AFLRlskAvlg89yY&ithint=file%2cpdf
Hope you guys can help me out!
Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.
The SimpleTextExtractionStrategy needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.
Also the line where the StringBuilder is being appended can be changed from:
sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
to
sb.Append(text);
Thus the following code gives the correct result:
using (PdfReader reader = new PdfReader(fileStream))
{
StringBuilder sb = new StringBuilder();
for (int page = 0; page < reader.NumberOfPages; page++)
{
string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
if (!string.IsNullOrWhiteSpace(text))
sb.Append(text);
}
Debug.WriteLine(sb.ToString());
}
I want to replace a particular text in PDF document. I am currently using itextSharp library to play with PDF documents.
I had extracted the bytes from pdfdocument and then replaced that byte and then write the document again with the bytes but it is not working. In the below example I am trying to replace string 1234 with 5678
Any advise on how to perform this would be helpful.
PdfReader reader = new PdfReader(opf.FileNames[i]);
byte[] pdfbytes = reader.GetPageContent(1);
PdfString oldstring = new PdfString("1234");
PdfString newstring = new PdfString("5678");
byte[] byte1022 = oldstring.GetOriginalBytes();
byte[] byte1067 = newstring.GetOriginalBytes();
int position = 0;
for (int j = 0; j <pdfbytes.Length ; j++)
{
if (pdfbytes[j] == byte1022[0])
{
if (pdfbytes[j+1] == byte1022[1])
{
if (pdfbytes[j+2] == byte1022[2])
{
if (pdfbytes[j+3] == byte1022[3])
{
position = j;
break;
}
}
}
}
}
pdfbytes[position] = byte1067[0];
pdfbytes[position + 1] = byte1067[1];
pdfbytes[position + 2] = byte1067[2];
pdfbytes[position + 3] = byte1067[3];
File.WriteAllBytes(opf.FileNames[i].Replace(".pdf","j.pdf"), pdfbytes);
What makes you think 1234 is part of the page's content stream and not of a form XObject? Your code is never going to work in general if you don't parse all the resources of a page.
Also: I see GetPageContent(), but I don't see you using SetPageContent() anywhere. How are the changes ever going to be stored in the PdfReader object?
Moreover, I don't see you using PdfStamper to write the altered PdfReader contents to a file.
Finally: I'm to shy to quote the words of Leonard Rosenthol, Adobe's PDF Architect, but ask him, and he'll tell you personally that you shouldn't do what you're trying to do. PDF is NOT a format for editing.Read the intro of chapter 6 of the book I wrote on iText: http://www.manning.com/lowagie2/samplechapter6.pdf