How do we use a PdfWriter of iTextSharp? - itext

If you do a google search you will find some code samples more or less like this:
PdfWriter writer = PdfWriter.GetInstance(document, myIoStream);
And interestingly, most of the code samples don't use the writer object at all. So, what's the point of getting that handle? Thank you!

You can do just PdfWriter.getInstance(document, myIoStream);
this will "create" the document using the stream.
When using PdfWriter writer = PdfWriter.GetInstance(document, myIoStream);
we can use a lot of methods like AddAnnotation() that will be merged with the document or also call .DirectContent to add images, etc.
Hope it helps.

Related

Using itextsharp to remove text from pdf

I'm working on a program to remove text from a specified area of a pdf.
It works well on most pdfs, but I've found it falls over with some pdfs which contain graphics using Indexed colorspace - it only works on CMYK or RGB. I'm afraid I'm really clueless on this subject so could really use some help.
Here's my code:
Dim source_file as String ="c:\test pdf\test.pdf"
Dim destination_file as String = ="c:\test pdf\output.pdf"
Dim reader As PdfReader = New PdfReader(source_file)
Using outputPdfStream As Stream = New FileStream(destination_file, FileMode.Create, FileAccess.Write, FileShare.None)
Dim stamper = New PdfStamper(reader, outputPdfStream)
Dim Locs As New List(Of PdfCleanUpLocation)
Locs.Add(New PdfCleanUpLocation(1, New Rectangle(97.0F, 405.0F, 480.0F, 445.0F), BaseColor.WHITE))
Dim oCleaner As New PdfCleanUpProcessor(Locs, stamper)
oCleaner.CleanUp()
stamper.Close()
reader.Close()
End Using
The error I'm getting is:
iTextSharp.text.exceptions.UnsupportedPdfException: 'The color space [/Indexed, /DeviceCMYK, 73, 13 0R] is not supported'
This comes up at the oCleaner.CleanUp() line
For reference, I originally extracted the code from the below link where someone was trying to do something similar, but a lot more involved, a few years ago:
https://www.vbforums.com/showthread.php?831051-RESOLVED-Confusion-converting-C-code
If anyone can suggest a way of getting this to work with pdfs featuring Indexed colorspace graphics I'd be extremely grateful!
Thanks for reading!

iText - Remove Document Level Javascripts

Using the iText PDF libraries (v7), does anyone have any advice on how to remove 'Document-level' JavaScripts from PDFs? I have figured out how to remove Page-Level JavaScripts, but cannot seem to figure out how to remove those at the document-level. Thank you.
I got this resolved and below is the snippet of code (C#) in case anyone else needs it:
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SOURCE), new PdfWriter(TARGET));
PdfCatalog pdfCat = pdfDoc.GetCatalog();
PdfDictionary names = pdfCat.GetPdfObject().GetAsDictionary(PdfName.Names);
names.Remove(PdfName.JavaScript);
pdfDoc.Close();

using iTextPDF parse the data from HTML using XMLWorkerHelper this is directly write the content into Document but i want set data using ColumnText [duplicate]

I want to use iText to convert a series of html file to PDF.
For instance: if have these files:
page1.html
page2.html
page3.html
...
Now I want to create a single PDF file, where page1.html is the first page, page2.html is the second page, and so on...
I know how to convert a single HTML file to a PDF, but I don't know how to combine these different PDFs resulting from this operation into a single PDF.
Before we start: I am not a C# developer, so I can not give you an example in C#. All the iText examples I write, are written in Java. Fortunately, iText and iTextSharp are always kept in sync. In the context of this question, you can rest assure that whatever works for iText will also work for iTextSharp, but you'll have to make small adaptations that are specific to C#. From what I hear from C# developers, this is usually not hard to achieve.
Regarding the answer: there are two answers and answer #2 is generally better than answer #1, but I'm giving both options because there may be specific cases where answer #1 is better.
Test data: I have created 3 simple HTML files, each containing some info about a State in the US:
page1.html: California
page2.html: New York
page3.html: Massachusetts
We are going to use XML Worker to parse these three files and we want a single PDF file as a result.
Answer #1: see ParseMultipleHtmlFiles1 for the full code sample and multiple_html_pages1.pdf for the resulting PDF.
You say that you already succeeded in converting one HTML file into one PDF files. It is assumed that you did it like this:
public byte[] parseHtml(String html) throws DocumentException, IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
// step 1
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.getInstance(document, baos);
// step 3
document.open();
// step 4
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(html));
// step 5
document.close();
// return the bytes of the PDF
return baos.toByteArray();
}
This is not the most efficient way to parse an HTML file (there are other examples on the web site), but it's the simplest way.
As you can see, this method parse an HTML into a PDF file and returns that PDF file in the form of a byte[]. As we want to create a single PDF, we can feed this byte array to a PdfCopy instance, so that we can concatenate multiple documents.
Suppose that we have three documents:
public static final String[] HTML = {
"resources/xml/page1.html",
"resources/xml/page2.html",
"resources/xml/page3.html"
};
We can loop over these three documents, parse them one by one to a byte[], create a PdfReader instance with the PDF bytes, and add the document to the PdfCopy instance using the addDocument() method:
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream(file));
document.open();
PdfReader reader;
for (String html : HTML) {
reader = new PdfReader(parseHtml(html));
copy.addDocument(reader);
reader.close();
}
document.close();
}
This solves your problem, but why do I think it's not the optimal solution?
Suppose that you need to use a special font that needs to be embedded. In that case, every separate PDF file will contain a subset of that font. Different files will require different font subsets, and PdfCopy (nor PdfSmartCopy for that matter) can merge font subsets. This could result in a bloated PDF file with way too many font subsets of the same font.
How do we solve this? That's explained in answer #2.
Answer #2: See ParseMultipleHtmlFiles2 for the full code sample and multiple_html_pages2.pdf for the resulting PDF. You already see the difference in file size: 4.61 KB versus 5.05 KB (and we didn't even introduce embedded fonts).
In this case, we don't parse the HTML to a PDF file the way we did in the parseHtml() method from answer #1. Instead, we parse the HTML to an iText ElementList using the parseToElementList() method. This method requires two Strings. One containing the HTML code, the other one containing CSS values.
We use a utility method to read the HTML file into a String. As for the CSS value, we could pass null to parseToElementList(), but in that case, default styles will be ignored. You'll notice that the <h1> tag we introduced in our HTML will look completely different if you don't pass the default.css that is shipped with XML Worker.
Long story short, this is the code:
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
String css = readCSS();
for (String htmlfile : HTML) {
String html = Utilities.readFileToString(htmlfile);
ElementList list = XMLWorkerHelper.parseToElementList(html, css);
for (Element e : list) {
document.add(e);
}
document.newPage();
}
document.close();
}
We create a single Document and a single PdfWriter instance. We parse the different HTML files into ElementLists one by one, and we add all the elements to the Document.
As you want a new page, each time a new HTML file is parsed, I introduced a document.newPage(). If you remove this line, you can add the three HTML pages on a single page (which wouldn't be possible if you would opt for answer #1).

How to parse multiple HTML files into a single PDF?

I want to use iText to convert a series of html file to PDF.
For instance: if have these files:
page1.html
page2.html
page3.html
...
Now I want to create a single PDF file, where page1.html is the first page, page2.html is the second page, and so on...
I know how to convert a single HTML file to a PDF, but I don't know how to combine these different PDFs resulting from this operation into a single PDF.
Before we start: I am not a C# developer, so I can not give you an example in C#. All the iText examples I write, are written in Java. Fortunately, iText and iTextSharp are always kept in sync. In the context of this question, you can rest assure that whatever works for iText will also work for iTextSharp, but you'll have to make small adaptations that are specific to C#. From what I hear from C# developers, this is usually not hard to achieve.
Regarding the answer: there are two answers and answer #2 is generally better than answer #1, but I'm giving both options because there may be specific cases where answer #1 is better.
Test data: I have created 3 simple HTML files, each containing some info about a State in the US:
page1.html: California
page2.html: New York
page3.html: Massachusetts
We are going to use XML Worker to parse these three files and we want a single PDF file as a result.
Answer #1: see ParseMultipleHtmlFiles1 for the full code sample and multiple_html_pages1.pdf for the resulting PDF.
You say that you already succeeded in converting one HTML file into one PDF files. It is assumed that you did it like this:
public byte[] parseHtml(String html) throws DocumentException, IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
// step 1
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.getInstance(document, baos);
// step 3
document.open();
// step 4
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(html));
// step 5
document.close();
// return the bytes of the PDF
return baos.toByteArray();
}
This is not the most efficient way to parse an HTML file (there are other examples on the web site), but it's the simplest way.
As you can see, this method parse an HTML into a PDF file and returns that PDF file in the form of a byte[]. As we want to create a single PDF, we can feed this byte array to a PdfCopy instance, so that we can concatenate multiple documents.
Suppose that we have three documents:
public static final String[] HTML = {
"resources/xml/page1.html",
"resources/xml/page2.html",
"resources/xml/page3.html"
};
We can loop over these three documents, parse them one by one to a byte[], create a PdfReader instance with the PDF bytes, and add the document to the PdfCopy instance using the addDocument() method:
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream(file));
document.open();
PdfReader reader;
for (String html : HTML) {
reader = new PdfReader(parseHtml(html));
copy.addDocument(reader);
reader.close();
}
document.close();
}
This solves your problem, but why do I think it's not the optimal solution?
Suppose that you need to use a special font that needs to be embedded. In that case, every separate PDF file will contain a subset of that font. Different files will require different font subsets, and PdfCopy (nor PdfSmartCopy for that matter) can merge font subsets. This could result in a bloated PDF file with way too many font subsets of the same font.
How do we solve this? That's explained in answer #2.
Answer #2: See ParseMultipleHtmlFiles2 for the full code sample and multiple_html_pages2.pdf for the resulting PDF. You already see the difference in file size: 4.61 KB versus 5.05 KB (and we didn't even introduce embedded fonts).
In this case, we don't parse the HTML to a PDF file the way we did in the parseHtml() method from answer #1. Instead, we parse the HTML to an iText ElementList using the parseToElementList() method. This method requires two Strings. One containing the HTML code, the other one containing CSS values.
We use a utility method to read the HTML file into a String. As for the CSS value, we could pass null to parseToElementList(), but in that case, default styles will be ignored. You'll notice that the <h1> tag we introduced in our HTML will look completely different if you don't pass the default.css that is shipped with XML Worker.
Long story short, this is the code:
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
String css = readCSS();
for (String htmlfile : HTML) {
String html = Utilities.readFileToString(htmlfile);
ElementList list = XMLWorkerHelper.parseToElementList(html, css);
for (Element e : list) {
document.add(e);
}
document.newPage();
}
document.close();
}
We create a single Document and a single PdfWriter instance. We parse the different HTML files into ElementLists one by one, and we add all the elements to the Document.
As you want a new page, each time a new HTML file is parsed, I introduced a document.newPage(). If you remove this line, you can add the three HTML pages on a single page (which wouldn't be possible if you would opt for answer #1).

Combine two PDF-a documents using ITextSharp

hoping that someone can see the flaw in my code to merge to PDF-a documents using ITextSharp. Currently it complains about missing metadata which PDF-a requires.
Document document = new Document();
MemoryStream ms = new MemoryStream();
using (PdfACopy pdfaCopy = new PdfACopy(document, ms, PdfAConformanceLevel.PDF_A_1A))
{
document.Open();
using (PdfReader reader = new PdfReader("Doc1.pdf"))
{
pdfaCopy.AddDocument(reader);
}
using (PdfReader reader = new PdfReader("doc2.pdf"))
{
pdfaCopy.AddDocument(reader);
}
}
The exact error received is
Unhandled Exception: iTextSharp.text.pdf.PdfAConformanceException: The document catalog dictionary of a PDF/A conforming file shall contain
the Metadata key
I was hoping that the 'document catalog dictionary' would be copied as well, but I guess the 'new Document()' creates an empty non-conforming document or something.
Thanks! Hope you can help
Wouter
You need to add this line:
copy.CreateXmpMetadata();
This will create some default XMP metadata. Of course: if you want to create your own XMP file containing info about the documents you're about to merge, you can also use:
copy.XmpMetadata = myMetaData;
where myMetaData is a byte array containing a correct XMP stream.
I hope you understand that iText can't automatically create the correct metadata. Providing metadata is something that needs human attention.