Merge 2 pdf byte streams using Itextsharp - itext

I have a method that returns a pdf byte stream (from fillable pdf) Is there a straight forward way to merge 2 streams into one stream and make one pdf out of it? I need to run my method twice but need the two pdf's into One pdf stream. Thanks.

You didn't say if you're flattening the filled forms with the PdfStamper, so I'll just say you must flatten the before trying to merge them. Here's a working .ashx HTTP handler:
<%# WebHandler Language="C#" Class="mergeByteForms" %>
using System;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
public class mergeByteForms : IHttpHandler {
HttpServerUtility Server;
public void ProcessRequest (HttpContext context) {
Server = context.Server;
HttpResponse Response = context.Response;
Response.ContentType = "application/pdf";
using (Document document = new Document()) {
using (PdfSmartCopy copy = new PdfSmartCopy(
document, Response.OutputStream) )
{
document.Open();
for (int i = 0; i < 2; ++i) {
PdfReader reader = new PdfReader(_getPdfBtyeStream(i.ToString()));
copy.AddPage(copy.GetImportedPage(reader, 1));
}
}
}
}
public bool IsReusable { get { return false; } }
// simulate your method to use __one__ byte stream for __one__ PDF
private byte[] _getPdfBtyeStream(string data) {
// replace with __your__ PDF template
string pdfTemplatePath = Server.MapPath(
"~/app_data/template.pdf"
);
PdfReader reader = new PdfReader(pdfTemplatePath);
using (MemoryStream ms = new MemoryStream()) {
using (PdfStamper stamper = new PdfStamper(reader, ms)) {
AcroFields form = stamper.AcroFields;
// replace this with your form field data
form.SetField("title", data);
// ...
// this is __VERY__ important; since you're using the same fillable
// PDF, if you don't set this property to true the second page will
// lose the filled fields.
stamper.FormFlattening = true;
}
return ms.ToArray();
}
}
}
Hopefully the inline comments make sense. _getPdfBtyeStream() method above simulates your PDF byte streams. The reason you need to set FormFlattening to true is that a when you fill PDF form fields, names are supposed to be unique. In your case the second page is the same fillable PDF form, so it has the same field names as the first page and when you fill them they're ignored. Comment out the example line above:
stamper.FormFlattening = true;
to see what I mean.
In other words, a lot of the generic code to merge PDFs on the Internet and even here on stackoverflow will not work (for fillable forms) because Acrofields are not being accounted for. In fact, if you take a look at stackoverflow's about itextsharp tag "SO FAQ & Popular" to Merge PDFs, it's mentioned in the third comment for the correctly marked answer by #Ray Cheng.
Another way to merge fillable PDF (without flattening the form) is to rename the form fields for the second/following page(s), but that's more work.

Related

iText7: com.itextpdf.kernel.PdfException: Dictionary doesn't have supported font data

I try to generate a toc(table of content) for my pdf, and I want to get some strings which look like chapter title in xxx.pdf using ITextExtractionStrategy. But I got com.itextpdf.kernel.PdfException when I am running a test.
Here is my code:
#org.junit.Test
public void test() throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfDocument pdfDoc = new PdfDocument(new PdfReader("src/test/resources/template/xxx.pdf"),
new PdfWriter(baos));
pdfDoc.addNewPage(1);
Document document = new Document(pdfDoc);
// when add this code, throw com.itextpdf.kernel.PdfException: Dictionary doesn't have supported font data.
Paragraph title = new Paragraph(new Text("index"))
.setTextAlignment(TextAlignment.CENTER);
document.add(title);
SimpleTextExtractionStrategy extractionStrategy = new SimpleTextExtractionStrategy();
for (int i = 1; i < pdfDoc.getNumberOfPages(); i++) {
PdfPage page = pdfDoc.getPage(i);
PdfCanvasProcessor parser = new PdfCanvasProcessor(extractionStrategy);
parser.processPageContent(page);
}
...
document.close();
pdfDoc.close();
new FileOutputStream("./yyy.pdf").write(baos.toByteArray());
}
Here is the output:
com.itextpdf.kernel.PdfException: Dictionary doesn't have supported font data.
at com.itextpdf.kernel.font.PdfFontFactory.createFont(PdfFontFactory.java:123)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.getFont(PdfCanvasProcessor.java:490)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor$SetTextFontOperator.invoke(PdfCanvasProcessor.java:811)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.invokeOperator(PdfCanvasProcessor.java:454)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processContent(PdfCanvasProcessor.java:282)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processPageContent(PdfCanvasProcessor.java:303)
at com.example.pdf.util.Test.test(Test.java:138)
Whenever you add content to a PdfDocument like you do here
Document document = new Document(pdfDoc);
Paragraph title = new Paragraph(new Text("index"))
.setTextAlignment(TextAlignment.CENTER);
document.add(title);
you have to be aware that this content is not already stored in its final form; for example fonts used are not yet properly subset'ed. The final form is generated when you're closing the document.
Text extraction on the other hand requires the content to extract to be in its final form.
Thus, you should not apply text extraction to a document you're working on. In particular, don't apply text extraction to a page you've changed the content of.
If you need to extract text from the documents you create yourself, close your document first, open a new document from the output, and extract from that new document.

Insert HTML in docx file

I have made an application that fills wordfiles with customxmlparts now I am trying to put text into a textfield, but it has HTML in it and I want it to show the styling of it. I tried converting it to rich text format but that just gets pasted in the word file. Here is an example of the code:
var taskId = Guid.NewGuid();
var tempFilePath = $"{Path.GetTempPath()}/{taskId}";
using (var templateStream = new FileStream($"{tempFilePath}.docx", FileMode.CreateNew))
{
templateStream.Write(template, 0, template.Length);
// 1. Fill template.
using (WordprocessingDocument doc = WordprocessingDocument.Open(templateStream, true))
{
MainDocumentPart mainDocument = doc.MainDocumentPart;
if (mainDocument.CustomXmlParts != null)
{
mainDocument.DeleteParts<CustomXmlPart>(mainDocument.CustomXmlParts);
}
CustomXmlPart cxp = mainDocument.AddCustomXmlPart(CustomXmlPartType.CustomXml);
foreach (var line in data.Lines)
{
if (line.MoreInfo != null && line.MoreInfo != " ") {
}
}
var xmlData = ObjectToXml(data);
using (var stream = GenerateStreamFromString(tempFilePath, xmlData))
{
cxp.FeedData(stream);
}
mainDocument.Document.Save();
}
}
You can't just write the HTML formatted text into a DOCX field, you would need to convert it into a WordprocessingML format.
However, there is another way that you could try and that is to insert an "AltChunk" element. That element represents a sort of like a placeholder which can reference a HTML file and then when the DOCX file is opened in MS Word, it will make that HTML to WordprocessingML conversion for you. For details see: How to Use altChunk for Document Assembly
Alternatively you could use some third party, like GemBox.Document, which can make that HTML to WordprocessingML conversion for you.
For example check this Set Content example:
// Set content using HTML tags
document.Sections[0].Blocks[4].Content.LoadText(
"Paragraph 5 <b>(part of this paragraph is bold)</b>", LoadOptions.HtmlDefault);

iText not returning text contents of a PDF after first page

I am trying to use the iText library with c# to capture the text portion of pdf files.
I created a pdf from excel 2013 (exported) and then copied the sample from the web of how to use itext (added the lib ref to the project).
It reads perfectly the first page but it gets garbled info after that. It is keeping part of the first page and merging the info with the next page. The commented lines is when I was trying to solve the problem, the string "thePage" is recreated inside the for loop.
Here is the code. I can email the pdf to whoever can help with this issue.
Thanks in advance
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
//string[] theLines;
//theLines = new string[COLUMNS];
//string thePage;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = "";
thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string [] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
// text.AppendLine(" ");
// Array.Clear(theLines, 0, theLines.Length);
// thePage = "";
}
return text.ToString();
}
}
A strategy object collects text data and does not know if a new page has started or not.
Thus, use a new strategy object for each page.

itext AcroFields form onto second page, needs to keep same template

i have a form that i have created in MS Word then converted to a PDF (Form) then i load this in using a PDF Reader, i then have a stamper created that fills in the fields, if i want to add a second page with the same template (Form) how do i do this and populate some of the fields with the same information
i have managed to get a new page with another reader but how do i stamp information onto this page as the AcroFields will have the same name.#
this is how i achieved that:
stamper.insertPage(1,PageSize.A4);
PdfReader reader = new PdfReader("/soaprintjobs/templates/STOTemplate.pdf"); //reads the original pdf
PdfImportedPage page; //writes the new pdf to file
page = stamper.getImportedPage(reader,1); //retrieve the second page of the original pdf
PdfContentByte newPageContent = stamper.getUnderContent(1); //get the over content of the first page of the new pdf
newPageContent.addTemplate(page, 0,0);
Thanks
Acroform fields have the property that fields with the same name are considered the same field. They have the same value. So if you have a field with the same name on page 1 and page 2, they will always display the same value. If you change the value on page 1, it will also change on page 2.
In some cases this is desirable. You may have a multi-page form with a reference number and want to repeat that reference number on each page. In that case you can use fields with the same name.
However, if you want to have multiple copies of the same form with different data in 1 document, you'll run into problems. You'll have to rename the form fields so they are unique.
In iText, you should not use getImportedPage() to copy Acroforms. Starting with iText 5.4.4 you can use the PdfCopy class. In earlier versions the PdfCopyFields class should be used.
Here's some sample code to copy Acroforms and rename fields. Code for iText 5.4.4 and up is in comments.
public static void main(String[] args) throws FileNotFoundException, DocumentException, IOException {
String[] inputs = { "form1.pdf", "form2.pdf" };
PdfCopyFields pcf = new PdfCopyFields(new FileOutputStream("out.pdf"));
// iText 5.4.4+
// Document document = new Document();
// PdfCopy pcf = new PdfCopy(document, new FileOutputStream("out.pdf"));
// pcf.setMergeFields();
// document.open();
int documentnumber = 0;
for (String input : inputs) {
PdfReader reader = new PdfReader(input);
documentnumber++;
// add suffix to each field name, in order to make them unique.
renameFields(reader, documentnumber);
pcf.addDocument(reader);
}
pcf.close();
// iText 5.4.4+
// document.close();
}
public static void renameFields(PdfReader reader, int documentnumber) {
Set<String> keys = new HashSet<String>(reader.getAcroFields()
.getFields().keySet());
for (String key : keys) {
reader.getAcroFields().renameField(key, key + "_" + documentnumber);
}
}

Bolding with Rich Text Values in iTextSharp

Is it possible to bold a single word within a sentence with iTextSharp? I'm working with large paragraphs of text coming from xml, and I am trying to bold several individual words without having to break the string into individual phrases.
Eg:
document.Add(new Paragraph("this is <b>bold</b> text"));
should output...
this is bold text
As #kuujinbo pointed out there is the XMLWorker object which is where most of the new HTML parsing work is being done. But if you've just got simple commands like bold or italic you can use the native iTextSharp.text.html.simpleparser.HTMLWorker class. You could wrap it into a helper method such as:
private Paragraph CreateSimpleHtmlParagraph(String text) {
//Our return object
Paragraph p = new Paragraph();
//ParseToList requires a StreamReader instead of just text
using (StringReader sr = new StringReader(text)) {
//Parse and get a collection of elements
List<IElement> elements = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, null);
foreach (IElement e in elements) {
//Add those elements to the paragraph
p.Add(e);
}
}
//Return the paragraph
return p;
}
Then instead of this:
document.Add(new Paragraph("this is <b>bold</b> text"));
You could use this:
document.Add(CreateSimpleHtmlParagraph("this is <b>bold</b> text"));
document.Add(CreateSimpleHtmlParagraph("this is <i>italic</i> text"));
document.Add(CreateSimpleHtmlParagraph("this is <b><i>bold and italic</i></b> text"));
I know that this is an old question, but I could not get the other examples here to work for me. But adding the text in Chucks with different fonts did.
//define a bold font to be used
Font boldFont = FontFactory.GetFont(FontFactory.HELVETICA_BOLD, 12);
//add a phrase and add Chucks to it
var phrase2 = new Phrase();
phrase2.Add(new Chunk("this is "));
phrase2.Add(new Chunk("bold", boldFont));
phrase2.Add(new Chunk(" text"));
document.Add(phrase2);
Not sure how complex your Xml is, but try XMLWorker. Here's a working example with an ASP.NET HTTP handler:
<%# WebHandler Language="C#" Class="boldText" %>
using System;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.xml;
using iTextSharp.tool.xml;
public class boldText : IHttpHandler {
public void ProcessRequest (HttpContext context) {
HttpResponse Response = context.Response;
Response.ContentType = "application/pdf";
StringReader xmlSnippet = new StringReader(
"<p>This is <b>bold</b> text</p>"
);
using (Document document = new Document()) {
PdfWriter writer = PdfWriter.GetInstance(
document, Response.OutputStream
);
document.Open();
XMLWorkerHelper.GetInstance().ParseXHtml(
writer, document, xmlSnippet
);
}
}
public bool IsReusable { get { return false; } }
}
You may have to pre-process your Xml before sending it to XMLWorker. (notice the snippet is a bit different from yours) Support for parsing HTML/Xml was released relatively recently, so your mileage may vary.
Here is another XMLWorker example that uses a different overload of ParseHtml and returns a Phrase instead of writing it directly to the document.
private static Phrase CreateSimpleHtmlParagraph(String text)
{
var p = new Phrase();
var mh = new MyElementHandler();
using (TextReader sr = new StringReader("<html><body><p>" + text + "</p></body></html>"))
{
XMLWorkerHelper.GetInstance().ParseXHtml(mh, sr);
}
foreach (var element in mh.elements)
{
foreach (var chunk in element.Chunks)
{
p.Add(chunk);
}
}
return p;
}
private class MyElementHandler : IElementHandler
{
public List<IElement> elements = new List<IElement>();
public void Add(IWritable w)
{
if (w is iTextSharp.tool.xml.pipeline.WritableElement)
{
elements.AddRange(((iTextSharp.tool.xml.pipeline.WritableElement)w).Elements());
}
}
}