I am trying to merge multiple documents into a single one by following examples as posted in this other post.
I am using AltChunk altChunk = new AltChunk(). When documents are merged, it does not seem to retain seperate hearders of each document. The merged document will contain the headers of the first document during the merging. If the first document being merged contains no hearders, then all the rest of the newly merged document will contain no headers, and vise versa.
My question is, how can I preserve different headers of the documents being merged?
Merge multiple word documents into one Open Xml
using System;
using System.IO;
using System.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
namespace WordMergeProject
{
public class Program
{
private static void Main(string[] args)
{
byte[] word1 = File.ReadAllBytes(#"..\..\word1.docx");
byte[] word2 = File.ReadAllBytes(#"..\..\word2.docx");
byte[] result = Merge(word1, word2);
File.WriteAllBytes(#"..\..\word3.docx", result);
}
private static byte[] Merge(byte[] dest, byte[] src)
{
string altChunkId = "AltChunkId" + DateTime.Now.Ticks.ToString();
var memoryStreamDest = new MemoryStream();
memoryStreamDest.Write(dest, 0, dest.Length);
memoryStreamDest.Seek(0, SeekOrigin.Begin);
var memoryStreamSrc = new MemoryStream(src);
using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStreamDest, true))
{
MainDocumentPart mainPart = doc.MainDocumentPart;
AlternativeFormatImportPart altPart =
mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.WordprocessingML, altChunkId);
altPart.FeedData(memoryStreamSrc);
var altChunk = new AltChunk();
altChunk.Id = altChunkId;
OpenXmlElement lastElem = mainPart.Document.Body.Elements<AltChunk>().LastOrDefault();
if(lastElem == null)
{
lastElem = mainPart.Document.Body.Elements<Paragraph>().Last();
}
//Page Brake einfügen
Paragraph pageBreakP = new Paragraph();
Run pageBreakR = new Run();
Break pageBreakBr = new Break() { Type = BreakValues.Page };
pageBreakP.Append(pageBreakR);
pageBreakR.Append(pageBreakBr);
return memoryStreamDest.ToArray();
}
}
}
I encountered this question a few years ago and spent quite some time on it; I eventually wrote a blog article that links to a sample file. Achieving integrating files with headers and footers using Alt-Chunk is not straight-forward. I'll try to cover the essentials, here. Depending on what kinds of content the headers and footers contain (and assuming Microsoft has not addressed any of the problems I originally ran into) it may not be possible to rely soley on AltChunk.
(Note also that there may be Tools/APIs that can handle this - I don't know and asking that on this site would be off-topic.)
Background
Before attacking the problem, it helps to understand how Word handles different headers and footers. To get a feel for it, start Word...
Section Breaks / Unlinking headers/footers
Type some text on the page and insert a header
Move the focus to the end of the page and go to the Page Layout tab in the Ribbon
Page Setup/Breaks/Next Page section break
Go into the Header area for this page and note the information in the blue "tags": you'll see a section identifier on the left and "Same as previous" on the right. "Same as Previous" is the default, to create a different Header click the "Link to Previous" button in the Header
So, the rule is:
a section break is required, with unlinked headers (and/or footers),
in order to have different header/footer content within a document.
Master/Sub-documents
Word has an (in)famous functionality called "Master Document" that enables linking outside ("sub") documents into a "master" document. Doing so automatically adds the necessary section breaks and unlinks the headers/footers so that the originals are retained.
Go to Word's Outline view
Click "Show Document"
Use "Insert" to insert other files
Notice that two section breaks are inserted, one of the type "Next page" and the other "Continuous". The first is inserted in the file coming in; the second in the "master" file.
Two section breaks are necessary when inserting a file because the last paragraph mark (which contains the section break for the end of the document) is not carried over to the target document. The section break in the target document carries the information to unlink the in-coming header from those already in the target document.
When the master is saved, closed and re-opened the sub documents are in a "collapsed" state (file names as hyperlinks instead of the content). They can be expanded by going back to the Outline view and clicking the "Expand" button. To fully incorporate a sub-document into the document click on the icon at the top left next to a sub-document then clicking "Unlink".
Merging Word Open XML files
This, then, is the type of environment the Open XML SDK needs to create when merging files whose headers and footers need to be retained. Theoretically, either approach should work. Practically, I had problems with using only section breaks; I've never tested using the Master Document feature in Word Open XML.
Inserting section breaks
Here's the basic code for inserting a section break and unlinking headers before bringing in a file using AltChunk. Looking at my old posts and articles, as long as there's no complex page numbering involved, it works:
private void btnMergeWordDocs_Click(object sender, EventArgs e)
{
string sourceFolder = #"C:\Test\MergeDocs\";
string targetFolder = #"C:\Test\";
string altChunkIdBase = "acID";
int altChunkCounter = 1;
string altChunkId = altChunkIdBase + altChunkCounter.ToString();
MainDocumentPart wdDocTargetMainPart = null;
Document docTarget = null;
AlternativeFormatImportPartType afType;
AlternativeFormatImportPart chunk = null;
AltChunk ac = null;
using (WordprocessingDocument wdPkgTarget = WordprocessingDocument.Create(targetFolder + "mergedDoc.docx", DocumentFormat.OpenXml.WordprocessingDocumentType.Document, true))
{
//Will create document in 2007 Compatibility Mode.
//In order to make it 2010 a Settings part must be created and a CompatMode element for the Office version set.
wdDocTargetMainPart = wdPkgTarget.MainDocumentPart;
if (wdDocTargetMainPart == null)
{
wdDocTargetMainPart = wdPkgTarget.AddMainDocumentPart();
Document wdDoc = new Document(
new Body(
new Paragraph(
new Run(new Text() { Text = "First Para" })),
new Paragraph(new Run(new Text() { Text = "Second para" })),
new SectionProperties(
new SectionType() { Val = SectionMarkValues.NextPage },
new PageSize() { Code = 9 },
new PageMargin() { Gutter = 0, Bottom = 1134, Top = 1134, Left = 1318, Right = 1318, Footer = 709, Header = 709 },
new Columns() { Space = "708" },
new TitlePage())));
wdDocTargetMainPart.Document = wdDoc;
}
docTarget = wdDocTargetMainPart.Document;
SectionProperties secPropLast = docTarget.Body.Descendants<SectionProperties>().Last();
SectionProperties secPropNew = (SectionProperties)secPropLast.CloneNode(true);
//A section break must be in a ParagraphProperty
Paragraph lastParaTarget = (Paragraph)docTarget.Body.Descendants<Paragraph>().Last();
ParagraphProperties paraPropTarget = lastParaTarget.ParagraphProperties;
if (paraPropTarget == null)
{
paraPropTarget = new ParagraphProperties();
}
paraPropTarget.Append(secPropNew);
Run paraRun = lastParaTarget.Descendants<Run>().FirstOrDefault();
//lastParaTarget.InsertBefore(paraPropTarget, paraRun);
lastParaTarget.InsertAt(paraPropTarget, 0);
//Process the individual files in the source folder.
//Note that this process will permanently change the files by adding a section break.
System.IO.DirectoryInfo di = new System.IO.DirectoryInfo(sourceFolder);
IEnumerable<System.IO.FileInfo> docFiles = di.EnumerateFiles();
foreach (System.IO.FileInfo fi in docFiles)
{
using (WordprocessingDocument pkgSourceDoc = WordprocessingDocument.Open(fi.FullName, true))
{
IEnumerable<HeaderPart> partsHeader = pkgSourceDoc.MainDocumentPart.GetPartsOfType<HeaderPart>();
IEnumerable<FooterPart> partsFooter = pkgSourceDoc.MainDocumentPart.GetPartsOfType<FooterPart>();
//If the source document has headers or footers we want to retain them.
//This requires inserting a section break at the end of the document.
if (partsHeader.Count() > 0 || partsFooter.Count() > 0)
{
Body sourceBody = pkgSourceDoc.MainDocumentPart.Document.Body;
SectionProperties docSectionBreak = sourceBody.Descendants<SectionProperties>().Last();
//Make a copy of the document section break as this won't be imported into the target document.
//It needs to be appended to the last paragraph of the document
SectionProperties copySectionBreak = (SectionProperties)docSectionBreak.CloneNode(true);
Paragraph lastpara = sourceBody.Descendants<Paragraph>().Last();
ParagraphProperties paraProps = lastpara.ParagraphProperties;
if (paraProps == null)
{
paraProps = new ParagraphProperties();
lastpara.Append(paraProps);
}
paraProps.Append(copySectionBreak);
}
pkgSourceDoc.MainDocumentPart.Document.Save();
}
//Insert the source file into the target file using AltChunk
afType = AlternativeFormatImportPartType.WordprocessingML;
chunk = wdDocTargetMainPart.AddAlternativeFormatImportPart(afType, altChunkId);
System.IO.FileStream fsSourceDocument = new System.IO.FileStream(fi.FullName, System.IO.FileMode.Open);
chunk.FeedData(fsSourceDocument);
//Create the chunk
ac = new AltChunk();
//Link it to the part
ac.Id = altChunkId;
docTarget.Body.InsertAfter(ac, docTarget.Body.Descendants<Paragraph>().Last());
docTarget.Save();
altChunkCounter += 1;
altChunkId = altChunkIdBase + altChunkCounter.ToString();
chunk = null;
ac = null;
}
}
}
If there's complex page numbering (quoted from my blog article):
Unfortunately, there’s a bug in the Word application when integrating
Word document “chunks” into the main document. The process has the
nasty habit of not retaining a number of SectionProperties, among them
the one that sets whether a section has a Different First Page
() and the one to restart Page Numbering () in a section. As long as your documents don’t need to
manage these kinds of headers and footers you can probably use the
“altChunk” approach.
But if you do need to handle complex headers and footers the only
method currently available to you is to copy in the each document in
its entirety, part-by-part. This is a non-trivial undertaking, as
there are numerous possible types of Parts that can be associated not
only with the main document body, but also with each header and footer
part.
...or try the Master/Sub Document approach.
Master/Sub Document
This approach will certainly maintain all information, it will open as a Master document, however, and the Word API (either the user or automation code) is required to "unlink" the sub-documents to turn it into a single, integrated document.
Opening a Master Document file in the Open XML SDK Productivity Tool shows that inserting sub documents into the master document is a fairly straight-forward procedure:
The underlying Word Open XML for the document with one sub-document:
<w:body xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1" />
</w:pPr>
<w:subDoc r:id="rId6" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" />
</w:p>
<w:sectPr>
<w:headerReference w:type="default" r:id="rId7" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" />
<w:type w:val="continuous" />
<w:pgSz w:w="11906" w:h="16838" />
<w:pgMar w:top="1417" w:right="1417" w:bottom="1134" w:left="1417" w:header="708" w:footer="708" w:gutter="0" />
<w:cols w:space="708" />
<w:docGrid w:linePitch="360" />
</w:sectPr>
</w:body>
and the code:
public class GeneratedClass
{
// Creates an Body instance and adds its children.
public Body GenerateBody()
{
Body body1 = new Body();
Paragraph paragraph1 = new Paragraph();
ParagraphProperties paragraphProperties1 = new ParagraphProperties();
ParagraphStyleId paragraphStyleId1 = new ParagraphStyleId(){ Val = "Heading1" };
paragraphProperties1.Append(paragraphStyleId1);
SubDocumentReference subDocumentReference1 = new SubDocumentReference(){ Id = "rId6" };
paragraph1.Append(paragraphProperties1);
paragraph1.Append(subDocumentReference1);
SectionProperties sectionProperties1 = new SectionProperties();
HeaderReference headerReference1 = new HeaderReference(){ Type = HeaderFooterValues.Default, Id = "rId7" };
SectionType sectionType1 = new SectionType(){ Val = SectionMarkValues.Continuous };
PageSize pageSize1 = new PageSize(){ Width = (UInt32Value)11906U, Height = (UInt32Value)16838U };
PageMargin pageMargin1 = new PageMargin(){ Top = 1417, Right = (UInt32Value)1417U, Bottom = 1134, Left = (UInt32Value)1417U, Header = (UInt32Value)708U, Footer = (UInt32Value)708U, Gutter = (UInt32Value)0U };
Columns columns1 = new Columns(){ Space = "708" };
DocGrid docGrid1 = new DocGrid(){ LinePitch = 360 };
sectionProperties1.Append(headerReference1);
sectionProperties1.Append(sectionType1);
sectionProperties1.Append(pageSize1);
sectionProperties1.Append(pageMargin1);
sectionProperties1.Append(columns1);
sectionProperties1.Append(docGrid1);
body1.Append(paragraph1);
body1.Append(sectionProperties1);
return body1;
}
}
Related
I try to generate a toc(table of content) for my pdf, and I want to get some strings which look like chapter title in xxx.pdf using ITextExtractionStrategy. But I got com.itextpdf.kernel.PdfException when I am running a test.
Here is my code:
#org.junit.Test
public void test() throws IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfDocument pdfDoc = new PdfDocument(new PdfReader("src/test/resources/template/xxx.pdf"),
new PdfWriter(baos));
pdfDoc.addNewPage(1);
Document document = new Document(pdfDoc);
// when add this code, throw com.itextpdf.kernel.PdfException: Dictionary doesn't have supported font data.
Paragraph title = new Paragraph(new Text("index"))
.setTextAlignment(TextAlignment.CENTER);
document.add(title);
SimpleTextExtractionStrategy extractionStrategy = new SimpleTextExtractionStrategy();
for (int i = 1; i < pdfDoc.getNumberOfPages(); i++) {
PdfPage page = pdfDoc.getPage(i);
PdfCanvasProcessor parser = new PdfCanvasProcessor(extractionStrategy);
parser.processPageContent(page);
}
...
document.close();
pdfDoc.close();
new FileOutputStream("./yyy.pdf").write(baos.toByteArray());
}
Here is the output:
com.itextpdf.kernel.PdfException: Dictionary doesn't have supported font data.
at com.itextpdf.kernel.font.PdfFontFactory.createFont(PdfFontFactory.java:123)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.getFont(PdfCanvasProcessor.java:490)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor$SetTextFontOperator.invoke(PdfCanvasProcessor.java:811)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.invokeOperator(PdfCanvasProcessor.java:454)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processContent(PdfCanvasProcessor.java:282)
at com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor.processPageContent(PdfCanvasProcessor.java:303)
at com.example.pdf.util.Test.test(Test.java:138)
Whenever you add content to a PdfDocument like you do here
Document document = new Document(pdfDoc);
Paragraph title = new Paragraph(new Text("index"))
.setTextAlignment(TextAlignment.CENTER);
document.add(title);
you have to be aware that this content is not already stored in its final form; for example fonts used are not yet properly subset'ed. The final form is generated when you're closing the document.
Text extraction on the other hand requires the content to extract to be in its final form.
Thus, you should not apply text extraction to a document you're working on. In particular, don't apply text extraction to a page you've changed the content of.
If you need to extract text from the documents you create yourself, close your document first, open a new document from the output, and extract from that new document.
I have a full.docx which includes two math questions, the docx embeds some pictures and MathType equation (oleobject), I split the doc according to this, get two files (first.docx, second.docx) , first.docx works fine, the second.docx, however, pops up a warning dialog when I try to open it:
"Word found unreadable content in second.docx. Do you want to recover the contents of this document? If you trust the source of this document, click Yes."
After click "Yes", the doc can be opened, the content is also correct, I want to know what is wrong with the second.docx? I have checked it with "Open xml sdk 2.5 productivity tool", but found no reason. Very appreciated for any help. Thanks.
The three files have been uploaded to here.
Show some code:
byte[] templateBytes = System.IO.File.ReadAllBytes(TEMPLATE_YANG_FILE);
using (MemoryStream templateStream = new MemoryStream())
{
templateStream.Write(templateBytes, 0, (int)templateBytes.Length);
string guidStr = Guid.NewGuid().ToString();
using (WordprocessingDocument document = WordprocessingDocument.Open(templateStream, true))
{
document.ChangeDocumentType(DocumentFormat.OpenXml.WordprocessingDocumentType.Document);
MainDocumentPart mainPart = document.MainDocumentPart;
mainPart.Document = new Document();
Body bd = new Body();
foreach (DocumentFormat.OpenXml.Wordprocessing.Paragraph clonedParagrph in lst)
{
bd.AppendChild<DocumentFormat.OpenXml.Wordprocessing.Paragraph>(clonedParagrph);
clonedParagrph.Descendants<Blip>().ToList().ForEach(blip =>
{
var newRelation = document.CopyImage(blip.Embed, this.wordDocument);
blip.Embed = newRelation;
});
clonedParagrph.Descendants<DocumentFormat.OpenXml.Vml.ImageData>().ToList().ForEach(imageData =>
{
var newRelation = document.CopyImage(imageData.RelationshipId, this.wordDocument);
imageData.RelationshipId = newRelation;
});
}
mainPart.Document.Body = bd;
mainPart.Document.Save();
}
string subDocFile = System.IO.Path.Combine(this.outDir, guidStr + ".docx");
this.subWordFileLst.Add(subDocFile);
File.WriteAllBytes(subDocFile, templateStream.ToArray());
}
the lst contains Paragraph cloned from original docx using:
(DocumentFormat.OpenXml.Wordprocessing.Paragraph)p.Clone();
Using productivity tool, found oleobjectx.bin not copied, so I add below code after copy Blip and ImageData:
clonedParagrph.Descendants<OleObject>().ToList().ForEach(ole =>
{
var newRelation = document.CopyOleObject(ole.Id, this.wordDocument);
ole.Id = newRelation;
});
Solved the issue.
Is it possible to concatenate a number of pdf/a (with possibly different conformance levels: some pdf/a-1b, some pdf/a-3b ecc) into a single pdfa ?
I was thinking that using the latest level (3-a or 3b) would be ok but I get errors when validating with VeraPDF:
Here is my code (where :
public static byte[] CreateConformantCopy(List<byte[]> sourcePdfs)
{
var version = PdfVersion.PDF_1_7;
var type = PdfAType.PDF_A_3B;
WriterProperties wp = new WriterProperties();
wp.UseSmartMode();
wp.SetPdfVersion(version.ToPdfVersion());
PdfOutputIntent oi = new PdfOutputIntent("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", Assembly.GetExecutingAssembly().GetManifestResourceStream("xxx.Resources.sRGB_CS_profile.icm"));
using (var mergedPdf = new MemoryStream())
{
var writer = new PdfWriter(mergedPdf, wp);
using (PdfADocument newDoc = new PdfADocument(writer, type.ToPdfAConformanceLevel(), oi, new DocumentProperties() { }))
{
Document document = new Document(newDoc, PageSize.A4.Rotate());
newDoc.SetTagged();
newDoc.GetCatalog().SetLang(new PdfString(Thread.CurrentThread.CurrentUICulture.Name));
newDoc.GetCatalog().SetViewerPreferences(
new PdfViewerPreferences()
.SetDisplayDocTitle(true)
.SetCenterWindow(true)
);
PdfMerger merger = new PdfMerger(newDoc);
for (int k = 0; k < sourcePdfs.Count; k++)
{
using (var inDoc = PdfHelper.GetDocument(sourcePdfs[k]))
{
var numberOfPages = inDoc.GetNumberOfPages();
merger.Merge(inDoc, 1, numberOfPages);
}
}
newDoc.Close();
}
return mergedPdf.ToArray();
}
}
PDF/A-1 and PDF/A-2 have several differences in the requirements. So, merging them together might not be possible. Looking on your validation errors, I think this is exactly the case. For example, the very first one is about XMP metadata. The PDF/A-2 is more strict here, and you get this error because your first file (which is probably a valid PDF/A-1) does not actually satisfy the PDF/A-2 rules.
What is possible however is to attach a PDF/A-1 document to PDF/A-2 one. This does not even require the use of PDF/A-3, which allows arbitrary attachments. The PDF/A-2 standard does allow attaching valid PDF/A-1 (as well as PDF/A-2 documents).
By the help of some very kind community members here I managed to programatically create a function to replace text inside content controls in a Word document using open xml. After the document is generated it removes the formatting of the text after I replace the text.
Any ideas on how I can still keep the formatting in word and remove the content control tags ?
This is my code:
using (var wordDoc = WordprocessingDocument.Open(mem, true))
{
var mainPart = wordDoc.MainDocumentPart;
ReplaceTags(mainPart, "FirstName", _firstName);
ReplaceTags(mainPart, "LastName", _lastName);
ReplaceTags(mainPart, "WorkPhoe", _workPhone);
ReplaceTags(mainPart, "JobTitle", _jobTitle);
mainPart.Document.Save();
SaveFile(mem);
}
private static void ReplaceTags(MainDocumentPart mainPart, string tagName, string tagValue)
{
//grab all the tag fields
IEnumerable<SdtBlock> tagFields = mainPart.Document.Body.Descendants<SdtBlock>().Where
(r => r.SdtProperties.GetFirstChild<Tag>().Val == tagName);
foreach (var field in tagFields)
{
//remove all paragraphs from the content block
field.SdtContentBlock.RemoveAllChildren<Paragraph>();
//create a new paragraph containing a run and a text element
Paragraph newParagraph = new Paragraph();
Run newRun = new Run();
Text newText = new Text(tagValue);
newRun.Append(newText);
newParagraph.Append(newRun);
//add the new paragraph to the content block
field.SdtContentBlock.Append(newParagraph);
}
}
Keeping the style is a tricky problem as there could be more than one style applied to the text you are trying to replace. What should you do in that scenario?
Assuming a simple case of one style (but potentially over many Paragraphs, Runs and Texts) you could keep the first Text element you come across per SdtBlock and place your required value in that element then delete any further Text elements from the SdtBlock. The formatting from the first Text element will then be maintained. Obviously you can apply this theory to any of the Text blocks; you don't have to necessarily use the first. The following code should show what I mean:
private static void ReplaceTags(MainDocumentPart mainPart, string tagName, string tagValue)
{
IEnumerable<SdtBlock> tagFields = mainPart.Document.Body.Descendants<SdtBlock>().Where
(r => r.SdtProperties.GetFirstChild<Tag>().Val == tagName);
foreach (var field in tagFields)
{
IEnumerable<Text> texts = field.SdtContentBlock.Descendants<Text>();
for (int i = 0; i < texts.Count(); i++)
{
Text text = texts.ElementAt(i);
if (i == 0)
{
text.Text = tagValue;
}
else
{
text.Remove();
}
}
}
}
I am trying to use the iText library with c# to capture the text portion of pdf files.
I created a pdf from excel 2013 (exported) and then copied the sample from the web of how to use itext (added the lib ref to the project).
It reads perfectly the first page but it gets garbled info after that. It is keeping part of the first page and merging the info with the next page. The commented lines is when I was trying to solve the problem, the string "thePage" is recreated inside the for loop.
Here is the code. I can email the pdf to whoever can help with this issue.
Thanks in advance
public static string ExtractTextFromPdf(string path)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
//string[] theLines;
//theLines = new string[COLUMNS];
//string thePage;
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string thePage = "";
thePage = PdfTextExtractor.GetTextFromPage(reader, i, its);
string [] theLines = thePage.Split('\n');
foreach (var theLine in theLines)
{
text.AppendLine(theLine);
}
// text.AppendLine(" ");
// Array.Clear(theLines, 0, theLines.Length);
// thePage = "";
}
return text.ToString();
}
}
A strategy object collects text data and does not know if a new page has started or not.
Thus, use a new strategy object for each page.