merge multiple docx files into a single docx file - openxml

I came across solutions on how to merge docx files using c#:
Append multiple DOCX files together
In this solution, he iterates through files and copies the body "outerxml" to a new document:
XElement tempBody = XElement.Parse(tempDocument.MainDocumentPart.Document.Body.OuterXml);
newBody.Add(tempBody);
This looks somethign specific to c# api. But I am using Ruby. So far I have been able to edit the docx file and make changes to it by editing "word/document.xml". However, now I need to merge multiple docx files, and I would like to know if there is a specific xml file in openxml that encompasses an entire document, so that I can use that to copy into another document.

The main document part (usually at word/document.xml) contains the text of the body of the docx. Headers/footers/comments/footnotes/endnotes are elsewhere.
The problem is that the main document part will often refer to other parts, and you need to manage these references.
Some of these references (eg images, headers, footers) are via "relationships" in the rels part; others are styles, comment ids etc.
If your documents are predictable and simple, you could handle those cases yourself. Otherwise, you'd be better of using http://openxmldeveloper.org/wiki/w/wiki/documentbuilder.aspx (C#) or our commercial MergeDocx component (Java).

Related

Migrating from itext2 to itext7

Years ago, I wrote a small app in itext2 to gather reports on a weekly basis and concatenate them into one PDF. The app used com.lowagie.text.pdf.PdfCopy to copy and merge the PDFs. And it worked fine. Performed exactly as expected.
A few weeks ago I looked into migrating the application to itex7. To that end, I used the copyPagesTo method of com.itextpdf.kernel.pdf.PdfDocument. When run on the same file set, this produces warnings like:
WARN PdfNameTree - Name "section.1" already exists in the name tree; old value will be replaced by the new one.
When I click on the link to "section.1" in the first document of the merged PDF, I am taken to "section.1" of the last document. Not what I expected and not what happens when using the itext2 app. In the PDF's produced by itext2, if I click on the link to "section.1" of the first document in the combined PDF, I am taken to section 1 of the first document.
There is a hint in Javadocs for copyPagesTo saying
If outlines destination names are the same in different documents, all
such outlines will lead to a single location in the resultant
document. In this case iText will log a warning. This can be avoided
by renaming destinations names in the source document.
There is however, no explanation of how this should be done. I find it odd that this should be necessary in itext7, although it wasn't in itext2.
Is there a simple way to get around his problem?
I've also tried the Sejda desktop app and it produces correct results, but I would prefer to automate the process through a batch script.
My guess is iText 2 didn't even know it might be a problem.
If iText can't deduplicate destination names, the procedure is roughly:
Follow /Catalog -> /Names -> /Dests in each document to find the destination name tree.
Deduplicate the names, by adding suffixes. Remember that a name with a suffix added might be equal to an existing name in the same or another document. Be careful!
Now you can rewrite the destination name trees. Since you have only used suffixes, you can do this in place - the lexicographic ordering of the names is unaltered so the search tree structure is not broken.
Now, rewrite destination links in each PDF for the new names. For example any dictionary entry with key /Dest, or any /D in a /GoTo action.
Now, after all this preprocessing, the files will merge without name clashes.
(I know all this because I've just implemented it for my own PDF software. It's slightly hairy stuff, but not intractable.)
If you like, I can provide a devel version of cpdf with this functionality, if you would like to test it.

Can OpenXML be used to launch a new Word instance?

I'm able to generate Word documents without issue. I save the resulting *.docx file to a temporary location and then need to launch the file in Word.
The requirement is to not "open" the file in Word (easily done with a Process.Start) but to have load into Word as a new unsaved file. This is because certain propriety integrations for Word need to take over when a user saves the file and don't kick in if the file is ready saved but to a location on disk.
I've achieved this by using Interop calls to the Word application, adding the new document to Word's workspace. My problem is with Interop which tends to break on various client machines, particularly when Office upgrades take place (say a client had 32-bit office but upgraded with a 64-bit version).
I'm somewhat new to OpenXML, but can it be used to automate Word or is Interop my only real option?
object oFilename = tmpFileName;
object oNewTemplate = false;
object oDocumentType = 0;
object oVisible = true;
Document document = _application.Documents.Add(ref oFilename, ref oNewTemplate, ref oDocumentType, ref oVisible);
No, the Open XML technology has no way of interacting with the Office (Word) application - it's for file creation/manipulation, only. The interop is required in order to do anything with the Word application.
There is sort of a way around this - and it's only possible with Word, no other Office application has this - is to convert the Open XML content to the OPC flat-file format. This "concatenates" the various packages that make up the zip file to a pure text string, essetially a single XML file.
XML content in the OPC flat-file format can then be written to an already opened (even newly created) Word document using the Range.InsertXML method via "the interop". In a way, this "streams" the Open XML content into the opened Word document.
The problem with this approach is that certain document-level properties are not written to the target document, so not all aspects of the opened document can be changed. For example: page size, orientation, headers, footers... So if this kind of thing also needs to be affected the interop is required for such settings.

Jaspersoft REST API folder traversing, XML PDF extraction

Currently trying to extract an XML document that returns all the files in a particular folder, after which I can return the XML document containing files in a given folder. In the below JPG, I want to extract files from folder '02_17_2018', which is a sub-folder under '/LatestJLIPDFs' using the Jaspersoft REST API (see image below).
Basically, I'm trying to match the query http://(host):(port)/jasperserver[-pro]/rest_v2/resources?(parameters) with the Jaspersoft REST API to get to the '02_17_2018'. I've tried several different parameters, none of which seem to work. Here is a list of attempted parameters,
folderURI=/LatestJLIPDFs/02_17_2018&type=file
folderURI=/02_17_2018&type=file
folderURI=/LatestJLIPDFs&type=file&q=02_17_2018
folderURI=LatestJLIPDFs/02_17_2018&type=folder&q=02_17_2018
among many more attempts. Any hints to how the files in '02_17_2018' can be extracted?
I think your problem is related to parameter name folderURI. In the documentation, it looks like folderUri.
Let's try to write it in camel case as in my example:
http://localhost:8080/jasperserver-pro/rest_v2/resources?folderUri=/public&type=file&recursive=false

How to make a section optional when mapped to optional data in a Word OpenXml Part?

I'm using OpenXml SDK to generate word 2013 files. I'm running on a server (part of a server solution), so automation is not an option.
Basically I have an xml file that is output from a backend system. Here's a very simplified example:
<my:Data
xmlns:my="https://schemas.mycorp.com">
<my:Customer>
<my:Details>
<my:Name>Customer Template</my:Name>
</my:Details>
<my:Orders>
<my:Count>2</my:Count>
<my:OrderList>
<my:Order>
<my:Id>1</my:Id>
<my:Date>19/04/2017 10:16:04</my:Date>
</my:Order>
<my:Order>
<my:Id>2</my:Id>
<my:Date>20/04/2017 10:16:04</my:Date>
</my:Order>
</my:OrderList>
</my:Orders>
</my:Customer>
</my:Data>
Then I use Word's Xml Mapping pane to map this data to content control:
I simply duplicate the word file, and write new Xml data when generating new files.
This is working as expected. When I update the xml part, it reflects the data from my backend.
Thought, there's a case that does not works. If a customer has no order, the template content is kept in the document. The xml data is :
<my:Data
xmlns:my="https://schemas.mycorp.com">
<my:Customer>
<my:Details>
<my:Name>Some customer</my:Name>
</my:Details>
<my:Orders>
<my:Count>0</my:Count>
<my:OrderList>
</my:OrderList>
</my:Orders>
</my:Customer>
</my:Data>
(see the empty order list).
In Word, the xml pane reflects the correct data (meaning no Order node):
But as you can see, the template content is still here.
Basically, I'd like to hide the order list when there's no order (or at least an empty table).
How can I do that?
PS: If it can help, I uploaded the word and xml files, and a small PowerShell script that injects the data : repro.zip
Thanks for sharing your files so we can better help you.
I had a difficult time trying to solve your problem with your existing Word Content Controls, XML files and the PowerShell script that added the XML to the Word document. I found what seemed to be Microsoft's VSTO example solution to your problem, but I couldn't get this to work cleanly.
I was however able to write a simple C# console application that generates a Word file based on your XML data. The OpenXML code to generate the Word file was generated code from the Open XML Productivity Tool. I then added some logic to read your XML file and generate the second table rows dynamically depending on how many orders there are in the data. I have uploaded the code for you to use if you are interested in this solution. Note: The xml data file should be in c:\temp and the generated word files will be in c:\temp also.
Another added bonus to this solution is if you were to add all of the customer data into one XML file, the application will create separate word files in your temp directory like so:
customer_<name1>.docx
customer_<name2>.docx
customer_<name3>.docx
etc.
Here is the document generated from the first xml file
Here is the document generated from the second xml file with the empty row
Hope this helps.

Combine two TCPDF documents

I'm using TCPDF to create two separate reports in different parts of my website. I would like that, in the end of the first report, the second report should be loaded.
It's different than import a PDF file, because the second report is also generated by TCPDF. Is there a way to do this?
I assume from your question that what you ultimately want to provide is one PDF file that consists of the first PDF concatenated with the second PDF.
One quick and dirty solution is to utilize the pdftk command line PDF processor and call it from within your PHP code using the exec() function. The pdftk command has many features and concatenating files is only one of them, but it does an awesome job. Depending on your hosting situation, this may or may not be an option for you.
The other option would be to use FPDI to import the two PDF files and concatenate them within your PHP code and then send the concatenated version to the user.
More information on using PFDI here:
Merge existing PDF with dynamically generated PDF using TCPDF
Given that you're already using TCPDF, importing the pre-existing file that you want to concatenate with the one you've just created shouldn't be too difficult.
Just add FPDI to your project/composer from:
https://www.setasign.com/products/fpdi/downloads/
Can you still used tcpdf.
FPDI support all the methods of tcpdf, just used new FPDI() instead new tcpdf() the result will be the same in your report, after you create your report marge the files with the code from this page:
https://www.setasign.com/products/fpdi/about/
In a loop, once set the first file and after this set the second...
If you will need help i am here for you.