TLDR/Question
How can I best assign unique IDs to (ideally all) of the elements in the XML that describes a Word document such that I can read/write those unique IDs from a Word (2013) Add-In?
Additionally, solutions describing ways I can get a good diff of two Word documents might be helpful but this is not the primary question.
Background
I'm creating an application-level add-in for Word (2013) using VSTO. Part of my task involves diffing an original Word document W with a modified W' so that I can then process the diff for another task. While Word clearly has the capability for diffs/merges (available in the "Review" panel in Word 2013) thus far I have not been able to find a way to programatically extract the diffs.
Therefore, I plan to get the XML for the documents (e.g. using Range.WordOpenXML) and diff them. There are a number of published algorithms for diffing XML documents (i.e. Diff(W.XML, W'.XML)) where the accuracy of the diff is largely dependent on being able to properly match the XML elements from the two documents.
Proposed Solution and Its Problems
Therefore, I'd like to be able to assign a unique ID for every element in the XML of the Word document that I can access from my Add-In. In this case a solution would be something like importing a custom namespace into the package called mynamespace and adding the attribute mynamespace:ID=*** for every element in the DOCX package. The attribute would then be accessible via Range.WordOpenXML.
However, simply using mce:Ignorable, mce:ProcessContent, and mce:PreserveAttributes as detailed at http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2012/09/21/markup-compatibility-and-extensibility.aspx does not work. The modified Word document loads without any issues, however I cannot seem to find any of the attributes, additionally saving the document removes all of the added markup.
From http://openxmldeveloper.org/discussions/formats/f/13/p/8078/163573.aspx it appears that this process of using custom xml via the Markup Compatibility and Extensibility (MCE) portion of the Office Open XML standard has become complicated over the years (patent issues, etc.). Therefore I'm guessing that my issues arise because Word's XML processor just removes all of the markup that it cannot natively process (maybe there is a way to hook into Word's XML processor and give it custom commands?).
For the future viewers:
1) There is absolutely no way to set any kind of id for most of elements, which can survive in Word (you can use any custom tags or attributes, but after MS Word opens the document, it's gone)
2) Only two elements can be used as id - ContentControl, they have ids, and bookmark (it is possible to make a hidden bookmark adding underscore before it's name, it works only from code), their name can be an id.
3) If tracking changes is enabled in Word, it is absolutely possible to see the diffs in XML, using Range.WordOpenXML and getting actual OpenXml from it, as explained here, for example.
Related
We regularly need to perform a handful of relatively simple tests against a bunch of MS Word documents. As these checks are currently done manually, I am striving for a way to automate this. For example:
Check if every page actually has a page number and verify that it is correct.
Verify that a version identifier in the page header is identical across all pages.
Check if the document has a table of contents.
Check if the document has a table of figures.
Check if every figure has a caption.
et cetera. Is this reasonably feasible using PowerShell in conjunction with a Word API?
Powershell can access Word via its object model/Interop (on Windows, at any rate) and AIUI can also work with the Office Open XML OOXML) API, so really you should be able to write any checks you want on the document content. What is slightly less obvious is how you verify that the document content will result in a particular "printed appearance". I'm going to start with some comments on the details first.
Just bear in mind that in the following notes I'm just pointing out a few things that you might have to deal with. If you're examining documents produced by an organisation where people are already broadly speaking following the same standards, it may be easier.
Of the 5 examples you give, without checking the details I couldn't say exactly how you would do them, and there could be difficulties with all of them, but for example
Check if every page actually has a page number and verify that it is correct.
Difficult using either OOXML or the object model, because what you would really be checking is that the header for a particular section had a visible { PAGE } field code. Because that field code might be nested inside other fields that say "if don't display this field code", it's not so easy to be sure that there would be a page number.
Which is what I mean by checking the document's "printed appearance" - if, for example, you can use the object model to print to PDF and have some mechanism that lets PS inspect the PDF's content, that might be a better approach.
Verify that a version identifier in the page header is identical across all pages.
Similar problem to the above, IMO. It depends partly on how the version identifier might be inserted. Is it just a piece of text? Could it be constructed from a number of fields? Might it reference Document Properties or Variables, or Custom XML content?
Check if the document has a table of contents.
Perhaps enough to look for a TOC field that does not have certain options, such as a \c option that a Table of Figures would contain.
Check if the document has a table of figures.
Perhaps enough to check for a TOC field that does have a \c option, perhaps with a specific parameter such as "Figure"
Check if every figure has a caption.
Not sure that you can tell whether a particular image is "a Figure". But if you mean "verify that every graphic object has a caption", you could probably iterate through the inline and floating graphics in the document and verify that there was something that looked like a Word standard caption paragraph within a certain distance of that object. Word has two standard field code patterns for captions AFAIK (one where the chapter number is included and one where it isn't), so you could look for those. You could measure a distance between the image and the caption by ensuring that they were no more than a predefined number of paragraphs apart, or in the case of a floating image, perhaps that the paragraph anchoring the image was no more than so many paragraphs away from the caption.
A couple of more general problems that you might have to deal with:
- just because a document contains a certain feature, such as a ToC field, does not mean that it is visible. A TOC field might have been formatted as not visible. Even harder to detect, it could have been formatted as colored white.
- change tracking. You might have to use the Word object model to "accept changes" before checking whether any given feature is actually there or not. Unless you can find existing code that would help you do that using the OOXML representation of the document, that's probably a strong case for doing checks via the object model.
Some final observations
for future checks, perhaps worth noting that in principle you could create a "DocumentInspector" that users could call from Word BackStage to perform checks on a document. Not sure you can force users to run it, or that you could create it in PS, but perhaps a useful tool.
longer term, if you are doing a very large number of checks, perhaps worth considering whether you could train a ML model to try to detect problems.
I want to merge two or more docx files (append them after each other) or move one part of the document(XWPFParagraph) to other place.
The problem is that listings always breaks after such an operation. Say we have a listing in a document which has sequence numbers then we have other listing in another document which has bullets or letters. Than after the copy all of the bullets becomes numbers (or worse numbers which starts from where the previous listing has been ended).
I have tried several solutions :
-traversing BodyElements and copying Paragraphs and Tables by hand like here.
-attaching a newBody into an existing one like here here
Aside from page scoped styles they work well. But the listings never. Is that means the listing symbols are stored as page scoped information (otherwise it would be copyied successfully with the XWPFParagraph)? If yes than why and where?
I have dig myself into the javadoc: https://poi.apache.org/apidocs/dev/org/apache/poi/xwpf/usermodel/XWPFDocument.html
But couldn't find anything about the listings.
The Word numberings (numbered lists but bullet lists also) in Office Open XML file format are stored in /word/numbering.xml of the *.docx ZIP archive. There are abstractNum elements describing the list format and num elements referencing the abstractNum. The numId of the numelements are referenced in paragraphs of /word/document.xml to set which numbering formats shall be used in that paragraph. Paragraphs referencing the same numId are in the same list.
Paragraphs referencing different numId are in different lists.
In apache poi there are XWPFNumbering representing the document part /word/numbering.xml and XWPFAbstractNum representing the abstractNum.
Until now there is no way creating XWPFAbstractNum from scratch without using the low level ooxml-schemas classes.
Also, as far as I know, there is no simple way to merge /word/numbering.xml document parts of different Word documents because of the need handling the different Ids in /word/numbering.xml as well as their occurrences in /word/document.xml. This is very complex and I do not know any free library which can do this properly.
In general, as far as I know, there is no simple way to merge different Word documents together because of the complex storage in Word file formats. All provided possibilities using free code are only halfway useful (traversing and copying), if not wrong and useless (simply attaching multiple document bodys one after the other) at all.
I have an application that uses the DocumentFormat.OpenXml API to a Word document from one or more originating documents, inserting and deleting chunks of data as it goes. In other words, the resulting document will be significantly different from the constituent documents.
I have already successfully created things like Custom Document Properties, Document Variables and Core File Properties.
But: is it possible to get the other metadata items (number of pages, words, paragraphs, etc.) refreshed, without actually having to calculate these?
Thank you to #Cindy Meister for the answer.
I was hoping that there might be some method or other in the DocumentFormat.OpenXML SDK that I could call, but it seems that is not possible.
I'm using a virtual document structure generated by a script to create a document from EA, and I'm trying to use the same template fragment several times with different elements and different headings.
For example, I have an element which describes the input data to one program, and the output data to another program, so I can't really store the information in the element I'm documenting.
Where it is the input, I want one heading (and similar references within the template), and where it is output I want different values for the headings.
I've tried using the ReportTitle tagged value in the individual <model document> element, but this appears to be ignored and only the <report package> value is used throughout (which makes me wonder why they are there in the first place).
While I could create multiple templates all referring to the same fragment and hard-code the different headings, but that is messy, and as I already have fragments within fragments so it could result in a lot of almost identical templates and fragments. Variables that I can set for each <model document> would be much preferable.
Has anyone got a better approach than this? Thanks!
I don't think there's an easy solution.
If there is a way to determine based on the element, package, or diagram ID whether you should use one title or the other then you could use a script or SQL fragment to return the correct title.
If that is not the case I guess the only possibility is to hardcode the different titles in your templates.
In order to avoid too much duplication you could create a template with only the title and use that on a model document. Since you are generating the modeldocuments by a script anyway that doesn't cost any user time.
I am trying to split a multi-paged PDF document that contains form fields (check boxes, text boxes). In case it is important to know, the PDF's form fields are initially all blank/empty/un-checked. I have tried using pdftk and pdfsam, but the form fields in the split pages do not work; they appear, but I am unable to use them. I can merge multiple PDF documents that contain form fields, and the final product has all the form fields functioning, so it works with merging.
I read the man page for pdftk, searched forums, and have begun reviewing the source (The source seems to be composed of C++ and Java, which are currently beyond my knowledge base), but I have not located an answer to the question: How can I split a PDF document that contains multiple pages with form fields AND keep the form fields functioning in the split documents?
I would prefer a solution that uses pdftk, and/or any open-source solution.