manipulating Microsoft Word DOCX files that have links and track changes using Python - ms-word

I have been using the excellent python-docx package to read, modify, and write Microsoft Word files. The package supports extracting the text from each paragraph. It also allows accessing a paragraph a "run" at a time, where the run is a set of characters that have the same font information. Unfortunately, when you access a paragraph by runs, you lose the links, because the package does not support links. The package also does not support accessing change tracking information.
My problem is that I need to access change tracking information. Or, more specifically, I need to copy paragraphs that have change tracking indicated from one document to another.
I've tried doing this at the XML level. For example, this code snippet appends the contents of file1.docx to file2.docx:
from docx import Document
doc1 = Document("file1.docx")
doc2 = Document("file2.docx")
doc2.element.body.append(doc1.element.body)
doc2.save("file2-appended.docx")
When I try to open the file on my Mac for complicated files, I get this error:
But if I click OK, the contents are there. The manipulation also works without problem for very simple files.
What am I missing?

The .element attribute is really an "internal" interface and should be named ._element. In most other places I have named it that. What you're getting there is the root element of the document part. You can see what it is by calling:
print(doc2.element.xml)
That element has one and only one w:body element below it, which is what you get when with doc2.element.body (.xml will work on that too, btw, if you want to inspect that element).
What your code is doing is appending one body element at the end of another w:body element and thereby forming invalid XML. The WordprocessingML vocabulary is quite strict about what element can follow another and how many and so forth. The only surprise for me is that it actually sometimes works for you, I take it :)
If you want to manipulate the XML directly, which is what the ._element attribute is there for, you need to do it carefully, in view of the (complex) WordprocessingML XML Schema.
Unlike when you stick to the published API, there's no safety net once ._element (or .element) appears in your code.
Inside the body XML can be relationships to external document parts, like images and hyperlinks. These will only be valid within the document in which they appear. This might explain why some files can be repaired.

Related

This file has Custom XML elements that are no longer supported in Word. Saving the file will remove these elements permanently

One of my customers has reported seeing the following message when working in documents created from templates that I provided:
This file has Custom XML elements that are no longer supported in Word. Saving the file will remove these elements permanently.
I understand that this message relates to Custom XML Markup (https://learn.microsoft.com/en-us/office/troubleshoot/word/custom-xml-elements-not-supported). I can confirm that there is no Custom XML Markup in the document.
Any idea what could be causing this error to display?

Can exams2moodle export additional metainfo such as idnumber and tags?

When I export the xml file of a multiple choice question, it contains the following lines:
<idnumber>arbitrary_id_set_by_user</idnumber>
<answernumbering>ABCD</answernumbering>
<tag></tag>
Is there a way to add idnumber, answernumbering and tag to the metainformation section of the question so that r-exams can export to moodle XML as <idnumber>idnumber</idnumber>,<answernumbering>ABCD</answernumbering>, <tag>tag1</tag>, and <tag>tag2</tag> etc?
The <answernumbering> tag can be set in exams2moodle() via the answernumbering= argument, see ?exams2moodle. The reason for this is that this is set in the same way for all exercises in a quiz. This is more consistent than setting it individually and potentially inconsistently in the meta-information of the different exercises.
The <idnumber> tag appears to be used by Moodle only for internal purposes. It is also not mentioned in the official Moodle XML documentation at https://docs.moodle.org/311/en/Moodle_XML_format. Hence we did not implement it in exams2moodle().
The <tag> is currently not supported in exams2moodle() because we felt that it would be more important to have tags in the Rmd (or Rnw) exercise itself and not the Moodle version of the exercise. For structuring the content on the Moodle side the exsection meta-information can be used, see boxhist for a worked example.
Finally, you can add arbitrary metainformation by using the exextra tag. This is used, for example, in the essayreg exercise template. However, there is no general way of using this extra metainformation to insert additional XML code in the exams2moodle() output. To do that, the source code underlying exams2moodle() would have to be adapted correspondingly.

Merging documents using OpenXml and section breaks causes empty paragraphs

I am stitching a couple of documents together with a requirement that each document should retain its header and footer information in the final document. Using AltChunk instead of raw OpenXml or DocumentBuilder saves a lot of effort with regards to styles, formatting, references, parts, etc.
Unfortunately, after a couple of days I can't seem to get a 100% working version due to a small and frustrating issue and I need some insight.
My code is loosly based on this article
I modify each sub document, prior to appending it (as an AltChunk) to a working document, by moving the last section properties into the last paragraph (in order to retain header and footer references), but Word seems to be adding a blank paragraph to each of these documents as it renders them in the final document. I end up with:
document 1 with correct header and footer
section properties/break
blank paragraph
document 2 with correct header and footer
section properties/break
blank paragraph
etc.
I cant remove the blank paragraphs afterwards, as I ideally don't want to use WAS to render the document first.
It seems as if you cannot have a next-page section break without a following paragraph?
After further investigation, it seems that will not be away around my usage scenario. I would need to place the last section properties in the body element, but due to my way of processing with nested AltChunk, it would not work.
I have changed my approach completely and went back to a more detailed append procedure using OpenXml Power Tools and some LINQ to Xml.
I'm using Document Builder and works perfectly for me!
var sources = new List<OpenXmlPowerTools.Source>();
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart1)));
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart2)));
var outputPath = #"C:\Users\xpto\Documents\TestFolder\myNewDocument.docx";
DocumentBuilder.BuildDocument(sources, outputPath);
I have the similar empty paragraph issue while importing HTML files.
My solution is,
After inserting HTML AltChunk, I add a GUID place holder. After processing the file, I will open the file again, locate the GUID and check if there is a empty paragraph before it, if so remove the empty paragraph and GUID. it seems work perfectly in my solution.
Hope it helps.

MS Word 2007 - How to set up placeholder text to mimic text but not formatting

I'm probably biting off more than I can chew with this particular problem, but I'll try to be as specific as possible in case it's within my scope. Disclaimer: I'm not terribly experienced with MS Word, beyond simple data entry/some formatting, and I have absolutely zero experience working with macros or VBasic. Unfortunately, I'm afraid the solution to my problem will come in the form of one of those last two.
THE GOAL:
What I want to do is to have placeholder text throughout my template document that will change content but not formatting when the first instance of it is changed. Basically, I'm writing a template for support manuals for a software suite. Each app has certain similar features like the menu bar, data entry screen, diagnostic log screen, transaction history, etc., so I am pre-writing those sections and using placeholders when I need to insert certain app specific properties.
I started off using the Insert->Quick Parts->Document Property->Subject tool which I used as a placeholder for the app name. I set the Property to [Subject] and then used Insert->Quick Parts->Field->Subject throughout the document, wherever I needed to include the app name. This worked fine in this case because the app name will always be capitalized. I simply change the text in the first [Subject] (which is content controlled) and update the fields throughout the document, and they all match nicely, easy-peasy, work done, go home and drink beer, right?
Not quite.
Our software handles part tracking via scanners and SQL Server, so while the interface and menu in the apps remains largely unchanged, the parts they track change from app to app. Because of this, I need to change the part name when I reference it within the text of the manuals; for example, if I'm working in ToiletPap.app and our TP is tracked by the roll, I need every mention of [Component] to be changed to roll. If I'm working in LightBulbs.app, I need [Component] to say bulb.
My first efforts went toward creating a custom doc property called Component using the Advanced tab under the Document Properties dropmenu. I then created a plaintext content control around my first [Component] titled Component and made my next [Component] a field with modified code: {COMPONENT * MERGEFORMAT}. This comes from copying what I can find when [Subject] works. This didn't work at all; updating the text in the first CC doesn't change the Content doc prop, and my fields return "!Undefined Bookmark, COMPONENT".
I got close to what I need by using the [Comments] doc property, set initially to [Component]. I used it just like [Subject], but (this is when I realized that capitalization was going to be an issue) when I mention my [component] in-text, as often as not, I need to to be lowercase instead of upper.
I've looked on MS's forums and a few others as well as here on SO, and I can't find anyone who's trying to do the same thing, much less an answer to how. Please keep in mind when answering, it would be a great help to me if you would include step-by-step instructions on how to enter/implement the code you provide because, as I mentioned, I have no idea how to go about editing macros/VBasic for MS Word.
To restate and summarize my overall question: How can I use a placeholder that displays the text "[Component]" so that, when I change the first instance of [Component] to something else, say "hopper", every subsequent instance of [Component] is updated to hopper but maintains its current capitalization and formatting scheme?
Apologies for the length of the request, but I wanted to make sure I explained the situation as accurately as possible. Thanks in advance for your consideration and responses.
I managed to solve this one after a couple extra hours of tinkering. I didn't need macros or VBasic, either.
On the first instance of [component] I created a plain-text content control to act as a container (not a necessity, but it makes it look nicer. Will likely cause a problem eventually, but for now, it's working as intended) and bookmarked it. Then, for all other instances of [container] I selected each and used Insert->Quick Parts->Field->Ref with the following field code:
REF Text1 \*Lower
Where "Text1" is my bookmark and "*Lower" indicates all lower case. The *Lower can be replaced with *Upper or *FirstCap to indicate all upper case or capitalize the first letter respectively. Now, each field reflects the text of the first with the capitalization appropriate to each field's location within the document. Just like using the doc prop with [Subject], ^a -> f9 is needed to update all fields within the document.

Appropriate design of DHTML / multipage CGI form

What is the standard method for implementing a "wizard" using successive web forms?
I'm implementing a CGI that accepts several options, files, etc. But some of these options have dependencies to one another, and allow or require other options to be used.
For example, one type of object that needs to be initialized by the CGI can be created using:
two files of type X
two strings
one file of type Y
In my command line version, I look whether two files of type X, two strings, or one file of type Y is provided, and construct the object in the appropriate manner.
In my CGI, I'd like to do this using multiple pages or DHTML (perhaps a radio button that specifies which arguments the user wishes to provide; changing the radio button will change the form to the right).
Anyway, I have this situation for 3 main groups of arguments. I thought it would be pleasing to the user to create a 6 "page" wizard (think online dating):
Page 1:
"How would you like to specify your proteins of interest?"
radio button:
Two FASTA files
Prefix and suffix strings that match all of my proteins (and match only my proteins)
A text file containing the proteins
Page 2:
"Great! Please choose your (either 'fasta files', 'prefix and suffix strings', or 'text file')."
(appropriate web form)
Unfortunately, if the form is split over different pages, I'm not sure how the 3rd, 4th, etc. pages will know the location of the temporary folder created for the uploaded files from pages 1 and 2.
I'd really appreciate your advice; I have a good command line app, but I am having a difficult time making beautiful interface code that will do what I want. And I'd be shocked if there isn't a very easy standard way to do this with Django or some other framework; it just seems it must come up very frequently.
There's a wizard plugin for jQuery.
http://plugins.jquery.com/project/formwizard
If you don't know jQuery, it is a javascript framework for doing DHTML.
Try the demo at http://thecodemine.org/