I can successfully manipulate fields in the header and footer sections of a DOCX document with TinyButStrong (TBS) through this code:
$TBS->PlugIn(OPENTBS_SELECT_HEADER);
$TBS->MergeField('abk', 'ainfo', true);
$TBS->PlugIn(OPENTBS_SELECT_FOOTER);
$TBS->MergeField('abk', 'ainfo', true);
However, this does not work with an ODT file that is just the DOCX file saved in a different format through LibreOffice.
I found out that I can make it work by manually selecting the enclosed file "style.xml", but this seems not the right way to do it as it does not address a document section in the abstract sense:
$TBS->PlugIn(OPENTBS_SELECT_FILE, 'styles.xml');
$TBS->MergeField('abk', 'ainfo', true);
Does anybody have a better solution?
I later found out that a multi-section DOCX document causes a similar problem and only its first section is processed. As Skrol29, the maintainer of openTBS, kindly noted, there is a small bug in version 1.10.0 that prevents proper selection of document parts. As a workaround do this:
$TBS->PlugIn(OPENTBS_SELECT_FILE, 'word/header2.xml', false);
$TBS->MergeField('abk', 'ainfo', true);
Proceed with all the sections (header3, header4...) you have. Note that the footers need extra selection.
Related
One of my customers has reported seeing the following message when working in documents created from templates that I provided:
This file has Custom XML elements that are no longer supported in Word. Saving the file will remove these elements permanently.
I understand that this message relates to Custom XML Markup (https://learn.microsoft.com/en-us/office/troubleshoot/word/custom-xml-elements-not-supported). I can confirm that there is no Custom XML Markup in the document.
Any idea what could be causing this error to display?
I have been using the excellent python-docx package to read, modify, and write Microsoft Word files. The package supports extracting the text from each paragraph. It also allows accessing a paragraph a "run" at a time, where the run is a set of characters that have the same font information. Unfortunately, when you access a paragraph by runs, you lose the links, because the package does not support links. The package also does not support accessing change tracking information.
My problem is that I need to access change tracking information. Or, more specifically, I need to copy paragraphs that have change tracking indicated from one document to another.
I've tried doing this at the XML level. For example, this code snippet appends the contents of file1.docx to file2.docx:
from docx import Document
doc1 = Document("file1.docx")
doc2 = Document("file2.docx")
doc2.element.body.append(doc1.element.body)
doc2.save("file2-appended.docx")
When I try to open the file on my Mac for complicated files, I get this error:
But if I click OK, the contents are there. The manipulation also works without problem for very simple files.
What am I missing?
The .element attribute is really an "internal" interface and should be named ._element. In most other places I have named it that. What you're getting there is the root element of the document part. You can see what it is by calling:
print(doc2.element.xml)
That element has one and only one w:body element below it, which is what you get when with doc2.element.body (.xml will work on that too, btw, if you want to inspect that element).
What your code is doing is appending one body element at the end of another w:body element and thereby forming invalid XML. The WordprocessingML vocabulary is quite strict about what element can follow another and how many and so forth. The only surprise for me is that it actually sometimes works for you, I take it :)
If you want to manipulate the XML directly, which is what the ._element attribute is there for, you need to do it carefully, in view of the (complex) WordprocessingML XML Schema.
Unlike when you stick to the published API, there's no safety net once ._element (or .element) appears in your code.
Inside the body XML can be relationships to external document parts, like images and hyperlinks. These will only be valid within the document in which they appear. This might explain why some files can be repaired.
When I try to using
pdftk my.pdf dump_data_fields >result.txt
have empty data result
Your file my.pdf may not be compatible with pdftk. Convert the file first using the following command:
>pdftk my.pdf output my_converted.pdf
Then try,
>pdftk my_converted.pdf dump_data_fields > result.txt
I've taken this from the following http://www.fpdf.org/en/script/script93.php where the converting process is suggested when the fields won't write to the pdf file so converting before dumping the fields may not help.
If your pdf has fields you it should be fillable in your pdf viewer. If in isn't fillable then it would seem that it has no fields.
This is most likely because the pdf you are using doesn't have any data fields to dump! Use a tool like Adobe Acrobat to open the pdf, go to wherever you need to to Edit Fields, and add fields anywhere you need them to show up. Make sure they are named so you can utilize them by using the attributes[] call in pdftk.
I recommend using snake case (i.e. text box named 'first_name') and then you should have access to it using attributes[:first_name] = 'your text'.
Hope this helps, let me know if you have any other questions/issues.
I am stitching a couple of documents together with a requirement that each document should retain its header and footer information in the final document. Using AltChunk instead of raw OpenXml or DocumentBuilder saves a lot of effort with regards to styles, formatting, references, parts, etc.
Unfortunately, after a couple of days I can't seem to get a 100% working version due to a small and frustrating issue and I need some insight.
My code is loosly based on this article
I modify each sub document, prior to appending it (as an AltChunk) to a working document, by moving the last section properties into the last paragraph (in order to retain header and footer references), but Word seems to be adding a blank paragraph to each of these documents as it renders them in the final document. I end up with:
document 1 with correct header and footer
section properties/break
blank paragraph
document 2 with correct header and footer
section properties/break
blank paragraph
etc.
I cant remove the blank paragraphs afterwards, as I ideally don't want to use WAS to render the document first.
It seems as if you cannot have a next-page section break without a following paragraph?
After further investigation, it seems that will not be away around my usage scenario. I would need to place the last section properties in the body element, but due to my way of processing with nested AltChunk, it would not work.
I have changed my approach completely and went back to a more detailed append procedure using OpenXml Power Tools and some LINQ to Xml.
I'm using Document Builder and works perfectly for me!
var sources = new List<OpenXmlPowerTools.Source>();
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart1)));
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart2)));
var outputPath = #"C:\Users\xpto\Documents\TestFolder\myNewDocument.docx";
DocumentBuilder.BuildDocument(sources, outputPath);
I have the similar empty paragraph issue while importing HTML files.
My solution is,
After inserting HTML AltChunk, I add a GUID place holder. After processing the file, I will open the file again, locate the GUID and check if there is a empty paragraph before it, if so remove the empty paragraph and GUID. it seems work perfectly in my solution.
Hope it helps.
is it possible to insert a content control into a Word document, then, get some sort of handle or context to the content control, and then insert HTML into it?
Essentially, the scenario that I am trying to create with the Office JavaScript API is to, upon the user's request, insert a rich text content control, and then populate it with HTML.
I am able to insert the content control from the JavaScript API using the approach suggested at http://social.msdn.microsoft.com/Forums/en-US/appsforoffice/thread/8c4809c7-743c-4388-aef0-bc6a6855c882. It requires a coercionType of ooxml. However, the content that I wish to populate with the ooxml is HTML based. So when I try to insert a content control with the following ooxml:
...Boiler ooxml to create content control...
<w:r><w:t><h1>Test header</h1><h2>Test subheader</h2><p>Test paragraph text</p></w:t></w:r>
The insert attempt fails. I'm assuming that's because you can't mix ooxml and html when inserting this into the document with a coercionType of ooxml.
Since this ooxml approach is the only way you can insert a content control, how can I then set the content control with HTML text? I have looked over the Document object help content at http://msdn.microsoft.com/en-us/library/fp142295.aspx, but I'm unsure how I can do this still, or if it's feasible.
Thanks
though I have not tried this with JS - it should be possible nontheless.
Try adding a altChunk Element, it can contain other open xml or html. I have used it a few times with success.
a few links on the issue:
http://blogs.msdn.com/b/brian_jones/archive/2008/12/08/the-easy-way-to-assemble-multiple-word-documents.aspx
http://blogs.msdn.com/b/ericwhite/archive/2008/10/27/how-to-use-altchunk-for-document-assembly.aspx
U should however try to use "strict"-xml - otherwise the above might not be possible.
I just found this example (sry it's german, but there should be an english version somewhere as well). In which coercionType is used like this:
Office.context.document.setSelectedDataAsync(
booksToRead,
{ coercionType: Office.CoercionType.Html },
function (result) {
// Access the results, if necessary.
});
This might do the trick as well.