Merging documents using OpenXml and section breaks causes empty paragraphs - ms-word

I am stitching a couple of documents together with a requirement that each document should retain its header and footer information in the final document. Using AltChunk instead of raw OpenXml or DocumentBuilder saves a lot of effort with regards to styles, formatting, references, parts, etc.
Unfortunately, after a couple of days I can't seem to get a 100% working version due to a small and frustrating issue and I need some insight.
My code is loosly based on this article
I modify each sub document, prior to appending it (as an AltChunk) to a working document, by moving the last section properties into the last paragraph (in order to retain header and footer references), but Word seems to be adding a blank paragraph to each of these documents as it renders them in the final document. I end up with:
document 1 with correct header and footer
section properties/break
blank paragraph
document 2 with correct header and footer
section properties/break
blank paragraph
etc.
I cant remove the blank paragraphs afterwards, as I ideally don't want to use WAS to render the document first.
It seems as if you cannot have a next-page section break without a following paragraph?

After further investigation, it seems that will not be away around my usage scenario. I would need to place the last section properties in the body element, but due to my way of processing with nested AltChunk, it would not work.
I have changed my approach completely and went back to a more detailed append procedure using OpenXml Power Tools and some LINQ to Xml.

I'm using Document Builder and works perfectly for me!
var sources = new List<OpenXmlPowerTools.Source>();
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart1)));
sources.Add(new OpenXmlPowerTools.Source(new WmlDocument(#tempReportPart2)));
var outputPath = #"C:\Users\xpto\Documents\TestFolder\myNewDocument.docx";
DocumentBuilder.BuildDocument(sources, outputPath);

I have the similar empty paragraph issue while importing HTML files.
My solution is,
After inserting HTML AltChunk, I add a GUID place holder. After processing the file, I will open the file again, locate the GUID and check if there is a empty paragraph before it, if so remove the empty paragraph and GUID. it seems work perfectly in my solution.
Hope it helps.

Related

DOCVARIABLE in ms word Field has disappeared, and yet still appears to be functioning. How can I get it back?

First off, sorry if this is really basic, but I've been working with fields in a word document for the past few days and I'm finding them quite counterintuitive. I have a document with over 100 images, and I am sourceing those images using the INCLUDEPICTURE field. Inside that field there is a DOCVARIABLEwhich contains the path to the image. I set this up to display all 1000 images. I then copied this word file and made a new one because I had a second set of images to display. SoI copied and pasted a section of the image name in the field codes and replaced it with a new name, e.g. all "image_a" instances were replaced with "image_b" so instead of seeing "image_a_1.png" and "image_a_2-png", the field codes now show "image_b_1.png" and "image_b_2.png" etc. and this has successfully retrieved the correct images so the document looks good.
However after doing this I have noticed that the codes in the fields has now changed. beforehand at the start the appeared like this:
{ INCLUDEPICTURE "{ DOCVARIABLE "var_doc_path" }folderwithpics\\image_a_1.pgn" \d }
now however after the copy and paste this is what appears:
{ INCLUDEPICTURE "folderwithpics\\image_b_1.pgn" \* MERGEFORMAT \d }
The doc variable is no longer displayed. What's weird that is that the correct image is still sourced and displayed in the word document, so it seems that the docvarible which is essential for the field to reference the correct path, is still active.
There is a problem though, which is that in a new word document, I need to use INCLUDEIMAGE to source all of the 1000 images again into this new document, and they aren't getting displayed. I need to go back and manually enter in the full path for each of the images in order for the new word document to access those image.
I think this must have something to do with the fact that the correct path is no longer displayed. Can anyone help me? I think I need to get the document to display { DOCVARIABLE "var_doc_path" } in the INCLUDEPICTURE field again.
As a side note if anyone has a good guide they can reccommend on working with fields I think that would be a great help. Thanks!
Unless you copied the document via Windows or via SaveAs, rather than simply copying & pasting content from one document to another, the new document will not contain the Document Variable. By using the \d switch, Word is referencing a copy of the image stored in the document metadata rather than the one in the filepath it can no longer access via the DOCVARIABLE field.
FWIW, the \* MERGEFORMAT switch does nothing useful in an INCLUDEPICTURE field.

how to select header section in OpenOffice ODT document with TinyButStrong TBS

I can successfully manipulate fields in the header and footer sections of a DOCX document with TinyButStrong (TBS) through this code:
$TBS->PlugIn(OPENTBS_SELECT_HEADER);
$TBS->MergeField('abk', 'ainfo', true);
$TBS->PlugIn(OPENTBS_SELECT_FOOTER);
$TBS->MergeField('abk', 'ainfo', true);
However, this does not work with an ODT file that is just the DOCX file saved in a different format through LibreOffice.
I found out that I can make it work by manually selecting the enclosed file "style.xml", but this seems not the right way to do it as it does not address a document section in the abstract sense:
$TBS->PlugIn(OPENTBS_SELECT_FILE, 'styles.xml');
$TBS->MergeField('abk', 'ainfo', true);
Does anybody have a better solution?
I later found out that a multi-section DOCX document causes a similar problem and only its first section is processed. As Skrol29, the maintainer of openTBS, kindly noted, there is a small bug in version 1.10.0 that prevents proper selection of document parts. As a workaround do this:
$TBS->PlugIn(OPENTBS_SELECT_FILE, 'word/header2.xml', false);
$TBS->MergeField('abk', 'ainfo', true);
Proceed with all the sections (header3, header4...) you have. Note that the footers need extra selection.

manipulating Microsoft Word DOCX files that have links and track changes using Python

I have been using the excellent python-docx package to read, modify, and write Microsoft Word files. The package supports extracting the text from each paragraph. It also allows accessing a paragraph a "run" at a time, where the run is a set of characters that have the same font information. Unfortunately, when you access a paragraph by runs, you lose the links, because the package does not support links. The package also does not support accessing change tracking information.
My problem is that I need to access change tracking information. Or, more specifically, I need to copy paragraphs that have change tracking indicated from one document to another.
I've tried doing this at the XML level. For example, this code snippet appends the contents of file1.docx to file2.docx:
from docx import Document
doc1 = Document("file1.docx")
doc2 = Document("file2.docx")
doc2.element.body.append(doc1.element.body)
doc2.save("file2-appended.docx")
When I try to open the file on my Mac for complicated files, I get this error:
But if I click OK, the contents are there. The manipulation also works without problem for very simple files.
What am I missing?
The .element attribute is really an "internal" interface and should be named ._element. In most other places I have named it that. What you're getting there is the root element of the document part. You can see what it is by calling:
print(doc2.element.xml)
That element has one and only one w:body element below it, which is what you get when with doc2.element.body (.xml will work on that too, btw, if you want to inspect that element).
What your code is doing is appending one body element at the end of another w:body element and thereby forming invalid XML. The WordprocessingML vocabulary is quite strict about what element can follow another and how many and so forth. The only surprise for me is that it actually sometimes works for you, I take it :)
If you want to manipulate the XML directly, which is what the ._element attribute is there for, you need to do it carefully, in view of the (complex) WordprocessingML XML Schema.
Unlike when you stick to the published API, there's no safety net once ._element (or .element) appears in your code.
Inside the body XML can be relationships to external document parts, like images and hyperlinks. These will only be valid within the document in which they appear. This might explain why some files can be repaired.

Trying to replace content zones in typo3

What I'm aiming to do!
I'm creating a template for a site in typo3, and i'd like to get rid of typo3's default content zones, and replace them with my own.
I.E. On the page menu.
to remove left, content, border
and to keep/add. Header. Main. Right.
The problem!
I've found snippets around the web, and bluntly, what I'm expecting to happen, isn't happening. Where every post seems to be "Thank you, great success! ++", the code I paste isn't throwing any errors, and isn't doing anything, well, at all.
My attempt
Via the typo3 documentation http://typo3.org/documentation/snippets/sd/24/
I call mod.SHARED.colPos_list in order to choose the three sections to display
t3lib_extMgm::addPageTSConfig('
mod.SHARED.colPos_list = 0,1,3
');
And I edit the TCA in extTables.php to set them to my specs.
$TCA["tt_content"]["columns"]["colPos"]["config"]["items"] = array (
"1" => array ("Header||Header||||||||","1"),
"0" => array ("Main||Main||||||||","0"),
"3" => array ("Right||Right||||||||","3"),
);
extTables.php is being called as as a die(); cuts the page.
I've cleared the cache and deleted typo3temp, logged out and in again.
But nothing happens.
My main guess, is, is this feature anything to do with templavoila? I removed it as I felt like trying out the new(er) typo3 fluid templating system, and didn't feel that I needed a GUI editor.
Any ideas?
Well - the more pages and content elements you got the more problems you will have to face when using TemplaVoila. Having comma separated values in XML structures saved to a single database field will be a performance killer as soon as you want to collect content from more than one page (uncached teaser menus or the like). Handling of references and "unused elements" is questionable as well. Of course it will work for small to medium sites, but concept wise a clean approach looks different.
Backend layouts are available since TYPO3 4.5 and work flawlessly since they just represent a normalized relation between elements and pages based on colPos. If you need more, Grid Elements will take this principle to the next level, offering even nested structures but still based on normalized relations, which will make your life much easier when it comes to DB cleaning and other maintenance tasks.
Find an introduction to backend layouts here: http://www.youtube.com/watch?v=SsxfNd4TYbk
Instead of removing default columns you can just rename them...
TIP: Use TemplaVoila extension for templating, you'll find much more flexibility there.

MS Word 2007 - How to set up placeholder text to mimic text but not formatting

I'm probably biting off more than I can chew with this particular problem, but I'll try to be as specific as possible in case it's within my scope. Disclaimer: I'm not terribly experienced with MS Word, beyond simple data entry/some formatting, and I have absolutely zero experience working with macros or VBasic. Unfortunately, I'm afraid the solution to my problem will come in the form of one of those last two.
THE GOAL:
What I want to do is to have placeholder text throughout my template document that will change content but not formatting when the first instance of it is changed. Basically, I'm writing a template for support manuals for a software suite. Each app has certain similar features like the menu bar, data entry screen, diagnostic log screen, transaction history, etc., so I am pre-writing those sections and using placeholders when I need to insert certain app specific properties.
I started off using the Insert->Quick Parts->Document Property->Subject tool which I used as a placeholder for the app name. I set the Property to [Subject] and then used Insert->Quick Parts->Field->Subject throughout the document, wherever I needed to include the app name. This worked fine in this case because the app name will always be capitalized. I simply change the text in the first [Subject] (which is content controlled) and update the fields throughout the document, and they all match nicely, easy-peasy, work done, go home and drink beer, right?
Not quite.
Our software handles part tracking via scanners and SQL Server, so while the interface and menu in the apps remains largely unchanged, the parts they track change from app to app. Because of this, I need to change the part name when I reference it within the text of the manuals; for example, if I'm working in ToiletPap.app and our TP is tracked by the roll, I need every mention of [Component] to be changed to roll. If I'm working in LightBulbs.app, I need [Component] to say bulb.
My first efforts went toward creating a custom doc property called Component using the Advanced tab under the Document Properties dropmenu. I then created a plaintext content control around my first [Component] titled Component and made my next [Component] a field with modified code: {COMPONENT * MERGEFORMAT}. This comes from copying what I can find when [Subject] works. This didn't work at all; updating the text in the first CC doesn't change the Content doc prop, and my fields return "!Undefined Bookmark, COMPONENT".
I got close to what I need by using the [Comments] doc property, set initially to [Component]. I used it just like [Subject], but (this is when I realized that capitalization was going to be an issue) when I mention my [component] in-text, as often as not, I need to to be lowercase instead of upper.
I've looked on MS's forums and a few others as well as here on SO, and I can't find anyone who's trying to do the same thing, much less an answer to how. Please keep in mind when answering, it would be a great help to me if you would include step-by-step instructions on how to enter/implement the code you provide because, as I mentioned, I have no idea how to go about editing macros/VBasic for MS Word.
To restate and summarize my overall question: How can I use a placeholder that displays the text "[Component]" so that, when I change the first instance of [Component] to something else, say "hopper", every subsequent instance of [Component] is updated to hopper but maintains its current capitalization and formatting scheme?
Apologies for the length of the request, but I wanted to make sure I explained the situation as accurately as possible. Thanks in advance for your consideration and responses.
I managed to solve this one after a couple extra hours of tinkering. I didn't need macros or VBasic, either.
On the first instance of [component] I created a plain-text content control to act as a container (not a necessity, but it makes it look nicer. Will likely cause a problem eventually, but for now, it's working as intended) and bookmarked it. Then, for all other instances of [container] I selected each and used Insert->Quick Parts->Field->Ref with the following field code:
REF Text1 \*Lower
Where "Text1" is my bookmark and "*Lower" indicates all lower case. The *Lower can be replaced with *Upper or *FirstCap to indicate all upper case or capitalize the first letter respectively. Now, each field reflects the text of the first with the capitalization appropriate to each field's location within the document. Just like using the doc prop with [Subject], ^a -> f9 is needed to update all fields within the document.