I need to compare two Word files and merge all insertions into a third one.
I have managed to do that with OpenXML-Power-Tools and WmlCompare, but how do I reject only deletions?
Accepting insertions is easy with OpenXmlPowerTools.RevisionAccepter but I can't get the rejection of deletions to work, that way I would get merged file without revisions.
Should I take this approach or would you suggest different approach?
Rules are:
1) Text can be added anywhere in the file.
2) Text is always added to file, never deleted.
3) Files have a .docx file extension
It is relatively easy to reject, or undo, deletions. So, say you have the following sample paragraph with one w:del element that wraps a deleted w:r element.
<w:p>
<w:r>
<w:t xml:space="preserve">This is </w:t>
</w:r>
<w:del w:id="0" w:author="Thomas Barnekow" w:date="2020-02-16T14:37:00Z">
<w:r>
<w:delText xml:space="preserve">deleted </w:delText>
</w:r>
</w:del>
<w:r>
<w:t>text.</w:t>
</w:r>
</w:p>
To reject the deletion, you need to unwrap the deleted w:r and turn the w:delText into a w:t again. Without any further processing (see below), the result of your rejecting the deletion would look like this:
<w:p>
<w:r>
<w:t xml:space="preserve">This is </w:t>
</w:r>
<w:r>
<w:t xml:space="preserve">deleted </w:delText>
</w:r>
<w:r>
<w:t>text.</w:t>
</w:r>
</w:p>
As an optional step, using the MarkupSimplifier of the Open XML PowerTools, you could also coalesce adjacent runs having identical formatting, which would result in the following markup:
<w:p>
<w:r>
<w:t>This is deleted text.</w:t>
</w:r>
</w:p>
Related
I have field definition in Word 2016 (Office 365 same problem) like this
<w:r>
<w:fldChar w:fldCharType="begin"/>
<w:instrText xml:space="preserve"> DOCPROPERTY "UohsDate" \* MERGEFORMAT </w:instrText>
<w:fldChar w:fldCharType="separate"/>
</w:r>
after filling field by docx4j, editing document in word and saving document again, my field definition is divided into two:
<w:r>
<w:instrText xml:space="preserve"> DOCPROPERTY "UohsDa</w:instrText>
</w:r>
<w:r>
<w:instrText xml:space="preserve">te" \* MERGEFORMAT </w:instrText>
</w:r>
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
Please, any suggestions? Thanks a lot
When attempting to get the ooxml from the document I've found a variance.
In Windows, iPad, and Word Online in a document that has a cross reference that will break if you update it, but is not yet broken. (Meaning if you right click it and press update then it will go to to the "Error! Reference source not found." Then the xml will display the text value:
<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> REF _Ref274197646 \n \h </w:instrText>
</w:r>
<w:r>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:bookmarkStart w:id="39" w:name="_Read0190056i"/>
<w:bookmarkStart w:id="40" w:name="_Read0190047i"/>
<w:r>
<w:t>(i)</w:t>
</w:r>
<w:bookmarkEnd w:id="39"/>
<w:bookmarkEnd w:id="40"/>
<w:r>
<w:fldChar w:fldCharType="end"/>
</w:r>
However in the Mac even though the screen is displaying the text, the xml shows the broken reference text.
<w:r>
<w:rPr>
<w:b/>
<w:bCs/>
</w:rPr>
<w:t>Error! Reference source not found.</w:t>
</w:r>
This feels like a bug, as the different versions of word should report the xml largely the same.
I have turned on the track changes(Revisions) option in word and made some changes and found all the track changes were being tracked and found in the openxml content. but i am not seeing the deleted listnum value in openxml content and the listnum values are continued from the next paragraph. so how can I track/get the deleted listnum value in openXml.
More details on the issue - we have 5 paragraphs with listnums (a) to (e). I turned on track changes and deleted listnum value (b) so that second paragraph has no listnum now. I thought I might get the value (b) in openxml since I turned on track changes but I am not able to get the deleted value (b) from openxml.
Thanks,
Manu
A single bulletpoint may use the following xml. It's a single Paragraph containing the text 'Item1' in a Run. The ParagraphProperties applies the style 'ListParagraph' and refers to a numbering:
<w:p>
<w:pPr>
<w:pStyle w:val="ListParagraph" />
<w:numPr>
<w:ilvl w:val="0" />
<w:numId w:val="1" />
</w:numPr>
</w:pPr>
<w:r>
<w:t>Item1</w:t>
</w:r>
</w:p>
If Track Changes is enabled and I delete the text 'Item1' I get xml like the following:
<w:p>
<w:pPr>
<w:pStyle w:val="ListParagraph" />
<w:pPrChange w:author="Daniel Brixen" w:date="2017-02-16T09:37:00Z" w:id="0">
<w:pPr>
<w:pStyle w:val="ListParagraph" />
<w:numPr>
<w:numId w:val="1" />
</w:numPr>
<w:ind w:hanging="360" />
</w:pPr>
</w:pPrChange>
</w:pPr>
<w:del w:author="Daniel Brixen" w:date="2017-02-16T09:37:00Z" w:id="2">
<w:r>
<w:delText>Item2</w:delText>
</w:r>
</w:del>
</w:p>
Two things to note:
The deleted text is in a DeletedRun-element
The change in paragraph-properties is recorded by a ParagraphPropertiesChange-element.
So you should be able to find the deleted text by using something like this:
using (var doc = WordprocessingDocument.Open(#"c:\temp\test.docx", true))
{
var deletedText = doc.MainDocumentPart.Document.Body.Descendants<DeletedText>();
Console.WriteLine(String.Join(" ", deletedText.Select(t => t.Text)));
}
Using Open XML Productivity Tool is helpful when debugging stuff like this.
I'm looking at an OOXML WordprocessingML document that has bold enabled in the paragraph style, but no indication in the run level style. No other linked styles make any indications about bold status. I expected that inheritance would dictate the text to be bold, but when I view it in Mac Word 2016 the text ("Trailing Text") is unbolded. Why is that?
Here's the example:
<w:p>
<w:pPr>
<w:pStyle w:val="InconsequentialStyle"/>
<w:jc w:val="both"/>
<w:rPr>
<w:b/>
<w:color w:val="000000" w:themeColor="text1"/>
<w:sz w:val="28"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:b/>
<w:color w:val="000000" w:themeColor="text1"/>
<w:sz w:val="28"/>
</w:rPr>
<w:t xml:space="preserve">Leading text: </w:t>
</w:r>
<w:r>
<w:rPr>
<w:color w:val="000000" w:themeColor="text1"/>
<w:sz w:val="28"/>
</w:rPr>
<w:t xml:space="preserve">Trailing Text</w:t>
</w:r>
</w:p>
The text in the first run ("Leading Text:") is bolded by Word, which is my expectation. Does the lack a <w:b/> element turn the style off? If so, then what's the point of allowing the tag in the paragraph's style? FWIW, Word's formatting does appear to be the formatting preference of the document's author. I just can't figure out why this code is producing the desired effect.
If it helps, this text is not in a table, list or anything else. This <w:p> is a child of <w:body> and <w:docDefaults> doesn't specify anything about bold.
I'm seeing the same behavior with the <w:color> style in a different paragraph, so it's not just toggle styles. Please help me understand how Word is interpreting this code.
If you want to apply direct formatting to a run of text you have to specify that formatting in the run properties of that run – and not in the run properties of the paragraph. That means the behavior you see is correct application behavior.
The run properties of the paragraph are used to format the paragraph marker. This formatting is used in Word to format list numbers and bullets:
I have a java program that search rsidR="00CA303F" inside document.xml(unzipped of DOCX).
<w:sdtContent>
<w:r w:rsidR="00CA303F">
<w:rPr>
<w:rFonts w:cs="Arial"/>
<w:b/>
<w:sz w:val="18"/>
<w:szCs w:val="18"/>
<w:lang w:val="en-US"/>
</w:rPr>
<w:t>17-Jan-14</w:t>
</w:r>
</w:sdtContent>
The problem: if i change something like the date in the docx and after i save the file, this rsidR change! and im not able to find it next time in my program.
How i can freeze-fixed it? or which other fixed-element can i add to w:r for find it after saving file?
Solutions(not working) that i tryed: I added other tags(hoping they will not change), i tryed for example: w:rsidRDefault, w:id, w:val, w:rsidRPr to this w:r, but Word wont be able to open file docx after.
Word or the OpenXML file format do not offer a direct way to add an ID to an element, which is also persisted if the document is edited.
As a workaround, you can create a character style which you then apply to the run of text you are interested in. Then you can search for the w:rStyle element with the correct character style in the w:val attribute:
<w:r w:rsidRPr="00E05157">
<w:rPr>
<w:rStyle w:val="MyCharacterStyle"/>
</w:rPr>
<w:t>17-Jan-14</w:t>
</w:r>
It should be possible to assign a unique id to the containing w:sdt (in the descendant w:sdtPr/w:id/#w:val). See for example the docx4java documentation for sdtPr.
A good explanation of rsid's, and how they are used by MS Word, is in What's up with all those rsid's. In many application it is harmless to completely ignore them.