Comparing 2 word Documents in Office 2003

Comparing 2 word Documents in Office 2003 - ms-word

we have word Document(office 2003) Containing Bookmarks(Template) - We Generate the Document Via Application under Test(Final Document)(Office 2003)- Based on the data we enter in the Application under Test -Book mark gets filled and Document Gets Printed
So now I need to Compare the template with the Final Document
What is the Best Approach to compare the 2 documents
Note: I need to Compare the Margin , font and all other formatting stuffs as well
Initial Analaysis - I Converted the 2003 template and the final document to 2010 word format and changed the File type to .zip - when we extract it i got numerous XML - I compared both the XML - but that is not adding value for this kind of test why because eventhough there is a discrepancy it is becoming really difficult to Map the contents in XML to the actual contents in the word document

Winmerge has a plugin that handles Word 2003/2007 files.

Related

How to programatically get the fragments called in an RTFtemplate?

I need to programmatically find the fragments that are called by each rtftemplate.
So, for example in the figure, I would need to get the "GlossaryTermsAcronyms" fragment for the H2_terms_acronyms template.
I can't seem to find any query or script solution to do this. But this should be possible, right?

Unfortunately that is (almost) impossible.
The information is stored in the t_documents.bincontent column. It is binary encoded RTF.
Somewhere in that RTF there should be a reference to the templates fragments that are used.
If you can figure out how to decode the bincontent to get to the actual RTF code of your template, you might have a chance.
Binary fields in EA are usually stored as a zipped text file.
In case the field is included in an xml file (or xml string in the database), it will be base64 encoded.

Why aren't word processing program documents stored as plaintext?

Whenever an MS word (or LibreOffice or other word processor) document is opened in its respective program, the words appear normally on the page, but when the document is opened in a text editor, most of it is Unicode gibberish.
I can understand why the document might have some parts that aren't legible, like bullet points or metadata, but why isn't at least some of the content stored as plaintext? Does every letter get encoded?

The last format docx of Microsoft Word is an xml with plain text compressed with zip. You can unzip the file by renaming docx to zip and then open the file with a notepad. So it is stored partially as plain text just compressed.

I find that it is probably a branding thing. If you want you can import it to a Text File.
If you go to File > Export > Change File Type > Plain Text (*.txt), you can export the document there.

Get rid of Compatibility Mode warning

I use RTF as the default format. Is there a way to keep Word 2013 from showing “[Compatibility Mode]” after the title of every document? It is taking up lots of space in the taskbar and title bar and makes the titles of documents hard to read.

There's no way to suppress using a setting - RTF file format hasn't been the "plain text" version of a Word document since Word 2007, when the Word Open XML format took over that role. Any RTF file equates to Compatibillity Mode since it cannot support the newer Word functionality.
The only possibility you'd have would be to set the Caption property of the Application object, using VBA or another language automating the Word application. But you need to be aware that this is not permanent in any way.

Writing CR+LF into Open XML from a Database

I'm trying to take some data stored in a database and populate a Word template's Content Controls with it using the Open XML SDK. The data contains paragraphs and so there are carriage return and line feed characters in it. The data is stored in the database as nvarchar.
When I open the generated document, the CR+LF combination shows up as a question mark with a box around it (not sure the name of this character). This is actually two sequences back to back, so CR+LF CR+LF equals two strange characters:
If I unzip the .docx, take the Custom XML part and do a hex dump, I can clearly see 0d0a 0d0a so the CR+LF is there. Word is just printing it weird.
I've tried enforcing UTF-8 encoding in my XmlWriter's settings, but that didn't seem to help:
Dim docStream As New MemoryStream
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Encoding = New UTF8Encoding(False)
Dim docWriter As XmlWriter = XmlTextWriter.Create(docStream, settings)
Does anyone know how I can get Word to render these characters correctly when written to a .docx through the Open XML SDK?

To bind to a Word 2013 rich text control, your XML element has to contain a complete docx. See [MS-DOCX]:
the data stored in the XML element will be an escaped string comprised of a flattened WordprocessingML document representing the formatted data in the structured document tag range.
Earlier versions couldn't bind a rich text control.
Things should work though (with CR/LF, not w:br), if you bind to a plain text control, with multiline set to true.

How to determine the content of an Office content type of an saved Document?

The documentation of getFileAsyc says it will always be in (.pptx or .docx) in Office Open XML (OOXML)
Since Office 2016 this holds no longer true, if one saves the file in OpenDocument format (*.odt).
How will I get the information about the filetype? The name ends with *.odt, but in Word 2013 the name also ended with *.odt, but was transferred as *.docx
Example:
In following case, the binary filecontent cannot be determined:
Create an empty file in Word
Insert your TaskpaneApp
Safe the file as *.odt to you PC in Word
call getFileAsync(Compressed), and
get no docx but odt-content in Word 2016 with the name .odt
get docx-content in Word 2013 with the name .odt
For Word 2013 I fixed the problem by adding .docx to the provided name. Exactly this fix causes the Problems with Word 2016, where the file is realy a *.odt

The input parameter of the getFileAsync method is precisely the file type you need. And this is independently on which format you saved the file.
Office.js supports 3 file types: compressed (which is docx,pptx, etc), text (plain text) and PDF. ODT is not a file format supported in the getFileAsync Method. Check the article you referred to see what formats are supported in which Office Applications.
hope this clarification helps.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Comparing 2 word Documents in Office 2003 - ms-word

Winmerge has a plugin that handles Word 2003/2007 files.

Related

How to programatically get the fragments called in an RTFtemplate?

Why aren't word processing program documents stored as plaintext?

Get rid of Compatibility Mode warning

Writing CR+LF into Open XML from a Database

How to determine the content of an Office content type of an saved Document?

Categories

Resources