Huge docx filled with <w:p> tags - ms-word

My girlfriend is writing a Word document for a homework. She's using the old .doc format as required by her teacher ( :'( ).
At some point, the .doc file went from 150 kB to 2.6 MB with no noticeable change (seen in Dropbox history. Sadly, Word's comparison function fails because Word crashes). From that point, she was unable to save her document without crashing word...
I converted the .doc to docx, unzipped it, and found a 18 MB document.xml file !
I can't even format the xml properly because it crashes Notepad++, but I can see that the file is filled with the same xml tag repeating over and over :
<w:p w:rsidR="002A70E5" w:rsidRDefault="002A70E5" w:rsidP="00565ED9"/>
Do you have any idea what could cause this ?
EDIT: Here's the docx
EDIT2: The motivation for this question is more curiosity than looking for a fix. Thanks for your answers though.

If you're willing to edit the XML directly, you can just delete all the empty <w:p> tags and rezip.
If you're good with Python, you might give python-docx a try and use it to delete all empty paragraphs.
Hopefully that will at least recover the work she's done so far.
Not sure how this would happen, or whether it matters much. Only thing I can think of is a sticking Return key on the keyboard that would insert a huge number of carriage returns. Those each insert a new paragraph. I've actually had that happen occasionally on a Windows virtual machine running on my Mac. No clue why it does it though.

The tag you are talking about is the OpenXml format for building word documents. The openxml stores the document as a zipped file and I am afraid you are seeing the unzipped document.xml file. If you want to keep working with the doc just convert the doc file to docx. Dont unzip it.

Related

Cant open a .docx file after overwriting it with a different encoding

This is not exactly a programming question but I came to this problem while I was trying to access to a .docx document using Python.
Basically, I opened manually the .docx with notepad and I overwrote it with utf-8 encoding (ANSI was the default encoding). After I did this if I try to open the document I see the next message: "We're sorry. We canĀ“t open filename because we found a problem with its contents". Clicking on details you'll see "The file it's corrupt and cannot be opened".
It doesn't matter if I save the file with ANSI again, it won't open. Later I tried it with a new document and the same thing happened, but it also happens if I overwrite it with "ANSI" (even that it's the default one).
I can still open it with notepad so my question is: Is there a way to recover my file or to convert it to a readable document?
I've tried every single Method of the following link https://learn.microsoft.com/en-US/office/troubleshoot/word/damaged-documents-in-word and none of them worked.
Edit: If I open any ms-word with notepad and I save it with any encoding I wont be able to open it with ms-word anymore. I don't know why but if I open the document and erase the first two letters (PK - which I believe stands for zip document) I can open the file with ms-word but it would have unreadable characters.
Thank you in advance
Word files are zip archives containing XML, which is already encoded in UTF. A zip archive is a binary format and is not encoded. Notepad makes a guess, but it's wrong. That's why when you reopen the Word file that you thought you saved in UTF, Notepad still thinks it's ANSI format.
Unfortunately, your file is hosed. It's not even a zip archive anymore, so you can't open it to extract the text from the XML. Best to experiment on a copy next time.

Generate .hhk file From Word Document

I am trying to convert MS Word file to chm file. I have a well organized word document. But,I could not figure out how to word saved as a html file to chm file. I know I can add html file to created project but there are some issue such that I could not solve how to convert ms word table of content file to index file in html help workshop program. I would be very happy If someone provide some example about conversion of word documents.(I am trying to achieve this thorough HTML Help Workshop program)
Best regards,
Converting a Word document to CHM format is difficult without special (often expensive) tools and has a learning curve.
You should think about whether the PDF format is not sufficient. But the CHM format - integrated in the Windows operating system - has of course some popular functions.
I recommend to read through Search and Index not working after converting from Word 2016 to CHM.
As I mentioned in my answer I never used chmProcessor before (because using other tools) but surprisingly seems to be a good one for converting Word documents in a simple way.
Please try chmProcessor for your needs. You may want to ask a new question here on SO later.
Edit:
Maybe you have additional interest in the following CodeProject article:
How to Easily Write a User's Guide for Your Application using Different File Extensions

What can I do to recover a UTF-8 binary file?

I somehow had a script running on my company's server that basically did a mongodump and then for some reason used recode to encode all .bson files to UTF-8. Thanks to that, I can't use mongorestore, as it says every single .bson file has 268 Mb.
Is there anything one can do to get data back from a recoded to UTF-8 binary BSON file? There's apparently no way to recode it back. Thanks.
OK. This works only on MongoDB, probably, but I'll put it as an answer because it may work for people with this exact problem:
BSON files, while binary, are somewhat readable, depending on your need. In my case, I had a product collection, and most of what I had to update was descriptions and such.
While not a perfect solution, it is possible to just use Notepad++ to turn hex characters into new lines or anything else, and try to parse the resulting file, if you know what you are doing.
Since all fields (name, _id, description) are still there, I recommend turning those into XML headers, for example.
That solved my problem. Thanks.

Prevent Word 2010 from saving o:gfxdata base64 or uuencoded VML?

I am working with .docx files containing several drawing canvases with images inserted and some lines and arrows drawn in Word 2010. I am using 2010 format with no compatibility mode.
Word inserts an o:gfxdata attribute into each v:shape and v:group element and fills it with ascii encoded something. From what I have read it may be a copy of the VML describing the v:shape or v:group. I don't know if I just don't know what to look for, but I cannot determine what this data is for as its removal has no apparent effect on my ability to read or edit the document in Word 2003, 2007, or 2010.
It does swell the document.xml to almost twice the (apparent) necessary size. This considerably slows OpenTBS' processing so I would like to remove it, if possible. Does anyone know of a way to tell Word 2010 to quit saving this extra data? Or what it is for? I have really struggled to find any documentation on it beyond this post.
Edit:
Here is a sample .docx. The document.xml is ~141KB and OpenTBS takes an average of 10.35 seconds to create a file that includes this as a subtemplate 21 times. If I remove all of the o:ogfxdata attributes, the file size is reduced to ~37KB and OpenTBS takes only 2.99 seconds to produce the same file.
Edit 2:
After further investigation, it appears the removal of the o:gfxdata may cause Word 2003 with an older Compatibilty Pack installed, to object to the file with the following error:
"This is a pre-release version of the Compatibility Pack and can open
pre-release Office 2007 files only. Do you want to check for a newer
version of the Compatibility Pack?"
I have been able to open the file by installing a newer compatibility pack - though it prompts the user about the incompatibility and converts the file in order to open it. This does not damage my file, but it is something to look out for.
Attribute o:ogfxdata is poorly documented in the web.
According to your investigations, it's some kind of compatibility extra information.
You can delete those attributes in your template using OpenTBS.
The cleaning can be done once on your template without any merging, and then save the cleaned template as a new template. Or you can perform the cleaning each time you open the template.
Cleaning the DOCX file:
while ($x = clsTbsXmlLoc::FindStartTagHavingAtt($TBS->Source, 'o:gfxdata', 0) ) {
$x->ReplaceAtt('o:gfxdata', '');
$TBS->Source = str_replace(' o:gfxdata=""', '', $TBS->Source);
}
Note that the class clsTbsXmlLoc is provided with OpenTBS and is undocumented.
The code should work since OpenTBS 1.8.0. (which is currently in stable beta version).
I've noticed that since attributes o:gfxdata are deleted, they do not come back immediately when you edit the docx.

Understanding WordProcessingML tags and avoid unnecessary tags

I am using MS Word API to generate .docx which contains the data fetched from DB, in which i am applying the respective styles, fonts, symbols, etc. If the data fetched from the DB is quite huge, then there is a problem in displaying those data in the .docx file. I found that internally MS Word 2007 will write some content through tags which may not be needed to display the data. Hence i am figuring out what are the necessary MS Word tags needed when converting into a .xml file. So that i can avoid unnecessary tags and build only the respective tags which are needed to display the data. Hence i am planning to write my own .xml with the MS Word tags which are needed, than generating a .XML from .docx file
My queries are:-
1) Whether it is right that the MS Word will generate some tags which may not be needed during the conversion of .docx to document.xml? That makes it heavy? If so what are the tags , so that i can avoid them when write by own .xml file.
2) Please send links to understand about the MS Word tags and its advantages, which tags are needed and which are not ?
3) Whether my approach to write a new .xml similar to document.xml (.docx conversion) is worthy one to go forward so that i can build the .xml with the tags i needed , so that i can improve the performance of the data display?
Please shed some light into it and thanks in advance..
Thanks,
Rithu
You'll want to learn WordprocessingML in much more detail to do this. It certainly isn't impossible, but it is quite a learning curve to start with. Probably the best place to start is with this eBook. If you go the manual route, you'll need a zip technology. If you're in Visual Studio, you can make the writing of all of this easier by using the Open XML SDK.
As to your questions on 'unnecessary tags', it's hard to believe that there would be much at all in the file that is unnecessary. But that depends on what you consider not needed - for example, if a word is caught as mispelled, there will be "dirty=1" attribute on the Run tag. If you're okay with displaying mispelled words, then that could be considered unnecessary. Really depends on what you're displaying for and in what.