This is not exactly a programming question but I came to this problem while I was trying to access to a .docx document using Python.
Basically, I opened manually the .docx with notepad and I overwrote it with utf-8 encoding (ANSI was the default encoding). After I did this if I try to open the document I see the next message: "We're sorry. We can´t open filename because we found a problem with its contents". Clicking on details you'll see "The file it's corrupt and cannot be opened".
It doesn't matter if I save the file with ANSI again, it won't open. Later I tried it with a new document and the same thing happened, but it also happens if I overwrite it with "ANSI" (even that it's the default one).
I can still open it with notepad so my question is: Is there a way to recover my file or to convert it to a readable document?
I've tried every single Method of the following link https://learn.microsoft.com/en-US/office/troubleshoot/word/damaged-documents-in-word and none of them worked.
Edit: If I open any ms-word with notepad and I save it with any encoding I wont be able to open it with ms-word anymore. I don't know why but if I open the document and erase the first two letters (PK - which I believe stands for zip document) I can open the file with ms-word but it would have unreadable characters.
Thank you in advance
Word files are zip archives containing XML, which is already encoded in UTF. A zip archive is a binary format and is not encoded. Notepad makes a guess, but it's wrong. That's why when you reopen the Word file that you thought you saved in UTF, Notepad still thinks it's ANSI format.
Unfortunately, your file is hosed. It's not even a zip archive anymore, so you can't open it to extract the text from the XML. Best to experiment on a copy next time.
Related
I'm having a little problem automating a test with UFT (vbscript). I need to open a csv file, modify it, and then save it again. The problem is that when I open the file in Notepad++, it shows the encoding as "UCS-2 LE BOM". This file is then injected into our system for processing and if I change the encoding to ANSI, the injection will fail because the file seems to lose its column structure, and I'm not sure it is readable for the system anymore.
From what I understand, it's not possible to do it directly with vbscript but any idea how I could do it with powershell for example? Is there a notepad++ command line to change the encoding of a file for example?
Thanks
1) How can I differentiate doc and docx files from requests?
a) For instance, if I have
url='https://www.iadb.org/Document.cfm?id=36943997'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])
I get this:
application/vnd.openxmlformats-officedocument.wordprocessingml.document
This file is a docx.
b) If I have
url='https://www.iadb.org/Document.cfm?id=36943972'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])
I get this
application/msword
This file is a doc.
2) Are there other options?
3) If I save a docx file as doc or vice-versa may I have recognition problems (for instance, for conversion to pdf?)? Is there any kind of best practice for dealing with this?
The mime headers you get appear to be the correct ones: What is a correct mime type for docx, pptx etc?
However, the sending software can only go on what file its user selected – and there still are a lot of people sending files with the wrong extension. Some software can handle this, others cannot. To see this in action, change the name of a PNG image to end with JPEG instead. I just did on my Mac and Preview still is able to open it. When I press ⌘+I in the Finder it says it is a JPEG file, but when opened in Preview it gets correctly identified as a "Portable Network Graphics" file. (Your OS may or may not be able to do this.)
But after the file is downloaded, you can unambiguously differ between a DOC and a DOCX file, even if the author got its extension wrong.
A DOC file starts with a Microsoft OLE Header, which is quite complicated structure. A DOCX file, on the other hand, is a compound file format containing lots of smaller XML files, compressed together using a standard ZIP file compression. Therefore, this file type always will start with the two characters PK.
This check is compatible with Python 2.7 and 3.x (only one needs the decode):
import sys
if len(sys.argv) == 2:
print ('testing file: '+sys.argv[1])
with open(sys.argv[1], 'rb') as testMe:
startBytes = testMe.read(2).decode('latin1')
print (startBytes)
if startBytes == 'PK':
print ('This is a DOCX document')
else:
print ('This is a DOC document')
Technically it will confidently state "This is a DOC document" for anything that does not start with PK, and, conversely, it will say "This is a DOCX document" for any zipped file (or even a plain text file that happens to start with those two characters). So if you further process the file based on this decision, you may find out it's not a Microsoft Word document after all. But at least you will have tried with the proper decoder.
I know that the microsoft docx file format is a compressed zip archive. I analyzed it and I think I understand that I can manipulate it by changing the content of the /word/document.xml file inside this file structure.
But after I zip the folder again and try to open it, MS-Word complains with a message like:
"The file ... cannot be opened because its content is causing problems. "
I wonder which is the correct method to zip the content of the xml files after manipulating? Or is there something like a checksum I have overseen?
The reason why ms word complains about my manipulated docx format was, that the compressed file structure was within another folder, which I created to edit the document.xml file.
It is important that the xml files are located in the same root folder structure as in the original file.
A similar question has been asked on StackOverflow here.
My girlfriend is writing a Word document for a homework. She's using the old .doc format as required by her teacher ( :'( ).
At some point, the .doc file went from 150 kB to 2.6 MB with no noticeable change (seen in Dropbox history. Sadly, Word's comparison function fails because Word crashes). From that point, she was unable to save her document without crashing word...
I converted the .doc to docx, unzipped it, and found a 18 MB document.xml file !
I can't even format the xml properly because it crashes Notepad++, but I can see that the file is filled with the same xml tag repeating over and over :
<w:p w:rsidR="002A70E5" w:rsidRDefault="002A70E5" w:rsidP="00565ED9"/>
Do you have any idea what could cause this ?
EDIT: Here's the docx
EDIT2: The motivation for this question is more curiosity than looking for a fix. Thanks for your answers though.
If you're willing to edit the XML directly, you can just delete all the empty <w:p> tags and rezip.
If you're good with Python, you might give python-docx a try and use it to delete all empty paragraphs.
Hopefully that will at least recover the work she's done so far.
Not sure how this would happen, or whether it matters much. Only thing I can think of is a sticking Return key on the keyboard that would insert a huge number of carriage returns. Those each insert a new paragraph. I've actually had that happen occasionally on a Windows virtual machine running on my Mac. No clue why it does it though.
The tag you are talking about is the OpenXml format for building word documents. The openxml stores the document as a zipped file and I am afraid you are seeing the unzipped document.xml file. If you want to keep working with the doc just convert the doc file to docx. Dont unzip it.
I could not find this answer in the man or info pages, nor with a search here or on Google. I have a file which is, in essence, a text file, but it somehow got screwed up upon saving. (I think there are a few strange bytes at the front of the file accidentally.)
I am able to open the file, and it makes sense, using head or cat, but not using any sort of editor.
In the end, all I wish to do is open the file in emacs, delete the "messy" characters, and save it once cleaned up. The file, however, is huge, so I need something powerful like emacs to be able to open it.
Otherwise, I suppose I can try to create a script to read this in line by line, forcing the script to read it in text format, then write it. But I wanted something quick, since I won't be doing this over & over.
Thanks!
Mike
perl -i.bk -pe 's/[^[:ascii:]]//g;' file
Found this perl one liner here: http://www.perlmonks.org/?node_id=619792
Try M-xfind-file-literally in Emacs.
You could edit the file using hexl-mode, which lets you edit the file in hexadecimal. That would let you see precisely what those offending characters are, and remove them.
It sounds like you either got a different line ending in the file (eg: carriage returns on a *nix system) or it got saved in an unexpected encoding.
You could use strings to grab "printable characters in file". You might have to play with the --encoding though I have only ever used it to grab ascii strings from executable files.