How to manipulate the content of a MS Word compressed file format - ms-word

I know that the microsoft docx file format is a compressed zip archive. I analyzed it and I think I understand that I can manipulate it by changing the content of the /word/document.xml file inside this file structure.
But after I zip the folder again and try to open it, MS-Word complains with a message like:
"The file ... cannot be opened because its content is causing problems. "
I wonder which is the correct method to zip the content of the xml files after manipulating? Or is there something like a checksum I have overseen?

The reason why ms word complains about my manipulated docx format was, that the compressed file structure was within another folder, which I created to edit the document.xml file.
It is important that the xml files are located in the same root folder structure as in the original file.
A similar question has been asked on StackOverflow here.

Related

how can i distinguish with perl if i have specified an XLSX or a ZIP file

I have a script which takes two parameters on the command line. One should be the name of an ZIP archive, one must be an excel (XSLX) file. Both parameters must be either relative or full quallified.
Ususally I use File::Type to check if a file has the expected format. but for XLSX files the answer is: "application/zip". I know this is right, because XLSX files are zipped, but how can I distinguish if it's a excel file or if the user made a misstake and provided the ZIP Archive as excel.
I also found File::LibMagic, but I can't get it running on Windows.

Cant open a .docx file after overwriting it with a different encoding

This is not exactly a programming question but I came to this problem while I was trying to access to a .docx document using Python.
Basically, I opened manually the .docx with notepad and I overwrote it with utf-8 encoding (ANSI was the default encoding). After I did this if I try to open the document I see the next message: "We're sorry. We canĀ“t open filename because we found a problem with its contents". Clicking on details you'll see "The file it's corrupt and cannot be opened".
It doesn't matter if I save the file with ANSI again, it won't open. Later I tried it with a new document and the same thing happened, but it also happens if I overwrite it with "ANSI" (even that it's the default one).
I can still open it with notepad so my question is: Is there a way to recover my file or to convert it to a readable document?
I've tried every single Method of the following link https://learn.microsoft.com/en-US/office/troubleshoot/word/damaged-documents-in-word and none of them worked.
Edit: If I open any ms-word with notepad and I save it with any encoding I wont be able to open it with ms-word anymore. I don't know why but if I open the document and erase the first two letters (PK - which I believe stands for zip document) I can open the file with ms-word but it would have unreadable characters.
Thank you in advance
Word files are zip archives containing XML, which is already encoded in UTF. A zip archive is a binary format and is not encoded. Notepad makes a guess, but it's wrong. That's why when you reopen the Word file that you thought you saved in UTF, Notepad still thinks it's ANSI format.
Unfortunately, your file is hosed. It's not even a zip archive anymore, so you can't open it to extract the text from the XML. Best to experiment on a copy next time.

Conversion between .xlsx and .zip

I want to manipulate Office Open XML format of Excel, but even just conversion between the .zip and .xlsx generates errors:
create a very simple test.xlsx by Excel
Right-click test.xlsx => Rename as text.xlsx.zip
Right-click text.xlsx.zip => Extract all to a folder named text.xlsx
Right-click text.xlsx folder => Send to => Compressed (zipped) folder named text_2.xlsx.zip
Right-click text_2.xlsx.zip => Rename as text_2.xlsx
open text_2.xlsx with Excel, then I got the following errors:
Does anyone know what's wrong there?
Xlsx files are just ordinary zip file and it is definitely possible to do what you are trying to do.
Does anyone know what's wrong there?
I would guess step 4:
Right-click text.xlsx folder => Send to => Compressed (zipped) folder named text_2.xlsx.zip*
You will need to zip the contents on the folder and not the folder itself. The resultant zip file should have the [Content_Types].xml files at the top level with no parent folder.
.***x files are zip files, but the method of compression is different than the standard one used by Windows Explorer. Windows Explorer does absolute compression (whatever it can compress safely, it will), MS Office and OpenXML leaves necessary pieces uncompressed to be used by the application when it is read.
Edit: I should add that you can zip the files back up and use them as an xlsx again but you have to make sure you're using the same compression method as Excel or OpenXML.

C# folder and subfolder

Upon numerous searches, I am here to see if someone has any idea on how I should go about tackling this issue.
I have a folder with sub-folders. The sub-folder containers each has files of different file types e.g. pdf, png, jpeg, tiff, avi and word documents.
My goal is to write a code in C# that will go into the subfolder, and combined all the files into one pdf using the name of the folder. The only exception is that a file such as avi will not be pdf'ed in which case I want a nudge as to which folder it is and possibly file name. I am trying to use the form approach, so that you can copy in the folder pathname and also destination of the created pdf.
Thanks.
to start, create a FolderBrowserDialog to get the root folder. Alternatively just make a textbox in which you paste the folder name ( less preferred since the first method gives you nicer error-handling straight out of the box )
In order to iterate through, see How to: Iterate Through a Directory Tree
To find the filetype, check System.IO.FileInfo.Extension for each file you iterate through. Add those to list with the data you need. ( hint, create a list of objects in which your object reflects the data you need such as path, type etc,... ). If its an avi don't toss it in the list but flash a warning (messagebox?) instead.
From here the original question gets fuzzy. What exactly do you need in the pdf. Just the filenames and locations or do you actually want to throw the actual contents of the file in pdf?

iPhone - reading .epub files

I am engaged in preparing an application regarding reading the .epub files in iPhone. Where can I get the reference for sample applications for unzipping and parsing the files? Can anyone guide me with a best link? Thank you in advance.
An .epub file is just a .zip file. It contains a few directory files in XML format and the actual book content is usually XHTML. You can use Objective-Zip to unzip the .epub file and then use NSXMLParser to parse the XML files.
More info: Epub Format Construction Guide
On top of Ole's answer (that's a pretty good how-to guide), it's definitely worth reading the specification for the Open Container Format (OCF) - sorry it's a word file. It's the formal specification for the for zip structure used.
In brief you parse the file by
Checking it's plausibly valid by looking for the text 'mimetype' starting at byte 30 and the text 'application/epub+zip' starting at byte 38.
Extracting the file META-INF/container.xml from the zip
Parsing that file and extracting the value of the full-path attribute of the first rootfile element in it.
Load the referenced file (the full-path attribute is a URL relative to the root of zip file)
Parse that file. It contains all the metadata required to reference all the other content (mostly XHTML/CSS/images). Particularly you want to read the contents of the spine element which will list all content files in reading order.
If you want to do it right, you should probably also handle DTBook content as well.
If you want to do this right, you need to read and understand the Open Packaging Format (OPF) and Open Publication Structure (OPS) specifications as well.