doc or docx: Is there safeway to identify the type from 'requests' in python3? - ms-word

1) How can I differentiate doc and docx files from requests?
a) For instance, if I have
url='https://www.iadb.org/Document.cfm?id=36943997'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])
I get this:
application/vnd.openxmlformats-officedocument.wordprocessingml.document
This file is a docx.
b) If I have
url='https://www.iadb.org/Document.cfm?id=36943972'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])
I get this
application/msword
This file is a doc.
2) Are there other options?
3) If I save a docx file as doc or vice-versa may I have recognition problems (for instance, for conversion to pdf?)? Is there any kind of best practice for dealing with this?

The mime headers you get appear to be the correct ones: What is a correct mime type for docx, pptx etc?
However, the sending software can only go on what file its user selected – and there still are a lot of people sending files with the wrong extension. Some software can handle this, others cannot. To see this in action, change the name of a PNG image to end with JPEG instead. I just did on my Mac and Preview still is able to open it. When I press ⌘+I in the Finder it says it is a JPEG file, but when opened in Preview it gets correctly identified as a "Portable Network Graphics" file. (Your OS may or may not be able to do this.)
But after the file is downloaded, you can unambiguously differ between a DOC and a DOCX file, even if the author got its extension wrong.
A DOC file starts with a Microsoft OLE Header, which is quite complicated structure. A DOCX file, on the other hand, is a compound file format containing lots of smaller XML files, compressed together using a standard ZIP file compression. Therefore, this file type always will start with the two characters PK.
This check is compatible with Python 2.7 and 3.x (only one needs the decode):
import sys
if len(sys.argv) == 2:
print ('testing file: '+sys.argv[1])
with open(sys.argv[1], 'rb') as testMe:
startBytes = testMe.read(2).decode('latin1')
print (startBytes)
if startBytes == 'PK':
print ('This is a DOCX document')
else:
print ('This is a DOC document')
Technically it will confidently state "This is a DOC document" for anything that does not start with PK, and, conversely, it will say "This is a DOCX document" for any zipped file (or even a plain text file that happens to start with those two characters). So if you further process the file based on this decision, you may find out it's not a Microsoft Word document after all. But at least you will have tried with the proper decoder.

Related

Cant open a .docx file after overwriting it with a different encoding

This is not exactly a programming question but I came to this problem while I was trying to access to a .docx document using Python.
Basically, I opened manually the .docx with notepad and I overwrote it with utf-8 encoding (ANSI was the default encoding). After I did this if I try to open the document I see the next message: "We're sorry. We can´t open filename because we found a problem with its contents". Clicking on details you'll see "The file it's corrupt and cannot be opened".
It doesn't matter if I save the file with ANSI again, it won't open. Later I tried it with a new document and the same thing happened, but it also happens if I overwrite it with "ANSI" (even that it's the default one).
I can still open it with notepad so my question is: Is there a way to recover my file or to convert it to a readable document?
I've tried every single Method of the following link https://learn.microsoft.com/en-US/office/troubleshoot/word/damaged-documents-in-word and none of them worked.
Edit: If I open any ms-word with notepad and I save it with any encoding I wont be able to open it with ms-word anymore. I don't know why but if I open the document and erase the first two letters (PK - which I believe stands for zip document) I can open the file with ms-word but it would have unreadable characters.
Thank you in advance
Word files are zip archives containing XML, which is already encoded in UTF. A zip archive is a binary format and is not encoded. Notepad makes a guess, but it's wrong. That's why when you reopen the Word file that you thought you saved in UTF, Notepad still thinks it's ANSI format.
Unfortunately, your file is hosed. It's not even a zip archive anymore, so you can't open it to extract the text from the XML. Best to experiment on a copy next time.

TCL fileutil::magic::mimetype not recognising Microsoft documents or mp3

I'm wondering if this is a limitation of fileutil::magic::mimetype, or whether something has gotten messed up in my installation. TCLLIB 1.15/TCL 8.5
Take an ordinary Microsoft Word .doc file and pass it to fileutil::magic::mimetype e.g.
package require fileutil
package require fileutil::magic::mimetype
set result [fileutil::magic::mimetype "/tmp/test.doc"]
It returns empty string. Same for mp3, plus other file formats. It does recognise GIF, PNG, TIFF and other image formats.
Calling fileutil::fileType returns binary for the Word document.
Standard Linux command file -i returns "application/msword" for the same file.
Can anyone confirm if this expected behaviour? I'm a little confused about the relationship between the fileutil and fumagic libraries, so maybe I've broken something in my install around that area.

Method to decompress a PDF (non-Adobe) while retaining form fields?

I found a similar question that involves Acrobat, but in this case the PDF was made with a combination of MS Word and CenoPDF v3, with which I'm unfamiliar. Additionally the PDF is version 1.3. I'd like to decompress it, to see its low-level workings and make some changes. It's easy with GhostScript's -dCompressPages=false parameter, but that simultaneously strips all the fill-in form functionality. Is there a method for decompressing the file while leaving everything else intact? A quick search of the docs for tcpdf and fpdi (cited in the link) didn't reveal a compression option.
Ghostscript and pdfwrite isn't a good combination. The PDF file you get out is NOT the same as the one you put in. This is because of the way that Ghostscript and pdfwrite work; the input is fully interpreted to a sequence of graphics primitives, which is sent to the Ghostscript graphics library. These are then sent to the requested device, most devices then render the result to a bitmap, but the pdfwrite family reassemble those graphics primitives int a new PDF file.
Note that the contents of the new PDF file have no relationship to the original, other than the appearance when rendered. Ghostscript and pdfwrite do maintain much of the non-marking content of PDF files such as hyperlinks and so on (which obviously don't get turned into graphics primitives), by interpreting them into pdfmark operations (an extension to the PostScript language defined by Adobe). However, even if Ghostscript and pdfwrite maintained all this content, the resulting PDF file wouldn't be the same as the original one decompressed....
There are tools which will decompress PDF files, and I would recommend one of our other products, MuPDF. A part of this is mutool, and "mutool clean -d in.pdf out.pdf" will decompress pretty much everything in a PDF file
QPDF can decompress PDF documents (among other things). I used this tool in the past and it preserved forms and data.
The tool has some issues with large PDFs (can take too much time and memory for decompression). The tool can produce incomplete output (with warnings in console) for some partially broken / nonstandard PDFs.

Automating Localizable.strings?

So, in my project I have 10 languages, and 10 Localizable.strings files.
I just created Localizable.strings files, a file for each language. Now they contain "key" = "value" pairs, and both keys and values are in English (default language).
My languages are all translated and stay in Excel files.
The question is, how can I insert all my languages in those files faster than just copying each word manually or writing a script for that?
Maybe there is a existing tool for this already?
Thanks.
I found an easy way to compose localizable.strings files from Excel documents.
In the Excel document, in specific columns I insert " " = " " symbols. It's easy to do for all the words by dragging Excel cell down from the corner, so that it copies stuff from that cell to all the cells you drag it to. (sorry for messy explanation)
Thus the document contains the same symbols and words as localizable.strings does.
Than I just copy everything to the text file, remove tabs, change extension to .strings.
(no comments saved unfortunately).
EDIT:
You can copy the stuff from Excel to Sublime Text, then Find & Replace tabs if any. Copy resulted stuff into proper Xcode .string file.
One application that will really save you a lot of time by automating and streamlining localization procedure is Localization Suite. I do not know if they support importing from excel (to save you time transferring your string pairs) but it's free and seems like a complete solution.
I had an internal script at work for doing that tasks in iOS and Android, and I've just opensourced it as a Gem. You can take a look at it here: http://github.com/mrmans0n/localio
It can open spreadsheets from Google Drive and local Excel files as well, like requested.
You just would have to install the gem
gem install localio
And have a custom DSL file in your project directory, called Locfile, with the info referring to your project and the localization files. An example in your case, where an Excel file is used, could be as simple as:
platform :ios
source :xls, :path => 'YourExcelFileGoesInHere.xls'
output_path 'Resources/Localizables/'
The .xls file should have a certain format, that probably is very similar to what you have right now. You just have to clone the contents of this one and fill it with your translations: https://docs.google.com/spreadsheet/ccc?key=0AmX_w4-5HkOgdFFoZ19iSUlRSERnQTJ4NVZiblo2UXc
Hope this helps.
Here are the steps i followed:
change the extension of .strings to .txt on windows
open excel and go to File > Open
Choose the file to open. This should present an import wizard
Follow the steps and specify the delimiting character as =
You're done

iPhone - reading .epub files

I am engaged in preparing an application regarding reading the .epub files in iPhone. Where can I get the reference for sample applications for unzipping and parsing the files? Can anyone guide me with a best link? Thank you in advance.
An .epub file is just a .zip file. It contains a few directory files in XML format and the actual book content is usually XHTML. You can use Objective-Zip to unzip the .epub file and then use NSXMLParser to parse the XML files.
More info: Epub Format Construction Guide
On top of Ole's answer (that's a pretty good how-to guide), it's definitely worth reading the specification for the Open Container Format (OCF) - sorry it's a word file. It's the formal specification for the for zip structure used.
In brief you parse the file by
Checking it's plausibly valid by looking for the text 'mimetype' starting at byte 30 and the text 'application/epub+zip' starting at byte 38.
Extracting the file META-INF/container.xml from the zip
Parsing that file and extracting the value of the full-path attribute of the first rootfile element in it.
Load the referenced file (the full-path attribute is a URL relative to the root of zip file)
Parse that file. It contains all the metadata required to reference all the other content (mostly XHTML/CSS/images). Particularly you want to read the contents of the spine element which will list all content files in reading order.
If you want to do it right, you should probably also handle DTBook content as well.
If you want to do this right, you need to read and understand the Open Packaging Format (OPF) and Open Publication Structure (OPS) specifications as well.