TCL fileutil::magic::mimetype not recognising Microsoft documents or mp3 - ms-word

I'm wondering if this is a limitation of fileutil::magic::mimetype, or whether something has gotten messed up in my installation. TCLLIB 1.15/TCL 8.5
Take an ordinary Microsoft Word .doc file and pass it to fileutil::magic::mimetype e.g.
package require fileutil
package require fileutil::magic::mimetype
set result [fileutil::magic::mimetype "/tmp/test.doc"]
It returns empty string. Same for mp3, plus other file formats. It does recognise GIF, PNG, TIFF and other image formats.
Calling fileutil::fileType returns binary for the Word document.
Standard Linux command file -i returns "application/msword" for the same file.
Can anyone confirm if this expected behaviour? I'm a little confused about the relationship between the fileutil and fumagic libraries, so maybe I've broken something in my install around that area.

Related

Can VSCode display a binary file if there is an executable that will convert it to an equivalent text file?

I sometimes use VSCode for a Delphi 7 project because I like VSCode's git functionality and for a few other reasons (superior string search, diff, etc).
Delphi 7 is a pain, and to get it to consistently compile I need to convert the dfm files to their binary version (all 2300 of them). This of course makes them unviewable in the diff viewer, or to just open the file?
Is there a setting where if I open that file, it will first pass it through the convert.exe (that's its actual name) util so that it can be viewed as a text? I understand that this might be read-only, which would be sufficient to my needs (though if on save it could just pass it back through, that'd be great too).
I'm having trouble figuring out what exactly to to search for on Google (the keywords seem too generic), but I can imagine some generalized functionality that would work for other environments beyond just Delphi/pascal.

doc or docx: Is there safeway to identify the type from 'requests' in python3?

1) How can I differentiate doc and docx files from requests?
a) For instance, if I have
url='https://www.iadb.org/Document.cfm?id=36943997'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])
I get this:
application/vnd.openxmlformats-officedocument.wordprocessingml.document
This file is a docx.
b) If I have
url='https://www.iadb.org/Document.cfm?id=36943972'
r = requests.get(url,timeout=15)
print(r.headers['content-type'])
I get this
application/msword
This file is a doc.
2) Are there other options?
3) If I save a docx file as doc or vice-versa may I have recognition problems (for instance, for conversion to pdf?)? Is there any kind of best practice for dealing with this?
The mime headers you get appear to be the correct ones: What is a correct mime type for docx, pptx etc?
However, the sending software can only go on what file its user selected – and there still are a lot of people sending files with the wrong extension. Some software can handle this, others cannot. To see this in action, change the name of a PNG image to end with JPEG instead. I just did on my Mac and Preview still is able to open it. When I press ⌘+I in the Finder it says it is a JPEG file, but when opened in Preview it gets correctly identified as a "Portable Network Graphics" file. (Your OS may or may not be able to do this.)
But after the file is downloaded, you can unambiguously differ between a DOC and a DOCX file, even if the author got its extension wrong.
A DOC file starts with a Microsoft OLE Header, which is quite complicated structure. A DOCX file, on the other hand, is a compound file format containing lots of smaller XML files, compressed together using a standard ZIP file compression. Therefore, this file type always will start with the two characters PK.
This check is compatible with Python 2.7 and 3.x (only one needs the decode):
import sys
if len(sys.argv) == 2:
print ('testing file: '+sys.argv[1])
with open(sys.argv[1], 'rb') as testMe:
startBytes = testMe.read(2).decode('latin1')
print (startBytes)
if startBytes == 'PK':
print ('This is a DOCX document')
else:
print ('This is a DOC document')
Technically it will confidently state "This is a DOC document" for anything that does not start with PK, and, conversely, it will say "This is a DOCX document" for any zipped file (or even a plain text file that happens to start with those two characters). So if you further process the file based on this decision, you may find out it's not a Microsoft Word document after all. But at least you will have tried with the proper decoder.

Method to decompress a PDF (non-Adobe) while retaining form fields?

I found a similar question that involves Acrobat, but in this case the PDF was made with a combination of MS Word and CenoPDF v3, with which I'm unfamiliar. Additionally the PDF is version 1.3. I'd like to decompress it, to see its low-level workings and make some changes. It's easy with GhostScript's -dCompressPages=false parameter, but that simultaneously strips all the fill-in form functionality. Is there a method for decompressing the file while leaving everything else intact? A quick search of the docs for tcpdf and fpdi (cited in the link) didn't reveal a compression option.
Ghostscript and pdfwrite isn't a good combination. The PDF file you get out is NOT the same as the one you put in. This is because of the way that Ghostscript and pdfwrite work; the input is fully interpreted to a sequence of graphics primitives, which is sent to the Ghostscript graphics library. These are then sent to the requested device, most devices then render the result to a bitmap, but the pdfwrite family reassemble those graphics primitives int a new PDF file.
Note that the contents of the new PDF file have no relationship to the original, other than the appearance when rendered. Ghostscript and pdfwrite do maintain much of the non-marking content of PDF files such as hyperlinks and so on (which obviously don't get turned into graphics primitives), by interpreting them into pdfmark operations (an extension to the PostScript language defined by Adobe). However, even if Ghostscript and pdfwrite maintained all this content, the resulting PDF file wouldn't be the same as the original one decompressed....
There are tools which will decompress PDF files, and I would recommend one of our other products, MuPDF. A part of this is mutool, and "mutool clean -d in.pdf out.pdf" will decompress pretty much everything in a PDF file
QPDF can decompress PDF documents (among other things). I used this tool in the past and it preserved forms and data.
The tool has some issues with large PDFs (can take too much time and memory for decompression). The tool can produce incomplete output (with warnings in console) for some partially broken / nonstandard PDFs.

Cleaning up text files with sed?

I have a bunch of text files that need cleaning up. Example
`E..4B?#.#...
..9J5.....P0.z.n9.9.. ........
.k#a..5
E...y^#.r...J5..
E...y_#.r...J5..
..9.P..n9..0.z............
….2..3..9…n7…..#.yr`
Is there any way sed can do this? Like notice weird patterns?
For this answer, I will assume that you have access to standard unix/linux tools.
Your file might be in some word-processor format. If so, the best way to get rid of the junk is to open it with that program. You may be able to find out which with file:
$ file mysteryfile
mysteryfile: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1 ....
If that doesn't work, there is a standard unix utility for extracting text from binary files. It is called strings:
$ strings mysteryfile
Some
Recovered Text
...
The behavior of strings can be fine tuned with several options. See man strings.

using edeD3 to change encoding of mp3 tags

I found this question which is my exact starting point: Chinese-encoded metadata on mp3 files. I want to re-encode all my metadata as utf-8 so that Banshee can read it.
I can't figure out how to get eyeD3 to do that. I can decode individual tags as per that previous link, but I can't make eyeD3 change the actual text encoding of the mp3 file itself, so those tags can be rewritten in the proper encoding. I tried reading all the data into variables (below, 't' is the properly encoded title), then calling:
tag.clear()
tag.update(eyeD3.ID3_V2_4)
tag.setTitle(t)
That tells me: ValueError: ID3 vNone.None is not supported. Not what I was expecting.
I tried tag.setTextEncoding('utf-8'), but that tells me eyeD3.tag.TagException: Invalid encoding. All the other encodings I try give me the same error message.
eyeD3.TAGS2_2_TO_TAGS_2_3_AND_4 looks promising, but it's a dictionary of cryptic letter codes that mean nothing to me.
Can someone tell me how to change the version of the tags to something that supports utf-8, then change the file encoding to utf-8 and write the metadata back in?
Looks like somebody's already created something that does this:
http://code.google.com/p/id3-to-unicode/
It's pretty easy to use. Just download the latest version of the script from the website, make sure you have the eyeD3 and chardet python modules installed (a quick sudo apt-get install python-eyed3 python-chardet did the trick for me in ubuntu), and run the script with the -h flag to see how to use it.
My only complaint is that the script assumes that your music is organized like artist/album/01 track name.mp3, and uses path/file information to fill in missing tags. I disabled this in the latest version (http://id3-to-unicode.googlecode.com/files/id3_to_unicode_1.1.py) by commenting out lines 126-138.
Eric Abrahamsen figured out, that setting the text encoding should look like
tag.setTextEncoding(eyeD3.UTF_8_ENCODING) instead of
tag.setTextEncoding('utf-8').