PyPDF2 Module and Encryted PDF Files - pypdf

I'm currently using PyPDF2 to work with PDF files in Python.
When I run a script to load some PDF files and extract some key words from the PDFs, I'm not able to:
PdfReadError: File has not been decrypted
So to (try &) get around this I implement:
if pathObj.isEncrypted:
pathObj.decrypt('')
However, I'm then comfronted with:
NotImplementedError: only algorithm code 1 and 2 are supported
Now, I kinda understand what the errors are telling me. What I don't understand is the fact that none of my PDFs are encrypted
Does anyone know why files that are not encryted are apparently encrypted? Is this some issue with PyPDF2?
Cheers

It would appear that these PDFs are 128-bit AES type encrypted. However, they can still used in Adobe just not with PyPDF2.
To get around the problem you have to install:
qpdf
and add to the code:
qpdf, --password=" ", --decrypt, in_put.pdf, out_put.pdf

Related

TCL fileutil::magic::mimetype not recognising Microsoft documents or mp3

I'm wondering if this is a limitation of fileutil::magic::mimetype, or whether something has gotten messed up in my installation. TCLLIB 1.15/TCL 8.5
Take an ordinary Microsoft Word .doc file and pass it to fileutil::magic::mimetype e.g.
package require fileutil
package require fileutil::magic::mimetype
set result [fileutil::magic::mimetype "/tmp/test.doc"]
It returns empty string. Same for mp3, plus other file formats. It does recognise GIF, PNG, TIFF and other image formats.
Calling fileutil::fileType returns binary for the Word document.
Standard Linux command file -i returns "application/msword" for the same file.
Can anyone confirm if this expected behaviour? I'm a little confused about the relationship between the fileutil and fumagic libraries, so maybe I've broken something in my install around that area.

Huge docx filled with <w:p> tags

My girlfriend is writing a Word document for a homework. She's using the old .doc format as required by her teacher ( :'( ).
At some point, the .doc file went from 150 kB to 2.6 MB with no noticeable change (seen in Dropbox history. Sadly, Word's comparison function fails because Word crashes). From that point, she was unable to save her document without crashing word...
I converted the .doc to docx, unzipped it, and found a 18 MB document.xml file !
I can't even format the xml properly because it crashes Notepad++, but I can see that the file is filled with the same xml tag repeating over and over :
<w:p w:rsidR="002A70E5" w:rsidRDefault="002A70E5" w:rsidP="00565ED9"/>
Do you have any idea what could cause this ?
EDIT: Here's the docx
EDIT2: The motivation for this question is more curiosity than looking for a fix. Thanks for your answers though.
If you're willing to edit the XML directly, you can just delete all the empty <w:p> tags and rezip.
If you're good with Python, you might give python-docx a try and use it to delete all empty paragraphs.
Hopefully that will at least recover the work she's done so far.
Not sure how this would happen, or whether it matters much. Only thing I can think of is a sticking Return key on the keyboard that would insert a huge number of carriage returns. Those each insert a new paragraph. I've actually had that happen occasionally on a Windows virtual machine running on my Mac. No clue why it does it though.
The tag you are talking about is the OpenXml format for building word documents. The openxml stores the document as a zipped file and I am afraid you are seeing the unzipped document.xml file. If you want to keep working with the doc just convert the doc file to docx. Dont unzip it.

SVG to PDF (with Perl Cairo?)

In a perl script, I try to convert svg files to pdf. This works great by just refering to Inkscape:
system "inkscape -D -z --file=$in --export-pdf=$out";
But it is enormously slow even for little 100 KB files, I mean it can be minutes per file, causing the script to fail when running with a time-out constrain, eg. on a webserver.
To speed up, I have read about svg2pdf as a standalone, but never found a binary for Win7 or managed to compile it, even with the libcairo dlls present.
My last idea now is to use the CPAN module Cairo. It makes me hoping that it can convert an svg file to pdf, but in the documentation I only find drawings and surfaces, but no method to write/convert.
Has anyone experience with that?
Making my comment an answer: You could try rsvg-convert which is part of the librsvg library. It's probably faster than Inkscape but it's still an external command.

Pipe multiple files into a zip file

I have several files in a GridFS Document Store and what I'd like to do is to pipe this data into a zip file via stdin in NodeJS. So that I will end up with a zip file containing all these files.
Now my question is how can I give the files a valid filename inside of the zip file. I think I need to emulate/fake a file header containing the filename?
Any help is appreciated!
Thanks
I had problems when writing zip files with Node.js not long ago. I ended up doing something similar to what is described in Zip archives in node.js
I can't help you directly with your problem, but at least I hope I can point out some things:
Don't try to use node-archive. Even if the description says it allows to create zip files, the moment I read the source code (since documentation is unexistant) I realized that's just a lie. It only exposes methods for reading.
Using zip by spawning a process, like recommended on the provided link, seems to be the best way. Something that would work is copying the files to a local folder with whatever name you desire and then calling the zip command, just to delete the files afterwards.
The other option, which seems ok, is to use zipper (https://github.com/rubenv/zipper, although better just use npm). The reason I'm not really wishing to use it is because there's not that much flexibility, it seems to have been done in a day and it hasn't been modified since the first commit, so I'm not sure it will receive maintenance (sure, you could just fork it...).
I swear the day I have an entire free weekend with no work I will write a freaking module that does this as complete as possible. It's silly that there isn't and it shouldn't be that much struggle. blablablarant.
Edit:
Not sure if it was there before, but now I've been using the node-compress module (also using gzippo). It works fine.

using edeD3 to change encoding of mp3 tags

I found this question which is my exact starting point: Chinese-encoded metadata on mp3 files. I want to re-encode all my metadata as utf-8 so that Banshee can read it.
I can't figure out how to get eyeD3 to do that. I can decode individual tags as per that previous link, but I can't make eyeD3 change the actual text encoding of the mp3 file itself, so those tags can be rewritten in the proper encoding. I tried reading all the data into variables (below, 't' is the properly encoded title), then calling:
tag.clear()
tag.update(eyeD3.ID3_V2_4)
tag.setTitle(t)
That tells me: ValueError: ID3 vNone.None is not supported. Not what I was expecting.
I tried tag.setTextEncoding('utf-8'), but that tells me eyeD3.tag.TagException: Invalid encoding. All the other encodings I try give me the same error message.
eyeD3.TAGS2_2_TO_TAGS_2_3_AND_4 looks promising, but it's a dictionary of cryptic letter codes that mean nothing to me.
Can someone tell me how to change the version of the tags to something that supports utf-8, then change the file encoding to utf-8 and write the metadata back in?
Looks like somebody's already created something that does this:
http://code.google.com/p/id3-to-unicode/
It's pretty easy to use. Just download the latest version of the script from the website, make sure you have the eyeD3 and chardet python modules installed (a quick sudo apt-get install python-eyed3 python-chardet did the trick for me in ubuntu), and run the script with the -h flag to see how to use it.
My only complaint is that the script assumes that your music is organized like artist/album/01 track name.mp3, and uses path/file information to fill in missing tags. I disabled this in the latest version (http://id3-to-unicode.googlecode.com/files/id3_to_unicode_1.1.py) by commenting out lines 126-138.
Eric Abrahamsen figured out, that setting the text encoding should look like
tag.setTextEncoding(eyeD3.UTF_8_ENCODING) instead of
tag.setTextEncoding('utf-8').