Does PDF::API2 support reading PDF 1.5+ with compressed XRef? - perl

It appears that PDF::API2 does not support PDF 1.5 (and later) compression of the xref table. This type of file is more common since Acrobat 9 & 10 write them by default. The other compression scheme is compressed object streams.
I get the following error:
Malformed xref in PDF file at /opt/local/lib/perl5/site_perl/5.12.3/PDF/API2/Basic/PDF/File.pm line 1140.
Do any of the Perl PDF modules support reading a PDF with a compressed XRef?

CAM::PDF can read a compressed XRef. The documentation says:
The file format through PDF 1.5 is well-supported, with the exception
of the "linearized" or "optimized" output format, which this module
can read but not write.
I haven't worked with CAM::PDF. But I looked it over and the api feels strange after coming from PDF::API2. It is more low level or something. There are advantages and disadvantages to both libraries though.
We use PDF::API2 at work and ask our designers to save as PDF v1.4 when they give us stuff. You can also use ghostscript to convert them to PDF 1.4 which is supported by PDF::API2.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o out.pdf in.pdf

Related

TCL fileutil::magic::mimetype not recognising Microsoft documents or mp3

I'm wondering if this is a limitation of fileutil::magic::mimetype, or whether something has gotten messed up in my installation. TCLLIB 1.15/TCL 8.5
Take an ordinary Microsoft Word .doc file and pass it to fileutil::magic::mimetype e.g.
package require fileutil
package require fileutil::magic::mimetype
set result [fileutil::magic::mimetype "/tmp/test.doc"]
It returns empty string. Same for mp3, plus other file formats. It does recognise GIF, PNG, TIFF and other image formats.
Calling fileutil::fileType returns binary for the Word document.
Standard Linux command file -i returns "application/msword" for the same file.
Can anyone confirm if this expected behaviour? I'm a little confused about the relationship between the fileutil and fumagic libraries, so maybe I've broken something in my install around that area.

IPython notebook embed postscript

How can we render postscript documents in IPython notebook?
I saw there is support for other file formats such as jpg, png, pdf and svg but couldn't find any mention about postscript.
PostScript isn't a 'file format', its a programming language. In order to render PostScript you will need a complete PostScript interpreter.
Presumably you could write one in Python, the last time I saw an estimate for the amount of time required to write a full PostScript interpreter it was 5 man years, its probably a bit more now.
Or you could render the program externally using Ghostscript, to produce something you can already read. Since you say PDF is already supported it would seem sensible to convert to that instead; since its not a bitmap format you won't lose scalability.

Method to decompress a PDF (non-Adobe) while retaining form fields?

I found a similar question that involves Acrobat, but in this case the PDF was made with a combination of MS Word and CenoPDF v3, with which I'm unfamiliar. Additionally the PDF is version 1.3. I'd like to decompress it, to see its low-level workings and make some changes. It's easy with GhostScript's -dCompressPages=false parameter, but that simultaneously strips all the fill-in form functionality. Is there a method for decompressing the file while leaving everything else intact? A quick search of the docs for tcpdf and fpdi (cited in the link) didn't reveal a compression option.
Ghostscript and pdfwrite isn't a good combination. The PDF file you get out is NOT the same as the one you put in. This is because of the way that Ghostscript and pdfwrite work; the input is fully interpreted to a sequence of graphics primitives, which is sent to the Ghostscript graphics library. These are then sent to the requested device, most devices then render the result to a bitmap, but the pdfwrite family reassemble those graphics primitives int a new PDF file.
Note that the contents of the new PDF file have no relationship to the original, other than the appearance when rendered. Ghostscript and pdfwrite do maintain much of the non-marking content of PDF files such as hyperlinks and so on (which obviously don't get turned into graphics primitives), by interpreting them into pdfmark operations (an extension to the PostScript language defined by Adobe). However, even if Ghostscript and pdfwrite maintained all this content, the resulting PDF file wouldn't be the same as the original one decompressed....
There are tools which will decompress PDF files, and I would recommend one of our other products, MuPDF. A part of this is mutool, and "mutool clean -d in.pdf out.pdf" will decompress pretty much everything in a PDF file
QPDF can decompress PDF documents (among other things). I used this tool in the past and it preserved forms and data.
The tool has some issues with large PDFs (can take too much time and memory for decompression). The tool can produce incomplete output (with warnings in console) for some partially broken / nonstandard PDFs.

How to convert Word 2007 document to PDF using Apache FOP

I am currently using Apache FOP and have a stylesheet (possibly from RenderX) that converts Word 2003 XML documents (Saved as XML option) to PDF. However, this does not work for Word 2007 XML documents.
I am looking for options and/or suggestions on how to accomplish one of the following tasks -
Get a stylesheet that will transform Word 2007 XML file to:
Word 2003 XML or
PDF using FOP (using a stylesheet to create xsl-fo)
I am also open to any other options you might have. If possible I would like to do this with little to no cost. However, I am limited to using Java so a C# type option is not possible.
Thanks,
You could try docx4j, an open source Java library (ASL v2) which uses FOP to create PDFs from docx files.
I'm not aware of any style sheets that do this transformation. It would be reasonably sophisticated. If you end up having to engineer another way of doing it, you might want to look at JODConverter (straight conversion - might be your best bet), the OpenOffice UNO API (very manual), JODReports or Docmosis (both can produce documents in various formats). All can produce PDFs from a Java environment. I think they all have free versions.
Hope that helps.

How can I use PDF 1.6 documents in Perl's CAM::PDF?

I am running into some problems using CAM::PDF with PDF documents which are %PDF-1.6
Is there a way to convert those into 1.3? (preferably a free batch-like way...)
What I am currently doing is print the files using the free PDF995. The resulting PDF file is %PDF-1.3. However, it would take me forever to convert lots of documents.
You can use Ghostscript to the job:
gs -dNOPAUSE -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -sOUTPUTFILE=out.pdf -dBATCH in.pdf