Method to decompress a PDF (non-Adobe) while retaining form fields? - forms

I found a similar question that involves Acrobat, but in this case the PDF was made with a combination of MS Word and CenoPDF v3, with which I'm unfamiliar. Additionally the PDF is version 1.3. I'd like to decompress it, to see its low-level workings and make some changes. It's easy with GhostScript's -dCompressPages=false parameter, but that simultaneously strips all the fill-in form functionality. Is there a method for decompressing the file while leaving everything else intact? A quick search of the docs for tcpdf and fpdi (cited in the link) didn't reveal a compression option.

Ghostscript and pdfwrite isn't a good combination. The PDF file you get out is NOT the same as the one you put in. This is because of the way that Ghostscript and pdfwrite work; the input is fully interpreted to a sequence of graphics primitives, which is sent to the Ghostscript graphics library. These are then sent to the requested device, most devices then render the result to a bitmap, but the pdfwrite family reassemble those graphics primitives int a new PDF file.
Note that the contents of the new PDF file have no relationship to the original, other than the appearance when rendered. Ghostscript and pdfwrite do maintain much of the non-marking content of PDF files such as hyperlinks and so on (which obviously don't get turned into graphics primitives), by interpreting them into pdfmark operations (an extension to the PostScript language defined by Adobe). However, even if Ghostscript and pdfwrite maintained all this content, the resulting PDF file wouldn't be the same as the original one decompressed....
There are tools which will decompress PDF files, and I would recommend one of our other products, MuPDF. A part of this is mutool, and "mutool clean -d in.pdf out.pdf" will decompress pretty much everything in a PDF file

QPDF can decompress PDF documents (among other things). I used this tool in the past and it preserved forms and data.
The tool has some issues with large PDFs (can take too much time and memory for decompression). The tool can produce incomplete output (with warnings in console) for some partially broken / nonstandard PDFs.

Related

Merge 2 pdf files and preserve forms

I'd like to merge at least 2 PDF files into one while preserving all the form elements in the original PDFs. The form elements include text fields, radio buttons, check boxes, drop down menus and others. Please have a look at this sample PDF file with forms:
http://foersom.com/net/HowTo/data/OoPdfFormExample.pdf
Now try to merge it with any other arbitrary PDF file.
Can you do it?
EDIT: As for the implementation, I'd ideally prefer a command line solution on a linux plattform using open source tools such as 'ghostscript', or any other tool that you think is appropriate to solve this task.
Of course, everybody is welcome to supply any working solution to this problem, including a coded solution that involves writing a script which makes some API calls to a pdf-processing library. However, I'd suggest to take the path of least resistance first (CMD Solution).
Best Regards
EDIT #2: Well there are indeed several CMD tools that merge PDFs. However, these tools don't seem to, AFAIK, to preserve the forms in the original PDFs! These tools appear to simply just concatenate the printouts of all those PDFs into a single Printout, which is then presented as a single PDF.
Furthermore, If you printout a PDF file with forms into a file, you lose all the forms in it. This clearly not what I'm looking for.
I have found success using pdftk, which is an open-source software that runs on linux and can be called from your terminal.
To concatenate multiple pdfs into one (and preserve form-fillable elements), you can use the following command:
pdftk input1.pdf input2.pdf cat output output-file.pdf

merging PDFs with Ghostscipt ignoring outline and using pdfmark instead

I am using a Batch script to merge different PDFs in one complete file.
%gsc% -dBATCH -sDEVICE=pdfwrite -sPAPERSIZE=letter -dEPSFitPage -o %dsk%%zus%%ext% %mfd% %pth%tmp\pdfmarks
%dsk%%zus%%ext%: Path and name of final (complete) document
%mfd%: Path and name of docs to be merged (c:\test\1.pdf c:\test\2.pdf ...)
%pth%tmp = path to the pdfmarks file
Additionally, I am creating a pdfmark document inside the script which gs uses to create the bookmarks. But unfortunately, some of the docs I am merging, have already their own bookmarks and I did not yet find a solution how to ignore those. GS should only use the bookmarks inside the pdfmarks file.
How can this be done?
Firstly; you are not 'merging' PDF files when you use Ghotscript's pdfwrite device. The process is described in detail here
The important point is that the way the input file(s) are constructed has no bearing on the way the output file is constructed. If any other software you use relies on the file being constructed in a particular fashion it may not work on the output PDF file.
The -dEPSFitPage switch only has any effect when the input is an EPS file. If you want to 'fit' PostScript or PDF files then you need to use -dPDFFitPage, -dPSFitPage or just -dFitPage. However, all of these rely on you first selecting a media size, and then preventing it being altered by setting -dFIXEDMEDIA. For EPS files you would more normally use -dEPSCrop which sets the media size to the EPS declared BoundingBox.
You can prevent the PDF interpreter reading the Outlines tree (which you are calling Bookmarks) and then creating a pdfmark from it to pass to the pdfwrite device by using the -dNO_PDFMARK_OUTLINES switch which oddly isn't documented, presumably an oversight.

TCL fileutil::magic::mimetype not recognising Microsoft documents or mp3

I'm wondering if this is a limitation of fileutil::magic::mimetype, or whether something has gotten messed up in my installation. TCLLIB 1.15/TCL 8.5
Take an ordinary Microsoft Word .doc file and pass it to fileutil::magic::mimetype e.g.
package require fileutil
package require fileutil::magic::mimetype
set result [fileutil::magic::mimetype "/tmp/test.doc"]
It returns empty string. Same for mp3, plus other file formats. It does recognise GIF, PNG, TIFF and other image formats.
Calling fileutil::fileType returns binary for the Word document.
Standard Linux command file -i returns "application/msword" for the same file.
Can anyone confirm if this expected behaviour? I'm a little confused about the relationship between the fileutil and fumagic libraries, so maybe I've broken something in my install around that area.

IPython notebook embed postscript

How can we render postscript documents in IPython notebook?
I saw there is support for other file formats such as jpg, png, pdf and svg but couldn't find any mention about postscript.
PostScript isn't a 'file format', its a programming language. In order to render PostScript you will need a complete PostScript interpreter.
Presumably you could write one in Python, the last time I saw an estimate for the amount of time required to write a full PostScript interpreter it was 5 man years, its probably a bit more now.
Or you could render the program externally using Ghostscript, to produce something you can already read. Since you say PDF is already supported it would seem sensible to convert to that instead; since its not a bitmap format you won't lose scalability.

How to programmatically convert PostScript to PDF with the fewest steps?

Is there any way to just slap on a header and use a PS file as a PDF, assuming that the PS is very simple and do anything complicated?
I want to do this programmatically, not using ps2pdf.
Thanks.
You can certainly *try" "just slapping on a header" ... but I don't think you'll get too far :-)
Personally, I'd suggest ps2pdf is the best solution (for example, invoke it with ShellExec() or system()).
But if you want a programmatic solution, ps2pdf is just a wrapper around Ghostscript. Have you considered using the Ghostscript libraries?
You cannot wrap a PostScript file into a PDF file.
Although a PDF file looks similar to a PostScript file,
a PDF file must have a special structure, including a cross-reference
table at the end with file offsets to different parts of the PDF file.
To understand the PDF file format you can download the PDF Reference from:
http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf
If your software generates the PostScript file, maybe you can
extend it to write a PDF file too? It takes some time to understand
the PDF file format but it is not especially difficult if you are familiar with PostScript.
If this is too difficult, then use pdf2ps to do the hard work for you.