Scale down PDF page size with Perl - perl

I process PDF files in an existing Perl framework. Some of the incoming files have very large page sizes. What I would like to do:
Check if the input file's page size is larger than a specific format f (A4/Letter)
If it is larger than f: scale it down to f.
This is the corresponding command in Unix using GhostScript:
gs -sPAPERSIZE=a4 -dFIXEDMEDIA -dPSFitPage -o <outputFile> -sDEVICE=pdfwrite <inputFile>
Is there a way to do this within Perl, i.e. without requiring to call external tools?
I checked the modules PDF and CAM::PDF, but the documentation does not really cover my issue and I couldn't find a straight-forward solution.

Related

merging PDFs with Ghostscipt ignoring outline and using pdfmark instead

I am using a Batch script to merge different PDFs in one complete file.
%gsc% -dBATCH -sDEVICE=pdfwrite -sPAPERSIZE=letter -dEPSFitPage -o %dsk%%zus%%ext% %mfd% %pth%tmp\pdfmarks
%dsk%%zus%%ext%: Path and name of final (complete) document
%mfd%: Path and name of docs to be merged (c:\test\1.pdf c:\test\2.pdf ...)
%pth%tmp = path to the pdfmarks file
Additionally, I am creating a pdfmark document inside the script which gs uses to create the bookmarks. But unfortunately, some of the docs I am merging, have already their own bookmarks and I did not yet find a solution how to ignore those. GS should only use the bookmarks inside the pdfmarks file.
How can this be done?
Firstly; you are not 'merging' PDF files when you use Ghotscript's pdfwrite device. The process is described in detail here
The important point is that the way the input file(s) are constructed has no bearing on the way the output file is constructed. If any other software you use relies on the file being constructed in a particular fashion it may not work on the output PDF file.
The -dEPSFitPage switch only has any effect when the input is an EPS file. If you want to 'fit' PostScript or PDF files then you need to use -dPDFFitPage, -dPSFitPage or just -dFitPage. However, all of these rely on you first selecting a media size, and then preventing it being altered by setting -dFIXEDMEDIA. For EPS files you would more normally use -dEPSCrop which sets the media size to the EPS declared BoundingBox.
You can prevent the PDF interpreter reading the Outlines tree (which you are calling Bookmarks) and then creating a pdfmark from it to pass to the pdfwrite device by using the -dNO_PDFMARK_OUTLINES switch which oddly isn't documented, presumably an oversight.

en masse inline editing in an uncompressed PDF

I have a large PDF (~20mb, 160 mb. uncompressed).
I need to do a find and replace in the text in it, about 1000 times.
Here is what I tried.
Via SVG
Tranform to SVG (inkscape)
Read SVG line by line and do the replace in the file
Transform back to PDF
=> bad output, probably due to some geometric transform matrix in the SVG, the text is not well rendered
Creating ~1000 sed command
Uncompress PDF
Perform each replace with a sed command
Recompress PDF
=> way too long. each sed command takes about 20 sec, leading to several hours of process
Read line-by-line and replace
Uncompress PDF
Read line by line the PDF
find text to be replaced
replace using perl
write line to a new file
Compress the new file
=> due to left data-stream in the uncompressed PDF, the new file is apparently damaged (writing binary as lines of text)
I wonder if it would be possible to read line-by-line the uncompressed PDF, but do the editing directly in it. How could I do this?
I have searched for perl inline editing, but it performs the changes in the whole file at once, while I'd like to edit a single line.
Other ideas are more than welcome ;)
Following advise, I used CAM::PDF, this was the most efficient and simple solution
There is no difference between 2. and 3. Sed reads the input file line by line and writes changed lines into the output file. If you fed -i switch to it, sed just opens the input file and then unlinks (it's what rm do) then opens the output file with the same name and writes into. That's it. No magic involved. So if you damaged content by Perl, but not by sed you do something different than by sed. The main difference is, you can make Perl script way faster for replacing many strings. See Using sed on text files with a csv
The main trick is you can compile regexp for search nad replace which works in linear time.
my %replace = ( foo => 'bar' );
my $re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
while (<>) {
s/$re/$replace{$1}/g;
}
You can use it with your original approach, but I would recommend to make it in Perl script which allows you to keep the regexp and replace hash between pdf files. You can also try it to combine with CAM::PDF. There is the example script changepagestring.pl in it. You can also look at PDF::API2 which would require more work but may provide better result. But remember, PDF format is not intended for modification.
You can follow the pdftk steps as described in
How to find and replace text in a existing PDF file with PDFTK (or other command line application)
You can first split the PDF into smaller documents with a few pages each, replace the text and again merge them together - all using pdftk.
There is also the PDFEdit software (http://pdfedit.cz/en/index.html). It is a GUI app with a scripting interface. You can process individual pages and then do a find replace using scripting commands. See if it loads your PDF.

perl, extract TOC from PDF file

I have checked through CAM::PDF and other PDF related modules, but can not figure if there a way to extract table of content from a clear PDF file.
If there any ideas I would be grateful!
I have not been able to find a library that supports the extraction of pdf bookmarks (which is what I assume you mean by table of contents) reliably.
However, pdftk does a great job at this and can be run from the command line;
pdftk myfile.pdf dump_data | grep BookmarkTitle > outline.txt

Method to decompress a PDF (non-Adobe) while retaining form fields?

I found a similar question that involves Acrobat, but in this case the PDF was made with a combination of MS Word and CenoPDF v3, with which I'm unfamiliar. Additionally the PDF is version 1.3. I'd like to decompress it, to see its low-level workings and make some changes. It's easy with GhostScript's -dCompressPages=false parameter, but that simultaneously strips all the fill-in form functionality. Is there a method for decompressing the file while leaving everything else intact? A quick search of the docs for tcpdf and fpdi (cited in the link) didn't reveal a compression option.
Ghostscript and pdfwrite isn't a good combination. The PDF file you get out is NOT the same as the one you put in. This is because of the way that Ghostscript and pdfwrite work; the input is fully interpreted to a sequence of graphics primitives, which is sent to the Ghostscript graphics library. These are then sent to the requested device, most devices then render the result to a bitmap, but the pdfwrite family reassemble those graphics primitives int a new PDF file.
Note that the contents of the new PDF file have no relationship to the original, other than the appearance when rendered. Ghostscript and pdfwrite do maintain much of the non-marking content of PDF files such as hyperlinks and so on (which obviously don't get turned into graphics primitives), by interpreting them into pdfmark operations (an extension to the PostScript language defined by Adobe). However, even if Ghostscript and pdfwrite maintained all this content, the resulting PDF file wouldn't be the same as the original one decompressed....
There are tools which will decompress PDF files, and I would recommend one of our other products, MuPDF. A part of this is mutool, and "mutool clean -d in.pdf out.pdf" will decompress pretty much everything in a PDF file
QPDF can decompress PDF documents (among other things). I used this tool in the past and it preserved forms and data.
The tool has some issues with large PDFs (can take too much time and memory for decompression). The tool can produce incomplete output (with warnings in console) for some partially broken / nonstandard PDFs.

How do I find the data segment of Mac OSX executable with Perl?

I'm writing a tool in Perl that needs to scan for certain binary patterns inside an executable file on a Mac OSX. To avoid getting very many false positives, I want to restrict my search to the data/text segment of the executable, excluding the code segment and a few other things. How can I accomplish this?
How about using otool?
-t Display the contents of the (__TEXT,__text) section.
-d Display the contents of the (__DATA,__data) section.
You should look at the ELF file format specification. It contains headers and tables that tell you exactly which segments live where. Parsing it is tedious but straightforward.