Copy page content from a large pdf page to smaller pdf page - itext

I have a PDF file that is 22*17 and I am needing it to fit that page content in a 11*8.5 page.
Basically reducing existing page size. I am using iTextSharp.
How do I do it?

You could use Ghostscript and the pdfwrite device for this instead of itextsharp. Something like:
gs \
-sDEVICEHIGHTPOINTS=612 -sDEVICEWIDTHPOINTS=792 \
-dFIXEDMEDIA \
-dPDFFitPage \
-sDEVICE=pdfwrite \
-sOutputFile=output.pdf \
input.pdf
caveat: This will fully interpret the PDF file to marking operations and emit a brand new PDF (rescaled to fit the new media size), depending on the content of the file you may or may not like the result.

Related

Is it possible for Pandoc to detect cross-references when converting from Word?

I have a minimal Word (.docx) document that contains a figure and cross-reference to it (see image below; Figure 1: Some figure). When I Ctrl+click on that cross-reference, it puts me back on the image.
But when I convert this document to markdown or native format using pandoc -s file.docx -t markdown -o file.md or pandoc -s file.docx -t native -o file.txt, the cross-reference is parsed as regular text.
What I am wondering is if it is possible for Pandoc to detect and parse such instances in .docx document using some filter or some other method. Note that cross-references to tables, chapters, and subchapters are also relevant for me.

Scale down PDF page size with Perl

I process PDF files in an existing Perl framework. Some of the incoming files have very large page sizes. What I would like to do:
Check if the input file's page size is larger than a specific format f (A4/Letter)
If it is larger than f: scale it down to f.
This is the corresponding command in Unix using GhostScript:
gs -sPAPERSIZE=a4 -dFIXEDMEDIA -dPSFitPage -o <outputFile> -sDEVICE=pdfwrite <inputFile>
Is there a way to do this within Perl, i.e. without requiring to call external tools?
I checked the modules PDF and CAM::PDF, but the documentation does not really cover my issue and I couldn't find a straight-forward solution.

asciidoc: is there a way to create an anchor that will be visible in libreoffice writer?

Tl;dr;
What is the correct way to create an anchor in docbook? and is there a way that will make the anchor visible in writer?
Background
I am trying to split up documentation that was previously in single open office documents into smaller asciidoc documents which are both included in the main open office document and also converted to either or both of html & pdf.
I have this mostly working. I use asciidoctor to create html. asciidoctor-pdf to create pdf and a combination of asciidoctor and pandoc to create .odt files. I also tried the python implementation of asciidoc but found the interface less useable.
Round tripping between asciidoc and odt is obviously not possible. This is sort of a fusion where the master document is word processed but pieces of content that can be produced independently (think man pages - in fact that is one of several use cases) are included.
asciidoc to html:
asciidoctor -b html5 foo.adoc -o foo.html
asciidoc to pdf:
asciidoctor-pdf -b pdf foo.adoc -o foo.pdf
asciidoc to odt
asciidoctor -b docbook foo.adoc -o foo.docbook
pandoc --base-header-level=3 -V date:"" -V title:"" -f docbook foo.docbook -o foo.odt
With pandoc I have to nullify the date and title and set the header-level as desired for the section to be inserted as an extra complication.
I insert the resulting .odt into the main document using insert section inside open office.
Note that the main document is not a master document as I could not find a way of creating a master document without also automatically splitting the file on h1 boundaries.
I have two main problems to resolve with this set-up. I would like to add headings in the asciidoc document as cross references and also create entries for them in the alphabetical index (actually the first heading would be suffcient). Is there a way to do this?
Index markers in asciidoc do not result in entries in .odt file being created.
I am able to cross reference content in the inserted section using "insert reference/heading" and referencing the uniquely named header. However, whenever I use "update all" these cross references are invalidated. They are shown as "Error: Reference source not found".
[On a separate note I would also like a way to find broken cross references automatically]
I am currently using libreoffice - Version: 4.3.7.2
I am not adverse to switching version or flavours (i.e. apache) if one behaves better than the other.
I'm not sure if the answer is in the asciidoc or docbook parts of the chain. I would accept an answer which inserts a index entry at the start of the inserted section (top of the .adoc/docbook file) automatically.
I am also open to changing my toolchain to something that will work.
For example I tried the asciidoc-odt backend and fell foul of https://github.com/dagwieers/asciidoc-odf/issues/47 which does not inspire confidence.
Using asciidoc-odt I avoid the need to create an intermediate docbook file. However, I still can't get the anchor to appear.
I can get a macro to create an anchor but at present I haven't figured out how to run the macro from the command line.
To create an anchor in DocBook, make an inline anchor in the .adoc file. For example, giving this to asciidoctor:
[[X1]]Section1
---------------
produced this:
<title>
<anchor xml:id="X1" xreflabel="[X1]"/>
Section1
</title>
Conversely, putting this on separate lines did not create an anchor tag in my test:
[[X1]]
Section 1
Now for some bad news. From the Pandoc User's Guide:
Internal links are currently supported for HTML formats (including HTML slide shows and EPUB), LaTeX, and ConTeXt.
I interpret this to mean that currently, Pandoc does not create internal links in Writer. When I tried it, the link was ignored.
Note: It looks like I did not answer all of your questions. If you want to ask more about LibreOffice cross references and headings (the big bold paragraph towards the end of the question), maybe you could make a separate question just for that part.

Method to decompress a PDF (non-Adobe) while retaining form fields?

I found a similar question that involves Acrobat, but in this case the PDF was made with a combination of MS Word and CenoPDF v3, with which I'm unfamiliar. Additionally the PDF is version 1.3. I'd like to decompress it, to see its low-level workings and make some changes. It's easy with GhostScript's -dCompressPages=false parameter, but that simultaneously strips all the fill-in form functionality. Is there a method for decompressing the file while leaving everything else intact? A quick search of the docs for tcpdf and fpdi (cited in the link) didn't reveal a compression option.
Ghostscript and pdfwrite isn't a good combination. The PDF file you get out is NOT the same as the one you put in. This is because of the way that Ghostscript and pdfwrite work; the input is fully interpreted to a sequence of graphics primitives, which is sent to the Ghostscript graphics library. These are then sent to the requested device, most devices then render the result to a bitmap, but the pdfwrite family reassemble those graphics primitives int a new PDF file.
Note that the contents of the new PDF file have no relationship to the original, other than the appearance when rendered. Ghostscript and pdfwrite do maintain much of the non-marking content of PDF files such as hyperlinks and so on (which obviously don't get turned into graphics primitives), by interpreting them into pdfmark operations (an extension to the PostScript language defined by Adobe). However, even if Ghostscript and pdfwrite maintained all this content, the resulting PDF file wouldn't be the same as the original one decompressed....
There are tools which will decompress PDF files, and I would recommend one of our other products, MuPDF. A part of this is mutool, and "mutool clean -d in.pdf out.pdf" will decompress pretty much everything in a PDF file
QPDF can decompress PDF documents (among other things). I used this tool in the past and it preserved forms and data.
The tool has some issues with large PDFs (can take too much time and memory for decompression). The tool can produce incomplete output (with warnings in console) for some partially broken / nonstandard PDFs.

How do you trim the XMP XML contained within a jpg

Through the use of sanselan I've found that the root cause of iPhone photos imported to windows becoming uneditable is that there is content (white space?) after the actual XML (for more details and a linked example of the bad XMP XML see https://apple.stackexchange.com/questions/45326/why-can-i-not-edit-some-photos-imported-from-an-iphone-to-windows-vista).
I'd like to scan through my photo archive and 'trim' the XMP XML.
Is there an easy way to do this?
I have some java code that can recursively navigate my photo archive and DETECT the issue. I'm not sure how to trim and write the XML back though.
Obtain the existing XML using any means.
The following works if using the Apache Sanselan library:
String xmpXml = Sanselan.getXmpXml(new File('/path/to/jpeg'));
Then trim it...
xmpXml = xmpXml.trim();
Then write it back to the file using the solution to serializing Xmp XML to an existing jpeg.
try the following steps:
collect all of the photos in a single folder (e.g. folder xmlToConvert on your Desktop)
open a Terminal.app window
cd to the directory you put the files in (e.g. cd ~/Desktop/xmlToConvert)
run the following command from your command line prompt
mkdir converted ; for f in *.xml ; do cat $f | head -n $(wc -l $f) > converted/$f ; done
the converted/ sub-directory should now contain all the files without the whitespace at the end.
(i.e. a folder called converted in the xmlToConvert you created on your Desktop)
hth