perl, extract TOC from PDF file - perl

I have checked through CAM::PDF and other PDF related modules, but can not figure if there a way to extract table of content from a clear PDF file.
If there any ideas I would be grateful!

I have not been able to find a library that supports the extraction of pdf bookmarks (which is what I assume you mean by table of contents) reliably.
However, pdftk does a great job at this and can be run from the command line;
pdftk myfile.pdf dump_data | grep BookmarkTitle > outline.txt

Related

Combining pdftk strings for specific pages

I've checked "Similar questions" and went through a lot of search but I can't seem to find a way to combine the snippets I already figured out; would be awesome if someone is able to help.
Using pdftk, alternatively running through PowerShell
I got two .pdf files (f.e.: A=1000 pages, B=5000 pages) which I need to combine in a specific way to generate a new .pdf file. In detail I need page 1-3, 4-6[...] of file A merged with page 1-4, 4-8[...] of file B with a blank page between 1-3 & 4-6.
So far I figured how to burst the files, add a blank page and combine them to a new .pdf file. Yet I'm only able to that for one needed document at a time (a new file with 8 pages).
pdftk fileC.pdf fileD.pdf cat output fileE.pdf
pdftk A=fileE.pdf B=blankpage.pdf cat A1-1 B1-1 A2-4 output conclusion.pdf
Now I'm wondering if there's a way to output the complete file with a command? Otherwise I'd have to do it for every merge of two long files.
Thanks in advance!

Import docx file into emacs org-mode

I have been scouring the web for hours now and I haven't found a complete answer. I am wanting to convert my .org file to .docx (and docx. to .org) while maintaining the sections and tables. I have found and tried using pandoc through powershell as a tool to do this but I believe I am not doing any thing.
Here is the command I type into pandec:
pandoc report7a.org -s -o report7.docx
Shows the error:
pandoc.exe: Cannot decode byte '\xfe': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream
I have little experience with doing stuff like this.
Here is an image of the .org file I want to convert
I think that your editor put the byte order mark (BOM) into begin of the file. Check this post on how to do it with Emacs.

en masse inline editing in an uncompressed PDF

I have a large PDF (~20mb, 160 mb. uncompressed).
I need to do a find and replace in the text in it, about 1000 times.
Here is what I tried.
Via SVG
Tranform to SVG (inkscape)
Read SVG line by line and do the replace in the file
Transform back to PDF
=> bad output, probably due to some geometric transform matrix in the SVG, the text is not well rendered
Creating ~1000 sed command
Uncompress PDF
Perform each replace with a sed command
Recompress PDF
=> way too long. each sed command takes about 20 sec, leading to several hours of process
Read line-by-line and replace
Uncompress PDF
Read line by line the PDF
find text to be replaced
replace using perl
write line to a new file
Compress the new file
=> due to left data-stream in the uncompressed PDF, the new file is apparently damaged (writing binary as lines of text)
I wonder if it would be possible to read line-by-line the uncompressed PDF, but do the editing directly in it. How could I do this?
I have searched for perl inline editing, but it performs the changes in the whole file at once, while I'd like to edit a single line.
Other ideas are more than welcome ;)
Following advise, I used CAM::PDF, this was the most efficient and simple solution
There is no difference between 2. and 3. Sed reads the input file line by line and writes changed lines into the output file. If you fed -i switch to it, sed just opens the input file and then unlinks (it's what rm do) then opens the output file with the same name and writes into. That's it. No magic involved. So if you damaged content by Perl, but not by sed you do something different than by sed. The main difference is, you can make Perl script way faster for replacing many strings. See Using sed on text files with a csv
The main trick is you can compile regexp for search nad replace which works in linear time.
my %replace = ( foo => 'bar' );
my $re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
while (<>) {
s/$re/$replace{$1}/g;
}
You can use it with your original approach, but I would recommend to make it in Perl script which allows you to keep the regexp and replace hash between pdf files. You can also try it to combine with CAM::PDF. There is the example script changepagestring.pl in it. You can also look at PDF::API2 which would require more work but may provide better result. But remember, PDF format is not intended for modification.
You can follow the pdftk steps as described in
How to find and replace text in a existing PDF file with PDFTK (or other command line application)
You can first split the PDF into smaller documents with a few pages each, replace the text and again merge them together - all using pdftk.
There is also the PDFEdit software (http://pdfedit.cz/en/index.html). It is a GUI app with a scripting interface. You can process individual pages and then do a find replace using scripting commands. See if it loads your PDF.

MATLAB How to delete a specific page from a .pdf File?

I recently learned how to download .pdf files using urlwrite, but I was wondering if there is any way to specify which pages of the .pdf to save.
The files are always either 1 or 2 pages long, and I only want to keep the first page of the .pdf. Is there any way to directly download just the first page, and if not, is there a way to download the entire .pdf and then get rid of the 2nd page?
I know that it is possible to manually get rid of the second page in Preview or Adobe Acrobat and other applications, but it'd make things a lot easy if I could automate the process in MATLAB.
Any help would be greatly appreciated!
Find an appropriate command line tool (example uses pdftk), and then you can make a call to it from MATLAB. Use sprintf to assemble the appropriate command and then pass it to system. This puts the output in a temporary file then uses movefile to change the filename back:
temp = 'sometempfile.pdf';
urlwrite(someurl, filename);
system(sprintf('pdftk %s cat 1 output %s dont_ask',filename,temp));
movefile(temp, filename);

How do you trim the XMP XML contained within a jpg

Through the use of sanselan I've found that the root cause of iPhone photos imported to windows becoming uneditable is that there is content (white space?) after the actual XML (for more details and a linked example of the bad XMP XML see https://apple.stackexchange.com/questions/45326/why-can-i-not-edit-some-photos-imported-from-an-iphone-to-windows-vista).
I'd like to scan through my photo archive and 'trim' the XMP XML.
Is there an easy way to do this?
I have some java code that can recursively navigate my photo archive and DETECT the issue. I'm not sure how to trim and write the XML back though.
Obtain the existing XML using any means.
The following works if using the Apache Sanselan library:
String xmpXml = Sanselan.getXmpXml(new File('/path/to/jpeg'));
Then trim it...
xmpXml = xmpXml.trim();
Then write it back to the file using the solution to serializing Xmp XML to an existing jpeg.
try the following steps:
collect all of the photos in a single folder (e.g. folder xmlToConvert on your Desktop)
open a Terminal.app window
cd to the directory you put the files in (e.g. cd ~/Desktop/xmlToConvert)
run the following command from your command line prompt
mkdir converted ; for f in *.xml ; do cat $f | head -n $(wc -l $f) > converted/$f ; done
the converted/ sub-directory should now contain all the files without the whitespace at the end.
(i.e. a folder called converted in the xmlToConvert you created on your Desktop)
hth