PDFtk: Merge PDF Problems - pdftk

I am using PDFtk (Version 2.02, UNIX) for merging PDF and facing below problems in the output PDF:
Initial View of the PDF is changed (should open with Bookmarks Panel and Page)
Bookmarks doesn’t point to the exact linked section as in the separate PDFs (shows fit page of the section)
Original metadata is lost (should retain first PDF's metadata)
Please suggest any workaround for the above points.
Regards,
Umesh

It's a little late to answer, but I came across this question while looking for a solution to the same problem. After taking a look at the man of pdftk I found a solution and I made a little script:
#!/usr/bin/env bash
# pdfcat
array=( $# )
len=${#array[#]}
merged=${array[$len-1]}
pdf2merge=${array[#]:0:$len-1}
pdftk $1 dump_data output metadata
pdftk $pdf2merge cat output $merged
pdftk $merged update_info metadata output out
mv out $merged
rm metadata
exiftool $merged
The script save the metadata of the first PDF file (first argument) and write it to a file called metadata. Then it uses the cat command of pdftk to merge all the files (the output file is the last argument). Finally it loads metadata's content to the metadata of the resulting file before erasing metadata. The last line uses exiftoolto print the metadata of the resulting file in order to check if everything went well.
You can save this script to your home/username/bin directory, make it executable with:
$ chmod u+x scriptname
and then you can use it to merge files with the following syntax:
$ scriptname 1.pdf 2.pdf 3.pdf output.pdf
The resulting output.pdf will have the same metadata as the original 1.pdf file.

Related

Exiftool: Want to output to one text file using -w command

I'm currently trying to use exiftool on Windows command prompt to read meta data from multiple files, then output to a single text file.
The exact command I last tried looked like this:
exiftool.exe -FileName -GPSPosition -CreateDate -d "%m:%d:%Y %H:%M:%S" -c "%d° %d' %.2f"\" -charset UTF-8 -ext jpg -w _Coordinate_Date.txt S:\Nick\Test\
When I run this, I get 7 individual text files with the content for one corresponding file in each of them. However, I simply want to output all of it to one single text file. Any help is greatly appreciated
The -w (textout) option can only be used to write multiple files. It is not meant to be used to output to a single file. As per the docs on -w:
It is not possible to specify a simple filename as an argument -- creating a single output file from multiple source files is typically done by shell redirection
Which is what you're doing with the >> ./output.txt part of your command. The -w _Coordinate_Date.txt isn't doing anything and I would think throw an Invalid TAG name: "w _Coordinate_Date.txt" error if quoted together like that as it gets treated as a single arugment. The -w option requires two arguments, the -w and either an extension or a format string.
I actually figured it out, if you wrap the entire -w _Coordinate_Date.txt command in quotations and append it to a file, you can throw all of the output into one text file.
i.e. "-w _Coordinate_Date.txt >> ./output.txt"

Using wget to recursively fetch .txt files in .php file, but filters break the command

I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.
Main page with list of accessions: http://signal.salk.edu/atg1001/download.php
Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]
I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)
Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.
wget --spider --recursive "http://signal.salk.edu/atg1001/download.php
I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.
# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"
However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.
# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"
# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.
I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.
Thank you!
Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.

Can we wget with file list and renaming destination files?

I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.
I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.

Can we give two files as input while using JasperStarter

I am using JasperStarter to create pdf from several jrprint files and then print it using JasperStarter functtions.
I want to create one single pdf file with all the .jrprint files.
If I give command like:
jasperstarter pr a.jprint b.jprint -f pdf -o rep
It does not recognise the files after the first input file.
Can we create one single output file with many input jasper/jrprint files?
Please help.
Thanks,
Oshin
Looking at the documentation, this is not possible:
The command process (pr)
The command process is for processing a report.
In direct comparison to the command for compiling:
The command compile (cp)
The command compile is for compiling one report or all reports in a directory.

How do I create a file of a hash of everything (individually) in a directory tree?

I have several pdf, jpg, png files inside an alphabetical directory tree. How do I produce a file of the hash of each individual file?
There are a lot of ways to do this..
which OS are you using?
What is the exact format to save the results?
Here is an example of a simple bash (version 4) script in Linux that gives you the hash followed by the file name on separate lines, including all sub-directories.
#!/bin/bash
shopt -s globstar
FILES=**
OUTPUT=output.txt
for f in $FILES
do
md5sum $f >> $OUTPUT
done