en masse inline editing in an uncompressed PDF - perl

I have a large PDF (~20mb, 160 mb. uncompressed).
I need to do a find and replace in the text in it, about 1000 times.
Here is what I tried.
Via SVG
Tranform to SVG (inkscape)
Read SVG line by line and do the replace in the file
Transform back to PDF
=> bad output, probably due to some geometric transform matrix in the SVG, the text is not well rendered
Creating ~1000 sed command
Uncompress PDF
Perform each replace with a sed command
Recompress PDF
=> way too long. each sed command takes about 20 sec, leading to several hours of process
Read line-by-line and replace
Uncompress PDF
Read line by line the PDF
find text to be replaced
replace using perl
write line to a new file
Compress the new file
=> due to left data-stream in the uncompressed PDF, the new file is apparently damaged (writing binary as lines of text)
I wonder if it would be possible to read line-by-line the uncompressed PDF, but do the editing directly in it. How could I do this?
I have searched for perl inline editing, but it performs the changes in the whole file at once, while I'd like to edit a single line.
Other ideas are more than welcome ;)
Following advise, I used CAM::PDF, this was the most efficient and simple solution

There is no difference between 2. and 3. Sed reads the input file line by line and writes changed lines into the output file. If you fed -i switch to it, sed just opens the input file and then unlinks (it's what rm do) then opens the output file with the same name and writes into. That's it. No magic involved. So if you damaged content by Perl, but not by sed you do something different than by sed. The main difference is, you can make Perl script way faster for replacing many strings. See Using sed on text files with a csv
The main trick is you can compile regexp for search nad replace which works in linear time.
my %replace = ( foo => 'bar' );
my $re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
while (<>) {
s/$re/$replace{$1}/g;
}
You can use it with your original approach, but I would recommend to make it in Perl script which allows you to keep the regexp and replace hash between pdf files. You can also try it to combine with CAM::PDF. There is the example script changepagestring.pl in it. You can also look at PDF::API2 which would require more work but may provide better result. But remember, PDF format is not intended for modification.

You can follow the pdftk steps as described in
How to find and replace text in a existing PDF file with PDFTK (or other command line application)
You can first split the PDF into smaller documents with a few pages each, replace the text and again merge them together - all using pdftk.
There is also the PDFEdit software (http://pdfedit.cz/en/index.html). It is a GUI app with a scripting interface. You can process individual pages and then do a find replace using scripting commands. See if it loads your PDF.

Related

i would like to extract the pspictures from a tex file and put the in another file so they can processed into ps or pdf files really easily

I have a list of files .tex file that contain fragments in the tex that build ps pictures which can be slow to process.
There are multiple fragments across multiple files and the end delimiter is \end{pspicture}
% this is the beginning of the fragment
\begin{pspicture}(0,0)(23,5)
\rput{0}(0,3){\crdKs}
\rput(1,3){\crdtres}
\rput(5,3){\crdAh}
\rput(6,3){\crdKh}
\rput(7,3){\crdsixh}
\rput(8,3){\crdtreh}
\rput(12,3){\crdQd}
\rput(13,3){\crdeigd}
\rput(14,3){\crdsixd}
\rput(15,3){\crdfived}
\rput(16,3){\crdtwod}
\rput(20,3){\crdKc}
\rput(21,3){\crdfourc}
\end{pspicture}
I would like to extract the fragments.
I am not sure how to go about this? can awk do this or sed?
They seem to work line by line, rather than work on the whole fragment.
I am not really looking for a solution just a good candidate tool.
sed -En '/^\\begin\{pspicture\}.*$/,/^\\end\{pspicture\}.*$/p' file
Utilising sed with -E for regular expressions.
Use //,// to determine start and ending regular expressions and print all lines from the start to the end.

Perl: How to replace a string in file with a paragraph [duplicate]

This question already has an answer here:
sed find and replace between two tags with multi line
(1 answer)
Closed 8 years ago.
I need to replace a token in a file with a multi-line paragrah, which has several line breakers inside it if the paragraph is represented as a string.
If I use sed the usually for a string to string replacement, the line breakers inside the new string would complain.
So now I want to open the file and seek to that token location, then write the new content into the file from there, but not sure how to achieve that. Can anybody help?
EDIT:
Looks like I probably can put both the file and the content to be inserted as arrays then use splice in perl. Might not be the easiest way though.
perl -i -pe's/token/foo\nbar\nbaz\n/g' file
You can't really insert into a file. Just like inserting into a string, you must first move the remainder of the string out of the way. With files, it's easier just to copy the entire file.
The provided code opens file, deletes file, creates file, then copies (with substitutions) from the open handle to the new handle.
It's my understanding that sed can do this too. It's my understanding that sed also uses -i to enable this feature.
Check out: How do I change, delete, or insert a line in a file, or append to the beginning of a file?
The easiest solutions will be to either use perl's $INPLACE_EDIT, optionally done as a one liner like demonstrated by ikegami, or perhaps to use Tie::File.

How can I force emacs (or any editor) to read a file as if it is in ASCII format?

I could not find this answer in the man or info pages, nor with a search here or on Google. I have a file which is, in essence, a text file, but it somehow got screwed up upon saving. (I think there are a few strange bytes at the front of the file accidentally.)
I am able to open the file, and it makes sense, using head or cat, but not using any sort of editor.
In the end, all I wish to do is open the file in emacs, delete the "messy" characters, and save it once cleaned up. The file, however, is huge, so I need something powerful like emacs to be able to open it.
Otherwise, I suppose I can try to create a script to read this in line by line, forcing the script to read it in text format, then write it. But I wanted something quick, since I won't be doing this over & over.
Thanks!
Mike
perl -i.bk -pe 's/[^[:ascii:]]//g;' file
Found this perl one liner here: http://www.perlmonks.org/?node_id=619792
Try M-xfind-file-literally in Emacs.
You could edit the file using hexl-mode, which lets you edit the file in hexadecimal. That would let you see precisely what those offending characters are, and remove them.
It sounds like you either got a different line ending in the file (eg: carriage returns on a *nix system) or it got saved in an unexpected encoding.
You could use strings to grab "printable characters in file". You might have to play with the --encoding though I have only ever used it to grab ascii strings from executable files.

Read and delete text between two strings in perl

I need a way to read and delete text between two different strings found in some file, then delete the two strings. Like a "cut command." I would like to have the text stored in a variable.
I saw the post about reading text between two strings, but I could not figure out how to delete it as well.
I intend to execute the stored text in bash. Efficiency is desirable. This script is not going to be used on large files, but it may be executed many times sequentially so the faster the script works the better.
The stored text will usually have special characters.
Thanks
Specify the beginning and ending strings via the environment, and the file to use on the perl command line:
export START_STRING='abc def'
export END_STRING='ghi jkl'
perl -0777 -i -wpe's/\Q$ENV{START_STRING}\E(.*)\Q$ENV{END_STRING}\E/s;print STDERR $1' file_to_use 2>savedtext

Perl and reading files with different encodings

I am using a perl script to read in a file, but I'm not sure what encoding the file is in. Basically, my file is a list of book titles, but each book has other info associated with it (author, publication date, etc). So each book title is within a discrete chunk of data for the book. So I iterate through the file line by line until I find the regular expression '/Book Title: (.*)/' and take what's in the paren. Then, I create a separate .txt file with the name of the text file being my book. However, in my unix server, when I look at the name of the file, it's actually not, for example, 'LordOfTheFlies.txt' but rather 'LordOfTheFlies^M.txt'
What is this '^M'? Is that a weird end of line encoding I'm not taking into account? I tried chomp but it doesn't seem to be working. What is the best file encoding for working with perl?
It's the additional carriage return character that Windows systems insert before line feed characters (M == 13th letter, hence ASCII 13 is visualised as ^M).
It has nothing to do with file encoding, it's just the line ending policy biting you. Perl is usually good at handling line ending characters correctly, but if they occur somewhere else than the end of a line you have to do it yourself. You can use s/\r// instead of chomp() to get them out.
Before processing the file, you need to know the encoding of the file, which is determined by the producer of the file.
That "^M" is control-M, which is a carriage return, and is not needed in Unix file systems.Looks like the file is created in Unix and transferred to Windows. It can also be added with ftp when text file are transfered as binaries.
Try chop, instead of 'chomp'. Chomp removes the 'new line character'. s/\r// is also good.
For your general question, you might want to use appropriate module for the file type you have to make your life easier and better with Perl.