Read and delete text between two strings in perl - perl

I need a way to read and delete text between two different strings found in some file, then delete the two strings. Like a "cut command." I would like to have the text stored in a variable.
I saw the post about reading text between two strings, but I could not figure out how to delete it as well.
I intend to execute the stored text in bash. Efficiency is desirable. This script is not going to be used on large files, but it may be executed many times sequentially so the faster the script works the better.
The stored text will usually have special characters.
Thanks

Specify the beginning and ending strings via the environment, and the file to use on the perl command line:
export START_STRING='abc def'
export END_STRING='ghi jkl'
perl -0777 -i -wpe's/\Q$ENV{START_STRING}\E(.*)\Q$ENV{END_STRING}\E/s;print STDERR $1' file_to_use 2>savedtext

Related

Perl Command Understanding

I have been working on a product code to resolve an issue but am stuck on a line of code
Can anyone help me understand what exactly does this command do?
perl -MText::CSV -lne 'BEGIN{$p = Text::CSV->new()} print join "|", $p->fields() if $p->parse($_)' /home/daily/${FULL_FILENAME} > /home/output.txt
I think its to copy the file to my home location with some transformations but not sure exactly
This is a slightly broken program that translates a comma-separated values (CSV) file to a pipe-separated values file.
The particular command-line switches are documented in perlrun. This is a "one-liner", so you can read about those to see what's going on there.
The Text::CSV module deals with CSV files, and the program is parsing a line from the file and re-outputting as a pipe-separated file.
But, this program deals with each line as a complete record. That might be fine for you, but at some point you might end up with a literal value that has a newline in it, like a,"b\nc",d. Now reading line-by-line breaks the program since the quotes appear to be unclosed within the first line. Note only that, it blindly concatenates the parsed fields without considering if any of the fields should be quoted. It might be unlikely that a pipe character would be in the data, but the problem isn't it's rarity but the consequences and costliness when it does show up.
The rewrite.pl example script in the related module Text::CSV_XS is a tool that could replace this one-liner. It properly reads the input and knows how to properly translate it.

i would like to extract the pspictures from a tex file and put the in another file so they can processed into ps or pdf files really easily

I have a list of files .tex file that contain fragments in the tex that build ps pictures which can be slow to process.
There are multiple fragments across multiple files and the end delimiter is \end{pspicture}
% this is the beginning of the fragment
\begin{pspicture}(0,0)(23,5)
\rput{0}(0,3){\crdKs}
\rput(1,3){\crdtres}
\rput(5,3){\crdAh}
\rput(6,3){\crdKh}
\rput(7,3){\crdsixh}
\rput(8,3){\crdtreh}
\rput(12,3){\crdQd}
\rput(13,3){\crdeigd}
\rput(14,3){\crdsixd}
\rput(15,3){\crdfived}
\rput(16,3){\crdtwod}
\rput(20,3){\crdKc}
\rput(21,3){\crdfourc}
\end{pspicture}
I would like to extract the fragments.
I am not sure how to go about this? can awk do this or sed?
They seem to work line by line, rather than work on the whole fragment.
I am not really looking for a solution just a good candidate tool.
sed -En '/^\\begin\{pspicture\}.*$/,/^\\end\{pspicture\}.*$/p' file
Utilising sed with -E for regular expressions.
Use //,// to determine start and ending regular expressions and print all lines from the start to the end.

en masse inline editing in an uncompressed PDF

I have a large PDF (~20mb, 160 mb. uncompressed).
I need to do a find and replace in the text in it, about 1000 times.
Here is what I tried.
Via SVG
Tranform to SVG (inkscape)
Read SVG line by line and do the replace in the file
Transform back to PDF
=> bad output, probably due to some geometric transform matrix in the SVG, the text is not well rendered
Creating ~1000 sed command
Uncompress PDF
Perform each replace with a sed command
Recompress PDF
=> way too long. each sed command takes about 20 sec, leading to several hours of process
Read line-by-line and replace
Uncompress PDF
Read line by line the PDF
find text to be replaced
replace using perl
write line to a new file
Compress the new file
=> due to left data-stream in the uncompressed PDF, the new file is apparently damaged (writing binary as lines of text)
I wonder if it would be possible to read line-by-line the uncompressed PDF, but do the editing directly in it. How could I do this?
I have searched for perl inline editing, but it performs the changes in the whole file at once, while I'd like to edit a single line.
Other ideas are more than welcome ;)
Following advise, I used CAM::PDF, this was the most efficient and simple solution
There is no difference between 2. and 3. Sed reads the input file line by line and writes changed lines into the output file. If you fed -i switch to it, sed just opens the input file and then unlinks (it's what rm do) then opens the output file with the same name and writes into. That's it. No magic involved. So if you damaged content by Perl, but not by sed you do something different than by sed. The main difference is, you can make Perl script way faster for replacing many strings. See Using sed on text files with a csv
The main trick is you can compile regexp for search nad replace which works in linear time.
my %replace = ( foo => 'bar' );
my $re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
while (<>) {
s/$re/$replace{$1}/g;
}
You can use it with your original approach, but I would recommend to make it in Perl script which allows you to keep the regexp and replace hash between pdf files. You can also try it to combine with CAM::PDF. There is the example script changepagestring.pl in it. You can also look at PDF::API2 which would require more work but may provide better result. But remember, PDF format is not intended for modification.
You can follow the pdftk steps as described in
How to find and replace text in a existing PDF file with PDFTK (or other command line application)
You can first split the PDF into smaller documents with a few pages each, replace the text and again merge them together - all using pdftk.
There is also the PDFEdit software (http://pdfedit.cz/en/index.html). It is a GUI app with a scripting interface. You can process individual pages and then do a find replace using scripting commands. See if it loads your PDF.

SAS- Reading multiple compressed data files

I hope you are all well.
So my question is about the procedure to open multiple raw data files that are compressed.
My files' names are ordered so I have for example : o_equities_20080528.tas.zip o_equities_20080529.tas.zip o_equities_20080530.tas.zip ...
Thank you all in advance.
How much work this will be depends on whether:
You have enough space to extract all the files simultaneously into one folder
You need to be able to keep track of which file each record has come from (i.e. you can't tell just from looking at a particular record).
If you have enough space to extract everything and you don't need to track which records came from which file, then the simplest option is to use a wildcard infile statement, allowing you to import the records from all of your files in one data step:
infile "c:\yourdir\o_equities_*.tas" <other infile options as per individual files>;
This syntax works regardless of OS - it's a SAS feature, not shell expansion.
If you have enough space to extract everything in advance but you need to keep track of which records came from each file, then please refer to this page for an example of how to do this using the filevar option on the infile statement:
http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm
If you don't have enough space to extract everything in advance, but you have access to 7-zip or another archive utility, and you don't need to keep track of which records came from each file, you can use a pipe filename and extract to standard output. If you're on a Linux platform then this is very simple, as you can take advantage of shell expansion:
filename cmd pipe "nice -n 19 gunzip -c /yourdir/o_equities_*.tas.zip";
infile cmd <other infile options as per individual files>;
On windows it's the same sort of idea, but as you can't use shell expansion, you have to construct a separate filename for each zip file, or use some of 7zip's more arcane command-line options, e.g.:
filename cmd pipe "7z.exe e -an -ai!C:\yourdir\o_equities_*.tas.zip -so -y";
This will extract all files from all of the matching archives to standard output. You can narrow this down further via the 7-zip command if necessary. You will have multiple header lines mixed in with the data - you can use findstr to filter these out in the pipe before SAS sees them, or you can just choose to tolerate the odd error message here and there.
Here, the -an tells 7-zip not to read the zip file name from the command line, and the -ai tells it to expand the wildcard.
If you need to keep track of what came from where and you can't extract everything at once, your best bet (as far as I know) is to write a macro to process one file at a time, using the above techniques and add this information while you're importing each dataset.

Rename file containing '©' character

We received as input in our application (running on Windows) a list of files. These files were automatically extracted from a database with a script.
Apparently some of the names are containing special characters (like accents) and these characters are rendered as '©' on our side.
How can rename programmatically these text files (around 900'000) to get rid of this character?
We cannot change the source neither re-extract the files.
The problem is that because of this character another program involved with our system does not accept the files.
Have a look at the unix command rename. It allows you to apply a perl regex to the names of a bunch of files. In this case you might want something like:
$ rename 's/[^a-zA-Z0-9]//' *
In debian the rename command is part of the perl package. It should also be available on CPAN.
I ended up creating a new script that reads the input files and search for special characters in their title.
It was quite easy indeed:
string filename = filename.Replace("©", "e");
Since the '©' is in the filename, the script (in C#) is able to recognize it and replace the match accordingly. In this way I can loop through all the folders and subfolders simply reading the filename and change specials characters.
Thank you all for the contributions!