Perl script to modify file - perl

I have Oracle files that I need to compare to CVS files, but the problem is that there are many files that I want to ignore the first line(s) as part of the diff. I want to run a script that opens each file, and replaces the file contents in such a way that the final output is replacing 'CREATE OR REPLACE PACKAGE "TRON"."SOME_PACKAGE" IS' with 'CREATE OR REPLACE PACKAGE SOME_PACKAGE IS'. The problem I am having is that the statement can span several lines, so I have to consider a situation like 'CREATE OR REPLACE "TRON"."SOME_PACKAGE" IS'.
My approach (since this is part of a Jenkins job), is to loop through all the files in the workspace, modifying any files that meet this criteria. I can then use my existing Perl script that is using File::Compare and Text::Diff::Table.
I've been testing with Zaid's solution with little success, since it still is not dealing with scenarios where the command string spans multiple lines. (my changes):
use strict;
use warnings;
use Tie::File;
use Data::Dumper;
my #array;
tie #array, 'Tie::File', 'c:\cb_k_check_recon_mma.sps' or die "Unable to tie file";
my %unwanted = map { $_ => 1 }
map { $_-1..$_-4, $_, $_+2 .. $_+4 }
grep { $array[$_] =~ /^CREATE.*[IS|AS]$/ }
0 .. $#array ;
print Dumper \%unwanted;
#array = map { $array[$_] } grep { ! $unwanted{$_} } 0 .. $#array;
print Dumper \#array;
untie #array;

If the text can span several lines, for a single regex to work you need to read the file into a string, not line-by-line.
perl -0777 -pi.bak -e 's/CREATE\s+OR\s+REPLACE\s+PACKAGE\s+"TRON"\."SOME_PACKAGE"\s+IS/CREATE OR REPLACE PACKAGE SOME_PACKAGE IS/g' /path/*.pl
The -0777 switch tells perl to slurp the file, so the regex will only be run once. For that reason, I added the global /g modifier, in case more than one substitution per file is needed.
As you see, I use \s+ instead of space, to match possible randomly inserted newlines. -pi in short means to perform in-place edit on the target file(s), and .bak after -i means to save backups with that extension. It is recommendable to save backups, but not required (except on Windows).

Related

Remove multiple duplicate lines from a file

I have a Perl script run in crontab that generates a file rich with duplicate entries, because on each run it rewrites information previously written.
I would use a sort -u of file, but, I would do it at the end of the Perl script file.
My list
10/10/2017 00:01:39:000;Sagitter
10/11/2017 00:00:01:002;Lupus
10/12/2017 00:03:14:109;Leon
10/12/2017 00:09:00:459;Sagitter
10/13/2017 01:11:03:009;Lupus
12/13/2017 04:29:00:609;Ariet
10/11/2017 00:00:01:002;Lupus
10/12/2017 00:03:14:109;Leon
...
My code
#!/usr/bin/perl
# Libraries
use strict;
use warnings 'all';
%lines = ();
# Remove duplicate
open( TMP_GL_OUTPUT, '>', $OUTPUT_FILE ) or die $!;
while ( <TMP_GL_OUTPUT> ) {
$lines{$_}++;
}
open( OUTFILE, '>', $TMPOUTPUT_FILE ) or die $!;
print OUTFILE keys %lines;
close( OUTFILE );
close( TMP_GL_OUTPUT );
Where am I going wrong? In shell it feels shorter than in Perl.
sort -u $TMPOUTPUT_FILE > $OUTPUT_FILE
As Suggested by ikegamy user, I've do as following:
move $OUTPUT_FILE, $TMPOUTPUT_FILE; # Copy file
run [ 'sort', '-u', '--', $TMPOUTPUT_FILE ], '>', $OUTPUT_FILE; # Remove duplicate
unlink $TMPOUTPUT_FILE;
I think you are asking why your Perl program is longer than your shell script.
First of all, your shell script does something completely different than your Perl program.
Your shell script executes a program, and stores its out in a file.
Your Perl program reads a file, manipulates the data it read, and stores the output in a file.
The Perl equivalent to
sort -u -- "$TMPOUTPUT_FILE" > "$OUTPUT_FILE"
is
use IPC::Run qw( run );
run [ 'sort', '-u', '--', $TMPOUTPUT_FILE ], '>', $OUTPUT_FILE;
(There are differences in error handling between these two.)
They're not that different in length.
This brings up the second difference. The shell specializes in executing programs, but Perl is a general purpose language. It would be surprising if it wasn't longer in Perl!
(Now try comparing the size of your Perl program to the source of sort...)
List::Util is a core module.
use List::Util 'uniq';
print for uniq <>
Your code looks almost OK.
My proposition is only to chomp each line, before you
save an element in the hash.
The reason is that e.g. the last line, not terminated
with a \n may look just the same as one of previous lines,
but without chomp the previous line would have contained
the terminating \n, whereas the last - not.
The resut is that both these lines will be different keys in the hash.
Compare my example program (working, presented below) with yours, there are
no other significant differences, apart from reading from __DATA__ and
writing to the console.
In my program, for demonstration purposes, I put 2 variants of printout,
one with key values (repetition counts) and another, printing just keys.
In your program leave only the second printout.
use strict; use warnings; use feature qw(say);
my %lines;
while(<DATA>) {
chomp;
$lines{$_}++;
}
while(my($key, $val) = each %lines) {
printf "%-32s / %d\n", $key, $val;
}
say '========';
foreach my $key (keys %lines) {
say $key;
}
__DATA__
10/10/2017 00:01:39:000;Sagitter
10/11/2017 00:00:01:002;Lupus
10/12/2017 00:03:14:109;Leon
10/12/2017 00:09:00:459;Sagitter
10/13/2017 01:11:03:009;Lupus
12/13/2017 04:29:00:609;Ariet
10/11/2017 00:00:01:002;Lupus
10/12/2017 00:03:14:109;Leon
Edit
Your code assigns no names to $OUTPUT_FILE and $TMPOUTPUT_FILE,
you even didn't declare these variables, but I assume, that in your actual
code you did it.
Another detail is that %lines should be preceded with my,
otherwise, as you put use strict; the compiler prints an error.
Edit 2
There is a quicker and shorter solution than yours.
Instead of writing lines to a hash and printing them as late as in
the second step, you can do it in a single loop:
Read the line.
Check whether the hash already contains a key equal to the line just read.
If not, then:
write the line to the hash, to block the printout, if just the
same line occured again,
print the line.
You can even write this program as a Perl one-liner:
perl -lne"print if !$lines{$_}++" input.txt
If you run the above command from the Windows cmd, it will print the output
to the console. If you use Linux, instead of double quotes, you can use apostrophes.
You may of course redirect the output to any file, adding > output.txt to
the above command.
The code is executed for each input line, chomped due to -l option.
If any other details concerning Perl one-liners are not known to you, search the web.

Perl in place editing within a script (rather than one liner)

So, I'm used to the perl -i to use perl as I would sed and in place edit.
The docs for $^I in perlvar:
$^I
The current value of the inplace-edit extension. Use undef to disable inplace editing.
OK. So this implies that I can perhaps mess around with 'in place' editing in a script?
The thing I'm having trouble with is this:
If I run:
perl -pi -e 's/^/fish/' test_file
And then deparse it:
BEGIN { $^I = ""; }
LINE: while (defined($_ = <ARGV>)) {
s/^/fish/;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Now - if I were to want to use $^I within a script, say to:
foreach my $file ( glob "*.csv" ) {
#inplace edit these files - maybe using Text::CSV to manipulate?
}
How do I 'enable' this to happen? Is it a question of changing $_ (as s/something/somethingelse/ does by default) and letting perl implicitly print it? Or is there something else going on?
My major question is - can I do an 'in place edit' that applies a CSV transform (or XML tweak, or similar).
I appreciate I can open separate file handles, read/print etc. I was wondering if there was another way. (even if it is only situationally useful).
The edit-in-place behaviour that is enabled by the -i command-line option or by setting $^I works only on the ARGV file handle. That means the files must either be named on the command line or #ARGV must be set up within the program
This program will change all lower-case letters to upper-case in all CSV files. Note that I have set $^I to a non-null string, which is advisable while you are testing so that your original data files are retained
use strict;
use warnings;
our $^I = '.bak';
while ( my $file = glob '*.csv' ) {
print "Processing $file\n";
our #ARGV = ($file);
while ( <ARGV> ) {
tr/a-z/A-Z/;
print;
}
}
There is a much simpler answer, if your script is always going to do in-place editing and your OS uses shebang:
#!perl -i
while (<>) {
print "LINE: $_"
}
Will add 'LINE: ' at the beginning of a line for each file it's given. (Note that you'd probably use the full path to perl, i.e., "#!/usr/bin/perl -i")
You can also call your script as:
% perl -i <script> <file1> <file2> ...
To run script as an in-place editor on file1, file2, etc.., if you don't have shebang support.

How to remove a specific word from a file in perl

A file contains:
rhost=localhost
ruserid=abcdefg_xxx
ldir=
lfile=
rdir=p01
rfile=
pgp=none
mainframe=no
ftpmode=binary
ftpcmd1=
ftpcmd2=
ftpcmd3=
ftpcmd1a=
ftpcmd2a=
notifycc=no
firstfwd=Yes
NOTIFYNYL=
decompress=no
compress=no
I want to write a simple code that removes the "_xxx" in that second line. Keep in mind that there will never be a file that contains the string "_xxx" so that should make it extremely easier. I'm just not too familiar with the syntax. Thanks!
The short answer:
Here's how you can remove just the literal '_xxx'.
perl -pli.bak -e 's/_xxx$//' filename
The detailed explanation:
Since Perl has a reputation for code that is indistinguishable from line noise, here's an explanation of the steps.
-p creates an implicit loop that looks something like this:
while( <> ) {
# Your code goes here.
}
continue {
print or die;
}
-l sort of acts like "auto-chomp", but also places the line ending back on the line before printing it again. It's more complicated than that, but in its simplest use, it changes your implicit loop to look like this:
while( <> ) {
chomp;
# Your code goes here.
}
continue {
print $_, $/;
}
-i tells Perl to "edit in place." Behind the scenes it creates a separate output file and at the end it moves that temporary file to replace the original.
.bak tells Perl that it should create a backup named 'originalfile.bak' so that if you make a mistake it can be reversed easily enough.
Inside the substitution:
s/
_xxx$ # Match (but don't capture) the final '_xxx' in the string.
/$1/x; # Replace the entire match with nothing.
The reference material:
For future reference, information on the command line switches used in Perl "one-liners" can be obtained in Perl's documentation at perlrun. A quick introduction to Perl's regular expressions can be found at perlrequick. And a quick overview of Perl's syntax is found at perlintro.
This overwrites the original file, getting rid of _xxx in the 2nd line:
use warnings;
use strict;
use Tie::File;
my $filename = shift;
tie my #lines, 'Tie::File', $filename or die $!;
$lines[1] =~ s/_xxx//;
untie #lines;
Maybe this can help
perl -ple 's/_.*// if /^ruserid/' < file
will remove anything after the 1st '_' (included) in the line what start with "ruserid".
One way using perl. In second line ($. == 2), delete from last _ until end of line:
perl -lpe 's/_[^_]*\Z// if $. == 2' infile

Extracting unique values from multiple files in Perl

I have several data files that are tab delimited. I need to extract all the unique values in a certain column of these data files (say column 25) and write these values into an output file for further processing. How might I do this in Perl? Remember I need to consider multiple files in the same folder.
edit: The code I've done thus far is like this.
#!/usr/bin/perl
use warnings;
use strict;
my #hhfilelist = glob "*.hh3";
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my #line = split /\t/;
print "field is $line[24]\n";
}
close (F);
}
The question is how do I efficiently create the hash/array of unique values as I read each line of each file. Or is it faster if I populate the whole array and then remove duplicates?
Some tips on how to handle the problem:
Find files
For finding files within a directory, use glob: glob '.* *'
For finding files within a directory tree, use File::Find's find function
Open each file, use Text::CSV with \t character as the delimiter, extract wanted values and write to file
For Perl solution, please use Text::CSV module to parse flat (X-separated) files - the constructor accepts a parameter specifying a separator character. Do this for every file in a loop, with file list generated by either glob() for files in a given directory or File::Find for subdirectories as well
Then, to get the unique values, for each row, store the column #25 in a hash.
E.g. after retrieving the values:
$colref = $csv->getline($io);
$unique_values_hash{ $colref->[24] } = 1;
Then, iterate over hash keys and print to a file.
For non-Perl shell solution, you can simply do:
cat MyFile_pattern | awk -F'\t' 'print $25' |sort -u > MyUniqueValuesFile
You can replace awk with cut
Please note that non-Perl solution only works if the files don't contain TABs in the fields themselves and the columns aren't quoted.
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' inputs > output
perl -F/\\t/ -ane 'print"$F[24]\n" unless $seen{$F[24]}++' *.hh3 > output
Command-line switches -F/\\t/ -an mean iterate through every line in every input file and split the line on the tab character into the array #F.
$F[24] refers to the value in the 25-th field of each line (between the 24-th and 25-th tab characters)
$seen{...} is a hashtable to keep track of which values have already been observed.
The first time a value is observed, $seen{VALUE} is 0 so Perl will execute the statement print"$F[24]\n". Every other time the value is observed, $seen{VALUE} will be non-zero and the statement won't be executed. This way each unique value gets printed out exactly once.
In a similar context to your larger script:
my #hhfilelist = glob "*.hh3";
my %values_in_field_25 = ();
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
my #F = split /\t/;
$values_in_field_25{$F[24]} = 1;
}
close (F);
}
my #unique_values_in_field_25 = keys %values_in_field_25; # or sort keys ...

Is there a simple way to do bulk file text substitution in place?

I've been trying to code a Perl script to substitute some text on all source files of my project. I'm in need of something like:
perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi" *.{cs,aspx,ascx}
But that parses all the files of a directory recursively.
I just started a script:
use File::Find::Rule;
use strict;
my #files = (File::Find::Rule->file()->name('*.cs','*.aspx','*.ascx')->in('.'));
foreach my $f (#files){
if ($f =~ s/thisgoesout/thisgoesin/gi) {
# In-place file editing, or something like that
}
}
But now I'm stuck. Is there a simple way to edit all files in place using Perl?
Please note that I don't need to keep a copy of every modified file; I'm have 'em all subversioned =)
Update: I tried this on Cygwin,
perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi" {*,*/*,*/*/*}.{cs,aspx,ascx
But it looks like my arguments list exploded to the maximum size allowed. In fact, I'm getting very strange errors on Cygwin...
If you assign #ARGV before using *ARGV (aka the diamond <>), $^I/-i will work on those files instead of what was specified on the command line.
use File::Find::Rule;
use strict;
#ARGV = (File::Find::Rule->file()->name('*.cs', '*.aspx', '*.ascx')->in('.'));
$^I = '.bak'; # or set `-i` in the #! line or on the command-line
while (<>) {
s/thisgoesout/thisgoesin/gi;
print;
}
This should do exactly what you want.
If your pattern can span multiple lines, add in a undef $/; before the <> so that Perl operates on a whole file at a time instead of line-by-line.
You may be interested in File::Transaction::Atomic or File::Transaction
The SYNOPSIS for F::T::A looks very similar with what you're trying to do:
# In this example, we wish to replace
# the word 'foo' with the word 'bar' in several files,
# with no risk of ending up with the replacement done
# in some files but not in others.
use File::Transaction::Atomic;
my $ft = File::Transaction::Atomic->new;
eval {
foreach my $file (#list_of_file_names) {
$ft->linewise_rewrite($file, sub {
s#\bfoo\b#bar#g;
});
}
};
if ($#) {
$ft->revert;
die "update aborted: $#";
}
else {
$ft->commit;
}
Couple that with the File::Find you've already written, and you should be good to go.
You can use Tie::File to scalably access large files and change them in place. See the manpage (man 3perl Tie::File).
Change
foreach my $f (#files){
if ($f =~ s/thisgoesout/thisgoesin/gi) {
#inplace file editing, or something like that
}
}
To
foreach my $f (#files){
open my $in, '<', $f;
open my $out, '>', "$f.out";
while (my $line = <$in>){
chomp $line;
$line =~ s/thisgoesout/thisgoesin/gi
print $out "$line\n";
}
}
This assumes that the pattern doesn't span multiple lines. If the pattern might span lines, you'll need to slurp in the file contents. ("slurp" is a pretty common Perl term).
The chomp isn't actually necessary, I've just been bitten by lines that weren't chomped one too many times (if you drop the chomp, change print $out "$line\n"; to print $out $line;).
Likewise, you can change open my $out, '>', "$f.out"; to open my $out, '>', undef; to open a temporary file and then copy that file back over the original when the substitution's done. In fact, and especially if you slurp in the whole file, you can simply make the substitution in memory and then write over the original file. But I've made enough mistakes doing that that I always write to a new file, and verify the contents.
Note, I originally had an if statement in that code. That was most likely wrong. That would have only copied over lines that matched the regular expression "thisgoesout" (replacing it with "thisgoesin" of course) while silently gobbling up the rest.
You could use find:
find . -name '*.{cs,aspx,ascx}' | xargs perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi"
This will list all the filenames recursively, then xargs will read its stdin and run the remainder of the command line with the filenames appended on the end. One nice thing about xargs is it will run the command line more than once if the command line it builds gets too long to run in one go.
Note that I'm not sure whether find completely understands all the shell methods of selecting files, so if the above doesn't work then perhaps try:
find . | grep -E '(cs|aspx|ascx)$' | xargs ...
When using pipelines like this, I like to build up the command line and run each part individually before proceeding, to make sure each program is getting the input it wants. So you could run the part without xargs first to check it.
It just occurred to me that although you didn't say so, you're probably on Windows due to the file suffixes you're looking for. In that case, the above pipeline could be run using Cygwin. It's possible to write a Perl script to do the same thing, as you started to do, but you'll have to do the in-place editing yourself because you can't take advantage of the -i switch in that situation.
Thanks to ephemient on this question and on this answer, I got this:
use File::Find::Rule;
use strict;
sub ReplaceText {
my $regex = shift;
my $replace = shift;
#ARGV = (File::Find::Rule->file()->name('*.cs','*.aspx','*.ascx')->in('.'));
$^I = '.bak';
while (<>) {
s/$regex/$replace->()/gie;
print;
}
}
ReplaceText qr/some(crazy)regexp/, sub { "some $1 text" };
Now I can even loop through a hash containing regexp=>subs entries!