Can one concatenate two Perl scripts which use different input record separators? - perl

Two Perl scripts, using different input record separators, work together to convert a LaTeX file into something easily searched for human-readable phrases and sentences. Of course, they could be wrapped together by a single shell script. But I am curious whether they can be incorporated into a single Perl script.
The reason for these scripts: It would be a hassle to find "two three" inside short.tex, for instance. But after conversion, grep 'two three' will return the first paragraph.
For any LaTeX file (here, short.tex), the scripts are invoked as follows.
cat short.tex | try1.pl | try2.pl
try1.pl works on paragraphs. It gets rid of LaTeX comments. It makes sure that each word is separated from its neighbors by a single space, so that no sneaky tabs, form feeds, etc., lurk between words. The resulting paragraph occupies a single line, consisting of visible characters separated by single spaces --- and at the end, a sequence of at least two newlines.
try2.pl slurps the entire file. It makes sure that paragraphs are separated from each other by exactly two newlines. And it ensures that the last line of the file is non-trivial, containing visible character(s).
Can one elegantly concatenate two operations such as these, which depend on different input record separators, into a single Perl script, say big.pl? For instance, could the work of try1.pl and try2.pl be accomplished by two functions or bracketed segments inside the larger script?
Incidentally, is there a Stack Overflow keyword for "input record separator"?
###File try1.pl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = ""; # input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
while (<>) {
s/[\x25].*\n/ /g; # remove all LaTeX comments. They start with %
s/[\t\f\r ]+/ /g; # collapse each "run" of whitespace to one single space
s/^\s*\n/\n/g; # any line that looks blank is converted to a pure newline;
s/(.)\n/$1/g; # Any line that does not look blank is joined to the subsequent line
print;
print "\n\n"; # make sure each paragraph is separated from its fellows by newlines
}
###File try2.pl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = undef; # input record separator: entire text or file is a single record.
while (<>) {
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines separate paragraphs. Like cat -s
s/[\n]+$/\n/; # last line is nontrivial; no blank line at the end
print;
}
###File short.tex:
\paragraph{One}
% comment
two % also 2
three % or 3
% comment
% comment
% comment
% comment
% comment
% comment
So they said%
that they had done it.
% comment
% comment
% comment
Fleas.
% comment
% comment
After conversion:
\paragraph{One} two three
So they said that they had done it.
Fleas.

To combine try1.pl and try2.pl into a single script you could try:
local $/ = "";
my #lines;
while (<>) {
[...] # Same code as in try1.pl except print statements
push #lines, $_;
}
$lines[-1] =~ s/\n+$/\n/;
print for #lines;

A pipe connects the output of one process to the input of another process. Neither one knows about the other nor cares how it operates.
But, putting things together like this breaks the Unix pipeline philosophy of small tools that each excel at a very narrow job. Should you link these two things, you'll always have to do both tasks even if you want one (although you could get into configuration to turn off one, but that's a lot of work).
I process a lot of LaTeX, and I control everything through a Makefile. I don't really care about what the commands look like and I don't even have to remember what they are:
short-clean.tex: short.tex
cat short.tex | try1.pl | try2.pl > $#
Let's do it anyways
I'll limit myself to the constraint of basic concatenation instead of complete rewriting or rearranging, most because there are some interesting things to show.
Consider what happens should you concatenate those two programs by simply adding the text of the second program at the end of the text of the first program.
The output from the original first program still goes to standard output and the second program now doesn't get that output as input.
The input to the program is likely exhausted by the original first program and the second program now has nothing to read. That's fine because it would have read the unprocessed input to the first program.
There are various ways to fix this, but none of them make much sense when you already have two working program that do their job. I'd shove that in the Makefile and forget about it.
But, suppose you do want it all in one file.
Rewrite the first section to send its output to a filehandle connected to a string. It's output is now in the programs memory. This basically uses the same interface, and you can even use select to make that the default filehandle.
Rewrite the second section to read from a filehandle connected to that string.
Alternately, you can do the same thing by writing to a temporary file in the first part, then reading that temporary file in the second part.
A much more sophisticated program would the first program write to a pipe (inside the program) that the second program is simultaneously reading. However, you have to pretty much rewrite everything so the two programs are happening simultaneously.
Here's Program 1, which uppercases most of the letters:
#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
print tr/a-z/A-Z/r;
}
and here's Program 2, which collapses whitespace:
#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
print s/\s+/ /gr;
}
They work serially to get the job done:
$ perl program1.pl
The quick brown dog jumped over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D
$ perl program2.pl
The quick brown dog jumped over the lazy fox.
The quick brown dog jumped over the lazy fox.
^D
$ perl program1.pl | perl program2.pl
The quick brown dog jumped over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D
Now I want to combine those. First, I'll make some changes that don't affect the operation but will make it easier for me later. Instead of using implicit filehandles, I'll make those explicit and one level removed from the actual filehandles:
Program 1:
#!/usr/bin/perl
use v5.26;
$|++;
my $output_fh = \*STDOUT;
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
Program 2:
#!/usr/bin/perl
$|++;
my $input_fh = \*STDIN;
while( <$input_fh> ) { # safer line input operator
print s/\s+/ /gr;
}
Now I have the chance to change what those filehandles are without disturbing the meat of the program. The while doesn't know or care what that filehandle is, so let's start by writing to a file in Program 1 and reading from that same file in Program 2:
Program 1:
#!/usr/bin/perl
use v5.26;
open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
Program 2:
#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
However, you can no longer run these in a pipeline because Program 1 doesn't use standard output and Program 2 doesn't read standard input:
% perl program1.pl
% perl program2.pl
You can, however, now join the programs, shebang and all:
#!/usr/bin/perl
use v5.26;
open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
You can skip the file and use a string instead, but at this point, you've gone beyond merely concatenating files and need a little coordination for them to share the scalar with the data. Still, the meat of the program doesn't care how you made those filehandles:
#!/usr/bin/perl
use v5.26;
my $output_string;
open my $output_fh, '>', \ $output_string or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
#!/usr/bin/perl
$|++;
open my $input_fh, '<', \ $output_string or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
So let's go one step further and do what the shell was already doing for us.
#!/usr/bin/perl
use v5.26;
pipe my $input_fh, my $output_fh;
$output_fh->autoflush(1);
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
From here, it gets a bit tricky and I'm not going to go to the next step with polling filehandles so one thing can write and the the next thing reads. There are plenty of things that do that for you. And, you're now doing a lot of work to avoid something that was already simple and working.
Instead of all that pipe nonsense, the next step is to separate code into functions (likely in a library), and deal with those chunks of code as named things that hide their details:
use Local::Util qw(remove_comments minify);
while( <<>> ) {
my $result = remove_comments($_);
$result = minify( $result );
...
}
That can get even fancier where you simply go through a series of steps without knowing what they are or how many of them there will be. And, since all the baby steps are separate and independent, you're basically back to the pipeline notion:
use Local::Util qw(get_input remove_comments minify);
my $result;
my #steps = qw(get_input remove_comments minify)
while( ! eof() ) { # or whatever
no strict 'refs'
$result = &{$_}( $result ) for #steps;
}
A better way makes that an object so you can skip the soft reference:
use Local::Processor;
my #steps = qw(get_input remove_comments minify);
my $processer = Local::Processor->new( #steps );
my $result;
while( ! eof() ) { # or whatever
$result = $processor->$_($result) for #steps;
}
Like I did before, the meat of the program doesn't care or know about the steps ahead of time. That means that you can move the sequence of steps to configuration and use the same program for any combination and sequence:
use Local::Config;
use Local::Processor;
my #steps = Local::Config->new->get_steps;
my $processer = Local::Processor->new;
my $result;
while( ! eof() ) { # or whatever
$result = $processor->$_($result) for #steps;
}
I write quite a bit about this sort of stuff in Mastering Perl and Effective Perl Programming. But, because you can do it doesn't mean you should. This reinvents a lot that make can already do for you. I don't do this sort of thing without good reason—bash and make have to be pretty annoying to motivate me to go this far.

The motivating problem was to generate a "cleaned" version of a LaTeX file, which would be easy to search, using regex, for complex phrases or sentences.
The following single Perl script does the job, whereas previously I required one shell script and two Perl scripts, entailing three invocations of Perl. This new, single script incorporates three consecutive loops, each with a different input record separator.
First loop:
input = STDIN, or a file passed as argument; record separator=default, loop by line; print result to fileafterperlLIN, a temporary
file on the hard drive.
Second loop:
input = fileafterperlLIN;
record separator = "", loop by paragraph;
print result to fileafterperlPRG, a temporary file on the hard drive.
Third loop:
input = fileafterperlPRG;
record separator = undef, slurp entire file
print result to STDOUT
This has the disadvantage of printing to and reading from two files on the hard drive, which may slow it down. Advantages are that the operation seems to require only one process; and all the code resides in a single file, which should make it easier to maintain.
#!/usr/bin/perl
# 2019v04v05vFriv17h18m41s
use strict;
use warnings;
use 5.18.2;
my $diagnose;
my $diagnosticstring;
my $exitcode;
my $userName = $ENV{'LOGNAME'};
my $scriptpath;
my $scriptname;
my $scriptdirectory;
my $cdld;
my $fileafterperlLIN;
my $fileafterperlPRG;
my $handlefileafterperlLIN;
my $handlefileafterperlPRG;
my $encoding;
my $count;
sub diagnosticmessage {
return unless ( $diagnose );
print STDERR "$scriptname: ";
foreach $diagnosticstring (#_) {
printf STDERR "$diagnosticstring\n";
}
}
# Routine setup
$scriptpath = $0;
$scriptname = $scriptpath;
$scriptname =~ s|.*\x2f([^\x2f]+)$|$1|;
$cdld = "$ENV{'cdld'}"; # A directory to hold temporary files used by scripts
$exitcode = system("test -d $cdld && test -w $cdld || { printf '%\n' 'cdld not a writeable directory'; exit 1; }");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$scriptdirectory = "$cdld/$scriptname"; # To hold temporary files used by this script
$exitcode = system("test -d $scriptdirectory || mkdir $scriptdirectory");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
diagnosticmessage ( "scriptdirectory=$scriptdirectory" );
$exitcode = system("test -w $scriptdirectory && test -x $scriptdirectory || exit 1;");
die "$scriptname: system returned exitcode=$exitcode: $scriptdirectory not writeable or not executable. bail\n" unless $exitcode == 0;
$fileafterperlLIN = "$scriptdirectory/afterperlLIN.tex";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
$exitcode = system("printf '' > $fileafterperlLIN;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$fileafterperlPRG = "$scriptdirectory/afterperlPRG.tex";
diagnosticmessage ( "fileafterperlPRG=$fileafterperlPRG" );
$exitcode=system("printf '' > $fileafterperlPRG;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
# This script's job: starting with a LaTeX file, which may compile beautifully in pdflatex but be difficult
# to read visually or search automatically,
# (1) convert any line that looks blank --- a "trivial line", containing only whitespace --- to a pure newline. This is because
# (a) LaTeX interprets any whitespace line following a non-blank or "nontrivial" line as end of paragraph, whereas
# (b) Perl needs two consecutive newlines to signal end of paragraph.
# (2) remove all LaTeX comments;
# (3) deal with the \unskip LaTeX construct, etc.
# The result will be
# (4) each LaTeX paragraph will occupy a unique line
# (5) exactly one pair of newlines --- visually, one blank line --- will divide each pair of consecutive paragraphs
# (6) first paragraph will be on first line (no opening blank line) and last paragraph will be on last line (no ending blank line)
# (7) whitespace in output will consist of only
# (a) a single space between readable strings, or
# (b) double newline between paragraphs
#
$handlefileafterperlLIN = undef;
$handlefileafterperlPRG = undef;
$encoding = ":encoding(UTF-8)";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
open($handlefileafterperlLIN, ">> $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for appending: $!";
# Loop 1 / line:
# Default input record separator: loop through one line at a time, delimited by \n
$count = 0;
while (<>) {
$count = $count + 1;
diagnosticmessage ( "line $count" );
s/^\s*\n/\n/mg; # Convert any trivial line to a pure newline.
print $handlefileafterperlLIN $_;
}
close($handlefileafterperlLIN);
open($handlefileafterperlLIN, "< $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for reading: $!";
open($handlefileafterperlPRG, ">> $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for appending: $!";
# Loop PRG / paragraph:
local $/ = ""; # Input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
$count = 0;
while (<$handlefileafterperlLIN>) {
$count = $count + 1;
diagnosticmessage ( "paragraph $count" );
s/(?<!\x5c)[\x25].*\n/ /g; # Remove all LaTeX comments.
# They start with % not \% and extend to end of line or newline character. Join to next line.
# s/(?<!\x5c)([\x24])/\x2a/g; # 2019v04v01vMonv13h44m09s any $ not preceded by backslash \, replace $ by * or something.
# This would be only if we are going to run detex on the output.
s/(.)\n/$1 /g; # Any line that has something other than newline, and then a newline, is joined to the subsequent line
s|([^\x2d])\s*(\x2d\x2d\x2d)([^\x2d])|$1 $2$3|g; # consistent treatment of triple hyphen as em dash
s|([^\x2d])(\x2d\x2d\x2d)\s*([^\x2d])|$1$2 $3|g; # consistent treatment of triple hyphen as em dash, continued
s/[\x0b\x09\x0c\x20]+/ /gm; # collapse each "run" of whitespace other than newline, to a single space.
s/\s*[\x5c]unskip(\x7b\x7d)?\s*(\S)/$2/g; # LaTeX whitespace-collapse across newlines
s/^\s*//; # Any nontrivial line: No indenting. No whitespace in first column.
print $handlefileafterperlPRG $_;
print $handlefileafterperlPRG "\n\n"; # make sure each paragraph ends with 2 newlines, hence at least 1 blank line.
}
close($handlefileafterperlPRG);
open($handlefileafterperlPRG, "< $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for reading: $!";
# Loop slurp
local $/ = undef; # Input record separator: entire file is a single record.
$count = 0;
while (<$handlefileafterperlPRG>) {
$count = $count + 1;
diagnosticmessage ( "slurp $count" );
s/[\n][\n]+/\n\n/g; # Exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/[\n]+$/\n/; # Last line is visible or "nontrivial"; no trivial (blank) line at the end
s/^[\n]+//; # No trivial (blank) line at the start. The first line is "nontrivial."
print STDOUT;
}

Related

How to print result STDOUT to a temporary blank new file in the same directory in Perl?

I'm new in Perl, so it's maybe a very basic case that i still can't understand.
Case:
Program tell user to types the file name.
User types the file name (1 or more files).
Program read the content of file input.
If it's single file input, then it just prints the entire content of it.
if it's multi files input, then it combines the contents of each file in a sequence.
And then print result to a temporary new file, which located in the same directory with the program.pl .
file1.txt:
head
a
b
end
file2.txt:
head
c
d
e
f
end
SINGLE INPUT program ioSingle.pl:
#!/usr/bin/perl
print "File name: ";
$userinput = <STDIN>; chomp ($userinput);
#read content from input file
open ("FILEINPUT", $userinput) or die ("can't open file");
#PRINT CONTENT selama ada di file tsb
while (<FILEINPUT>) {
print ; }
close FILEINPUT;
SINGLE RESULT in cmd:
>perl ioSingle.pl
File name: file1.txt
head
a
b
end
I found tutorial code that combine content from multifiles input but cannot adapt the while argument to code above:
while ($userinput = <>) {
print ($userinput);
}
I was stucked at making it work for multifiles input,
How am i suppose to reformat the code so my program could give result like this?
EXPECTED MULTIFILES RESULT in cmd:
>perl ioMulti.pl
File name: file1.txt file2.txt
head
a
b
end
head
c
d
e
f
end
i appreciate your response :)
A good way to start working on a problem like this, is to break it down into smaller sections.
Your problem seems to break down to this:
get a list of filenames
for each file in the list
display the file contents
So think about writing subroutines that do each of these tasks. You already have something like a subroutine to display the contents of the file.
sub display_file_contents {
# filename is the first (and only argument) to the sub
my $filename = shift;
# Use lexical filehandl and three-arg open
open my $filehandle, '<', $filename or die $!;
# Shorter version of your code
print while <$filehandle>;
}
The next task is to get our list of files. You already have some of that too.
sub get_list_of_files {
print 'File name(s): ';
my $files = <STDIN>;
chomp $files;
# We might have more than one filename. Need to split input.
# Assume filenames are separated by whitespace
# (Might need to revisit that assumption - filenames can contain spaces!)
my #filenames = split /\s+/, $files;
return #filenames;
}
We can then put all of that together in the main program.
#!/usr/bin/perl
use strict;
use warnings;
my #list_of_files = get_list_of_files();
foreach my $file (#list_of_files) {
display_file_contents($file);
}
By breaking the task down into smaller tasks, each one becomes easier to deal with. And you don't need to carry the complexity of the whole program in you head at one time.
p.s. But like JRFerguson says, taking the list of files as command line parameters would make this far simpler.
The easy way is to use the diamond operator <> to open and read the files specified on the command line. This would achieve your objective:
while (<>) {
chomp;
print "$_\n";
}
Thus: ioSingle.pl file1.txt file2.txt
If this is the sole objective, you can reduce this to a command line script using the -p or -n switch like:
perl -pe '1' file1.txt file2.txt
perl -ne 'print' file1.txt file2.txt
These switches create implicit loops around the -e commands. The -p switch prints $_ after every loop as if you had written:
LINE:
while (<>) {
# your code...
} continue {
print;
}
Using -n creates:
LINE:
while (<>) {
# your code...
}
Thus, -p adds an implicit print statement.

Summing a column of numbers in a text file using Perl

Ok, so I'm very new to Perl. I have a text file and in the file there are 4 columns of data(date, time, size of files, files). I need to create a small script that can open the file and get the average size of the files. I've read so much online, but I still can't figure out how to do it. This is what I have so far, but I'm not sure if I'm even close to doing this correctly.
#!/usr/bin/perl
open FILE, "files.txt";
##array = File;
while(FILE){
#chomp;
($date, $time, $numbers, $type) = split(/ /,<FILE>);
$total += $numbers;
}
print"the total is $total\n";
This is how the data looks in the file. These are just a few of them. I need to get the numbers in the third column.
12/02/2002 12:16 AM 86016 a2p.exe
10/10/2004 11:33 AM 393 avgfsznew.pl
11/01/2003 04:42 PM 38124 c2ph.bat
Your program is reasonably close to working. With these changes it will do exactly what you want
Always use use strict and use warnings at the start of your program, and declare all of your variables using my. That will help you by finding many simple errors that you may otherwise overlook
Use lexical file handles, the three-parameter form of open, and always check the return status of any open call
Declare the $total variable outside the loop. Declaring it inside the loop means it will be created and destroyed each time around the loop and it won't be able to accumulate a total
Declare a $count variable in the same way. You will need it to calculate the average
Using while (FILE) {...} just tests that FILE is true. You need to read from it instead, so you must use the readline operator like <FILE>
You want the default call to split (without any parameters) which will return all the non-space fields in $_ as a list
You need to add a variable in the assignment to allow for athe AM or PM field in each line
Here is a modification of your code that works fine
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($date, $time, $ampm, $numbers, $type) = split;
$total += $numbers;
$count += 1;
}
print "The total is $total\n";
print "The count is $count\n";
print "The average is ", $total / $count, "\n";
output
The total is 124533
The count is 3
The average is 41511
It's tempting to use Perl's awk-like auto-split option. There are 5 columns; three containing date and time information, then the size and then the name.
The first version of the script that I wrote is also the most verbose:
perl -n -a -e '$total += $F[3]; $num++; END { printf "%12.2f\n", $total / ($num + 0.0); }'
The -a (auto-split) option splits a line up on white space into the array #F. Combined with the -n option (which makes Perl run in a loop that reads the file name arguments in turn, or standard input, without printing each line), the code adds $F[3] (the fourth column, counting from 0) to $total, which is automagically initialized to zero on first use. It also counts the lines in $num. The END block is executed when all the input is read; it uses printf() to format the value. The + 0.0 ensures that the arithmetic is done in floating point, not integer arithmetic. This is very similar to the awk script:
awk '{ total += $4 } END { print total / NR }'
First drafts of programs are seldom optimal — or, at least, I'm not that good a programmer. Revisions help.
Perl was designed, in part, as an awk killer. There is still a program a2p distributed with Perl for converting awk scripts to Perl (and there's also s2p for converting sed scripts to Perl). And Perl does have an automatic (built-in) variable that keeps track of the number of lines read. It has several names. The tersest is $.; the mnemonic name $NR is available if you use English; in the script; so is $INPUT_LINE_NUMBER. So, using $num is not necessary. It also turns out that Perl does a floating point division anyway, so the + 0.0 part was unnecessary. This leads to the next versions:
perl -MEnglish -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $NR; }'
or:
perl -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $.; }'
You can tune the print format to suit your whims and fancies. This is essentially the script I'd use in the long term; it is fairly clear without being long-winded in any way. The script could be split over multiple lines if you desired. It is a simple enough task that the legibility of the one-line is not a problem, IMNSHO. And the beauty of this is that you don't have to futz around with split and arrays and read loops on your own; Perl does most of that for you. (Granted, it does blow up on empty input; that fix is trivial; see below.)
Recommended version
perl -n -a -e '$total += $F[3]; END { printf "%12.2f\n", $total / $. if $.; }'
The if $. tests whether the number of lines read is zero or not; the printf and division are omitted if $. is zero so the script outputs nothing when given no input.
There is a noble (or ignoble) game called 'Code Golf' that was much played in the early days of Stack Overflow, but Code Golf questions are no longer considered good questions. The object of Code Golf is to write a program that does a particular task in as few characters as possible. You can play Code Golf with this and compress it still further if you're not too worried about the format of the output and you're using at least Perl 5.10:
perl -Mv5.10 -n -a -e '$total += $F[3]; END { say $total / $. if $.; }'
And, clearly, there are a lot of unnecessary spaces and letters in there:
perl -Mv5.10 -nae '$t+=$F[3];END{say$t/$.if$.}'
That is not, however, as clear as the recommended version.
#!/usr/bin/perl
use warnings;
use strict;
open my $file, "<", "files.txt";
my ($total, $cnt);
while(<$file>){
$total += (split(/\s+/, $_))[3];
$cnt++;
}
close $file;
print "number of files: $cnt\n";
print "total size: $total\n";
printf "avg: %.2f\n", $total/$cnt;
Or you can use awk:
awk '{t+=$4} END{print t/NR}' files.txt
Try doing this :
#!/usr/bin/perl -l
use strict; use warnings;
open my $file, '<', "my_file" or die "open error [$!]";
my ($total, $count);
while (<$file>){
chomp;
next if /^$/;
my ($date, $time, $x, $numbers, $type) = split;
$total += $numbers;
$count++;
}
print "the average is " . $total/$count . " and the total is $total";
close $file;
It is as simple as this:
perl -F -lane '$a+=$F[3];END{print "The average size is ".$a/$.}' your_file
tested below:
> cat temp
12/02/2002 12:16 AM 86016 a2p.exe
10/10/2004 11:33 AM 393 avgfsznew.pl
11/01/2003 04:42 PM 38124 c2ph.bat
Now the execution:
> perl -F -lane '$a+=$F[3];END{print "The average size is ".$a/$.}' temp
The average size is 41511
>
explanation:
-F -a says store the line in an array format.with the default separator as space or tab.
so nopw $F[3] has you size of the file.
sum up all the sizes in the 4th column untill all the lines are processed.
END will be executed after processing all the lines in the file.
so $. at the end will gives the number of lines.
so $a/$. will give the average.
This solution opens the file and loops through each line of the file. It then splits the file into the five variables in the line by splitting on 1 or more spaces.
open the file for reading, "<", and if it fails, raise an error or die "..."
my ($total, $cnt) are our column total and number of files added count
while(<FILE>) { ... } loops through each line of the file using the file handle and stores the line in $_
chomp removes the input record separator in $_. In unix, the default separator is a newline \n
split(/\s+/, $_) Splits the current line represented by$_, with the delimiter \s+. \s represents a space, the + afterward means "1 or more". So, we split the next line on 1 or more spaces.
Next we update $total and $cnt
#!/usr/bin/perl
open FILE, "<", "files.txt" or die "Error opening file: $!";
my ($total, $cnt);
while(<FILE>){
chomp;
my ($date, $time, $am_pm, $numbers, $type) = split(/\s+/, $_);
$total += $numbers;
$cnt++;
}
close FILE;
print"the total is $total and count of $cnt\n";`

Perl: How to add a line to sorted text file

I want to add a line to the text file in perl which has data in a sorted form. I have seen examples which show how to append data at the end of the file, but since I want the data in a sorted format.
Please guide me how can it be done.
Basically from what I have tried so far :
(I open a file, grep its content to see if the line which I want to add to the file already exists. If it does than exit else add it to the file (such that the data remains in a sorted format)
open(my $FH, $file) or die "Failed to open file $file \n";
#file_data = <$FH>;
close($FH);
my $line = grep (/$string1/, #file_data);
if($line) {
print "Found\n";
exit(1);
}
else
{
#add the line to the file
print "Not found!\n";
}
Here's an approach using Tie::File so that you can easily treat the file as an array, and List::BinarySearch's bsearch_str_pos function to quickly find the insert point. Once you've found the insert point, you check to see if the element at that point is equal to your insert string. If it's not, splice it into the array. If it is equal, don't splice it in. And finish up with untie so that the file gets closed cleanly.
use strict;
use warnings;
use Tie::File;
use List::BinarySearch qw(bsearch_str_pos);
my $insert_string = 'Whatever!';
my $file = 'something.txt';
my #array;
tie #array, 'Tie::File', $file or die $!;
my $idx = bsearch_str_pos $insert_string, #array;
splice #array, $idx, 0, $insert_string
if $array[$idx] ne $insert_string;
untie #array;
The bsearch_str_pos function from List::BinarySearch is an adaptation of a binary search implementation from Mastering Algorithms with Perl. Its convenient characteristic is that if the search string isn't found, it returns the index point where it could be inserted while maintaining the sort order.
Since you have to read the contents of the text file anyway, how about a different approach?
Read the lines in the file one-by-one, comparing against your target string. If you read a line equal to the target string, then you don't have to do anything.
Otherwise, you eventually read a line 'greater' than your current line according to your sort criteria, or you hit the end of the file. In the former case, you just insert the string at that position, and then copy the rest of the lines. In the latter case, you append the string to the end.
If you don't want to do it that way, you can do a binary search in #file_data to find the spot to add the line without having to examine all of the entries, then insert it into the array before outputting the array to the file.
Here's a simple version that reads from stdin (or filename(s) specified on command line) and appends 'string to append' to the output if it's not found in the input. Outuput is printed on stdout.
#! /usr/bin/perl
$found = 0;
$append='string to append';
while(<>) {
$found = 1 if (m/$append/o);
print
}
print "$append\n" unless ($found);;
Modifying it to edit a file in-place (with perl -i) and taking the append string from the command line would be quite simple.
A 'simple' one-liner to insert a line without using any module could be:
perl -ni -le '$insert="lemon"; $eq=($insert cmp $_); if ($eq == 0){$found++}elsif($eq==-1 && !$found){print$insert} print'
giver a list.txt whose context is:
ananas
apple
banana
pear
the output is:
ananas
apple
banana
lemon
pear
{
local ($^I, #ARGV) = ("", $file); # Enable in-place editing of $file
while (<>) {
# If we found the line exactly, bail out without printing it twice
last if $_ eq $insert;
# If we found the place where the line should be, insert it
if ($_ gt $insert) {
print $insert;
print;
last;
}
print;
}
# We've passed the insertion point, now output the rest of the file
print while <>;
}
Essentially the same answer as pavel's, except with a lot of readability added. Note that $insert should already contain a trailing newline.

Perl Regular Expressions + delete line if it starts with #

How to delete lines if they begin with a "#" character using Perl regular expressions?
For example (need to delete the following examples)
line="#a"
line=" #a"
line="# a"
line=" # a"
...
the required syntax
$line =~ s/......../..
or skip loop if line begins with "#"
from my code:
open my $IN ,'<', $file or die "can't open '$file' for reading: $!";
while( defined( $line = <$IN> ) ){
.
.
.
You don't delete lines with s///. (In a loop, you probably want next;)
In the snippet you posted, it would be:
while (my $line = <IN>) {
if ($line =~ /^\s*#/) { next; }
# will skip the rest of the code if a line matches
...
}
Shorter forms /^\s*#/ and next; and next if /^\s*#/; are possible.
perldoc perlre
/^\s*#/
^ - "the beginning of the line"
\s - "a whitespace character"
* - "0 or more times"
# - just a #
Based off Aristotle Pagaltzis's answer you could do:
perl -ni.bak -e'print unless m/^\s*#/' deletelines.txt
Here, the -n switch makes perl put a loop around the code you provide
which will read all the files you pass on the command line in
sequence. The -i switch (for “in-place”) says to collect the output
from your script and overwrite the original contents of each file with
it. The .bak parameter to the -i option tells perl to keep a backup of
the original file in a file named after the original file name with
.bak appended. For all of these bits, see perldoc perlrun.
deletelines.txt (initially):
#a
b
#a
# a
c
# a
becomes:
b
c
Program (Cut & paste whole thing including DATA section, adjust shebang line, run)
#!/usr/bin/perl
use strict;
use warnings;
while(<DATA>) {
next if /^\s*#/; # skip comments
print; # process data
}
__DATA__
# comment
data
# another comment
more data
Output
data
more data
$text ~= /^\s*#.*\n//g
That will delete all of the lines with # in the entire file of $text, without requiring that you loop through each line of the text manually.

How can I translate a shell script to Perl?

I have a shell script, pretty big one. Now my boss says I must rewrite it in Perl.
Is there any way to write a Perl script and use the existing shell code as is in my Perl script. Something similar to Inline::C.
Is there something like Inline::Shell? I had a look at inline module, but it supports only languages.
I'll answer seriously. I do not know of any program to translate a shell script into Perl, and I doubt any interpreter module would provide the performance benefits. So I'll give an outline of how I would go about it.
Now, you want to reuse your code as much as possible. In that case, I suggest selecting pieces of that code, write a Perl version of that, and then call the Perl script from the main script. That will enable you to do the conversion in small steps, assert that the converted part is working, and improve gradually your Perl knowledge.
As you can call outside programs from a Perl script, you can even replace some bigger logic with Perl, and call smaller shell scripts (or other commands) from Perl to do something you don't feel comfortable yet to convert. So you'll have a shell script calling a perl script calling another shell script. And, in fact, I did exactly that with my own very first Perl script.
Of course, it's important to select well what to convert. I'll explain, below, how many patterns common in shell scripts are written in Perl, so that you can identify them inside your script, and create replacements by as much cut&paste as possible.
First, both Perl scripts and Shell scripts are code+functions. Ie, anything which is not a function declaration is executed in the order it is encountered. You don't need to declare functions before use, though. That means the general layout of the script can be preserved, though the ability to keep things in memory (like a whole file, or a processed form of it) makes it possible to simplify tasks.
A Perl script, in Unix, starts with something like this:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
#other libraries
(rest of the code)
The first line, obviously, points to the commands to be used to run the script, just like normal shells do. The following two "use" lines make then language more strict, which should decrease the amount of bugs you encounter because you don't know the language well (or plain did something wrong). The third use line imports the "Dumper" function of the "Data" module. It's useful for debugging purposes. If you want to know the value of an array or hash table, just print Dumper(whatever).
Note also that comments are just like shell's, lines starting with "#".
Now, you call external programs and pipe to or pipe from them. For example:
open THIS, "cat $ARGV[0] |";
That will run cat, passing "$ARGV[0]", which would be $1 on shell -- the first argument passed to it. The result of that will be piped into your Perl script through "THIS", which you can use to read that from it, as I'll show later.
You can use "|" at the beginning or end of line, to indicate the mode "pipe to" or "pipe from", and specify a command to be run, and you can also use ">" or ">>" at the beginning, to open a file for writing with or without truncation, "<" to explicitly indicate opening a file for reading (the default), or "+<" and "+>" for read and write. Notice that the later will truncate the file first.
Another syntax for "open", which will avoid problems with files with such characters in their names, is having the opening mode as a second argument:
open THIS, "-|", "cat $ARGV[0]";
This will do the same thing. The mode "-|" stands for "pipe from" and "|-" stands for "pipe to". The rest of the modes can be used as they were (>, >>, <, +>, +<). While there is more than this to open, it should suffice for most things.
But you should avoid calling external programs as much as possible. You could open the file directly, by doing open THIS, "$ARGV[0]";, for example, and have much better performance.
So, what external programs you could cut out? Well, almost everything. But let's stay with the basics: cat, grep, cut, head, tail, uniq, wc, sort.
CAT
Well, there isn't much to be said about this one. Just remember that, if possible, read the file only once and keep it in memory. If the file is huge you won't do that, of course, but there are almost always ways to avoid reading a file more than once.
Anyway, the basic syntax for cat would be:
my $filename = "whatever";
open FILE, "$filename" or die "Could not open $filename!\n";
while(<FILE>) {
print $_;
}
close FILE;
This opens a file, and prints all it's contents ("while(<FILE>)" will loop until EOF, assigning each line to "$_"), and close it again.
If I wanted to direct the output to another file, I could do this:
my $filename = "whatever";
my $anotherfile = "another";
open (FILE, "$filename") || die "Could not open $filename!\n";
open OUT, ">", "$anotherfile" or die "Could not open $anotherfile for writing!\n";
while(<FILE>) {
print OUT $_;
}
close FILE;
This will print the line to the file indicated by "OUT". You can use STDIN, STDOUT and STDERR in the appropriate places as well, without having to open them first. In fact, "print" defaults to STDOUT, and "die" defaults to "STDERR".
Notice also the "or die ..." and "|| die ...". The operators or and || means it will only execute the following command if the first returns false (which means empty string, null reference, 0, and the like). The die command stops the script with an error message.
The main difference between "or" and "||" is priority. If "or" was replaced by "||" in the examples above, it would not work as expected, because the line would be interpreted as:
open FILE, ("$filename" || die "Could not open $filename!\n");
Which is not at all what is expected. As "or" has a lower priority, it works. In the line where "||" is used, the parameters to open are passed between parenthesis, making it possible to use "||".
Alas, there is something which is pretty much what cat does:
while(<>) {
print $_;
}
That will print all files in the command line, or anything passed through STDIN.
GREP
So, how would our "grep" script work? I'll assume "grep -E", because that's easier in Perl than simple grep. Anyway:
my $pattern = $ARGV[0];
shift #ARGV;
while(<>) {
print $_ if /$pattern/o;
}
The "o" passed to $patttern instructs Perl to compile that pattern only once, thus gaining you speed. Not the style "something if cond". It means it will only execute "something" if the condition is true. Finally, "/$pattern/", alone, is the same as "$_ =~ m/$pattern/", which means compare $_ with the regex pattern indicated. If you want standard grep behavior, ie, just substring matching, you could write:
print $_ if $_ =~ "$pattern";
CUT
Usually, you do better using regex groups to get the exact string than cut. What you would do with "sed", for instance. Anyway, here are two ways of reproducing cut:
while(<>) {
my #array = split ",";
print $array[3], "\n";
}
That will get you the fourth column of every line, using "," as separator. Note #array and $array[3]. The # sigil means "array" should be treated as an, well, array. It will receive an array composed of each column in the currently processed line. Next, the $ sigil means array[3] is a scalar value. It will return the column you are asking for.
This is not a good implementation, though, as "split" will scan the whole string. I once reduced a process from 30 minutes to 2 seconds just by not using split -- the lines where rather large, though. Anyway, the following has a superior performance if the lines are expected to be big, and the columns you want are low:
while(<>) {
my ($column) = /^(?:[^,]*,){3}([^,]*),/;
print $column, "\n";
}
This leverages regular expressions to get the desired information, and only that.
If you want positional columns, you can use:
while(<>) {
print substr($_, 5, 10), "\n";
}
Which will print 10 characters starting from the sixth (again, 0 means the first character).
HEAD
This one is pretty simple:
my $printlines = abs(shift);
my $lines = 0;
my $current;
while(<>) {
if($ARGV ne $current) {
$lines = 0;
$current = $ARGV;
}
print "$_" if $lines < $printlines;
$lines++;
}
Things to note here. I use "ne" to compare strings. Now, $ARGV will always point to the current file, being read, so I keep track of them to restart my counting once I'm reading a new file. Also note the more traditional syntax for "if", right along with the post-fixed one.
I also use a simplified syntax to get the number of lines to be printed. When you use "shift" by itself it will assume "shift #ARGV". Also, note that shift, besides modifying #ARGV, will return the element that was shifted out of it.
As with a shell, there is no distinction between a number and a string -- you just use it. Even things like "2"+"2" will work. In fact, Perl is even more lenient, cheerfully treating anything non-number as a 0, so you might want to be careful there.
This script is very inefficient, though, as it reads ALL file, not only the required lines. Let's improve it, and see a couple of important keywords in the process:
my $printlines = abs(shift);
my #files;
if(scalar(#ARGV) == 0) {
#files = ("-");
} else {
#files = #ARGV;
}
for my $file (#files) {
next unless -f $file && -r $file;
open FILE, "<", $file or next;
my $lines = 0;
while(<FILE>) {
last if $lines == $printlines;
print "$_";
$lines++;
}
close FILE;
}
The keywords "next" and "last" are very useful. First, "next" will tell Perl to go back to the loop condition, getting the next element if applicable. Here we use it to skip a file unless it is truly a file (not a directory) and readable. It will also skip if we couldn't open the file even then.
Then "last" is used to immediately jump out of a loop. We use it to stop reading the file once we have reached the required number of lines. It's true we read one line too many, but having "last" in that position shows clearly that the lines after it won't be executed.
There is also "redo", which will go back to the beginning of the loop, but without reevaluating the condition nor getting the next element.
TAIL
I'll do a little trick here.
my $skiplines = abs(shift);
my #lines;
my $current = "";
while(<>) {
if($ARGV ne $current) {
print #lines;
undef #lines;
$current = $ARGV;
}
push #lines, $_;
shift #lines if $#lines == $skiplines;
}
print #lines;
Ok, I'm combining "push", which appends a value to an array, with "shift", which takes something from the beginning of an array. If you want a stack, you can use push/pop or shift/unshift. Mix them, and you have a queue. I keep my queue with at most 10 elements with $#lines which will give me the index of the last element in the array. You could also get the number of elements in #lines with scalar(#lines).
UNIQ
Now, uniq only eliminates repeated consecutive lines, which should be easy with what you have seen so far. So I'll eliminate all of them:
my $current = "";
my %lines;
while(<>) {
if($ARGV ne $current) {
undef %lines;
$current = $ARGV;
}
print $_ unless defined($lines{$_});
$lines{$_} = "";
}
Now here I'm keeping the whole file in memory, inside %lines. The use of the % sigil indicates this is a hash table. I'm using the lines as keys, and storing nothing as value -- as I have no interest in the values. I check where the key exist with "defined($lines{$_})", which will test if the value associated with that key is defined or not; the keyword "unless" works just like "if", but with the opposite effect, so it only prints a line if the line is NOT defined.
Note, too, the syntax $lines{$_} = "" as a way to store something in a hash table. Note the use of {} for hash table, as opposed to [] for arrays.
WC
This will actually use a lot of stuff we have seen:
my $current;
my %lines;
my %words;
my %chars;
while(<>) {
$lines{"$ARGV"}++;
$chars{"$ARGV"} += length($_);
$words{"$ARGV"} += scalar(grep {$_ ne ""} split /\s/);
}
for my $file (keys %lines) {
print "$lines{$file} $words{$file} $chars{$file} $file\n";
}
Three new things. Two are the "+=" operator, which should be obvious, and the "for" expression. Basically, a "for" will assign each element of the array to the variable indicated. The "my" is there to declare the variable, though it's unneeded if declared previously. I could have an #array variable inside those parenthesis. The "keys %lines" expression will return as an array they keys (the filenames) which exist for the hash table "%lines". The rest should be obvious.
The third thing, which I actually added only revising the answer, is the "grep". The format here is:
grep { code } array
It will run "code" for each element of the array, passing the element as "$_". Then grep will return all elements for which the code evaluates to "true" (not 0, not "", etc). This avoids counting empty strings resulting from consecutive spaces.
Similar to "grep" there is "map", which I won't demonstrate here. Instead of filtering, it will return an array formed by the results of "code" for each element.
SORT
Finally, sort. This one is easy too:
my #lines;
my $current = "";
while(<>) {
if($ARGV ne $current) {
print sort #lines;
undef #lines;
$current = $ARGV;
}
push #lines, $_;
}
print sort #lines;
Here, "sort" will sort the array. Note that sort can receive a function to define the sorting criteria. For instance, if I wanted to sort numbers I could do this:
my #lines;
my $current = "";
while(<>) {
if($ARGV ne $current) {
print sort #lines;
undef #lines;
$current = $ARGV;
}
push #lines, $_;
}
print sort {$a <=> $b} #lines;
Here "$a" and "$b" receive the elements to be compared. "<=>" returns -1, 0 or 1 depending on whether the number is less than, equal to or greater than the other. For strings, "cmp" does the same thing.
HANDLING FILES, DIRECTORIES & OTHER STUFF
As for the rest, basic mathematical expressions should be easy to understand. You can test certain conditions about files this way:
for my $file (#ARGV) {
print "$file is a file\n" if -f "$file";
print "$file is a directory\n" if -d "$file";
print "I can read $file\n" if -r "$file";
print "I can write to $file\n" if -w "$file";
}
I'm not trying to be exaustive here, there are many other such tests. I can also do "glob" patterns, like shell's "*" and "?", like this:
for my $file (glob("*")) {
print $file;
print "*" if -x "$file" && ! -d "$file";
print "/" if -d "$file";
print "\t";
}
If you combined that with "chdir", you can emulate "find" as well:
sub list_dir($$) {
my ($dir, $prefix) = #_;
my $newprefix = $prefix;
if ($prefix eq "") {
$newprefix = $dir;
} else {
$newprefix .= "/$dir";
}
chdir $dir;
for my $file (glob("*")) {
print "$prefix/" if $prefix ne "";
print "$dir/$file\n";
list_dir($file, $newprefix) if -d "$file";
}
chdir "..";
}
list_dir(".", "");
Here we see, finally, a function. A function is declared with the syntax:
sub name (params) { code }
Strictly speakings, "(params)" is optional. The declared parameter I used, "($$)", means I'm receiving two scalar parameters. I could have "#" or "%" in there as well. The array "#_" has all the parameters passed. The line "my ($dir, $prefix) = #_" is just a simple way of assigning the first two elements of that array to the variables $dir and $prefix.
This function does not return anything (it's a procedure, really), but you can have functions which return values just by adding "return something;" to it, and have it return "something".
The rest of it should be pretty obvious.
MIXING EVERYTHING
Now I'll present a more involved example. I'll show some bad code to explain what's wrong with it, and then show better code.
For this first example, I have two files, the names.txt file, which names and phone numbers, the systems.txt, with systems and the name of the responsible for them. Here they are:
names.txt
John Doe, (555) 1234-4321
Jane Doe, (555) 5555-5555
The Boss, (666) 5555-5555
systems.txt
Sales, Jane Doe
Inventory, John Doe
Payment, That Guy
I want, then, to print the first file, with the system appended to the name of the person, if that person is responsible for that system. The first version might look like this:
#!/usr/bin/perl
use strict;
use warnings;
open FILE, "names.txt";
while(<FILE>) {
my ($name) = /^([^,]*),/;
my $system = get_system($name);
print $_ . ", $system\n";
}
close FILE;
sub get_system($) {
my ($name) = #_;
my $system = "";
open FILE, "systems.txt";
while(<FILE>) {
next unless /$name/o;
($system) = /([^,]*)/;
}
close FILE;
return $system;
}
This code won't work, though. Perl will complain that the function was used too early for the prototype to be checked, but that's just a warning. It will give an error on line 8 (the first while loop), complaining about a readline on a closed filehandle. What happened here is that "FILE" is global, so the function get_system is changing it. Let's rewrite it, fixing both things:
#!/usr/bin/perl
use strict;
use warnings;
sub get_system($) {
my ($name) = #_;
my $system = "";
open my $filehandle, "systems.txt";
while(<$filehandle>) {
next unless /$name/o;
($system) = /([^,]*)/;
}
close $filehandle;
return $system;
}
open FILE, "names.txt";
while(<FILE>) {
my ($name) = /^([^,]*),/;
my $system = get_system($name);
print $_ . ", $system\n";
}
close FILE;
This won't give any error or warnings, nor will it work. It returns just the sysems, but not the names and phone numbers! What happened? Well, what happened is that we are making a reference to "$_" after calling get_system, but, by reading the file, get_system is overwriting the value of $_!
To avoid that, we'll make $_ local inside get_system. This will give it a local scope, and the original value will then be restored once returned from get_system:
#!/usr/bin/perl
use strict;
use warnings;
sub get_system($) {
my ($name) = #_;
my $system = "";
local $_;
open my $filehandle, "systems.txt";
while(<$filehandle>) {
next unless /$name/o;
($system) = /([^,]*)/;
}
close $filehandle;
return $system;
}
open FILE, "names.txt";
while(<FILE>) {
my ($name) = /^([^,]*),/;
my $system = get_system($name);
print $_ . ", $system\n";
}
close FILE;
And that still doesn't work! It prints a newline between the name and the system. Well, Perl reads the line including any newline it might have. There is a neat command which will remove newlines from strings, "chomp", which we'll use to fix this problem. And since not every name has a system, we might, as well, avoid printing the comma when that happens:
#!/usr/bin/perl
use strict;
use warnings;
sub get_system($) {
my ($name) = #_;
my $system = "";
local $_;
open my $filehandle, "systems.txt";
while(<$filehandle>) {
next unless /$name/o;
($system) = /([^,]*)/;
}
close $filehandle;
return $system;
}
open FILE, "names.txt";
while(<FILE>) {
my ($name) = /^([^,]*),/;
my $system = get_system($name);
chomp;
print $_;
print ", $system" if $system ne "";
print "\n";
}
close FILE;
That works, but it also happens to be horribly inefficient. We read the whole systems file for every line in the names file. To avoid that, we'll read all data from systems once, and then use that to process names.
Now, sometimes a file is so big you can't read it into memory. When that happens, you should try to read into memory any other file needed to process it, so that you can do everything in a single pass for each file. Anyway, here is the first optimized version of it:
#!/usr/bin/perl
use strict;
use warnings;
our %systems;
open SYSTEMS, "systems.txt";
while(<SYSTEMS>) {
my ($system, $name) = /([^,]*),(.*)/;
$systems{$name} = $system;
}
close SYSTEMS;
open NAMES, "names.txt";
while(<NAMES>) {
my ($name) = /^([^,]*),/;
chomp;
print $_;
print ", $systems{$name}" if defined $systems{$name};
print "\n";
}
close NAMES;
Unfortunately, it doesn't work. No system ever appears! What has happened? Well, let's look into what "%systems" contains, by using Data::Dumper:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
our %systems;
open SYSTEMS, "systems.txt";
while(<SYSTEMS>) {
my ($system, $name) = /([^,]*),(.*)/;
$systems{$name} = $system;
}
close SYSTEMS;
print Dumper(%systems);
open NAMES, "names.txt";
while(<NAMES>) {
my ($name) = /^([^,]*),/;
chomp;
print $_;
print ", $systems{$name}" if defined $systems{$name};
print "\n";
}
close NAMES;
The output will be something like this:
$VAR1 = ' Jane Doe';
$VAR2 = 'Sales';
$VAR3 = ' That Guy';
$VAR4 = 'Payment';
$VAR5 = ' John Doe';
$VAR6 = 'Inventory';
John Doe, (555) 1234-4321
Jane Doe, (555) 5555-5555
The Boss, (666) 5555-5555
Those $VAR1/$VAR2/etc is how Dumper displays a hash table. The odd numbers are the keys, and the succeeding even numbers are the values. Now we can see that each name in %systems has a preceeding space! Silly regex mistake, let's fix it:
#!/usr/bin/perl
use strict;
use warnings;
our %systems;
open SYSTEMS, "systems.txt";
while(<SYSTEMS>) {
my ($system, $name) = /^\s*([^,]*?)\s*,\s*(.*?)\s*$/;
$systems{$name} = $system;
}
close SYSTEMS;
open NAMES, "names.txt";
while(<NAMES>) {
my ($name) = /^\s*([^,]*?)\s*,/;
chomp;
print $_;
print ", $systems{$name}" if defined $systems{$name};
print "\n";
}
close NAMES;
So, here, we are aggressively removing any spaces from the beginning or end of name and system. There are other ways to form that regex, but that's beside the point. There is still one problem with this script, which you'll have seen if your "names.txt" and/or "systems.txt" files have an empty line at the end. The warnings look like this:
Use of uninitialized value in hash element at ./exemplo3e.pl line 10, <SYSTEMS> line 4.
Use of uninitialized value in hash element at ./exemplo3e.pl line 10, <SYSTEMS> line 4.
John Doe, (555) 1234-4321, Inventory
Jane Doe, (555) 5555-5555, Sales
The Boss, (666) 5555-5555
Use of uninitialized value in hash element at ./exemplo3e.pl line 19, <NAMES> line 4.
What happened here is that nothing went into the "$name" variable when the empty line was processed. There are many ways around that, but I choose the following:
#!/usr/bin/perl
use strict;
use warnings;
our %systems;
open SYSTEMS, "systems.txt" or die "Could not open systems.txt!";
while(<SYSTEMS>) {
my ($system, $name) = /^\s*([^,]+?)\s*,\s*(.+?)\s*$/;
$systems{$name} = $system if defined $name;
}
close SYSTEMS;
open NAMES, "names.txt" or die "Could not open names.txt!";
while(<NAMES>) {
my ($name) = /^\s*([^,]+?)\s*,/;
chomp;
print $_;
print ", $systems{$name}" if defined($name) && defined($systems{$name});
print "\n";
}
close NAMES;
The regular expressions now require at least one character for name and system, and we test to see if "$name" is defined before we use it.
CONCLUSION
Well, then, these are the basic tools to translate a shell script. You can do MUCH more with Perl, but that was not your question, and it wouldn't fit here anyway.
Just as a basic overview of some important topics,
A Perl script that might be attacked by hackers need to be run with the -T option, so that Perl will complain about any vulnerable input which has not been properly handled.
There are libraries, called modules, for database accesses, XML&cia handling, Telnet, HTTP & other protocols. In fact, there are miriads of modules which can be found at CPAN.
As mentioned by someone else, if you make use of AWK or SED, you can translate those into Perl with A2P and S2P.
Perl can be written in an Object Oriented way.
There are multiple versions of Perl. As of this writing, the stable one is 5.8.8 and there is a 5.10.0 available. There is also a Perl 6 in development, but experience has taught everyone not to wait too eagerly for it.
There is a free, good, hands-on, hard & fast book about Perl called Learning Perl The Hard Way. It's style is similar to this very answer. It might be a good place to go from here.
I hope this helped.
DISCLAIMER
I'm NOT trying to teach Perl, and you will need to have at least some reference material. There are guidelines to good Perl habits, such as using "use strict;" and "use warnings;" at the beginning of the script, to make it less lenient of badly written code, or using STDOUT and STDERR on the print lines, to indicate the correct output pipe.
This is stuff I agree with, but I decided it would detract from the basic goal of showing patterns for common shell script utilities.
I don't know what's in your shell script, but don't forget there are tools like
a2p - awk-to-perl
s2p - sed-to-perl
and perhaps more. Worth taking a look around.
You may find that due to Perl's power/features, it's not such a big job, in that you may have been jumping through hoops with various bash features and utility programs to do something that comes out of Perl natively.
Like any migration project, it's useful to have some canned regression tests to run with both solutions, so if you don't have those, I'd generate those first.
I'm surprised no-one has yet mentioned the Shell module that is included with core Perl, which lets you execute external commands using function-call syntax. For example (adapted from the synopsis):
use Shell qw(cat ps cp);
$passwd = cat '</etc/passwd';
#pslines = ps '-ww';
cp "/etc/passwd", "/tmp/passwd";
Provided you use parens, you can even call other programs in the $PATH that you didn't mention on the use line, e.g.:
gcc('-o', 'foo', 'foo.c');
Note that Shell gathers up the subprocess's STDOUT and returns it as a string or array. This simplifies scripting, but it is not the most efficient way to go and may cause trouble if you rely on a command's output being unbuffered.
The module docs mention some shortcomings, such as that shell internal commands (e.g. cd) cannot be called using the same syntax. In fact they recommend that the module not be used for production systems! But it could certainly be a helpful crutch to lean on until you get your code ported across to "proper" Perl.
The inline shell thingy is called system. If you have user-defined functions you're trying to expose to Perl, you're out of luck. However, you can run short bits of shell using the same environment as your running Perl program. You can also gradually replace parts of the shell script with Perl. Start writing a module that replicates the shell script functionality and insert Perly bits into the shell script until you eventually have mostly Perl.
There's no shell-to-Perl translator. There was a long running joke about a csh-to-Perl translator that you could email your script to, but that was really just Tom Christainsen translating it for you to show you how cool Perl was back in the early 90s. Randal Schwartz uploaded a sh-to-Perl translator, but you have to check the upload date: it was April Fool's day. His script merely wrapped everything in system.
Whatever you do, don't lose the original shell script. :)
I agree that learning Perl and trying to write Perl instead of shell is for the greater good. I did the transfer once with the help of the "Replace" function of Notepad++.
However, I had a similar problem to the one initially asked while I was trying to create a Perl wrapper around a shell script (that could execute it).
I came with the following code that works in my case.
It might help.
#!perl
use strict;
use Data::Dumper;
use Cwd;
#Variables read from shell
our %VAR;
open SH, "<$ARGV[0]" or die "Error while trying to read $ARGV[0] ($!)\n";
my #SH=<SH>;
close SH;
sh2perl(#SH);
#Subroutine to execute shell from Perl (read from array)
sub sh2perl {
#Variables
my %case; #To store data from conditional block of "case"
my %if; #To store data from conditional block of "if"
foreach my $line (#_) {
#Remove blanks at the beginning and EOL character
$line=~s/^\s*//;
chomp $line;
#Comments and blank lines
if ($line=~/^(#.*|\s*)$/) {
#Do nothing
}
#Conditional block - Case
elsif ($line=~/case.*in/..$line=~/esac/) {
if ($line=~/case\s*(.*?)\s*\in/) {
$case{'var'}=transform($1);
} elsif ($line=~/esac/) {
delete $case{'curr_pattern'};
#Run conditional block
my $case;
map { $case=$_ if $case{'var'}=~/$_/ } #{$case{'list_patterns'}};
$case ? sh2perl(#{$case{'patterns'}->{$case}}) : sh2perl(#{$case{'patterns'}->{"*"}});
} elsif ($line=~/^\s*(.*?)\s*\)/) {
$case{'curr_pattern'}=$1;
push(#{$case{'list_patterns'}}, $case{'curr_pattern'}) unless ($line=~m%\*\)%)
} else {
push(#{$case{'patterns'}->{ $case{'curr_pattern'} }}, $line);
}
}
#Conditional block - if
elsif ($line=~/^if/..$line=~/^fi/) {
if ($line=~/if\s*\[\s*(.*\S)\s*\];/) {
$if{'condition'}=transform($1);
$if{'curr_cond'}="TRUE";
} elsif ($line=~/fi/) {
delete $if{'curr_cond'};
#Run conditional block
$if{'condition'} ? sh2perl(#{$if{'TRUE'}}) : sh2perl(#{$if{'FALSE'}});
} elsif ($line=~/^else/) {
$if{'curr_cond'}="FALSE";
} else {
push(#{$if{ $if{'curr_cond'} }}, $line);
}
}
#echo
elsif($line=~/^echo\s+"?(.*?[^"])"?\s*$/) {
my $str=$1;
#echo with redirection
if ($str=~m%[>\|]%) {
eval { system(transform($line)) };
if ($#) { warn "Error while evaluating $line: $#\n"; }
#print new line
} elsif ($line=~/^echo ""$/) {
print "\n";
#default
} else {
print transform($str),"\n";
}
}
#cd
elsif($line=~/^\s*cd\s+(.*)/) {
chdir $1;
}
#export
elsif($line=~/^export\s+((\w+).*)/) {
my ($var,$exported)=($2,$1);
if ($exported=~/^(\w+)\s*=\s*(.*)/) {
while($exported=~/(\w+)\s*=\s*"?(.*?\S)"?\s*(;(?:\s*export\s+)?|$)/g) { $VAR{$1}=transform($2); }
}
# export($var,$VAR{$var});
$ENV{$var}=$VAR{$var};
print "Exported variable $var = $VAR{$var}\n";
}
#Variable assignment
elsif ($line=~/^(\w+)\s*=\s*(.*)$/) {
$1 eq "" or $VAR{$1}=""; #Empty variable
while($line=~/(\w+)\s*=\s*"?(.*?\S)"?\s*(;|$)/g) {
$VAR{$1}=transform($2);
}
}
#Source
elsif ($line=~/^source\s*(.*\.sh)/) {
open SOURCE, "<$1" or die "Error while trying to open $1 ($!)\n";
my #SOURCE=<SOURCE>;
close SOURCE;
sh2perl(#SOURCE);
}
#Default (assuming running command)
else {
eval { map { system(transform($_)) } split(";",$line); };
if ($#) { warn "Error while doing system on \"$line\": $#\n"; }
}
}
}
sub transform {
my $src=$_[0];
#Variables $1 and similar
$src=~s/\$(\d+)/$ARGV[$1-1]/ge;
#Commands stored in variables "$(<cmd>)"
eval {
while ($src=~m%\$\((.*)\)%g) {
my ($cmd,$new_cmd)=($1,$1);
my $curr_dir=getcwd;
$new_cmd=~s/pwd/echo $curr_dir/g;
$src=~s%\$\($cmd\)%`$new_cmd`%e;
chomp $src;
}
};
if ($#) { warn "Wrong assessment for variable $_[0]:\n=> $#\n"; return "ERROR"; }
#Other variables
$src=~s/\$(\w+)/$VAR{$1}/g;
#Backsticks
$src=~s/`(.*)`/`$1`/e;
#Conditions
$src=~s/"(.*?)"\s*==\s*"(.*?)"/"$1" eq "$2" ? 1 : 0/e;
$src=~s/"(.*?)"\s*!=\s*"(.*?)"/"$1" ne "$2" ? 1 : 0/e;
$src=~s/(\S+)\s*==\s*(\S+)/$1 == $2 ? 1 : 0/e;
$src=~s/(\S+)\s*!=\s*(\S+)/$1 != $2 ? 1 : 0/e;
#Return Result
return $src;
}
You could start your "Perl" script with:
#!/bin/bash
Then, assuming bash was installed at that location, perl would automatically invoke the bash interpretor to run it.
Edit: Or maybe the OS would intercept the call and stop it getting to Perl. I'm finding it hard to track down the documentation on how this actually works. Comments to documentation would be welcomed.