Perl: How to get filename when using <> construct? - perl

Perl offers this very nice feature:
while ( <> )
{
# do something
}
...which allows the script to be used as script.pl <filename> as well as cat <filename> | script.pl.
Now, is there a way to determine if the script has been called in the former way, and if yes, what the filename was?
I know I knew this once, and I know I even used the construct, but I cannot remember where / how. And it proved very hard to search the 'net for this ("perl stdin filename"? No...).
Help, please?

The variable $ARGV holds the current file being processed.
$ echo hello1 > file1
$ echo hello2 > file2
$ echo hello3 > file3
$ perl -e 'while(<>){s/^/$ARGV:/; print;}' file*
file1:hello1
file2:hello2
file3:hello3

The I/O Operators section of perlop is very informative about this.
Essentially, the first time <> is executed, - is added to #ARGV if it started out empty. Opening - has the effect of cloning the STDIN file handle, and the variable $ARGV is set to the current element of #ARGV as it is processed.
Here's the full clip.
The null filehandle "<>" is special: it can be used to emulate the
behavior of sed and awk, and any other Unix filter program that takes a
list of filenames, doing the same to each line of input from all of
them. Input from "<>" comes either from standard input, or from each
file listed on the command line. Here's how it works: the first time
"<>" is evaluated, the #ARGV array is checked, and if it is empty,
$ARGV[0] is set to "-", which when opened gives you standard input. The
#ARGV array is then processed as a list of filenames. The loop
while (<>) {
... # code for each line
}
is equivalent to the following Perl-like pseudo code:
unshift(#ARGV, '-') unless #ARGV;
while ($ARGV = shift) {
open(ARGV, $ARGV);
while (<ARGV>) {
... # code for each line
}
}
except that it isn't so cumbersome to say, and will actually work. It
really does shift the #ARGV array and put the current filename into the
$ARGV variable. It also uses filehandle ARGV internally. "<>" is just
a synonym for "<ARGV>", which is magical. (The pseudo code above doesn't
work because it treats "<ARGV>" as non-magical.)

If you care to know about when <> switches to a new file (e.g. in my case - I wanted to record the new filename and line number), then the eof() function documentation offers a trick:
# reset line numbering on each input file
while (<>) {
next if /^\s*#/; # skip comments
print "$.\t$_";
} continue {
close ARGV if eof; # Not eof()!
}

Related

Is there a way to get the current file handle that would be used with the <> operator in perl?

I've seen that close ARGV can close the currently processed file, but it would seem that ARGV isn't actually a file handle, so I can't use it in a read call. Is there any way to get the current file handle, or am I going to have to explicitly open the files myself?
... but it would seem that ARGV isn't actually a file handle, so I can't use it in a read call
ARGV is a filehandle and it can be used within read.
To cite from perlvar:
... a plain filehandle corresponding to the last file opened by <>"*
So it is a filehandle and it can be used within read. But you need to have to use <> first so that the file gets actually opened. And it will not magically continue with the next file as <> would do.
To test simply do (UNIX shell syntax, you might need to adapt this for Windows):
perl -e '<>; read(ARGV, my $buf, 10); print $buf' file
The <> will open the given file and read the first line. The read then will read the next 10 bytes from the same file.
<> is short for readline( ARGV ).
The file handle used is ARGV.
However, readline has special code to open/reopen ARGV which read doesn't have.
You can, however, achieve a read using readline by manipulating $/.
$ echo abcdef | perl -Mv5.14 -e'local $/ = \2; $_ = <>; say "<<$_>>";'
<<ab>>
$ perl -Mv5.14 -e'local $/ = \2; $_ = <>; say "<<$_>>";' <( echo abcdef )
<<ab>>

Perl: process string with shell command (pipe)

Assume a pipeline with three programs:
start | middle | end
If start and end are now part of one perl script, how can I pipe data through a shell command in the perl script, in order to pass through middle?
I tried the following (apologies for lack of strict mode, it was supposed to be a simple proof of concept):
#!/usr/bin/perl -n
# Output of "start" stage
$start = "a b c d\n";
# This shell command is "middle"
open (PR, "| sed -E 's/a/-/g' |") or die 'Failed to start sed';
# Pipe data from "start" into "middle"
print PR $start;
# Read data from "middle" into "end"
$end = "";
while (<PR>) {
$end .= $_;
}
close PR;
# Apply "end" and print output
$end =~ s/b/+/g;
print $end;
Expected output:
- + c d
Actual output:
none, until I hit ENTER, then I get - b c d. The middle command is receiving data from start and processing it, but the output is going to STDOUT instead of end. Also, the attempt to read from middle seems to be reading from STDIN instead (hence the relevance of hitting ENTER).
I'm aware that this could all easily be done in one line of perl (or sed); my problem is how to do piping in perl, not how to replace chars in a string.
You can use IPC::Open2 for this.
This code creates two file handles: $to_sed, which you can print to to send input to the program, and $from_sed which you can readline (or <$from_sed>) from to read the program's output.
use IPC::Open2;
my $pid = open2(my ($from_sed, $to_sed), "sed -E 's/a/-/g'");
Most often it is simplest to involve the shell, but there is an alternative call that allows you to bypass the shell and instead run a program and populate its argv directly. It is described in the linked documentation.
The reason your code does nothing until you hit enter is because you are using perl -n.
-n causes Perl to assume the following loop around your program, which makes it iterate over filename arguments
somewhat like sed -n or awk:
LINE:
while (<>) {
... # your program goes here
}
The part in your code where you read your file again returns nothing.
If you turn on warnings you will discover that perl doesn't do bi-directional pipes.

How to print without duplicates with perl?

My assignment is a little more in depth than the title but in the title is my main question. Here is the assignment:
Write a perl script that will grep for all occurrences of the regular expression in all regular files in the file/directory list as well as all regular files under the directories in the file/directory list. If a file is not a TEXT file then the file should first be operated on by the unix command strings (no switches) and the resulting lines searched. If the -l switch is given only the file name of the files containing the regular expression should be printed, one per line. A file name should occur a maximum of one time in this case. If the -l switch is not given then all matching lines should be printed, each proceeded on the same line by the file name and a colon. An example invocation from the command line:
plgrep 'ba+d' file1 dir1 dir2 file2 file3 dir3
Here is my code:
#!/usr/bin/perl -w
use Getopt::Long;
my $fname = 0;
GetOptions ('l' => \$fname);
$pat = shift #ARGV;
while (<>) {
if (/$pat/) {
$fname ? print "$ARGV\n" : print "$ARGV:$_";
}
}
So far that code does everything it's supposed to except for reading non-text files and printing out duplicates of file names when using the -l switch. Here is an example of my output after entering the following on the command line: plgrep 'ba+d' file1 file2
file1:My dog is bad.
file1:My dog is very baaaaaad.
file2:I am bad at the guitar.
file2:Even though I am bad at the guitar, it is still fun to play!
Which is PERFECT!
But when I use the -l switch to print out only the file names this is what I get after entering the following on the command line: plgrep -l 'ba+d' file1 file2
file1
file1
file2
file2
How do I get rid of those duplicates so it only prints:
file1
file2
I have tried:
$pat = shift #ARGV;
while (<>) {
if (/$pat/) {
$seen{$ARGV}++;
$fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
}
}
But when I try to run it without the -l switch I only get:
file1:My dog is bad.
file2:I am bad at the guitar.
I also tried:
$fname ? print "$ARGV\n" unless ($ARGV > 1) : print "$ARGV:$_";
But I keep getting syntax error at plgrep line 17, near ""$ARGV\n" unless"
If someone could help me out with my duplicates issue as well as the italicized part of the assignment I would truly appreciate it. I don't even know where to start on that italicized part.
If you're printing only file names, you can exit the loop (using the last command) after the first match, since you already know the file matches. By not scanning the rest of the file, this will also prevent the name from being printed repeatedly.
Edited to add: In order to do it this way, you'll also need to switch from using <> to read the files to instead getting the names from #ARGV and opening them normally.
If you want to continue using <>, you'll instead need to watch $ARGV to see when it changes (indicating that you've started reading a new file) and keep a flag to indicate whether the current file has found any matches yet or not. However, this approach would require you to read every file in its entirety, which will be less efficient than only reading enough of each file to know whether it contains at least one match or not (i.e., skipping to the next file after the first match), so I would recommend switching to open instead.
The first syntax problem is simply an extra semicolon.
The second is that you may only use if/unless as a statement modifier at the end of a statement - you can't embed it in the middle of a conditional that way.
$fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
Becomes:
next if $seen{$ARGV} > 1;
print $fname ? "$ARGV\n" : "$ARGV:$_";

What's the use of <> in Perl?

What's the use of <> in Perl. How to use it ?
If we simply write
<>;
and
while(<>)
what is that the program doing in both cases?
The answers above are all correct, but it might come across more plainly if you understand general UNIX command line usage. It is very common to want a command to work on multiple files. E.g.
ls -l *.c
The command line shell (bash et al) turns this into:
ls -l a.c b.c c.c ...
in other words, ls never see '*.c' unless the pattern doesn't match. Try this at a command prompt (not perl):
echo *
you'll notice that you do not get an *.
So, if the shell is handing you a bunch of file names, and you'd like to go through each one's data in turn, perl's <> operator gives you a nice way of doing that...it puts the next line of the next file (or stdin if no files are named) into $_ (the default scalar).
Here is a poor man's grep:
while(<>) {
print if m/pattern/;
}
Running this script:
./t.pl *
would print out all of the lines of all of the files that match the given pattern.
cat /etc/passwd | ./t.pl
would use cat to generate some lines of text that would then be checked for the pattern by the loop in perl.
So you see, while(<>) gets you a very standard UNIX command line behavior...process all of the files I give you, or process the thing I piped to you.
<>;
is a short way of writing
readline();
or if you add in the default argument,
readline(*ARGV);
readline is an operator that reads a line from the specified file handle. Reading from the special file handle ARGV will read from STDIN if #ARGV is empty or from the concatenation of the files named by #ARGV if it's not.
As for
while (<>)
It's a syntax error. If you had
while (<>) { ... }
it get rewritten to
while (defined($_ = <>)) { ... }
And as previously explained, that means the same as
while (defined($_ = readline(*ARGV))) { ... }
That means it will read lines from (previously explained) ARGV until there are no more lines to read.
It is called the diamond operator and feeds data from either stdin if ARGV is empty or each line from the files named in ARGV. This webpage http://docstore.mik.ua/orelly/perl/learn/ch06_02.htm explains it very well.
In many cases of programming with syntactical sugar like this, Deparse of O is helpful to find out what's happening:
$ perl -MO=Deparse -e 'while(<>){print 42}'
while (defined($_ = <ARGV>)) {
print 42;
}
-e syntax OK
Quoting perldoc perlop:
The null filehandle <> is special: it can be used to emulate the
behavior of sed and awk, and any other Unix filter program that takes
a list of filenames, doing the same to each line of input from all of
them. Input from <> comes either from standard input, or from each
file listed on the command line.
it takes the STDIN standard input:
> cat temp.pl
#!/usr/bin/perl
use strict;
use warnings;
my $count=<>;
print "$count"."\n";
>
below is the execution:
> temp.pl
3
3
>
so as soon as you execute the script it will wait till the user gives some input.
after 3 is given as input,it stores that value in $count and it prints the value in the next statement.

Need more explanation about what's being done here?

print reverse <>;
print sort <>;
What's the exact steps perl handles with these operations?
It seems for the 1st one,perl not just reverses the order of invocation parameters,but also the contents of each file...
print reverse <>;
<> is evaluated in an array context, meaning it "slurps" the file. It reads the entire file. In the case of the magic file represented by the files named in #ARGV, it will read the contents of all files in the order referenced by the command line arguments (#ARGV).
reverse then reverses the order of the array, meaning the last line from the last file comes first, and the first line from the last file comes last.
print then prints the array.
From your notes, you might want something like this:
perl -e 'sub BEGIN { #ARGV=reverse #ARGV; } print <>;' /etc/motd /etc/passwd
This is described in the docs for I/O operators. Here's an excerpt from the docs:
The null filehandle <> is special: it can be used to emulate the behavior of sed and awk. Input from <> comes either from standard input, or from each file listed on the command line. Here's how it works: the first time <> is evaluated, the #ARGV array is checked, and if it is empty, $ARGV[0] is set to "-", which when opened gives you standard input. The #ARGV array is then processed as a list of filenames.
It's worth reading the entire doc, as it provides equivalent "non-magical" Perl code equivalent to <> in various use cases.