What happens internally when you have < FH >, <>, or < * > in perl? - perl

I apologize if this question sounds simple, my intention is to understand in depth how this (these?) particular operator(s) works and I was unable to find a satisfactory description in the perldocs (It probably exists somewhere, I just couldn't find it for the life of me)
Particularly, I am interested in knowing if
a) <>
b) <*> or whatever glob and
c) <FH>
are fundamentally similar or different, and how they are used internally.
I built my own testing functions to gain some insight on this (presented below). I still don't have a full understanding (my understanding might even be wrong) but this is what I've concluded:
<>
In Scalar Context: Reads the next line of the "current file" being read (provided in #ARGV). Questions: This seems like a very particular scenario, and I wonder why it is the way it is and whether it can be generalized or not. Also what is the "current file" that is being read? Is it in a file handle? What is the counter?
In List Context: Reads ALL of the files in #ARGV into an array
<list of globs>
In Scalar Context: Name of the first file found in current folder that matches the glob. Questions: Why the current folder? How do I change this? Is the only way to change this doing something like < /home/* > ?
In List Context: All the files that match the glob in the current folder.
<FH> just seems to return undef when assigned to a variable.
Questions: Why is it undef? Does it not have a type? Does this behave similarly when the FH is not a bareword filehandle?
General Question: What is it that handles the value of <> and the others during execution? In scalar context, is any sort of reference returned, or are the variables that we assign them to, at that point identical to any other non-ref scalar?
I also noticed that even though I am assigning them in sequence, the output is reset each time. i.e. I would have assumed that when I do
$thing_s = <>;
#thing_l = <>;
#thing_l would be missing the first item, since it was already received by $thing_s. Why is this not the case?
Code used for testing:
use strict;
use warnings;
use Switch;
use Data::Dumper;
die "Call with a list of files\n" if (#ARGV<1);
my #whats = ('<>','<* .*>','<FH>');
my $thing_s;
my #thing_l;
for my $what(#whats){
switch($what){
case('<>'){
$thing_s = <>;
#thing_l = <>;
}
case('<* .*>'){
$thing_s = <* .*>;
#thing_l = <* .*>;
}
case('<FH>'){
open FH, '<', $ARGV[0];
$thing_s = <FH>;
#thing_l = <FH>;
}
}
print "$what in scalar context is: \n".Dumper($thing_s)."\n";
print "$what in list context is: \n".Dumper(#thing_l)."\n";
}

The <> thingies are all iterators. All of these variants have common behaviour:
Used in list context, all remaining elements are returned.
Used in scalar context, only the next element is returned.
Used in scalar context, it returns undef once the iterator is exhausted.
These last two properties make it suitable for use as a condition in while loops.
There are two kinds of iterators that can be used with <>:
Filehandles. In this case <$fh> is equivalent to readline $fh.
Globs, so <* .*> is equivalent to glob '* .*'.
The <> is parsed as a readline when it contains either nothing, a bareword, or a simple scalar. More complex expression can be embedded like <{ ... }>.
It is parsed as a glob in all other cases. This can be made explicit by using quotes: <"* .*"> but you should really be explicit and use the glob function instead.
Some details differ, e.g. where the iterator state is kept:
When reading from a file handle, the file handle holds that iterator state.
When using the glob form, each glob expression has its own state.
Another part is if the iterator can restart:
glob restarts after returning one undef.
filehandles can only be restarted by seeking – not all FHs support this operation.
If no file handle is used in <>, then this defaults to the special ARGV file handle. The behaviour of <ARGV> is as follows:
If #ARGV is empty, then ARGV is STDIN.
Otherwise, the elements of #ARGV are treated as file names. The following pseudocode is executed:
$ARGV = shift #ARGV;
open ARGV, $ARGV or die ...; # careful! no open mode is used
The $ARGV scalar holds the filename, and the ARGV file handle holds that file handle.
When ARGV would be eof, the next file from #ARGV is opened.
Only when #ARGV is completely empty can <> return undef.
This can actually be used as a trick to read from many files:
local #ARGV = qw(foo.txt bar.txt baz.txt);
while (<>) {
...;
}

What is it that handles the value of <> and the others during execution?
The Perl compiler is very context-aware, and often has to choose between multiple ambiguous interpretations of a code segment. It will compile <> as a call to readline or to glob depending on what is inside the brackets.
In scalar context, is any sort of reference returned, or are the variables that we assign them to, at that point identical to any other non-ref scalar?
I'm not sure what you're asking here, or why you think the variables that take the result of a <> should be any different from other variables. They are always simple string values: either a filename returned by glob, or some file data returned by readline.
<FH> just seems to return undef when assigned to a variable. Questions: Why is it undef? Does it not have a type? Does this behave similarly when the FH is not a bareword filehandle?
This form will treat FH as a filehandle, and return the next line of data from the file if it is open and not at eof. Otherwise undef is returned, to indicate that nothing valid could be read. Perl is very flexible with types, but undef behaves as its own type, like Ruby's nil. The operator behaves the same whether FH is a global file handle or a (variable that contains) a reference to a typeglob.

Related

Clear #ARGV before getting input from <>?

I have noticed that the contents of #ARGV gets directed to the input of a <> command.
If I am going to get input from the keyboard using <>, should I clear #ARGV beforehand? Is that the only way to do it?
eg:
#ARGV = ();
$input = <>;
(I was quite surprised that #ARGV interferes with <>. How does that make sense?)
<> means <ARGV>. The ARGV filehandle refers to the concatenation of the files listed in #ARGV. If #ARGV is empty, it acts as if #ARGV = ('-'), which means reading from standard input. (It's magical that way.) See I/O Operators in perldoc perlop and perldoc -f readline.
<> is meant to emulate the common behavior of many unix tools (e.g. cat, sort, wc, ...) that read from standard input if passed no arguments, and otherwise read input from all files listed on the command line (treating - as a directive to read from standard input as well).
If you just want to read from STDIN, do this:
my $line = <STDIN>;
... or, my preference:
my $line = readline STDIN;
(Note that standard input does not necessarily refer to the keyboard. You can easily redirect it e.g. to a file: yourscript.pl < input.txt)

Difference between $var = <FH> and $_ in Perl

Recently I came across something like this in a certain perl script:
while(<FH>){
$var1 = <FH>; $var2 = $_
}
Since the diamond operator with file handle name inside works the same way as readline(FH); may I know are there any special meaning in writing like this?
Thanks a lot
Let's reach for the documentation for the direct question.
From readline
This is the internal function implementing the <EXPR> operator, but you can use it directly. The <EXPR> operator is discussed in more detail in I/O Operators in perlop
and in the I/O Operators we find the statement
<FILEHANDLE> may also be spelled readline(*FILEHANDLE). See readline.
Thus <FH> and readline(FH) are equivalent (we can pass *FH or FH to readline()).
Note that lexical filehandles are preferred to typeglobs. See Typeglobs and Filehandles in perldata for instance. So open your files like
open my $fh, '<', $file or die "Can't open $file: $!";
and then you do <$fh> to read "from the filehandle" (from the file associated with it).
The operator <> itself has a few other properties, though. See the extensive perlop discussion.
The rest of the code snippet in the question brings up other issues.
The <FH> inside the while condition is in the scalar context so it reads one line from the resource connected to FH. As we enter the loop body, the <FH> will again read a line, thus the next one, which is assigned to $var1.
When <$FH> is the sole thing inside the while conditional then the line gets assigned to the default variable $_. See I/O Operators linked above. So $var2 gets assigned this line.
Thus after the body of the loop executed, we have the first line in $var2 and the next line in $var1. This strange loop goes over two lines in each iteration, assigning first the second line of the two, and then the first one.

Perl glob returning a false positive

What seemed liked a straightforward piece of code most certainly didn't do what I wanted it to do.
Can somebody explain to me what it does do and why?
my $dir = './some/directory';
if ( -d $dir && <$dir/*> ) {
print "Dir exists and has non-hidden files in it\n";
}
else {
print "Dir either does not exist or has no non-hidden files in it\n";
}
In my test case, the directory did exist and it was empty. However, the then (first) section of the if triggered instead of the else section as expected.
I don't need anybody to suggest how to accomplish what I want to accomplish. I just want to understand Perl's interpretation of this code, which definitely does not match mine.
Using glob (aka <filepattern>) in a scalar context makes it an iterator; it will return one file at a time each time it is called, and will not respond to changes in the pattern (e.g. a different $dir) until it has finished iterating over the initial results; I suspect this is causing the trouble you see.
The easy answer is to always use it in list context, like so:
if( -d $dir && ( () = <$dir/*> ) ) {
glob may only really be used safely in scalar context in code you will execute more than once if you are absolutely sure you will exhaust the iterator before you try to start a new iteration. Most of the time it's just easier to avoid glob in scalar context altogether.
I believe that #ysth is on the right track, but repeated calls to glob in scalar context don't generate false positives.
For example
use strict;
use warnings;
use 5.010;
say scalar glob('/usr/*'), "\n";
say scalar glob('/usr/*'), "\n";
output
/usr/bin
/usr/bin
But what is true is that any single call to glob maintains a state, so if I have
use strict;
use warnings;
use 5.010;
for my $dir ( '/sys', '/usr', '/sys', '/usr' ) {
say scalar glob("$dir/*"), "\n";
}
output
/sys/block
/sys/bus
/sys/class
/sys/dev
So clearly that glob statement inside the loop is maintaining a state, and ignoring the changes to $dir.
This is similar to the way that the pos (and corresponding \G regex anchor) has a state per scalar variable, and how print without a specific file handle prints to the last selected handle. In the end it is how all of Perl works, with the it variable $_ being the ultimate example.

How is a Perl filehandle a scalar if it can return multiple lines?

I have kind of fundamental question about scalars in Perl. Everything I read says scalars hold one value:
A scalar may contain one single value in any of three different
flavors: a number, a string, or a reference. Although a scalar may not
directly hold multiple values, it may contain a reference to an array
or hash which in turn contains multiple values.
--from perldoc
Was curious how the code below works
open( $IN, "<", "phonebook.txt" )
or die "Cannot open the file\n";
while ( my $line = <$IN> ) {
chomp($line);
my ( $name, $area, $phone ) = split /\|/, $line;
print "$name $phone $phone\n";
}
close $IN;
Just to clarify the code above is opening a pipe delimited text file in the following format name|areacode|phone
It opens the file up and then it splits them into $name $area $phone; how does it go through the multiple lines of the file and print them out?
Going back to the perldoc quote from above "A scalar may contain a single value of a string, number, reference." I am assuming that it has to be a reference, but doesn't even really seem like a reference and if it is looks like it would a reference of a scalar? so I am wondering what is going on internally that allows Perl to iterate through all of the lines in the code?
Nothing urgent, just something I noticed and was curious about. Thanks.
It looks like Borodin zeroed in on the part you wanted, but I'll add to it.
There are variables, which store things for us, and there are operators, which do things for us. A file handle, the thing you have in $IN, isn't the file itself or the data in the file. It's a connection that the program to use to get information from the file.
When you use the line input operator, <>, you give it a file handle to tell it where to grab the next line from. By itself, it defaults to ARGV, but you can put any file handle in there. In this case, you have <$IN>. Borodin already explained the reference and bareword stuff.
So, when you use the line input operator, it look at the connection you give in then gets a line from that file and returns it. You might be able to grok this more easily with it's function form:
my $line = readline( $IN );
The thing you get back doesn't come out of $IN, but the thing it points to. Along the way, $IN keeps track of where it is in the file. See seek and tell.
Along the same lines are Perl's regexes. Many people call something like /foo.*bar/ a regular expression. They are slightly wrong. There's a regular expression inside the pattern match operator //. The pattern is the instructions, but it doesn't do anything by itself until the operator uses it.
I find in my classes if I emphasize the difference between the noun and verb parts of the syntax, people have a much easier time with this sort of stuff.
Old Answer
Through each iteration of the while loop, exactly one value is put into the scalar variables. When the loop is done with a line, everything is reset.
The value in $line is a single value: the entire line which you have not broken up yet. Perl doesn't care what that single value looks like. With each iteration, you deal with exactly one line and that's what's in $line. Remember, these are variables, which means you can modify and replace their values, so they can only hold one thing at a time, but there can be multiple times.
The scalars $name, $area, and $phone have single values, each produced by split. Those are lexical variables (my), so they are only visible inside the specific loop iteration where they are defined.
Beyond that, I'm not sure which scalar you might be confused about.
The old-fashioned way of opening files is to use a bare name for the file handle, like so
open IN, 'phonebook.txt'
A file handle is a special type of value, like scalar, hash, array etc. but it has no prefix symbol to differentiate it. (This isn't actually the full extent of the truth, but I am worried about confusing you if I add even more detail.)
Perl still works like this, but it is best avoided for a couple of reasons.
All such file handles are global, and there is no way to restrict access to them by scope
There is no way to pass the value to a subroutine or store it in a data structure
So Perl was enhanced several years ago so that you can use references to file handles. These can be stored in scalar variables, arrays, or hashes, and can be passed as subroutine parameters.
What happens now when you write
open my $in, '<', 'phonebook.txt'
is that perl autovivifies an anonymous file handle, and puts a reference to it in variable $in, so yes, you were right, it is a reference. (Another thing that was changed about the same time was the move to three-parameter open calls, which allow you to open a file called, say, >.txt for input.)
I hope that helps you to understand. It's an unnecessary level of detail, but it can often help you to remember the way Perl works to understand the underlying details.
Incidentally, it is best to keep to lower-case letters for lexical variables, even for file handle references. I often add fh to the end to indicate that the variable holds a file handle, like $in_fh. But there's no need to use capitals, which are generally reserved for global variables like Package::Names.
Update - The Rest of the Story
I thought I should add something to explain what I have mised out, for fear of misleading people who care about the gory detail.
Perl keeps a symbol table hash - a stash - that work very like ordinary Perl hashes. There is one such stash for each package, including the default package main. Note that this hash nothing to do with lexical variables - declared with my - which are stored entirely separately.
Ther indexes for the stashes are the names of the package variables, without the initial symbol. So, for example, if you have
our $val;
our #val;
our %val;
then the stash will have only a single element, with a key of val and a value which is a reference to an intermediate structure called a typeglob. This is another hash structure, with one element for each different type of variable that has been declared. In this case our val typeglob will have three elements, for the scalar, array, and hash varieties of the val variables.
One of these elements may also be an IO variable type, which is where file handles are kept. But, for historical reasons, the value that is passed around as a file handle is in fact a reference to the typeglob that contains it. That is why, if you write open my $in, '<', 'phonebook.txt' and then print $in you will see something like GLOB(0x269581c) - the GLOB being short for typeglob.
Apart from that, the account above is accurate. Perl autovivifies an anonymous typeglob in the current package, and uses only its IO slot for the file handle.
Scalars in Perl are denoted by a $ and they can indeed contain the type of values you mention in your questions but next to that they can also contain a file handle. You can create file handles in Perl in two ways one way is Lexical
open my $filehandle, '>', '/path/to/file' or die $!;
and the other is global
open FILEHANDLE, '>', '/path/to/file' or die $!;
You should use the Lexical version which is what you're doing.
The while loop in your code uses the <> operator on your lexical filehandle which returns a line out of your file every time it's called, until it's out of lines (when End Of File is reached) in which case it returns false.
I went into a bit more detail on file handles as it seems it's a concept you're not completely clear on.

Can someone suggest how this Perl script works?

I have to maintain the following Perl script:
#!/usr/bin/perl -w
die "Usage: $0 <file1> <file2>\n" unless scalar(#ARGV)>1;
undef $/;
my #f1 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
my #f2 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
die "Error: file1 has $#f1 serials, file2 has $#f2\n" if ($#f1 != $#f2);
foreach my $g (0 .. $#f1) {
print (($f2[$g] =~ m/RESULT:\s+PASS/) ? $f2[$g] : $f1[$g]);
}
print STDERR "$#f1 serials found\n";
I know pretty much what it does, but how it's done is difficult to follow. The calls to split() are particulary puzzling.
It's fairly idiomatic Perl and I would be grateful if a Perl expert could make a few clarifying suggestions about how it does it, so that if I need to use it on input files it can't deal with, I can attempt to modify it.
It combines the best results from two datalog files containing test results. The datalog files contain results for various serial numbers and the data for each serial number begins and ends with SERIAL NUMBER: n (I know this because my equipment creates the input files)
I could describe the format of the datalog files, but I think the only important aspect is the SERIAL NUMBER: n because that's all the Perl script checks for
The ternary operator is used to print a value from one input file or the other, so the output can be redirected to a third file.
This may not be what I would call "idiomatic" (that would be use Module::To::Do::Task) but they've certainly (ab)used some language features here. I'll see if I can't demystify some of this for you.
die "Usage: $0 <file1> <file2>\n" unless scalar(#ARGV)>1;
This exits with a usage message if they didn't give us any arguments. Command-line arguments are stored in #ARGV, which is like C's char **argv except the first element is the first argument, not the program name. scalar(#ARGV) converts #ARGV to "scalar context", which means that, while #ARGV is normally a list, we want to know about it's scalar (i.e. non-list) properties. When a list is converted to scalar context, we get the list's length. Therefore, the unless conditional is satisfied only if we passed no arguments.
This is rather misleading, because it will turn out your program needs two arguments. If I wrote this, I would write:
die "Usage: $0 <file1> <file2>\n" unless #ARGV == 2;
Notice I left off the scalar(#ARGV) and just wrote #ARGV. The scalar() function forces scalar context, but if we're comparing equality with a number, Perl can implicitly assume scalar context.
undef $/;
Oof. The $/ variable is a special Perl built-in variable that Perl uses to tell what a "line" of data from a file is. Normally, $/ is set to the string "\n", meaning when Perl tries to read a line it will read up until the next linefeed (or carriage return/linefeed on Windows). Your writer has undef-ed the variable, though, which means when you try to read a "line", Perl will just slurp up the whole file.
my #f1 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
This is a fun one. <> is a special filehandle that reads line-by-line from each file given on the command line. However, since we've told Perl that a "line" is an entire file, calling <> once will read in the entire file given in the first argument, and storing it temporarily as a string.
Then we take that string and split() it up into pieces, using the regex /(?=(?:SERIAL NUMBER:\s+\d+))/. This uses a lookahead, which tells our regex engine "only match if this stuff comes after our match, but this stuff isn't part of our match," essentially allowing us to look ahead of our match to check on more info. It basically splits the file into pieces, where each piece (except possibly the first) begins with "SERIAL NUMBER:", some arbitrary whitespace (the \s+ part), and then some digits (the \d+ part). I can't teach you regexes, so for more info I recommend reading perldoc perlretut - they explain all of that better than I ever will.
Once we've split the string into a list, we store that list in a list called #f1.
my #f2 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
This does the same thing as the last line, only to the second file, because <> has already read the entire first file, and storing the list in another variable called #f2.
die "Error: file1 has $#f1 serials, file2 has $#f2\n" if ($#f1 != $#f2);
This line prints an error message if #f1 and #f2 are different sizes. $#f1 is a special syntax for arrays - it returns the index of the last element, which will usually be the size of the list minus one (lists are 0-indexed, like in most languages). He uses this same value in his error message, which may be deceptive, as it will print 1 fewer than might be expected. I would write it as:
die "Error: file $ARGV[0] has ", $#f1 + 1, " serials, file $ARGV[1] has ", $#f2 + 1, "\n"
if $#f1 != $#f2;
Notice I changed "file1" to "file $ARGV[0]" - that way, it will print the name of the file you specified, rather than just the ambiguous "file1". Notice also that I split up the die() function and the if() condition on two lines. I think it's more readable that way. I also might write unless $#f1 == $#f2 instead of if $#f1 != $#f2, but that's just because I happen to think != is an ugly operator. There's more than one way to do it.
foreach my $g (0 .. $#f1) {
This is a common idiom in Perl. We normally use for() or foreach() (same thing, really) to iterate over each element of a list. However, sometimes we need the indices of that list (some languages might use the term "enumerate"), so we've used the range operator (..) to make a list that goes from 0 to $#f1, i.e., through all the indices of our list, since $#f1 is the value of the highest index in our list. Perl will loop through each index, and in each loop, will assign the value of that index to the lexically-scoped variable $g (though why they didn't use $i like any sane programmer, I don't know - come on, people, this tradition has been around since Fortran!). So the first time through the loop, $g will be 0, and the second time it will be 1, and so on until the last time it is $#f1.
print (($f2[$g] =~ m/RESULT:\s+PASS/) ? $f2[$g] : $f1[$g]);
This is the body of our loop, which uses the ternary conditional operator ?:. There's nothing wrong with the ternary operator, but if the code gives you trouble we can just change it to an if(). Let's just go ahead and rewrite it to use if():
if($f2[$g] =~ m/RESULT:\s+PASS/) {
print $f2[$g];
} else {
print $f1[$g];
}
Pretty simple - we do a regex check on $f2[$g] (the entry in our second file corresponding to the current entry in our first file) that basically checks whether or not that test passed. If it did, we print $f2[$g] (which will tell us that test passed), otherwise we print $f1[$g] (which will tell us the test that failed).
print STDERR "$#f1 serials found\n";
This just prints an ending diagnostic message telling us how many serials were found (minus one, again).
I personally would rewrite that whole hairy bit where he hacks with $/ and then does two reads from <> to be a loop, because I think that would be more readable, but this code should work fine, so if you don't have to change it too much you should be in good shape.
The undef $/ line deactivates the input record separator. Instead of reading records line by line, the interpreter will read whole files at once after that.
The <>, or 'diamond operator' reads from the files from the command line or standard input, whichever makes sense. In your case, the command line is explicitely checked, so it will be files. Input record separation has been deactivated, so each time you see a <>, you can think of it as a function call returning a whole file as a string.
The split operators take this string and cut it in chunks, each time it meets the regular expression in argument. The (?= ... ) construct means "the delimiter is this, but please keep it in the chunked result anyway."
That's all there is to it. There would always be a few optimizations, simplifications, or "other ways to do it," but this should get you running.
You can get quick glimpse how the script works, by translating it into java or scala. The inccode.com translator delivers following java code:
public class script extends CRoutineProcess implements IInProcess
{
VarArray arrF1 = new VarArray();
VarArray arrF2 = new VarArray();
VarBox call ()
{
// !/usr/bin/perl -w
if (!(BoxSystem.ProgramArguments.scalar().isGT(1)))
{
BoxSystem.die(BoxString.is(VarString.is("Usage: ").join(BoxSystem.foundArgument.get(0
).toString()).join(" <file1> <file2>\n")
)
);
}
BoxSystem.InputRecordSeparator.empty();
arrF1.setValue(BoxConsole.readLine().split(BoxPattern.is("(?=(?:SERIAL NUMBER:\\s+\\d+))")));
arrF2.setValue(BoxConsole.readLine().split(BoxPattern.is("(?=(?:SERIAL NUMBER:\\s+\\d+))")));
if ((arrF1.length().isNE(arrF2.length())))
{
BoxSystem.die("Error: file1 has $#f1 serials, file2 has $#f2\n");
}
for (
VarBox varG : VarRange.is(0,arrF1.length()))
{
BoxSystem.print((arrF2.get(varG).like(BoxPattern.is("RESULT:\\s+PASS"))) ? arrF2.get(varG) : arrF1.get(varG)
);
}
return STDERR.print("$#f1 serials found\n");
}
}