Difference between $var = <FH> and $_ in Perl

Difference between $var = <FH> and $_ in Perl - perl

Recently I came across something like this in a certain perl script:
while(<FH>){
$var1 = <FH>; $var2 = $_
}
Since the diamond operator with file handle name inside works the same way as readline(FH); may I know are there any special meaning in writing like this?
Thanks a lot

Let's reach for the documentation for the direct question.
From readline
This is the internal function implementing the <EXPR> operator, but you can use it directly. The <EXPR> operator is discussed in more detail in I/O Operators in perlop
and in the I/O Operators we find the statement
<FILEHANDLE> may also be spelled readline(*FILEHANDLE). See readline.
Thus <FH> and readline(FH) are equivalent (we can pass *FH or FH to readline()).
Note that lexical filehandles are preferred to typeglobs. See Typeglobs and Filehandles in perldata for instance. So open your files like
open my $fh, '<', $file or die "Can't open $file: $!";
and then you do <$fh> to read "from the filehandle" (from the file associated with it).
The operator <> itself has a few other properties, though. See the extensive perlop discussion.
The rest of the code snippet in the question brings up other issues.
The <FH> inside the while condition is in the scalar context so it reads one line from the resource connected to FH. As we enter the loop body, the <FH> will again read a line, thus the next one, which is assigned to $var1.
When <$FH> is the sole thing inside the while conditional then the line gets assigned to the default variable $_. See I/O Operators linked above. So $var2 gets assigned this line.
Thus after the body of the loop executed, we have the first line in $var2 and the next line in $var1. This strange loop goes over two lines in each iteration, assigning first the second line of the two, and then the first one.

Related

Perl, find a match and read next line in perl

I would like to use
myscript.pl targetfolder/*
to read some number from ASCII files.
myscript.pl
#list = <#ARGV>;
# Is the whole file or only 1st line is loaded?
foreach $file ( #list ) {
open (F, $file);
}
# is this correct to judge if there is still file to load?
while ( <F> ) {
match_replace()
}
sub match_replace {
# if I want to read the 5th line in downward, how to do that?
# if I would like to read multi lines in multi array[row],
# how to do that?
if ( /^\sName\s+/ ) {
$name = $1;
}
}

I would recommend a thorough read of perlintro - it will give you a lot of the information you need. Additional comments:
Always use strict and warnings. The first will enforce some good coding practices (like for example declaring variables), the second will inform you about potential mistakes. For example, one warning produced by the code you showed would be readline() on unopened filehandle F, giving you the hint that F is not open at that point (more on that below).
#list = <#ARGV>;: This is a bit tricky, I wouldn't recommend it - you're essentially using glob, and expanding targetfolder/* is something your shell should be doing, and if you're on Windows, I'd recommend Win32::Autoglob instead of doing it manually.
foreach ... { open ... }: You're not doing anything with the files once you've opened them - the loop to read from the files needs to be inside the foreach.
"Is the whole file or only 1st line is loaded?" open doesn't read anything from the file, it just opens it and provides a filehandle (which you've named F) that you then need to read from.
I'd strongly recommend you use the more modern three-argument form of open and check it for errors, as well as use lexical filehandles since their scope is not global, as in open my $fh, '<', $file or die "$file: $!";.
"is this correct to judge if there is still file to load?" Yes, while (<$filehandle>) is a good way to read a file line-by-line, and the loop will end when everything has been read from the file. You may want to use the more explicit form while (my $line = <$filehandle>), so that your variable has a name, instead of the default $_ variable - it does make the code a bit more verbose, but if you're just starting out that may be a good thing.
match_replace(): You're not passing any parameters to the sub. Even though this code might still "work", it's passing the current line to the sub through the global $_ variable, which is not a good practice because it will be confusing and error-prone once the script starts getting longer.
if (/^\sName\s+/){$name = $1;}: Since you've named the sub match_replace, I'm guessing you want to do a search-and-replace operation. In Perl, that's called s/search/replacement/, and you can read about it in perlrequick and perlretut. As for the code you've shown, you're using $1, but you don't have any "capture groups" ((...)) in your regular expression - you can read about that in those two links as well.
"if I want to read the 5th line in downward , how to do that ?" As always in Perl, There Is More Than One Way To Do It (TIMTOWTDI). One way is with the range operator .. - you can skip the first through fourth lines by saying next if 1..4; at the beginning of the while loop, this will test those line numbers against the special $. variable that keeps track of the most recently read line number.
"and if I would like to read multi lines in multi array[row], how to do that ?" One way is to use push to add the current line to the end of an array. Since keeping the lines of a file in an array can use up more memory, especially with large files, I'd strongly recommend making sure you think through the algorithm you want to use here. You haven't explained why you would want to keep things in an array, so I can't be more specific here.
So, having said all that, here's how I might have written that code. I've added some debugging code using Data::Dumper - it's always helpful to see the data that your script is working with.
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper; # for debugging
$Data::Dumper::Useqq=1;
for my $file (#ARGV) {
print Dumper($file); # debug
open my $fh, '<', $file or die "$file: $!";
while (my $line = <$fh>) {
next if 1..4;
chomp($line); # remove line ending
match_replace($line);
}
close $fh;
}
sub match_replace {
my ($line) = #_; # get argument(s) to sub
my $name;
if ( $line =~ /^\sName\s+(.*)$/ ) {
$name = $1;
}
print Data::Dumper->Dump([$line,$name],['line','name']); # debug
# ... do more here ...
}
The above code is explicitly looping over #ARGV and opening each file, and I did say above that more verbose code can be helpful in understanding what's going on. I just wanted to point out a nice feature of Perl, the "magic" <> operator (discussed in perlop under "I/O Operators"), which will automatically open the files in #ARGV and read lines from them. (There's just one small thing, if I want to use the $. variable and have it count the lines per file, I need to use the continue block I've shown below, this is explained in eof.) This would be a more "idiomatic" way of writing that first loop:
while (<>) { # reads line into $_
next if 1..4;
chomp; # automatically uses $_ variable
match_replace($_);
} continue { close ARGV if eof } # needed for $. (and range operator)

Reading a file line by line in Perl

I want to read a file by one line, but it's reading just the first line. How to read all lines?
My code:
open(file_E, $file_E);
while ( <file_E> ) {
/([^\n]*)/;
print $line1;
}
close($file_E);

Let's start by looking at your code.
open(file_E, $file_E);
while ( <file_E> ) {
/([^\n]*)/;
print $line1;
}
close($file_E);
On the first line you open a file named in $file_E using the bareword filehandle file_E. This should work so long as the file successfully opens. It would be better to also check the success of this operation one of two ways: Either put use autodie; at the top of your script (but then risk applying its semantics in places where your code is incompatible with this level of error handling), or change your open to look like this:
open(file_E, $file_E) or die "Failed to open $file_E: $!\n";
Now if you fail to open the file you will get an error message that will help track down the problem.
Next lets look at the while loop, because it's here where you have an issue that is causing the bug you are experiencing. On the first line of the while loop you have this:
while ( <file_E> ) {
By consulting perldoc perlsyn you will see that line is special-cased to actually do this:
while (defined($_ = <file_E>)) {
So your code is implicitly assigning each line to $_ on successive iterations. Also by consulting perldoc perlop you'll find that when the match operator (/.../ or m/.../) is invoked without binding the match explicitly using =~, the match will bind against $_. Still then, so far so good. However, you are not actually doing anything useful with the match. The match operator will return Boolean truth / falsehood for whether or not the match succeeded. And because your pattern contains capturing parenthesis, it will capture something into the capture variable $1. But you are never testing for match success, nor are you ever referring to $1 again.
On the line that follows, you do this: print $line1. Where, in your code, is $line1 being assigned a value? Because it is never being assigned a value in what you've shown us.
I can only guess that your intent is to iterate over the lines of the file, capture the line but without the trailing newline, and then print it. It seems that you wish to print it without any newlines, so that all of the input file is printed as a single line of output.
open my $input_fh_e, '<', $file_E or die "Failed to open $file_E: $!\n";
while(my $line = <$input_fh_e>) {
chomp $line;
print $line;
}
close $input_fh_e or die "Failed to close $file_E: $!\n";
No need to capture anything -- if all that the capture is doing is just grabbing everything up to the newline, you can simply strip off the newline with chomp to begin with.
In my example I used a lexical filehandle (a file handle that is lexically scoped, declared with my). This is generally a better practice in modern Perl as it avoids using a bareword, avoids possible namespace collisions, and assures that the handle will get closed as soon as the lexical scope closes.
I also used the 'three arg' version of open, which is safer because it eliminates the potential for $file_E to be used to open a pipe or do some other nefarious or simply unintended shell manipulation.
I suggest also starting your script with use strict;, because had you done so, you would have gotten an error message at compiletime telling you that $line1 was never declared. Also start your script with use warnings, so that you would get a warning when you try to print $line1 before assigning a value to it.
Most of the issues in your code will be discussed in perldoc perlintro, which you can arrive at from your command line simply by typing perldoc perlintro, assuming you have Perl installed. It typically takes 20-40 minutes to read through perlintro. If ever there were a document that should constitute required reading before getting started writing Perl code, that reading would probably include perlintro.

Another alternative, note that $_ will include newline so you will need to chomp it if you don't want the newline in $line:
open(file_E, $file_E);
while ( <file_E> ) {
my $line = $_;
print $line;
}
close($file_E);

What is the meaning of the dot in this open() usage in Perl?

How can I understand the following usage of the open() function in Perl File I/O?
open(FHANDLE, ">" . $file )
I tried to find this type of syntax in the docs but did not find; please note there is a . (dot) after ">".
All I cannot understand is a use of dot, the rest I know.

This is an example of the old, two-argument form of open (which should be avoided now that three-argument open is available). In Perl, . is the append operator. It combines the two strings into a single string.
The line of code you posted is equivalent to open(FHANDLE, ">$file" ), it just uses a different method of combining the > and $file.
The better way to do it these days would be open(my $fhandle, '>', $file), as shown in the documentation you linked to.

This is the two-argument open. The dot . is the string concatenation operator in Perl. If open is called with two arguments, the second argument contains both the mode and the path.
In your case, it will open the file named in $file for writing.
However, for several reasons you should not do this. It's more common to use the three-argument-open, and the lexical filehandles instead of the global GLOB filehandle.
The lexical filehandle makes sure Perl implicitly closes the handel for you as soon as it goes out of scope. Using different args for mode and filename is a security concern, because otherwise a malicious user could smuggle in mode-changes into the filename.
open my $fh, '>', $file or die $!;
IN addition to the now lexical filehandle and the separation of the mode and the filename, we also check for errors in this code, which is always a good idea.

Perl replacement operator doesn't work under Windows when patterns contain slashes

I want to replace a string with a path:
my $somedir = "D:/somedir/someotherdir";
system("perl -pi.bak -e \"s{STRING_TO_BE_REPLACED}{$somedir}\" $file");
but under Windows it replaces string with random symbols instead of slashes.
What's the problem?

I think it's got something to do with a syntax detail needed on Windows, but can't test now.
However, as you are in a Perl script, why go out with system and run another Perl interpreter? It is far more complex and inefficient since it involves a syscall or a shell, and starts another program. Also, it is far harder to get it right -- you need to deal with syntax details, quoting and escaping, for system, your system's command interpreter, the other instance of Perl, and the regex.
The code below reads the whole file into an array first, which is fine if files aren't huge. In general it is better to process a file line by line, and how to do what you need in that way is discussed in detail in a perlfaq5 page. See the comment at the end, with the link.
use warnings 'all';
use strict;
# your code ...
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
# Change #lines in-place. See the comment
s/STRING_TO_BE_REPLACED/$somedir/ for #lines;
open $fh, '>', $file or die "Can't open $file for write: $!";
print $fh #lines;
close $fh;
When we open the $fh the second time it is closed and re-opened, so there is no need for an explicit close. When an existing file is opened for writing ('>') it is clobbered, so this replaces it.
It's more to write but it is better.
Comment on the in-place change to #lines This uses the fact that when iterating over an array if we change the index variable, here $_, the change is made in the original element. The index variable is like an alias for the array element. It says in perlsyn
If any element of LIST is an lvalue, you can modify it by modifying VAR inside the loop. Conversely, if any element of LIST is NOT an lvalue, any attempt to modify that element will fail. In other words, the foreach loop index variable is an implicit alias for each item in the list that you're looping over.
This has the benefit of not copying data and not touching elements that don't change so it is more efficient, potentially a lot more. However, it relies on a subtle property and thus it may be tricky and error prone, so I do not recommend it as a general practice.
To copy the array, with modifications, to a new one
my #lines_new;
foreach my $line (#lines) {
$line =~ s{STRING_TO_BE_REPLACED}{$somedir};
push #lines_new, $line;
}
This also changes #lines. If it need be kept intact do (my $new_line = $line) =~ s/.../. Then write #lines_new to $file. Somewhere in between these two is
#lines = map { s{STRING_TO_BE_REPLACED}{$somedir}; $_ } #lines;
what was posted originally. However, since the map changes elements of #lines and copies data to build the output list, while the whole statement also overwrites the array, on reflection I think it makes more sense to do either the in-place change or an explicit copy to a new array.
In principle it is better to not read the whole file at once but rather to process line by line, unless the file is small enough. In that case open the file for reading and new one for writing, and after you copy (with changes) the file over, move the new one to rewrite $file. See the topic in perlfaq5
The copied file is temporary, to be used to overwrite $file, so it can be named using the core module File::Temp to avoid accidents. To move a file use move from the core module File::Copy.

What happens internally when you have < FH >, <>, or < * > in perl?

I apologize if this question sounds simple, my intention is to understand in depth how this (these?) particular operator(s) works and I was unable to find a satisfactory description in the perldocs (It probably exists somewhere, I just couldn't find it for the life of me)
Particularly, I am interested in knowing if
a) <>
b) <*> or whatever glob and
c) <FH>
are fundamentally similar or different, and how they are used internally.
I built my own testing functions to gain some insight on this (presented below). I still don't have a full understanding (my understanding might even be wrong) but this is what I've concluded:
<>
In Scalar Context: Reads the next line of the "current file" being read (provided in #ARGV). Questions: This seems like a very particular scenario, and I wonder why it is the way it is and whether it can be generalized or not. Also what is the "current file" that is being read? Is it in a file handle? What is the counter?
In List Context: Reads ALL of the files in #ARGV into an array
<list of globs>
In Scalar Context: Name of the first file found in current folder that matches the glob. Questions: Why the current folder? How do I change this? Is the only way to change this doing something like < /home/* > ?
In List Context: All the files that match the glob in the current folder.
<FH> just seems to return undef when assigned to a variable.
Questions: Why is it undef? Does it not have a type? Does this behave similarly when the FH is not a bareword filehandle?
General Question: What is it that handles the value of <> and the others during execution? In scalar context, is any sort of reference returned, or are the variables that we assign them to, at that point identical to any other non-ref scalar?
I also noticed that even though I am assigning them in sequence, the output is reset each time. i.e. I would have assumed that when I do
$thing_s = <>;
#thing_l = <>;
#thing_l would be missing the first item, since it was already received by $thing_s. Why is this not the case?
Code used for testing:
use strict;
use warnings;
use Switch;
use Data::Dumper;
die "Call with a list of files\n" if (#ARGV<1);
my #whats = ('<>','<* .*>','<FH>');
my $thing_s;
my #thing_l;
for my $what(#whats){
switch($what){
case('<>'){
$thing_s = <>;
#thing_l = <>;
}
case('<* .*>'){
$thing_s = <* .*>;
#thing_l = <* .*>;
}
case('<FH>'){
open FH, '<', $ARGV[0];
$thing_s = <FH>;
#thing_l = <FH>;
}
}
print "$what in scalar context is: \n".Dumper($thing_s)."\n";
print "$what in list context is: \n".Dumper(#thing_l)."\n";
}

The <> thingies are all iterators. All of these variants have common behaviour:
Used in list context, all remaining elements are returned.
Used in scalar context, only the next element is returned.
Used in scalar context, it returns undef once the iterator is exhausted.
These last two properties make it suitable for use as a condition in while loops.
There are two kinds of iterators that can be used with <>:
Filehandles. In this case <$fh> is equivalent to readline $fh.
Globs, so <* .*> is equivalent to glob '* .*'.
The <> is parsed as a readline when it contains either nothing, a bareword, or a simple scalar. More complex expression can be embedded like <{ ... }>.
It is parsed as a glob in all other cases. This can be made explicit by using quotes: <"* .*"> but you should really be explicit and use the glob function instead.
Some details differ, e.g. where the iterator state is kept:
When reading from a file handle, the file handle holds that iterator state.
When using the glob form, each glob expression has its own state.
Another part is if the iterator can restart:
glob restarts after returning one undef.
filehandles can only be restarted by seeking – not all FHs support this operation.
If no file handle is used in <>, then this defaults to the special ARGV file handle. The behaviour of <ARGV> is as follows:
If #ARGV is empty, then ARGV is STDIN.
Otherwise, the elements of #ARGV are treated as file names. The following pseudocode is executed:
$ARGV = shift #ARGV;
open ARGV, $ARGV or die ...; # careful! no open mode is used
The $ARGV scalar holds the filename, and the ARGV file handle holds that file handle.
When ARGV would be eof, the next file from #ARGV is opened.
Only when #ARGV is completely empty can <> return undef.
This can actually be used as a trick to read from many files:
local #ARGV = qw(foo.txt bar.txt baz.txt);
while (<>) {
...;
}

What is it that handles the value of <> and the others during execution?
The Perl compiler is very context-aware, and often has to choose between multiple ambiguous interpretations of a code segment. It will compile <> as a call to readline or to glob depending on what is inside the brackets.
In scalar context, is any sort of reference returned, or are the variables that we assign them to, at that point identical to any other non-ref scalar?
I'm not sure what you're asking here, or why you think the variables that take the result of a <> should be any different from other variables. They are always simple string values: either a filename returned by glob, or some file data returned by readline.
<FH> just seems to return undef when assigned to a variable. Questions: Why is it undef? Does it not have a type? Does this behave similarly when the FH is not a bareword filehandle?
This form will treat FH as a filehandle, and return the next line of data from the file if it is open and not at eof. Otherwise undef is returned, to indicate that nothing valid could be read. Perl is very flexible with types, but undef behaves as its own type, like Ruby's nil. The operator behaves the same whether FH is a global file handle or a (variable that contains) a reference to a typeglob.