I've recently been exposed to a bit of Perl code, and some aspects of it are still elusive to me. This is it:
#collection = <*>;
I understand that the at-symbol defines collection as an array. I've also searched around a bit, and landed on perldoc, specifically at the part about I/O Operators. I found the null filelhandle specifically interesting; code follows.
while (<>) {
...
}
On the same topic I have also noticed that this syntax is also valid:
while (<*.c>) {
...
}
According to perldoc It is actually calling an internal function that invokes glob in a manner similar as the following code:
open(FOO, "echo *.c | tr -s ' \t\r\f' '\\012\\012\\012\\012'|");
while (<FOO>) {
...
}
Question
What does the less-than, asterisk, more-than (<*>) symbol mentioned on the first line actually do? Is it a reference to an internally open and referenced glob? Would it be a special case, such as the null filehandle? Or can it be something entirely different, like a legacy implementation?
<> (the diamond operator) is used in two different syntaxes.
<*.c>, <*> etc. is shorthand for the glob built-in function. So <*> returns a list of all files and directories in the current directory. (Except those beginning with a dot; use <* .*> for that).
<$fh> is shorthand for calling readline($fh). If no filehandle is specified (<>) the magical *ARGV handle is assumed, which is a list of files specified as command line arguments, or standard input if none are provided. As you mention, the perldoc covers both in detail.
How does Perl distinguish the two? It checks if the thing inside <> is either a bare filehandle or a simple scalar reference to a filehandle (e.g. $fh). Otherwise, it calls glob() instead. This even applies to stuff like <$hash{$key}> or <$x > - it will be interpreted as a call to glob(). If you read the perldoc a bit further on, this is explained - and it's recommended that you use glob() explicitly if you're putting a variable inside <> to avoid these problems.
It collects all filenames in the current directory and save them to the array collection. Except those beginning with a dot. It's the same as:
#collection = glob "*";
Related
use PDF::Extract;
$pdf=new PDF::Extract( PDFDoc=>"test.pdf");
$i=1;
$i++ while ( $pdf->savePDFExtract( PDFPages=>$i ) );
I am trying to understand the above Perl code. It appears to be instantiating an object from a module. What is the argument in the line that calls the constructor? What does the => mean? Is it a hash argument?
The constructor is called via indirect object syntax, which is discouraged (and usually a sign of old code). It would be better written as:
my $pdf = PDF::Extract->new(...);
The perlobj documentation recommends you avoid indirect object syntax for the following reasons:
First, it can be confusing to read. In the above example, it's not
clear if save is a method provided by the File class or simply a
subroutine that expects a file object as its first argument.
When used with class methods, the problem is even worse. Because Perl
allows subroutine names to be written as barewords, Perl has to guess
whether the bareword after the method is a class name or subroutine
name. In other words, Perl can resolve the syntax as either
File->new( $path, $data ) or new( File( $path, $data ) ).
To answer your second question, the => is known as the fat comma, and perlop has this to say about it:
The => operator (sometimes pronounced "fat comma") is a synonym for
the comma except that it causes a word on its left to be interpreted
as a string if it begins with a letter or underscore and is composed
only of letters, digits and underscores. This includes operands that
might otherwise be interpreted as operators, constants, single number
v-strings or function calls. If in doubt about this behavior, the left
operand can be quoted explicitly.
Otherwise, the => operator behaves exactly as the comma operator or
list argument separator, according to context.
In your example code, the constructor receives a list, just like if you had used a normal comma. In fact, your code is equivalent to this:
my $pdf = PDF::Extract->new('PDFDoc', 'test.pdf');
However, the thing that creates the hash is the assignment on the other side, which may look something like this:
sub new {
my $class = shift;
my %args = #_;
# ...
}
The fat comma isn't used exclusively with hashes (nor is it required to initialize a hash, as I pointed out above), but you will typically see it anywhere there's a key/value association in a list of arguments. The shape of the characters makes it clear that "this is related to that". It also saves some typing of quote characters, which is a nice side benefit.
I'm very new to perl, so I'm sure my confusion here is simply due to not understanding perl syntax and how it handles bare words. I'm failing to find good answers to my question online though.
I had code I'm refactoring, it use to look like this
#month_dirs = <$log_directory/*>;
I changed $log_directory to be loaded with a config file (AppConfig to be exact). Now instead of exporting $log_directory we output $conf which is an AppConfig object. To access loaded variables you usually make a method call to the variable name so I tried ...
#month_dirs = <$conf->log_directory()."/*">
This fails, because I can't make a method call $conf->log_directory in a location where a barword is expected. Just playing around I tried this instead
$month_directory_command = $conf->log_directory()."/*";
#month_dirs = <$month_directory_command>;
This still fails, silently, without any indicator that this is a problem. I tried using a string directly in the diamond but it fails, apparently only barewords, not strings, are accepted by the diamond I'm surprised by that since I'm not allowed to use a string at all, I thought most places Barewords could be used a string could instead, is this simply because most code implements separate logic to accept barewords vs strings, but not required to be implemented this way?
I can make this work by emulating exactly the original syntax
$month_directory_command = $conf->log_directory();
#month_dirs = <$month_directory_command/*>;
However, this feels ugly to me. I'm also confused why I can do that, but I can't create a bare word with:
$bare_word = $conf->log_directory()/*
or
$month_directory_command = $conf->log_directory();
$bare_word = $month_directory_command/*;
#month_dirs = <$bare_word>;
Why do some variables work for bare words but not others? why can I use a scaler variable but not if it's returned from a method call?
I tried looking up perl syntax on barewords but didn't have much luck describing situations where they are not written directly, but are composed of variables.
I'm hoping someone can help me better understand the bareword syntax here. What defines when I can use a variable as part of a bare word and if I can save it as a variable?
I'd like to figure out a cleaner syntax for using the barword in my diamond operator if one can be suggested, but more then that I'd like to understand the syntax so I know how to work with barewords in the future. I promise I did try hunting this down ahead of time, but without much luck.
Incidentally, it seems the suggestion is to not use barewords in perl anyways? Is there someway I should be avoid barewords in the diamond operator?
You're mistaken that the diamond operator <> only works with barewords:
$ perl -E'say for <"/*">'
/bin
/boot
/dev
...
(In fact, a bareword is just an identifier that doesn't have a sigil and is prohibited by use strict 'subs';, so none of your examples really qualify.)
This:
#month_dirs = <$log_directory/*>;
works because a level of double-quote interpolation is done inside <>, and scalar variables like $log_directory are interpolated.
It's equivalent to:
#month_dirs = glob("$log_directory/*");
This:
#month_dirs = <$conf->log_directory()."/*">
fails because the > in $conf->log_directory() closes the diamond operator prematurely, confusing the parser.
It's parsed as:
<$conf->
(a call to glob) followed by
log_directory()."/*">
which is a syntax error.
This:
$month_directory_command = $conf->log_directory()."/*";
#month_dirs = <$month_directory_command>;
fails because
<$month_directory_command>
is equivalent to
readline($month_directory_command)
and not to
glob("$month_directory_command")
From perldoc perlop:
If what the angle brackets contain is a simple scalar variable (for example, $foo), then that variable contains the name of the filehandle to input from, or its typeglob, or a reference to the same.
[...]
If what's within the angle brackets is neither a filehandle nor a simple scalar variable containing a filehandle name, typeglob, or typeglob reference, it is interpreted as a filename pattern to be globbed, and either a list of filenames or the next filename in the list is returned, depending on context. This distinction is determined on syntactic grounds alone. That means <$x> is always a readline() from an indirect handle, but <$hash{key}> is always a glob().
So you're trying to read from a filehandle ($month_directory_command) that hasn't been opened yet.
Turning on warnings with use warnings 'all'; would have alerted you to this:
readline() on unopened filehandle at foo line 6.
This:
$bare_word = $conf->log_directory()/*;
fails because you're trying to concatenate the result of a method call with a non-quoted string; to concatenate strings, you have to interpolate them into a double quoted string, or use the concatenation operator.
You could do:
$bare_word = $conf->log_directory() . "/*";
#month_dirs = <"$bare_word">;
(although $bare_word isn't a bareword at all, it's a scalar variable.)
Note that:
#month_dirs = <$bare_word>;
(without quotes) would be interpreted as readline, not glob, as explained in perlop above.
In general, though, it would probably be less confusing to use the glob operator directly:
#month_dirs = glob( $conf->log_directory() . "/*" );
One of the main reasons to avoid the diamond operator like this is that it has two totally-unrelated meanings. The usual form you find diamond in is
$data = <$fh>;
This acts like a read function; the full (non-symbol) name for this function is readline. This line of source is equivalent to
$data = readline( $fh );
However, your original form given was
#month_dirs = <$log_directory/*>;
which is an entirely different form. This acts like a shell glob, returning a list of filename matches by scanning the filesystem. This form is better written out using the glob function:
#month_dirs = glob( "$log_directory/*" );
Note also that this being a normal function just takes a normal string argument. In this manner, you can use it with any of your provided examples, such as:
#month_dirs = glob( $conf->log_directory()."/*" );
bareword can only be inside the bracket <>, syntax inside is shell syntax, more a perl one
# wrong -
$bare_word = $month_directory_command/*;
# right - star is allowed because it is inside the quote single or double
$bare_word = "$month_directory_command/*";
# star is allowed simply because it is inside the bracket
#month_dirs = <$month_directory_command/*>;
I am using the
my $file = <*.ext>;
function within Perl to test if a file exists (in this case, I need to know if there is a file with the .ext in the current working directory, otherwise I do not proceed) and throw an error if it doesn't. Such as:
my $file = <*.ext>;
if (-e $file) {
# we are good
}
else {
# file not found
}
As you can see I am bringing the input of the <*.ext> in to a scalar $variable, not an #array. This is probably not a good idea, but it's been working for me up until now and I have spent a while figuring out where my code was failing... and this seems to be it.
Seems that when switching directories (I am using "chdir", on a Windows machine) the current working directory switches properly but the input from the glob operator is very undefined and will look in previous directories, or is retaining past values.
I've been able to fix getting this to work by doing
my #file = <*.ext>;
if (-e $file[0]) {
}
and I'm wondering if anybody can explain this, because I've been unable to find where the return value of the glob operator is defined as an array or a scalar (if there is only one file, etc.)
Just trying to learn here to avoid future bugs. This was a requirement that it be in Perl on Windows and not something I regularly have to do, so my experience in this case is very thin. (More of a Python/C++ guy.)
Thanks.
perldoc -f glob explains this behavior
In list context, returns a (possibly empty) list of filename expansions on the value of EXPR such as the standard Unix shell /bin/csh would do. In scalar context, glob iterates through such filename expansions, returning undef when the list is exhausted.
So you are using iterator version which should be used with while to loop over it (all the way until it gets exhausted).
As you clearly want to get only first value using list context you can,
my ($file) = <*.ext>;
mpapec has already covered the perldoc glob documentation concerning the bahavior in a scalar context.
However, I'd just like to add that you can simplify your logic by putting the glob directly in the if
if (<*.ext>) {
# we are good
}
else {
# no files found
}
The testing of -e is superfluous as a file wouldn't have been returned by the glob if it didn't exist. Additionally, if you want to actually perform an operation on the found files, you can capture them inside the if
if (my #files = <*.ext>) {
The following file does not compile:
sub s {
return 'foo';
}
sub foo {
my $s = s();
return $s if $s;
return 'baz?';
}
The error from perl -c is:
syntax error at foobar.pl line 5 near "return"
(Might be a runaway multi-line ;; string starting on line 3)
foobar.pl had compilation errors.
But if I replace s() with &s() it works fine. Can you explain why?
The & prefix definitively says you want to call your own function called "s", rather than any built-in with the same name. In this case, it's confusing it for a substitution operator (like $stuff =~ s///;, which can also be written s()()).
Here's a PerlMonks discussion about what the ampersand does.
The problem you have, as has already been pointed out, is that s() is interpreted as the s/// substitution operator. Prefixing the function name with an ampersand is a workaround, although I would not say necessarily the correct one. In perldoc perlsub the following is said about calling subroutines:
NAME(LIST); # & is optional with parentheses.
NAME LIST; # Parentheses optional if predeclared/imported.
&NAME(LIST); # Circumvent prototypes.
&NAME; # Makes current #_ visible to called subroutine.
What the ampersand does here is merely to distinguish between the built-in function and your own.
The "proper" way to deal with this, apart from renaming your subroutine, is to realize what's going on under the surface. When you say
s();
What you are really saying is
CORE::s();
When what you mean is
main::s();
my $s = 's'->();
works too--oddly enough with strict on.
I'm trying to wrap my head around the way Perl handles the parsing of arguments to print.
Why does this
print $fh $stufftowrite
write to the file handle as expected, but
print($fh, $stufftowrite)
writes the file handle to STDOUT instead?
My guess is that it has something to do with the warning in the documentation of print:
Be careful not to follow the print keyword with a left parenthesis unless you want the corresponding right parenthesis to terminate the arguments to the print; put parentheses around all arguments (or interpose a + , but that doesn't look as good).
Should I just get used to the first form (which just doesn't seem right to me, coming from languages that all use parentheses around function arguments), or is there a way to tell Perl to do what I want?
So far I've tried a lot of combination of parentheses around the first, second and both parameters, without success.
On lists
The structure bareword (LIST1), LIST2 means "apply the function bareword to the arguments LIST1", while bareword +(LIST1), LIST2 can, but doesn't neccessarily mean "apply bareword to the arguments of the combined list LIST1, LIST2". This is important for grouping arguments:
my ($a, $b, $c) = (0..2);
print ($a or $b), $c; # print $b
print +($a or $b), $c; # print $b, $c
The prefix + can also be used to distinguish hashrefs from blocks, and functions from barewords, e.g. when subscripting an hash: $hash{shift} returns the shift element, while $hash{+shift} calls the function shift and returns the hash element of the value of shift.
Indirect syntax
In object oriented Perl, you normally call methods on an object with the arrow syntax:
$object->method(LIST); # call `method` on `$object` with args `LIST`.
However, it is possible, but not recommended, to use an indirect notation that puts the verb first:
method $object (LIST); # the same, but stupid.
Because classes are just instances of themselves (in a syntactic sense), you can also call methods on them. This is why
new Class (ARGS); # bad style, but pretty
is the same as
Class->new(ARGS); # good style, but ugly
However, this can sometimes confuse the parser, so indirect style is not recommended.
But it does hint on what print does:
print $fh ARGS
is the same as
$fh->print(ARGS)
Indeed, the filehandle $fh is treated as an object of the class IO::Handle.
(While this is a valid syntactic explanation, it is not quite true. The source of IO::Handle itself uses the line print $this #_;. The print function is just defined this way.)
Looks like you have a typo. You have put a comma between the file handle and the argument in the second print statement. If you do that, the file handle will be seen as an argument. This seems to apply only to lexical file handles. If done with a global file handle, it will produce the fatal error
No comma allowed after filehandle at ...
So, to be clear, if you absolutely have to have parentheses for your print, do this:
print($fh $stufftowrite)
Although personally I prefer to not use parentheses unless I have to, as they just add clutter.
Modern Perl book states in the Chapter 11 ("What to Avoid"), section "Indirect Notation Scalar Limitations":
Another danger of the syntax is that the parser expects a single scalar expression as the object. Printing to a filehandle stored in an aggregate variable seems obvious, but it is not:
# DOES NOT WORK AS WRITTEN
say $config->{output} 'Fun diagnostic message!';
Perl will attempt to call say on the $config object.
print, close, and say—all builtins which operate on filehandles—operate in an indirect fashion. This was fine when filehandles were package globals, but lexical filehandles (Filehandle References) make the indirect object syntax problems obvious. To solve this, disambiguate the subexpression which produces the intended invocant:
say {$config->{output}} 'Fun diagnostic message!';
Of course, print({$fh} $stufftowrite) is also possible.
It's how the syntax of print is defined. It's really that simple. There's kind of nothing to fix. If you put a comma between the file handle and the rest of the arguments, the expression is parsed as print LIST rather than print FILEHANDLE LIST. Yes, that looks really weird. It is really weird.
The way not to get parsed as print LIST is to supply an expression that can legally be parsed as print FILEHANDLE LIST. If what you're trying to do is get parentheses around the arguments to print to make it look more like an ordinary function call, you can say
print($fh $stufftowrite); # note the lack of comma
You can also say
(print $fh $stufftowrite);
if what you're trying to do is set off the print expression from surrounding code. The key point is that including the comma changes the parse.