Hash argument to constructor in Perl? - perl

use PDF::Extract;
$pdf=new PDF::Extract( PDFDoc=>"test.pdf");
$i=1;
$i++ while ( $pdf->savePDFExtract( PDFPages=>$i ) );
I am trying to understand the above Perl code. It appears to be instantiating an object from a module. What is the argument in the line that calls the constructor? What does the => mean? Is it a hash argument?

The constructor is called via indirect object syntax, which is discouraged (and usually a sign of old code). It would be better written as:
my $pdf = PDF::Extract->new(...);
The perlobj documentation recommends you avoid indirect object syntax for the following reasons:
First, it can be confusing to read. In the above example, it's not
clear if save is a method provided by the File class or simply a
subroutine that expects a file object as its first argument.
When used with class methods, the problem is even worse. Because Perl
allows subroutine names to be written as barewords, Perl has to guess
whether the bareword after the method is a class name or subroutine
name. In other words, Perl can resolve the syntax as either
File->new( $path, $data ) or new( File( $path, $data ) ).
To answer your second question, the => is known as the fat comma, and perlop has this to say about it:
The => operator (sometimes pronounced "fat comma") is a synonym for
the comma except that it causes a word on its left to be interpreted
as a string if it begins with a letter or underscore and is composed
only of letters, digits and underscores. This includes operands that
might otherwise be interpreted as operators, constants, single number
v-strings or function calls. If in doubt about this behavior, the left
operand can be quoted explicitly.
Otherwise, the => operator behaves exactly as the comma operator or
list argument separator, according to context.
In your example code, the constructor receives a list, just like if you had used a normal comma. In fact, your code is equivalent to this:
my $pdf = PDF::Extract->new('PDFDoc', 'test.pdf');
However, the thing that creates the hash is the assignment on the other side, which may look something like this:
sub new {
my $class = shift;
my %args = #_;
# ...
}
The fat comma isn't used exclusively with hashes (nor is it required to initialize a hash, as I pointed out above), but you will typically see it anywhere there's a key/value association in a list of arguments. The shape of the characters makes it clear that "this is related to that". It also saves some typing of quote characters, which is a nice side benefit.

Related

In perl, what does a parenthesized list of '$' mean in a sub declaration?

I have to debug someone else's code and ran across sub declarations that look like this...
sub mysub($$$$) {
<code here>
}
...also...
sub mysub($$$;$) {
<code here>
}
What does the parenthesized list of '$' (with optional ';') mean?
I ran an experiment and it doesn't seem to care if I pass more and fewer args to a sub declared this way than there are '$' in the list. I was thinking that it might be used to disambiguate two different subs with the same name, differring only by the number of args pased to it (as defined by the ($$$$) vs ($$$) vs ($$) etc... ). But that doesn't seem to be it.
That's a Perl subroutine prototype. It's an old-school way of letting the parser know how many arguments to demand. Unless you know what they are going to do for you, I suggest you avoid these for any new code. If you can avoid prototypes, avoid it. It doesn't gain you as much as you think. There's a newer but experimental way to do it better.
The elements after the ; are optional arguments. So, mysub($$$$) has four mandatory arguments, and mysub($$$;$) has three mandatory arguments and one optional argument.
A little about parsing
Perl lets you be a bit loose about parentheses when you want to specify arguments, so these are the same:
print "Hello World";
print( "Hello World\n" );
This is one of Perl's philosophical points. When we can omit boilerplate, we should be able to.
Also, Perl lets you pass as many arguments as you like to a subroutine and you don't have to say anything about parameters ahead of time:
sub some_sub { ... }
some_sub( 1, 2, 3, 4 );
some_sub 1, 2, 3, 4; # same
This is another foundational idea of Perl: we have scalars and lists. Many things work on a list, and we don't care what's in it or how many elements it has.
But, some builtins take a definite number of arguments. The sin takes exactly one argument (but print takes zero to effectively infinity):
print sin 5, 'a'; # -0.958924274663138a (a is from `a`)
The rand takes zero or one:
print rand; # 0.331390818188996
print rand 10; # 4.23956650382937
But then, you can define your own subroutines. Prototypes are a way to mimic that same behavior you see in the builtins (which I think is kinda cool but also not as motivating for production situations).
I tend to use parens in argument lists because I find it's easier for people to see what I intend (although not always with print, I guess):
print sin(5), 'a';
There's one interesting use of prototypes that I like. You can make your own syntax that works like map and grep block forms:
map { ... } #array;
If you want to play around with that (but still not subject maintenance programmers to it), check out Object::Iterate for a demonstration of it.
Experimental signatures
Perl v5.20 introduced an experimental signatures feature where you can give names to parameters. All of these are required:
use v5.20;
use feature qw(signatures);
sub mysub ( $name, $address, $phone ) { ... }
If you wanted an optional parameter, you can give it a default value:
sub mysub ( $name, $address, $phone = undef ) { ... }
Since this is an experimental feature, it warns whenever you use it. You can turn it off though:
no warnings qw(experimental::signatures);
This is interesting.
I ran an experiment and it doesn't seem to care if I pass more and fewer args to a sub declared this way than there are '$' in the list.
Because, of course, that's exactly what the code's author was trying to enforce.
There are two ways to circumvent the parameter counting that prototypes are supposed to enforce.
Call the subroutine as a method on an object ($my_obj->my_sub(...)) or on a class (MyClass->my_sub(...)).
Call the subroutine using the "old-style" ampersand syntax (&my_sub(...)).
From which we learn:
Don't use prototypes on subroutines that are intended to be used as methods.
Don't use the ampersand syntax for calling subroutines.

Confusion on syntax of of diamond operator in parsing and barewords

I'm very new to perl, so I'm sure my confusion here is simply due to not understanding perl syntax and how it handles bare words. I'm failing to find good answers to my question online though.
I had code I'm refactoring, it use to look like this
#month_dirs = <$log_directory/*>;
I changed $log_directory to be loaded with a config file (AppConfig to be exact). Now instead of exporting $log_directory we output $conf which is an AppConfig object. To access loaded variables you usually make a method call to the variable name so I tried ...
#month_dirs = <$conf->log_directory()."/*">
This fails, because I can't make a method call $conf->log_directory in a location where a barword is expected. Just playing around I tried this instead
$month_directory_command = $conf->log_directory()."/*";
#month_dirs = <$month_directory_command>;
This still fails, silently, without any indicator that this is a problem. I tried using a string directly in the diamond but it fails, apparently only barewords, not strings, are accepted by the diamond I'm surprised by that since I'm not allowed to use a string at all, I thought most places Barewords could be used a string could instead, is this simply because most code implements separate logic to accept barewords vs strings, but not required to be implemented this way?
I can make this work by emulating exactly the original syntax
$month_directory_command = $conf->log_directory();
#month_dirs = <$month_directory_command/*>;
However, this feels ugly to me. I'm also confused why I can do that, but I can't create a bare word with:
$bare_word = $conf->log_directory()/*
or
$month_directory_command = $conf->log_directory();
$bare_word = $month_directory_command/*;
#month_dirs = <$bare_word>;
Why do some variables work for bare words but not others? why can I use a scaler variable but not if it's returned from a method call?
I tried looking up perl syntax on barewords but didn't have much luck describing situations where they are not written directly, but are composed of variables.
I'm hoping someone can help me better understand the bareword syntax here. What defines when I can use a variable as part of a bare word and if I can save it as a variable?
I'd like to figure out a cleaner syntax for using the barword in my diamond operator if one can be suggested, but more then that I'd like to understand the syntax so I know how to work with barewords in the future. I promise I did try hunting this down ahead of time, but without much luck.
Incidentally, it seems the suggestion is to not use barewords in perl anyways? Is there someway I should be avoid barewords in the diamond operator?
You're mistaken that the diamond operator <> only works with barewords:
$ perl -E'say for <"/*">'
/bin
/boot
/dev
...
(In fact, a bareword is just an identifier that doesn't have a sigil and is prohibited by use strict 'subs';, so none of your examples really qualify.)
This:
#month_dirs = <$log_directory/*>;
works because a level of double-quote interpolation is done inside <>, and scalar variables like $log_directory are interpolated.
It's equivalent to:
#month_dirs = glob("$log_directory/*");
This:
#month_dirs = <$conf->log_directory()."/*">
fails because the > in $conf->log_directory() closes the diamond operator prematurely, confusing the parser.
It's parsed as:
<$conf->
(a call to glob) followed by
log_directory()."/*">
which is a syntax error.
This:
$month_directory_command = $conf->log_directory()."/*";
#month_dirs = <$month_directory_command>;
fails because
<$month_directory_command>
is equivalent to
readline($month_directory_command)
and not to
glob("$month_directory_command")
From perldoc perlop:
If what the angle brackets contain is a simple scalar variable (for example, $foo), then that variable contains the name of the filehandle to input from, or its typeglob, or a reference to the same.
[...]
If what's within the angle brackets is neither a filehandle nor a simple scalar variable containing a filehandle name, typeglob, or typeglob reference, it is interpreted as a filename pattern to be globbed, and either a list of filenames or the next filename in the list is returned, depending on context. This distinction is determined on syntactic grounds alone. That means <$x> is always a readline() from an indirect handle, but <$hash{key}> is always a glob().
So you're trying to read from a filehandle ($month_directory_command) that hasn't been opened yet.
Turning on warnings with use warnings 'all'; would have alerted you to this:
readline() on unopened filehandle at foo line 6.
This:
$bare_word = $conf->log_directory()/*;
fails because you're trying to concatenate the result of a method call with a non-quoted string; to concatenate strings, you have to interpolate them into a double quoted string, or use the concatenation operator.
You could do:
$bare_word = $conf->log_directory() . "/*";
#month_dirs = <"$bare_word">;
(although $bare_word isn't a bareword at all, it's a scalar variable.)
Note that:
#month_dirs = <$bare_word>;
(without quotes) would be interpreted as readline, not glob, as explained in perlop above.
In general, though, it would probably be less confusing to use the glob operator directly:
#month_dirs = glob( $conf->log_directory() . "/*" );
One of the main reasons to avoid the diamond operator like this is that it has two totally-unrelated meanings. The usual form you find diamond in is
$data = <$fh>;
This acts like a read function; the full (non-symbol) name for this function is readline. This line of source is equivalent to
$data = readline( $fh );
However, your original form given was
#month_dirs = <$log_directory/*>;
which is an entirely different form. This acts like a shell glob, returning a list of filename matches by scanning the filesystem. This form is better written out using the glob function:
#month_dirs = glob( "$log_directory/*" );
Note also that this being a normal function just takes a normal string argument. In this manner, you can use it with any of your provided examples, such as:
#month_dirs = glob( $conf->log_directory()."/*" );
bareword can only be inside the bracket <>, syntax inside is shell syntax, more a perl one
# wrong -
$bare_word = $month_directory_command/*;
# right - star is allowed because it is inside the quote single or double
$bare_word = "$month_directory_command/*";
# star is allowed simply because it is inside the bracket
#month_dirs = <$month_directory_command/*>;

Meaning of the <*> symbol

I've recently been exposed to a bit of Perl code, and some aspects of it are still elusive to me. This is it:
#collection = <*>;
I understand that the at-symbol defines collection as an array. I've also searched around a bit, and landed on perldoc, specifically at the part about I/O Operators. I found the null filelhandle specifically interesting; code follows.
while (<>) {
...
}
On the same topic I have also noticed that this syntax is also valid:
while (<*.c>) {
...
}
According to perldoc It is actually calling an internal function that invokes glob in a manner similar as the following code:
open(FOO, "echo *.c | tr -s ' \t\r\f' '\\012\\012\\012\\012'|");
while (<FOO>) {
...
}
Question
What does the less-than, asterisk, more-than (<*>) symbol mentioned on the first line actually do? Is it a reference to an internally open and referenced glob? Would it be a special case, such as the null filehandle? Or can it be something entirely different, like a legacy implementation?
<> (the diamond operator) is used in two different syntaxes.
<*.c>, <*> etc. is shorthand for the glob built-in function. So <*> returns a list of all files and directories in the current directory. (Except those beginning with a dot; use <* .*> for that).
<$fh> is shorthand for calling readline($fh). If no filehandle is specified (<>) the magical *ARGV handle is assumed, which is a list of files specified as command line arguments, or standard input if none are provided. As you mention, the perldoc covers both in detail.
How does Perl distinguish the two? It checks if the thing inside <> is either a bare filehandle or a simple scalar reference to a filehandle (e.g. $fh). Otherwise, it calls glob() instead. This even applies to stuff like <$hash{$key}> or <$x > - it will be interpreted as a call to glob(). If you read the perldoc a bit further on, this is explained - and it's recommended that you use glob() explicitly if you're putting a variable inside <> to avoid these problems.
It collects all filenames in the current directory and save them to the array collection. Except those beginning with a dot. It's the same as:
#collection = glob "*";

How to tell perl to print to a file handle instead of printing the file handle?

I'm trying to wrap my head around the way Perl handles the parsing of arguments to print.
Why does this
print $fh $stufftowrite
write to the file handle as expected, but
print($fh, $stufftowrite)
writes the file handle to STDOUT instead?
My guess is that it has something to do with the warning in the documentation of print:
Be careful not to follow the print keyword with a left parenthesis unless you want the corresponding right parenthesis to terminate the arguments to the print; put parentheses around all arguments (or interpose a + , but that doesn't look as good).
Should I just get used to the first form (which just doesn't seem right to me, coming from languages that all use parentheses around function arguments), or is there a way to tell Perl to do what I want?
So far I've tried a lot of combination of parentheses around the first, second and both parameters, without success.
On lists
The structure bareword (LIST1), LIST2 means "apply the function bareword to the arguments LIST1", while bareword +(LIST1), LIST2 can, but doesn't neccessarily mean "apply bareword to the arguments of the combined list LIST1, LIST2". This is important for grouping arguments:
my ($a, $b, $c) = (0..2);
print ($a or $b), $c; # print $b
print +($a or $b), $c; # print $b, $c
The prefix + can also be used to distinguish hashrefs from blocks, and functions from barewords, e.g. when subscripting an hash: $hash{shift} returns the shift element, while $hash{+shift} calls the function shift and returns the hash element of the value of shift.
Indirect syntax
In object oriented Perl, you normally call methods on an object with the arrow syntax:
$object->method(LIST); # call `method` on `$object` with args `LIST`.
However, it is possible, but not recommended, to use an indirect notation that puts the verb first:
method $object (LIST); # the same, but stupid.
Because classes are just instances of themselves (in a syntactic sense), you can also call methods on them. This is why
new Class (ARGS); # bad style, but pretty
is the same as
Class->new(ARGS); # good style, but ugly
However, this can sometimes confuse the parser, so indirect style is not recommended.
But it does hint on what print does:
print $fh ARGS
is the same as
$fh->print(ARGS)
Indeed, the filehandle $fh is treated as an object of the class IO::Handle.
(While this is a valid syntactic explanation, it is not quite true. The source of IO::Handle itself uses the line print $this #_;. The print function is just defined this way.)
Looks like you have a typo. You have put a comma between the file handle and the argument in the second print statement. If you do that, the file handle will be seen as an argument. This seems to apply only to lexical file handles. If done with a global file handle, it will produce the fatal error
No comma allowed after filehandle at ...
So, to be clear, if you absolutely have to have parentheses for your print, do this:
print($fh $stufftowrite)
Although personally I prefer to not use parentheses unless I have to, as they just add clutter.
Modern Perl book states in the Chapter 11 ("What to Avoid"), section "Indirect Notation Scalar Limitations":
Another danger of the syntax is that the parser expects a single scalar expression as the object. Printing to a filehandle stored in an aggregate variable seems obvious, but it is not:
# DOES NOT WORK AS WRITTEN
say $config->{output} 'Fun diagnostic message!';
Perl will attempt to call say on the $config object.
print, close, and say—all builtins which operate on filehandles—operate in an indirect fashion. This was fine when filehandles were package globals, but lexical filehandles (Filehandle References) make the indirect object syntax problems obvious. To solve this, disambiguate the subexpression which produces the intended invocant:
say {$config->{output}} 'Fun diagnostic message!';
Of course, print({$fh} $stufftowrite) is also possible.
It's how the syntax of print is defined. It's really that simple. There's kind of nothing to fix. If you put a comma between the file handle and the rest of the arguments, the expression is parsed as print LIST rather than print FILEHANDLE LIST. Yes, that looks really weird. It is really weird.
The way not to get parsed as print LIST is to supply an expression that can legally be parsed as print FILEHANDLE LIST. If what you're trying to do is get parentheses around the arguments to print to make it look more like an ordinary function call, you can say
print($fh $stufftowrite); # note the lack of comma
You can also say
(print $fh $stufftowrite);
if what you're trying to do is set off the print expression from surrounding code. The key point is that including the comma changes the parse.

Why does my Perl print show HASH(0x100a2d018)?

Here I am thinking I know how to use lists in Perl, when this happens. If I do this (debugging code, prettiness not included):
#! /usr/bin/perl -w
use strict;
my $temp1 = "FOOBAR";
my $temp2 = "BARFOO!";
my #list = { $temp1, $temp2 };
print $temp1; #this works fine
print $list[0]; #this prints out HASH(0x100a2d018)
It looks like I am printing out the address of the second string. How do I get at the actual string stored inside the list? I assume it has something to do with references, but dunno for sure.
my #list = { $temp1, $temp2 };
should be
my #list = ( $temp1, $temp2 ); # Parentheses instead of curly braces.
What your original code did was store a reference to a hash {$temp1 => $temp2} into #list's first element ($list[0]). This is a perfectly valid thing to do (which is why you didn't get a syntax error), it's just not what you intended to do.
You already got the answer to your question, don't use {}, because that creates an anonymous hash reference.
However, there is still the matter of the question you didn't know you asked.
What is the difference between an array and a list in Perl?
In your question, you use the term 'list' as if it were interchangeable with the term array, but the terms are not interchangeable. It is important to understand the what the difference is.
An array is a type of variable. You can assign values to it. You can take references to it.
A list is an ordered group of zero or more scalars that is created when an expression is evaluated in a list context.
Say what?
Ok, conisder the case of my $foo = (1,2,3). Here $foo is a scalar, and so the expression (1,2,3) is evaluated in a scalar context.
On the surface it is easy to look at (1,2,3) and say that's a literal list. But it is not.
It is a group of literal values strung together using the comma operator. In a scalar context, the comma operator returns the right hand value, so we really have ((1 ,2),3) which becomes ((2),3) and finally 3.
Now my #foo = (1,2,3) is very different. Assignment into an array occurs in a list context, so we evaluate (1,2,3) in list context. Here the comma operator inserts both sides into the list. So we have ((1,2),3) which evaluates to (list_of(1,2),3) and then list_of(list_of(1,2),3), since Perl flattens lists, this becomes list_of(1,2,3). The resulting list is assigned into #foo. Note that there is no list_of operator in Perl, I am trying to differentiate between what is commonly thought of as a literal list and an actual list. Since there is no way to directly express an actual list in Perl, I had to make one up.
So, what does all this mean to someone who is just learning Perl? I'll boil it down to a couple of key points:
Learn about and pay attention to context.
Remember that your array variables are arrays and not lists.
Don't worry too much if this stuff seems confusing.
DWIM does, mostly--most of the time the right things will happen without worrying about the details.
While you are pondering issues of context, you might want to look at these links:
Start with the discussion of context in Programming Perl. Larry et alia explain it all much more clearly than I do.
Perlop means something entirely different when you pay attention to what each operator returns based on context.
A nice discussion of scalar and context on Perlmonks.
An short introductory article about context: Context is Everything.
MJD explains context.
The perldoc for scalar and wantarray