A better variable naming system? - perl

A newbie to programming. The task is to extract a particular data from a string and I chose to write the code as follows -
while ($line =<IN>) {
chomp $line;
#tmp=(split /\t/, $line);
next if ($tmp[0] !~ /ch/);
#tgt1=#tmp[8..11];
#tgt2=#tmp[12..14];
#tgt3=#tmp[15..17];
#tgt4=#tmp[18..21];
foreach (1..4) {
print #tgt($_), "\n";
}
I thought #tgt($_) would be interpreted as #tgt1, #tgt2, #tgt3, #tgt4 but I still get the error message that #tgt is a global symbol (#tgt1, #tgt2, #tgt3, #tgt4` have been declared).
Q1. Did I misunderstand foreach loop?
Q2. Why couldn't perl see #tgt($_) as #tgt1, #tgt2 ..etc?
Q2. From the experience this is probably a bad way to name variables. What would be a preferred way to name variables that have similar features?

Q2. Why couldn't perl see #tgt($_) as #tgt1, #tgt2 ..etc?
Q2. From the experience this is probably a bad way to name variables. What would be a preferred way to name variables that have similar features?
I'll asnswer both together.
#tgt($_) does NOT mean what you hope it means
First off, it's an invalid syntax (you can't use () after an array name, perl interpeter will produce a compile error).
What you're trying to do is access distinct variables by accessing a variable via an expression resulting in its name (aka symbolic references). This IS possible to do; but is typically a bad idea and poor-style Perl (as in, you CAN but you SHOULD NOT do it, without a very very good reason).
To access element $_ the way you tried, you use #{"tgt$_"} syntax. But I repeat - Do Not Do That, even if you can.
A correct idiomatic solution: use an array of arrayrefs, with your 1-4 (or rather 0-3) indexing the outer array:
# Old bad code: #tgt1=#tmp[8..11];
# New correct code:
$tgt[0]=[ #tmp[8..11] ]; # [] creates an array reference from a list.
# etc... repeat 4 times - you can even do it in a smart loop later.
What this does is, it stores a reference to an array slice into a zeroth element of a single #tgt array.
At the end, #tgt array has 4 elements , each an array reference to an array containing one of the slices.
Q1. Did I misunderstand foreach loop?
Your foreach loop (as opposed to its contents - see above) was correct, with one style caveat - again, while you CAN use a default $_ variable, you should almost never use it, instead always use named variables for readability.
You print the abovementioned array of arrayrefs as follows (ask separately if any of the syntax is unclear - this is a mid-level data structure handling, not for beginners):
foreach my $index (0..3) {
print join(",", #{ $tgt[$index]}) . "\n";
}

Related

How to avoid input modification in PDL subroutines

I would like to avoid the assignment operator .= to modify the user input from a subroutine.
One way to avoid this is to perform a copy of the input inside the subroutine. Is this the best way to proceed? Are there other solutions?
use PDL;use strict;
my $a=pdl(1);
f_0($a);print "$a\n";
f_1($a);print "$a\n";
sub f_0{
my($input)=#_;
my $x=$input->copy;
$x.=0;
}
sub f_1{
my($input)=#_;
$input.=0;
}
In my case (perl 5.22.1), executing last script prints 1 and 0 in two lines. f_0 does not modify user input in-place, while f_1 does.
According to the FAQ 6.17 What happens when I have several references to the same PDL object in different variables :
Piddles behave like Perl references in many respects. So when you say
$a = pdl [0,1,2,3]; $b = $a;
then both $b and $a point to the same
object, e.g. then saying
$b++;
will not create a copy of the original piddle but just
increment in place
[...]
It is important to keep the "reference nature" of piddles in mind when
passing piddles into subroutines. If you modify the input piddles you
modify the original argument, not a copy of it. This is different from
some other array processing languages but makes for very efficient
passing of piddles between subroutines. If you do not want to modify
the original argument but rather a copy of it just create a copy
explicitly...
So yes, to avoid modification of the original, create a copy as you did:
my $x = $input->copy;
or alternatively:
my $x = pdl( $input );

Perl subroutines

Here I fixed most of my mistakes and thank you all, any other advice please with my hash at this point and how can I clear each word and puts the word and its frequency in a hash, excluding the empty words.. I think my code make since now.
So you can focus on the key part of the algorithm, how about accepting input on STDIN and output to STDOUT. That way there's no argument checking, etc. Just a simple:
$ prog < words.txt
All you really need is a very simple algorithm:
Read a line
Split it into words
Record a count of the word
When done, display the counts
Here's a sample program
#! /usr/bin/perl -w
use strict;
my (%data);
while (<STDIN>) {
chomp;
my(#words) = split(/\s+/);
foreach my $word (#words) {
if (!defined($data{$word})) {
$data{$word} = 0;
}
$data{$word}++;
}
}
foreach (sort(keys(%data))) {
print "$_: $data{$_}\n";
}
Once you understand this and have it working in your environment, you can extend it to meet your other requirements:
remove non-alphabetic characters from each word
print three results per line
use input and output files
put the algorithm into a subroutine
I agree that starting with dave's answer would be more productive, but if you are interested in your mistakes, here is what I see:
You assign the return value of checkArgs to a scalar variable $checkArgs, but return an array value. It means that $checkArgs will always contain 2 (the size of the array) after this call (because the program dies if the number of arguments is not 2). It is not very bad since you do not use the value later, but why you need it at all in this case?
You open files and close them immediately without reading from them. Does not make sense.
Statement
while (<>)
reads either from standard output or from all files in the command line arguments. The latter variant is like what you want, but your second argument is the output file, not input. The diamond operator will try to read from it too. You have two options: a) use only one file name in the command line arguments, read the file with <>, use standard output for output, and redirect output to a file in shell; b) use
while(<$file1>)
instead, of course, before closing files. Option a) is the traditional Unix- and Perl-style, but b) provides for clearer code for beginners.
Statements
return $word;
and
return $str, $hash{$str};
return corresponding values on the first iterations of the loops, all other data remain unprocessed. In the first case, you should create a local array, store all $word in it and return the array as a whole. In the second case, you already have such local %hash, it is enough to return this hash. In both cases, you need should assign the return values of the functions not to scalars, but to an array and a hash correspondingly. Now, you actually lose all you data.

Perl: Anonymous Multi-Dimensional Arrays

NOTE: See the end of this post for final explanation.
This is probably a very basic question, but I'm still trying to master a few of the fundamentals regarding references in Perl, and came across something in the perldsc page that I'd like to confirm. The following code is in the Generating Array of Arrays section:
while ( <> ) {
push #AoA, [ split ];
}
Obviously, the <> operation in the while loop reads one line of input in at a time. I am assuming at this point that line is then put into an anonymous array via the [ ] brackets, we'll call this #zero. Then the split command places everything in a given line separated by whitespace within the array (e.g., the first word is assigned to $zero[0], the second to $zero[1] and so on). The scalar reference of #zero is then pushed onto #AoA.
The next line of input is passed via the <> operator and gets assigned to a completely new anonymous array (e.g. #one), and its scalar reference is pushed onto #AoA.
Once #AoA is populated, I could then access its contents with a nested foreach loop; the first iterating through the "rows" (e.g. for $row (#AoA)), and a second, inner loop, foreach to access the columns of that particular row.
The latter (accessing said "columns" would be done by dereferencing (e.g., for $column (#$row)) the particular $row being read by the previous, "outer" foreach loop.
Is my understanding correct? I'm assuming you could still access any element of the #AoA just as you would if it were assigned vs. being anonymous? That is $element = $AoA[8][1]; .
I'm want to verify my thought process here. Is the automatic declaration of a unique, anonymous array each time through the loop part of the autovivication in Perl? I guess that is what is throwing me off a bit. Thanks.
EDIT: Based on the comments below my understanding regarding the anonymous array is still unclear, so I want to take a shot at one more description to see if it meets everyone's understanding.
Starting with the push #AoA, [split]; statement, split takes in the line from $_ and returns a list parsed by whitepace. That list is captured by [ ], which then returns an array reference. That array reference (created by [ ]) is then pushed onto #AoA. Is this accurate re: [ ]? The next step (dereferencing / use of #AoA) was covered very well by #krico below.
FINAL ANSWER/EXPLANATION: Based on all of the comments / feedback here, some further research on my part, and testing it seems my understanding was correct. I'll break it down here, so others can easily reference it later. See #krico's response below for a more explicit code representation that follows the steps outlined here.
while ( <> ) {
push #AoA, [ split ];
}
One line of input is passed at a time to the <> operator
The split function takes that line in via $_ and parses it based on whitespace (the default).
split then returns a LIST.
The [ ] is an anonymous array that provides the perl data structure for the List passed by split.
The push #AoA pushes the reference to the anonymous array onto its queue as element $AoA[0] (the second anonymous array reference will be put into $AoA1, etc...).
This continues through the entire input file. Once completed, #AoA is a 2D array, holding reference values (scalar values) to each of the previously generated anonymous arrays.
From this point #AoA can be dereferenced appropriately to work with the underlying/reference elements taken in from the input file. The default dereferencing technique is CIRCUMFIX (see perlfef below); however as of 5.19 a new method of dereferencing is available and will be released in 5.20, POSTFIX. Articles are linked below.
References: Perl References Documentation, Perl References Tutorial, Perl References Question noted by #Eli Hubert, Mike Friedman's blog post about differences between arrays and lists, Upcoming Postfix dereferencing in Perl, and Postfix dereferencing Article
This is what is going on:
The <> will put the line into the default variable $_
The split function will read $_ and return an array
The [ ] brackets will return a scalar, in it there will be a reference to that array
That reference is then pushed into the #AoA array
When you do $AoA[8][2] you are implicitly dereferencing the scalar. It's the same as $AoA[8]->[2].
The same code a little more readeable and you should understand it.
my $line;
while ( $line = <STDIN> ) {
my #parts = split $line;
my $partsRef = \#parts;
push #AoA, $partsRef;
}
Now, if you wanted to print the 2nd part of the 5th line you could say.
my $ref = #AoA[4];
my #parts = #$ref;
print $parts[1];
Get it?

Error using intermediate variable to access Spreadsheet::Read sheets

I'm no expert at Perl, wondering why the first way of obtaining numSheets is okay, while the following way isn't:
use Spreadsheet::Read;
my $spreadsheet = ReadData("blah.xls");
my $n1 = $spreadsheet->[1]{sheets}; # okay
my %sh = %spreadsheet->[1]; # bad
my $n2 = $sh{label};
The next to last line gives the error
Global symbol "%spreadsheet" requires explicit package name at newexcel_display.pl line xxx
I'm pretty sure I have the right sigils; if I experiment I can only get different errors. I know spreadsheet is a reference to an array not directly an array. I don't know about the hash for the metadata or individual sheets, but experimenting with different assumptions leads nowhere (at least with my modest perl skill.)
My reference on Spreadsheet::Read workings is http://search.cpan.org/perldoc?Spreadsheet::Read If there are good examples somewhere online that show how to properly use Spreadsheet, I'd like to know where they are.
It's not okay because it's not valid Perl syntax. The why is because that's not how Larry defined his language.
The sigils in front of variables tell you what you are trying to do, not what sort of variable it is. A $ means single item, as in $scalar but also single element accesses to aggregates such as $array[0] and $hash{$key}. Don't use the sigils to coerce types. Perl 5 doesn't do that.
In your case, $spreadsheet is an array reference. The %spreadsheet variable, which is a named hash, is a completely separate variable unrelated to all other variables with the same identifier. $foo, #foo, and %foo come from different namespaces. Since you haven't declared a %spreadsheet, strict throws the error that you see.
It looks like you want to get a hash reference from $spreadsheet->[1]. All references are scalars, so you want to assign to a scalar:
my $hash_ref = $spreadsheet->[1];
Once you have the hash reference in the scalar, you dereference it to get its values:
my $n2 = $hash_ref->{sheets};
This is the stuff we cover in the first part of Intermediate Perl.

In Perl, why does the `while(<HANDLE>) {...}` construct not localize `$_`?

What was the design (or technical) reason for Perl not automatically localizing $_ with the following syntax:
while (<HANDLE>) {...}
Which gets rewritten as:
while (defined( $_ = <HANDLE> )) {...}
All of the other constructs that implicitly write to $_ do so in a localized manner (for/foreach, map, grep), but with while, you must explicitly localize the variable:
local $_;
while (<HANDLE>) {...}
My guess is that it has something to do with using Perl in "Super-AWK" mode with command line switches, but that might be wrong.
So if anyone knows (or better yet was involved in the language design discussion), could you share with us the reasoning behind this behavior? More specifically, why was allowing the value of $_ to persist outside of the loop deemed important, despite the bugs it can cause (which I tend to see all over the place on SO and in other Perl code)?
In case it is not clear from the above, the reason why $_ must be localized with while is shown in this example:
sub read_handle {
while (<HANDLE>) { ... }
}
for (1 .. 10) {
print "$_: \n"; # works, prints a number from 1 .. 10
read_handle;
print "done with $_\n"; # does not work, prints the last line read from
# HANDLE or undef if the file was finished
}
From the thread on perlmonks.org:
There is a difference between foreach
and while because they are two totally
different things. foreach always
assigns to a variable when looping
over a list, while while normally
doesn't. It's just that while (<>) is
an exception and only when there's a
single diamond operator there's an
implicit assignment to $_.
And also:
One possible reason for why while(<>)
does not implicitly localize $_ as
part of its magic is that sometimes
you want to access the last value of
$_ outside the loop.
Quite simply, while never localises. No variable is associated with a while construct, so it doesn't have even have anything to localise.
If you change some variable in the while loop expression or in a while loop body, it's your responsibility to adequately scope it.
Speculation: Because for and foreach are iterators and loop over values, while while operates on a condition. In the case of while (<FH>) the condition is that data was read from the file. The <FH> is what writes to $_, not the while. The implicit defined() test is just an affordance to prevent naive code from terminating the loop on a read of false value.
For other forms of while loops, e.g. while (/foo/) you wouldn't want to localize $_.
While I agree that it would be nice if while (<FH>) localized $_, it would have to be a very special case, which could cause other problems with recognizing when to trigger it and when not to, much like the rules for <EXPR> distinguishing being a handle read or a call to glob.
As a side note, we only write while(<$fh>) because Perl doesn't have real iterators. If Perl had proper iterators, <$fh> would return one. for would use that to iterate a line at a time rather than slurping the whole file into an array. There would be no need for while(<$fh>) or the special cases associated with it.