map documentation not clear enough - perl

I've tried to understand the map function by reading its documentation to no avail.
In the documentation it says "Evaluates the BLOCK or EXPR for each element of LIST"
However, how is one to know that one can also use file test operators as well as shown below?
map { [$_, -s] } ('perl.c', 'sv.c', 'hv.c', 'av.c');
Source of the above code is: http://www.stllinux.org/meeting_notes/1997/0918/schwtr.html
So basically, the result will be a hash of files along with its file size but how on earth was I supposed to know about this from the documentation alone?
Can you guys help me out to understand more?

Actually, it says
map BLOCK LIST
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value composed of the results
of each such evaluation. In scalar context, returns the total number
of elements so generated. Evaluates BLOCK or EXPR in list context, so
each element of LIST may produce zero, one, or more elements in the
returned value.
The important part is that $_ is localized to the BLOCK, containing the value of each element of the LIST. Much the same is true for a for loop, i.e. for (LIST).
The -s function is as you say a file test, and without explicit argument it operates on $_. This is the same default behaviour that many of Perl's built-in functions have, for example print, unpack, ord, length.
The code you are showing contains a single scalar expression: [$_, -s], which is an array ref containing the file name inside $_ and as you say, its size.
So, basically, what you are seeing here is basic Perl techniques. If there is anything that is still not clear, feel free to ask.
Update:
As for what this code in specific does, it is probably part of a Schwartzian transform, whereby you perform a more efficient sort on a list, where the sort criteria consists of an expensive operation. For example:
my #files = ('perl.c', 'sv.c', 'hv.c', 'av.c');
my #sorted = sort { -s $a <=> -s $b } #files; # sorting by file size
For a small list, this will not matter much, but with a larger list, it might not be very efficient to run file tests multiple times, so instead we cache the test result in an array ref:
my #sorted = map $_->[0], # restore original value
sort { $a->[1] <=> $b->[1] } # perform sort on element #2
map { [ $_, -s ] } #files; # your map statement
And this is then called a Schwartzian transform.

Related

Perl subroutines

Here I fixed most of my mistakes and thank you all, any other advice please with my hash at this point and how can I clear each word and puts the word and its frequency in a hash, excluding the empty words.. I think my code make since now.
So you can focus on the key part of the algorithm, how about accepting input on STDIN and output to STDOUT. That way there's no argument checking, etc. Just a simple:
$ prog < words.txt
All you really need is a very simple algorithm:
Read a line
Split it into words
Record a count of the word
When done, display the counts
Here's a sample program
#! /usr/bin/perl -w
use strict;
my (%data);
while (<STDIN>) {
chomp;
my(#words) = split(/\s+/);
foreach my $word (#words) {
if (!defined($data{$word})) {
$data{$word} = 0;
}
$data{$word}++;
}
}
foreach (sort(keys(%data))) {
print "$_: $data{$_}\n";
}
Once you understand this and have it working in your environment, you can extend it to meet your other requirements:
remove non-alphabetic characters from each word
print three results per line
use input and output files
put the algorithm into a subroutine
I agree that starting with dave's answer would be more productive, but if you are interested in your mistakes, here is what I see:
You assign the return value of checkArgs to a scalar variable $checkArgs, but return an array value. It means that $checkArgs will always contain 2 (the size of the array) after this call (because the program dies if the number of arguments is not 2). It is not very bad since you do not use the value later, but why you need it at all in this case?
You open files and close them immediately without reading from them. Does not make sense.
Statement
while (<>)
reads either from standard output or from all files in the command line arguments. The latter variant is like what you want, but your second argument is the output file, not input. The diamond operator will try to read from it too. You have two options: a) use only one file name in the command line arguments, read the file with <>, use standard output for output, and redirect output to a file in shell; b) use
while(<$file1>)
instead, of course, before closing files. Option a) is the traditional Unix- and Perl-style, but b) provides for clearer code for beginners.
Statements
return $word;
and
return $str, $hash{$str};
return corresponding values on the first iterations of the loops, all other data remain unprocessed. In the first case, you should create a local array, store all $word in it and return the array as a whole. In the second case, you already have such local %hash, it is enough to return this hash. In both cases, you need should assign the return values of the functions not to scalars, but to an array and a hash correspondingly. Now, you actually lose all you data.

Index of an element in an array in Perl

I am trying to find a way to get an index of an element in an array that partially matches a certain patten.
Let's say I have an array with values
Maria likes tomatoes,
Sonia likes plums,
Andrew likes oranges
If my search term is plums, I will get 1 returned as index.
Thank you!
Quick search didn't find a dupe, but I'm sure there is one. Meanwhile:
To find elements of an array that meet a certain condition, you use grep. If you want the indexes instead of the elements.. well, Perl 6 added a grep-index method to handle that case, but in Perl 5 the easiest way is to change the target of grep. That is, instead of running it on the original array, run it on a list of indexes - just with a condition that references the original array. In your case, that might look like this:
my #array = ( 'Maria likes tomatoes',
'Sonia likes plums',
'Andrew likes oranges');
grep { $array[$_] =~ /plums/ } 0..$#array; # 1
Relevant bits:
$#array returns the index of the last element of #array.
m..n generates a range of values between m and n (inclusive); in list context that becomes a list of those values.
grep { code } list returns the elements of list for which code produces a true value when the special variable $_ is set to the element.
These sorts of expressions read most easily from right to left. So, first we generate a list of all the indexes of the original array (0..$#array), then we use grep to test each index (represented by $_) to see if the corresponding element of #array ($array[$_]) matches (~=) the regular expression /plums/.
If it does, that index is included in the list returned by the grep; if not, it's left out. So the end result is a list of only those indexes for which the condition is true. In this case, that list contains only the value 1.
Added to reply to your comment: It's important to note that the return value of grep is normally a list of matching elements, even if there is only one match. If you assign the result to an array (e.g. with my #indexes = grep...), the array will contain all the matching values. However, grep is context-sensitive, and if you call it in scalar context (e.g. by assigning its return value to a scalar variable with something like my $count = grep...), you'll instead only get a number telling you how many matches there were. You might want to take a look at this tutorial on context sensitivity in Perl.
This is what firstidx from List::MoreUtils is for.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use List::MoreUtils 'firstidx';
my #array = ('Maria likes tomatoes',
'Sonia likes plums',
'Andrew likes oranges');
say firstidx { /plums/ } #array;
Update: I see that draegtun has answered your comment about getting multiple indexes. But I wonder why you couldn't just browse the List::MoreUtils documentation to see if there was a useful-looking function in there.

Perl: Anonymous Multi-Dimensional Arrays

NOTE: See the end of this post for final explanation.
This is probably a very basic question, but I'm still trying to master a few of the fundamentals regarding references in Perl, and came across something in the perldsc page that I'd like to confirm. The following code is in the Generating Array of Arrays section:
while ( <> ) {
push #AoA, [ split ];
}
Obviously, the <> operation in the while loop reads one line of input in at a time. I am assuming at this point that line is then put into an anonymous array via the [ ] brackets, we'll call this #zero. Then the split command places everything in a given line separated by whitespace within the array (e.g., the first word is assigned to $zero[0], the second to $zero[1] and so on). The scalar reference of #zero is then pushed onto #AoA.
The next line of input is passed via the <> operator and gets assigned to a completely new anonymous array (e.g. #one), and its scalar reference is pushed onto #AoA.
Once #AoA is populated, I could then access its contents with a nested foreach loop; the first iterating through the "rows" (e.g. for $row (#AoA)), and a second, inner loop, foreach to access the columns of that particular row.
The latter (accessing said "columns" would be done by dereferencing (e.g., for $column (#$row)) the particular $row being read by the previous, "outer" foreach loop.
Is my understanding correct? I'm assuming you could still access any element of the #AoA just as you would if it were assigned vs. being anonymous? That is $element = $AoA[8][1]; .
I'm want to verify my thought process here. Is the automatic declaration of a unique, anonymous array each time through the loop part of the autovivication in Perl? I guess that is what is throwing me off a bit. Thanks.
EDIT: Based on the comments below my understanding regarding the anonymous array is still unclear, so I want to take a shot at one more description to see if it meets everyone's understanding.
Starting with the push #AoA, [split]; statement, split takes in the line from $_ and returns a list parsed by whitepace. That list is captured by [ ], which then returns an array reference. That array reference (created by [ ]) is then pushed onto #AoA. Is this accurate re: [ ]? The next step (dereferencing / use of #AoA) was covered very well by #krico below.
FINAL ANSWER/EXPLANATION: Based on all of the comments / feedback here, some further research on my part, and testing it seems my understanding was correct. I'll break it down here, so others can easily reference it later. See #krico's response below for a more explicit code representation that follows the steps outlined here.
while ( <> ) {
push #AoA, [ split ];
}
One line of input is passed at a time to the <> operator
The split function takes that line in via $_ and parses it based on whitespace (the default).
split then returns a LIST.
The [ ] is an anonymous array that provides the perl data structure for the List passed by split.
The push #AoA pushes the reference to the anonymous array onto its queue as element $AoA[0] (the second anonymous array reference will be put into $AoA1, etc...).
This continues through the entire input file. Once completed, #AoA is a 2D array, holding reference values (scalar values) to each of the previously generated anonymous arrays.
From this point #AoA can be dereferenced appropriately to work with the underlying/reference elements taken in from the input file. The default dereferencing technique is CIRCUMFIX (see perlfef below); however as of 5.19 a new method of dereferencing is available and will be released in 5.20, POSTFIX. Articles are linked below.
References: Perl References Documentation, Perl References Tutorial, Perl References Question noted by #Eli Hubert, Mike Friedman's blog post about differences between arrays and lists, Upcoming Postfix dereferencing in Perl, and Postfix dereferencing Article
This is what is going on:
The <> will put the line into the default variable $_
The split function will read $_ and return an array
The [ ] brackets will return a scalar, in it there will be a reference to that array
That reference is then pushed into the #AoA array
When you do $AoA[8][2] you are implicitly dereferencing the scalar. It's the same as $AoA[8]->[2].
The same code a little more readeable and you should understand it.
my $line;
while ( $line = <STDIN> ) {
my #parts = split $line;
my $partsRef = \#parts;
push #AoA, $partsRef;
}
Now, if you wanted to print the 2nd part of the 5th line you could say.
my $ref = #AoA[4];
my #parts = #$ref;
print $parts[1];
Get it?

A better variable naming system?

A newbie to programming. The task is to extract a particular data from a string and I chose to write the code as follows -
while ($line =<IN>) {
chomp $line;
#tmp=(split /\t/, $line);
next if ($tmp[0] !~ /ch/);
#tgt1=#tmp[8..11];
#tgt2=#tmp[12..14];
#tgt3=#tmp[15..17];
#tgt4=#tmp[18..21];
foreach (1..4) {
print #tgt($_), "\n";
}
I thought #tgt($_) would be interpreted as #tgt1, #tgt2, #tgt3, #tgt4 but I still get the error message that #tgt is a global symbol (#tgt1, #tgt2, #tgt3, #tgt4` have been declared).
Q1. Did I misunderstand foreach loop?
Q2. Why couldn't perl see #tgt($_) as #tgt1, #tgt2 ..etc?
Q2. From the experience this is probably a bad way to name variables. What would be a preferred way to name variables that have similar features?
Q2. Why couldn't perl see #tgt($_) as #tgt1, #tgt2 ..etc?
Q2. From the experience this is probably a bad way to name variables. What would be a preferred way to name variables that have similar features?
I'll asnswer both together.
#tgt($_) does NOT mean what you hope it means
First off, it's an invalid syntax (you can't use () after an array name, perl interpeter will produce a compile error).
What you're trying to do is access distinct variables by accessing a variable via an expression resulting in its name (aka symbolic references). This IS possible to do; but is typically a bad idea and poor-style Perl (as in, you CAN but you SHOULD NOT do it, without a very very good reason).
To access element $_ the way you tried, you use #{"tgt$_"} syntax. But I repeat - Do Not Do That, even if you can.
A correct idiomatic solution: use an array of arrayrefs, with your 1-4 (or rather 0-3) indexing the outer array:
# Old bad code: #tgt1=#tmp[8..11];
# New correct code:
$tgt[0]=[ #tmp[8..11] ]; # [] creates an array reference from a list.
# etc... repeat 4 times - you can even do it in a smart loop later.
What this does is, it stores a reference to an array slice into a zeroth element of a single #tgt array.
At the end, #tgt array has 4 elements , each an array reference to an array containing one of the slices.
Q1. Did I misunderstand foreach loop?
Your foreach loop (as opposed to its contents - see above) was correct, with one style caveat - again, while you CAN use a default $_ variable, you should almost never use it, instead always use named variables for readability.
You print the abovementioned array of arrayrefs as follows (ask separately if any of the syntax is unclear - this is a mid-level data structure handling, not for beginners):
foreach my $index (0..3) {
print join(",", #{ $tgt[$index]}) . "\n";
}

Is there any advantage to using keys #array instead of 0 .. $#array?

I was quite surprised to find that the keys function happily works with arrays:
keys HASH
keys ARRAY
keys EXPR
Returns a list consisting of all the keys of the named hash, or the
indices of an array. (In scalar context, returns the number of keys or
indices.)
Is there any benefit in using keys #array instead of 0 .. $#array with respect to memory usage, speed, etc., or are the reasons for this functionality more of a historic origin?
Seeing that keys #array holds up to $[ modification, I'm guessing it's historic :
$ perl -Mstrict -wE 'local $[=4; my #array="a".."z"; say join ",", keys #array;'
Use of assignment to $[ is deprecated at -e line 1.
4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
Mark has it partly right, I think. What he's missing is that each now works on an array, and, like the each with hashes, each with arrays returns two items on each call. Where each %hash returns key and value, each #array also returns key (index) and value.
while (my ($idx, $val) = each #array)
{
if ($idx > 0 && $array[$idx-1] eq $val)
{
print "Duplicate indexes: ", $idx-1, "/", $idx, "\n";
}
}
Thanks to Zaid for asking, and jmcnamara for bringing it up on perlmonks' CB. I didn't see this before - I've often looped through an array and wanted to know what index I'm at. This is waaaay better than manually manipulating some $i variable created outside of a loop and incremented inside, as I expect that continue, redo, etc., will survive this better.
So, because we can now use each on arrays, we need to be able to reset that iterator, and thus we have keys.
The link you provided actually has one important reason you might use/not use keys:
As a side effect, calling keys() resets the internal interator of the HASH or ARRAY (see each). In particular, calling keys() in void context resets the iterator with no other overhead.
That would cause each to reset to the beginning of the array. Using keys and each with arrays might be important if they ever natively support sparse arrays as a real data-type.
All that said, with so many array-aware language constructs like foreach and join in perl, I can't remember the last time I used 0..$#array.
I actually think you've answered your own question: it returns the valid indices of the array, no matter what value you've set for $[. So from a generality point of view (especially for library usage), it's more preferred.
The version of Perl I have (5.10.1) doesn't support using keys with arrays, so it can't be for historic reasons.
Well in your example, you are putting them in a list; So, in a list context
keys #array will be replaced with all elements of array
whereas 0 .. $#array will do the same but as array slicing; So, instead $array[0 .. $#array] you can also mention $array[0 .. (some specific index)]