three questions on a Perl function - perl

I am trying to use an existing Perl program, which includes the following function of GetItems. The way to call this function is listed in the following.
I have several questions for this program:
what does foreach my $ref (#_) aim to do? I think #_ should be related to the parameters passed, but not quite sure.
In my #items = sort { $a <=> $b } keys %items; the "items" on the left side should be different from the "items" on the right side? Why do they use the same name?
What does $items{$items[$i]} = $i + 1; aim to do? Looks like it just sets up the value for the hash $items sequentially.
$items = GetItems($classes, $pVectors, $nVectors, $uVectors);
######################################
sub GetItems
######################################
{
my $classes = shift;
my %items = ();
foreach my $ref (#_)
{
foreach my $id (keys %$ref)
{
foreach my $cui (keys %{$ref->{$id}}) { $items{$cui} = 1 }
}
}
my #items = sort { $a <=> $b } keys %items;
open(VAL, "> $classes.items");
for my $i (0 .. $#items)
{
print VAL "$items[$i]\n";
$items{$items[$i]} = $i + 1;
}
close VAL;
return \%items;
}

When you enter a function, #_ starts out as an array of (aliases to) all the parameters passed into the function; but the my $classes = shift removes the first element of #_ and stores it in the variable $classes, so the foreach my $ref (#_) iterates over all the remaining parameters, storing (aliases to) them one at a time in $ref.
Scalars, hashes, and arrays are all distinguished by the syntax, so they're allowed to have the same name. You can have a $foo, a #foo, and a %foo all at the same time, and they don't have to have any relationship to each other. (This, together with the fact that $foo[0] refers to #foo and $foo{'a'} refers to %foo, causes a lot of confusion for newcomers to the language; you're not alone.)
Exactly. It sets each element of %items to a distinct integer ranging from one to the number of elements, proceeding in numeric (!) order by key.

foreach my $ref (#_) loops through each hash reference passed as a parameter to GetItems. If the call looks like this:
$items = GetItems($classes, $pVectors, $nVectors, $uVectors);
then the loop processes the hash refs in $pVector, $nVectors, and $uVectors.
#items and %items are COMPLETELY DIFFERENT VARIABLES!! #items is an array variable and %items is a hash variable.
$items{$items[$i]} = $i + 1 does exactly as you say. It sets the value of the %items hash whose key is $items[$i] to $i+1.

Here is an (nearly) line by line description of what is happening in the subroutine
Define a sub named GetItems.
sub GetItems {
Store the first value in the default array #_, and remove it from the array.
my $classes = shift;
Create a new hash named %items.
my %items;
Loop over the remaining values given to the subroutine, setting $ref to the value on each iteration.
for my $ref (#_){
This code assumes that the previous line set $ref to a hash ref. It loops over the unsorted keys of the hash referenced by $ref, storing the key in $id.
for my $id (keys %$ref){
Using the key ($id) given by the previous line, loop over the keys of the hash ref at that position in $ref. While also setting the value of $cui.
for my $cui (keys %{$ref->{$id}}) {
Set the value of %item at position $cui, to 1.
$items{$cui} = 1;
End of the loops on the previous lines.
}
}
}
Store a sorted list of the keys of %items in #items according to numeric value.
my #items = sort { $a <=> $b } keys %items;
Open the file named by $classes with .items appended to it. This uses the old-style two arg form of open. It also ignores the return value of open, so it continues on to the next line even on error. It stores the file handle in the global *VAL{IO}.
open(VAL, "> $classes.items");
Loop over a list of indexes of #items.
for my $i (0 .. $#items){
Print the value at that index on it's own line to *VAL{IO}.
print VAL "$items[$i]\n";
Using that same value as an index into %items (which it is a key of) to the index plus one.
$items{$items[$i]} = $i + 1;
End of loop.
}
Close the file handle *VAL{IO}.
close VAL;
Return a reference to the hash %items.
return \%items;
End of subroutine.
}

I have several questions for this program:
What does foreach my $ref (#_) aim to do? I think #_ should be related to the parameters passed, but not quite sure.
Yes, you are correct. When you pass parameters into a subroutine, they automatically are placed in the #_ array. (Called a list in Perl). The foreach my $ref (#_) begins a loop. This loop will be repeated for each item in the #_ array, and each time, the value of $ref will be assigned the next item in the array. See Perldoc's Perlsyn (Perl Syntax) section about for loops and foreach loops. Also look at Perldoc's Perlvar (Perl Variables) section of General variables for information about special variables like #_.
Now, the line my $classes = shift; is removing the first item in the #_ list and putting it into the variable $classes. Thus, the foreach loop will be repeated three times. Each time, $ref will be first set to the value of $pVectors, $nVectors, and finally $uVectors.
By the way, these aren't really scalar values. In Perl, you can have what is called a reference. This is the memory location of the data structure you're referencing. For example, I have five students, and each student has a series of tests they've taken. I want to store all the values of each test in a hash keyed by the student's ID.
Normally, each entry in the hash can only contain a single item. However, what if this item refers to a list that contains the student's grades?
Here's the list of student #100's grade:
#grades = (100, 93, 89, 95, 74);
And here's how I set Student 100's entry in my hash:
$student{100} = \#grades;
Now, I can talk about the first grade of the year for Student #100 as $student{100}[0]. See the Perldoc's Mark's very short tutorial about references.
In my #items = sort { $a <=> $b } keys %items; the "items" on the left side should be different from the "items" on the right side? Why do they use the same name?
In Perl, you have three major types of variables: Lists (what some people call Arrays), Hashes (what some people call Keyed Arrays), and Scalars. In Perl, it is perfectly legal to have different variable types have the same name. Thus, you can have $var, %var, and #var in your program, and they'll be treated as completely separate variables1.
This is usually a bad thing to do and is highly discouraged. It gets worse when you think of the individual values: $var refers to the scalar while $var[3] refers to the list, and $var{3} refers to the hash. Yes, it can be very, very confusing.
In this particular case, he has a hash (a keyed array) called %item, and he's converting the keys in this hash into a list sorted by the keys. This syntax could be simplified from:
my #items = sort { $a <=> $b } keys %items;
to just:
my #items = sort keys %items;
See the Perldocs on the sort function and the keys function.
What does $items{$items[$i]} = $i + 1; aim to do? Looks like it just sets up the value for the hash $items sequentially.
Let's look at the entire loop:
foreach my $i (0 .. $#items)
{
print VAL "$items[$i]\n";
$items{$items[$i]} = $i + 1;
}
The subroutine is going to loop through this loop once for each item in the #items list. This is the sorted list of keys to the old %items hash. The $#items means the largest index in the item list. For example, if #items = ("foo", "bar", and "foobar"), then $#item would be 2 because the last item in this list is $item[2] which equals foobar.
This way, he's hitting the index of each entry in #items. (REMEMBER: This is different from %item!).
The next line is a bit tricky:
$items{$items[$i]} = $i + 1;
Remember that $item{} refers to the old %items hash! He's creating a new %items hash. This is being keyed by each item in the #items list. And, the value is being set to the index of that item plus 1. Let's assume that:
#items = ("foo", "bar", "foobar")
In the end, he's doing this:
$item{foo} = 1;
$item{bar} = 2;
$item{foobar} = 3;
1 Well, this isn't 100% true. Perl stores each variable in a kind of hash structure. In memory, $var, #var, and %var will be stored in the same hash entry in memory, but in positions related to each variable type. 99.9999% of the time, this matters not one bit. As far as you are concerned, these are three completely different variables.
However, there are a few rare occasions where a programmer will take advantage of this when they futz directly with memory in Perl.

I want to show you how I would write that subroutine.
Bur first, I want to show you some of the steps of how, and why, I changed the code.
Reduce the number of for loops:
First off this loop doesn't need to set the value of $items{$cui} to anything in particular. It also doesn't have to be a loop at all.
foreach my $cui (keys %{$ref->{$id}}) { $items{$cui} = 1 }
This does practically the same thing. The only real difference is it sets them all to undef instead.
#items{ keys %{$ref->{$id}} } = ();
If you really needed to set the values to 1. Note that (1)x#keys returns a list of 1's with the same number of elements in #keys.
my #keys = keys %{$ref->{$id}};
#items{ #keys } = (1) x #keys;
If you are going to have to loop over a very large number of elements then a for loop may be a good idea, but only if you have to set the value to something other than undef. Since we are only using the loop variable once, to do something simple; I would use this code:
$items{$_} = 1 for keys %{$ref->{$id}};
Swap keys with values:
On the line before that we see:
foreach my $id (keys %$ref){
In case you didn't notice $id was used only once, and that was for getting the associated value.
That means we can use values and get rid of the %{$ref->{$id}} syntax.
for my $hash (values %$ref){
#items{ keys %$hash } = ();
}
( $hash isn't a good name, but I don't know what it represents. )
3 arg open:
It isn't recommended to use the two argument form of open, or to blindly use the bareword style of filehandles.
open(VAL, "> $classes.items");
As an aside, did you know there is also a one argument form of open. I don't really recommend it though, it's mostly there for backward compatibility.
our $VAL = "> $classes.items";
open(VAL);
The recommend way to do it, is with 3 arguments.
open my $val, '>', "$classes.items";
There may be some rare edge cases where you need/want to use the two argument version though.
Put it all together:
sub GetItems {
# this will cause open and close to die on error (in this subroutine only)
use autodie;
my $classes = shift;
my %items;
for my $vector_hash (#_){
# use values so that we don't have to use $ref->{$id}
for my $hash (values %$ref){
# create the keys in %items
#items{keys %$hash} = ();
}
}
# This assumes that the keys of %items are numbers
my #items = sort { $a <=> $b } keys %items;
# using 3 arg open
open my $output, '>', "$classes.items";
my $index; # = 0;
for $item (#items){
print {$output} $item, "\n";
$items{$item} = ++$index; # 1...
}
close $output;
return \%items;
}
Another option for that last for loop.
for my $index ( 1..#items ){
my $item = $items[$index-1];
print {$output} $item, "\n";
$items{$item} = $index;
}
If your version of Perl is 5.12 or newer, you could write that last for loop like this:
while( my($index,$item) = each #items ){
print {$output} $item, "\n";
$items{$item} = $index + 1;
}

Related

Perl split function basics reading each word from an input file

I'm having trouble understanding why this code will not output anything:
#!/usr/bin/perl -w
use strict;
my %allwords = (); #Create an empty hash list.
my $running_total = 0;
while (<>) {
print "In the loop 1";
chomp;
print "Got here";
my #words = split(/\W+/,$_);
}
foreach my $val (my #words) {
print "$val\n";
}
And I run it from the terminal using the command:
perl wordfinder.pl < exampletext.txt
I would expect the code above to output each word from the input file, but it does not output anything other than "In the loop 1" and "Got here". I'm trying to separate the input file word by word, using the split parameter I specified.
Update 1: Here, I have declared the variables within their proper scope, which was my main issue. Now I am getting all of the words from the input file to output on the terminal:
my %allwords = (); #Create an empty hash list.
my $running_total = 0;
my #words = ();
my $val;
while (<>) {
print "Inputting words into an array! \n";
chomp;
#words = split(/\W+/,$_);
}
print("Words have been input successfully, performing analysis: \n");
foreach $val (#words) {
print "$val\n";
}
UPDATE 2: Progress has been made. Now, we put all words from any input files into a hash, and then print each unique key (i.e. each unique word found across all input files) from the hash.
#!/usr/bin/perl -w
use strict;
# Description: We want to take ALL text files from the command line input and calculate
# the frequencies of the words contained therein.
# Step 1: Loop over all words in all input files, and put each new unique word in a
# hash (check to see if contained in hash, if not, put the word in; if the word already
# exists in the hash, then increase its "total" by 1). Also, keep a running total of
# all words.
print("Welcome to word frequency finder. \n");
my $running_total = 0;
my %words;
my $val;
while (<>) {
chomp;
foreach my $str (split(/\W+/,$_)) {
$words{$str}++;
$running_total++;
}
}
print("Words have been input successfully, performing analysis: \n");
# Step 2: Loop over all entries in the hash and look for the word (key) with the
# maximum amount, and then remove this from the hash and put in a separate list.
# Do this until the size of the separate list is 10, since we want the top 10 words.
foreach $val (keys %words) {
print "$val\n";
}
Since you've already completed step 1, you're left with getting your top ten most common words. Rather than looping through the hash and finding the most frequent entry, let's let Perl do the work for us by sorting the hash by its values.
To sort the %words hash by its keys, we can use the expression sort keys %words; to sort a hash by its values, but be able to access its keys, we need a more complex expression:
sort { $words{$a} <=> $words{$a} } keys %words
Breaking it down, to sort numerically, we use the expression
sort { $a <=> $b } #array
(see [perl sort][1] for more on the special variables $a and $b used in sorting)
sort { $a <=> $b } keys %words
would sort on the hash keys, so to sort on the values, we do
sort { $words{$a} <=> $words{$b} } keys %words
Note that the output is still the keys of the hash %words.
We actually want to sort from high to low, so swap $a and $b over to reverse the sort direction:
sort { $words{$b} <=> $words{$a} } keys %words
Since we're compiling a top ten list, we only want the first ten from our hash. It's possible to do this by taking a slice of the hash, but the easiest way is just to use an accumulator to keep count of how many entries we have in the top ten:
my %top_ten;
my $i = 0;
for (sort { $words{$b} <=> $words{$a} } keys %words) {
# $_ is the current hash key
$top_ten{$_} = $words{$_};
$i++;
last if $i == 10;
}
And we're done!

How do I return an array and a hashref?

I want to make a subroutine that adds elements (keys with values) to a previously-defined hash. This subroutine is called in a loop, so the hash grows. I don’t want the returning hash to overwrite existing elements.
At the end, I would like to output the whole accumulated hash.
Right now it doesn’t print anything. The final hash looks empty, but that shouldn’t be.
I’ve tried it with hash references, but it does not really work. In a short form, my code looks as follows:
sub main{
my %hash;
%hash=("hello"=>1); # entry for testing
my $counter=0;
while($counter>5){
my(#var, $hash)=analyse($one, $two, \%hash);
print ref($hash);
# try to dereference the returning hash reference,
# but the error msg says: its not an reference ...
# in my file this is line 82
%hash=%{$hash};
$counter++;
}
# here trying to print the final hash
print "hash:", map { "$_ => $hash{$_}\n" } keys %hash;
}
sub analyse{
my $one=shift;
my $two=shift;
my %hash=%{shift #_};
my #array; # gets filled some where here and will be returned later
# adding elements to %hash here as in
$hash{"j"} = 2; #used for testing if it works
# test here whether the key already exists or
# otherwise add it to the hash
return (#array, \%hash);
}
but this doesn’t work at all: the subroutine analyse receives the hash, but its returned hash reference is empty or I don’t know. At the end nothing got printed.
First it said, it's not a reference, now it says:
Can't use an undefined value as a HASH reference
at C:/Users/workspace/Perl_projekt/Extractor.pm line 82.
Where is my mistake?
I am thankful for any advice.
Arrays get flattened in perl, so your hashref gets slurped into #var.
Try something like this:
my ($array_ref, $hash_ref) = analyze(...)
sub analyze {
...
return (\#array, \#hash);
}
If you pass the hash by reference (as you're doing), you needn't return it as a subroutine return value. Your manipulation of the hash in the subroutine will stick.
my %h = ( test0 => 0 );
foreach my $i ( 1..5 ) {
do_something($i, \%h);
}
print "$k = $v\n" while ( my ($k,$v) = each %h );
sub do_something {
my $num = shift;
my $hash = shift;
$hash->{"test${num}"} = $num; # note the use of the -> deference operator
}
Your use of the #array inside the subroutine will need a separate question :-)

How to parse through many hashes using foreach?

foreach my %hash (%myhash1,%myhash2,%myhash3)
{
while (($keys,$$value) = each %hash)
{
#use key and value...
}
}
Why doesn't this work :
it says synta error on foreach line.
Pls tell me why is it wrong.
This is wrong because you seem to think that this allows you to access each hash as a separate construct, whereas what you are in fact doing is, besides a syntax error, accessing the hashes as a mixed-together new list. For example:
my %hash1 = qw(foo 1 bar 1);
my %hash2 = qw(abc 1 def 1);
for (%hash1, %hash2) # this list is now qw(foo 1 bar 1 abc 1 def 1)
When you place a hash (or array) in a list context statement, they are expanded into their elements, and their integrity is not preserved. Some built-in functions do allow this behaviour, but normal Perl code does not.
You also cannot assign a hash as the for iterator variable, that can only ever be a scalar value. What you can do is this:
for my $hash (\%myhash1, \%myhash2, \%myhash3) {
while (my ($key, $value) = each %$hash) {
...
Which is to say, you create a list of hash references and iterate over them. Note that you cannot tell the difference between the hashes with this approach.
Note also that I use my $hash because this variable must be a scalar.
The syntax should be like:
my $hash1 = {'a'=>1};
my $hash2 = {'b'=>1};
my #arr2 = ($hash1, $hash2);
foreach $hash (#arr2)
{
while(($key, $value) = each %$hash)
{
print $key, $value;
}
}
you need to reference and then dereference the hash.

Is there a simple way to validate a hash of hash element comparsion?

Is there a simple way to validate a hash of hash element comparsion ?
I need to validate a Perl hash of hash element $Table{$key1}{$key2}{K1}{Value} compare to all other elements in hash
third key will be k1 to kn and i want comprare those elements and other keys are same
if ($Table{$key1}{$key2}{K1}{Value} eq $Table{$key1}{$key2}{K2}{Value}
eq $Table{$key1}{$key2}{K3}{Value} )
{
#do whatever
}
Something like this may work:
use List::MoreUtils 'all';
my #keys = map "K$_", 1..10;
print "All keys equal"
if all { $Table{$key1}{$key2}{$keys[1]}{Value} eq $Table{$key1}{$key2}{$_}{Value} } #keys;
I would use Data::Dumper to help with a task like this, especially for a more general problem (where the third key is more arbitrary than 'K1'...'Kn'). Use Data::Dumper to stringify the data structures and then compare the strings.
use Data::Dumper;
# this line is needed to assure that hashes with the same keys output
# those keys in the same order.
$Data::Dumper::Sortkeys = 1;
my $string1= Data::Dumper->Dump($Table{$key1}{$key2}{k1});
for ($n=2; exists($Table{$key1}{$key2}{"k$n"}; $n++) {
my $string_n = Data::Dumper->Dump($Table{$key1}{$key2}{"k$n"});
if ($string1 ne $string_n) {
warn "key 'k$n' is different from 'k1'";
}
}
This can be used for the more general case where $Table{$key1}{$key2}{k7}{value} itself contains a complex data structure. When a difference is detected, though, it doesn't give you much help figuring out where that difference is.
A fairly complex structure. You should be looking into using object oriented programming techniques. That would greatly simplify your programming and the handling of these complex structures.
First of all, let's simplify a bit. When you say:
$Table{$key1}{$key2}{k1}{value}
Do you really mean:
my $value = $Table{$key1}->{$key2}->{k1};
or
my $actual_value = $Table{$key1}->{$key2}->{k1}->{Value};
I'm going to assume the first one. If I'm wrong, let me know, and I'll update my answer.
Let's simplify:
my %hash = %{$Table{$key1}->{$key2}};
Now, we're just dealing with a hash. There are two techniques you can use:
Sort the keys of this hash by value, then if two keys have the same value, they will be next to each other in the sorted list, making it easy to detect duplicates. The advantage is that all the duplicate keys would be printed together. The disadvantage is that this is a sort which takes time and resources.
Reverse the hash, so it's keyed by value and the value of that key is the key. If a key already exists, we know the other key has a duplicate value. This is faster than the first technique because no sorting is involved. However, duplicates will be detected, but not printed together.
Here's the first technique:
my %hash = %{$Table{$key1}->{$key2}};
my $previous_value;
my $previous_key;
foreach my $key (sort {$hash{$a} cmp $hash{$b}} keys %hash) {
if (defined $previous_key and $previous_value eq $hash{$key}) {
print "\$hash{$key} is a duplicate of \$hash{$previous_key}\n";
}
$previous_value = $hash{$key};
$previous_key = $key;
}
And the second:
my %hash = %{$Table{$key1}->{$key2}};
my %reverse_hash;
foreach $key (keys %hash) {
my $value = $hash{$key};
if (exists $reverse_hash{$value}) {
print "\$hash{$reverse_hash{$value}} has the same value as \$hash{$key}\n";
}
else {
$reverse_hash{$value} = $key;
}
}
Alternative approach to the problem is make utility function which will compare all keys if has same value returned from some function for all keys:
sub AllSame (&\%) {
my ($c, $h) = #_;
my #k = keys %$h;
my $ref;
$ref = $c->() for $h->{shift #k};
$ref ne $c->() and return for #$h{#k};
return 1
}
print "OK\n" if AllSame {$_->{Value}} %{$Table{$key1}{$key2}};
But if you start thinking in this way you can found this approach much more generic (recommended way):
sub AllSame (#) {
my $ref = shift;
$ref ne $_ and return for #_;
return 1
}
print "OK\n" if AllSame map {$_->{Value}} values %{$Table{$key1}{$key2}};
If mapping operation is expensive you can make lazy counterpart of same:
sub AllSameMap (&#) {
my $c = shift;
my $ref;
$ref = $c->() for shift;
$ref ne $c->() and return for #_;
return 1
}
print "OK\n" if AllSameMap {$_->{Value}} values %{$Table{$key1}{$key2}};
If you want only some subset of keys you can use hash slice syntax e.g.:
print "OK\n" if AllSame map {$_->{Value}} #{$Table{$key1}{$key2}}{map "K$_", 1..10};

How can I pass a hash to a Perl subroutine?

In one of my main( or primary) routines,I have two or more hashes. I want the subroutine foo() to recieve these possibly-multiple hashes as distinct hashes. Right now I have no preference if they go by value, or as references. I am struggling with this for the last many hours and would appreciate help, so that I dont have to leave perl for php! ( I am using mod_perl, or will be)
Right now I have got some answer to my requirement, shown here
From http://forums.gentoo.org/viewtopic-t-803720-start-0.html
# sub: dump the hash values with the keys '1' and '3'
sub dumpvals
{
foreach $h (#_)
{
print "1: $h->{1} 3: $h->{3}\n";
}
}
# initialize an array of anonymous hash references
#arr = ({1,2,3,4}, {1,7,3,8});
# create a new hash and add the reference to the array
$t{1} = 5;
$t{3} = 6;
push #arr, \%t;
# call the sub
dumpvals(#arr);
I only want to extend it so that in dumpvals I could do something like this:
foreach my %k ( keys #_[0]) {
# use $k and #_[0], and others
}
The syntax is wrong, but I suppose you can tell that I am trying to get the keys of the first hash ( hash1 or h1), and iterate over them.
How to do it in the latter code snippet above?
I believe this is what you're looking for:
sub dumpvals {
foreach my $key (keys %{$_[0]}) {
print "$key: $_[0]{$key}\n";
}
}
An element of the argument array is a scalar, so you access it as $_[0] not #_[0].
keys operates on hashes, not hash refs, so you need to dereference, using %
And of course, the keys are scalars, not hashes, so you use my $key, not my %key.
To have dumpvals dump the contents of all hashes passed to it, use
sub dumpvals {
foreach my $h (#_) {
foreach my $k (keys %$h) {
print "$k: $h->{$k}\n";
}
}
}
Its output when called as in your question is
1: 2
3: 4
1: 7
3: 8
1: 5
3: 6