Hash (Multihash?) Indexing (Perl) - perl

I have a function that counts frequencies of Trigrams in text. No knowledge of Computational Linguistics required, I just need help with Perl code.
This is the Function:
sub extract_frequencies {
for( my $i=0; $i<=$#tag; $i++ ) {
$wordtagfreq{"$word[$i]\t$tag[$i]"}++;
$tagfreq{$tag[$i]}++;
}
# count Tag-Trigramm-Frequencies
my #start = ("<s>","<s>");
unshift #tag, #start; # korrigiert
push #tag, "<s>";
for( my $i=2; $i<=$#tag; $i++ ) {
$ngramfreq[3]{"$tag[$i-2]\t$tag[$i-1]\t$tag[$i]"}++;
}
}
The particular code points that I do not understand are the following:
1) $ngramfreq[3]
What does the Index on the hash means here? Do I count for each Tag separately? Is it the length of the key? What is my end key (3 different tag keys?)?
2) $i<=$#tag
What does $# in Perl mean?
Haven't used Perl in a while, so I hope some Perl Monks will help me.

[0] is an array index, nothing to do with a hash. This implies that ngramfreq is actually an array of hashes:
my #ngramfreq = (
{ tag => 1, fish => 3 },
{ anothertag => 4 }
);
And thus $ngramfreq[0] gets you the first anon hash, and then you can access the tag.
$#tag is the last index in the array #tag. So with 3 elements, it would be 2, because the array indicies are 0,1,2
Data::Dumper is a good way of visualising a structure, to give you an idea of how it's layed out.
perldoc perldsc is worth a read, as it expands on data structures.

Related

Why I can use #list to call an array, but can't use %dict to call a hash in perl? [duplicate]

This question already has answers here:
Why do you need $ when accessing array and hash elements in Perl?
(9 answers)
Closed 8 years ago.
Today I start my perl journey, and now I'm exploring the data type.
My code looks like:
#list=(1,2,3,4,5);
%dict=(1,2,3,4,5);
print "$list[0]\n"; # using [ ] to wrap index
print "$dict{1}\n"; # using { } to wrap key
print "#list[2]\n";
print "%dict{2}\n";
it seems $ + var_name works for both array and hash, but # + var_name can be used to call an array, meanwhile % + var_name can't be used to call a hash.
Why?
#list[2] works because it is a slice of a list.
In Perl 5, a sigil indicates--in a non-technical sense--the context of your expression. Except from some of the non-standard behavior that slices have in a scalar context, the basic thought is that the sigil represents what you want to get out of the expression.
If you want a scalar out of a hash, it's $hash{key}.
If you want a scalar out of an array, it's $array[0]. However, Perl allows you to get slices of the aggregates. And that allows you to retrieve more than one value in a compact expression. Slices take a list of indexes. So,
#list = #hash{ qw<key1 key2> };
gives you a list of items from the hash. And,
#list2 = #list[0..3];
gives you the first four items from the array. --> For your case, #list[2] still has a "list" of indexes, it's just that list is the special case of a "list of one".
As scalar and list contexts were rather well defined, and there was no "hash context", it stayed pretty stable at $ for scalar and # for "lists" and until recently, Perl did not support addressing any variable with %. So neither %hash{#keys} nor %hash{key} had meaning. Now, however, you can dump out pairs of indexes with values by putting the % sigil on the front.
my %hash = qw<a 1 b 2>;
my #list = %hash{ qw<a b> }; # yields ( 'a', 1, 'b', 2 )
my #l2 = %list[0..2]; # yields ( 0, 'a', 1, '1', 2, 'b' )
So, I guess, if you have an older version of Perl, you can't, but if you have 5.20, you can.
But for a completist's sake, slices have a non-intuitive way that they work in a scalar context. Because the standard behavior of putting a list into a scalar context is to count the list, if a slice worked with that behavior:
( $item = #hash{ #keys } ) == scalar #keys;
Which would make the expression:
$item = #hash{ #keys };
no more valuable than:
scalar #keys;
So, Perl seems to treat it like the expression:
$s = ( $hash{$keys[0]}, $hash{$keys[1]}, ... , $hash{$keys[$#keys]} );
And when a comma-delimited list is evaluated in a scalar context, it assigns the last expression. So it really ends up that
$item = #hash{ #keys };
is no more valuable than:
$item = $hash{ $keys[-1] };
But it makes writing something like this:
$item = $hash{ source1(), source2(), #array3, $banana, ( map { "$_" } source4()};
slightly easier than writing:
$item = $hash{ [source1(), source2(), #array3, $banana, ( map { "$_" } source4()]->[-1] }
But only slightly.
Arrays are interpolated within double quotes, so you see the actual contents of the array printed.
On the other hand, %dict{1} works, but is not interpolated within double quotes. So, something like my %partial_dict = %dict{1,3} is valid and does what you expect i.e. %partial_dict will now have the value (1,2,3,4). But "%dict{1,3}" (in quotes) will still be printed as %dict{1,3}.
Perl Cookbook has some tips on printing hashes.

Multidimension array in perl

I am working on a short script in which two to three variables are linked with each other.
Example:
my #batch;
my #case;
my #type = {
back => "sticker",
front => "no sticker",
};
for (my $i=0; $i<$#batch; $i++{
for (my $j=0; $j<$#batch; $j++{
if ($batch[$i]=="health" && $case[$i]$j]=="pain"){
$type[$i][$j]->back = "checked";
}
}
}
In this short code I want to use #type as $type[$i][$j]->back & $type[$i][$j]->front, but I am getting error that array referenced not defined . Can anyone help me how to fix this ?
Perl two-dimensional arrays are just arrays of arrays: each element of the top level array contains a (reference to) another array. The best reference for this is perldoc perlreftut
From what I can understand, you want an array of arrays of hashes. $type[$i][$j]->back and $type[$i][$j]->front are method calls in Perl, and what you want is $type[$i][$j]{back} and $type[$i][$j]{front}.
use strict;
use warnings;
my #batch;
my #case;
# Populate #batch and #case
my #type;
for my $i (0 .. $#batch) {
for my $j (0 .. $#{ $batch[$i] } ) {
if ($batch[$i] eq 'health' and $case[$i][$j] eq 'pain') {
$type[$i][$j]{back} = 'checked';
}
}
}
But I am very worried about your design. #type will be full of undefined elements, with only occasional ones set to checked. A proper fix depends entirely on what you need to do with #type once you have built it.
I hope this helps
Perl doesn't have multiple dimension variables. To emulate multidimential arrays, you can use what are called references. A reference is a way of referring to a memory location of another Perl structure such as an array or hash.
References allows you to build up more complex structures. For example, you could have an array and instead of each element in the array having a distinct value, it could point to another array. Using this, I can treat my array of arrays as a two dimensional array. But it's not a two dimensional array.
In a two dimensional array, each column ($j) has the same length. That's guaranteed. In Perl, what you have is each row ($i), pointing to a different array of columns ($j), and each of those column arrays could have a different number of elements (or even none at all! That inner array $j may not even be defined!).
There for, I have to check each column and see exactly how many values it might have:
for my $i ( 0..$#array ) {
if ( ref $array[i] ne "ARRAY" ) {
die qq(There is no sub array! for \$array[$i]!\n);
}
my #temp_j_array = #{ $array[$i] } { # This is how you dereference a reference
for my $j ( 0..$#temp_j_array ) {
# Here be dragons...
}
}
Note that I have to see exactly how many columns are in my inner ($j) array before I can go through it.
By the way, notice how I use .. to index my arrays. It's a lot cleaner than using that three part for loop which is very error prone. For example, should you check $i < $#array or $i <= $#array`? See the difference?
Since you're already dealing with a very complex structure (an array of arrays), I'm going to make it even more complex: (An array of arrays of hashes). This added complexity allows me to get rid of three separate variables. Instead of trying to keep #batch #case and #type in sync with each other, I can make these keys to my inner most hash:
my #structure = ... # Some sort of structure...
for my $i ( 0..$#structure ) {
my #temp = #{ $structure[$i] }; # This is a reference to an array. Dereference it.
for my $j ( 0..$#temp ) {
if ( $structure[$i]->[$j]->{batch} eq "health"
and $structure[$i]->[$j]->{case} eq "pain" ) {
$structure[$i]->[$j]->{back} = "checked";
}
}
}
This is a very common way to use Perl references to build more complex data structures:
my %employees; # Keyed by employee number:
$employees{1001}->{NAME} = "Bob";
$employees{1001}->{JOB} = "Yes man";
$employees{1002}->{NAME} = "Susan";
$employees{1002}->{JOB} = "sycophant";
You had some syntax errors, and were using the wrong boolean operator (==) instead of (ne).

what is meaning of $# in perl?

Consider the following code:
my #candidates = get_candidates($marker);
CANDIDATE:
for my $i (0..$#candidates) {
next CANDIDATE if open_region($i);
$candidates[$i] = $incumbent{ $candidates[$i]{region} };
}
What is meaning $# in line 3?
It is a value of last index on array (in your case it is last index on candidates).
Since candidates is an array, $#candidates is the largest index (number of elements - 1)
For example:
my #x = (4,5,6);
print $#x;
will print 2 since that is the largest index.
Note that if the array is empty, $#candidates will be -1
EDIT: from perldoc perlvar:
$# is also used as sigil, which, when prepended on the name of
an array, gives the index of the last element in that array.
my #array = ("a", "b", "c");
my $last_index = $#array; # $last_index is 2
for my $i (0 .. $#array) {
print "The value of index $i is $array[$i]\n";
}
This means array_size - 1. It is the same as (scalar #array) - 1.
In perl ,we have several ways to get an array size ,such as print #arr,print scalar (#arr) ,print $#arr+1 and so on.No reason ,just use it.You will get familiar with some default usage in perl during your further contact with perl .Unlike C++/java ,perl use a lot of
special expression to simplify our coding , but sometimes it always make us more confused.

How to test if a value exist in a hash?

Let's say I have this
#!/usr/bin/perl
%x = ('a' => 1, 'b' => 2, 'c' => 3);
and I would like to know if the value 2 is a hash value in %x.
How is that done?
Fundamentally, a hash is a data structure optimized for solving the converse question, knowing whether the key 2 is present. But it's hard to judge without knowing, so let's assume that won't change.
Possibilities presented here will depend on:
how often you need to do it
how dynamic the hash is
One-time op
grep $_==2, values %x (also spelled grep {$_==1} values %x) will return a list of as many 2s as are present in the hash, or, in scalar context, the number of matches. Evaluated as a boolean in a condition, it yields just what you want.
grep works on versions of Perl as old as I can remember.
use List::Util qw(first); first {$_==2} values %x returns only the first match, undef if none. That makes it faster, as it will short-circuit (stop examining elements) as soon as it succeeds. This isn't a problem for 2, but take care that the returned element doesn't necessarily evaluate to boolean true. Use defined in those cases.
List::Util is a part of the Perl core since 5.8.
use List::MoreUtils qw(any); any {$_==2} values %x returns exactly the information you requested as a boolean, and exhibits the short-circuiting behavior.
List::MoreUtils is available from CPAN.
2 ~~ [values %x] returns exactly the information you requested as a boolean, and exhibits the short-circuiting behavior.
Smart matching is available in Perl since 5.10.
Repeated op, static hash
Construct a hash that maps values to keys, and use that one as a natural hash to test key existence.
my %r = reverse %x;
if ( exists $r{2} ) { ... }
Repeated op, dynamic hash
Use a reverse lookup as above. You'll need to keep it up to date, which is left as an exercise to the reader/editor. (hint: value collisions are tricky)
Shorter answer using smart match (Perl version 5.10 or later):
print 2 ~~ [values %x];
my %reverse = reverse %x;
if( defined( $reverse{2} ) ) {
print "2 is a value in the hash!\n";
}
If you want to find out the keys for which the value is 2:
foreach my $key ( keys %x ) {
print "2 is the value for $key\n" if $x{$key} == 2;
}
Everyone's answer so far was not performance-driven. While the smart-match (~~) solution short circuits (e.g. stops searching when something is found), the grep ones do not.
Therefore, here's a solution which may have better performance for Perl before 5.10 that doesn't have smart match operator:
use List::MoreUtils qw(any);
if (any { $_ == 2 } values %x) {
print "Found!\n";
}
Please note that this is just a specific example of searching in a list (values %x) in this case and as such, if you care about performance, the standard performance analysis of searching in a list apply as discussed in detail in this answer
grep and values
my %x = ('a' => 1, 'b' => 2, 'c' => 3);
if (grep { $_ == 2 } values %x ) {
print "2 is in hash\n";
}
else {
print "2 is not in hash\n";
}
See also: perldoc -q hash
Where $count would be the result:
my $count = grep { $_ == 2 } values %x;
This will not only show you if it's a value in the hash, but how many times it occurs as a value. Alternatively you can do it like this as well:
my $count = grep {/2/} values %x;

How fast is Perl's smartmatch operator when searching for a scalar in an array?

I want to repeatedly search for values in an array that does not change.
So far, I have been doing it this way: I put the values in a hash (so I have an array and a hash with essentially the same contents) and I search the hash using exists.
I don't like having two different variables (the array and the hash) that both store the same thing; however, the hash is much faster for searching.
I found out that there is a ~~ (smartmatch) operator in Perl 5.10. How efficient is it when searching for a scalar in an array?
If you want to search for a single scalar in an array, you can use List::Util's first subroutine. It stops as soon as it knows the answer. I don't expect this to be faster than a hash lookup if you already have the hash, but when you consider creating the hash and having it in memory, it might be more convenient for you to just search the array you already have.
As for the smarts of the smart-match operator, if you want to see how smart it is, test it. :)
There are at least three cases you want to examine. The worst case is that every element you want to find is at the end. The best case is that every element you want to find is at the beginning. The likely case is that the elements you want to find average out to being in the middle.
Now, before I start this benchmark, I expect that if the smart match can short circuit (and it can; its documented in perlsyn), that the best case times will stay the same despite the array size, while the other ones get increasingly worse. If it can't short circuit and has to scan the entire array every time, there should be no difference in the times because every case involves the same amount of work.
Here's a benchmark:
#!perl
use 5.12.2;
use strict;
use warnings;
use Benchmark qw(cmpthese);
my #hits = qw(A B C);
my #base = qw(one two three four five six) x ( $ARGV[0] || 1 );
my #at_end = ( #base, #hits );
my #at_beginning = ( #hits, #base );
my #in_middle = #base;
splice #in_middle, int( #in_middle / 2 ), 0, #hits;
my #random = #base;
foreach my $item ( #hits ) {
my $index = int rand #random;
splice #random, $index, 0, $item;
}
sub count {
my( $hits, $candidates ) = #_;
my $count;
foreach ( #$hits ) { when( $candidates ) { $count++ } }
$count;
}
cmpthese(-5, {
hits_beginning => sub { my $count = count( \#hits, \#at_beginning ) },
hits_end => sub { my $count = count( \#hits, \#at_end ) },
hits_middle => sub { my $count = count( \#hits, \#in_middle ) },
hits_random => sub { my $count = count( \#hits, \#random ) },
control => sub { my $count = count( [], [] ) },
}
);
Here's how the various parts did. Note that this is a logarithmic plot on both axes, so the slopes of the plunging lines aren't as close as they look:
So, it looks like the smart match operator is a bit smart, but that doesn't really help you because you still might have to scan the entire array. You probably don't know ahead of time where you'll find your elements. I expect a hash will perform the same as the best case smart match, even if you have to give up some memory for it.
Okay, so the smart match being smart times two is great, but the real question is "Should I use it?". The alternative is a hash lookup, and it's been bugging me that I haven't considered that case.
As with any benchmark, I start off thinking about what the results might be before I actually test them. I expect that if I already have the hash, looking up a value is going to be lightning fast. That case isn't a problem. I'm more interested in the case where I don't have the hash yet. How quickly can I make the hash and lookup a key? I expect that to perform not so well, but is it still better than the worst case smart match?
Before you see the benchmark, though, remember that there's almost never enough information about which technique you should use just by looking at the numbers. The context of the problem selects the best technique, not the fastest, contextless micro-benchmark. Consider a couple of cases that would select different techniques:
You have one array you will search repeatedly
You always get a new array that you only need to search once
You get very large arrays but have limited memory
Now, keeping those in mind, I add to my previous program:
my %old_hash = map {$_,1} #in_middle;
cmpthese(-5, {
...,
new_hash => sub {
my %h = map {$_,1} #in_middle;
my $count = 0;
foreach ( #hits ) { $count++ if exists $h{$_} }
$count;
},
old_hash => sub {
my $count = 0;
foreach ( #hits ) { $count++ if exists $old_hash{$_} }
$count;
},
control_hash => sub {
my $count = 0;
foreach ( #hits ) { $count++ }
$count;
},
}
);
Here's the plot. The colors are a bit difficult to distinguish. The lowest line there is the case where you have to create the hash any time you want to search it. That's pretty poor. The highest two (green) lines are the control for the hash (no hash actually there) and the existing hash lookup. This is a log/log plot; those two cases are faster than even the smart match control (which just calls a subroutine).
There are a few other things to note. The lines for the "random" case are a bit different. That's understandable because each benchmark (so, once per array scale run) randomly places the hit elements in the candidate array. Some runs put them a bit earlier and some a bit later, but since I only make the #random array once per run of the entire program, they move around a bit. That means that the bumps in the line aren't significant. If I tried all positions and averaged, I expect that "random" line to be the same as the "middle" line.
Now, looking at these results, I'd say that a smart-match is much faster in its worst case than the hash lookup is in its worst case. That makes sense. To create a hash, I have to visit every element of the array and also make the hash, which is a lot of copying. There's no copying with the smart match.
Here's a further case I won't examine though. When does the hash become better than the smart match? That is, when does the overhead of creating the hash spread out enough over repeated searches that the hash is the better choice?
Fast for small numbers of potential matches, but not faster than the hash. Hashes are really the right tool for testing set membership. Since hash access is O(log n) and smartmatch on an array is still O(n) linear scan (albeit short-circuiting, unlike grep), with larger numbers of values in the allowed matches, smartmatch gets relatively worse.
Benchmark code (matching against 3 values):
#!perl
use 5.12.0;
use Benchmark qw(cmpthese);
my #hits = qw(one two three);
my #candidates = qw(one two three four five six); # 50% hit rate
my %hash;
#hash{#hits} = ();
sub count_hits_hash {
my $count = 0;
for (#_) {
$count++ if exists $hash{$_};
}
$count;
}
sub count_hits_smartmatch {
my $count = 0;
for (#_) {
$count++ when #hits;
}
$count;
}
say count_hits_hash(#candidates);
say count_hits_smartmatch(#candidates);
cmpthese(-5, {
hash => sub { count_hits_hash((#candidates) x 1000) },
smartmatch => sub { count_hits_smartmatch((#candidates) x 1000) },
}
);
Benchmark results:
Rate smartmatch hash
smartmatch 404/s -- -65%
hash 1144/s 183% --
The "smart" in "smart match" isn't about the searching. It's about doing the right thing at the right time based on context.
The question of whether it's faster to loop through an array or index into a hash is something you'd have to benchmark, but in general, it'd have to be a pretty small array to be quicker to skim through than indexing into a hash.