Perl equivalent of java Iterator - perl

In java for a class that represents a collection, if it implements the iterable interface, iterating over the collection can be done without any knowledge about the collection using the foreach loop.
Can we do something similar in perl? Suppose my perl class is a collection. What is the best way of iterating over the collection ?

If you use a Perl hash in place of a Collection, then you can use each to iterate over it.
Every call to each %hash will return the next key/value pair, or an empty list at the end.

Perl doesn't have Java-style iterators, cause it doesn't need them. Perl has rather robust hashes and lists built in, which supplant most custom collections you'll ever need, and you can easily loop over them already. And if you want to do something with each element (and particularly if you want to build a list of the results), there's functions like map that you can pass a code ref to.
If you really want this Java'ism, though, you could build your own iterator class. It doesn't have to implement or extend any particular interface, cause Perl duck-types -- and normal people don't use iterators anyway, so there's no standard interface in the first place. And it doesn't have anything to do with foreach. :P
Really, it's better to just use the standard collection types. I've never run into a case where i needed a custom collection.

You can create arbitrarily complex iterators in perl by using closures and functional programming techniques, for when you need something above and beyond perl's built in ability to iterate over arrays and hashes. Here is is simplistic example to illustrate the technique - in this instance, its pretty useless since an iterator is unnecessary for this particular case. But as I said this technique can be applied usefully in other scenarios. This example will walk over an array backwards, skipping elements that we want excluded (in this case, the letter 'c').
sub gen_iter {
my #collection = #_;
return sub {
while (my $item = pop #collection) {
next if $item eq 'c';
return $item;
}
}
}
my $it = gen_iter((qw/a b c d e f g/));
# prints 'gfedba'
while (my $item = $it->()) {
print $item;
}
I recommend grabbing the free ebook 'High-Order Perl' and digesting the chapter on iterators. It has some great, useful examples.

Perl's two builtin collections — arrays and hashes — have various ways to be iterated:
foreach loops: This loops (but not iterates lazily) through each item in the given list. Using this on hashes is probably wrong, you should either iterate over the values or the keys:
foreach my $key (keys %hash) {
do something;
}
The each iterator: You can iterate through "collections" with the each function. Each time it is called, it will return a list of a key/index, and a value. If no further items are there, it will return a false value:
while (my ($key, $value) = each %hash) {
do something;
}
You can only have one such iterator per array/hash. This usage of each on arrays is a rather new feature.
The each idiom can be extended via closures to create custom iterators:
my %hash = ...;
my #keys = keys %hash; # cache the current keys
my $i = 0;
my $iter = sub {
return unless $i < #keys;
my $key = $keys[$i++];
return ($key, $hash{$key});
};
...;
while (my ($key, $value) = $iter->()) {
do something;
}

Related

Perl: safely make hash from list, checking for duplicates

In Perl if you have a list with an even number of elements you can straightforwardly convert it to a hash:
my #a = qw(each peach pear plum);
my %h = #a;
However, if there are duplicate keys then they will be silently accepted, with the last occurrence being the one used. I would like to make a hash checking that there are no duplicates:
my #a = qw(a x a y);
my %h = safe_hash_from_list(#a); # prints error: duplicate key 'a'
Clearly I could write that routine myself:
sub safe_hash_from_list {
die 'even sized list needed' if #_ % 2;
my %r;
while (#_) {
my $k = shift;
my $v = shift;
die "duplicate key '$k'" if exists $r{$k};
$r{$k} = $v;
}
return %r;
}
This, however, is quite a bit slower than the simple assignment. Moreover I do not want to use my own private routine if there is a CPAN module that already does the same job.
Is there a suitable routine on CPAN for safely turning lists into hashes? Ideally one that is a bit faster than the pure-Perl implementation above (though probably never quite as fast as the simple assignment).
If I may be allowed a related follow-up question, I'm also wondering about a tied hash class which allows each key to be assigned only once and dies on reassignment. That would be a more general case of the above problem. Again, I can write such a tied hash myself but I do not want to reinvent the wheel and I would prefer an optimized implementation if one already exists.
Quick way to check that no keys were duplicate would be count the keys and make sure they are equal to half the number of items in the list:
my #a = ...;
my %h = #a;
if (keys %h == (#a / 2)) {
print "Success!";
}

Efficiently get hash entry only if it exists in Perl

I am quite often writing fragments of code like this:
if (exists $myHash->{$key}) {
$value = $myHash->{$key};
}
What I am trying to do is get the value from the hash if the hash has that key in it, and at the same time I want to avoid autovivifying the hash entry if it did not already exist.
However it strikes me that this is quite inefficient: I am doing a hash lookup to find out if a key exists, and then if it did exist I am doing another hash lookup of the same key to extract it.
It gets even more inefficient in a multilevel structure:
if (exists $myHash->{$key1}
&& exists $myHash->{$key1}{$key2}
&& exists $myHash->{$key1}{$key2}{$key3}) {
$value = $myHash->{$key1}{$key2}{$key3};
}
Here I am presumably doing 9 hash lookups instead of 3!
Is perl smart enough to optimize this kind of case? Or is there some other idiom to get the value of a hash without either autovivifying the entry or doing two successive lookups?
I am aware of the autovivification module, but if possible I am looking for a solution that does not require an XS module to be installed. Also I have not had a chance to try this module out and I am not completely sure what happens in the case of a multilevel hash - the pod says that this:
$h->{$key}
would return undef if the key did not exist - does that mean that this:
$h->{$key1}{$key2}
would die if $key1 did not exist, on the grounds that I am trying to de-reference undef? If so, to avoid that presumably you would still need to do multi-level tests for existence.
I wouldn't worry about optimization since hash lookups are fast. But for your first case, you can do:
if (my $v = $hash{$key}) {
print "have $key => $v\n";
}
Similarly:
if ( ($v = $hash{key1}) && ($v = $v->{key2}) ) {
print "Got $v\n";
}
Autovivification doesn't happen for single-level access so you can safely write
my $value = $hash{$key};
For multi-level access intermediate entries will be autovivified. e.g.
my $value = $hash{a}{b};
will create a reference to an empty hash if $hash{a} doesn't already exist. (If it does exist and isn't a hash reference, perl will throw an error and die.) To avoid that, you need to check each level first. You can write a subroutine to check existence of arbitrarily nested keys.
sub safe_exists {
my $x = shift;
foreach my $k (#_) {
no warnings 'uninitialized';
return unless ref $x eq ref {};
return unless exists $x->{$k};
$x = $x->{$k};
}
return 1;
}
if (safe_exists(\%hash, qw(a b))) {...}
Depending on your algorithm (and why you're trying to avoid autovivification) locking your hash can be a useful alternative to no autovivification or multi-layer exists tests.
use Hash::Util;
my %hash = (a => { b => 1 });
Hash::Util::lock_hash_recurse(%hash);
say $h{a}{b}; # 1
say $h{a}{c}; # error!
I mostly use this as a way to detect programming errors when working with complex data structures. It's useful for detecting mis-typed key names or inadvertent modification of values.

Adding multiple values to key in perl hash

I need to create multi-dimensional hash.
for example I have done:
$hash{gene} = $mrna;
if (exists ($exon)){
$hash{gene}{$mrna} = $exon;
}
if (exists ($cds)){
$hash{gene}{$mrna} = $cds;
}
where $gene, $mrna, $exon, $cds are unique ids.
But, my issue is that I want some properties of $gene and $mrna to be included in the hash.
for example:
$hash{$gene}{'start_loc'} = $start;
$hash{gene}{mrna}{'start_loc'} = $start;
etc. But, is that a feasible way of declaring a hash? If I call $hash{$gene} both $mrna and start_loc will be printed. What could be the solution?
How would I add multiple values for the same key $gene and $mrna being the keys in this case.
Any suggestions will be appreciated.
What you need to do is to read the Perl Reference Tutorial.
Simple answer to your question:
Perl hashes can only take a single value to a key. However, that single value can be a reference to a memory location of another hash.
my %hash1 = ( foo => "bar", fu => "bur" }; #First hash
my %hash2;
my $hash{some_key} = \%hash1; #Reference to %hash1
And, there's nothing stopping that first hash from containing a reference to another hash. It's turtles all the way down!.
So yes, you can have a complex and convoluted structure as you like with as many sub-hashes as you want. Or mix in some arrays too.
For various reasons, I prefer the -> syntax when using these complex structures. I find that for more complex structures, it makes it easier to read. However, the main this is it makes you remember these are references and not actual multidimensional structures.
For example:
$hash{gene}->{mrna}->{start_loc} = $start; #Quote not needed in string if key name qualifies as a valid variable name.
The best thing to do is to think of your hash as a structure. For example:
my $person_ref = {}; #Person is a hash reference.
my $person->{NAME}->{FIRST} = "Bob";
my $person->{NAME}->{LAST} = "Rogers";
my $person->{PHONE}->{WORK}->[0] = "555-1234"; An Array Ref. Might have > 1
my $person->{PHONE}->{WORK}->[1] = "555-4444";
my $person->{PHONE}->{CELL}->[0] = "555-4321";
...
my #people;
push #people, $person_ref;
Now, I can load up my #people array with all my people, or maybe use a hash:
my %person;
$person{$bobs_ssn} = $person; #Now, all of Bob's info is index by his SSN.
So, the first thing you need to do is to think of what your structure should look like. What are the fields in your structure? What are the sub-fields? Figure out what your structure should look like, and then setup your hash of hashes to look like that. Figure out exactly how it will be stored and keyed.
Remember, this hash contains references to your genes (or whatever), so you want to choose your keys wisely.
Read the tutorial. Then, try your hand at it. It's not all that complicated to understand. However, it can be a bear to maintain.
When you say use strict;, you give yourself some protection:
my $foo = "bar";
say $Foo; #This won't work!
This won't work because you didn't declare $Foo, you declared $foo. The use stict; can catch variable names that are mistyped, but:
my %var;
$var{foo} = "bar";
say $var{Foo}; #Whoops!
This will not be caught (except maybe that $var{Foo} has not been initialized. The use strict; pragma can't detect mistakes in typing in your keys.
The next step, after you've grown comfortable with references is to move onto object oriented Perl. There's a Tutorial for that too.
All Object Oriented Perl does is to take your hash references, and turns them into objects. Then, it creates subroutines that will help you keep track of manipulating objects. For example:
sub last_name {
my $person = shift; #Don't worry about this for now..
my $last_name = shift;
if ( exists $last_name ) {
my $person->{NAME}->{LAST} = $last_name;
}
return $person->{NAME}->{LAST};
}
When I set my last name using this subroutine ...I mean method, I guarantee that the key will be $person->{NAME}->{LAST} and not $person->{LAST}->{NAME} or $person->{LAST}->{NMAE}. or $person->{last}->{name}.
The main problem isn't learning the mechanisms, but learning to apply them. So, think about exactly how you want to represent your items. This about what fields you want, and how you're going to pull up that information.
You could try pushing each value onto a hash of arrays:
my (#gene, #mrna, #exon, #cds);
my %hash;
push #{ $hash{$gene[$_]} }, [$mrna[$_], $exon[$_], $cds[$_] ] for 0 .. $#gene;
This way gene is the key, with multiple values ($mrna, $exon, $cds) associated with it.
Iterate over keys/values as follows:
for my $key (sort keys %hash) {
print "Gene: $key\t";
for my $value (#{ $hash{$key} } ) {
my ($mrna, $exon, $cds) = #$value; # De-references the array
print "Values: [$mrna], [$exon], [$cds]\n";
}
}
The answer to a question I've asked previously might be of help (Can a hash key have multiple 'subvalues' in perl?).

How can I prevent perl from reading past the end of a tied array that shrinks when accessed?

Is there any way to force Perl to call FETCHSIZE on a tied array before each call to FETCH? My tied array knows its maximum size, but could shrink from this size depending on the results of earlier FETCH calls. here is a contrived example that filters a list to only the even elements with lazy evaluation:
use warnings;
use strict;
package VarSize;
sub TIEARRAY { bless $_[1] => $_[0] }
sub FETCH {
my ($self, $index) = #_;
splice #$self, $index, 1 while $$self[$index] % 2;
$$self[$index]
}
sub FETCHSIZE {scalar #{$_[0]}}
my #source = 1 .. 10;
tie my #output => 'VarSize', [#source];
print "#output\n"; # array changes size as it is read, perl only checks size
# at the start, so it runs off the end with warnings
print "#output\n"; # knows correct size from start, no warnings
for brevity I have omitted a bunch of error checking code (such as how to deal with accesses starting from an index other than 0)
EDIT: rather than the above two print statements, if ONE of the following two lines is used, the first will work fine, the second will throw warnings.
print "$_ " for #output; # for loop "iterator context" is fine,
# checks FETCHSIZE before each FETCH, ends properly
print join " " => #output; # however a list context expansion
# calls FETCHSIZE at the start, and runs off the end
Update:
The actual module that implements a variable sized tied array is called List::Gen which is up on CPAN. The function is filter which behaves like grep, but works with List::Gen's lazy generators. Does anyone have any ideas that could make the implementation of filter better?
(the test function is similar, but returns undef in failed slots, keeping the array size constant, but that of course has different usage semantics than grep)
sub FETCH {
my ($self, $index) = #_;
my $size = $self->FETCHSIZE;
...
}
Ta da!
I suspect what you're missing is they're just methods. Methods called by tie magic, but still just methods you can call yourself.
Listing out the contents of a tied array basically boils down to this:
my #array;
my $tied_obj = tied #array;
for my $idx (0..$tied_obj->FETCHSIZE-1) {
push #array, $tied_obj->FETCH($idx);
}
return #array;
So you don't get any opportunity to control the number of iterations. Nor can FETCH reliably tell if its being called from #array or $array[$idx] or #array[#idxs]. This sucks. Ties kinda suck, and they're really slow. About 3 times slower than a normal method call and 10 times than a regular array.
Your example already breaks expectations about arrays (10 elements go in, 5 elements come out). What happen when a user asks for $array[3]? Do they get undef? Alternatives include just using the object API, if your thing doesn't behave exactly like an array pretending it does will only add confusion. Or you can use an object with array deref overloaded.
So, what you're doing can be done, but its difficult to get it to work well. What are you really trying to accomplish?
I think that order in which perl calls FETCH/FETCHSIZE methods can't be changed. It's perls internal part.
Why not just explicitly remove warnings:
sub FETCH {
my ($self, $index) = #_;
splice #$self, $index, 1 while ($$self[$index] || 0) % 2;
exists $$self[$index] ? $$self[$index] : '' ## replace '' with default value
}

Is returning a whole array from a Perl subroutine inefficient?

I often have a subroutine in Perl that fills an array with some information. Since I'm also used to hacking in C++, I find myself often do it like this in Perl, using references:
my #array;
getInfo(\#array);
sub getInfo {
my ($arrayRef) = #_;
push #$arrayRef, "obama";
# ...
}
instead of the more straightforward version:
my #array = getInfo();
sub getInfo {
my #array;
push #array, "obama";
# ...
return #array;
}
The reason, of course, is that I don't want the array to be created locally in the subroutine and then copied on return.
Is that right? Or does Perl optimize that away anyway?
What about returning an array reference in the first place?
sub getInfo {
my $array_ref = [];
push #$array_ref, 'foo';
# ...
return $array_ref;
}
my $a_ref = getInfo();
# or if you want the array expanded
my #array = #{getInfo()};
Edit according to dehmann's comment:
It's also possible to use a normal array in the function and return a reference to it.
sub getInfo {
my #array;
push #array, 'foo';
# ...
return \#array;
}
Passing references is more efficient, but the difference is not as big as in C++. The argument values themselves (that means: the values in the array) are always passed by reference anyway (returned values are copied though).
Question is: does it matter? Most of the time, it doesn't. If you're returning 5 elements, don't bother about it. If you're returning/passing 100'000 elements, use references. Only optimize it if it's a bottleneck.
If I look at your example and think about what you want to do I'm used to write it in this manner:
sub getInfo {
my #array;
push #array, 'obama';
# ...
return \#array;
}
It seems to me as straightforward version when I need return large amount of data. There is not need to allocate array outside sub as you written in your first code snippet because my do it for you. Anyway you should not do premature optimization as Leon Timmermans suggest.
To answer the final rumination, no, Perl does not optimize this away. It can't, really, because returning an array and returning a scalar are fundamentally different.
If you're dealing with large amounts of data or if performance is a major concern, then your C habits will serve you well - pass and return references to data structures rather than the structures themselves so that they won't need to be copied. But, as Leon Timmermans pointed out, the vast majority of the time, you're dealing with smaller amounts of data and performance isn't that big a deal, so do it in whatever way seems most readable.
This is the way I would normally return an array.
sub getInfo {
my #array;
push #array, 'foo';
# ...
return #array if wantarray;
return \#array;
}
This way it will work the way you want, in scalar, or list contexts.
my $array = getInfo;
my #array = getInfo;
$array->[0] == $array[0];
# same length
#$array == #array;
I wouldn't try to optimize it unless you know it is a slow part of your code. Even then I would use benchmarks to see which subroutine is actually faster.
There's two considerations. The obvious one is how big is your array going to get? If it's less than a few dozen elements, then size is not a factor (unless you're micro-optimizing for some rapidly called function, but you'd have to do some memory profiling to prove that first).
That's the easy part. The oft overlooked second consideration is the interface. How is the returned array going to be used? This is important because whole array dereferencing is kinda awful in Perl. For example:
for my $info (#{ getInfo($some, $args) }) {
...
}
That's ugly. This is much better.
for my $info ( getInfo($some, $args) ) {
...
}
It also lends itself to mapping and grepping.
my #info = grep { ... } getInfo($some, $args);
But returning an array ref can be handy if you're going to pick out individual elements:
my $address = getInfo($some, $args)->[2];
That's simpler than:
my $address = (getInfo($some, $args))[2];
Or:
my #info = getInfo($some, $args);
my $address = $info[2];
But at that point, you should question whether #info is truly a list or a hash.
my $address = getInfo($some, $args)->{address};
What you should not do is have getInfo() return an array ref in scalar context and an array in list context. This muddles the traditional use of scalar context as array length which will surprise the user.
Finally, I will plug my own module, Method::Signatures, because it offers a compromise for passing in array references without having to use the array ref syntax.
use Method::Signatures;
method foo(\#args) {
print "#args"; # #args is not a copy
push #args, 42; # this alters the caller array
}
my #nums = (1,2,3);
Class->foo(\#nums); # prints 1 2 3
print "#nums"; # prints 1 2 3 42
This is done through the magic of Data::Alias.
3 other potentially LARGE performance improvements if you are reading an entire, largish file and slicing it into an array:
Turn off BUFFERING with sysread() instead of read() (manual warns
about mixing)
Pre-extend the array by valuing the last element -
saves memory allocations
Use Unpack() to swiftly split data like uint16_t graphics channel data
Passing an array ref to the function allows the main program to deal with a simple array while the write-once-and-forget worker function uses the more complicated "$#" and arrow ->[$II] access forms. Being quite C'ish, it is likely to be fast!
I know nothing about Perl so this is a language-neutral answer.
It is, in a sense, inefficient to copy an array from a subroutine into the calling program. The inefficiency arises in the extra memory used and the time taken to copy the data from one place to another. On the other hand, for all but the largest arrays, you might not give a damn, and might prefer to copy arrays out for elegance, cussedness or any other reason.
The efficient solution is for the subroutine to pass the calling program the address of the array. As I say, I haven't a clue about Perl's default behaviour in this respect. But some languages provide the programmer the option to choose which approach.