Grouping with Perl: finding a faster solution to recursion - perl

The Perl code below works, but it doesn't scale well even with considerable computer resources. I hoping that someone can help me find more efficient code such as by replacing recursion with iteration, if that's the problem.
my data structure looks like this:
my %REV_ALIGN;
$REV_ALIGN{$dna}{$rna} = ();
Any dna key may have multiple rna sub keys. The same rna sub key may appear with multiple different dna keys. The purpose is to group rna ( transcripts ) based on shared dna sequence elements. For example, if dnaA has RNA1, RNA8, RNA9, and RNA4, and dnaB has RNA11, RNA4, and RNA99, then we group all these transcripts together ( RNA1, RNA9, RNA4, RNA11, RNA99 ) and continue to proceed to try and add to the group by selecting other dna. My recusive solution to this problem works but doesn't scale so well when using data from whole genome to transcriptome alignment.
SO MY QUESTION IS: WHAT IS A MORE EFFICIENT SOLUTION TO THIS PROBLEM? THANK YOU VERY MUCH
my #groups;
while ( my $x =()= keys %REV_ALIGN )
{
my #DNA = keys %REV_ALIGN;
my $dna = shift #DNA;
# the corresponding list of rna
my #RNA = keys %{$REV_ALIGN{$dna}};
delete $REV_ALIGN{$dna};
if ( $x == 1 )
{
push #groups, \#RNA;
last;
}
my $ref = group_transcripts ( \#RNA, \%REV_ALIGN );
push #groups, $ref;
}
sub group_transcripts
{
my $tran_ref = shift;
my $align_ref = shift;
my #RNA_A = #$tran_ref;
my %RNA;
# create a null hash with seed list of transcripts
#RNA{#RNA_A} = ();
# get a list of all remaining dna sequences in the alignment
my #DNA = keys %{$align_ref};
my %count;
# select a different list of transcripts
for my $dna ( #DNA )
{
next unless exists $align_ref->{$dna};
my #RNA_B = keys %{$align_ref->{$dna}};
# check to see two list share and transcripts
for my $element ( #RNA_A, #RNA_B )
{
$count{$element}++;
}
for my $rna_a ( keys %count )
{
# if they do, add any new transcripts to the current group
if ( $count{$rna_a} == 2 )
{
for my $rna_b ( #RNA_B )
{
push #RNA_A, $rna_b if $count{$rna_b} == 1;
}
delete $align_ref->{$dna};
delete $count{$_} foreach keys %count;
# recurse to try and continue adding to list
#_ = ( \#RNA_A, $align_ref );
goto &group_transcripts;
}
}
delete $count{$_} foreach keys %count;
}
# if no more transcripts can be added, return a reference to the group
return \#RNA_A;
}

You have a loops nested four deep. It's an pretty safe bet that's why your code scales poorly.
If I understand correctly what you are trying to accomplish, the input
my %REV_ALIGN = (
"DNA1" => { map { $_ => undef } "RNA1", "RNA2" }, # \ Linked by RNA1 \
"DNA2" => { map { $_ => undef } "RNA1", "RNA3" }, # / \ Linked by RNA3 > Group
"DNA3" => { map { $_ => undef } "RNA3", "RNA4" }, # / /
"DNA4" => { map { $_ => undef } "RNA5", "RNA6" }, # \ Linked by RNA5 \ Group
"DNA5" => { map { $_ => undef } "RNA5", "RNA7" }, # / /
"DNA6" => { map { $_ => undef } "RNA8" }, # > Group
);
should result in
my #groups = (
[
dna => [ "DNA1", "DNA2", "DNA3" ],
rna => [ "RNA1", "RNA2", "RNA3", "RNA4" ],
],
[
dna => [ "DNA4", "DNA5" ],
rna => [ "RNA5", "RNA6", "RNA7" ],
],
[
dna => [ "DNA6" ],
rna => [ "RNA8" ],
],
);
If so, you can use the following:
use strict;
use warnings;
use Graph::Undirected qw( );
my %REV_ALIGN = (
"DNA1" => { map { $_ => undef } "RNA1", "RNA2" },
"DNA2" => { map { $_ => undef } "RNA1", "RNA3" },
"DNA3" => { map { $_ => undef } "RNA3", "RNA4" },
"DNA4" => { map { $_ => undef } "RNA5", "RNA6" },
"DNA5" => { map { $_ => undef } "RNA5", "RNA7" },
"DNA6" => { map { $_ => undef } "RNA8" },
);
my $g = Graph::Undirected->new();
for my $dna (keys(%REV_ALIGN)) {
for my $rna (keys(%{ $REV_ALIGN{$dna} })) {
$g->add_edge("dna:$dna", "rna:$rna");
}
}
my #groups;
for my $raw_group ($g->connected_components()) {
my %group = ( dna => [], rna => [] );
for (#$raw_group) {
my ($type, $val) = split(/:/, $_, 2);
push #{ $group{$type} }, $val;
}
push #groups, \%group;
}
use Data::Dumper qw( Dumper );
print(Dumper(\#groups));
If you just want the RNA, the final section simplifies to the following:
my #groups;
for my $raw_group ($g->connected_components()) {
my #group;
for (#$raw_group) {
my ($type, $val) = split(/:/, $_, 2);
push #group, $val if $type eq 'rna';
}
push #groups, \#group;
}

Related

looping through hash of hashes reference perl

I have a hash of hashes that I am passing to a subroutine. In the subroutine I need to loop over the hash of hashes and access the value of the inner hash based on the outer hash's key. I am having trouble referencing and dereferencing the hash of hashes.
Here is my code.
use List::Util qw( min max );
##testingWords is array of strings
foreach(#testingWords)
{
#skip values that are '[' or ']' and move onto next value in array.
if($_ eq '[' or $_ eq ']')
{
next;
}
#if value in array matches key in %trainingHashRaw (hash of hashes)pass key to getMax.
if($trainingHashRaw{$_})
{
#key is $_, value is returned string from getMax
#%trainingHashRelative is hash of hashes
$testingHash{$_} = getMax($_, \%trainingHashRelative);
}
}
sub getMax
{
my $key = shift;
my $hash = shift;
my #max = ();
my $max = 0;
my $tag = "";
for my $i(keys $hash)
{
for my $j(keys $hash->{$i})
{
if($key eq $i)
{
push(#max, $hash->{$i}->{$j});
}
}
if(#max)
{
$max = max #max;
}
}
for my $i(keys $hash)
{
for my $j(keys $hash->{$i})
{
if($max == $hash->{$i}->{$j})
{
$tag = $j;
}
}
}
return $tag;
}
It is not very clear what your data structures should contain. So I made up an example for what I think that you have meant. The max value can occur more than once, so I keep track of all occurrences.
use strict;
use warnings;
use List::Util qw( min max );
use Data::Dumper;
# for each outer key we want to get the max of values of the associated hashref
# outer_a: 3
# outer_b: 100
my %hash_of_hashes = (
outer_a => {
inner_a_x => 1,
inner_a_y => 2,
inner_a_z => 3,
},
outer_b => {
inner_a_x => -100,
inner_a_y => 100,
},
outer_c => {
inner_a_x => 100,
inner_a_y => 1,
}
);
my ($max_value, $keys_of_max_value) = get_max( \%hash_of_hashes );
print "The max value $max_value occured in ", join( ' ,', #{$keys_of_max_value} ), ".\n";
sub get_max {
my ($hoh_ref) = #_; # reference to a hash of hashes
# %tmp keeps track of the outer_key for the max of the inner values
# %tmp = (
# 3 => ['outer_a'],
# 100 => ['outer_b', 'outer_c']
# )
my %tmp;
while ( my ($outer_key, $inner_hashref) = each %{$hoh_ref} ) {
my #inner_values = values %{$inner_hashref};
my $max_inner_values = max( #inner_values );
$tmp{$max_inner_values} ||= []; # for clarity create the arref expiclitly
push #{$tmp{$max_inner_values}}, $outer_key;
}
my $max_value = max( keys %tmp ); # 100
return ( $max_value, $tmp{$max_value} ); # 100, ['outer_a', 'outer_b']
}

Get value from hash of hashes

I would like to get value from hash of hashes but i do not. My code is :
sub test {
my $filename = $_[0];
open INFILE, ${filename} or die $!;
my %hashCount;
my #firstline = split('\t',<INFILE>);
shift(#firstline);
while (my $line = <INFILE>)
{
my %temp;
chomp($line);
my #line = split('\t', $line);
foreach my $cpt (1..$#line) {
$temp{$firstline[$cpt-1]}=$line[$cpt];
}
$hashCount{$line[0]}={%temp};
}
return %hashCount;
}
sub get_hash_of_hash {
my $h = shift;
foreach my $key (keys %$h) {
if( ref $h->{$key}) {
get_hash_of_hash( $h->{$key} );
}
else {
say $h->{$key};
}
}
}
And when i display my hash :
$VAR10679 = 'M00967_43_1106_2493_14707';
$VAR10680 = {
'A' => '1',
'B' => '0',
'C' => '1',
'D' => '0',
'E' => '0'
};
My first function return my hash of hashes and i get my specific value with the second function.
So I want to get value like that :
my %hashTest = test("FILE.txt");
get_hash_of_hash(%hashTest,"M00967_43_1106_2493_14707","A")
//return value '1'
You can either access nested elements like
$hash{keyA}{keyB}
or we can write a function that walks the data structure, like
sub walk {
my ($hashref, #keys) = #_;
my $pointer = $hashref;
for my $key (#keys) {
if (exists $pointer->{$key}) {
$pointer = $pointer->{$key};
} else {
die "No value at ", join "->", #keys;
}
}
return $pointer;
}
which can be used like
my %hash = (
'M00967_43_1106_2493_14707' => {
'A' => '1',
'B' => '0',
'C' => '1',
'D' => '0',
'E' => '0'
},
);
say walk(\%hash, 'M00967_43_1106_2493_14707', 'A');
Note: When using Data::Dumper, pass references to the Dump function:
print Dump \%hash; # not print Dump %hash
This is neccessary to show the correct data structure.
Your hash holds references to hashes.
You can access them like this:
$hashTest{'M00967_43_1106_2493_14707'}{'A'};
See perlref for more info
Use this subroutine..
sub get_hash_of_hash {
my $h = shift;
foreach my $key (keys %$h) {
if( ref $h->{$key}) {
get_hash_of_hash( $h->{$key} );
}
else {
print $h->{$key};
}
}
}

Perl WWW:Mechanize Accessing data in a hash reference

I have a question I'm hoping you could help with?
This is the last part I need help with in understanding hash references
Code:
my $content_lengths; # this is at the top
foreach my $url ( # ... more stuff
# compare
if ( $mech->response->header('Content-Length') != $content_length ) {
print "$child_url: different content length: $content_length vs "
. $mech->response->header('Content-Length') . "!\n";
# store the urls that are found to have different content
# lengths to the base url only if the same url has not already been stored
$content_lengths->{$url}->{'different'}->{$child_url} = $mech->response->header('Content-Length');
} elsif ( $mech->response->header('Content-Length') == $content_length ) {
print "Content lengths are the same\n";
# store the urls that are found to have the same content length as the base
# url only if the same url has not already been stored
$content_lengths->{$url}->{'equal'}->{$child_url} = $mech->response->header('Content-Length');
}
What it looked like using Data::Dumper
$VAR1 = {
'http://www.superuser.com/' => {
'difference' => {
'http://www.superuser.com/questions' => '10735',
'http://www.superuser.com/faq' => '13095'
},
'equal' => {
'http://www.superuser.com/ ' => '20892'
}
},
'http://www.stackoverflow.com/' => {
'difference' => {
'http://www.stackoverflow.com/faq' => '13015',
'http://www.stackoverflow.com/questions' => '10506'
},
'equal' => {
'http://www.stackoverflow.com/ ' => '33362'
}
}
};
What I need help with:
I need help understanding the various ways of accessing the different parts in the hash reference and using them to do things, such as print them.
So for example how do I print all the $url from the hash reference (i.e from Data::Dumper that will be http://www.superuser.com/ and http://www.stackoverflow.com/)
and how do I print all the $child_url or a particular one/subset from $child_url and so on?
Your help with this is much appreciated,
thanks a lot
You can navigate your hashref thusly:
$hashref->{key1}{key2}{keyN};
For example, if you want the superuser equal branch:
my $urlArrayref = $hashref->{'http://www.superuser.com/'}{'equal'};
More to the point, to print the urls (first level key) of the hashref, you would do:
foreach my $key ( keys( %{$hashref} ) ) {
print( "key is '$key'\n" );
}
Then if you wanted the second level keys:
foreach my $firstLevelKey ( keys( %{$hashref} ) ) {
print( "first level key is '$firstLevelKey'\n" );
foreach my $secondLevelKey ( keys( %{$hashref->{$firstLevelKey}} ) ) {
print( "\tfirst level key is '$secondLevelKey'\n" );
}
}
And so forth...
----- EDIT -----
This is working sample code from your example above:
#!/usr/bin/perl
use strict;
use warnings;
my $content_lengths = {
'http://www.superuser.com/' => {
'difference' => {
'http://www.superuser.com/questions' => '10735',
'http://www.superuser.com/faq' => '13095'
},
'equal' => {
'http://www.superuser.com/ ' => '20892'
}
},
'http://www.stackoverflow.com/' => {
'difference' => {
'http://www.stackoverflow.com/faq' => '13015',
'http://www.stackoverflow.com/questions' => '10506'
},
'equal' => {
'http://www.stackoverflow.com/ ' => '33362'
}
}
};
foreach my $key1 ( keys( %{$content_lengths} ) ) {
print( "$key1\n" );
foreach my $key2 ( keys( %{$content_lengths->{$key1}} ) ) {
print( "\t$key2\n" );
foreach my $key3 ( keys( %{$content_lengths->{$key1}{$key2}} ) ) {
print( "\t\t$key3\n" );
}
}
}
Which results in this output:
http://www.superuser.com/
difference
http://www.superuser.com/questions
http://www.superuser.com/faq
equal
http://www.superuser.com/
http://www.stackoverflow.com/
difference
http://www.stackoverflow.com/faq
http://www.stackoverflow.com/questions
equal
http://www.stackoverflow.com/

In Perl, how can I skip an empty key when traversing a hash?

This is my problem, I'm not very knowledgeable in Perl, and I have this function that needs to be fixed.
When this function deviceModelMenu() is called, the CLI displays the following text:
The following models are available
==================================================
1.
2. Cisco1240
3. Catalyst3750
4. Catalyst3650
5. HP2524
The first item is empty, which is wrong, and I need to fix that, the piece of code that displays this menu is:
my $features = shift;
print "=" x 50, "\n";
print "The following models are available\n";
print "=" x 50, "\n";
my $i=1;
foreach (keys %{$features->{features}[0]->{deviceModel}})
{
print "$i. $_ \n";
$i++;
}
If I add the following line:
warn Dumper($features->{features}[0]->{deviceModel});
It dumps this:
$VAR1 = {
'deviceModel' => {
'' => {
'cfg' => []
},
'Cisco1240' => {
'cfg' => [
'cisco1240feature.cfg'
]
},
'Catalyst3750' => {
'cfg' => [
'catalyst3750feature.cfg'
]
},
'Catalyst3650' => {
'cfg' => [
'catalyst3650feature.cfg'
]
},
'HP2524' => {
'cfg' => [
'hp2524feature.cfg'
]
}
}
};
As you may notice, the first item is indeed empty. I added the following line to skip it, and just print the rest of the info:
if ($_ eq '') {
shift;
}
But it doesn't seem to work do what I want. I want to skip the item if it's empty.
Well, shifting #ARGV (implicit argument to shift in main program) nor shifting #_ (implicit argument of shift in a function) are not going to help you, because you are not printing either of them.
You can either:
Not add the '' entry in the first place (depends on how it's generated)
Remove the '' entry before printing:
delete $features->{features}[0]->{deviceModel}->{''};
Don't print the entry:
if($_ eq '') {
next;
}
or
if($_ ne '') {
print "$i. $_ \n";
$i++;
}
foreach (keys %{$features->{features}[0]->{deviceModel}})
{
next unless length($_);
print "$i. $_ \n";
$i++;
}
#!/usr/bin/env perl
use strict; use warnings;
my $devices = {
'deviceModel' => {
'' => { 'cfg' => [] },
'Cisco1240' => { 'cfg' => ['cisco1240feature.cfg' ] },
'Catalyst3750' => { 'cfg' => [ 'catalyst3750feature.cfg' ]},
'Catalyst3650' => { 'cfg' => [ 'catalyst3650feature.cfg' ]},
'HP2524' => { 'cfg' => [ 'hp2524feature.cfg' ]},
}
};
{
my $item = 1;
for my $d (grep length, keys %{ $devices->{deviceModel} }) {
printf "%2d. %s\n", $item++, $d;
}
}
Output:
1. Catalyst3750
2. Cisco1240
3. Catalyst3650
4. HP2524

Perl Working On Two Hash References

I would like to compare the values of two hash references.
The data dumper of my first hash is this:
$VAR1 = {
'42-MG-BA' => [
{
'chromosome' => '19',
'position' => '35770059',
'genotype' => 'TC'
},
{
'chromosome' => '2',
'position' => '68019584',
'genotype' => 'G'
},
{
'chromosome' => '16',
'position' => '9561557',
'genotype' => 'G'
},
And the second hash is similar to this but with more hashes in the array. I would like to compare the genotype of my first and second hash if the position and the choromosome matches.
map {print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n"}sort keys %$cave_snp_list;
map {print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n"}sort keys %$geno_seq_list;
I could do that for the first array of the hashes.
Could you help me in how to work for all the arrays?
This is my actual code in full
#!/software/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Benchmark;
use Config::Config qw(Sequenom.ini);
useDatabase::Conn;
use Data::Dumper;
GetOptions("sam=s" => \my $sample);
my $geno_seq_list = getseqgenotypes($sample);
my $cave_snp_list = getcavemansnpfile($sample);
#print Dumper($geno_seq_list);
print scalar %$geno_seq_list, "\n";
foreach my $sam (keys %{$geno_seq_list}) {
my $seq_used = $geno_seq_list->{$sam};
my $cave_used = $cave_snp_list->{$sam};
print scalar(#$geno_seq_list->{$_}) if sort keys %$geno_seq_list, "\n";
print scalar(#$cave_used), "\n";
#foreach my $seq2com (# {$seq_used } ){
# foreach my $cave2com( # {$cave_used} ){
# print $seq2com->{chromosome},":" ,$cave2com->{chromosome},"\n";
# }
#}
map { print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n" } sort keys %$cave_snp_list;
map { print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n" } sort keys %$geno_seq_list;
}
sub getseqgenotypes {
my $snpconn;
my $gen_list = {};
$snpconn = Database::Conn->new('live');
$snpconn->addConnection(DBI->connect('dbi:Oracle:pssd.world', 'sn', 'ss', { RaiseError => 1, AutoCommit => 0 }),
'pssd');
#my $conn2 =Database::Conn->new('live');
#$conn2->addConnection(DBI->connect('dbi:Oracle:COSI.world','nst_owner','nst_owner', {RaiseError =>1 , AutoCommit=>0}),'nst');
my $id_ind = $snpconn->execute('snp::Sequenom::getIdIndforExomeSample', $sample);
my $genotype = $snpconn->executeArrRef('snp::Sequenom::getGenotypeCallsPosition', $id_ind);
foreach my $geno (#{$genotype}) {
push #{ $gen_list->{ $geno->[1] } }, {
chromosome => $geno->[2],
position => $geno->[3],
genotype => $geno->[4],
};
}
return ($gen_list);
} #end of sub getseqgenotypes
sub getcavemansnpfile {
my $nstconn;
my $caveman_list = {};
$nstconn = Database::Conn->new('live');
$nstconn->addConnection(
DBI->connect('dbi:Oracle:CANP.world', 'nst_owner', 'NST_OWNER', { RaiseError => 1, AutoCommit => 0 }), 'nst');
my $id_sample = $nstconn->execute('nst::Caveman::getSampleid', $sample);
#print "IDSample: $id_sample\n";
my $file_location = $nstconn->execute('nst::Caveman::getCaveManSNPSFile', $id_sample);
open(SNPFILE, "<$file_location") || die "Error: Cannot open the file $file_location:$!\n";
while (<SNPFILE>) {
chomp;
next if /^>/;
my #data = split;
my ($nor_geno, $tumor_geno) = split /\//, $data[5];
# array of hash
push #{ $caveman_list->{$sample} }, {
chromosome => $data[0],
position => $data[1],
genotype => $nor_geno,
};
} #end of while loop
close(SNPFILE);
return ($caveman_list);
}
The problem that I see is that you're constructing a tree for generic storage of data, when what you want is a graph, specific to the task. While you are constructing the record, you could also be constructing the part that groups data together. Below is just one example.
my %genotype_for;
my $record
= { chromosome => $data[0]
, position => $data[1]
, genotype => $nor_geno
};
push #{ $gen_list->{ $geno->[1] } }, $record;
# $genotype_for{ position }{ chromosome }{ name of array } = genotype code
$genotype_for{ $data[1] }{ $data[0] }{ $sample } = $nor_geno;
...
return ( $caveman_list, \%genotype_for );
In the main line, you receive them like so:
my ( $cave_snp_list, $geno_lookup ) = getcavemansnpfile( $sample );
This approach at least allows you to locate similar position and chromosome values. If you're going to do much with this, I might suggest an OO approach.
Update
Assuming that you wouldn't have to store the label, we could change the lookup to
$genotype_for{ $data[1] }{ $data[0] } = $nor_geno;
And then the comparison could be written:
foreach my $pos ( keys %$small_lookup ) {
next unless _HASH( my $sh = $small_lookup->{ $pos } )
and _HASH( my $lh = $large_lookup->{ $pos } )
;
foreach my $chrom ( keys %$sh ) {
next unless my $sc = $sh->{ $chrom }
and my $lc = $lh->{ $chrom }
;
print "$sc:$sc";
}
}
However, if you had limited use for the larger list, you could construct the specific case
and pass that in as a filter when creating the longer list.
Thus, in whichever loop creates the longer list, you could just go
...
next unless $sample{ $position }{ $chromosome };
my $record
= { chromosome => $chromosome
, position => $position
, genotype => $genotype
};
...