I have below hash structure.
$VAR1 = {
'USA' => {
'Alabama' => {
'ISO3' => 'ISO3:3166-2:US',
'ISO2' => 'ISO2:4166-23:US',
'UNI' => 'UNIABR-A',
'UNDP' => 'UNDP-ZXC-1',
'FAOSTAT' => 'STAT20.98',
'GAUL' => 'UL-SD-20/40'
},
'Washington' => {
'ISO3' => 'ISO3:40-166-2:US',
'ISO2' => 'ISO2:30-23:US',
'UNI' => 'UNIISO-B',
'UNDP' => 'UNDP-YXC-2',
'FAOSTAT' => 'STAT30.98.78',
'GAUL' => 'UL-SD-30/60'
}
}
};
What i would like to achieve is to iterate through the above hash and get the statename and country name for value inside hash "ISO2:4166-23:US". what i have tried to do is:
I can get the required output with the below code.
my $find = "ISO2:4166-23:US";
my $statename;
while ( my ($country, $states) = each (%$VAR1) ) {
while (my ($states, $otherkeys) = each (%$states) ) {
while (my ($otherkeys, $value) = each %$otherkeys) {
$statename = $states if ($value eq $find);
}
}
}
print "State name for value [$find] is :: $statename \n"; ### Output : Alabama
Is there any way I can get -
Get the top-level key "USA" and second-level key "Alabama" if $value is equal to ISO2:4166-23:US. What i know is the value inside hash for which I need to get the output, key corresponding to my search value doesn't matter.
with one-liner grep command from above hash?
Any pointers in the right direction would be useful. Thanks.
Your variables are poorly named. Reusing $states for the state name? ouch.
my $find = "ISO2:4166-23:US";
my $found_state_name;
while ( my ($country_name, $states) = each(%$VAR1) ) {
while (my ($state_name, $state) = each(%$states) ) {
while ( my ($key, $value) = each(%$state) ) {
if ($value eq $find) {
$found_state_name = $state_name;
}
}
}
}
Now, it would be nice to stop searching as soon as you a result is found. We can't do that while still using each(cause it'll screw up later keys/values/each on those hashes).
my $find = "ISO2:4166-23:US";
my $found_state_name;
FIND:
for my $country_name (keys(%$VAR1)) {
my $country = $VAR1->{$country_name};
for my $state_name (keys(%$country)) {
my $state = $country->{$state_name};
for my $key (keys(%$state)) {
if ($state->{$key} eq $find) {
$found_state_name = $state_name;
last FIND;
}
}
}
}
We never use $country_name or $key except to get the value.
my $find = "ISO2:4166-23:US";
my $found_state_name;
FIND:
for my $states (values(%$VAR1)) {
for my $state_name (keys(%$states)) {
my $state = $country->{$state_name};
for my $value (values(%$state)) {
if ($value eq $find) {
$found_state_name = $state_name;
last FIND;
}
}
}
}
If you know you're looking for an ISO2 value, this simplifies to the following:
my $find = "ISO2:4166-23:US";
my $found_state_name;
FIND:
for my $states (values(%$VAR1)) {
for my $state_name (keys(%$states)) {
my $state = $states->{$state_name};
if ($state->{ISO2} eq $find) {
$found_state_name = $state_name;
last FIND;
}
}
}
You want to use grep, eh? Since you want a state name as a result, we need to grep a list of state names.
my #state_names = ...;
my ($found_state_name) =
grep { ... }
#state_names;
We can obtain the list of state name using
my #state_names =
map { keys(%$_) }
values(%$VAR1);
But that's not quite enough to perform the check. (For now, I'm going to assume that only the ISO2 property needs to be checked.)
my #state_names =
map { keys(%$_) }
values(%$VAR1);
my ($found_state_name) =
grep { $VAR1->{???}{$_}{ISO2} eq $find }
#state_names;
There are two solutions. You can work with country-state pairs.
my #country_state_name_pairs =;
map {
my $country_name = $_;
map { [ $country_name, $_ ] }
keys(%{ $VAR1->{$country_name} )
}
keys(%$VAR1);
my ($found_state_name) =
map { $_->[1] }
grep {
my ($country_name, $state_name) = #$_;
$VAR1->{$country_name}{$state_name}{ISO2} eq $find
}
#country_state_name_pairs;
Or you can create a flat lists of states and search that.
my #states_with_name =
map { [ $_, $VAR1->{$_} ] }
values(%$VAR1);
my ($found_state_name) =
map { $_->[0] }
grep { $_->[1]{ISO2} eq $find }
#states_with_name;
Noting stops us from merging the two statements.
my ($found_state_name) =
map { $_->[0] } # Get the state name.
grep { $_->[1]{ISO2} eq $find } # Filter out undesireable states.
map { [ $_, $VAR1->{$_} ] } # $state_name => [ $state_name, $state ]
values(%$VAR1); # Get the countries.
This last one isn't too bad!
Finally, there are two ways to modify each of the above to search all the fields instead of just ISO2). (I'm going to only the modifications to the latter of the above two solutions.)
my ($found_state_name) =
map { $_->[0] } # Get the state name.
grep { # Filter out undesireable states.
grep { $_ eq $find } # Filter out undesireable properties of the state.
values(%{ $_->[1] }) # Get the state's property values.
}
map { [ $_, $VAR1->{$_} ] } # $state_name => [ $state_name, $state ]
values(%$VAR1); # Get the countries.
or
my ($found_state_name) =
map { $_->[0] } # Get the state name.
grep { $_->[1] eq $find } # Filter out undesireable states.
map { # $state_name => multiple [ $state_name, $value ]
my $state_name = $_;
map { [ $state_name, $_ ] } # value => [ $state_name, $value ]
values(%{ $VAR1->{$_} ) # Get the state's property values.
}
values(%$VAR1); # Get the countries.
These aren't readable. They are best avoided.
Finally, if you are going to perform many searches based on ISO2, it would be best if you organized your data in terms of ISO2.
my %by_iso2 = (
'ISO2:4166-23:US' => {
country_name => 'USA',
state_name => 'Alabama',
ISO2 => 'ISO2:4166-23:US',
ISO3 => 'ISO3:3166-2:US',
...
},
'ISO2:4166-23:US' => {
country_name => 'USA',
state_name => 'Washington',
ISO2 => 'ISO2:30-23:US',
ISO3 => 'ISO3:40-166-2:US',
...
},
...
);
No. Hashes are one-directional, so you have to loop over all the values to find one you are searching for.
It sounds like what you really want is some sort of database and that a hash is the wrong tool for your job.
You can build your hash the other way around, using the ISO2 as the key, e.g.
$VAR1 = {
"ISO2:4166-23:US" => { Country => 'USA', State => 'Alabama' },
"ISO2:4166-23:US" => { Country => 'USA', State => 'Washington' }
}
If you are planning to do these lookups a lot, it might be worth the while. Doing that can be automated too. Using a similar loop to build a new hash to use as lookup.
Speed-wise, there is nothing wrong with looping over all the hash keys. The difference between direct lookup and looping will be negligible, unless you have a truly huge hash.
Related
I am trying to create a hash of hash of - the nesting depth depends on the number of arguments passed into #aGroupByFields array.
In the below implementation, I am getting the desired hash structure.But I have hard coded the fields [ example - $phBugRecord->{createdBy} ] instead of deriving it from the array.
I am not sure how to dynamically create this.
my (#aGroupByFields) = ['createdBy','status','devPriority'];
# In real case,these are passed in as arguments
my (%hTemp);
# This is the final hash which will be structured according to above fields
# %hBugDetails is the hash containing details of all bugs
foreach my $phBugRecord ( #{ $hBugDetails{records} } ) {
# The below statement needs to be generated dynamically as
# opposed to the hard-coded values.
push(
#{
$hTemp{ $phBugRecord->{createdBy} }{ $phBugRecord->{status} }
{ $phBugRecord->{devPriority} }
},
$phBugRecord
);
}
Any pointer will be a great help.Thanks.
Here is a working implementation with Data::Diver.
use strict;
use warnings;
use Data::Diver 'DiveVal';
use Data::Printer;
my %hBugDetails = (
records => [
{
createdBy => 'created_by1',
status => 'status1',
devPriority => 'dev_priority1',
foo => 'foo1',
bar => 'bar1',
},
{
createdBy => 'created_by1',
status => 'status2',
devPriority => 'dev_priority2',
foo => 'foo',
bar => 'bar',
},
],
);
# we want to group by these fields
my #group_by = ( 'createdBy', 'status', 'devPriority' );
my $grouped_bugs = {}; # for some reason we need to start with an empty hashref
foreach my $bug ( #{ $hBugDetails{records} } ) {
# this will auto-vivify the hash for us
push #{ DiveVal( $grouped_bugs, map { $bug->{$_} } #group_by ) }, $bug;
}
p $grouped_bugs;
The output looks like this.
\ {
created_by1 {
status1 {
dev_priority1 [
[0] {
bar "bar1",
createdBy "created_by1",
devPriority "dev_priority1",
foo "foo1",
status "status1"
}
]
},
status2 {
dev_priority2 [
[0] {
bar "bar",
createdBy "created_by1",
devPriority "dev_priority2",
foo "foo",
status "status2"
}
]
}
}
}
Note that I renamed your variables. It was very hard to read the code like that. It makes more sense to just use speaking names instead of cryptic abbreviations for the type of variable. The sigil already does that for you.
This code will do what you need
my #aGroupByFields = qw/ createdBy status devPriority /;
my %hTemp;
for my $phBugRecord ( #{ $hBugDetails{records} } ) {
my $hash = \%hTemp;
for my $field ( #aGroupByFields ) {
my $key = $phBugRecord->{$field};
if ( $field eq $aGroupByFields[-1] ) {
push #{ $hash->{ $key } }, $phBugRecord;
}
else {
$hash = $hash->{ $key } //= {};
}
}
}
The Perl code below works, but it doesn't scale well even with considerable computer resources. I hoping that someone can help me find more efficient code such as by replacing recursion with iteration, if that's the problem.
my data structure looks like this:
my %REV_ALIGN;
$REV_ALIGN{$dna}{$rna} = ();
Any dna key may have multiple rna sub keys. The same rna sub key may appear with multiple different dna keys. The purpose is to group rna ( transcripts ) based on shared dna sequence elements. For example, if dnaA has RNA1, RNA8, RNA9, and RNA4, and dnaB has RNA11, RNA4, and RNA99, then we group all these transcripts together ( RNA1, RNA9, RNA4, RNA11, RNA99 ) and continue to proceed to try and add to the group by selecting other dna. My recusive solution to this problem works but doesn't scale so well when using data from whole genome to transcriptome alignment.
SO MY QUESTION IS: WHAT IS A MORE EFFICIENT SOLUTION TO THIS PROBLEM? THANK YOU VERY MUCH
my #groups;
while ( my $x =()= keys %REV_ALIGN )
{
my #DNA = keys %REV_ALIGN;
my $dna = shift #DNA;
# the corresponding list of rna
my #RNA = keys %{$REV_ALIGN{$dna}};
delete $REV_ALIGN{$dna};
if ( $x == 1 )
{
push #groups, \#RNA;
last;
}
my $ref = group_transcripts ( \#RNA, \%REV_ALIGN );
push #groups, $ref;
}
sub group_transcripts
{
my $tran_ref = shift;
my $align_ref = shift;
my #RNA_A = #$tran_ref;
my %RNA;
# create a null hash with seed list of transcripts
#RNA{#RNA_A} = ();
# get a list of all remaining dna sequences in the alignment
my #DNA = keys %{$align_ref};
my %count;
# select a different list of transcripts
for my $dna ( #DNA )
{
next unless exists $align_ref->{$dna};
my #RNA_B = keys %{$align_ref->{$dna}};
# check to see two list share and transcripts
for my $element ( #RNA_A, #RNA_B )
{
$count{$element}++;
}
for my $rna_a ( keys %count )
{
# if they do, add any new transcripts to the current group
if ( $count{$rna_a} == 2 )
{
for my $rna_b ( #RNA_B )
{
push #RNA_A, $rna_b if $count{$rna_b} == 1;
}
delete $align_ref->{$dna};
delete $count{$_} foreach keys %count;
# recurse to try and continue adding to list
#_ = ( \#RNA_A, $align_ref );
goto &group_transcripts;
}
}
delete $count{$_} foreach keys %count;
}
# if no more transcripts can be added, return a reference to the group
return \#RNA_A;
}
You have a loops nested four deep. It's an pretty safe bet that's why your code scales poorly.
If I understand correctly what you are trying to accomplish, the input
my %REV_ALIGN = (
"DNA1" => { map { $_ => undef } "RNA1", "RNA2" }, # \ Linked by RNA1 \
"DNA2" => { map { $_ => undef } "RNA1", "RNA3" }, # / \ Linked by RNA3 > Group
"DNA3" => { map { $_ => undef } "RNA3", "RNA4" }, # / /
"DNA4" => { map { $_ => undef } "RNA5", "RNA6" }, # \ Linked by RNA5 \ Group
"DNA5" => { map { $_ => undef } "RNA5", "RNA7" }, # / /
"DNA6" => { map { $_ => undef } "RNA8" }, # > Group
);
should result in
my #groups = (
[
dna => [ "DNA1", "DNA2", "DNA3" ],
rna => [ "RNA1", "RNA2", "RNA3", "RNA4" ],
],
[
dna => [ "DNA4", "DNA5" ],
rna => [ "RNA5", "RNA6", "RNA7" ],
],
[
dna => [ "DNA6" ],
rna => [ "RNA8" ],
],
);
If so, you can use the following:
use strict;
use warnings;
use Graph::Undirected qw( );
my %REV_ALIGN = (
"DNA1" => { map { $_ => undef } "RNA1", "RNA2" },
"DNA2" => { map { $_ => undef } "RNA1", "RNA3" },
"DNA3" => { map { $_ => undef } "RNA3", "RNA4" },
"DNA4" => { map { $_ => undef } "RNA5", "RNA6" },
"DNA5" => { map { $_ => undef } "RNA5", "RNA7" },
"DNA6" => { map { $_ => undef } "RNA8" },
);
my $g = Graph::Undirected->new();
for my $dna (keys(%REV_ALIGN)) {
for my $rna (keys(%{ $REV_ALIGN{$dna} })) {
$g->add_edge("dna:$dna", "rna:$rna");
}
}
my #groups;
for my $raw_group ($g->connected_components()) {
my %group = ( dna => [], rna => [] );
for (#$raw_group) {
my ($type, $val) = split(/:/, $_, 2);
push #{ $group{$type} }, $val;
}
push #groups, \%group;
}
use Data::Dumper qw( Dumper );
print(Dumper(\#groups));
If you just want the RNA, the final section simplifies to the following:
my #groups;
for my $raw_group ($g->connected_components()) {
my #group;
for (#$raw_group) {
my ($type, $val) = split(/:/, $_, 2);
push #group, $val if $type eq 'rna';
}
push #groups, \#group;
}
I have a Perl data structurte like so
%myhash = (
k1 => v1,
kArray => [
{
name => "anonymous hash",
...
},
\&funcThatReturnsHash,
{
name => "another anonymous hash",
...
}
]
);
Elsewhere I iterate through the list in kArray which contains a bunch of hashes. I would like to either process the actual hash OR the hash returned by the function.
foreach my $elem( #{myhash{kArray}} ) {
if (ref($elem) == "CODE") {
%thisHash = &$elem;
}
else {
%thisHash = %$elem;
}
...
}
However ref ($elem) is always scalar or undefined. I tried func, &func, \&func, \%{&func}, in %myhash to no effect.
how do I extract the hash within the function in the main body?
Apart from the code sample you give being invalid Perl, the main problems seem to be that you are using == to compare strings instead of eq, and you are assigning a hash reference to a hash variable %thishash. I assure you that ref $elem never returns SCALAR with the data you show
It would help you enormously if you followed the common advice to use strict and use warnings at the top of your code
This will work for you
for my $elem ( #{ $myhash{kArray} } ) {
my $this_hash;
if ( ref $elem eq 'CODE' ) {
$this_hash = $elem->();
}
else {
$this_hash = $elem;
}
# Do stuff with $this_hash
}
or you could just use a map like this
use strict;
use warnings;
use 5.010;
use Data::Dump;
my %myhash = (
k1 => v1,
kArray => [
{
name => "anonymous hash",
},
\&funcThatReturnsHash,
{
name => "another anonymous hash",
}
]
);
for my $hash ( map { ref eq 'CODE' ? $_->() : $_ } #{ $myhash{kArray} } ) {
say $hash->{name};
}
sub funcThatReturnsHash {
{ name => 'a third anonymous hash' };
}
output
anonymous hash
a third anonymous hash
another anonymous hash
If you turn on strict and warnings, you'll see that:
foreach my $elem(#{mynahs{kArray}}) {
Isn't valid. You need at the very least a $ before mynahs.
But given something like this - your approach works - here's an example using map to 'run' the code references:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
sub gimme_hash {
return { 'fish' => 'paste' };
}
my $stuff =
[ { 'anon1' => 'value' },
\&gimme_hash,
{ 'anon2' => 'anothervalue' }, ];
my $newstuff = [ map { ref $_ eq "CODE" ? $_->() : $_ } #$stuff ];
print Dumper $newstuff;
Turns that hash into:
$VAR1 = [
{
'anon1' => 'value'
},
{
'fish' => 'paste'
},
{
'anon2' => 'anothervalue'
}
];
But your approach does work:
foreach my $element ( #$stuff ) {
my %myhash;
if ( ref $element eq "CODE" ) {
%myhash = %{$element -> ()};
}
else {
%myhash = %$element;
}
print Dumper \%myhash;
}
Gives:
$VAR1 = {
'anon1' => 'value'
};
$VAR1 = {
'fish' => 'paste'
};
$VAR1 = {
'anon2' => 'anothervalue'
};
I have written the following code in Perl. The code is reading a pdb file and getting some values. Ignore the top part of the code,where everything is working perfect.
Problem is in the sub-routine part, where I try to store arrays in the hash3 with model as key another key position
the array values can be accessed inside the if condition using this :
$hash3{$model}{$coordinates}[1].
but when I go out of all foreach loop and try to access the elements I only get one value.
Please look at the end foreach loop and tell me is it the wrong way to access the hash values.
The pdb file I am using can be downloaded from this link http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId=1NZS
#!/usr/bin/perl
open(IN,$ARGV[0]);
my #phosphosites;
my $model=1;
my %hash3;
while(<IN>)
{
#findmod(#line);
#finddist;
#findfreq;
if((/^MODRES/) && (/PHOSPHO/))
{
#line=split;
push(#phosphosites, $line[2]);
#print "$line[4]";
}
foreach $elements (#phosphosites){
if(/^HETATM\s+\d+\s+CA\s+$i/)
{
#line1=split;
#print "$line1[5]";
#print "$line1[6] $line1[7] $line1[8]\n";
push(#phosphositesnum, $line1[5]);
}
}
$pos=$line1[5];
#findspatial(\#line,\#line1);
}
my #ori_data=removeDuplicates(#phosphositesnum);
sub removeDuplicates {
my %seen = ();
my #vals = ();
foreach my $i (#_) {
unless ($seen{$i}) {
push #vals, $i;
$seen{$i} = 1;
}
}
return #vals;
}
$a=(#phosphosites);
print "$a\n";
print "#phosphosites\n";
print "#ori_data\n";
close(IN);
open(IN1,$ARGV[0]);
my (#data)=<IN1>;
spatial(\#ori_data);
sub spatial {
my #spatial_array1=#{$_[0]};
foreach $coordinates(#spatial_array1)
{
$model=1;
{foreach $data1(#data){
if($data1=~ m/^HETATM\s+\d+\s+CA\s+[A-Z]*\s+[A-Z]*\s+$coordinates/)
{
#cordivals=split(/\s+/,$data1);
push #{ $sphash{$model} },[$cordivals[6], $cordivals[7], $cordivals[8]];
$hash3{$model}{$coordinates}= \#cordivals;
#print "$model $coordinates $hash3{$model}{$coordinates}[6] $hash3{$model}{$coordinates}[7] $hash3{$model}{$coordinates}[8]\n";
#print "$model $sphash{$model}[$i][0] $sphash{$model}[$i][1] $sphash{$model}[$i][2]\n";
}
elsif($data1=~ m/^ENDMDL/)
{
$model++;
}
#print "$model $coordinates $hash3{$model}{$coordinates}[6] $hash3{$model}{$coordinates}[7] $hash3{$model}{$coordinates}[8]\n";
}
}
}
#foreach $z1 (sort keys %hash3)
# {
# foreach $z2(#spatial_array1){
# print "$z1 $z2";
# print "$hash3{$z1}{$z2}[6]\n";
# print "$z2\n";
# }
# }
}
After using the Data::Dumper option it is giving me this kind of output
$VAR1 = {
'11' => {
'334' => [
'HETATM',
'115',
'CA',
'SEP',
'A',
'343',
'-0.201',
'-2.884',
'1.022',
'1.00',
'99.99',
'C'
],
'342' => $VAR1->{'11'}{'334'},
'338' => $VAR1->{'11'}{'334'},
'335' => $VAR1->{'11'}{'334'},
'340' => $VAR1->{'11'}{'334'},
'343' => $VAR1->{'11'}{'334'},
'336' => $VAR1->{'11'}{'334'}
},
'7' => {
'334' => $VAR1->{'11'}{'334'},
'342' => $VAR1->{'11'}{'334'},
'338' => $VAR1->{'11'}{'334'},
'335' => $VAR1->{'11'}{'334'},
'340' => $VAR1->{'11'}{'334'},
'343' => $VAR1->{'11'}{'334'},
'336' => $VAR1->{'11'}{'334'}
},
'2' => {
'334' => $VAR1->{'11'}{'334'},
'342' => $VAR1->{'11'}{'334'},
...
Change:
#cordivals=split(/\s+/,$data1);
to:
my #cordivals=split(/\s+/,$data1);
What seems to be happening is that all the hash elements contain references to the same array variable, because you're not making the variable local to that iteration.
In general, you should use my with all variables.
I would like to compare the values of two hash references.
The data dumper of my first hash is this:
$VAR1 = {
'42-MG-BA' => [
{
'chromosome' => '19',
'position' => '35770059',
'genotype' => 'TC'
},
{
'chromosome' => '2',
'position' => '68019584',
'genotype' => 'G'
},
{
'chromosome' => '16',
'position' => '9561557',
'genotype' => 'G'
},
And the second hash is similar to this but with more hashes in the array. I would like to compare the genotype of my first and second hash if the position and the choromosome matches.
map {print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n"}sort keys %$cave_snp_list;
map {print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n"}sort keys %$geno_seq_list;
I could do that for the first array of the hashes.
Could you help me in how to work for all the arrays?
This is my actual code in full
#!/software/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Benchmark;
use Config::Config qw(Sequenom.ini);
useDatabase::Conn;
use Data::Dumper;
GetOptions("sam=s" => \my $sample);
my $geno_seq_list = getseqgenotypes($sample);
my $cave_snp_list = getcavemansnpfile($sample);
#print Dumper($geno_seq_list);
print scalar %$geno_seq_list, "\n";
foreach my $sam (keys %{$geno_seq_list}) {
my $seq_used = $geno_seq_list->{$sam};
my $cave_used = $cave_snp_list->{$sam};
print scalar(#$geno_seq_list->{$_}) if sort keys %$geno_seq_list, "\n";
print scalar(#$cave_used), "\n";
#foreach my $seq2com (# {$seq_used } ){
# foreach my $cave2com( # {$cave_used} ){
# print $seq2com->{chromosome},":" ,$cave2com->{chromosome},"\n";
# }
#}
map { print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n" } sort keys %$cave_snp_list;
map { print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n" } sort keys %$geno_seq_list;
}
sub getseqgenotypes {
my $snpconn;
my $gen_list = {};
$snpconn = Database::Conn->new('live');
$snpconn->addConnection(DBI->connect('dbi:Oracle:pssd.world', 'sn', 'ss', { RaiseError => 1, AutoCommit => 0 }),
'pssd');
#my $conn2 =Database::Conn->new('live');
#$conn2->addConnection(DBI->connect('dbi:Oracle:COSI.world','nst_owner','nst_owner', {RaiseError =>1 , AutoCommit=>0}),'nst');
my $id_ind = $snpconn->execute('snp::Sequenom::getIdIndforExomeSample', $sample);
my $genotype = $snpconn->executeArrRef('snp::Sequenom::getGenotypeCallsPosition', $id_ind);
foreach my $geno (#{$genotype}) {
push #{ $gen_list->{ $geno->[1] } }, {
chromosome => $geno->[2],
position => $geno->[3],
genotype => $geno->[4],
};
}
return ($gen_list);
} #end of sub getseqgenotypes
sub getcavemansnpfile {
my $nstconn;
my $caveman_list = {};
$nstconn = Database::Conn->new('live');
$nstconn->addConnection(
DBI->connect('dbi:Oracle:CANP.world', 'nst_owner', 'NST_OWNER', { RaiseError => 1, AutoCommit => 0 }), 'nst');
my $id_sample = $nstconn->execute('nst::Caveman::getSampleid', $sample);
#print "IDSample: $id_sample\n";
my $file_location = $nstconn->execute('nst::Caveman::getCaveManSNPSFile', $id_sample);
open(SNPFILE, "<$file_location") || die "Error: Cannot open the file $file_location:$!\n";
while (<SNPFILE>) {
chomp;
next if /^>/;
my #data = split;
my ($nor_geno, $tumor_geno) = split /\//, $data[5];
# array of hash
push #{ $caveman_list->{$sample} }, {
chromosome => $data[0],
position => $data[1],
genotype => $nor_geno,
};
} #end of while loop
close(SNPFILE);
return ($caveman_list);
}
The problem that I see is that you're constructing a tree for generic storage of data, when what you want is a graph, specific to the task. While you are constructing the record, you could also be constructing the part that groups data together. Below is just one example.
my %genotype_for;
my $record
= { chromosome => $data[0]
, position => $data[1]
, genotype => $nor_geno
};
push #{ $gen_list->{ $geno->[1] } }, $record;
# $genotype_for{ position }{ chromosome }{ name of array } = genotype code
$genotype_for{ $data[1] }{ $data[0] }{ $sample } = $nor_geno;
...
return ( $caveman_list, \%genotype_for );
In the main line, you receive them like so:
my ( $cave_snp_list, $geno_lookup ) = getcavemansnpfile( $sample );
This approach at least allows you to locate similar position and chromosome values. If you're going to do much with this, I might suggest an OO approach.
Update
Assuming that you wouldn't have to store the label, we could change the lookup to
$genotype_for{ $data[1] }{ $data[0] } = $nor_geno;
And then the comparison could be written:
foreach my $pos ( keys %$small_lookup ) {
next unless _HASH( my $sh = $small_lookup->{ $pos } )
and _HASH( my $lh = $large_lookup->{ $pos } )
;
foreach my $chrom ( keys %$sh ) {
next unless my $sc = $sh->{ $chrom }
and my $lc = $lh->{ $chrom }
;
print "$sc:$sc";
}
}
However, if you had limited use for the larger list, you could construct the specific case
and pass that in as a filter when creating the longer list.
Thus, in whichever loop creates the longer list, you could just go
...
next unless $sample{ $position }{ $chromosome };
my $record
= { chromosome => $chromosome
, position => $position
, genotype => $genotype
};
...