Split a hash into many hashes - perl

If I have a hash how could I "break" it/"split" it into multiple hashes containing equal number of keys?
Basically splice in arrays seems to be close to what I need (loop/slice) but that works only for arrays.
So what's the best way to do this?
Update:
Or a way to remove at most X number of key-values so as to simulate the splice of arrays
Update
{ foo => 1, bar => 2, bla =>3}
To be
{ foo => 1 }, { bar => 2 }, { bla => 3 } if X = 1
or { foo => 1, bar => 2 }, { bla => 3 } if X = 2
or { foo => 1, bar => 2, bla => 3 } if X = 3

This should do what you want. On 5.20+, you can probably use the new slice syntax to simplify the code.
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;
sub split_hash {
my ($x, $hash) = #_;
my #hashes;
while (%$hash) {
my #k = keys %$hash;
push #hashes, { map each %$hash, 1 .. $x };
delete #{ $hash }{ keys %{ $hashes[-1] } };
}
return #hashes
}
print Dumper([ split_hash($_, { foo => 1,
bar => 2,
bla => 3,
}
)]) for 1 .. 3;
Note that as written, the code deletes the original hash.

Similar to the solution provided by #choroba, but using splice, and doesn't modify the passed hash:
use Data::Dumper;
use strict;
use warnings;
sub split_hash {
my ( $x, $hash ) = #_;
my #keys = keys %$hash;
my #hashes;
while ( my #subset = splice( #keys, 0, $x ) ) {
push #hashes, { map { $_ => $hash->{$_} } #subset };
}
return \#hashes;
}
print Dumper( [
split_hash(
$_,
{
foo => 1,
bar => 2,
bla => 3,
} ) ] ) for 1 .. 3;

Related

Find key for greatest value in hash of hashes in Perl

I have a hash of hashes containing keys, values, and counts of the form ((k1, v1), c1). I am trying to write a subroutine that returns the value of the key passed as a parameter with the greatest count. For example, if I had:
%hash = (
"a" => {
"b" => 2,
"c" => 1,
},
"d" => {
"e" => 4,
},
);
and made the call:
print &function("a");
it should print "b" because key "a" has the highest count of 2 with "b" as its value. Here is the code I have so far:
sub function() {
$key = $_[0];
if(exists($hash{$key})) {
while (my ($value, $count) = each %{$hash{$key}}) {
#logic goes here
}
} else {
return "$key does not exist";
}
}
The sub doesn't need to know anything about the outer hash, so it makes far more sense to call the sub as follows:
print key_with_highest_val($hash{a});
The sub simply needs to iterate over all the elements of that hash, keeping track of the highest value seen, and the key at which it was seen.
sub key_with_highest_val {
my ($h) = #_;
my $hi_v;
my $hi_k;
for my $k (keys(%$h)) {
my $v = $h->{$k};
if (!defined($hi_v) || $v > $hi_v) {
$hi_v = $v;
$hi_k = $k;
}
}
return $hi_k;
}
As Chris Charley points out, List::Util's reduce can simply this function. With the calling convention I recommended above, the reduce solution becomes the following:
use List::Util qw( reduce );
sub key_with_highest_val {
my ($h) = #_;
return reduce { $h->{$a} >= $h->{$b} ? $a : $b } keys(%$h);
}
Both versions return an arbitrary key among those that tied when there's a tie.
Use the reduce function from List::Util (which is part of core perl).
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce/;
my %hash = (
"a" => {
"b" => 2,
"c" => 1,
},
"d" => {
"e" => 4,
},
);
my $key = 'a';
print "For key: $key, max key is ", max_key($key, %hash), "\n";
sub max_key {
my ($key, %hash) = #_;
return "$key does not exist" unless exists $hash{$key};
my $href = $hash{$key};
return reduce { $href->{$a} > $href->{$b} ? $a : $b } keys %$href;
}
You should always include use strict and use warnings at the top of your programs to catch errors so you can find and fix them. This requires declaring of your variables with my, like my $key = 'a';, for example and my %hash = ...
This program prints:
For key: a, max key is b
This code makes the following assumptions:
The values of your nested hashes are always numeric.
You don't have duplicate values.
Anything else is left as an exercise for the reader.
use strict;
use warnings;
use Data::Dump;
use List::Util qw(max);
my %hash = (
a => {
b => 2,
c => 1,
},
d => {
e => 4,
},
);
dd(max_hash_value(\%hash, $_)) for 'a' .. 'd';
sub max_hash_value {
my ($hash_ref, $search_key) = #_;
return unless $hash_ref->{$search_key};
my %lookup;
while (my ($key, $value) = each(%{$hash_ref->{$search_key}})) {
$lookup{$value} = $key;
}
return $lookup{max(keys(%lookup))};
}
Output:
"b"
()
()
"e"

Adding each element in an array to a complex hash in Perl

I have an array with n number of elments. I want to add each of the elements to a complex hash, each as a key/value pair. If the number of elements were fixed, say three, I would do:
my %hash;
my #array = ("first", "second", "third");
$hash{$array[0]}{$array[1]}{$array[2]}++;
The structure I want to end up with, is this (printed with Data::Dumper):
$VAR1 = 'first';
$VAR2 = {
'second' => {
'third' => 1
};
But I am at loss at achieving the same structure when the number of elements in the array isn't fixed. Something with anonymous variables and iterating through the array, yes, but something like foreach #array{$hash{$_}++}; will obviously only make one entry per element, and not the desired structure. Help?
Something like this could build the structure you desire for N elements:
use strict;
use warnings;
use Data::Dumper;
my #array = qw(first second third four five six seven);
my $hash;
foreach my $key ( reverse #array ) {
$hash = { $key => $hash };
}
print Dumper $hash;
__END__
$VAR1 = {
'first' => {
'second' => {
'third' => {
'fourth' => {
'fifth' => {
'sixth' => {
'seventh' => undef
}
}
}
}
}
}
};
It is not clear what you really need this for. There may be a better solution if you explain your use-case a little more. Incrementing this structure doesn't appear to be very easy.
After playing around a little, you can increment by traversing the hash reference to the bottom then incrementing the value of the last element. It is not very pretty though :|
# incrementing
my $elem = $hash; # copy the reference
foreach my $key ( #array ) {
# found the bottom of the hash
unless ( $elem->{$key} && ref($elem->{$key}) ) {
$elem->{$key}++;
last;
}
# not at the bottom, move to the next level
$elem = $elem->{$key};
}
print Dumper $hash;
__END__
$VAR1 = {
'first' => {
'second' => {
'third' => {
'fourth' => {
'fifth' => {
'sixth' => {
'seventh' => 1
}
}
}
}
}
}
};
This is relatively simple if you maintain a current hash reference. This short program demonstrates
The first few steps make sure that each hash element exists and its value is a hash reference. $href is moved to the next level of hash at each stage. For the final element of the array, the latest hash level's element is incremented instead of being set to a hash reference.
Whether or not this data structure is the correct choice depends on what else you need to do with it once you have built it
use strict;
use warnings;
my %hash;
my #array = qw/ first second third fourth fifth /;
drill_hash(\%hash, #array);
use Data::Dump;
dd \%hash;
sub drill_hash {
my ($href, #list) = #_;
my $final = pop #list;
$href = $href->{$_} //= {} for #list;
++$href->{$final};
}
output
{
first => { second => { third => { fourth => { fifth => 1 } } } },
}
Update
Having understood your purpose, the simplest way to keep a count of occurrences of ngrams like that is to have a speficic hash key that is used to keep the count of the sequence of words so far.
This program uses the value _COUNT for that key, and you can see that, for example, {under}{a}{_COUNT} and {under}{a}{rock}{_COUNT} are both 1
use strict;
use warnings;
my %counts;
count_ngram(\%counts, qw/ under a /);
count_ngram(\%counts, qw/ a rock /);
count_ngram(\%counts, qw/ under a rock /);
count_ngram(\%counts, qw/ a tree /);
count_ngram(\%counts, qw/ under a tree /);
use Data::Dump;
dd \%counts;
sub count_ngram {
my ($href, #ngram) = #_;
my $final = pop #ngram;
$href = $href->{$_} //= {} for #ngram;
++$href->{$final}{_COUNT};
}
output
{
a => { rock => { _COUNT => 1 }, tree => { _COUNT => 1 } },
under => {
a => { _COUNT => 1, rock => { _COUNT => 1 }, tree => { _COUNT => 1 } },
},
}

Converting HoA to HoH with counting

Have this code:
use 5.020;
use warnings;
use Data::Dumper;
my %h = (
k1 => [qw(aa1 aa2 aa1)],
k2 => [qw(ab1 ab2 ab3)],
k3 => [qw(ac1 ac1 ac1)],
);
my %h2;
for my $k (keys %h) {
$h2{$k}{$_}++ for (#{$h{$k}});
}
say Dumper \%h2;
produces:
$VAR1 = {
'k1' => {
'aa2' => 1,
'aa1' => 2
},
'k3' => {
'ac1' => 3
},
'k2' => {
'ab1' => 1,
'ab3' => 1,
'ab2' => 1
}
};
Is possible to write the above code with "another way"? (e.g. simpler or more compact)?
Honestly, I don't like the number of times $h2{$k} is evaluated.
my %h2;
for my $k (keys %h) {
my $src = $h{$k};
my $dst = $h2{$k} = {};
++$dst->{$_} for #$src;
}
A subroutine can help make the intent more obvious. Maybe.
sub counts { my %c; ++$c{$_} for #_; \%c }
$h2{$_} = counts(#{ $h{$_} }) for keys %h;
That can be simplified if you do the change in-place.
sub counts { my %c; ++$c{$_} for #_; \%c }
$_ = counts(#$_) for values %h;

Perl: Sorting hash of hash by value descending order

data :
%HoH => (
abc => {
value => "12",
},
xyz => {
number => "100",
},
pqr => {
digit => "5",
}
)
How do I sort the hash of hash by value in descending order?
Output
100
12
5
You can't sort a hash, it won't hold the order. If you wanted to keep them sorted, you'll have to sort the keys based on the number and store the keys in an array.
#!/usr/bin/perl
use strict;
use warnings;
my %HoH = (
abc => { value => 12 },
xyz => { value => 100},
pqr => { value => 5},
def => { value => 15},
hij => { value => 30},
);
my #sorted_keys = map { $_->[0] }
sort { $b->[1] <=> $a->[1] } # use numeric comparison
map { my $temp;
if ( exists $HoH{$_}{'value'} ) {
$temp = $HoH{$_}{'value'};
} elsif ( exists $HoH{$_}{'number'} ) {
$temp = $HoH{$_}{'number'};
} elsif ( exists $HoH{$_}{'digit'} ) {
$temp = $HoH{$_}{'digit'};
} else {
$temp = 0;
}
{[$_, $temp]} }
(keys %HoH);
for my $key (#sorted_keys) {
my $temp;
if ( exists $HoH{$key}{'value'} ) {
$temp = $HoH{$key}{'value'};
} elsif ( exists $HoH{$key}{'number'} ) {
$temp = $HoH{$key}{'number'};
} elsif ( exists $HoH{$key}{'digit'} ) {
$temp = $HoH{$key}{'digit'};
} else {
$temp = 0;
}
print $key . ":" . $temp ."\n";
}
Output:
xyz:100
hij:30
def:15
abc:12
pqr:5
This technique to do the sorting is called Schwartzian Transform.
Given you're not actually using the keys for anything, you can flatten the data structure into a single array and then sort it:
use strict;
use warnings;
my %HoH = (
abc => {value => "12",},
xyz => {number => "100",},
pqr => {digit => "5",},
);
my #numbers = sort {$b <=> $a} map {values %$_} values %HoH;
print "$_\n" for #numbers;
Outputs:
100
12
5
However, if you want to use the additional key information, then you'll need fold your Hash of Hash into an array, and then you can sort however you like:
my #array;
while (my ($k, $ref) = each %HoH) {
while (my ($k2, $v) = each %$ref) {
push #array, [$k, $k2, $v];
}
}
#array = sort {$b->[2] <=> $a->[2]} #array;
use Data::Dump;
dd \#array;
Outputs:
[
["xyz", "number", 100],
["abc", "value", 12],
["pqr", "digit", 5],
]
I came up with this solution
#!/usr/bin/perl
use strict;
use warnings;
my %HoH = (
abc => {
value => "12",
},
xyz => {
number => "100",
},
pqr => {
digit => "5",
}
);
my %rever;
for my $TopKey(keys %HoH){
for my $value(values %{ $HoH{$TopKey} }){
push #{ $rever{$value} }, $TopKey;
}
}
my #nums = sort {$b <=> $a} (keys(%rever));
print $_, "\n" for #nums;
I reversed the values in case you still needed to use the key names.
This is how it looks after using Dumper.
$VAR1 = '100';
$VAR2 = [
'xyz'
];
$VAR3 = '12';
$VAR4 = [
'abc'
];
$VAR5 = '5';
$VAR6 = [
'pqr'
];

Perl Working On Two Hash References

I would like to compare the values of two hash references.
The data dumper of my first hash is this:
$VAR1 = {
'42-MG-BA' => [
{
'chromosome' => '19',
'position' => '35770059',
'genotype' => 'TC'
},
{
'chromosome' => '2',
'position' => '68019584',
'genotype' => 'G'
},
{
'chromosome' => '16',
'position' => '9561557',
'genotype' => 'G'
},
And the second hash is similar to this but with more hashes in the array. I would like to compare the genotype of my first and second hash if the position and the choromosome matches.
map {print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n"}sort keys %$cave_snp_list;
map {print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n"}sort keys %$geno_seq_list;
I could do that for the first array of the hashes.
Could you help me in how to work for all the arrays?
This is my actual code in full
#!/software/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Benchmark;
use Config::Config qw(Sequenom.ini);
useDatabase::Conn;
use Data::Dumper;
GetOptions("sam=s" => \my $sample);
my $geno_seq_list = getseqgenotypes($sample);
my $cave_snp_list = getcavemansnpfile($sample);
#print Dumper($geno_seq_list);
print scalar %$geno_seq_list, "\n";
foreach my $sam (keys %{$geno_seq_list}) {
my $seq_used = $geno_seq_list->{$sam};
my $cave_used = $cave_snp_list->{$sam};
print scalar(#$geno_seq_list->{$_}) if sort keys %$geno_seq_list, "\n";
print scalar(#$cave_used), "\n";
#foreach my $seq2com (# {$seq_used } ){
# foreach my $cave2com( # {$cave_used} ){
# print $seq2com->{chromosome},":" ,$cave2com->{chromosome},"\n";
# }
#}
map { print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n" } sort keys %$cave_snp_list;
map { print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n" } sort keys %$geno_seq_list;
}
sub getseqgenotypes {
my $snpconn;
my $gen_list = {};
$snpconn = Database::Conn->new('live');
$snpconn->addConnection(DBI->connect('dbi:Oracle:pssd.world', 'sn', 'ss', { RaiseError => 1, AutoCommit => 0 }),
'pssd');
#my $conn2 =Database::Conn->new('live');
#$conn2->addConnection(DBI->connect('dbi:Oracle:COSI.world','nst_owner','nst_owner', {RaiseError =>1 , AutoCommit=>0}),'nst');
my $id_ind = $snpconn->execute('snp::Sequenom::getIdIndforExomeSample', $sample);
my $genotype = $snpconn->executeArrRef('snp::Sequenom::getGenotypeCallsPosition', $id_ind);
foreach my $geno (#{$genotype}) {
push #{ $gen_list->{ $geno->[1] } }, {
chromosome => $geno->[2],
position => $geno->[3],
genotype => $geno->[4],
};
}
return ($gen_list);
} #end of sub getseqgenotypes
sub getcavemansnpfile {
my $nstconn;
my $caveman_list = {};
$nstconn = Database::Conn->new('live');
$nstconn->addConnection(
DBI->connect('dbi:Oracle:CANP.world', 'nst_owner', 'NST_OWNER', { RaiseError => 1, AutoCommit => 0 }), 'nst');
my $id_sample = $nstconn->execute('nst::Caveman::getSampleid', $sample);
#print "IDSample: $id_sample\n";
my $file_location = $nstconn->execute('nst::Caveman::getCaveManSNPSFile', $id_sample);
open(SNPFILE, "<$file_location") || die "Error: Cannot open the file $file_location:$!\n";
while (<SNPFILE>) {
chomp;
next if /^>/;
my #data = split;
my ($nor_geno, $tumor_geno) = split /\//, $data[5];
# array of hash
push #{ $caveman_list->{$sample} }, {
chromosome => $data[0],
position => $data[1],
genotype => $nor_geno,
};
} #end of while loop
close(SNPFILE);
return ($caveman_list);
}
The problem that I see is that you're constructing a tree for generic storage of data, when what you want is a graph, specific to the task. While you are constructing the record, you could also be constructing the part that groups data together. Below is just one example.
my %genotype_for;
my $record
= { chromosome => $data[0]
, position => $data[1]
, genotype => $nor_geno
};
push #{ $gen_list->{ $geno->[1] } }, $record;
# $genotype_for{ position }{ chromosome }{ name of array } = genotype code
$genotype_for{ $data[1] }{ $data[0] }{ $sample } = $nor_geno;
...
return ( $caveman_list, \%genotype_for );
In the main line, you receive them like so:
my ( $cave_snp_list, $geno_lookup ) = getcavemansnpfile( $sample );
This approach at least allows you to locate similar position and chromosome values. If you're going to do much with this, I might suggest an OO approach.
Update
Assuming that you wouldn't have to store the label, we could change the lookup to
$genotype_for{ $data[1] }{ $data[0] } = $nor_geno;
And then the comparison could be written:
foreach my $pos ( keys %$small_lookup ) {
next unless _HASH( my $sh = $small_lookup->{ $pos } )
and _HASH( my $lh = $large_lookup->{ $pos } )
;
foreach my $chrom ( keys %$sh ) {
next unless my $sc = $sh->{ $chrom }
and my $lc = $lh->{ $chrom }
;
print "$sc:$sc";
}
}
However, if you had limited use for the larger list, you could construct the specific case
and pass that in as a filter when creating the longer list.
Thus, in whichever loop creates the longer list, you could just go
...
next unless $sample{ $position }{ $chromosome };
my $record
= { chromosome => $chromosome
, position => $position
, genotype => $genotype
};
...