Mapping values with Column header and row header - perl

I have some files with below data.
sample File 1:
sitename1,2009-07-19,"A1",11975,17.23
sitename1,2009-07-19,"A2",11,0.02
sitename1,2009-07-20,"A1",2000,17.23
sitename1,2009-07-20,"A2",538,0.02
I want to map the values in column 4 with column 2 and 3 as shown below.
Output required.
Site,Type,2009-07-19,2009-07-20
sitename1,"A1",11975,2000
sitename1,"A2",11,538
Here is what I have tried so far:
#! /usr/bin/perl -w
use strict;
use warnings;
my $column_header=["Site,Type"];
my $position={};
my $last_position=0;
my $current_event=[];
my $events=[];
while (<STDIN>) {
my ($site,$date,$type,$value,$percent) = split /[,\n]/, $_;
my $event_key = $date;
if (not defined $position->{$event_key}) {
$last_position+=1;
$position->{$event_key}=$last_position;
push #$column_header,$event_key;
}
my $pos = $position->{$event_key};
if (defined $current_event->[$pos]) {
dumpEvent();
}
if (not defined $current_event->[0]) {
$current_event->[0]="$site,$type";
}
$current_event->[$pos]=$value;
}
dumpEvent();
my $order = [];
for (my $scan=0; $scan<scalar(#$column_header); $scan++) {
push #$order,$scan;
}
printLine($column_header);
map { printLine($_) } #$events;
sub printLine {
my $record=shift;
my #result=();
foreach my $offset (#$order) {
if (defined $record->[$offset]) {
push #result,$record->[$offset];
} else {
push #result,"";
}
}
print join(",",#result)."\n";
}
sub dumpEvent {
return unless defined $current_event->[0];
push #$events,$current_event;
$current_event=[];
}
The output i am getting is as below.
*Site,Type,2009-07-19,2009-07-20*
sitename1,"A1",11975,
sitename1,"A2",11,
sitename1,"A1",,14620
sitename1,"A2",,538

If I understand you correctly (and I have to admit I'm only guessing), you have several types of things at different dates and a value for each. Thus you need a data structure like this hash for each site:
$foo = {
site => 'sitename1',
type => 'A1',
dates => [
{
date => '2009-07-19',
value => 11975,
},
{
date => '2009-07-20',
value => 538,
},
],
};
Is that even close?

The folowing code produces the expected result and makes "some" sense. I don't know if it makes real sense.
my %dates;
my %SiteType;
while (<DATA>) {
chomp;
my ($site,$date,$type,$value,$percent) = split /,/;
$dates{$date} = '1';
push #{$SiteType{"$site,$type"}}, $value ;
};
print 'Site,Type,', join(',', sort keys %dates), "\n";
foreach ( sort keys %SiteType) {
print $_, ',', join(',', #{$SiteType{$_}}), "\n";
};

Related

need to modify perl Deep::Hash::Utils

i am very new to perl. i am trying to use the below code from CPAN.
my $C;
# Recursive version of C<each>;
sub reach {
my $ref = shift;
if (ref $ref eq 'HASH') {
if (defined $C->{$ref}{v}) {
if (ref $C->{$ref}{v} eq 'HASH') {
if (my #rec = reach($C->{$ref}{v})) {
return ($C->{$ref}{k},#rec);
}
} elsif (ref $C->{$ref}{v} eq 'ARRAY') {
if (my #rec = reach($C->{$ref}{v})) {
if (defined $C->{$ref}{k}) {
return $C->{$ref}{k},#rec;
}
return #rec;
}
}
undef $C->{$ref};
}
if (my ($k,$v) = each %$ref) {
$C->{$ref}{v} = $v;
$C->{$ref}{k} = $k;
return ($k,reach($v));
}
return ();
} elsif (ref $ref eq 'ARRAY') {
if (defined $C->{$ref}{v}) {
if (ref $C->{$ref}{v} eq 'HASH' ||
ref $C->{$ref}{v} eq 'ARRAY') {
if (my #rec = reach($C->{$ref}{v})) {
if (defined $C->{$ref}{k}) {
return $C->{$ref}{k},#rec;
}
return #rec;
}
}
}
if (my $v = $ref->[$C->{$ref}{i}++ || 0]) {
$C->{$ref}{v} = $v;
return (reach($v));
}
return ();
}
return $ref;
}
input:
bar => {cmd_opts => { gld_upf => ['abc' , 'def']} }
current output:
[bar, cmd_opts, gld_upf, abc]
[bar, cmd_opts, gld_upf, def]
desired output:
[bar, cmd_opts, gld_upf, ['abc', 'def']]
also, what are the concepts that are being used in this code?
are there any books/courses i can take for this?
also, what are the concepts that are being used in this code? are there any books/courses i can take for this?
The code mentioned by you from the Deep::Hash::Utils CPAN module is mainly handling nested data structures.
A couple of places to read about these:
the official docs: perldsc ; perlreftut ; perlref ;
Modern Perl by chromatic has a section on Nested Data Structures around page 60
Intermediate Perl: Beyond The Basics of Learning Perl 2nd edition has a section about Nested Data Structures around page 44.
In the most basic case, in these nested data structures, every node has one of the following types:
scalar
hashref
arrayref
In turn, the values in the array pointed to by an arrayref can be of type scalar/hashref/arrayref.
The same goes for the values of the hash pointed to by an arrayref, it can be of type scalar/hashref/arrayref.
This induces a tree-like structure. The algorithm for traversing such a tree is depth-first search
where some additional logic is required to check the type of the node and depending on the type decide how to proceed further down the tree.
To make a parallel, all of this is not that much different from traversing a filesystem hierarchy (see link1, link2).
A bigger list called perlres on Perl resources is available.
In this specific case, the function reach from Deep::Hash::Utils acts as an iterator, and it returns all paths descending from the root down to each leaf.
Whenever a #path to a leaf is found, its elements are compared side-by-side with another list called #output, and there are three cases:
there's no element on that position, so we store it
the elements are equal, so we skip them
the elements are different, so we merge them together in a list
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Deep::Hash::Utils qw/reach/;
my $input = { bar => {cmd_opts => { gld_upf => ['abc' , 'def']} } };
my #output = ();
while (my #path = reach($input)) {
for(my $i=0;$i<=$#path;$i++){
if(defined $output[$i]) {
if(ref($output[$i]) eq "") {
if($output[$i] eq $path[$i]) {
next;
};
my $e1 = $output[$i];
my $e2 = $path[$i];
$output[$i] = [$e1,$e2];
}elsif(ref($output[$i]) eq "ARRAY"){
push #{$output[$i]}, $path[$i];
};
} else {
$output[$i] = $path[$i];
};
};
}
print Dumper \#output;
OUTPUT:
$VAR1 = [
'bar',
'cmd_opts',
'gld_upf',
[
'abc',
'def'
]
];

Split string into a hash of hashes (perl)

at the moment im a little confused..
I am looking for a way to write a string with an indefinite number of words (separated by a slash) in a recursive hash.
These "strings" are output from a text database.
Given is for example
"office/1/hardware/mouse/count/200"
the next one can be longer or shorter..
This must be created from it:
{
office {
1{
hardware {
mouse {
count => 200
}
}
}
}
}
Any idea ?
Work backwards. Split the string. Use the last two elements to make the inner-most hash. While more words exist, make each one the key of a new hash, with the inner hash as its value.
my $s = "office/1/hardware/mouse/count/200";
my #word = split(/\//, $s);
# Bottom level taken explicitly
my $val = pop #word;
my $key = pop #word;
my $h = { $key => $val };
while ( my $key = pop #word )
{
$h = { $key => $h };
}
Simple recursive function should do
use strict;
use warnings;
use Data::Dumper;
sub foo {
my $str = shift;
my ($key, $rest) = split m|/|, $str, 2;
if (defined $rest) {
return { $key => foo($rest) };
} else {
return $key;
}
}
my $hash = foo("foo/bar/baz/2");
print Dumper $hash;
Gives output
$VAR1 = {
'foo' => {
'bar' => {
'baz' => '2'
}
}
};
But like I said in the comment: What do you intend to use this for? It is not a terribly useful structure.
If there are many lines to be read into a single hash and the lines have a variable number of fields, you have big problems and the other two answers will clobber data by either smashing sibling keys or overwriting final values. I'm supposing this because there is no rational reason to convert a single line into a hash.
You will have to walk down the hash with each field. This will also give you the most control over the process.
our $hash = {};
our $eolmark = "\000";
while (my $line = <...>) {
chomp $line;
my #fields = split /\//, $line;
my $count = #fields;
my $h = $hash;
my $i = 0;
map { (++$i == $count) ?
($h->{$_}{$eolmark} = 1) :
($h = $h->{$_} ||= {});
} #fields;
}
$h->{$_}{$eolmark} = 1 You need the special "end of line" key so that you can recognize the end of a record and still permit longer records to coexist. If you had two records
foo/bar/baz foo/bar/baz/quux, the second would overwrite the final value of the first.
$h = $h->{$_} ||= {} This statement is a very handy idiom to both create and populate a cache in one step and then take a shortcut reference to it. Never do a hash lookup more than once.
HTH

Printing Hash of Hash into a Matrix Table in Perl

I have a data structure like this:
#!/usr/bin/perl -w
my $hash = {
'abTcells' => {
'mesenteric_lymph_node' => {
'Itm2a' => '664.661',
'Gm16452' => '18.1425',
'Sergef' => '142.8205'
},
'spleen' => {
'Itm2a' => '58.07155',
'Dhx9' => '815.2795',
'Ssu72' => '292.889'
}
}
};
What I want to do is to print it out into this format:
mesenteric_lymph_node spleen
Itm2a 664.661 58.07155
Gm16452 18.1425 NA
Sergef 142.8205 NA
Dhx9 NA 815.2795
Ssu72 NA 292.889
What's the way to do it.
I'm currently stuck with the following code https://eval.in/44207
foreach my $ct (keys %{$hash}) {
print "$ct\n\n";
my %hash2 = %{$hash->{$ct}};
foreach my $ts (keys %hash2) {
print "$ts\n";
my %hash3 = %{$hash2{$ts}};
foreach my $gn (keys %hash3) {
print "$gn $hash3{$gn}\n";
}
}
}
Use Text::Table for output. Beautify to taste.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Table;
my $hash = {
'abTcells' => {
'mesenteric_lymph_node' => {
'Itm2a' => '664.661',
'Gm16452' => '18.1425',
'Sergef' => '142.8205'
},
'spleen' => {
'Itm2a' => '58.07155',
'Dhx9' => '815.2795',
'Ssu72' => '292.889'
}
}
};
my $struct = $hash->{abTcells};
my #cols = sort keys %{ $struct };
my #rows = sort keys %{ { map {
my $x = $_;
map { $_ => undef }
keys %{ $struct->{$x} }
} #cols } };
my $tb = Text::Table->new('', #cols);
for my $r (#rows) {
$tb->add($r, map $struct->{$_}{$r} // 'NA', #cols);
}
print $tb;
Output:
mesenteric_lymph_node spleen
Dhx9 NA 815.2795
Gm16452 18.1425 NA
Itm2a 664.661 58.07155
Sergef 142.8205 NA
Ssu72 NA 292.889
Now, the order of the rows above is different than the one you show because I wanted it to be consistent. If you know the set of all possible rows, then you can specify another order obviously.
First thing would be to separate out the two hashes:
my %lymph_node = %{ $hash->{abTcells}->{mesenteric_lymph_node} };
my %spleen = %{ $hash->{abTcells}->{spleen} };
Now, you have two separate hashes that contains the data you want.
What we need is a list of all the keys. Let's make a third hash that contains your keys.
my %keys;
map { $keys{$_} = 1; } keys %lymph_node, keys %spleen;
Now, we can go through all your keys and print the value for each of the two hashes. If one of the hashes doesn't have the data, we'll set it to NA:
for my $value ( sort keys %keys ) {
my $spleen_value;
my $lymph_nodes_value;
$spleen_value = exists $spleen{$value} ? $spleen{$value} : "NA";
$lymph_node_value = exists $lymph_node{$value} ? $lymph_node{$value} : "NA";
printf "%-20.20s %-9.5f %-9.5f\n", $key, $lymph_node_value, $spleen_value;
}
The printf statement is a nice way to tabularize data. You'll have to create the headings yourself. The ... ? ... : ... statement is an abbreviated if/then/else If the statement before the ? is true, then the value is the value between the ? and the :. Else, the value is the value after the :.
Both of your inner hashes have the same keys, So do a foreach on one of the hashes to get the key, and then print both.

Equivalent of "shift" for a hash to create a $class->next() method

I almost feel like saying "it's me again!".
Anyway, here we go.
I like using while $object->next() style constructs. They appeal to me and seem "neat".
Now, when the thing I'm iterating over is an array, it's straightforward ("shift #ary or return undef")
sub next {
my ( $self, $args ) = #_;
my $next = shift #{ $self->{list_of_things} } or return undef;
my ( $car, $engine_size, $color )
= split( /\Q$opts->{fieldsep}/, $next );
$self->car = $host;
$self->engine_size = $engine_size;
$self->color = $color;
}
In this example I use AUTOLOAD to create the getters and setters and then have those instance variables available in my object during the while loop.
I'd like to do something similar but with the "list_of_things" being a %hash.
Here's a non-OO example that doesn't make it into the first iteration. Any ideas why?
(The total "list_of_things" is not that big - maybe 100 entries - so to do a keys(%{$hash}) every time doesn't seem too wasteful to me).
use strict;
use warnings;
use Data::Dumper;
my $list_of_things = {
volvo => {
color => "red",
engine_size => 2000,
},
bmw => {
color => "black",
engine_size => 2500,
},
mini => {
color => "british racing green",
engine_size => 1200,
}
};
sub next {
my $args = $_;
my #list = keys( %{$list_of_things} );
return undef if scalar #list == "0";
my $next = $list_of_things->{ $list[0] };
delete $list_of_things->{ $list[0] };
return $next;
}
while ( next()) {
print Dumper $_;
print scalar keys %{ $list_of_things }
}
Is there a better way of doing this? Am I doing something crazy?
EDIT:
I tried Ikegami's suggestion. Of course, Ikegami's example works flawlessly. When I try and abstract a little, so that all that is exposed to the object is a next->() method, I get the same "perl-going-to-100%-cpu" problem as in my original example.
Here's a non-OO example:
use Data::Dumper qw( Dumper );
sub make_list_iter {
my #list = #_;
return sub { #list ? shift(#list) : () };
}
sub next {
make_list_iter( keys %$hash );
}
my $hash = { ... };
while ( my ($k) = next->() ) {
print Dumper $hash->{$k};
}
It does not seem to get past the first step of the while() loop.
I am obviously missing something here...
If you don't want to rely on the hash's builtin iterator (used by each, keys and values), there's nothing stopping you from making your own.
use Data::Dumper qw( Dumper );
sub make_list_iter {
my #list = #_;
return sub { #list ? shift(#list) : () };
}
my $list_of_things = { ... };
my $i = make_list_iter(keys %$list_of_things);
while (my ($k) = $i->()) {
local $Data::Dumper::Terse = 1;
local $Data::Dumper::Indent = 0;
say "$k: " . Dumper($list_of_things->{$k});
}
The each operator is a builtin that iterates over hashes. It returns undef when it runs out of elements to return. So you could so something like
package SomeObject;
# creates new object instance
sub new {
my $class = shift;
return bless { hash_of_things => { #_ } }, $class
}
sub next {
my $self = shift;
my ($key,$value) = each %{ $self->{hash_of_things} };
return $key; # or return $value
}
Calling keys on the hash will reset the each iterator. It's good to know this so you can reset it on purpose:
sub reset {
my $self = shift;
keys %{ $self->{hash_of_things} }
}
and so you can avoid resetting it on accident.
The section on tie'ing hashes in perltie also has an example like this.
Here's how List::Gen could be used to create an iterator from a list:
use strict;
use warnings;
use List::Gen 'makegen';
my #list_of_things = ( # This structure is more suitable IMO
{
make => 'volvo',
color => 'red',
engine_size => 2000,
},
{
make => 'bmw',
color => 'black',
engine_size => 2500,
},
{
make => 'mini',
color => 'british racing green',
engine_size => 1200,
}
);
my $cars = makegen #list_of_things;
print $_->{make}, "\n" while $cars->next;
Well, if you don't need $list_of_things for later, you can always do something like
while(keys %$list_of_things)
{
my $temp=(sort keys %$list_of_things)[0];
print "key: $temp, value array: " . join(",",#{$list_of_things->{$temp}}) . "\n";
delete $list_of_things->{$temp};
}
And if you do need it, you can always assign it to a temporary hash reference and perform the same while loop on it.

Perl Working On Two Hash References

I would like to compare the values of two hash references.
The data dumper of my first hash is this:
$VAR1 = {
'42-MG-BA' => [
{
'chromosome' => '19',
'position' => '35770059',
'genotype' => 'TC'
},
{
'chromosome' => '2',
'position' => '68019584',
'genotype' => 'G'
},
{
'chromosome' => '16',
'position' => '9561557',
'genotype' => 'G'
},
And the second hash is similar to this but with more hashes in the array. I would like to compare the genotype of my first and second hash if the position and the choromosome matches.
map {print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n"}sort keys %$cave_snp_list;
map {print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n"}sort keys %$geno_seq_list;
I could do that for the first array of the hashes.
Could you help me in how to work for all the arrays?
This is my actual code in full
#!/software/bin/perl
use strict;
use warnings;
use Getopt::Long;
use Benchmark;
use Config::Config qw(Sequenom.ini);
useDatabase::Conn;
use Data::Dumper;
GetOptions("sam=s" => \my $sample);
my $geno_seq_list = getseqgenotypes($sample);
my $cave_snp_list = getcavemansnpfile($sample);
#print Dumper($geno_seq_list);
print scalar %$geno_seq_list, "\n";
foreach my $sam (keys %{$geno_seq_list}) {
my $seq_used = $geno_seq_list->{$sam};
my $cave_used = $cave_snp_list->{$sam};
print scalar(#$geno_seq_list->{$_}) if sort keys %$geno_seq_list, "\n";
print scalar(#$cave_used), "\n";
#foreach my $seq2com (# {$seq_used } ){
# foreach my $cave2com( # {$cave_used} ){
# print $seq2com->{chromosome},":" ,$cave2com->{chromosome},"\n";
# }
#}
map { print "$_= $cave_snp_list->{$_}->[0]->{chromosome}\n" } sort keys %$cave_snp_list;
map { print "$_= $geno_seq_list->{$_}->[0]->{chromosome}\n" } sort keys %$geno_seq_list;
}
sub getseqgenotypes {
my $snpconn;
my $gen_list = {};
$snpconn = Database::Conn->new('live');
$snpconn->addConnection(DBI->connect('dbi:Oracle:pssd.world', 'sn', 'ss', { RaiseError => 1, AutoCommit => 0 }),
'pssd');
#my $conn2 =Database::Conn->new('live');
#$conn2->addConnection(DBI->connect('dbi:Oracle:COSI.world','nst_owner','nst_owner', {RaiseError =>1 , AutoCommit=>0}),'nst');
my $id_ind = $snpconn->execute('snp::Sequenom::getIdIndforExomeSample', $sample);
my $genotype = $snpconn->executeArrRef('snp::Sequenom::getGenotypeCallsPosition', $id_ind);
foreach my $geno (#{$genotype}) {
push #{ $gen_list->{ $geno->[1] } }, {
chromosome => $geno->[2],
position => $geno->[3],
genotype => $geno->[4],
};
}
return ($gen_list);
} #end of sub getseqgenotypes
sub getcavemansnpfile {
my $nstconn;
my $caveman_list = {};
$nstconn = Database::Conn->new('live');
$nstconn->addConnection(
DBI->connect('dbi:Oracle:CANP.world', 'nst_owner', 'NST_OWNER', { RaiseError => 1, AutoCommit => 0 }), 'nst');
my $id_sample = $nstconn->execute('nst::Caveman::getSampleid', $sample);
#print "IDSample: $id_sample\n";
my $file_location = $nstconn->execute('nst::Caveman::getCaveManSNPSFile', $id_sample);
open(SNPFILE, "<$file_location") || die "Error: Cannot open the file $file_location:$!\n";
while (<SNPFILE>) {
chomp;
next if /^>/;
my #data = split;
my ($nor_geno, $tumor_geno) = split /\//, $data[5];
# array of hash
push #{ $caveman_list->{$sample} }, {
chromosome => $data[0],
position => $data[1],
genotype => $nor_geno,
};
} #end of while loop
close(SNPFILE);
return ($caveman_list);
}
The problem that I see is that you're constructing a tree for generic storage of data, when what you want is a graph, specific to the task. While you are constructing the record, you could also be constructing the part that groups data together. Below is just one example.
my %genotype_for;
my $record
= { chromosome => $data[0]
, position => $data[1]
, genotype => $nor_geno
};
push #{ $gen_list->{ $geno->[1] } }, $record;
# $genotype_for{ position }{ chromosome }{ name of array } = genotype code
$genotype_for{ $data[1] }{ $data[0] }{ $sample } = $nor_geno;
...
return ( $caveman_list, \%genotype_for );
In the main line, you receive them like so:
my ( $cave_snp_list, $geno_lookup ) = getcavemansnpfile( $sample );
This approach at least allows you to locate similar position and chromosome values. If you're going to do much with this, I might suggest an OO approach.
Update
Assuming that you wouldn't have to store the label, we could change the lookup to
$genotype_for{ $data[1] }{ $data[0] } = $nor_geno;
And then the comparison could be written:
foreach my $pos ( keys %$small_lookup ) {
next unless _HASH( my $sh = $small_lookup->{ $pos } )
and _HASH( my $lh = $large_lookup->{ $pos } )
;
foreach my $chrom ( keys %$sh ) {
next unless my $sc = $sh->{ $chrom }
and my $lc = $lh->{ $chrom }
;
print "$sc:$sc";
}
}
However, if you had limited use for the larger list, you could construct the specific case
and pass that in as a filter when creating the longer list.
Thus, in whichever loop creates the longer list, you could just go
...
next unless $sample{ $position }{ $chromosome };
my $record
= { chromosome => $chromosome
, position => $position
, genotype => $genotype
};
...