error pushing value into hash of array - perl

I am parsing an output report from psiblast. I used COG alignments and searched a gene database for matches (homologues). One thing that I want to do is to find out which genes match to more than one COG. My partial script is below.
I am specifically having problems creating an array that holds all of the COGs for the genes that are assigned to multiple COGs.
I am getting the following error "Can't use string ("COG0003") as an ARRAY ref while "strict refs" in use at parse_POG_reports.pl line 26, line 67.".
I have looked at other posted relating to pushing elements into hashes of arrays. But I think the error might be occurring when one gene has 2 matches to the same COG, and it tries to push the same COG into the array (ie. the last 2 lines of the sample input). Does this make sense? If so, how can I avoid this problem?
use strict;
use warnings;
my %maxBits;my %COGhit_count;
my $Hohits={};my %COGhits;
my $COG_psi_report=$ARGV[0];
open (IN, $COG_psi_report) or die "cannot open $COG_psi_report\n";
while (my $line=<IN>){
next if ($line =~/^#/);
chomp $line;
my #columns = split(/\t/,$line);
my $bits=$columns[11];
my $COG=$columns[0];
my $hit=$columns[1];
my $Eval=$columns[10];
next if ($Eval > 0.00001); # threshold for significant hits set by DK
$COGhit_count{$hit}++; # count how many COGs each gene is homologous to
$COGhits{$hit}=$COG;
if ($COGhit_count{$hit}>1) {
push #{$COGhits{$hit}}, $COG; #
}
## for those that there are multiple hits we need to select top hit ##
if (!exists $maxBits{$hit}){
$maxBits{$hit}=$bits;
}
elsif (exists $maxBits{$hit} && $bits > $maxBits{$hit}){
$maxBits{$hit}=$bits;
}
$Hohits->{$hit}->{$bits}=$COG;
}
close (IN);
example Input:
POG0002 764184357-stool1_revised_scaffold22981_1_gene47608 23.90 159 112 3 1 156 1 153 2e-06 54.2
POG0002 764062976-stool2_revised_C999233_1_gene54902 23.63 182 121 5 3 169 2 180 2e-06 53.9
POG0002 763901136-stool1_revised_scaffold39447_1_gene145241 26.45 155 89 3 3 137 5 154 3e-06 53.9
POG0002 765701615-stool1_revised_C1349270_1_gene168522 23.53 187 115 5 3 169 2 180 5e-06 53.1
POG0002 158802708-stool2_revised_C1077267_1_gene26470 22.69 216 158 5 3 213 5 216 5e-06 52.7
POG0003 160502038-stool1_revised_scaffold47906_2_gene161164 33.00 297 154 6 169 424 334 626 6e-40 157
POG0003 160502038-stool1_revised_scaffold47906_2_gene161164 16.28 172 128 4 23 192 46 203 1e-06 56.6
POG0003 158337416-stool1_revised_C1254444_1_gene13533 30.06 346 184 7 133 424 57 398 6e-40 155
POG0003 158337416-stool1_revised_scaffold29713_1_gene153054 28.61 332 194 8 132 424 272 599 2e-38 152
POG0003 158337416-stool1_revised_scaffold29713_1_gene153054 24.00 200 131 5 1 193 5 190 9e-11 69.3

You need to get rid of line 24 (counting backwards):
$COGhits{$hit}=$COG;
In it, you are setting $COGhits{$hit} to a scalar value (the value of $COG). Later, in line 26 you are trying to dereference $COGhits{$hit} as an array to push into it. That doesn't work because there's a scalar in there.
Just remove the if and change those lines into this. That should do the trick as now all those $hits are stored in array references.
$COGhit_count{$hit}++; # count how many COGs each gene is homologous to
push #{$COGhits{$hit}}, $COG;
Output of $COGhits:
$VAR4 = {
'158802708-stool2_revised_C1077267_1_gene26470' => [
'POG0002'
],
'764062976-stool2_revised_C999233_1_gene54902' => [
'POG0002'
],
'764184357-stool1_revised_scaffold22981_1_gene47608' => [
'POG0002'
],
'765701615-stool1_revised_C1349270_1_gene168522' => [
'POG0002'
],
'763901136-stool1_revised_scaffold39447_1_gene145241' => [
'POG0002'
],
'160502038-stool1_revised_scaffold47906_2_gene161164' => [
'POG0003',
'POG0003'
]
};
If you however want both the scalar and the array ref, try this code. I don't recommend this, though.
$COGhit_count{$hit}++; # count how many COGs each gene is homologous to
if ($COGhit_count{$hit} == 1) {
$COGhits{$hit}=$COG; # Save as scalar
}
elsif ($COGhit_count{$hit} == 2) { # If we've just found the second hit,
my $temp = $COGhits{$hit}; # save the first and convert $COGhits{$hit}
$COGhits{$hit} = []; # to an array ref, then push both the old and
push #{$COGhits{$hit}}, $temp, $COG; # the new value in it.
} elsif ($COGhit_count{$hit} > 2) {
push #{$COGhits{$hit}}, $COG; # Just push the new value in
}
Thought: You probably had $COGhits{$hit}=$COG first but then noticed that sometimes there can be more than one value, so you added the push line, but you did not realized that you in fact had to replace the old line.

It's telling you exactly what you're doing wrong.
$COGhits{$hit}=$COG; # <--- scalar
if ($COGhit_count{$hit}>1) {
push #{$COGhits{$hit}}, $COG; # <--- array
}
You can't assign the value as a non-reference type and then try to autovivify it as a reference type. Perl will do the latter, but not if you've already stored a conflicting data type in that location.
Also, if this by some miracle worked the first time (it won't), and you ran this more than once, any array that you might have autoviv-ed by the push, would be clobbered by the scalar non-reference assignment.
I'm not sure what you're after, but the first line should probably be deleted.
Instead of that construct, you want to decide whether there will ever be more than one specification of $COG for a value of $hit. If there can be, simply replacing those 4 lines with the push is the way to go.
I have done multi-purpose structure slots before, and they are largely a pain to maintain. But if you wanted to do something like that, you can do this:
my $ref = \$hashref->{ $key }; # autovivifies slot as simple scalar.
# it starts out as undefined.
if ( ref $$ref ) { # ref $ref will always be true
push #$$ref, $value;
}
else {
$$ref = defined( $$ref ) ? [ $$ref, $value ] : $value;
}
But you have to write bifurcated logic every time you want to access the mixed tree in some different way. The performance savings that you get with scalars, is somewhat eaten up by the tests and branching.
So I don't do too much of this anymore. I decide beforehand whether the relationship is 1-1 or 1-n. Routines like the ones below can make it more straightforward dealing with these kinds of tables, to a degree.
sub get_list_from_hash {
my ( $hash, $key ) = #_;
my $ref = \$hash->{ $key };
return unless defined( $$ref );
return ref( $$ref ) ? #$$ref : $$ref;
}
sub store_in_hash {
$_[0] = {} unless ref $_[0];
my ( $hash, $key, #values ) = #_;
my #defined = grep {; defined } #values;
unless ( #defined ) {
delete $hash->{ $key };
return;
}
my $ref = \$hash->{ $key };
if ( ref $$ref ) {
push #$$ref, #defined;
}
elsif ( defined $$ref ) {
$$ref = [ $$ref, #defined ];
}
elsif ( #values > 1 ) {
#$$ref = #defined;
}
else {
( $$ref ) = #defined;
}
}

Related

Merging N no of files based on their first column in perl

My question is similar to this question posted earlier.
I am having many files which I need to merge them based on the presence or absence of the first column ID, but while merging I am getting lots of empty values in my output file, I want those empty values to be zero if it is not present in another file. The example below is based on only two files content, but I have many sample files like this format (tabular).
For example:
File1
ID Value
123 1
231 2
323 3
541 7
File2
ID Value
541 6
123 1
312 3
211 4
Expected Output:
ID File1 File2
123 1 1
231 2 0
323 3 0
541 7 6
312 0 3
211 0 4
Obtaining Output:
ID File1 File2
123 1 1
231 2
323 3
541 7 6
312 undef 3
211 undef 4
As you can see above I am getting output but in file2 column, it's not adding zero or leaving empty and in case of file1 column it is having undef value. I have checked undef values and then my final output gives zeros in place of undef values but still I am having those empty spaces. Please find my code below (hardcoded only for two files).
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Data::Dumper;
my $path = "/home/pranjay/Projects/test";
my #files = ("s1.txt","s2.txt");
my %classic_com;
my $cnt;
my $classic_txt;
my $sample_cnt = 0;
my $classic_txtcomb = "test_classic.txt";
open($classic_txt,">$path/$classic_txtcomb") or die "Couldn't open file
$classic_txtcomb for writing,$!";
print $classic_txt "#ID\t"."file1\tfile2\n";
foreach my $file(#files){
$sample_cnt++;
print "$sample_cnt\n";
open($cnt,"<$path/$file")or die "Couldn't open file $file for reading,$!";
while(<$cnt>){
chomp($_);
my #count = ();
next if($_=~/^ID/);
my #record=();
#record=split(/\t/,$_);
my $scnt = $sample_cnt -1;
if((exists($classic_com{$record[0]})) and ($sample_cnt > 0)){
${$classic_com{$record[0]}}[$scnt]=$record[1];
}else{
$count[$scnt] = "$record[1]";
$classic_com{$record[0]}= [#count];
}
}
}
my %final_txt=();
foreach my $key ( keys %classic_com ) {
#print "$key: ";
my #val = #{ $classic_com{$key} };
my #v;
foreach my $i ( #val ) {
if(not defined($i)){
$i = 0;
push(#v, $i);
}else{
push(#v, $i);
next;
}
}
$final_txt{$key} = [#v];
}
#print Dumper %classic_com;
while(my($key,$value)=each(%final_txt)){
my $val=join("\t", #{$value});
print $classic_txt "$key\t"."#{$value}"."\n";
}
Just read the input files into a hash of arrays. The topmost key is the ID, each inner array contains the value for file i on the i-th position. When printing, use the // defined-or operator to replace undefs with zeroes:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my %merged;
my $file_tally = 0;
while (my $file = shift) {
open my $in, '<', $file or die "$file: $!";
<$in>; # skip the header
while (<$in>) {
my ($id, $value) = split;
$merged{$id}[$file_tally] = $value;
}
++$file_tally;
}
for my $value (keys %merged) {
my #values = #{ $merged{$value} };
say join "\t", $value, map $_ // 0, #values[0 .. $file_tally - 1];
}
program.pl
my %val;
/ (\d+) \s+ (\d+) /x and $val{$1}{$ARGV} = $2 while <>;
pr( 'ID', my #f = sort keys %{{map%$_,values%val}} );
pr( $_, map$_//0, #{$val{$_}}{#f} ) for sort keys %val;
sub pr{ print join("\t",#_)."\n" }
Run:
perl program.pl s1.txt s2.txt
ID s1.txt s2.txt
123 1 1
211 0 4
231 2 0
312 0 3
323 3 0
541 7 6

Backslash before a subroutine call

As I was understanding the difference between [] and \ in references,I used both on subroutine the former was fine but when I tried later I thought it should give error but the below program in perl
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #b;
for my $i ( 0 .. 10 ) {
$b[$i] = \somefunc($i);
}
print Dumper( \#b );
sub somefunc {
my $n = shift;
my ( #a, $k );
for my $j ( 11 .. 13 ) {
$k = $n * $j;
push( #a, $k );
}
print "a: #a \n";
return #a;
}
gives output as :
a: 0 0 0
a: 11 12 13
a: 22 24 26
a: 33 36 39
a: 44 48 52
a: 55 60 65
a: 66 72 78
a: 77 84 91
a: 88 96 104
a: 99 108 117
a: 110 120 130
$VAR1 = [
\0,
\13,
\26,
\39,
\52,
\65,
\78,
\91,
\104,
\117,
\130
];
I was unable to understand the output.Need explanation.
What is happening here is:
You are returning an array from somefunc.
But you are assigning it to a scalar. What this is effectively doing therefore, is simply putting the last value in the array, into the scalar value.
my $value = ( 110, 120, 130 );
print $value;
When you do this - $value is set to the last value in the array. This is what's actually happening in your code. See for example perldata:
List values are denoted by separating individual values by commas (and enclosing the list in parentheses where precedence requires it):
(LIST)
In a context not requiring a list value, the value of what appears to be a list literal is simply the value of the final element, as with the C comma operator. For example,
#foo = ('cc', '-E', $bar);
assigns the entire list value to array #foo, but
foo = ('cc', '-E', $bar);
assigns the value of variable $bar to the scalar variable $foo. Note that the value of an actual array in scalar context is the length of the array; the following assigns the value 3 to $foo:
#foo = ('cc', '-E', $bar);
$foo = #foo; # $foo gets 3
It's this latter case that's often the gotcha, because it's a list in a scalar context.
And in your example - the backslash prefix denotes 'reference to' - which is largely meaningless because it's a reference to a number.
But for a scalar, it might be more meaningful:
my $newvalue = "fish";
my $value = ( 110, 120, 130, \$newvalue );
print Dumper $value;
$newvalue = 'barg';
print Dumper $value;
Gives:
$VAR1 = \'fish';
$VAR1 = \'barg';
That's why you're getting the results. Prefix with the slash indicates that you're getting a reference to the result, not a reference to the sub. Reference to 130 isn't actually all that meaningful.
Normally, when doing the assignment above - you'd get a warning about Useless use of a constant (110) in void context but this doesn't apply when you've got a subroutine return.
If you wanted to insert a sub reference, you'd need to add &, but if you just want to insert the returned array by reference - you either need to:
$b[$i] = [somefunc($i)]
Or:
return \#a;

How can I use the until function with appropriate way

I have a file that I want to filter which is like that:
##matrix=axtChain 16 91,-114,-31,-123,-114,100,-125,-31,-31,-125,100,-114,-123,-31,-114,91
##gapPenalties=axtChain O=400 E=30
chain 21455232 chr20 14302601 + 37457 14119338 chr22 14786829 + 3573 14759345 1
189 159 123
24 30 22
165 21 20
231 105 0
171 17 19
261 0 2231
222 2 0
253 56 48
chain 164224 chr20 14302601 + 1105938 1125118 chr22 14786829 + 1081744 1100586 8
221 352 334
24 100 112
34 56 56
26 50 47
…………………….
chain 143824 chr20 14302601 + 1105938 1125118 chr22 14786829 + 1081744 1100586 8
So, briefly,there are blocks separated by a blank line.
Each block begins with the line " chain xxxxx " and continues with lines with numbers.
I want to filter out the file and keep just the blocks with chain and the number that follows be greater than 3000.
I wrote the following script to do that:
#!/usr/bin/perl
use strict;
use warnings;
use POSIX;
my $chain = $ARGV[0];
#It filters the chains with chains >= 3000.
open my $chain_file, $chain or die "Could not open $chain: $!";
my #array;
while( my $cline = <$chain_file>) {
#next if /^\s*#/;
chomp $cline;
#my #lines = split (/ /, $cline);
if ($cline =~/^chain/) {
my #lines = split (/\s/, $cline);
if ($lines[1] >= 3000) {
#print $lines[1];
#my #lines = split (/ /, $cline);
#print "$cline\n";
push (#array, $cline);
}
}
until ($cline ne ' ') {
push (#array, $cline);
}
foreach (#array) {
print "$_\n";
}
undef(#array);
}
The problem is that I can print just the headers (chain XXXXX…..) and not the numbers that follows at the next lines of each block.
I'm using the until function till will find the blank line, but it doesn't work.
If someone could help me with that….
Thank you very much in advance,
Vasilis.
The first problem here is that ' ' is a single space, not a blank line ("" or '' should be fine since you've already chomp-ed the line.
The second problem is that
until ( $cline ne "" )
is the same as
while ( $cline eq "" )
which is the opposite of what you need to push lines to #array.
That said, the flip-flop operator is probably a more suitable construct for what you're after:
my #array;
while ( <$chain_file> ) { # Using $_ instead of $cline
chomp;
if ( do { /^chain\s+(\d+)/ && $1 >= 3000 } .. /^$/ ) {
# Accumulate lines in #array
push #array, $_; # False until LHS evaluates to true ...
} # ... then true until RHS evaluates to true
else {
for ( #array ) {
print $_, "\n"; # Print matches
}
#array = (); # Reset/clear out #array
}
}
It's usually best not to use unless instead of while. It negates the boolean expression many times leaving you with a double negative to solve. Here's an example
while ( $foo ne $bar ) {
Even though this is a negative expression, I can pretty easily figure out when to exit my loop. However:
until ( $foo eq $bar ) {
Just takes time to figure out.
Also, ' ' does not make a blank line: Use the regular expression $cline =~ /^\s*$/. However, even beyond that the loop:
until ($cline ne ' ') {
push (#array, $cline);
}
will go on forever if $cline does equal blank. You're never changing the value of $cline.
You can use what I use to call state variables (until Perl actually created a variable type called state and now I have no idea what to call them.) This variable tracks where you are in your file. Are you inside a chain section of the file? Do you want these lines or not? This way, you only have a single loop. You set your state variables and then process your loop.
In this example, I have a state variable called $keep_line which is asking whether or not I want to keep the lines I want to read in. If the line starts with chain and the second field is greater than 3000, I want to keep the entire block (if I understand what you're attempting to do). (By the way, I'm keeping blank lines. Is that okay?)
my $keep_lines = 0; # Aren't in lines I want to keep
my #good_lines; # Array where you want to keep the good lines
while ( my $line = <$chain_file> ) {
chomp $line; # Good habit to always chomp a input as soon as it's read.
if ( $line =~ /^chain/ ) { # This is a chain line. Do I want to keep this group?
my #fields = ( /\s+/, $line );
if ( $field[1] > 3000 ) { # Yes, if the second field is greater than 3000
$keep_lines = 1; # Keep this and all other lines until the next chain line
}
else {
$keep_lines = 0; # Skip until the next chain line
}
}
if ( $keep_lines ) {
push #good_lines, $line;
}
}
I also smell a function here: Instead of the tangle of if clauses, I would probably make this a function that returns the value I set $keep_lines to.
while ( my $line = <$chain_file> ) {
chomp $line; # Good habit to always chomp a input as soon as it's read.
$keep_lines = keep_this_section( $line );
if ( $keep_lines ) {
push #good_lines, $line;
}
}
Simplifies the code quite a bit.
I would also declare some constants to remove those Mysterious Moes. Those are things like 3000 and /^chain/ that have some sort of mysterious, but important meaning in your program. You can use constant as a pragma to define Perl constants. It's part of standard Perl:
use constant {
KEEP_LIMIT => 3000,
SECTION_HEADER => qr/^chain/,
};
Now, I can do things like this:
if ( $line =~ SECTION_HEADER ) {
instead of:
if ( $line =~ /^chain/ ) {
and
if ( $field[1] > KEEP_LIMIT ) {
instead of
if ( $field[1] > 3000 ) {
There are problems with the constant pragma. The biggest is that it just doesn't interpolate in places where Perl will normally interpolate variables. This include double quoted strings and hash keys. If I have $foo{KEEP_LIMIT}, Perl will interpret the key as a string KEEP_LIMIT and not as a constant of KEEP_LIMIT.
Many developers use Readonly which is just so much better in so many ways. Unfortunately, Readonly isn't a standard Perl module, so you have to install it via CPAN, and that's sometimes not possible or desirable to do. So, I tend to use constant.

Distance between one point to all other in a PDB file

I have a PDB file. Now it has two parts separated by TER. Before TER I call it part 1. I want to take x,y,z of ATOM 1 of first part i.e before TER and find distance to all x,y,z co ordinates after TER and then second ATOM of part one to all ATOMS of part second. This has to be repeated for all ATOMS of first part= to all ATOMS of second part. I have to automate it for 20 files. names of my files begin like 1_0.pdb,2_0.pdb....20_0.pdb.
This is a distance calculation. I have tried something in PERL but its very rough. Can someone help a bit.
The File looks like:
----long file (I truncated it)----
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
The code is: In the end it finds the maximum distance and its co ordinates
my #points = ();
open(IN, #ARGV[0]) or die "$!";
while (my $line = <IN>) {
chomp($line);
my #array = (split (/\s+/, $line))[5, 6, 7];
print "#array\n";
push #points, [ #array ];
}
close(IN);
$max=0;
for my $i1 ( 0 .. $#points )
{
my ( $x1, $y1, $z1 ) = #{ $points[$i1] };
my $dist = sqrt( ($x1+1.925)**2 + ($y1+11.270)**2 + ($z1-1.404)**2 );
print "distance from (-1.925 -11.270 1.404) to ( $x1, $y1, $z1 ) is $dist\n";
if ( $dist > $max )
{ $max = $dist;
$x=$x1;
$y=$y1;
$z=$z1;
}}
print "maximum value is : $max\n";
print "co ordinates are : $x $y $z\n";
Not sure I clearly understand what you want, but how about:
#!/usr/local/bin/perl
use strict;
use warnings;
my (#refer, #points);
my $part = 0;
while (my $line = <DATA>) {
chomp($line);
if ($line =~ /^TER/) {
$part++;
next;
}
my #array = (split (/\s+/, $line))[5, 6, 7];
if ($part == 0) {
push #refer, [ #array ];
} else {
push #points, [ #array ];
}
}
my %max = (val=>0, x=>0, y=>0, z=>0);
foreach my $ref(#refer) {
my ($x1, $y1, $z1) = #{$ref};
foreach my $atom(#points) {
my ($x, $y, $z) = #{$atom};
my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
if ($dist > $max{val}) {
$max{val} = $dist;
$max{x} = $x;
$max{y} = $y;
$max{z} = $z;
}
}
}
print "max is $max{val}; coord: x=$max{x}, y=$max{y}, z=$max{z}\n";
__DATA__
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008
output:
max is 35.9813670807545; coord: x=30.175, y=4.389, z=5.768
The main issue here is reading the data. First, note that one cannot use split with PDB text files since the fields are defined by position and not by separators. See Coordinate File Description (PDB Format).
To separate the ATOM record of different polymer chains you can start with a simplified version like
my $iblock = 0;
my #atoms = ();
while (my $line = <IN>) {
chomp($line);
# Switch blocks at TER lines
if ($line =~ /^TER/) {
$iblock++;
# Read ATOM lines
} elsif ($line =~ m/^ATOM/) {
my #xyz = (substr($line,7-1,9),substr($line,16-1,9),substr($line,25-1,9));
printf "Block %d: atom at (%s)\n",$iblock,join (",",#xyz);
push #{$atoms[$iblock]},\#xyz;
# Parse additional line types (if needed)
} else {
...
}
}
Followed by a loop over all pairs of coordinates from different blocks, structured as follows:
# 1st block
for my $iblock1 (0..$#atoms) {
# 2nd block
for my $iblock2 ($iblock1+1..$#atoms) {
# Compare all pairs of atoms
...
my $xyz1 (#{$atoms[$iblock1]}) {
for my $xyz2 (#{$atoms[$iblock2]}) {
# Calculate distance and compare with $max_dist
...
}
}
# Print the maximal distance between these two blocks
...
}
}
Of course, the code could be more general if a more elaborate data structure is used or by applying one of the available PDB parsers, such as Bioperl's.
With proper encapsulation, this is pretty simple, and requires minor modifications of your code.
ETA: Added fixed width solution I had on hand. It would probably be best to read all the fields instead of discarding the first 31 chars, and then return them all in a hash reference. That way, you could process all the lines with the same subroutine, and simply switch between parts when the first field turns out to be TER. It should be easy for you to extrapolate this from the given code.
You'll note that the reference values are read in with a loop, because we need to break the loop at the break point. The rest of the values are slurped up with a map statement. Then we simply feed the data to the subroutine we made from your initial code (with some improvements). I used the same names for the lexical variables to make it easier to read the code.
use strict;
use warnings;
my #points;
while (<DATA>) {
last if /^TER$/;
push #points, getpoints($_);
}
my #ref = map getpoints($_), <DATA>;
for my $p (#points) {
getcoords($p, \#ref);
}
sub getpoints {
my $line = shift;
my #data = unpack "A31 A8 A8 A8", $line;
shift #data;
return \#data;
}
sub getcoords {
my ($p, $ref) = #_;
my ($p1,$p2,$p3) = #$p;
my $max=0;
my ($x,$y,$z);
for my $aref ( #$ref ) {
my ( $x1, $y1, $z1 ) = #$aref;
my $dist = sqrt(
($x1-$p1)**2 +
($y1-$p2)**2 +
($z1-$p3)**2
);
print "distance from ($p1 $p2 $p3) to ( $x1, $y1, $z1 ) is $dist\n";
if ( $dist > $max ) {
$max = $dist;
$x=$x1;
$y=$y1;
$z=$z1;
}
}
print "maximum value is : $max\n";
print "co ordinates are : $x $y $z\n";
}
__DATA__
ATOM 1279 C ALA 81 -1.925 -11.270 1.404
ATOM 1280 O ALA 81 -0.279 9.355 15.557
ATOM 1281 OXT ALA 81 -2.188 10.341 15.346
TER
ATOM 1282 N THR 82 29.632 5.205 5.525
ATOM 1283 H1 THR 82 30.175 4.389 5.768
ATOM 1284 H2 THR 82 28.816 4.910 5.008

perl text::csv - filtering specific columns in a csv document and discarding others

I would like to filter out particular columns with a regex and discard others. For example, if I had the following column names:
date
mem_total
cpu.usagemhz.average_0
cpu.usagemhz.average_1
cpu.usagemhz.average_2
I would like to capture only columns that begin with "cpu.usage.mhz.average"
Is their a particular function of text::csv that will help me do a quick check of the column names?
Thanks!
JD
* Update **
I tried jimtut answer and it is extremely close to what I am looking for. Thanks Again Everyone!
Here is the code from jimtut with one small edit on the print statement at the bottom. I added the print $colCount just to see what was going on with the data;
use Text::CSV;
my $file = "foo.csv";
my $pattern = ".*In";
open(F, $file) or warn "Warning! Unable to open $file\n";
my $lineCount = 0;
my %desiredColumns;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
print "$colCount\t$field\n";
}
}
}
}
close(F);
Here is the results
colCount | $field
12 565
13 73
14 36
15 32
16 127
17 40
18 32
19 42
20 171
12 464
13 62
14 32
15 24
16 109
17 21
18 19
19 39
20 150
12 515
13 76
14 28
15 30
16 119
17 15
18 25
19 46
20 169
12 500
13 71
14 30
15 28
16 111
17 20
18 18
19 40
20 167
I would like to add this data to individual arrays or hashes. what do you think? something like...
foreach column {
check to see if a hash already exists with that column number. If not then create hash.
}
Then go through each field and add the field data to the appropriate hash.
Do you think this is the right way to go about solving this?
No, not a specific function in Text::CSV. I would do something like this:
use Text::CSV;
my $file = "foo.csv";
my $pattern = "cpu.usage.mhz.average.*";
open(F, $file) or die "Unable to open $file: $!\n";
my $lineCount = 0;
my %desiredColumns;
my %columnContents;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
push(#{$columnContents{$colCount}}, $field);
}
}
}
}
close(F);
foreach my $key (sort keys %columnContents) {
print "Column $key: " . join(",", #{$columnContents{$key}}) . "\n\n";
}
Hope that helps! I'm sure someone can write that in a Perl one-liner, but that's easier (for me) to read...
Since your fields of interest are at index 2-4, we'll just pluck those out of the field array returned by getline(). This sample code prints them but you can do whatever you like to them.
use Text::CSV; # load the module
my $csv = Text::CSV->new (); # instantiate
open $fh, "<somefile"; # open the input
while ( my $fields = $csv->getline($fh) ) { # read a line, and parse it into fields
print "I got #{$fields}[2..4]\n"; # print the fields of interest
}
close ($fh) # close when done
WHY are you trying to do this? Is it to minimize storage? Eliminate processing costs for parsing many un-needed columns?
If the latter, you can't avoid that processing cost. Any solution you come up with would STILL read and parse 100% of the file.
If the former, there are many methods, some are more efficient than the others.
Also, what exactly do you mean "help me do a quick check of the column names?"? If you want to get the column names, there's column_names() method provided you previously set the column names using column_names(getline($fh)).
If you want to only return specific column names in a hash to avid wasting memory on un-needed columns, there's no clear-cut API for that. You can roll your own, or abuse a "bug/feature" of getline_hr() method:
For the former (roll your own), you can do something like:
my $headers = $csv->getline( $fh ); # First line is headers.
my #headers_keep = map { /^cpu.usage.mhz.average/ ? 1 : 0 } #$headers;
while ( my $row = $csv->getline( $fh ) ) {
my $i = 0;
my #row_new = grep { $headers_keep[$i++] } $#row;
push #rows, \#row_new;
}
BUT you can either roll your own OR .
You can also use a "feature" of "getline_hr()" which doesn't assign values into a hash if the column name is a duplicate (only the LAST version gets assigned) \
In your case, for column names: date,mem_total,cpu.usagemhz.average_0,cpu.usagemhz.average_1,cpu.usagemhz.average_2, merely set the column_names array to contain "cpu.usagemhz.average_0" value in the first 2 eements of the array - they will NOT be then saved by getline_hr().
You can go over the list of columns, find the consecutive range of "not needed" columns, and replace their names with the name of the first needed column follwing that range. The only stiking point is if the "un-needed" range is at the very end of the columns - replace with "JUNK" or something.