Manipulating digits - perl

This is a program which grabs lines which contains the $position AND $amino value in the first two columns.
Code:
#!/usr/bin/perl
my $id = $ARGV[0];
my $position = $ARGV[1]; # POSITION OF THE RESIDUE
my $amino= $ARGV[2]; #THREE LETTER AMINO ACID CODE IN CAPITALS
my #grabbed;
open (FILE, $id.$amino.$position.".hb2");
#CREATES AN ARRAY WITH ONLY THE VALUES FROM THE HB2 FILE. REMOVES THE HEADER OF THE FILE.
while (<FILE>) {
if (/^-/) {
push #grabbed, $_;
while (<FILE>) {
last if /^$/;
push #grabbed, $_;
}
}
}
close (FILE);
for ( #grabbed ) {
my #f = split;
if (( $f[2] == "-"."00".$position."-".$amino ) or ($f[0] == "-"."00".$position."-".$amino)) {
push #line, $id.$amino.$position, " ",$_;
}
}
print #line;
Partial input data :
-0007-ARG NH2 -0009-GLN OE1 3.24 SS 2 6.00 143.3 2.38 105.9 95.8 1 #CASE 1
-0008-GLU N -0008-GLU OE1 2.62 MS 0 -1.00 120.8 1.96 102.3 103.4 2
-0011-ILE N -0117-ARG O 2.87 MM 106 4.90 144.0 2.00 127.5 139.0 3
-0117-ARG N -0011-ILE O 2.75 MM 106 4.90 160.4 1.79 153.2 148.6 4 #CASE 2
-0016-SER N -0012-THR O 2.89 MM 4 6.00 156.2 1.95 149.8 154.8 5 #CASE 3
-0017-ALA N -0013-LEU O 3.10 MM 4 6.24 152.8 2.17 143.4 149.7 6
-0018-GLU N -0014-ARG O 3.04 MM 4 6.24 154.1 2.11 147.2 154.2 7
-0019-ILE N -0015-GLY O 2.90 MM 4 6.16 155.8 1.96 150.7 156.2 8
-0016-SER OG -0188-THR OG1 2.72 SS 172 5.92 172.0 1.73 98.9 99.6 9
-0188-THR OG1 -0016-SER OG 2.72 SS 172 5.92 163.7 1.75 116.4 115.1 10
Question :
In order to generalize the program I made the match as :
( $f[2] == "-"."00".$position."-".$amino ) or ($f[0] == "-"."00".$position."-".$amino)
The format is always four digits after "-" before $amino (-0188-THR). I suddenly realized that my code wouldnt work if the $position input is "one digit(like CASE 1)" or "three digit (like CASE 2, column 1)". Since I hard coded it as format as "-" followed by two zeros and THEN position, it has to always be two digit input to work.
I am stumped to generalize this code so that I could put in 1/2/3 digits. The remaining digits would always be replaced by zeros.

You can format the string using sprintf:
my $mstring = sprintf("-%04d-%s", $position, $amino);
if ( ($f[2] eq $mstring) or ($f[0] eq $mstring) ) {
# ...
}
Here, %04d adds 0's to the left of position to make it 4 digits long.

First, == operator in perl used only for comparing arithmetic expressions
To compare strings you should use eq operator
Second, to format strings from digits you can use sprintf function.
if ($f[2] eq "-".sprintf("%04d", $position)."-".$amino ...

Related

Comparing two files, where one piece of information can be flexible

Comparing two files. So easy, but comparing two files where one piece of information can be flexible is proving to be very challenging for me.
fileA
4 "dup" 37036335 37044984
3 "dup" 100146708 100147504
7 "del" 100 203
2 "dup" 34 89
fileB
4 "dup" 37036335 37036735
3 "dup" 100146708 100147504
4 "dup" 68 109
Anticipated output:
output_file1 (matching hits)
fileA: 4 "dup" 37036335 37044984
fileB: 4 "dup" 37036335 37036735
fileA: 3 "dup" 100146708 100147504
fileB: 3 "dup" 100146708 100147504
output_file2 (found in fileA, but not in FileB including non-overlap)
7 "del" 100 203
2 "dup" 34 89
output_file3 (found in fileB, but not in FileA including non-overlap)
4 "dup" 68 109
The credentials are...
I need field 1 and field 2 in the first file to exactly match the second file and the coordinates in field 3 to be exact or overlap.
This would mean these are the same.
fileA :4 "dup" 37036335 37044984
fileB :4 "dup" 37036335 37036735
I also need to find differences between the two files. (no-overlap, 1 row isn't present in one file, but not in the other, etc)
Here's the gist of what I've tried. I've written this code probably 4 different ways, alas, still no success. I've put both files into arrays (I've tried a hash too...idk)
## if no hits in original, but hits in calculated
if((! #ori) && (#calc)){}
## if CNV calls in original, but none in calculated
if((#ori) && (! #calc)){}
## if CNV calls in both
if((#ori) && (#calc)){
## compare calls with double 'for' loop
foreach my $l (#ori){
my #l = split(/\s/,$l);
my $Ochromosome = $l[0];
my $Ostart = $l[2];
my $Oend = $l[3];
my $Otype = $l[1];
foreach my $l (#calc){
my #l = split(/\s/,$l);
my $Cchromosome = $l[0];
my $Cstart = $l[2];
my $Cend = $l[3];
my $Ctype = $l[1];
## check chromosome and type here
if(($Ochromosome eq $Cchromosome) && ($Otype eq $Ctype)){ ## what if there are two duplications on the same chromosome?
## check coordinates
if(($Ostart <= $Cend) && ($Cstart <= $Oend)){
## overlap
}else{
## noOverlap
}
}else{
## what if there is something found in one, but not in the other and they both have calls?
## ahhhh
}
}
}
Here is a simple solution which is also fairly efficient.
Iterate over lines of one file, checking each against all lines of the other (until a match is found). This is the very least we must do complexity wise, given all information that needs to be gathered.
If a line from A is not found in B, it is added to #not_in_B. To determine which lines in B are not in A, we prepare a hash where each element of B is a key with a value 0. Once/if an element of B is found, the value of its key in the hash is set to 1. Those that are not 1 at the end have never been found by elements of A, and so are the extra ones. They go in #not_in_A.
Both files are first read into arrays for simplicity (but this is needed for the inner loop).
use warnings;
use strict;
use feature 'say';
my $f1 = 'f1.txt';
my $f2 = 'f2.txt';
open my $fh, '<', $f1;
my #a1 = <$fh>; chomp(#a1);
open $fh, '<', $f2;
my #a2 = <$fh>; chomp(#a2);
close $fh;
my (#not_in_A, #not_in_B);
my %Bs_in_A = map { $_ => 0 } #a2;
foreach my $e1 (#a1)
{
my $match = 0;
foreach my $e2 (#a2)
{
if ( lines_match($e1, $e2) ) {
$match = 1;
say "Match:\n\tf1: $e1\n\tf2: $e2";
$Bs_in_A{$e2} = 1;
last;
}
}
push #not_in_B, $e1 if not $match;
}
#not_in_A = grep { $Bs_in_A{$_} == 0 } keys %Bs_in_A;
say '---';
say "Elements of A that are not in B:";
say "\t$_" for #not_in_B;
say "Elements of B that are not in A:";
say "\t$_" for #not_in_A;
sub lines_match
{
my ($l1, $l2) = #_;
my #t1 = split ' ', $l1;
my #t2 = split ' ', $l2;
# First two fields must be the same
return if $t1[0] ne $t2[0] or $t1[1] ne $t2[1];
# Third-to-fourth-field ranges must overlap
return
if ($t1[2] < $t2[2] and $t1[3] < $t2[2])
or ($t1[2] > $t2[3] and $t1[3] > $t2[3]);
return 1; # match
}
Output
Match:
f1: 4 "dup" 37036335 37044984
f2: 4 "dup" 37036335 37036735
Match:
f1: 3 "dup" 100146708 100147504
f2: 3 "dup" 100146708 100147504
---
Elements of A that are not in B:
7 "del" 100 203
2 "dup" 34 89
Elements of B that are not in A:
4 "dup" 68 109
Note that I've used 1 in place of A and 2 in place of B.

Perl - Print first letter of column

I'm trying to print the first letter of column2 of an input file as well as other columns of interest. I'm not sure why the following script, adapted from Matching first letter of word gives me an 'Use of uninitialized value $columns[2]' warning.
Input File Example:
ATOM 1 CAY GLY X 1 -0.124 0.401 -0.153 1.00 2.67 PEP
ATOM 2 HY1 GLY X 1 -0.648 0.043 -1.064 1.00 0.00 PEP
ATOM 3 HY2 GLY X 1 -0.208 1.509 -0.145 1.00 0.00 PEP
Output File Example:
1 C -0.124 0.401 -0.153 1.00 2.67
2 H -0.648 0.043 -1.064 1.00 0.00
3 H -0.208 1.509 -0.145 1.00 0.00
Script
open (my $input_fh, "<", $filename) or die $!;
while (my $data = <$input_fh>) {
chomp $data;
my #columns = split(/\t/, $data);
my ($firstletter) = ($columns[2] =~ m/^\d+(\w)/);
if (/CAY/../HT2/)
print $output_fh join ("\t", $columns[1], $firstletter, $columns[6], $columns[7], $columns[8]), "\n";
}
UPDATE The warning occurred due to the if (/CAY/../HT2/) statement for some reason -- but since the input files are identical, I don't really need this condition. Also, since there are no digits in column2 it is more appropriate to use the /^(\w)/ regex.
Is there some particular reason that you must split on tabs? Getting various kinds of white space in an arbitrary text file correctly can be picky. If not necessary, it seems fully fitting to just split by (any) space, then grab the first letter
my #cols = split '\s+', $data;
my ($firstletter) = $cols[1] =~ m/^(\w)/;
I am not sure what the rest does but you can easily pluck the columns you need.
Try to debug what you get after splitting:
my #columns = split(/\t/, $data);
local $" = "\n"; print "$data\nSplitted into:\n#columns";
As guess your file have double \t characters. I mean you probably have:
ATOM\t\t1 CAY GLY X... so second column is undef
It sounds to me like the code that gave that warning was not what you show but instead had something like
($columns[2]) = ($columns[2] =~ m/^\d+(\w)/);
And you are getting the warning because the regex is failing due to not finding a digit. Maybe you meant \d*?
For me, maybe i would like to use cut command and pipeline, then split command to get the exact info you want.

Perl Range Operator with Letters and Numbers

Is there a way to use Perl's range operator .. to use letters AND numbers?
For example, have:
for my $i ('X'..'9') {
print "$i ";
}
output X Y Z 1 2 3 4 5 6 7 8 9
Not unless for my $i ('X' .. 'Z', 1 .. 9) counts. "Z" will never increment to 1.

Perl decimal to binary conversion

I need to convert a number from decimal to binary in Perl where my constraint is that the binary number width is set by a variable:
for (my $i = 0; $i<32; $i++)
{
sprintf("%b",$i) # This will give me a binary number whose width is not fixed
sprintf("%5b",$i) # This will give me binary number of width 5
# Here is what I need:
sprintf (%b"$MY_GENERIC_WIDTH"b, $i)
}
I can probably use a work-around in my print statements, but the code would be much cleaner if I can do the aforementioned.
Your question amounts to the following:
How do I build the string %5b where 5 is variable?
Using concatenation.
"%".$width."b"
That can also be written as
"%${width}b"
In more complex cases, you might want to use the following, but it's overkill here.
join('', "%", $width, "b")
Note that sprintf accepts a * as a placeholder for a value to be provided in a variable.
sprintf("%*b", $width, $num)
If you want leading zeroes instead of leading spaces, just add a 0 immediately after the %.
You can interpolate the width into the format string:
my $width = 5;
for my $i (0..31) {
printf "%${width}b\n", $i;
}
Or use a * to input it via a variable:
my $width = 5;
for my $i (0..31) {
printf "%*b\n", $width, $i;
}
Both outputs:
0
1
10
11
100
101
110
111
1000
1001
1010
1011
1100
1101
1110
1111
10000
10001
10010
10011
10100
10101
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111

How to code in perl using subroutines

I need to score my string patches following certain criteria:
Column 1: B for buried or E for Exposed - Threshold: 25%
Column 2: Amino acid
Column 3: Sequence name
Column 4: Amino acid number
Column 5: Relative Surface Accessibility - RSA
Column 6: Absolute Surface Accessibility
Column 7: Z-fit score for RSA prediction
Column 8: Probability for Alpha-Helix
Column 9: Probability for Beta-strand
Column 10: Probability for Coil
E K 132L_A_PDBID_CHAIN_SEQUENCE 1 0.716 147.261 1.150 0.016 0.005 0.979
E V 132L_A_PDBID_CHAIN_SEQUENCE 2 0.514 79.033 1.252 0.191 0.086 0.723
B F 132L_A_PDBID_CHAIN_SEQUENCE 3 0.134 26.793 -0.325 0.191 0.086 0.723
E G 132L_A_PDBID_CHAIN_SEQUENCE 4 0.570 44.835 1.012 0.354 0.048 0.598
Remember, the last three columns are the probabilities for either Helix/Sheet/Coil.......
So first we need to classify whether a certain residue falls under Helix/Sheet/Coil using some criterion function....based on the max. probability within the last 3 columns...
Then one we get the structural preferences, we need to score the sequences breaking into patches of 10......
My scoring criteria is this:
EXPOSED = 1; # +1 for letters that exposed
BURIED = 0; # 0 for letters that are buried
COIL = 3; # +3 for any coil
HELIX = 2; # +2 for any helix
SHEET = 1; # +1 for any sheet
The link below is for breaking a string into patches of 10~11
http://pastebin.com/GeW5AKF3
The problem I am facing is that I have splitted in string into horizontal patches as in the above link, but the file is vertically aligned......
Thanks for help....... Waiting for reply
This should get you going:
open my $fh, "<", "input.txt";
my #data;
while(my $line = <$fh>) # If we got line from file
{
chomp $line; # remove carraigereturn/linefeed
my #parts = split /\s+/, $line; # split based on values seperated by one or more spaces
push #data, [#parts] # Add array of split parts to data array
}
Thats puts everything into #data. You access it like this:
# now access whatever you want...
# example: line 3 column 6 (perl arrays start from 0 not 1):
print $data[2][5] . "\n"; #prints 26.793
# line 4 column 2:
print $data[3][1] . "\n"; #prints G
Then you can sort like this. (Example sort by Col1 then by Col5(RSA):)
#data = sort { if ( $a->[0] eq $b->[0] ) { $a->[4] <=> $b->[4] } else { $a->[0] cmp $b->[0] } } #data;
Then output data like this:
foreach my $line (#data)
{
foreach my $field (#$line)
{
print $field."\t";
}
print "\n";
}
Output is:
B F 132L_A_PDBID_CHAIN_SEQUENCE 3 0.134 26.793 -0.325 0.191 0.086 0.723
E V 132L_A_PDBID_CHAIN_SEQUENCE 2 0.514 79.033 1.252 0.191 0.086 0.723
E G 132L_A_PDBID_CHAIN_SEQUENCE 4 0.570 44.835 1.012 0.354 0.048 0.598
E K 132L_A_PDBID_CHAIN_SEQUENCE 1 0.716 147.261 1.150 0.016 0.005 0.979