How to code in perl using subroutines - perl

I need to score my string patches following certain criteria:
Column 1: B for buried or E for Exposed - Threshold: 25%
Column 2: Amino acid
Column 3: Sequence name
Column 4: Amino acid number
Column 5: Relative Surface Accessibility - RSA
Column 6: Absolute Surface Accessibility
Column 7: Z-fit score for RSA prediction
Column 8: Probability for Alpha-Helix
Column 9: Probability for Beta-strand
Column 10: Probability for Coil
E K 132L_A_PDBID_CHAIN_SEQUENCE 1 0.716 147.261 1.150 0.016 0.005 0.979
E V 132L_A_PDBID_CHAIN_SEQUENCE 2 0.514 79.033 1.252 0.191 0.086 0.723
B F 132L_A_PDBID_CHAIN_SEQUENCE 3 0.134 26.793 -0.325 0.191 0.086 0.723
E G 132L_A_PDBID_CHAIN_SEQUENCE 4 0.570 44.835 1.012 0.354 0.048 0.598
Remember, the last three columns are the probabilities for either Helix/Sheet/Coil.......
So first we need to classify whether a certain residue falls under Helix/Sheet/Coil using some criterion function....based on the max. probability within the last 3 columns...
Then one we get the structural preferences, we need to score the sequences breaking into patches of 10......
My scoring criteria is this:
EXPOSED = 1; # +1 for letters that exposed
BURIED = 0; # 0 for letters that are buried
COIL = 3; # +3 for any coil
HELIX = 2; # +2 for any helix
SHEET = 1; # +1 for any sheet
The link below is for breaking a string into patches of 10~11
http://pastebin.com/GeW5AKF3
The problem I am facing is that I have splitted in string into horizontal patches as in the above link, but the file is vertically aligned......
Thanks for help....... Waiting for reply

This should get you going:
open my $fh, "<", "input.txt";
my #data;
while(my $line = <$fh>) # If we got line from file
{
chomp $line; # remove carraigereturn/linefeed
my #parts = split /\s+/, $line; # split based on values seperated by one or more spaces
push #data, [#parts] # Add array of split parts to data array
}
Thats puts everything into #data. You access it like this:
# now access whatever you want...
# example: line 3 column 6 (perl arrays start from 0 not 1):
print $data[2][5] . "\n"; #prints 26.793
# line 4 column 2:
print $data[3][1] . "\n"; #prints G
Then you can sort like this. (Example sort by Col1 then by Col5(RSA):)
#data = sort { if ( $a->[0] eq $b->[0] ) { $a->[4] <=> $b->[4] } else { $a->[0] cmp $b->[0] } } #data;
Then output data like this:
foreach my $line (#data)
{
foreach my $field (#$line)
{
print $field."\t";
}
print "\n";
}
Output is:
B F 132L_A_PDBID_CHAIN_SEQUENCE 3 0.134 26.793 -0.325 0.191 0.086 0.723
E V 132L_A_PDBID_CHAIN_SEQUENCE 2 0.514 79.033 1.252 0.191 0.086 0.723
E G 132L_A_PDBID_CHAIN_SEQUENCE 4 0.570 44.835 1.012 0.354 0.048 0.598
E K 132L_A_PDBID_CHAIN_SEQUENCE 1 0.716 147.261 1.150 0.016 0.005 0.979

Related

Comparing two files, where one piece of information can be flexible

Comparing two files. So easy, but comparing two files where one piece of information can be flexible is proving to be very challenging for me.
fileA
4 "dup" 37036335 37044984
3 "dup" 100146708 100147504
7 "del" 100 203
2 "dup" 34 89
fileB
4 "dup" 37036335 37036735
3 "dup" 100146708 100147504
4 "dup" 68 109
Anticipated output:
output_file1 (matching hits)
fileA: 4 "dup" 37036335 37044984
fileB: 4 "dup" 37036335 37036735
fileA: 3 "dup" 100146708 100147504
fileB: 3 "dup" 100146708 100147504
output_file2 (found in fileA, but not in FileB including non-overlap)
7 "del" 100 203
2 "dup" 34 89
output_file3 (found in fileB, but not in FileA including non-overlap)
4 "dup" 68 109
The credentials are...
I need field 1 and field 2 in the first file to exactly match the second file and the coordinates in field 3 to be exact or overlap.
This would mean these are the same.
fileA :4 "dup" 37036335 37044984
fileB :4 "dup" 37036335 37036735
I also need to find differences between the two files. (no-overlap, 1 row isn't present in one file, but not in the other, etc)
Here's the gist of what I've tried. I've written this code probably 4 different ways, alas, still no success. I've put both files into arrays (I've tried a hash too...idk)
## if no hits in original, but hits in calculated
if((! #ori) && (#calc)){}
## if CNV calls in original, but none in calculated
if((#ori) && (! #calc)){}
## if CNV calls in both
if((#ori) && (#calc)){
## compare calls with double 'for' loop
foreach my $l (#ori){
my #l = split(/\s/,$l);
my $Ochromosome = $l[0];
my $Ostart = $l[2];
my $Oend = $l[3];
my $Otype = $l[1];
foreach my $l (#calc){
my #l = split(/\s/,$l);
my $Cchromosome = $l[0];
my $Cstart = $l[2];
my $Cend = $l[3];
my $Ctype = $l[1];
## check chromosome and type here
if(($Ochromosome eq $Cchromosome) && ($Otype eq $Ctype)){ ## what if there are two duplications on the same chromosome?
## check coordinates
if(($Ostart <= $Cend) && ($Cstart <= $Oend)){
## overlap
}else{
## noOverlap
}
}else{
## what if there is something found in one, but not in the other and they both have calls?
## ahhhh
}
}
}
Here is a simple solution which is also fairly efficient.
Iterate over lines of one file, checking each against all lines of the other (until a match is found). This is the very least we must do complexity wise, given all information that needs to be gathered.
If a line from A is not found in B, it is added to #not_in_B. To determine which lines in B are not in A, we prepare a hash where each element of B is a key with a value 0. Once/if an element of B is found, the value of its key in the hash is set to 1. Those that are not 1 at the end have never been found by elements of A, and so are the extra ones. They go in #not_in_A.
Both files are first read into arrays for simplicity (but this is needed for the inner loop).
use warnings;
use strict;
use feature 'say';
my $f1 = 'f1.txt';
my $f2 = 'f2.txt';
open my $fh, '<', $f1;
my #a1 = <$fh>; chomp(#a1);
open $fh, '<', $f2;
my #a2 = <$fh>; chomp(#a2);
close $fh;
my (#not_in_A, #not_in_B);
my %Bs_in_A = map { $_ => 0 } #a2;
foreach my $e1 (#a1)
{
my $match = 0;
foreach my $e2 (#a2)
{
if ( lines_match($e1, $e2) ) {
$match = 1;
say "Match:\n\tf1: $e1\n\tf2: $e2";
$Bs_in_A{$e2} = 1;
last;
}
}
push #not_in_B, $e1 if not $match;
}
#not_in_A = grep { $Bs_in_A{$_} == 0 } keys %Bs_in_A;
say '---';
say "Elements of A that are not in B:";
say "\t$_" for #not_in_B;
say "Elements of B that are not in A:";
say "\t$_" for #not_in_A;
sub lines_match
{
my ($l1, $l2) = #_;
my #t1 = split ' ', $l1;
my #t2 = split ' ', $l2;
# First two fields must be the same
return if $t1[0] ne $t2[0] or $t1[1] ne $t2[1];
# Third-to-fourth-field ranges must overlap
return
if ($t1[2] < $t2[2] and $t1[3] < $t2[2])
or ($t1[2] > $t2[3] and $t1[3] > $t2[3]);
return 1; # match
}
Output
Match:
f1: 4 "dup" 37036335 37044984
f2: 4 "dup" 37036335 37036735
Match:
f1: 3 "dup" 100146708 100147504
f2: 3 "dup" 100146708 100147504
---
Elements of A that are not in B:
7 "del" 100 203
2 "dup" 34 89
Elements of B that are not in A:
4 "dup" 68 109
Note that I've used 1 in place of A and 2 in place of B.

Perl script to check another array values depending on current array index

I'm working on a perl assignment, that has three arrays - #array_A, #array_B and array_C with some values in it, I grep for a string "CAT" on array A and fetching its indices too
my #index = grep { $#array_A[$_] =~ 'CAT' } 0..$#array_A;
print "Index : #index\n";
Output: Index : 2 5
I have to take this as an input and check the value of other two arrays at indices 2 and 5 and print it to a file.
Trick is the position of the string - "CAT" varies. (Index might be 5 , 7 and 9)
I'm not quite getting the logic here , looking for some help with the logic.
Here's an overly verbose example of how to extract the values you want as to show what's happening, while hopefully leaving some room for you to have to further investigate. Note that it's idiomatic Perl to use regex delimiters when using =~. eg: $name =~ /steve/.
use warnings;
use strict;
my #a1 = qw(AT SAT CAT BAT MAT CAT SLAT);
my #a2 = qw(a b c d e f g);
my #a3 = qw(1 2 3 4 5 6 7);
# note the difference in the next line... no # symbol...
my #indexes = grep { $a1[$_] =~ /CAT/ } 0..$#a1;
for my $index (#indexes){
my $a2_value = $a2[$index];
my $a3_value = $a3[$index];
print "a1 index: $index\n" .
"a2 value: $a2_value\n" .
"a3 value: $a3_value\n" .
"\n";
}
Output:
a1 index: 2
a2 value: c
a3 value: 3
a1 index: 5
a2 value: f
a3 value: 6

How can I plot p-values for SNPs that are spread across thousands of scaffolds on a single continuous axis?

I have association mapping derived P-values for SNPs that are scattered across thousands of scaffolds in a non-model organism. I would like plot the P-value of each SNP on a Manhattan-style plot. I do not care about the order of the scaffolds, but I would like to retain the relative order and spacing of SNP positions on their respective scaffolds. I simply want to visualize roughly how many genomic regions are significantly associated with a phenotype. For example:
My data looks something like this:
SCAFFOLD POSITION
1 8967
1 8986
1 9002
1 9025
1 9064
2 60995
2 61091
2 61642
2 61898
2 61921
2 62034
2 62133
2 62202
2 62219
2 62220
3 731894
3 731907
3 731962
3 731999
3 732000
3 732050
3 732076
3 732097
I would like to write a perl code to create a third column that retains the distance between SNPs on the same scaffold, while arbitrarily spacing scaffolds by some number (100 in the following example):
SCAFFOLD POSITION CONTINUOUS_AXIS
1 8967 8967
1 8986 8986
1 9002 9002
1 9025 9025
1 9064 9064
2 60995 9164
2 61091 9260
2 61642 9811
2 61898 10067
2 61921 10090
2 62034 10203
2 62133 10302
2 62202 10371
2 62219 10388
2 62220 10389
3 731894 10489
3 731907 10502
3 731962 10557
3 731999 10594
3 732000 10595
3 732050 10645
3 732076 10671
3 732097 10692
Thank you to anyone who might have a good strategy.
Something like the following should work:
#!/usr/bin/env perl
use strict;
use warnings;
use constant SCAFFOLD_SPACING => 100;
my ($last_scaffold, $last_position, $continuous_axis, $found_data);
my $input = './input';
open my $fh, "<$input"
or die "Unable to open '$input' for reading : $!";
print join( "\t", qw( SCAFFOLD POSITION CONTINUOUS_AXIS ) ) . "\n"; # Output Header
while (<$fh>) {
next unless m|\d|; # Skip non-data lines
my ($scaffold, $position) = split /\s+/; # Split on whitespace
unless ($found_data++) {
# Initialize
$last_scaffold = $scaffold; # Set to first data value
$last_position = $position; # Set to first data value
$continuous_axis = $position; # Start continuous axis at first position
}
my $position_diff = $position - $last_position;
my $scaffold_diff = $scaffold - $last_scaffold;
if ($scaffold_diff == 0) {
$continuous_axis += $position_diff;
} else {
$continuous_axis += SCAFFOLD_SPACING;
}
print join( "\t", $scaffold, $position, $continuous_axis ) . "\n";
# Update
$last_scaffold = $scaffold;
$last_position = $position;
}

perl add contents of a column of a file

Column A | Column B | Column C | Column D
35627799100 8 8 2
35627788000 60 34 45
35627799200 10 21 21
35627780000 60 5 8
Basically I have a file as shown above and would like to add the contents of Column B i.e 8+60+10+60. To be frank I'm not sure if need to remove the first line being text and if I can use the split function and put it in a hash something along the lines:
my %hash = map {split/\s+/,$_,4} <$file>;
Thanks in advance for the help.
If you just want to sum up the second column, a hash is overkill. You can do something like this and calculate the sum directly in the map.
my $sum;
$sum += (split /\s+/, $_)[1] while <$file>;
Edit: If you have header rows or other rows with non-numeric values in column 2, then as the comments below indicate, you will run into problems. You can avoid this by trading split for a regular expression, like so:
my $sum = 0;
while (<STDIN>)
{
$sum += $1 if $_ =~ /^\S+\s+(\d+)/;
}
If it's possible that column 1 has no text (ie. the line starts with a single blank and the first non-blank represents the second column), then change the first part of the pattern from ^\S+ to ^\S*.
This is an example based on your data:
use strict;
use warnings;
my $sum_column_b = 0;
<DATA>; #drop header
while( my $line = <DATA>) {
$line =~ m/\s+(\d+)/; #regexpr to catch second column values
$sum_column_b += $1;
}
print $sum_column_b, "\n"; #<-- prints: 138
__DATA__
Column A | Column B | Column C | Column D
35627799100 8 8 2
35627788000 60 34 45
35627799200 10 21 21
35627780000 60 5 8

Manipulating digits

This is a program which grabs lines which contains the $position AND $amino value in the first two columns.
Code:
#!/usr/bin/perl
my $id = $ARGV[0];
my $position = $ARGV[1]; # POSITION OF THE RESIDUE
my $amino= $ARGV[2]; #THREE LETTER AMINO ACID CODE IN CAPITALS
my #grabbed;
open (FILE, $id.$amino.$position.".hb2");
#CREATES AN ARRAY WITH ONLY THE VALUES FROM THE HB2 FILE. REMOVES THE HEADER OF THE FILE.
while (<FILE>) {
if (/^-/) {
push #grabbed, $_;
while (<FILE>) {
last if /^$/;
push #grabbed, $_;
}
}
}
close (FILE);
for ( #grabbed ) {
my #f = split;
if (( $f[2] == "-"."00".$position."-".$amino ) or ($f[0] == "-"."00".$position."-".$amino)) {
push #line, $id.$amino.$position, " ",$_;
}
}
print #line;
Partial input data :
-0007-ARG NH2 -0009-GLN OE1 3.24 SS 2 6.00 143.3 2.38 105.9 95.8 1 #CASE 1
-0008-GLU N -0008-GLU OE1 2.62 MS 0 -1.00 120.8 1.96 102.3 103.4 2
-0011-ILE N -0117-ARG O 2.87 MM 106 4.90 144.0 2.00 127.5 139.0 3
-0117-ARG N -0011-ILE O 2.75 MM 106 4.90 160.4 1.79 153.2 148.6 4 #CASE 2
-0016-SER N -0012-THR O 2.89 MM 4 6.00 156.2 1.95 149.8 154.8 5 #CASE 3
-0017-ALA N -0013-LEU O 3.10 MM 4 6.24 152.8 2.17 143.4 149.7 6
-0018-GLU N -0014-ARG O 3.04 MM 4 6.24 154.1 2.11 147.2 154.2 7
-0019-ILE N -0015-GLY O 2.90 MM 4 6.16 155.8 1.96 150.7 156.2 8
-0016-SER OG -0188-THR OG1 2.72 SS 172 5.92 172.0 1.73 98.9 99.6 9
-0188-THR OG1 -0016-SER OG 2.72 SS 172 5.92 163.7 1.75 116.4 115.1 10
Question :
In order to generalize the program I made the match as :
( $f[2] == "-"."00".$position."-".$amino ) or ($f[0] == "-"."00".$position."-".$amino)
The format is always four digits after "-" before $amino (-0188-THR). I suddenly realized that my code wouldnt work if the $position input is "one digit(like CASE 1)" or "three digit (like CASE 2, column 1)". Since I hard coded it as format as "-" followed by two zeros and THEN position, it has to always be two digit input to work.
I am stumped to generalize this code so that I could put in 1/2/3 digits. The remaining digits would always be replaced by zeros.
You can format the string using sprintf:
my $mstring = sprintf("-%04d-%s", $position, $amino);
if ( ($f[2] eq $mstring) or ($f[0] eq $mstring) ) {
# ...
}
Here, %04d adds 0's to the left of position to make it 4 digits long.
First, == operator in perl used only for comparing arithmetic expressions
To compare strings you should use eq operator
Second, to format strings from digits you can use sprintf function.
if ($f[2] eq "-".sprintf("%04d", $position)."-".$amino ...