for loop only runs once PERL, Argument "" isn't numeric in substr - perl

I have a for loop with nested if, else, and elsif statements within. The for loop runs correctly, but it only runs once for some reason. I am looking to count the A's, C's, G's, and T's in a sequence, but I want to count them in two groups - a motif group and a background group. The motif group counts need to be position specific while the background counts do not.
Here is what is contained in my .dna file (a .txt would work fine):
AGGCT
Here is what I have so far:
use strict;
use warnings;
#Upload sequence
print "Please enter the filename of the first sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and ignore comment lines
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
for (<FILE>)
{
next if /^#/;
next if /^>/;
$dna .= $_;
}
close FILE;
#Remove white spaces
$dna =~ s/[\s\d]//g;
$dna =~ /./g;
#User specifies motif width
print "Please enter the motif width:\n";
my $width = <STDIN>;
#Remove newline from file
chomp $width;
#Omitting code for non-negative widths to keep this shorter
#Initialize counts and arrays for motif positions
my #motA;
my #motC;
my #motG;
my #motT;
#Define length of motif arrays per width
for(0..($width-1))
{
$motA[$_] = 0;
$motC[$_] = 0;
$motG[$_] = 0;
$motT[$_] = 0;
}
#Initialize background counts
my $bgA = 0;
my $bgC = 0;
my $bgG = 0;
my $bgT = 0;
#Generate random start site in the sequence
#for motif to start from
my $ms = int(rand(((length($dna)+1)-$width)));
#Within a motif, count the bases at the positions
for (my $pos = 0..(length($dna)-1))
{
my $base = substr($dna, $pos, 1);
if ($pos = $ms..($ms + $width))
{
#Add to motif counts
if($base eq 'A')
{
$motA[$pos-$ms] = $motA[$pos-$ms] + 1;
}
elsif($base eq 'C')
{
$motC[$pos-$ms] = $motC[$pos-$ms] + 1;
}
elsif($base eq 'G')
{
$motG[$pos-$ms] = $motG[$pos-$ms] + 1;
}
elsif($base eq 'T')
{
$motT[$pos-$ms] = $motT[$pos-$ms] + 1;
}
}
else
{
#Create background counts
if ($base eq 'A')
{
$bgA = $bgA + 1;
}
elsif ($base eq 'C')
{
$bgC = $bgC + 1;
}
elsif ($base eq 'G')
{
$bgG = $bgG + 1;
}
elsif ($base eq 'T')
{
$bgT = $bgT + 1;
}
}
}
print "A #motA\nC #motC\nG #motG\nT #motT\n\n";
print "bgA = $bgA\n
bgC = $bgC\n
bgG = $bgG\n
bgT = $bgT";
The output looks like this:
Please enter the filename of the first sequence data: sample.dna
Please enter the motif width:
3
Argument "" isn't numeric in substr at line 62, <STDIN> line2.
A 0 1 0
C 0 0 0
G 0 0 0
T 0 0 0
bgA = 0
bgC = 0
bgG = 0
bgT = 0
I know that this is most likely because my $dna or $pos in the line with substr contains the "" (empty string?), but I am unsure how to solve this problem. I thought the initialization of $pos took care of that, but that's why I want to ask the masters to see what to do. I THINK that this will solve the for loop problem as well. As always, any and all help is useful. I appreciate it in advance!

This:
for (my $pos = 0..length($dna))
{
my $base = substr($dna, $pos, 1);
probably is meant to be 0..length($dna)-1 instead?
When $pos is the length, the substring is going to be an empty string.
And that's not the proper syntax for a for loop iterating over a range. It should be
for my $pos (0..length($dna)-1)
This:
if ($pos = $ms..($ms + $width))
if I understand correctly should be
if ($pos >= $ms && $pos < $ms + $width)
What you have is assigning to $pos the result of a flipflop operation, which is not going to be anything useful.
It looks like this:
my $ms = int(rand(((length($dna)+1)-$width)));
should be
my $ms = int(rand(((length($dna))-$width)));
E.g. if the $dna length is 10 and width is 3, you want the possible starting offsets to be 0 through 7, not 0 through 8.
And it looks like your counting within the motif should be using the position within the motif, not $pos; this:
$motA[$pos] = $motA[$pos] + 1;
should be
$motA[$pos-$ms] = $motA[$pos-$ms] + 1;

Related

How to read multiple text file filled with columns calculating statistics perl

I'm running some network measurements where I now get a ton of timestamps according to this picture:
http://i.stack.imgur.com/TNnBf.png
I want to get the statistics of Mean, Var & CoV and the problem I have is that I don't really know how to read in the text files so I get the right values in the right array.
I saw that Bash was considered poor in this area so I decided that perl should be better. This is my attempt so far:
#!/usr/bin/perl
#foreach $line
$pkt_number = $1;
$pkt_arrival = $2;
$pkt_size = $6;
$nr_of_pkg = 16;
$linespeed = 100000;
$sum_of_throughput = 0;
$throughput = 0;
$sum_of_throughputsquared = 0;
$mean = 0;
$var = 0;
$co_v = 0;
$duration = $pkt_arrival[0] - $pkt_arrival[15];
for (my $i = 0 ; $nr_of_pkg > $pkt_number[i] ; $i++) {
$throughput[i] = $packet_size[i] / $pkt_arrival[i];
$sum_of_throughput = $sum_of_throughput + $throughput[i];
$sum_of_throughput_squared = $sum_of_throughput_squared + $throughput[i] * $throughput[i];
}
$mean = $sum_of_throughput / $nr_of_pkg;
$var = (($sum_of_throughput * $sum_of_throughput) - $sum_of_throughput_squared);
$co_v = sqrt($var)
This is the code to read and parse the text file.
my #pkt_number;
my #pkt_arrival;
my #pkt_size;
my $i = 0;
my $line;
while ( <> ) {
$line = " $_"; # make sure it always start with a whitespace
chomp $line;
( $pkt_number[$i], $pkt_arrival[$i], $pkt_size[$i] )
= (split( /\s+/, $line, 8 ))[ 1, 2, 6 ];
printf "[%s]\n", join("/", $pkt_number[$i], $pkt_arrival[$i], $pkt_size[$i]);
$i++;
}
And this is how you run it on Linux :
% cat result.txt | ./your_perl.pl

Counting the frequencies of bases using for loop and substr in perl

I'm trying to count the number of bases using a for loop and the substr function but the counts are off and I'm not sure why! Please help! I have to use these functions in my assignment. Where am I going wrong? Here is my code:
use strict;
use warnings;
my $user_input = "accgtutf5";
#initalizing the lengths
my $a_base_total = 0;
my $c_base_total = 0;
my $g_base_total = 0;
my $t_base_total = 0;
my $other_total = 0;
for ( my $position = 0; $position < length $user_input; $position++ ) {
my $nucleotide = substr( $user_input, $position, 1 );
if ( $nucleotide eq "a" ) {
$a_base_total++;
} elsif ( $nucleotide eq "c" ) {
$c_base_total++;
} elsif ( $nucleotide eq "g" ) {
$g_base_total++;
} elsif ( $nucleotide eq "t" ) {
$t_base_total++;
} else {
$other_total++;
}
$position++;
}
print "a = $a_base_total\n";
print "c = $c_base_total\n";
print "g = $g_base_total\n";
print "t = $t_base_total\n";
print "other = $other_total\n";
The output I'm getting is :
a=1
c=1
g=0
t=2
other=1
When it should be:
a = 1
c = 2
g = 1
t = 2
other = 3
Thanks in advance! :)
You're incrementing twice.
Simply remove this line:
$position++;
Also, instead of iterating on position, I would suggest iterating on character.
Your script can be simplified to just:
use strict;
use warnings;
my $user_input = "accgtutf5";
my %count;
for my $nucleotide (split '', $user_input) {
$nucleotide = 'other' unless $nucleotide =~ /[acgt]/;
$count{$nucleotide}++;
}
printf "%s = %d\n", $_, $count{$_} // 0 for qw(a c g t other);
You are incrementing $position twice: once at the for and once at the end of the loop. Remove the second $position++.

Obtain 15 characters from a string that contain less than 15 characters in Perl

I have a sequence and a number representing the location of a residue(character). I want to take 7 residues from each side of the residue. This is the code to do that:
my $seq = substr($sequence, $location-8, 14);
This grabs 7 from each side of the residue. However, there are some sequences where there is less than 7 residues on either side. So when this occurs, I get an error saying:
substr outside of string at test9.pl line 52 (#1) (W substr)(F) You tried to reference a substr() that pointed outside of a string. That is, the absolute value of the offset was larger than the length of the string.
How can I change the empty places and replace them with another letter (X for example).
For example, if there is a sequence
ABCDEFGH
and $location points to D, I need 7 on each side so the result would be:
XXXXABCDEFGHXXX
Expanding on my comment above. I would create a my_substr function that encapsulates the padding and location shift.
my $sequence = "ABCDEFGH";
my $location = 3;
sub my_substr {
my ($seq, $location, $pad_length) = #_;
my $pad = "X"x$pad_length;
return substr("$pad$seq$pad", $location, (2*$pad_length+1));
}
print my_substr($sequence, $location, 7) . "\n";
yields
XXXXABCDEFGHXXX
This is an very verbose answer, but more or less gets you what you want:
use strict;
use warnings;
my $sequence = 'ABCDEFGH';
my $wings = 7;
my $location = index $sequence, 'D';
die "D not found" if $location == -1;
my $start = $location - $wings;
my $length = 1 + 2 * $wings;
my $leftpad = 0;
if ($start < 0) {
$leftpad = -1 * $start;
$start = 0;
}
my $seq = substr($sequence, $start, $length);
$seq = ('X' x $leftpad) . $seq if $leftpad;
my $rightpad = $length - length ($seq);
$seq .= 'X' x $rightpad if $rightpad > 0;
print $seq;
Or to avoid all the extra work, could just create a new $sequence variable containing padding:
my $sequence = 'ABCDEFGH';
my $wings = 7;
my $location = index $sequence, 'D';
die "D not found" if $location == -1;
my $paddedseq = ('X' x $wings) . $sequence . ('X' x $wings);
my $seq = substr($paddedseq, $location, 1 + 2 * $wings);
print $seq;

How I can take the coordinates of a block of numbers?

I have a problem tha bothers me a lot...
I have a file with two columns (thanks to your help in a previous question) like:
14430001 0.040
14430002 0.000
14430003 0.990
14430004 1.000
14430005 0.050
14430006 0.490
....................
the first column is coordinates the second probabilities.
I am trying to find the blocks with probability >=0.990 and to be more than 100 in size.
As output I want to be like this:
14430001 14430250
14431100 14431328
18750003 18750345
.......................
where the first column has the coordinate of the start of each block and the second the end of it.
I wrote this script:
use strict;
#use warnings;
use POSIX;
my $scores_file = $ARGV[0];
#finds the highly conserved subsequences
open my $scores_info, $scores_file or die "Could not open $scores_file: $!";
#open(my $fh, '>', $coords_file) or die;
my $count = 0;
my $cons = "";
my $newcons = "";
while( my $sline = <$scores_info>) {
my #data = split('\t', $sline);
my $coord = $data[0];
my $prob = $data[1];
if ($data[1] >= 0.990) {
#$cons = "$cons + '\n' + $sline + '\n'";
$cons = join("\n", $cons, $sline);
# print $cons;
$count++;
if($count >= 100) {
$newcons = join("\n", $newcons, $cons);
my #array = split /'\n'/, $newcons;
print #array;
}
}
else {
$cons = "";
$count = 0;
}
}
It gives me the lines with probability >=0.990 (the first if works) but the coordinates are wrong. When Im trying to print it in a file it stacks, so I have only one sample to check it.
Im terrible sorry if my explanations aren't helpful, but I am new in programming.
Please, I need your help...
Thank you very much in advance!!!
You seem to be using too much variables. Also, after splitting the array and assigning its parts to variables, use the new variables rather than the original array.
sub output {
my ($from, $to) = #_;
print "$from\t$to\n";
}
my $threshold = 0.980; # Or is it 0.990?
my $count = 0;
my ($start, $last);
while (my $sline = <$scores_info>) {
my ($coord, $prob) = split /\t/, $sline;
if ($prob >= $threshold) {
$count++;
defined $start or $start = $coord;
$last = $coord;
} else {
output($start, $last) if $count > 100;
undef $start;
$count = 0;
}
}
output($start, $last) if $count > 100;
(untested)

How can I extract/parse tabular data from a text file in Perl?

I am looking for something like HTML::TableExtract, just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing.
Data could look like this:
Here is some header text.
Column One Column Two Column Three
a b
a b c
Some more text
Another Table Another Column
abdbdbdb aaaa
Not aware of any packaged solution, but something not very flexible is fairly simple to do assuming you can do two passes over the file: (the following is partially Perlish pseudocode example)
Assumption: data may contain spaces and is NOT quoted ala CSV if there's a space - if this is not the case, just use Text::CSV(_XS).
Assumption: no tabs used for formatting.
The logic defines a "column separator" to be any consecutive set of vertical rows populated 100% with spaces.
If by accident every row has a space which is part of the data at offset M characters, the logic will consider offset M to be a column separator, since it can't know any better. The ONLY way it can know better is if you require column separation to be at least X spaces where X>1 - see the second code fragment for that.
Sample code:
my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
# 0 means from entire file
my $lines_scanned = 0;
my #non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
chomp;
my $line = $_;
my #chars = split(//, $line);
for (my $i = 0; $i < #chars; $i++) { # Probably can be done prettier via map?
$non_spaces[$i] = 1 if $chars[$i] ne " ";
}
}
close $fh or die;
# Find columns, defined as consecutive "non-spaces" slices.
my #starts, #ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < #non_spaces; $i++) {
next if $state eq " " && !$non_spaces[$i];
next if $state eq "c" && $non_spaces[$i];
if ($state eq " ") { # && $non_spaces[$i] of course => start column
$state = "c";
push #starts, $i;
} else { # meaning $state eq "c" && !$non_spaces[$i] => end column
$state = " ";
push #ends, $i-1;
}
}
if ($state eq "c") { # Last char is NOT a space - produce the last column end
push #ends, $#non_spaces;
}
# Now split lines
my $fh = open(...) or die;
my #rows = ();
while (<$fh>) {
my #columns = ();
push #rows, \#columns;
chomp;
my $line = $_;
for (my $col_num = 0; $col_num < #starts; $col_num++) {
$columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1);
}
}
close $fh or die;
Now, if you require column separation to be at least X spaces where X>1, it's also doable but the parser of column locations needs to be a bit more complex :
# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my #starts, #ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < #non_spaces; $i++) {
if ($state eq "S") { # done with last column, inside a separator
if ($non_spaces[$i]) { # start a new column
$state = "c";
push #starts, $i;
}
next;
}
if ($state eq "c") { # Processing a column
if (!$non_spaces[$i]) { # First space after non-space
# Could be beginning of separator? check next X chars!
for (my $j = $i+1; $j < #non_spaces
|| $j < $i+$min_col_separator_is_X_spaces; $j++) {
if ($non_spaces[$j]) {
$i = $j++; # No need to re-scan again
next NEXT_CHAR; # OUTER loop
}
# If we reach here, next X chars are spaces! Column ended!
push #ends, $i-1;
$state = "S";
$i = $i + $min_col_separator_is_X_spaces;
}
}
next;
}
}
Here's a very quick solution, commented with an overview. (My apologies for the length.) Basically, if a "word" appears after the start of column header n, then it ends up in column n, unless most of its body trails into column n + 1, in which case it ends up there instead. Tidying this up, extending it to support multiple different tables, etc. are left as an exercise. You could also use something other than the left offset of the column header as the boundary mark, such as the centre, or some value determined by the column number.
#!/usr/bin/perl
use warnings;
use strict;
# Just plug your headers in here...
my #headers = ('Column One', 'Column Two', 'Column Three');
# ...and get your results as an array of arrays of strings.
my #result = ();
my $all_headers = '(' . (join ').*(', #headers) . ')';
my $found = 0;
my #header_positions;
my $line = '';
my $row = 0;
push #result, [] for (1 .. #headers);
# Get lines from file until a line matching the headers is found.
while (defined($line = <DATA>)) {
# Get the positions of each header within that line.
if ($line =~ /$all_headers/) {
#header_positions = #-[1 .. #headers];
$found = 1;
last;
}
}
$found or die "Table not found! :<\n";
# For each subsequent nonblank line:
while (defined($line = <DATA>)) {
last if $line =~ /^$/;
push #{$_}, "" for (#result);
++$row;
# For each word in line:
while ($line =~ /(\S+)/g) {
my $word = $1;
my $position = $-[1];
my $length = $+[1] - $position;
my $column = -1;
# Get column in which word starts.
while ($column < $#headers &&
$position >= $header_positions[$column + 1]) {
++$column;
}
# If word is not fully within that column,
# and more of it is in the next one, put it in the next one.
if (!($column == $#headers ||
$position + $length < $header_positions[$column + 1]) &&
$header_positions[$column + 1] - $position <
$position + $length - $header_positions[$column + 1]) {
my $element = \$result[$column + 1]->[$row];
$$element .= " $word";
# Otherwise, put it in the one it started in.
} else {
my $element = \$result[$column]->[$row];
$$element .= " $word";
}
}
}
# Output! Eight-column tabs work best for this demonstration. :P
foreach my $i (0 .. $#headers) {
print $headers[$i] . ": ";
foreach my $c (#{$result[$i]}) {
print "$c\t";
}
print "\n";
}
__DATA__
This line ought to be ignored.
Column One Column Two Column Three
These lines are part of the tabular data to be processed.
The data are split based on how much words overlap columns.
This line ought to be ignored also.
Sample output:
Column One: These lines are The data are split
Column Two: part of the tabular based on how
Column Three: data to be processed. much words overlap columns.