Counting nucleotide frequency using perl script [closed] - perl

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I have this perl script below to calculate sequence length and their frequency along with nucleotide frequency(A,T,G and C). This script works fine for a file with large number of sequences, but it does not give the right result for a file of small size like this:
infile.fasta
>NC_013116_1051_1114
TTGTCCCTTTGAGTCTCTGG
>NC_013116_1051_1114
GCGCAGCCGATATGGATAA
>NC_013116_1051_1114
TCGAGACTTTGTAATGTTTGGG
>NC_013116_1051_1114
TATTCCACGTCAGGTGCTTTT
>NC_013116_1051_1114
TAGAGCCGATTCCAGACTGTTCC
>NC_013116_1051_1114
TACAGGACCAAGCTCTTCACTC
>NC_013116_1051_1114
CCGTCAAGTTCAGCTCCAATAA
>NC_013116_6_301
CCACGCAACGGACAATCAAACA
>NC_013116_6_301
GGACACTTCCAACTATAAATA
>NC_013116_6_301
CCACGCAACGGACAATCAAACA
>NC_013116_1051_1114
GCTCTTCACTCTTCCTCGTCT
>NC_013116_1051_1114
TTGGGAAAAAGAAGTTGCTGCAGC
>NC_013116_1051_1114
TCGCAGTATCTCTGAAGTTG
count.pl
#!/usr/bin/perl -w
#usage ./count.pl infile min_length max_length
#usage ./count.pl infile 18 34
my $min_len = $ARGV[1];
my $max_len = $ARGV[2];
my $read_len = 0;
my #lines = ("header1","sequence","header2","quality");
my #lray = ();
my $count = 0;
my $total = 0;
my $i = 0;
my #Aray = ();
my #Cray = ();
my #Gray = ();
my #Tray = ();
my$FN = "";
for ($i=$min_len; $i<=$max_len; $i++){
$lray[$i] = 0;
}
open (INFILE, "<$ARGV[0]") || die "couldn't open input file!";
while (<INFILE>) {
$lines[$count] = $_;
chomp($lines[$count]);
$count++;
if($count eq 4){
$read_len = length($lines[1]);
# print "$read_len $lines[1]\n";
$FN = substr $lines[1], 0, 1;
$lray[$read_len]++;
if ($FN eq "T") { $Tray[$read_len]++;}
else {
if ($FN eq "A"){ $Aray[$read_len]++;}
else {
if ($FN eq "C"){ $Cray[$read_len]++;}
else {
if ($FN eq "G"){ $Gray[$read_len]++;}
}
}
}
$count = 0;
}
}
print "length\tnumber\tA\tC\tG\tT\n";
for ($i=$min_len; $i<=$max_len; $i++){
print "$i\t$lray[$i]\t$Aray[$i]\t$Cray[$i]\t$Gray[$i]\t$Tray[$i]\n";
}
exit;
This is the type of result I get from a big file with many sequences.
length number A C G T
18 4473 542 710 471 2750
19 12647 990 1680 1103 8874
20 31194 3010 3354 2743 22087
21 61214 6288 7196 5784 41946
22 128642 14596 11902 12518 89626
23 65190 6859 6525 7773 44033
24 10012 611 1401 1112 6888
25 1406 231 192 435 548
26 661 169 91 105 296
27 407 126 81 65 135
28 602 391 49 68 94
29 520 54 30 370 66
30 175 26 93 18 38
31 156 35 28 29 64
32 106 22 16 24 44
33 97 45 17 16 19
34 0
I would really appreciate if you could help me correct this code. Thanks

Trying do not reinvent the wheel, so, using the FAST module, got:
use 5.014;
use warnings;
use FAST::Bio::SeqIO;
my $fasta = FAST::Bio::SeqIO->new(-file => "infile.fasta", -format => 'Fasta');
my $seqnum=0;
while ( my $seq = $fasta->next_seq() ) {
my $stats;
$stats->{len} = length($seq->seq);
$stats->{$_}++ for split //, $seq->seq;
say ++$seqnum, " #$stats{qw(len A C G T)}";
}
The above, for your demo infile.fasta prints:
1 20 1 5 5 9
2 19 6 4 6 3
3 22 4 2 7 9
4 21 3 5 4 9
5 23 5 7 5 6
6 22 6 8 3 5
7 22 7 7 3 5
8 22 10 8 3 1
9 21 9 5 2 5
10 22 10 8 3 1
11 21 1 9 2 9
12 24 8 3 8 5
13 20 4 4 5 7
or the
use 5.014;
use warnings;
use FAST::Bio::SeqIO;
my $fasta = FAST::Bio::SeqIO->new(-file => "file.fasta", -format => 'Fasta');
my $stats;
while ( my $seq = $fasta->next_seq() ) {
my $len = length($seq->seq);
$stats->{$len}{count}++;
$stats->{$len}{$_}++ for split //, $seq->seq;
}
say "Length $_ ($stats->{$_}->{count} times) Letters freq: #{$stats->{$_}}{qw(A C G T)}" for sort { $a <=> $b } keys %$stats;
produce:
Length 19 (1 times) Letters freq: 6 4 6 3
Length 20 (2 times) Letters freq: 5 9 10 16
Length 21 (3 times) Letters freq: 13 19 8 23
Length 22 (5 times) Letters freq: 37 33 19 21
Length 23 (1 times) Letters freq: 5 7 5 6
Length 24 (1 times) Letters freq: 8 3 8 5
and so on...

Related

About gsl_cdf_tdist_P from PDL::GSL::CDF (Perl)

Could you please let me know how to get the full p-value from gsl_cdf_tdist_P function when p-value is smaller than 1E-16? I am getting 0 instead.
Thanks,
Woody
print "t-test p-value = " . ttest(\#n,\#t) . "\n";
sub ttest{
my ($n,$t) = #_;
my #n = #$n;
my #t = #$t;
my $nn = pdl(#n);
my $tt = pdl(#t);
my ($tstats, $df) = t_test( $nn, $tt );
use PDL::GSL::CDF;
my $p_2tail = 2 * (1 - gsl_cdf_tdist_P( $tstats->abs, $df ));
return $p_2tail;
}
My input values as follows:
my #n = qw (1 2 4 2 3 1 2 4 2 1 2 4 2 3 1 2 4 2 1 2 4 2 3 1 2 4 2);
my #t = qw (11 12 13 12 13 11 14 11 12 13 12 13 11 14 11 12 13 12 13 11 14);
I found an easy solution for this problem. I used gsl_cdf_tdist_Q to get p-values. The p-values are not limited to 16 digits after decimal because they are not remapped p-values like (1-gsl_cdf_tdist_P).
Woody

Modifying Script to include the Count of a each time a name appears from a table

I have a script below that takes my FILE1 and parses out FILE2 only if the first column of FILE1 matches column number 10 of FILE2. So it will print out the rows I need. This part works great. The part I am having a tad bit of difficulty is inserting a sort of count for the output. The goal of the script is take column 10 at the end and produce an output. In my list there are 12 names and I want to get the count of each name. For the example below, I have used four names.
FILE1:
name1 15
name2 15
name2 30
name5 15
name4 10
name2 5
name2 5
FILE2:
23 15 5.4 1.3 5 55 128 21799 + 32 name2 1 77 0 1
23 20 5.4 1.3 5 55 128 7998 + 18 name4 1 77 0 1
23 20 5.4 1.3 6 55 128 9984 + 13 name4 1 77 1 1
23 20 5.4 1.3 7 55 128 7998 + 14 name5 1 77 2 1
23 20 5.4 1.3 6 55 128 994 + 14 name1 1 77 3
23 20 5.4 1.3 9 55 128 984 + 5 name7 1 77 4 1
23 20 5.4 1.3 5 55 128 99 + 5 name8 1 77 5 1
Expected Output
$VAR1 = {
'name1' => 1,
'name2' => 4,
'name4' => 1,
'name5' => 1,
};
5 55 128 21799 32 name2 77 0 1
5 55 128 7998 18 name4 77 0 1
6 55 128 9984 13 name4 77 1 1
7 55 128 7998 14 name5 77 2 1
6 55 128 994 14 name1 77 3 1
name1 1
name2 1
name4 2
name5 1
You can test the script it works. The part I am having difficulty with is inserting the count of each name based on the output. The print \%x is a way of checking if my original list was truly used as I am working with a much larger set of data. If someone could point me the right direction on how to modify my script without changing it drastically that would be great. I feel like this script fulfills the majority of my needs even if it is not the most efficient way of doing it.
use strict;
use Data::Dumper;
my %x;
open(FILE1, $ARGV[0]) or die "Cannot open the file: $!";
while (my $line = <FILE1>) {
my #array = split(" ", $line);
$x{$array[0]}++;
}
close FILE1;
print Dumper( \%x );
my %count;
open(FILE2, $ARGV[1]) or die "Cannot open the file: $!";
while (my $line = <FILE2>) {
my #name = split(" ", $line);
my $y = $name[9];
if ( $x{ $y } ) {
print join(" ", #name[4,5,6,7,9,11,12,13]), "\n";
$count{#name[9]}++;
}
}
print Dumper (\%count);
close FILE2;
exit;
Script now counts. Just need to debug.
the "minimal" change would be to set the elements of %x to 0 in the FILE1 loop, then check for exists $x{$y} in the FILE2 loop and do ++$x{$y} inside the condition body. Now at the end %x has the counts of all the occurrences.
The usual way (as mentioned in the comments of the question) would be to declare an additional %count and perform the same ++$count{$y} inside the if block as in the above method.
The first has the advantage and disadvantage (depending on your needs) of reporting the count even when the name has zero found occurrences.

Consolidation of intervals

I'm working with biological data (copy number variations) which is shown as intervals (tab separated file):
File 1
Columns: Chromosome, Start, End, Annotation
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
I intersected them in order to consolidate the overlapping events (50% of overlap is my condition), the result is this:
I used intersectBed from Bedtools (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html):
$ intersectBed -a File1 -b File1 -loj -f 0.50 -r > File 2
File 2
Columns: Chromosome, Start, End, Annotation , Chromosome, Start, End, Annotation
1 1 10 A 1 1 10 A
1 1 10 A 1 3 12 B
1 3 12 B 1 1 10 A
1 3 12 B 1 3 12 B
1 3 12 B 1 7 15 C
1 7 15 C 1 3 12 B
1 7 15 C 1 7 15 C
1 20 30 D 1 20 30 D
1 35 45 E 1 35 45 E
1 35 45 E 1 37 45 F
1 37 45 F 1 35 45 E
1 37 45 F 1 37 45 F
1 50 60 G 1 50 60 G
1 50 60 G 1 50 65 H
1 50 65 H 1 50 60 G
1 50 65 H 1 50 65 H
Event A and the event C overlaps with the event B, event E and F overlaps with each other like G and H, finally the event D has no overlapping partners. Knowing this, the list of consolidated CNV should be:
File 3
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
I was trying to use the merge option of the HDCNV java software (http://daleylab.org/lab/?page_id=125) but the output is not what I needed. I was trying to write a perl code but I'm a beginner so this problem is, at the moment, out of my limits.
I would appreciate if you can help me with a nice perl or awk code which take File 2 as input and outputs File 3.
Thanks in advance
I'm assuming that the columns have the following meanings:
col 1: chromosome number
col 2: start position of genomic region
col 3: end position of genomic region
col 4: text identifier
This script looks for the areas of overlap between the named regions. It assumes that the input text is sorted by col 1 then col 2. I have put the input text in a string, but you will probably be reading it in from a file (and outputting your data to a file, too). I will leave you to work out how to do that--it is pretty easy, and there is lots of documentation on the perl website.
#!/usr/bin/perl
use strict;
use warnings;
use feature ":5.10";
use Data::Dumper;
my $text = '1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
';
# we have tab-delimited data.
# split on line breaks, remove line ending, split on tabs
my #lines = map { chomp; [ split(/\t/, $_) ]; } split("\n", $text);
my $col_0 = 1;
my $min = 0;
my $max = 0;
my #range;
foreach (#lines) {
# the chromosome number has changed or
# minimum is greater than current maximum:
# start a new interval
if ($col_0 != $_->[0] || $_->[1] > $max) {
if (#range) {
# print out the range, and restart the stack
say join("\t",
$col_0,
( $min || $_->[1] ),
( $max || $_->[2] ),
join(", ", #range)
);
}
#range = ( $_->[3] );
# set the min and max
$col_0 = $_->[0];
$min = $_->[1];
$max = $_->[2];
}
else {
# the minimum is lower than our current maximum.
# check whether the max is greater than our current
# max and increase it if so. Add the letter to the
# current range.
if ($_->[2] > $max) {
$max = $_->[2];
}
push #range, $_->[3];
}
}
# print out the last line
say join("\t", $col_0, $min, $max, join(", ", #range) );
Output:
1 1 15 A, B, C
1 20 30 D
1 35 45 E, F
1 50 65 G, H
2 1 15 I, J, K
2 20 30 L
2 35 45 M, N
2 50 65 O, P
I have just calculated simple overlap - this doesn't do 50% overlap. Using this script as a start, you can figure out how to do that. We're not doing your PhD for you! ;)
awk '
$2 > end && NR>1 {
print "1", start, end, pair;
start=end=pair=0
}
{
if (!start) { start = $2 };
end = $3;
pair = (pair ? pair "," $4 : $4)
}
END {
print "1", start, end, pair
}' file
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
Assuming ordered data, the following stub should handle merging the records.
Would just have to modify it to load and output to a file.
use strict;
use warnings;
use List::Util qw(min max);
my $last;
while (<DATA>) {
my #fields = split;
if ( !$last ) {
$last = \#fields;
} elsif ( $last->[0] == $fields[0] && $last->[2] > $fields[1] ) {
$last->[1] = min( $last->[1], $fields[1] );
$last->[2] = max( $last->[2], $fields[2] );
$last->[3] .= ",$fields[3]";
} else {
print join( "\t", #$last ), "\n";
$last = \#fields;
}
}
print join( "\t", #$last ), "\n";
__DATA__
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
Outputs:
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
2 1 15 I,J,K
2 20 30 L
2 35 45 M,N
2 50 65 O,P
My take:
awk -F "\t" -v OFS="\t" '
function emit() {print chrom, start, end, annot}
$1 == chrom && ((start<=$2 && $2<=end) || (start<=$3 && $3<=end)) {
annot = annot "," $4
if ($2 < start) start = $2
if ($3 > end) end = $3
next
}
chrom {emit()}
{chrom=$1; start=$2; end=$3; annot=$4}
END {emit()}
' file1
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H

How to use rank index number to sort other set of data using loops?

I am quite new to Matlab, and am tasked to use Matlab to manage a finance/econs database.
To cut straight to the problem. Just imagine I have two sets of data, one is A and another one is B (see below). My objective is to rank the 3 columns according to value size, and then I would like to use the rank index for A (sorted_index) to position the values in B accordingly.
Below is the working but non-looping solution to obtain my answer:
A = [5 17 8; 11 2 9; 55 70 3; 11 8 33; 9 71 35; 9 2 3; 21 5 43; 5 2 9; 41 5 23; 61 72 91];
B = [1 2 3; 11 12 13; 21 22 23; 31 32 33; 1 2 3; 11 12 13; 21 22 23; 31 32 33; 41 42 43; 51 52 53];
[A_sorted sorted_index] = sort (A);
[B_sorted sorted_indexB] = sort (B);
B_sorted (:,1) = B(sorted_index(:,1),1);
B_sorted (:,2) = B(sorted_index(:,2),2);
B_sorted (:,3) = B(sorted_index(:,3),3);
The outcome of B (sorted according to the rank position of A):
1 12 23
31 12 13
1 32 3
11 22 13
11 42 33
31 32 43
21 2 33
41 22 3
21 2 23
51 52 53
The problem is, what if I have 2000 columns instead of just 3, how can i loop successfully?
I tried this
for ii= size(B,2); jj= size(B,2) ; kk= size(B,2);
temp = 0*B;
temp(:,ii) = B(sorted_index(:,jj),kk);
B_sortedTest= temp;
end
But it only turns out to give me the correct sorted result for the last column, the first two columns are overwritten (become all zeros). Can you help me to solve the problem?
Thank you very very much!
Here is my method without any loops:
A = [5 17 8; 11 2 9; 55 70 3; 11 8 33; 9 71 35; 9 2 3; 21 5 43; 5 2 9; 41 5 23; 61 72 91];
B = [1 2 3; 11 12 13; 21 22 23; 31 32 33; 1 2 3; 11 12 13; 21 22 23; 31 32 33; 41 42 43; 51 52 53];
[A_sorted sorted_index] = sort (A);
% converting sorted_index into a vectorized form and having linear indices instead of subscripts i.e.
% (row 2,column 3) in your sorted_index will be represented as 23=2*number of rows + 3=2*10+3.
linearSortedIndex=sub2ind(size(sorted_index),sorted_index(:),reshape(repmat((1:size(sorted_index,2),size(sorted_index,1),1).*ones(size(sorted_index)),[],1));
B_sorted1=reshape(B(linearSortedIndex),[],size(B,2));
%test that the result is correct
for i=1:size(B,2)
B_sorted2(:,i) = B(sorted_index(:,i),i);
end
isequal(B_sorted1,B_sorted2) %If it prints 1, then this method is correct.
Try this:
A = [5 17 8; 11 2 9; 55 70 3; 11 8 33; 9 71 35; 9 2 3; 21 5 43; 5 2 9; 41 5 23; 61 72 91];
B = [1 2 3; 11 12 13; 21 22 23; 31 32 33; 1 2 3; 11 12 13; 21 22 23; 31 32 33; 41 42 43; 51 52 53];
[A_sorted sorted_index] = sort (A);
[B_sorted sorted_indexB] = sort (B);
for i=1:size(B,2)
B_sorted (:,i) = B(sorted_index(:,i),i);
end

concat columns in perl

Each iteration in my perl code generates a vector of 5.
Output of first iteration is
out1
1
2
3
4
5
The second iterations generates same length of vector.
out2
10
20
30
40
50
and then it runs for nth time
out n
100
200
300
400
500
I want to merge these columns and have the final output in a tabular format or matrix format if you like:
out1 out2 ... outn
1 10 100
2 20 200
3 30 300
4 40 400
5 50 500
I tried splitting and then using the push but it prints "(101" and only do it once and not for all 20. I also have no idea where the "(101" comes from.
Any suggestions?
First, put all those output lists to a list. Second, iterate on that list: output every first element of each element-list in the first iteration, output every second element of each element-list in the second iteration, and so on.
For example
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #lists;
for my $i (1..10) {
my #list;
push #list, $_ * $i for (1..5);
push #lists, \#list;
}
$Data::Dumper::Indent = 0;
print Dumper(\#lists), "\n\n";
while (#{$lists[0]}) {
for my $list (#lists) {
print shift #$list, "\t";
}
print "\n";
}
Output:
$ perl t.pl
$VAR1 = [
[1,2,3,4,5],
[2,4,6,8,10],
[3,6,9,12,15],
[4,8,12,16,20],
[5,10,15,20,25],
[6,12,18,24,30],
[7,14,21,28,35],
[8,16,24,32,40],
[9,18,27,36,45],
[10,20,30,40,50]
];
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
Note: The output of Data::Dumper has been edited to make it more compact.
Save your vector information to an array of array as you do your processing. Then you can output the rows using a simple join:
use strict;
use warnings;
my #rows;
for my $i (1..10) {
my #vector = map {$i * $_} (1..5);
push #{$rows[$_]}, $vector[$_] for (0..$#vector);
}
for my $row (#rows) {
print join(" ", map {sprintf "%-3s", $_} #$row), "\n";
}
Outputs:
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
Note: It'd be a lot easier to advise if you provided code and actual data.