Perl to count current value based on next value - perl

Currently I'm learning Perl and gnuplot. I would like to know how to count certain value based on the next value. For example I have a text file consist of:
#ID(X) Y
1 1
3 9
5 11
The output should show the value of the unknown ID as well. So, the output should show:
#ID(X) Y
1 1
2 5
3 9
4 10
5 11
The Y of ID#2 is based on the following:
((2-3)/(1-3))*1 + ((2-1)/(3-1))*9 which is linear algebra
Y2=((X2-X3)/(X1-X3))*Y1 + ((X2-X1)/(X3-X1)) * Y3
Same goes to ID#5
Currently I have this code,
#! /usr/bin/perl -w
use strict;
my $prev_id = 0;
my $prev_val = 0;
my $next_id;
my $next_val;
while (<>)
{
my ($id, $val) = split;
for (my $i = $prev_id + 1; $i < $next_id; $i++)
{
$val = (($id - $next_id) / ($prev_id - $next_id)) * $prev_val + (($id - $prev_id) / ($next_id - $prev_id)) * $next_val;
printf ("%d %s\n", $i, $val);
}
printf ("%d %s\n", $id, $val);
($prev_val, $prev_id) = ($val, $id);
($next_val, $next_id) = ($prev_val, $prev_id);
}

Your formula seems more complicated than I would expect, given that you are always dealing with integer spacings of 1.
You did not say whether you want to fill gaps for multiple consecutive missing values, but let's assume you want to.
What you do is read in the first line, and say that's the current one and you output it. Now you read the next line, and if its ID is not the expected one, you fill the gaps with simple linear interpolation...
Pseudocode
(currID, currY) = readline()
outputvals( currID, currY )
while lines remain do
(nextID, nextY) = readline()
gap = nextID - currID
for i = 1 to gap
id = currID + i
y = currY + (nextY - currY) * i / gap
outputvals( id, y )
end
(currID, currY) = (nextID, nextY)
end
Sorry for the non-Perl code. It's just that I haven't been using Perl for ages, and can't remember half of the syntax. =) The concepts here are pretty easy to translate into code though.

Using an array may be the way to go. This will also make your data available for further manipulation.
** Caveat: will not work for multiple consecutive missing values of y; see #paddy's answer.
#!/usr/bin/perl
use strict;
use warnings;
my #coordinates;
while (<DATA>) {
my ($x, $y) = split;
$coordinates[$x] = $y;
}
# note that the for loop starts on index 1 here ...
for my $x (1 .. $#coordinates) {
if (! $coordinates[$x]) {
$coordinates[$x] = (($x - ($x + 1)) / (($x - 1) - ($x + 1)))
* $coordinates[$x - 1]
+ (($x - ($x - 1)) / (($x + 1) - ($x - 1)))
* $coordinates[$x + 1];
}
print "$x - $coordinates[$x]\n";
}
__DATA__
1 1
3 9
5 11

You indicated your problem is getting the next value. The key isn't to look ahead, it's to look behind.
my $prev = get first value;
my ($prev_a, $prev_b) = parse($prev);
my $this = get second value;
my ($this_a, $this_b) = parse($this);
while ($next = get next value) {
my ($next_a, $next_b) = parse($next);
...
$prev = $this; $prev_a = $this_a; $prev_b = $this_b;
$this = $next; $this_a = $next_a; $this_b = $next_b;
}

#! /usr/bin/perl -w
use strict;
my #in = (1,9,11);
my #out;
for (my $i = 0; $i<$#in; $i++) {
my $j = $i*2;
my $X1 = $i;
my $X2 = $i+1;
my $X3 = $i+2;
my $Y1 = $in[$i];
my $Y3 = $in[$i+1];
my $Y2 = $Y1*(($X2-$X3)/($X1-$X3))
+ $Y3*(($X2-$X1)/($X3-$X1));
$out[$j] = $in[$i];
$out[$j+1] = $Y2;
}
$out[$#in*2] = $in[$#in];
print (join " ",#out);

Related

Obtain 15 characters from a string that contain less than 15 characters in Perl

I have a sequence and a number representing the location of a residue(character). I want to take 7 residues from each side of the residue. This is the code to do that:
my $seq = substr($sequence, $location-8, 14);
This grabs 7 from each side of the residue. However, there are some sequences where there is less than 7 residues on either side. So when this occurs, I get an error saying:
substr outside of string at test9.pl line 52 (#1) (W substr)(F) You tried to reference a substr() that pointed outside of a string. That is, the absolute value of the offset was larger than the length of the string.
How can I change the empty places and replace them with another letter (X for example).
For example, if there is a sequence
ABCDEFGH
and $location points to D, I need 7 on each side so the result would be:
XXXXABCDEFGHXXX
Expanding on my comment above. I would create a my_substr function that encapsulates the padding and location shift.
my $sequence = "ABCDEFGH";
my $location = 3;
sub my_substr {
my ($seq, $location, $pad_length) = #_;
my $pad = "X"x$pad_length;
return substr("$pad$seq$pad", $location, (2*$pad_length+1));
}
print my_substr($sequence, $location, 7) . "\n";
yields
XXXXABCDEFGHXXX
This is an very verbose answer, but more or less gets you what you want:
use strict;
use warnings;
my $sequence = 'ABCDEFGH';
my $wings = 7;
my $location = index $sequence, 'D';
die "D not found" if $location == -1;
my $start = $location - $wings;
my $length = 1 + 2 * $wings;
my $leftpad = 0;
if ($start < 0) {
$leftpad = -1 * $start;
$start = 0;
}
my $seq = substr($sequence, $start, $length);
$seq = ('X' x $leftpad) . $seq if $leftpad;
my $rightpad = $length - length ($seq);
$seq .= 'X' x $rightpad if $rightpad > 0;
print $seq;
Or to avoid all the extra work, could just create a new $sequence variable containing padding:
my $sequence = 'ABCDEFGH';
my $wings = 7;
my $location = index $sequence, 'D';
die "D not found" if $location == -1;
my $paddedseq = ('X' x $wings) . $sequence . ('X' x $wings);
my $seq = substr($paddedseq, $location, 1 + 2 * $wings);
print $seq;

Using perl, given an array of any size, how do I randomly pick 1/4 of the list

For clarification, if I had a list of 8 elements, i would want to randomly pick 2. If I had a list of 20 elements, I would want to randomly pick 5. I would also like to assure (though not needed) that two elements don't touch, i.e. if possible not the 3 and then 4 element. Rather, 3 and 5 would be nicer.
The simplest solution:
Shuffle the list
select the 1st quarter.
Example implementation:
use List::Util qw/shuffle/;
my #nums = 1..20;
my #pick = (shuffle #nums)[0 .. 0.25 * $#nums];
say "#pick";
Example output: 10 2 18 3 19.
Your additional restriction “no neighboring numbers” actually makes this less random, and should be avoided if you want actual randomness. To avoid that two neighboring elements are included in the output, I would iteratively splice unwanted elements out of the list:
my #nums = 1..20;
my $size = 0.25 * #nums;
my #pick;
while (#pick < $size) {
my $i = int rand #nums;
push #pick, my $num = $nums[$i];
# check and remove neighbours
my $len = 1;
$len++ if $i < $#nums and $num + 1 == $nums[$i + 1];
$len++, $i-- if 0 < $i and $num - 1 == $nums[$i - 1];
splice #nums, $i, $len;
}
say "#pick";
use strict;
use warnings;
sub randsel {
my ($fact, $i, #r) = (1.0, 0);
while (#r * 4 < #_) {
if (not grep { $_ == $i } #r) {
$fact = 1.0;
# make $fact = 0.0 if you really don't want
# consecutive elements
$fact = 0.1 if grep { abs($i - $_) == 1 } #r;
push(#r, $i) if (rand() < 0.25 * $fact);
}
$i = ($i + 1) % #_;
}
return map { $_[$_] } sort { $a <=> $b } #r;
}
my #l;
$l[$_] = $_ for (0..19);
print join(" ", randsel(#l)), "\n";

Help me finish the last part of my app? It solves any Countdown Numbers game on Channel 4 by brute forcing every possibly equation

For those not familiar with the game. You're given 8 numbers and you have to reach the target by using +, -, / and *.
So if the target is 254 and your game numbers are 2, 50, 5, 2, 1, you would answer the question correctly by saying 5 * 50 = 250. Then 2+2 is four. Add that on aswell to get 254.
Some videos of the game are here:
Video 1
video 2
Basically I brute force the game using by generating all perms of all sizes for the numbers and all perms of the symbols and use a basic inflix calculator to calculate the solution.
However it contains a flaw because all the solutions are solved as following: ((((1+1)*2)*3)*4). It doesn't permutate the brackets and it's causing my a headache.
Therefore I cannot solve every equation. For example, given
A target of 16 and the numbers 1,1,1,1,1,1,1,1 it fails when it should do (1+1+1+1)*(1+1+1+1)=16.
I'd love it in someone could help finish this...in any language.
This is what I've written so far:
#!/usr/bin/env perl
use strict;
use warnings;
use Algorithm::Permute;
# GAME PARAMETERS TO FILL IN
my $target = 751;
my #numbers = ( '2', '4', '7', '9', '1', '6', '50', '25' );
my $num_numbers = scalar(#numbers);
my #symbols = ();
foreach my $n (#numbers) {
push(#symbols, ('+', '-', '/', '*'));
}
my $num_symbols = scalar(#symbols);
print "Symbol table: " . join(", ", #symbols);
my $lst = [];
my $symb_lst = [];
my $perms = '';
my #perm = ();
my $symb_perms = '';
my #symb_perm;
my $print_mark = 0;
my $progress = 0;
my $total_perms = 0;
my #closest_numbers;
my #closest_symb;
my $distance = 999999;
sub calculate {
my #oprms = #{ $_[0] };
my #ooperators = #{ $_[1] };
my #prms = #oprms;
my #operators = #ooperators;
#print "PERMS: " . join(", ", #prms) . ", OPERATORS: " . join(", ", #operators);
my $total = pop(#prms);
foreach my $operator (#operators) {
my $x = pop(#prms);
if ($operator eq '+') {
$total += $x;
}
if ($operator eq '-') {
$total -= $x;
}
if ($operator eq '*') {
$total *= $x;
}
if ($operator eq '/') {
$total /= $x;
}
}
#print "Total: $total\n";
if ($total == $target) {
#print "ABLE TO ACCURATELY SOLVE WITH THIS ALGORITHM:\n";
#print "PERMS: " . join(", ", #oprms) . ", OPERATORS: " . join(", ", #ooperators) . ", TOTAL=$total\n";
sum_print(\#oprms, \#ooperators, $total, 0);
exit(0);
}
my $own_distance = ($target - $total);
if ($own_distance < 0) {
$own_distance *= -1;
}
if ($own_distance < $distance) {
#print "found a new solution - only $own_distance from target $target\n";
#print "PERMS: " . join(", ", #oprms) . ", OPERATORS: " . join(", ", #ooperators) . ", TOTAL=$total\n";
sum_print(\#oprms, \#ooperators, $total, $own_distance);
#closest_numbers = #oprms;
#closest_symb = #ooperators;
$distance = $own_distance;
}
$progress++;
if (($progress % $print_mark) == 0) {
print "Tested $progress permutations. " . (($progress / $total_perms) * 100) . "%\n";
}
}
sub factorial {
my $f = shift;
$f == 0 ? 1 : $f*factorial($f-1);
}
sub sum_print {
my #prms = #{ $_[0] };
my #operators = #{ $_[1] };
my $total = $_[2];
my $distance = $_[3];
my $tmp = '';
my $op_len = scalar(#operators);
print "BEST SOLUTION SO FAR: ";
for (my $x = 0; $x < $op_len; $x++) {
print "(";
}
$tmp = pop(#prms);
print "$tmp";
foreach my $operator (#operators) {
$tmp = pop(#prms);
print " $operator $tmp)";
}
if ($distance == 0) {
print " = $total\n";
}
else {
print " = $total (distance from target $target is $distance)\n";
}
}
# look for straight match
foreach my $number (#numbers) {
if ($number == $target) {
print "matched!\n";
}
}
for (my $x = 1; $x < (($num_numbers*2)-1); $x++) {
$total_perms += factorial($x);
}
print "Total number of permutations: $total_perms\n";
$print_mark = $total_perms / 100;
if ($print_mark == 0) {
$print_mark = $total_perms;
}
for (my $num_size=2; $num_size <= $num_numbers; $num_size++) {
$lst = \#numbers;
$perms = new Algorithm::Permute($lst, $num_size);
print "Perms of size: $num_size.\n";
# print matching symb permutations
$symb_lst = \#symbols;
$symb_perms = new Algorithm::Permute($symb_lst, $num_size-1);
while (#perm = $perms->next) {
while (#symb_perm = $symb_perms->next) {
calculate(\#perm, \#symb_perm);
}
$symb_perms = new Algorithm::Permute($symb_lst, $num_size-1);
}
}
print "exhausted solutions";
print "CLOSEST I CAN GET: $distance\n";
sum_print(\#closest_numbers, \#closest_symb, $target-$distance, $distance);
exit(0);
Here is the example output:
[15:53: /mnt/mydocuments/git_working_dir/countdown_solver$] perl countdown_solver.pl
Symbol table: +, -, /, *, +, -, /, *, +, -, /, *, +, -, /, *, +, -, /, *, +, -, /, *, +, -, /, *, +, -, /, *Total number of permutations: 93928268313
Perms of size: 2.
BEST SOLUTION SO FAR: (2 + 4) = 6 (distance from target 751 is 745)
BEST SOLUTION SO FAR: (2 * 4) = 8 (distance from target 751 is 743)
BEST SOLUTION SO FAR: (4 + 7) = 11 (distance from target 751 is 740)
BEST SOLUTION SO FAR: (4 * 7) = 28 (distance from target 751 is 723)
BEST SOLUTION SO FAR: (4 * 9) = 36 (distance from target 751 is 715)
BEST SOLUTION SO FAR: (7 * 9) = 63 (distance from target 751 is 688)
BEST SOLUTION SO FAR: (4 * 50) = 200 (distance from target 751 is 551)
BEST SOLUTION SO FAR: (7 * 50) = 350 (distance from target 751 is 401)
BEST SOLUTION SO FAR: (9 * 50) = 450 (distance from target 751 is 301)
Perms of size: 3.
BEST SOLUTION SO FAR: ((4 + 7) * 50) = 550 (distance from target 751 is 201)
BEST SOLUTION SO FAR: ((2 * 7) * 50) = 700 (distance from target 751 is 51)
BEST SOLUTION SO FAR: ((7 + 9) * 50) = 800 (distance from target 751 is 49)
BEST SOLUTION SO FAR: ((9 + 6) * 50) = 750 (distance from target 751 is 1)
Perms of size: 4.
BEST SOLUTION SO FAR: (((9 + 6) * 50) + 1) = 751
Here is Java applet (source) and Javascript version.
The suggestion to use reverse polish notation is excellent.
If you have N=5 numbers, the template is
{num} {num} {ops} {num} {ops} {num} {ops} {num} {ops}
There can be zero to N ops in any spot, although the total number will be N-1. You just have to try different placements of numbers and ops.
The (((1+1)+1)+1)*(((1+1)+1)+1)=16 solution will be found when you try
1 1 + 1 + 1 + 1 1 + 1 + 1 + *
Update: Maybe not so good, since finding the above could take 433,701,273,600 tries. The number was obtained using the following:
use strict;
use warnings;
{
my %cache = ( 0 => 1 );
sub fact { my ($n) = #_; $cache{$n} ||= fact($n-1) * $n }
}
{
my %cache;
sub C {
my ($n,$r) = #_;
return $cache{"$n,$r"} ||= do {
my $i = $n;
my $j = $n-$r;
my $c = 1;
$c *= $i--/$j-- while $j;
$c
};
}
}
my #nums = (1,1,1,1,1,1,1,1);
my $Nn = 0+#nums; # Number of numbers.
my $No = $Nn-1; # Number of operators.
my $max_tries = do {
my $num_orderings = fact($Nn);
{
my %counts;
++$counts{$_} for #nums;
$num_orderings /= fact($_) for values(%counts);
}
my $op_orderings = 4 ** $No;
my $op_placements = 1;
$op_placements *= C($No, $_) for 1..$No-1;
$num_orderings * $op_orderings * $op_placements
};
printf "At most %.f tries needed\n", $max_tries;

How do I change this to "idiomatic" Perl?

I am beginning to delve deeper into Perl, but am having trouble writing "Perl-ly" code instead of writing C in Perl. How can I change the following code to use more Perl idioms, and how should I go about learning the idioms?
Just an explanation of what it is doing: This routine is part of a module that aligns DNA or amino acid sequences(using Needelman-Wunch if you care about such things). It creates two 2d arrays, one to store a score for each position in the two sequences, and one to keep track of the path so the highest-scoring alignment can be recreated later. It works fine, but I know I am not doing things very concisely and clearly.
edit: This was for an assignment. I completed it, but want to clean up my code a bit. The details on implementing the algorithm can be found on the class website if any of you are interested.
sub create_matrix {
my $self = shift;
#empty array reference
my $matrix = $self->{score_matrix};
#empty array ref
my $path_matrix = $self->{path_matrix};
#$seq1 and $seq2 are strings set previously
my $num_of_rows = length($self->{seq1}) + 1;
my $num_of_columns = length($self->{seq2}) + 1;
#create the 2d array of scores
for (my $i = 0; $i < $num_of_rows; $i++) {
push(#$matrix, []);
push(#$path_matrix, []);
$$matrix[$i][0] = $i * $self->{gap_cost};
$$path_matrix[$i][0] = 1;
}
#fill out the first row
for (my $i = 0; $i < $num_of_columns; $i++) {
$$matrix[0][$i] = $i * $self->{gap_cost};
$$path_matrix[0][$i] = -1;
}
#flag to signal end of traceback
$$path_matrix[0][0] = 2;
#double for loop to fill out each row
for (my $row = 1; $row < $num_of_rows; $row++) {
for (my $column = 1; $column < $num_of_columns; $column++) {
my $seq1_gap = $$matrix[$row-1][$column] + $self->{gap_cost};
my $seq2_gap = $$matrix[$row][$column-1] + $self->{gap_cost};
my $match_mismatch = $$matrix[$row-1][$column-1] + $self->get_match_score(substr($self->{seq1}, $row-1, 1), substr($self->{seq2}, $column-1, 1));
$$matrix[$row][$column] = max($seq1_gap, $seq2_gap, $match_mismatch);
#set the path matrix
#if it was a gap in seq1, -1, if was a (mis)match 0 if was a gap in seq2 1
if ($$matrix[$row][$column] == $seq1_gap) {
$$path_matrix[$row][$column] = -1;
}
elsif ($$matrix[$row][$column] == $match_mismatch) {
$$path_matrix[$row][$column] = 0;
}
elsif ($$matrix[$row][$column] == $seq2_gap) {
$$path_matrix[$row][$column] = 1;
}
}
}
}
You're getting several suggestions regarding syntax, but I would also suggest a more modular approach, if for no other reason that code readability. It's much easier to come up to speed on code if you can perceive the big picture before worrying about low-level details.
Your primary method might look like this.
sub create_matrix {
my $self = shift;
$self->create_2d_array_of_scores;
$self->fill_out_first_row;
$self->fill_out_other_rows;
}
And you would also have several smaller methods like this:
n_of_rows
n_of_cols
create_2d_array_of_scores
fill_out_first_row
fill_out_other_rows
And you might take it even further by defining even smaller methods -- getters, setters, and so forth. At that point, your middle-level methods like create_2d_array_of_scores would not directly touch the underlying data structure at all.
sub matrix { shift->{score_matrix} }
sub gap_cost { shift->{gap_cost} }
sub set_matrix_value {
my ($self, $r, $c, $val) = #_;
$self->matrix->[$r][$c] = $val;
}
# Etc.
One simple change is to use for loops like this:
for my $i (0 .. $num_of_rows){
# Do stuff.
}
For more info, see the Perl documentation on foreach loops and the range operator.
I have some other comments as well, but here is the first observation:
my $num_of_rows = length($self->{seq1}) + 1;
my $num_of_columns = length($self->{seq2}) + 1;
So $self->{seq1} and $self->{seq2} are strings and you keep accessing individual elements using substr. I would prefer to store them as arrays of characters:
$self->{seq1} = [ split //, $seq1 ];
Here is how I would have written it:
sub create_matrix {
my $self = shift;
my $matrix = $self->{score_matrix};
my $path_matrix = $self->{path_matrix};
my $rows = #{ $self->{seq1} };
my $cols = #{ $self->{seq2} };
for my $row (0 .. $rows) {
$matrix->[$row]->[0] = $row * $self->{gap_cost};
$path_matrix->[$row]->[0] = 1;
}
my $gap_cost = $self->{gap_cost};
$matrix->[0] = [ map { $_ * $gap_cost } 0 .. $cols ];
$path_matrix->[0] = [ (-1) x ($cols + 1) ];
$path_matrix->[0]->[0] = 2;
for my $row (1 .. $rows) {
for my $col (1 .. $cols) {
my $gap1 = $matrix->[$row - 1]->[$col] + $gap_cost;
my $gap2 = $matrix->[$row]->[$col - 1] + $gap_cost;
my $match_mismatch =
$matrix->[$row - 1]->[$col - 1] +
$self->get_match_score(
$self->{seq1}->[$row - 1],
$self->{seq2}->[$col - 1]
);
my $max = $matrix->[$row]->[$col] =
max($gap1, $gap2, $match_mismatch);
$path_matrix->[$row]->[$col] = $max == $gap1
? -1
: $max == $gap2
? 1
: 0;
}
}
}
Instead of dereferencing your two-dimensional arrays like this:
$$path_matrix[0][0] = 2;
do this:
$path_matrix->[0][0] = 2;
Also, you're doing a lot of if/then/else statements to match against particular subsequences: this could be better written as given statements (perl5.10's equivalent of C's switch). Read about it at perldoc perlsyn:
given ($matrix->[$row][$column])
{
when ($seq1_gap) { $path_matrix->[$row][$column] = -1; }
when ($match_mismatch) { $path_matrix->[$row][$column] = 0; }
when ($seq2_gap) { $path_matrix->[$row][$column] = 1; }
}
The majority of your code is manipulating 2D arrays. I think the biggest improvement would be switching to using PDL if you want to do much stuff with arrays, particularly if efficiency is a concern. It's a Perl module which provides excellent array support. The underlying routines are implemented in C for efficiency so it's fast too.
I would always advise to look at CPAN for previous solutions or examples of how to do things in Perl. Have you looked at Algorithm::NeedlemanWunsch?
The documentation to this module includes an example for matching DNA sequences. Here is an example using the similarity matrix from wikipedia.
#!/usr/bin/perl -w
use strict;
use warnings;
use Inline::Files; #multiple virtual files inside code
use Algorithm::NeedlemanWunsch; # refer CPAN - good style guide
# Read DNA sequences
my #a = read_DNA_seq("DNA_SEQ_A");
my #b = read_DNA_seq("DNA_SEQ_B");
# Read Similarity Matrix (held as a Hash of Hashes)
my %SM = read_Sim_Matrix();
# Define scoring based on "Similarity Matrix" %SM
sub score_sub {
if ( !#_ ) {
return -3; # gap penalty same as wikipedia)
}
return $SM{ $_[0] }{ $_[1] }; # Similarity Value matrix
}
my $matcher = Algorithm::NeedlemanWunsch->new( \&score_sub, -3 );
my $score = $matcher->align( \#a, \#b, { align => \&check_align, } );
print "\nThe maximum score is $score\n";
sub check_align {
my ( $i, $j ) = #_; # #a[i], #b[j]
print "seqA pos: $i, seqB pos: $j\t base \'$a[$i]\'\n";
}
sub read_DNA_seq {
my $source = shift;
my #data;
while (<$source>) {
push #data, /[ACGT-]{1}/g;
}
return #data;
}
sub read_Sim_Matrix {
#Read DNA similarity matrix (scores per Wikipedia)
my ( #AoA, %HoH );
while (<SIMILARITY_MATRIX>) {
push #AoA, [/(\S+)+/g];
}
for ( my $row = 1 ; $row < 5 ; $row++ ) {
for ( my $col = 1 ; $col < 5 ; $col++ ) {
$HoH{ $AoA[0][$col] }{ $AoA[$row][0] } = $AoA[$row][$col];
}
}
return %HoH;
}
__DNA_SEQ_A__
A T G T A G T G T A T A G T
A C A T G C A
__DNA_SEQ_B__
A T G T A G T A C A T G C A
__SIMILARITY_MATRIX__
- A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8
And here is some sample output:
seqA pos: 7, seqB pos: 2 base 'G'
seqA pos: 6, seqB pos: 1 base 'T'
seqA pos: 4, seqB pos: 0 base 'A'
The maximum score is 100

Generating Synthetic DNA Sequence with Substitution Rate

Given these inputs:
my $init_seq = "AAAAAAAAAA" #length 10 bp
my $sub_rate = 0.003;
my $nof_tags = 1000;
my #dna = qw( A C G T );
I want to generate:
One thousand length-10 tags
Substitution rate for each position in a tag is 0.003
Yielding output like:
AAAAAAAAAA
AATAACAAAA
.....
AAGGAAAAGA # 1000th tags
Is there a compact way to do it in Perl?
I am stuck with the logic of this script as core:
#!/usr/bin/perl
my $init_seq = "AAAAAAAAAA" #length 10 bp
my $sub_rate = 0.003;
my $nof_tags = 1000;
my #dna = qw( A C G T );
$i = 0;
while ($i < length($init_seq)) {
$roll = int(rand 4) + 1; # $roll is now an integer between 1 and 4
if ($roll == 1) {$base = A;}
elsif ($roll == 2) {$base = T;}
elsif ($roll == 3) {$base = C;}
elsif ($roll == 4) {$base = G;};
print $base;
}
continue {
$i++;
}
As a small optimisation, replace:
$roll = int(rand 4) + 1; # $roll is now an integer between 1 and 4
if ($roll == 1) {$base = A;}
elsif ($roll == 2) {$base = T;}
elsif ($roll == 3) {$base = C;}
elsif ($roll == 4) {$base = G;};
with
$base = $dna[int(rand 4)];
EDIT: Assuming substitution rate is in the range 0.001 to 1.000:
As well as $roll, generate another (pseudo)random number in the range [1..1000], if it is less than or equal to (1000 * $sub_rate) then perform the substitution, otherwise do nothing (i.e. output 'A').
Be aware that you may introduce subtle bias unless the properties of your random number generator are known.
Not exactly what you are looking for, but I suggest you take a look at BioPerl's Bio::SeqEvolution::DNAPoint module. It does not take mutation rate as a parameter though. Rather, it asks what the lower bound of sequence identity with the original you want.
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqEvolution::Factory;
my $seq = Bio::Seq->new(-seq => 'AAAAAAAAAA', -alphabet => 'dna');
my $evolve = Bio::SeqEvolution::Factory->new (
-rate => 2, # transition/transversion rate
-seq => $seq
-identity => 50 # At least 50% identity with the original
);
my #mutated;
for (1..1000) { push #mutated, $evolve->next_seq }
All 1000 mutated sequences will be stored in the #mutated array, their sequences can be accessed via the seq method.
In the event of a substitution, you want to exclude the current base from the possibilities:
my #other_bases = grep { $_ ne substr($init_seq, $i, 1) } #dna;
$base = #other_bases[int(rand 3)];
Also please see Mitch Wheat's answer for how to implement the substitution rate.
I don't know if I understand correctly but I'd do something like this (pseudocode):
digits = 'ATCG'
base = 'AAAAAAAAAA'
MAX = 1000
for i = 1 to len(base)
# check if we have to mutate
mutate = 1+rand(MAX) <= rate*MAX
if mutate then
# find current A:0 T:1 C:2 G:3
current = digits.find(base[i])
# get a new position
# but ensure that it is not current
new = (j+1+rand(3)) mod 4
base[i] = digits[new]
end if
end for