I am writing a Perl code, using substr to extract characters one by one but encountered a very strange problem.
I am trying to do the following
Scan character one by one, if it is # go to end of line, if it is ' or " then find the next matching one. Also added HTML color tag to highlight them. Everything else just print.
Here is the block of code
while ($char = (substr $src, $off_set, 1)) {
if ($char eq '#') {
$end_index = index $src, "\n", $off_set+ 1;
my $c = substr($src, $off_set, $end_index-$off_set+1);
print $comment_color.$c.$color_end;
} elsif (($char eq '"') || ($char eq "'")) {
$end_index = index ($src, $char, $off_set+1);
my $char_before = substr $src, $end_index-1, 1;
while ($end_index > 0 && $char_before eq '\\') {
$end_index = index $src, $char, $end_index + 1;
$char_before = substr $src, $end_index-1, 1;
}
my $s = substr($src, $off_set, $end_index-$off_set+1);
print $string_color.$s.$color_end;
} else {
print $char;
$end_index++;
}
$off_set = $end_index + 1;
}
When I use the following testing code, the script will just exit on first 0, if I remove all the 0 then it runs ok. If I remove first 0, it will exit on 2nd. I really have no idea why this happens.
# Comment 1
my $zero = 0;
my #array = (0xdead_beef, 0377, 0b011011);
# xor
sub sample2
{
print "true or false";
return 3 + 4 eq " 7"; # true or false
}
#now write input to STDOUT
print $time . "\n";
my $four = "4";
Check for defined in your while loop:
while (defined(my $char = substr $src, $off_set, 1)) {
The reason why your code was exiting early is because '0' is a false value, and therefore the while would end. Instead, this will check if any value is pulled from the substr call.
This is your loop condition:
while ($char = (substr $src, $off_set, 1)) {
...
So what happens when $char = "0"? As Perl considers that to be a false value, the loop will terminate. Instead, loop as long as characters are left:
while ($off_set < length $src) {
my $char = substr $src, $off_set, 1;
...
Anyway, your code is convoluted and hard to read. Consider using regular expressions instead:
use re '/xsm';
my $src = ...;
pos($src) = 0;
my $out = '';
while (pos($src) < length $src) {
if ($src =~ m/\G ([#][^\n]*)/gc) {
$out .= colored(comment => $1);
}
elsif ($src =~ m/\G (["] (?:[^"\\]++|[\\].)* ["])/gc) {
$out .= colored(string => $1);
}
elsif ($src =~ m/\G (['] (?:[^'\\]++|[\\]['\\])* ['])/gc) {
$out .= colored(string => $1);
}
elsif ($src =~ m/\G ([^"'#]+)/gc) {
$out .= $1;
}
else {
die "illegal state";
}
}
where colored is some helper function.
Finally, figured out it is the while loop. It exit the loop when it sees a 0.
Updated the while loop condition to
while (($char = (substr $src, $off_set, 1)) || ($off_set < (length $src))) {
and it is working now.
Related
Got a very technical coding problem with perl. Basically I am search for some specific DNA sequences.
my $seq = '...';
my #dna = split //, $seq;
my $amount = scalar #dna;
for my $index (0 .. $amount - 1){
if ($dna[$index] eq 'A' and $dna[$index+1] eq 'T' and
$dna[$index+2] eq 'G' and $dna[$index+3] eq 'C'
) {
do {
print $dna[$index++];
} until ($dna[$index] eq 'C' and $dna[$index+1] eq 'C' and $dna[$index+2] eq'G')
}
}
The problem is,is that the "until" portion is in an infinite loop. I am getting all my sequences, but for some reason the until loop wont stop looping. I have tried
until ($dna[$index] eq 'C' and $dna[$index+1] eq 'C' and $dna[$index+2] eq 'G' ){last;}
But that breaks out of the "for" loop.
Is there a way to break out of do-until loop without breaking the for loop?
Case 1
my $seq = 'xxxxxATGCyyyyyCCGzzzzz'; # Ouputs ATGCyyyyy
When every ATGC is followed by a CCG, your code works.
Case 2
my $seq = 'xxxxxATGCyyyyy'; # Infinite loop
When there's no CCG following a ATGC, you keep incrementing $index past the end of #dna, causing an infinite loop.
Case 3
my $seq = 'wwwwwATGCxxxxxATGCyyyyyCCGzzzzz'; # Ouputs ATGCxxxxxATGCyyyyy and ATGCyyyyy
You start looking for another match where the previous match started rather than where it ended, so you can end up with overlapping matches. This is partly due to a bad assumption the loop will continue from the value to which you changed $index.
Solution
my $seq = '...';
my #seq = split //, $seq;
my $seq_len = #seq;
for (my $i = 0; $i < $seq_len-7; ++$i) {
if ( $seq[$i+0] eq 'A'
&& $seq[$i+1] eq 'T'
&& $seq[$i+2] eq 'G'
&& $seq[$i+3] eq 'C'
) {
my $start = $i;
$i += 4;
for (; $i < $seq_len-3; ++$i) {
if ( $seq[$i+0] eq 'C'
&& $seq[$i+1] eq 'C'
&& $seq[$i+2] eq 'G'
) {
my $end = $i;
print(join('', #seq[$start .. $end-1]), "\n");
last;
}
}
}
}
substr simplifies things.
my $seq = '...';
my $seq_len = length($seq);
for (my $i = 0; $i < $seq_len-7; ++$i) {
if (substr($seq, $i, 4) eq 'ATGC') {
my $start = $i;
$i += 4;
for (; $i < $seq_len-3; ++$i) {
if (substr($seq, $i, 3) eq 'CCG') {
my $end = $i;
print(substr($seq, $start, $end-$start), "\n");
last;
}
}
}
}
But regular expressions simply things much further.
my $seq = '...';
while ($seq =~ / ( ATGC .*? ) (?= CCG ) /xsg) {
print("$1\n");
}
If you wanted to output the remainder of the string when there is no CCG, you can use the following:
my $seq = '...';
while ($seq =~ / ( ATGC (?:(?! CCG ).)* ) /xsg) {
print("$1\n");
}
As mentioned in perlsyn
For "last", you have to be more elaborate:
LOOP: {
do {
last if $x = $y**2;
# do something here
} while $x++ <= $z;
}
If the sequence doesn't contain CCG somewhere later than ATGC, the loop won't terminate. Add or $index == $#dna to the condition.
Although I definitely endorse #ikegami's regular expression-based approach, let's try to salvage what you have. Some specific issues:
Use the C-style for/foreach loop so that adjustments to $index carry over to the outer loop
You can't use last or next directly in a do {} until () or do {} while () loop -- these aren't while () or until () loops, they are do constructs and these keywords don't apply. See some of the other answers with respect to wrapping the do construct in a one-shot loop to make these keywords viable.
You need to deal with the possiblity of running out of data before finding the things you're looking for.
Here's my rework of your code to get it to run and clear #ikegami's test hurdles:
use constant { START => 'ATGC', STOP => 'CCG' };
use constant { START_LENGTH => length(START), STOP_LENGTH => length(STOP) };
my $sequence = 'wwwwwATGCxxxxxATGCyyyyyCCGzzzzz';
my #dna = split //, $sequence;
for (my $index = 0; $index < #dna - START_LENGTH; $index++) {
if (join('', #dna[$index .. $index + START_LENGTH - 1]) eq START) {
do {
print $dna[$index++];
} until ($index > $#dna or ($index < #dna - STOP_LENGTH and join('', #dna[$index .. $index + STOP_LENGTH - 1]) eq STOP));
print("\n");
}
}
(Yes, the $#dna messes with the SO formatter as it thinks it's introducing a comment. I'm sure someone knows how to fix this and will edit this answer accordingly.)
Consider:
string1 = "AAABBBBBCCCCCDDDDD"
string2 = "AEABBBBBCCECCDDDDD"
output. Where the mismatch (in this case E) will be replaced with HTML tags around E that color it.
A**E**ABBBBBCC**E**CCDDDDD
What I tried so far: XOR, diff and substr. First I need to find the indices then replace those indices with the pattern.
my #x = split '', "AAABBBBBCCCCCDDDDD";
my #y = split '', "AEABBBBBCCECCDDDDD";
my $result = join '',
map { $x[$_] eq $y[$_] ? $y[$_] : "**$y[$_]**" }
0 .. $#y;
Use:
use strict;
use warnings;
my $string1 = 'AAABBBBBCCCCCDDDDD';
my $string2 = 'AEABBBBBCCECCDDDDD';
my $result = '';
for(0 .. length($string1)) {
my $char = substr($string2, $_, 1);
if($char ne substr($string1, $_, 1)) {
$result .= "**$char**";
} else {
$result .= $char;
}
}
print $result;
It prints A**E**ABBBBBCC**E**CCDDDDD and was tested somewhat, but it may contain errors.
use warnings;
use strict;
my ($s1, $s2, $o1, $o2) = ("AAABBBBBCCCCCDDDDD", "AEABBBBBCCECCDDDDD");
my #s1 = split(//, $s1);
my #s2 = split(//, $s2);
my $eq_state = 1;
while (#s1 and #s2) {
if (($s1[0] eq $s2[0]) != $eq_state) {
$o1 .= (!$eq_state) ? "</b>" : "<b>";
$o2 .= (!$eq_state) ? "</b>" : "<b>";
}
$eq_state = $s1[0] eq $s2[0];
$o1.=shift #s1;
$o2.=shift #s2;
}
print "$o1\n$o2\n";
Output
A<b>A</b>ABBBBBCC<b>C</b>CCDDDDD
A<b>E</b>ABBBBBCC<b>E</b>CCDDDDD
A simpler one that only prints out the second string:
use warnings;
use strict;
my ($s1, $s2, $was_eq) = ("AAABBBBBCCCCCDDDDD", "AEABBBBBCCECCDDDDD", 1);
my #s1 = split(//, $s1);
my #s2 = split(//, $s2);
for my $idx (0 .. #s2 -1) {
my $is_eq = $s1[$idx] eq $s2[$idx];
print $is_eq ? "</b>" : "<b>" if ( $was_eq != $is_eq);
$was_eq = $is_eq;
print $s2[$idx];
}
Outout
</b>A<b>E</b>ABBBBBCC<b>E</b>CCDDDDD
This might be memory intensive, for large strings:
use strict;
use warnings;
my $a = "aabbcc";
my $b = "aabdcc";
my #a = split //, $a;
my #b = split //, $b;
my $new_b = '';
for(my $i = 0; $i < scalar(#a); $i++) {
$new_b .= $a[$i] eq $b[$i] ? $b[$i] : "**$b[$i]**";
}
Output
$ test.pl
new_b: aab**d**cc
There are several ways to accomplish this. Below is a possible way to solve this.
my $str1="ABCDEA";
my $str2="AECDEB";
my #old1=split("",$str1);
my #old2=split("",$str2);
my #new;
for my $i (0..$#old1) {
if ($old1[$i] eq $old2[$i] ) {
push (#new, $old2[$i]);
}
else
{
my $elem = "**".$old2[$i]."**";
push (#new , $elem);
}
}
print #new;
The output is:
A**E**CDE**B**
Aligning columns and using the bitwise string operator, "^":
my $a = "aabbccP";
my $b = "aabdccEE";
$_ = $a ^ $b;
s/./ord $& ? "^" : " "/ge;
print "$_\n" for $a, $b, $_;
gives:
aabbccP
aabdccEE
^ ^^
I'm using Perl to run through a tree, and then calculate the leaf nodes of the tree using the internal nodes as operators. I want to be able to print this in a postfix manner, and I managed to this this fairly easily with the basic operands (simply call the left and right nodes respectively before calling the parent) but I am having trouble producing the desired output for an average function. I don't have any trouble printing the actual result of the calculation, but I want to be able to print the operators and operands in postfix notation.
For example, 1 + average(3, 4, 5) will be shown as 1 ; 3 4 5 average +.
Here is my code:
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 0;
$Data::Dumper::Terse = 1;
my $debug = 0;
# an arithmetic expression tree is a reference to a list, which can
# be of two kinds as follows:
# [ 'leaf', value ]
# [ 'internal', operation, leftarg, rightarg ]
# Evaluate($ex) takes an arithmetic expression tree and returns its
# evaluated value.
sub Evaluate {
my ($ex) = #_;
$debug and
print "evaluating: ", Dumper($ex), "\n";
# the kind of node is given in the first element of the array
my $node_type = $ex->[0];
# if a leaf, then the value is a number
if ( $node_type eq 'leaf' ) {
# base case
my $value = $ex->[1];
$debug and
print "returning leaf: $value\n";
return $value;
}
# if not a leaf, then is internal,
if ( $node_type ne 'internal' ) {
die "Eval: Strange node type '$node_type' when evaluating tree";
}
# should now have an operation and two arguments
my $operation = $ex->[1];
my $left_ex = $ex->[2];
my $right_ex = $ex->[3];
# evaluate the left and right arguments;
my $left_value = Evaluate($left_ex);
my $right_value = Evaluate($right_ex);
# if any arguments are undefined, our value is undefined.
return undef unless
defined($left_value) and defined($right_value);
my $result;
# or do it explicitly for the required operators ...
if ($operation eq 'average') {
$result = ($left_value + $right_value) / 2;
}
if ($operation eq '+') {
$result = $left_value + $right_value;
} elsif ($operation eq '-') {
$result = $left_value - $right_value;
} elsif ($operation eq '*') {
$result = $left_value * $right_value;
} elsif ($operation eq 'div') {
if ($right_value != 0 ) {
$result = int ($left_value / $right_value);
} else {
$result = undef;
}
} elsif ($operation eq 'mod') {
$result = $left_value % $right_value;
} elsif ($operation eq '/') {
if ( $right_value != 0 ) {
$result = $left_value / $right_value;
}
else {
$result = undef;
}
}
$debug and
print "returning '$operation' on $left_value and $right_value result: $result\n";
return $result;
}
# Display($ex, $style) takes an arithmetic expression tree and a style
# parameter ('infix' or 'postfix') and returns a string that represents
# printable form of the expression in the given style.
sub Display {
my ($ex, $style) = #_;
# the kind of node is given in the first element of the array
my $node_type = $ex->[0];
# if a leaf, then the value is a number
if ( $node_type eq 'leaf' ) {
# base case
my $value = $ex->[1];
return $value;
}
# if not a leaf, then is internal,
if ( $node_type ne 'internal' ) {
die "Display: Strange node type '$node_type' when evaluating tree";
}
# should now have an operation and two arguments
my $operation = $ex->[1];
my $left_ex = $ex->[2];
my $right_ex = $ex->[3];
# evaluate the left and right arguments;
my $left_value = Display($left_ex, $style);
my $right_value = Display($right_ex, $style);
my $result;
if ($operation ne 'average') {
$result = "($left_value $operation $right_value) \n $left_value $right_value $operation";
} else {
$result = "($left_value $operation $right_value) \n $left_value $right_value $operation";
}
return $result;
}
# module end;
1;
And here is a test:
use strict;
use warnings;
use Display;
use arith;
my $ex1 = [ 'leaf', 42];
my $ex2 = [ 'internal', '+', [ 'leaf', 42], [ 'leaf', 10 ] ];
my $ex3 = [ 'internal', 'average', $ex2, [ 'leaf', 1 ] ];
print "ex1 is ", Evaluate($ex1), "\n";
print "ex1: ", Display($ex1), "\n";
print "\n";
print "ex2 is ", Evaluate($ex2), "\n";
print "ex2: ", Display($ex2), "\n";
print "\n";
print "ex3 is ", Evaluate($ex3), "\n";
print "ex3: ", Display($ex3), "\n";
print "\n";
Display::Render(\$ex3);
In order to do this, I realize I will have to change the subroutine "Display", but I'm not sure how to get the output --> value value ; #to indicate values that aren't averaged# value value average operand etc.
Any ideas?
I am not 100% sure that I understand your problem, but here is a cleanup / improvement of your two functions:
my %ops = ( # dispatch table for operations
average => sub {my $acc; $acc += $_ for #_; $acc / #_},
'+' => sub {$_[0] + $_[1]},
'-' => sub {$_[0] - $_[1]},
'*' => sub {$_[0] * $_[1]},
'mod' => sub {$_[0] % $_[1]},
(map {$_ => sub {$_[1] ? $_[0] / $_[1] : undef}} qw (/ div)),
);
sub Evaluate {
my $ex = shift;
print "evaluating: ", Dumper($ex), "\n" if $debug;
my $node_type = $ex->[0];
if ( $node_type eq 'leaf' ) {
print "returning leaf: $$ex[1]\n" if $debug;
return $$ex[1];
}
elsif ( $node_type ne 'internal' ) {
die "Eval: Strange node type '$node_type' when evaluating tree";
}
my $operation = $ex->[1];
my #values = map {Evaluate($_)} #$ex[2 .. $#$ex];
defined or return for #values;
if (my $op = $ops{$operation}) {
return $op->(#values);
} else {
print "operation $operation not found\n";
return undef;
}
}
Here the large if/elsif block is replaced with a dispatch table. This allows you to separate the logic from the parser. I have also replaced the $left_value and $right_value variables with the #values array, allowing your code to scale to n-arity operations (like average).
The following Display function has also been updated to handle n-arity operations:
my %is_infix = map {$_ => 1} qw( * + / - );
sub Display {
my ($ex, $style) = #_;
my $node_type = $ex->[0];
# if a leaf, then the value is a number
if ( $node_type eq 'leaf' ) {
return $$ex[1];
}
# if not a leaf, then is internal,
if ( $node_type ne 'internal' ) {
die "Display: Strange node type '$node_type' when evaluating tree";
}
# should now have an operation and n arguments
my $operation = $ex->[1];
if ($style and $style eq 'infix') {
my #values = map {Display($_, $style)} #$ex[2 .. $#$ex];
if ($is_infix{$operation}) {
return "$values[0] $operation $values[1]"
} else {
local $" = ', '; # "
return "$operation( #values )"
}
} else { # postfix by default
my #out;
for (#$ex[2 .. $#$ex]) {
if (#out and $_->[0] eq 'internal') {
push #out, ';'
}
push #out, Display($_, $style)
}
return join ' ' => #out, $operation;
}
}
You can call Display as Display($tree) or Display($tree, 'postfix') for postfix notation. And Display($tree, 'infix') for the infix notation.
ex1 is 42
ex1: 42
ex1: 42
ex2 is 52
ex2: 42 10 +
ex2: 42 + 10
ex3 is 26.5
ex3: 42 10 + 1 average
ex3: average( 42 + 10, 1 )
Which I believe is what you are looking for.
Finally, using your first example 1 + average(3, 4, 5):
my $avg = ['internal', 'average', [leaf => 3], [leaf => 4], [leaf => 5] ];
my $ex4 = ['internal', '+', [leaf => 1], $avg ];
print "ex4 is ", Evaluate($ex4), "\n";
print "ex4: ", Display($ex4), "\n";
print "ex4: ", Display($ex4, 'infix'), "\n";
print "\n";
which prints:
ex4 is 5
ex4: 1 ; 3 4 5 average +
ex4: 1 + average( 3, 4, 5 )
Maybe try AlgebraicToRPN?
I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.
Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.
First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];
On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.
The following is the script for finding consecutive substrings in strings.
use strict;
use warnings;
my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");
#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";
my $sequence;
while (my $line = <DAT>) {
if ($line=~ /(HDWFLSFKD)/g){
{
print "its found index location: ",
pos($line), "-", pos($line)+length($1), "\n";
}
if ($line=~ /(HD)/g){
print "motif found and its locations is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
if ($line=~ /(K)/g){
print "motif found and its location is: \n";
pos($line), "-",pos($line)+length($1), "\n\n";
}
if ($line=~ /(DD)/g){
print "motif found and its location is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
}else {
$sequence .= $line;
print "came in else\n";
}
}
It matches substring1 with string and prints out position where substring1 matched. The problem lies in finding the rest of the substrings. For substrings2 it starts again from the beginning of the string (instead of starting from the position where substring1 was found). The problem is that every time it calculates position it starts from the beginning of string instead of starting from the position of the previously found substring. Since substrings are consecutive substring1, substring2, substring3, substring4, their positions have to occur after the previous respectively.
Try this perl program
use strict;
use warnings;
use feature qw'say';
my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");
my #regex = qw(
HDWFLSFKD
HD
K
DD
);
my $sequence;
while( my $line = <$dat> ){
chomp $line;
say 'Line: ', $.;
# reset the position of variable $line
# pos is an lvalue subroutine
pos $line = 0;
for my $regex ( #regex ){
$regex = quotemeta $regex;
if( scalar $line =~ / \G (.*?) ($regex) /xg ){
say $regex, ' found at location (', $-[2], '-', $+[2], ')';
if( $1 ){
say " but skipped: \"$1\" at location ($-[1]-$+[1])";
}
}else{
say 'Unable to find ', $regex;
# end loop
last;
}
}
}
I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.
use strict;
use warnings;
my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");
open (OUTPUTFILE, '>data.txt');
my $sequence;
my $someVar = 0;
my $sequenceNums = 1;
my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";
while (my $line = <DAT>)
{
$someVar = 0;
print "\nSequence $sequenceNums: $line\n";
print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
if ($line=~ /$motif1/g)
{
&printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
$someVar = 1;
}
if ($line=~ /$motif2/g and $someVar == 1)
{
&printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
$someVar = 2;
}
if ($line=~ /$motif3/g and $someVar == 2)
{
&printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
$someVar = 3;
}
if ($line=~ /$motif4/g and $someVar == 3)
{
&printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
}
else
{
$sequence .= $line;
if ($someVar == 0)
{
&printWrongStuff($sequenceNums, "motif1", $motif1);
}
elsif ($someVar == 1)
{
&printWrongStuff($sequenceNums, "motif2", $motif2);
}
elsif ($someVar == 2)
{
&printWrongStuff($sequenceNums, "motif3", $motif3);
}
elsif ($someVar == 3)
{
&printWrongStuff($sequenceNums, "motif4", $motif4);
}
}
$sequenceNums++;
}
sub printStuff
{
print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] index location: $_[3]\n";
}
sub printWrongStuff
{
print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";
}
close (OUTPUTFILE);
close (DAT);
Sample input:
MLTSHQKKFHDWFLSFKDSNNYNHDSKQNHSIKDDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKHDWFLSFKDQNHSIKDIFNRFNHYIYNDL
You really should read
perldoc perlre
perldoc perlreref
perldoc perlretut
You need the special variables #- and #+ if you need the positions. No need to try to compute them yourself.
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw( each_array );
my $source = 'AAAA BBCCC DD E FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );
if ( $source =~ /$pattern/ ) {
my $it = each_array #-, #+;
$it->(); # discard overall match information;
while ( my ($start, $end) = $it->() ) {
printf "Start: %d - Length: %d\n", $start, $end - $start;
}
}
Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5
The result of a construct like
$line=~ /(HD)/g
is a list. Use while to step through the hits.
To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):
The "\G" assertion can be used to
chain global matches (using "m//g"),
as described in "Regexp Quote-Like
Operators" in perlop. It is also
useful when writing "lex"-like
scanners, when you have several
patterns that you want to match
against consequent substrings of your
string, see the previous reference.
The actual location where "\G" will
match can also be influenced by using
"pos()" as an lvalue: see "pos" in
perlfunc. Note that the rule for
zero-length matches is modified
somewhat, in that contents to the left
of "\G" is not counted when
determining the length of the match.
Thus the following will not match
forever:
$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
print $&;
}