Obtaining overlapping matches using a regex

Obtaining overlapping matches using a regex - perl

See code first plz~
This is perl code.
my $st = 'aaaa';
while ( $st =~ /aa/g ) {
print $&, "\n";
}
I want to move one point of the string.
So I want the results of the three aa.
However, only two results are obtained.
I can derive three results do?

my $st = 'aaaa';
my $find = 'aa';
while($st =~ /$find/g){
print $&,"\n";
pos($st) -= (length($find)-1);
}
From perldoc pos
Returns the offset of where the last m//g search left off for the variable in question ($_ is used when the variable is not specified)
Also pos() is lvalue subroutine and result from it can be changed like for variable.

Use a look ahead. It doesn't advance the position:
my $st = 'abcd';
while ($st =~ /(?=(..))/g) {
print "$1\n";
}
I used a different string to make the matching positions visible.

Your problem is that regular expressions do not normally allow overlapping matches.
You can explore this fact by outputting the Positional Information for your two current matches:
my $st = 'aaaa';
while ( $st =~ /aa/g ) {
print "pos $-[0] - $&\n";
}
Outputs:
pos 0 - aa
pos 2 - aa
To fix this, you simply need to use a Positive Lookahead Assertion and an explicit capture group:
while ( $st =~ /(?=(aa))/g ) {
print "pos $-[0] - $1\n";
}
Outputs:
pos 0 - aa
pos 1 - aa
pos 2 - aa

The following will do the trick:
while ($st =~ /(?=(aa))/g) {
print "$1\n";
}

Related

Passing strings as array to subroutine and return count of specific char

I was trying to think in the right way to tackle this:
-I would to pass say, n elements array as argument to a subroutine. And for each element match two char types S and T and print for each element, the count of these letters. So far I did this but I am locked and found some infinite loops in my code.
use strict;
use warnings;
sub main {
my #array = #_;
while (#array) {
my $s = ($_ = tr/S//);
my $t = ($_ = tr/T//);
print "ST are in total $s + $t\n";
}
}
my #bunchOfdata = ("QQQRRRRSCCTTTS", "ZZZSTTKQSST", "ZBQLDKSSSS");
main(#bunchOfdata);
I would like the output to be:
Element 1 Counts of ST = 5
Element 2 Counts of ST = 6
Element 3 Counts of ST = 4
Any clue how to solve this?

while (#array) will be an infinite loop since #array never gets smaller. You can't read into the default variable $_ this way. For this to work, use for (#array) which will read the array items into $_ one at a time until all have been read.
The tr transliteration operator is the right tool for your task.
The code needed to get your results could be:
#!/usr/bin/perl
use strict;
use warnings;
my #data = ("QQQRRRRSCCTTTS", "ZZZSTTKQSST", "ZBQLDKSSSS");
my $i = 1;
for (#data) {
my $count = tr/ST//;
print "Element $i Counts of ST = $count\n";
$i++;
}
Also, note that my $count = tr/ST//; doesn't require the binding of the transliteration operator with $_. Perl assumes this when $_ holds the value to be counted here. Your code tried my $s = ($_ = tr/S//); which will give the results but the shorter way I've shown is the preferred way.
(Just noticed you had = instead of =~ in your statement. That is an error. Has to be $s = ($_ =~ tr/S//);)
You can combine the 2 sought letters as in my code. Its not necessary to do them separately.
I got the output you want.
Element 1 Counts of ST = 5
Element 2 Counts of ST = 6
Element 3 Counts of ST = 4
Also, you can't perform math operations in a quoted string like you had.
print "ST are in total $s + $t\n";
Instead, you would need to do:
print "ST are in total ", $s + $t, "\n";
where the operation is performed outside of the string.

Don't use while to traverse an array - your array gets no smaller, so the condition is always true and you get an infinite loop. You should use for (or foreach) instead.
for (#array) {
my $s = tr/S//; # No need for =~ as tr/// works on $_ by default
my $t = tr/T//;
print "ST are in total $s + $t\n";
}

Why tr///??
sub main {
my #array = #_;
while (#array) {
my $s = split(/S/, $_, -1) - 1;
my $t = split(/T/, $_, -1) - 1;
print "ST are in total $s + $t\n";
}
}

How to rewind next-search start position by 1?

How can I rewind the start of the next search position by 1? For example, suppose I want to match all digits between #. The following will give me only odd numbers.
my $data="#1#2#3#4#";
while ( $data =~ /#(\d)#/g ) {
print $1, "\n";
}
But if I could rewind the start of the next position by 1, I would get both even and odd numbers.
This doesn't work: pos() = pos() - 1;
I know I can accomplish this using split. But this doesn't answer my question.
for (split /#/, $data) {
print $_, "\n";
}

One approach is to use a look-ahead assertion:
while ( $data =~ /#(\d)(?=#)/g ) {
print $1, "\n";
}
The characters in the look-ahead assertion are not part of the matched expression and do not update pos() past the \d part of the regular expression.
More demos:
say "#1#2#3#4#" =~ /#(\d)/g; # 1234
say "#1#2#3#4" =~ /#(\d)/g; # 1234
say "#1#2#3#4#" =~ /#(\d)(?=#)/g; # 1234
say "#1#2#3#4" =~ /#(\d)(?=#)/g; # 123

You're calling pos() on $_, instead of $data
From perldoc
Returns the offset of where the last m//g search left off for the variable in question ($_ is used when the variable is not specified)
So,
pos($data) = pos($data) - 1;

Efficient way to read columns in a file using Perl

I have an input file like so, separated by newline characters.
AAA
BBB
BBA
What would be the most efficient way to count the columns (vertically), first with first, second with second etc etc.
Sample OUTPUT:
ABB
ABB
ABA
I have been using the following, but am unable to figure out how to remove the scalar context from it. Any hints are appreciated:
while (<#seq_prot>){
chomp;
my #sequence = map substr (#seq_prot, 1, 1), $start .. $end;
#sequence = split;
}
My idea was to use the substring to get the first letter of the input (A in this case), and it would cycle for all the other letters (The second A and B). Then I would increment the cycle number + 1 so as to get the next line, until I reached the end. Of course I can't seem to get the first part going, so any help is greatly appreciated, am stumped on this one.

Basically, you're trying to transpose an array.
This can be done easily using Array::Transpose
use warnings;
use strict;
use Array::Transpose;
die "Usage: $0 filename\n" if #ARGV != 1;
for (transpose([map {chomp; [split //]} <>])) {
print join("", map {$_ // " "} #$_), "\n"
}
For an input file:
ABCDEFGHIJKLMNOPQRS
12345678901234
abcdefghijklmnopq
ZYX
Will output:
A1aZ
B2bY
C3cX
D4d
E5e
F6f
G7g
H8h
I9i
J0j
K1k
L2l
M3m
N4n
O o
P p
Q q
R
S

You'll have to read in the file once for each column, or store the information and go through the data structure later.
I was originally thinking in terms of arrays of arrays, but I don't want to get into References.
I'm going to make the assumption that each line is the same length. Makes it simpler that way. We can use split to split your line into individual letters:
my = $line = "ABC"
my #split_line = split //, $line;
This will give us:
$split_line[0] = "A";
$split_line[1] = "B";
$split_line[2] = "C";
What if we now took each letter, and placed it into a #vertical_array.
my #vertical_array;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
Now let's do this with the next line:
$line = "123";
#split_line = split //, $line;
for my $index ( 0..##split_line ) {
$vertical_array[$index] .= "$split_line[$index];
}
This will give us:
$vertical_array[0] = "A1";
$vertical_array[1] = "B2";
$vertical_array[2] = "C3";
As you can see, I'm building the $vertical_array with each interation:
use strict;
use warnings;
use autodie;
use feature qw(say);
my #vertical_array;
while ( my $line = <DATA> ) {
chomp $line;
my #split_line = split //, $line;
for my $index ( 0..$#split_line ) {
$vertical_array[$index] .= $split_line[$index];
}
}
#
# Print out your vertical lines
#
for my $line ( #vertical_array ) {
say $line;
}
__DATA__
ABC
123
XYZ
BOY
FOO
BAR
This prints out:
A1XBFB
B2YOOA
C3ZYOR
If I had used references, I could probably have built an array of arrays and then flipped it. That's probably more efficient, but more complex. However, that may be better at handling lines of different lengths.

how to replace a div[ ] element with another div[ ] element?

for ($i=0; $i<10; $i++)
{
my $v1 = $sel->get_text("//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td/div/div");
my $v2 = $sel->get_text("//body[#\id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td[2]/div/div")
print ($v1 . $v2);
}
For every iteration, it has to find the 14th element starting from div[10] & replace it with the increased div[ ] element (Ex: if 14th element is div, replace it by div[2]. In the next iterartion find 14th element i.e., div[2] & replace it by div[3] & so on ).
By using PATTERN matching, it can't. Is there any method by using regex for finding that particular element & replacing it ? how can i do it ?
my $a = "//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td/div/div";
my #arr = split ('/' , $a);
print "#arr \n";
my $size1 = #arr;
print "$size1\n";
print $arr[16];
foreach my $a2 (#arr)
{
print "$a2 \n";
}
my $b = "//body[\#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/**div**/table/tbody/tr/td[2]/div/div";
Two variables as mentioned in the above question as v1 & v2 (edited as $a and $b), the modification has to apply for both of them. I think i'm almost near to what you've told. Can yoy please help me further

use 5.010;
my $xpath = q(//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div/table/tbody/tr/td/div/div);
for my $i (0..10) {
my #nodes = split qr'/', $xpath;
$nodes[16] .= "[$i]" unless 0 == $i;
say join '/', #nodes;
}
Results:
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[1]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[2]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[3]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[4]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[5]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[6]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[7]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[8]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[9]/table/tbody/tr/td/div/div
//body[#id='ext-gen3']/div[10]/div[2]/div/div/div/div/div/div/div/div/div/div[2]/div/div[10]/table/tbody/tr/td/div/div

Ummm, all elements are separated by /, right? So you can use the native split method to split the portion of the text following div[10] based on /. Store it in an array $arr. Merge it to find the length of the string, say $len. Find the index of the div[10], say $orig_index. Then you find the 14th element, do a regex match to see which format it is in:
$arr[13] =~ /div([\d+])?/;
if ($1) {
$arr[13] =~ /div[$1]/div[($1+1)]/e;
}
else {
$arr[13] = div[2];
}
Now that the 14th element is changed, concatenate the array to get the new output string for the portion from the portion between div[10] and the 14th one:
{
local $" = '';
$newstring = "#arr";
}
splice($originalstring,$orig_index,$len,$newstring);
I think that will do.

How can I extract start and end codon from DNA sequences in Perl?

I have a code below that try to identify the position of start and end codon of the given DNA sequences.
We define start codon as a ATG sequence and end codon as TGA,TAA,TAG sequences.
The problem I have is that the code below works only for first two sequences (DM208659 and AF038953) but not the rest.
What's wrong with my approach below?
This code can be copy-pasted from here.
#!/usr/bin/perl -w
while (<DATA>) {
chomp;
print "$_\n";
my ($id,$rna_sq) = split(/\s+/,$_);
local $_ = $rna_sq;
while (/atg/g) {
my $start = pos() - 2;
if (/tga|taa|tag/g) {
my $stop = pos();
my $gene = substr( $_, $start - 1, $stop - $start + 1 ),$/;
my $genelen = length($gene);
my $ct = "$id $start $stop $gene $genelen";
print "\t$ct\n";
}
}
}
__DATA__
DM208659 gtgggcctcaaatgtggagcactattctgatgtccaagtggaaagtgctgcgacatttgagcgtcac
AF038953 gatcccagacctcggcttgcagtagtgttagactgaagataaagtaagtgctgtttgggctaacaggatctcctcttgcagtctgcagcccaggacgctgattccagcagcgccttaccgcgcagcccgaagattcactatggtgaaaatcgccttcaatacccctaccgccgtgcaaaaggaggaggcgcggcaagacgtggaggccctcctgagccgcacggtcagaactcagatactgaccggcaaggagctccgagttgccacccaggaaaaagagggctcctctgggagatgtatgcttactctcttaggcctttcattcatcttggcaggacttattgttggtggagcctgcatttacaagtacttcatgcccaagagcaccatttaccgtggagagatgtgcttttttgattctgaggatcctgcaaattcccttcgtggaggagagcctaacttcctgcctgtgactgaggaggctgacattcgtgaggatgacaacattgcaatcattgatgtgcctgtccccagtttctctgatagtgaccctgcagcaattattcatgactttgaaaagggaatgactgcttacctggacttgttgctggggaactgctatctgatgcccctcaatacttctattgttatgcctccaaaaaatctggtagagctctttggcaaactggcgagtggcagatatctgcctcaaacttatgtggttcgagaagacctagttgctgtggaggaaattcgtgatgttagtaaccttggcatctttatttaccaactttgcaataacagaaagtccttccgccttcgtcgcagagacctcttgctgggtttcaacaaacgtgccattgataaatgctggaagattagacacttccccaacgaatttattgttgagaccaagatctgtcaagagtaagaggcaacagatagagtgtccttggtaataagaagtcagagatttacaatatgactttaacattaaggtttatgggatactcaagatatttactcatgcatttactctattgcttatgccgtaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
BC021011 ggggagtccggggcggcgcctggaggcggagccgcccgctgggctaaatggggcagaggccgggaggggtgggggttccccgcgccgcagccatggagcagcttcgcgccgccgcccgtctgcagattgttctg
DM208660 gggatactcaaaatgggggcgctttcctttttgtctgtactgggaagtgcttcgattttggggtgtccc
AF038954 ggacccaagggggccttcgaggtgccttaggccgcttgccttgctctcagaatcgctgccgccatggctagtcagtctcaggggattcagcagctgctgcaggccgagaagcgggcagccgagaaggtgtccgaggcccgcaaaagaaagaaccggaggctgaagcaggccaaagaagaagctcaggctgaaattgaacagtaccgcctgcagagggagaaagaattcaaggccaaggaagctgcggcattgggatcccgtggcagttgcagcactgaagtggagaaggagacccaggagaagatgaccatcctccagacatacttccggcagaacagggatgaagtcttggacaacctcttggcttttgtctgtgacattcggccagaaatccatgaaaactaccgcataaatggatagaagagagaagcacctgtgctgtggagtggcattttagatgccctcacgaatatggaagcttagcacagctctagttacattcttaggagatggccattaaattatttccatatattataagagaggtccttccactttttggagagtagccaatctagctttttggtaacagacttagaaattagcaaagatgtccagctttttaccacagattcctgagggattttagatgggtaaatagagtcagactttgaccaggttttgggcaaagcacatgtatatcagtgtggacttttcctttcttagatctagtttaaaaaaaaaaaccccttaccattctttgaagaaaggaggggattaaataattttttcccctaacactttcttgaaggtcaggggctttatctatgaaaagttagtaaatagttctttgtaacctgtgtgaagcagcagccagccttaaagtagtccattcttgctaatggttagaacagtgaatactagtggaattgtttgggctgcttttagtttctcttaatcaaaattactagatgatagaattcaagaacttgttacatgtattacttggtgtatcgataatcatttaaaagtaaagactctgtcatgcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

I removed the use of $_ (I especially shuddered when you localized it -- you did so correctly, but why force yourself to worry if some other function is going to clobber $_, rather than use $rna_sq which is already available?
Additionally I corrected $start and $stop to be 0-based indexes into the string (which made the rest of the math more straight-forward), and calculated $genelen early so it could be used directly in the substr operation. (Alternatively, you could localize $[ to 1 to use 1-based array indexes, see perldoc perlvar.)
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
print "processing $line\n";
my ($id, $rna_sq) = split(/\s+/, $line);
while ($rna_sq =~ /atg/g) {
# $start and $stop are 0-based indexes
my $start = pos($rna_sq) - 3; # back up to include the start sequence
# discard remnant if no stop sequence can be found
last unless $rna_sq =~ /tga|taa|tag/g;
my $stop = pos($rna_sq);
my $genelen = $stop - $start;
my $gene = substr($rna_sq, $start, $genelen);
print "\t" . join(' ', $id, $start+1, $stop, $gene, $genelen) . "\n";
}
}

It's never breaking out of your inner loop when the if (/tga|taa|tag/g) fails to find an end codon. It keeps matching /atg/g repeatedly, never advancing any further. You could forcibly eject it from the inner loop:
if (/tga|taa|tag/g) {
...
}
else {
last;
}

It all depends on whether you want to generate sequences which could overlap. For example, sequence AF038954 contains atgaccatcctccagacatacttccggcagaacagggatga, the end of which overlaps with atgaagtcttggacaacctcttggcttttgtctgtga. Do you want to report them both?
If you don't want to report sequences which overlap, this is a very simple problem, which you can solve with a single regexp:
while (<DATA>) {
chomp;
print "processing $_\n";
my ($id, $rna_sq) = split;
while ($rna_sq =~ /(atg.*?(?:tga|taa|tag))/g) {
printf "\t%8s %4i %4i %s %i\n",
$id,
pos($rna_sq) - length($1) + 1,
pos($rna_sq),
$1,
length($1);
}
}
The regexp (atg.*?(?:tga|taa|tag)) matches your required start, then as little as possible of what comes next (that's the ? to stop the .* being "greedy") then your required end. Iterating over it with the while loop restarts after this match, which meets the requirement of not looking for overlaps.
If you do want overlapping sequences reported, you do need a two-stage process: find the start, find the end, and then find another start, picking up where you left off looking for the start the last time. But you can still do a simpler job using a second regexp:
while (<DATA>) {
chomp;
print "processing $_\n";
my ($id, $rna_sq) = split;
while ($rna_sq =~ /atg/g) {
if ($' =~ /(.*?(?:tga|taa|tag))/) {
my $match = "atg$1";
printf "\t%8s %4i %4i %s %i\n",
$id,
pos($rna_sq) - 2,
pos($rna_sq) - 3 + length($match),
$match,
length($match);
}
}
}
Here we use the (generally non-recommended) $' special variable, which contains the content after the match. We look in this to find the end of the sequence and output the details. Because our main global match against $rna_seq doesn't include the sequence (as it does above) we restart the search for a start where the previous search left off, that is just after the start we found. This way we do include overlapping sequences.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Obtaining overlapping matches using a regex - perl

See code first plz~ This is perl code. my $st = 'aaaa'; while ( $st =~ /aa/g ) { print $&, "\n"; } I want to move one point of the string. So I want the results of the three aa. However, only two results are obtained. I can derive three results do?

Use a look ahead. It doesn't advance the position: my $st = 'abcd'; while ($st =~ /(?=(..))/g) { print "$1\n"; } I used a different string to make the matching positions visible.

The following will do the trick: while ($st =~ /(?=(aa))/g) { print "$1\n"; }

Related

Passing strings as array to subroutine and return count of specific char

How to rewind next-search start position by 1?

Efficient way to read columns in a file using Perl

how to replace a div[ ] element with another div[ ] element?

How can I extract start and end codon from DNA sequences in Perl?

Categories

Resources