How to rewind next-search start position by 1? - perl

How can I rewind the start of the next search position by 1? For example, suppose I want to match all digits between #. The following will give me only odd numbers.
my $data="#1#2#3#4#";
while ( $data =~ /#(\d)#/g ) {
print $1, "\n";
}
But if I could rewind the start of the next position by 1, I would get both even and odd numbers.
This doesn't work: pos() = pos() - 1;
I know I can accomplish this using split. But this doesn't answer my question.
for (split /#/, $data) {
print $_, "\n";
}

One approach is to use a look-ahead assertion:
while ( $data =~ /#(\d)(?=#)/g ) {
print $1, "\n";
}
The characters in the look-ahead assertion are not part of the matched expression and do not update pos() past the \d part of the regular expression.
More demos:
say "#1#2#3#4#" =~ /#(\d)/g; # 1234
say "#1#2#3#4" =~ /#(\d)/g; # 1234
say "#1#2#3#4#" =~ /#(\d)(?=#)/g; # 1234
say "#1#2#3#4" =~ /#(\d)(?=#)/g; # 123

You're calling pos() on $_, instead of $data
From perldoc
Returns the offset of where the last m//g search left off for the variable in question ($_ is used when the variable is not specified)
So,
pos($data) = pos($data) - 1;

Related

Need to extract value from a file and subtract from variable in Perl

FILE:
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
code:
use IPC::System::Simple qw( capture capturex );
use POSIX;
my $tb1_file = '/var/egridmanage_pl/daily_pl/egrid-csv/test.csv';
open my $fh1, '<', $tb1_file or die qq{Unable to open "$tb1_file" for input: $!};
my #t1_temp_12 = map {
chomp;
my #t1_ft_12 = split /,/;
sprintf "%.0f", $t1_ft_12[6] if $t1_ft_12[2] eq '12:00:00';
} <$fh1>;
print "TEMP #t1_temp_12\n";
my $result = #t1_temp_12 - 273.14;
print "$result should equal something closer to 24 ";
$result value prints out -265.14 making me think the #t1_temp_12 is hashed
So I tried to do awk
my $12temp = capture("awk -F"," '$3 == "12:00:00" {print $7 - 273-.15}' test.csv");
I've tried using ``, qx, open, system all having the same error result using the awk command
But this errors out. When executing awk at command line i get the favoured results.
This looks like there's some cargo cult programming going on here. It looks like all you're trying to do is find the line for 12:00:00 and print the temperature in degrees C rather than K.
Which can be done like this:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
my #fields = split /,/;
print $fields[6] - 273.15 if $fields[2] eq "12:00:00";
}
__DATA__
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
Prints:
25.72
You don't really need to do map sprintf etc. (Although you could do a printf on that output if you do want to format it).
Edit: From the comments, it seems one of the sources of confusion is extracting an element from an array. An array is zero or more scalar elements - you can't just assign one to the other, because .... well, what should happen if there isn't just one element (which is the usual case).
Given an array, we can:
pop #array will return the last element (and remove it from the array) so you could my $result = pop #array;
[0] is the first element of the array, so we can my $result = $array[0];
Or we can assign one array to another: my ( $result ) = #array; - because on the left hand side we have an array now, and it's a single element - the first element of #array goes into $result. (The rest isn't used in this scenario - but you could do my ( $result, #anything_else ) = #array;
So in your example - if what you're trying to do is retrieve a value matching a criteria - the normal tool for the job would be grep - which filters an array by applying a conditional test to each element.
So:
my #lines = grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print "#lines";
print $lines[0];
Which we can reduce to:
my ( $firstresult ) = grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print $firstresult;
But as we want to want to transform our array - map is the tool for the job.
my ( $result ) = map { (split /,/)[6] - 273.15 } grep { (split /,/)[2] eq "12:00:00" } <DATA>;
print $result;
First we:
use grep to extract the matching elements. (one in this case, but doesn't necessarily have to be!)
use map to transform the list, so that that we turn each element into just it's 6th field, and subtract 273.15
assign the whole lot to a list containing a single element - in effect just taking the first result, and throwing the rest away.
But personally, I think that's getting a bit complicated and may be hard to understand - and would suggest instead:
my $result;
while (<DATA>) {
my #fields = split /,/;
if ( $fields[2] eq "12:00:00" ) {
$result = $fields[6] - 273.15;
last;
}
}
print $result;
Iterate your data, split - and test - each line, and when you find one that matches the criteria - set $result and bail out of the loop.
#t1_temp_12 is an array. Why are you trying to subtract an single value from it?
my $result = "#t1_temp_12 - 273.14";
Did you want to do this instead?
#t1_temp_12 = map {$_ - 273.14} #t1_temp_12;
As a shell one-liner, you could write your entire script as:
perl -F, -lanE 'say $F[6]-273.14 if $F[2] eq "12:00:00"' <<DATA
1,2015-08-20,00:00:00,89,1007.48,295.551,296.66,
2,2015-08-20,03:00:00,85,1006.49,295.947,296.99,
3,2015-08-20,06:00:00,86,1006.05,295.05,296.02,
4,2015-08-20,09:00:00,85,1005.87,296.026,296.93,
5,2015-08-20,12:00:00,77,1004.96,298.034,298.87
DATA
25.73

Obtaining overlapping matches using a regex

See code first plz~
This is perl code.
my $st = 'aaaa';
while ( $st =~ /aa/g ) {
print $&, "\n";
}
I want to move one point of the string.
So I want the results of the three aa.
However, only two results are obtained.
I can derive three results do?
my $st = 'aaaa';
my $find = 'aa';
while($st =~ /$find/g){
print $&,"\n";
pos($st) -= (length($find)-1);
}
From perldoc pos
Returns the offset of where the last m//g search left off for the variable in question ($_ is used when the variable is not specified)
Also pos() is lvalue subroutine and result from it can be changed like for variable.
Use a look ahead. It doesn't advance the position:
my $st = 'abcd';
while ($st =~ /(?=(..))/g) {
print "$1\n";
}
I used a different string to make the matching positions visible.
Your problem is that regular expressions do not normally allow overlapping matches.
You can explore this fact by outputting the Positional Information for your two current matches:
my $st = 'aaaa';
while ( $st =~ /aa/g ) {
print "pos $-[0] - $&\n";
}
Outputs:
pos 0 - aa
pos 2 - aa
To fix this, you simply need to use a Positive Lookahead Assertion and an explicit capture group:
while ( $st =~ /(?=(aa))/g ) {
print "pos $-[0] - $1\n";
}
Outputs:
pos 0 - aa
pos 1 - aa
pos 2 - aa
The following will do the trick:
while ($st =~ /(?=(aa))/g) {
print "$1\n";
}

Matching in Perl

I am trying to get text in between two dots of a line, but my program returns the entire line.
For example: I have text which looks like:
My sampledata 1,2 for perl .version 1_1.
I used the following match statement
$x =~ m/(\.)(.*)(\.)/;
My output for $x should be version 1_1, but I am getting the entire line as my match.
In your code, the value of $x will not change after the match.
When $x is successfully matched with m/(.)(.*)(.)/, your three capture groups will contain '.', 'version 1_1' and '.' respectively (in the order given). $2 will give you 'version 1_1'.
Considering that you might probably only want the part 'version 1_1', you need not capture the two dots. This code will give you the same result:
$x =~ m/\.(.*)\./;
print $1;
Try this:
my $str = "My sampledata 1,2 for perl .version 1_1.";
$str =~ /\.\K[^.]+(?=\.)/;
print $&;
The period must be escaped out of a character class.
\K resets all that has been matched before (you can replace it by a lookbehind (?<=\.))
[^.] means any character except a period.
For several results, you can do this:
my $str = "qwerty .target 1.target 2.target 3.";
my #matches = ($str =~ /\.\K[^.]+(?=\.)/g);
print join("\n", #matches);
If you don't want to use twice a period you can do this:
my $str = "qwerty .target 1.target 2.target 3.";
my #matches = ($str =~ /\.([^.]+)\./g);
print join("\n", #matches)."\n";
It should be simple enough to do something like this:
#!/usr/bin/perl
use warnings;
use strict;
my #tests = (
"test one. get some stuff. extra",
"stuff with only one dot.",
"another test line.capture this. whatever",
"last test . some data you want.",
"stuff with only no dots",
);
for my $test (#tests) {
# For this example, I skip $test if the match fails,
# otherwise, I move on do stuff with $want
next if $test !~ /\.(.*)\./;
my $want = $1;
print "got: $want\n";
}
Output
$ ./test.pl
got: get some stuff
got: capture this
got: some data you want

Perl regular expressions and returned array of matched groups

i am new in Perl and i need to do some regexp.
I read, when array is used like integer value, it gives count of elements inside.
So i am doing for example
if (#result = $pattern =~ /(\d)\.(\d)/) {....}
and i was thinking it should return empty array, when pattern matching fails, but it gives me still array with 2 elements, but with uninitialized values.
So how i can put pattern matching inside if condition, is it possible?
EDIT:
foreach (keys #ARGV) {
if (my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) {
if (defined $params{$result[0]}) {
print STDERR "Cmd option error\n";
}
$params{$result[0]} = (defined $result[1] ? $result[1] : 1);
}
else {
print STDERR "Cmd option error\n";
exit ERROR_CMD;
}
}
It is regexp pattern for command line options, cmd options are in long format with two hyphens preceding and possible with argument, so
--CMD[=ARG]. I want elegant solution, so this is why i want put it to if condition without some prolog etc.
EDIT2:
oh sry, i was thinking groups in #result array are always counted from 0, but accesible are only groups from branch, where the pattern is success. So if in my code command is "input", it should be in $result[0], but actually it is in $result[1]. I thought if $result[0] is uninitialized, than pattern fails and it goes to the if statement.
Consider the following:
use strict;
use warnings;
my $pattern = 42.42;
my #result = $pattern =~ /(\d)\.(\d)/;
print #result, ' elements';
Output:
24 elements
Context tells Perl how to treat #result. There certainly aren't 24 elements! Perl has printed the array's elements which resulted from your regex's captures. However, if we do the following:
print 0 + #result, ' elements';
we get:
2 elements
In this latter case, Perl interprets a scalar context for #result, so adds the number of elements to 0. This can also be achieved through scalar #results.
Edit to accommodate revised posting: Thus, the conditional in your code:
if(my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) { ...
evaluates to true if and only if the match was successful.
#results = $pattern =~ /(\d)\.(\d)/ ? ($1,$2) : ();
Try this:
#result = ();
if ($pattern =~ /(\d)\.(\d)/)
{
push #result, $1;
push #result, $2;
}
=~ is not an equal sign. It's doing a regexp comparison.
So my code above is initializing the array to empty, then assigning values only if the regexp matches.

How can I extract start and end codon from DNA sequences in Perl?

I have a code below that try to identify the position of start and end codon of the given DNA sequences.
We define start codon as a ATG sequence and end codon as TGA,TAA,TAG sequences.
The problem I have is that the code below works only for first two sequences (DM208659 and AF038953) but not the rest.
What's wrong with my approach below?
This code can be copy-pasted from here.
#!/usr/bin/perl -w
while (<DATA>) {
chomp;
print "$_\n";
my ($id,$rna_sq) = split(/\s+/,$_);
local $_ = $rna_sq;
while (/atg/g) {
my $start = pos() - 2;
if (/tga|taa|tag/g) {
my $stop = pos();
my $gene = substr( $_, $start - 1, $stop - $start + 1 ),$/;
my $genelen = length($gene);
my $ct = "$id $start $stop $gene $genelen";
print "\t$ct\n";
}
}
}
__DATA__
DM208659 gtgggcctcaaatgtggagcactattctgatgtccaagtggaaagtgctgcgacatttgagcgtcac
AF038953 gatcccagacctcggcttgcagtagtgttagactgaagataaagtaagtgctgtttgggctaacaggatctcctcttgcagtctgcagcccaggacgctgattccagcagcgccttaccgcgcagcccgaagattcactatggtgaaaatcgccttcaatacccctaccgccgtgcaaaaggaggaggcgcggcaagacgtggaggccctcctgagccgcacggtcagaactcagatactgaccggcaaggagctccgagttgccacccaggaaaaagagggctcctctgggagatgtatgcttactctcttaggcctttcattcatcttggcaggacttattgttggtggagcctgcatttacaagtacttcatgcccaagagcaccatttaccgtggagagatgtgcttttttgattctgaggatcctgcaaattcccttcgtggaggagagcctaacttcctgcctgtgactgaggaggctgacattcgtgaggatgacaacattgcaatcattgatgtgcctgtccccagtttctctgatagtgaccctgcagcaattattcatgactttgaaaagggaatgactgcttacctggacttgttgctggggaactgctatctgatgcccctcaatacttctattgttatgcctccaaaaaatctggtagagctctttggcaaactggcgagtggcagatatctgcctcaaacttatgtggttcgagaagacctagttgctgtggaggaaattcgtgatgttagtaaccttggcatctttatttaccaactttgcaataacagaaagtccttccgccttcgtcgcagagacctcttgctgggtttcaacaaacgtgccattgataaatgctggaagattagacacttccccaacgaatttattgttgagaccaagatctgtcaagagtaagaggcaacagatagagtgtccttggtaataagaagtcagagatttacaatatgactttaacattaaggtttatgggatactcaagatatttactcatgcatttactctattgcttatgccgtaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
BC021011 ggggagtccggggcggcgcctggaggcggagccgcccgctgggctaaatggggcagaggccgggaggggtgggggttccccgcgccgcagccatggagcagcttcgcgccgccgcccgtctgcagattgttctg
DM208660 gggatactcaaaatgggggcgctttcctttttgtctgtactgggaagtgcttcgattttggggtgtccc
AF038954 ggacccaagggggccttcgaggtgccttaggccgcttgccttgctctcagaatcgctgccgccatggctagtcagtctcaggggattcagcagctgctgcaggccgagaagcgggcagccgagaaggtgtccgaggcccgcaaaagaaagaaccggaggctgaagcaggccaaagaagaagctcaggctgaaattgaacagtaccgcctgcagagggagaaagaattcaaggccaaggaagctgcggcattgggatcccgtggcagttgcagcactgaagtggagaaggagacccaggagaagatgaccatcctccagacatacttccggcagaacagggatgaagtcttggacaacctcttggcttttgtctgtgacattcggccagaaatccatgaaaactaccgcataaatggatagaagagagaagcacctgtgctgtggagtggcattttagatgccctcacgaatatggaagcttagcacagctctagttacattcttaggagatggccattaaattatttccatatattataagagaggtccttccactttttggagagtagccaatctagctttttggtaacagacttagaaattagcaaagatgtccagctttttaccacagattcctgagggattttagatgggtaaatagagtcagactttgaccaggttttgggcaaagcacatgtatatcagtgtggacttttcctttcttagatctagtttaaaaaaaaaaaccccttaccattctttgaagaaaggaggggattaaataattttttcccctaacactttcttgaaggtcaggggctttatctatgaaaagttagtaaatagttctttgtaacctgtgtgaagcagcagccagccttaaagtagtccattcttgctaatggttagaacagtgaatactagtggaattgtttgggctgcttttagtttctcttaatcaaaattactagatgatagaattcaagaacttgttacatgtattacttggtgtatcgataatcatttaaaagtaaagactctgtcatgcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I removed the use of $_ (I especially shuddered when you localized it -- you did so correctly, but why force yourself to worry if some other function is going to clobber $_, rather than use $rna_sq which is already available?
Additionally I corrected $start and $stop to be 0-based indexes into the string (which made the rest of the math more straight-forward), and calculated $genelen early so it could be used directly in the substr operation. (Alternatively, you could localize $[ to 1 to use 1-based array indexes, see perldoc perlvar.)
use strict;
use warnings;
while (my $line = <DATA>) {
chomp $line;
print "processing $line\n";
my ($id, $rna_sq) = split(/\s+/, $line);
while ($rna_sq =~ /atg/g) {
# $start and $stop are 0-based indexes
my $start = pos($rna_sq) - 3; # back up to include the start sequence
# discard remnant if no stop sequence can be found
last unless $rna_sq =~ /tga|taa|tag/g;
my $stop = pos($rna_sq);
my $genelen = $stop - $start;
my $gene = substr($rna_sq, $start, $genelen);
print "\t" . join(' ', $id, $start+1, $stop, $gene, $genelen) . "\n";
}
}
It's never breaking out of your inner loop when the if (/tga|taa|tag/g) fails to find an end codon. It keeps matching /atg/g repeatedly, never advancing any further. You could forcibly eject it from the inner loop:
if (/tga|taa|tag/g) {
...
}
else {
last;
}
It all depends on whether you want to generate sequences which could overlap. For example, sequence AF038954 contains atgaccatcctccagacatacttccggcagaacagggatga, the end of which overlaps with atgaagtcttggacaacctcttggcttttgtctgtga. Do you want to report them both?
If you don't want to report sequences which overlap, this is a very simple problem, which you can solve with a single regexp:
while (<DATA>) {
chomp;
print "processing $_\n";
my ($id, $rna_sq) = split;
while ($rna_sq =~ /(atg.*?(?:tga|taa|tag))/g) {
printf "\t%8s %4i %4i %s %i\n",
$id,
pos($rna_sq) - length($1) + 1,
pos($rna_sq),
$1,
length($1);
}
}
The regexp (atg.*?(?:tga|taa|tag)) matches your required start, then as little as possible of what comes next (that's the ? to stop the .* being "greedy") then your required end. Iterating over it with the while loop restarts after this match, which meets the requirement of not looking for overlaps.
If you do want overlapping sequences reported, you do need a two-stage process: find the start, find the end, and then find another start, picking up where you left off looking for the start the last time. But you can still do a simpler job using a second regexp:
while (<DATA>) {
chomp;
print "processing $_\n";
my ($id, $rna_sq) = split;
while ($rna_sq =~ /atg/g) {
if ($' =~ /(.*?(?:tga|taa|tag))/) {
my $match = "atg$1";
printf "\t%8s %4i %4i %s %i\n",
$id,
pos($rna_sq) - 2,
pos($rna_sq) - 3 + length($match),
$match,
length($match);
}
}
}
Here we use the (generally non-recommended) $' special variable, which contains the content after the match. We look in this to find the end of the sequence and output the details. Because our main global match against $rna_seq doesn't include the sequence (as it does above) we restart the search for a start where the previous search left off, that is just after the start we found. This way we do include overlapping sequences.