Searching for recursive pattern within a string - perl

Using Perl, I want to search a string of nucleotides (AGCT) for pattern of no less and no more than three nucleotides that repeat consecutively at least seven times. I need to also save that combination for print to file as well as a total count.
The pattern of these three nucleotides will be unknown in the sense that while there are only 64 possible combinations, we will not know which one will be the repeating combination.
I have two lines of thought going in my head about how to go about this:
Create a list of the possible combinations and check against that, while producing a count. This doesn't seem feasible because every three nucleotides would produce a match. And it still wouldn't solve the problem of consecutive matching.
OR Check the first three nucleotides against the next three, if matching, check the next three. If no match, shift the reading frame to the second nucleotide in the string and try the search again.

This regex ought to do the trick:
/( ([ACGT]{3}) \2{6,} )/x
Match three chars of ACGT, then repeat the capture $2 at least six additional times. The whole matched string is in $1 and will have three times the length of actual groups: $n = length($1)/3.
Test:
my $regex = qr/( ([ACGT]{3}) \2{6,} )/x;
"TACGACGACGACGACGACGACGACGT" =~ $regex;
printf "Matched %s exactly %d times\n", $2, length($1)/3;
Output:
Matched ACG exactly 8 times
Looks good.

Related

Subtract a number of capital letters in a row from a string

I have a string which can contain letters and/or numbers, and I want to identify if it has 10 capital letters in a row and subtract them if they exist:
Example :
my $string = "MyString-MetadataDZEDDMWKQMsomeothertext";
I want to identify that this string contains 10 capital letters one after each other (DZEDDMWKQM) and subtract them
my $final_string = "MyString-Metadatasomeothertext"
I managed only to be able to subtract a fixed amount of characters, but it was not helpful for what I need.
You want a regular expression substitution.
my $string = 'fooABCDEFGHIJbar';
$string =~ s/[A-Z]{10}//;
print $string;
The regex pattern contains two parts:
[A-Z] is a character group containing all upper case letters, from A to Z
{10} is a quantifier, meaning repeat the the previous thing exactly 10 times
You can learn more about regular expressions in Perl in perlre. The regex tag wiki is useful too.

Confusion about file counting

I've seen numerous posts about using Perl to count specific files in a directory, one of which showed this code
#filesInDirectory = glob("$directory$fileNameRegex");
$numberOfFiles = #filesInDirectory;
Then if I follow this with:
print LOGFILE $#filesInDirectory;
print LOGFILE $numberOfFiles;
The log will have 01. Where the confusion is this: Why is $numberOfFiles 1 instead of 0 when the file I'm looking for doesn't exist?
$numberOfFiles contains the numbers of elements #filesInDirectory. Since #filesInDirectory contains one element, $numberOfFiles contains 1.
$#filesInDirectory contains the index of the last element in #filesInDirectory. Since #filesInDirectory contains one element ($filesInDirectory[0]), $numberOfFiles contains 0.
Unless you mess with $[ (don't!!), $#a will always be one less than #a in scalar context.
Filename globs are not regular expressions, so if $fileNameRegex is what its name implies, it's the wrong thing to use here.
Globs are the pattern-matching language you use to expand filenames into a shell command line: * matches any string of characters (equivalent to .* in a regular expression), ? matches a single character (equivalent to . in a regular expression), and [] works pretty much the same as in regular expressions.
If you need to test filenames against a regular expression, you'll have to do a grep on the list you get back from glob (or readdir).
If you're wondering why #filesInDirectory has something in it instead of being empty, then you already have a solution
But if you're asking why $numberOfFiles is different from $#filesInDirectory then it's because the first contains the number of elements in #filesInDirectory while the second contains the index of the last element of #filesInDirectory
Perl indices start at zero, so if #array has exactly one element then $#array will be zero. If #array is empty then $#array will be -1

Need Regular expression - perl

I am looking for a regx for below expression:
my $text = "1170 KB/s (244475 bytes in 2.204s)"; # I want to retrieve last ‘2.204’ from this String.
$text =~ m/\d+[^\d*](\d+)/; #Regexp
my $num = $1;
print " $num ";
Output:
204
But I need 2.204 as output, please correct me.
Can any one help me out?
The regex is doing exactly what you asked it to: It is matching digits \d+, followed by one non-digit or star [^\d*], followed by digits \d+. The only thing that matches that in your string is 204.
If you want a quick fix, you can just move the parentheses:
m/(\d+[^\d*]\d+)/
This would (with the above input) match what you want. A more exact way to put it would be:
m/(\d+\.\d+)/
Of course this will match any float precision number, so if you can have more of those, that's not a good idea. You can shore it up by using an anchor, like so:
m/(\d+\.\d+)s\)/
Where s\) forces the match to occur at only that place. Further strictures:
m/\(\d+\D+(\d+\.\d+)s\)/
You might also want to account for the possibility of your target number not being a float:
m/\(\d+\D+(\d+\.?\d*)s\)/
By using ? and * we allow for those parts not to match at all. This is not recommended to do unless you are using anchors. You can also replace everything in the capture group with [\d.]+.
If you are not fond of matching the parentheses, you can match the text:
m/bytes in ([\d.]+)s/
I'd go with the second marker as indicator where you are in the string:
my ($num) = ($text =~ /(\d+\.\d+)s/);
with explanations:
/( # start of matching group
\d+ # first digits
\. # a literal '.', take \D if you want non-numbers
\d+ # second digits
)/x # close the matching group and the regex
You had the matching groups wrong. Also the [^\d] is a bit excessive, generally you can negate some of the backspaced special classes (\d,\h, \s and \w) with their respective uppercase letter.
Try this regex:
$text =~ m/\d+[^\d]*(\d+\.?\d*)s/;
That should match 1+ digits, a decimal point if there is one, 0 or more decimal places, and make sure it's followed by a "s".

In Perl, how can I split only a certain leading portion of a string?

I am parsing a file with long lines, whose tokens are white space delimited. Before handling most of the line, I want to check whether the n-th (for small n) token has some value. I'll skip most of the lines, so really there's no need to split most of the very long lines. Is there a quick way to do a lazy split in Perl or do I need to roll my own?
You can provide a limit argument to the split operator to make Perl stop splitting after a certain number of tokens have been generated.
#fields = split /\s+/, $expression, 4
for example, will put everything after the 3rd whitespace-separated field in the 4th element of #list. This is more efficient than doing a complete split when the expression has more than four fields.
If you do this lazy split and decide that you need to process the line further, you will need to split the line again. Depending on how long the lines are and how frequently you need to reprocess them, you could still come out ahead.
Another approach may be to split a portion of the line you are interested in. For example, if the line contains many fields but you want to filter on the 4th field AND you are sure that the 4th field always occurs before the 100th byte on the line, saying
#fields = split /\s+/, substr($expression, 0, 100);
if (matches_some_condition($line[3])) {
# process the whole line
#fields = split /\s+/, $expression;
...
}
and occasionally splitting the expression twice may be more efficient than always splitting the full expression one time.
perldoc -f split:
If LIMIT is specified and positive, it represents the maximum number of fields the EXPR will be split into, though the actual number of fields returned depends on the number of times PATTERN matches within EXPR.
my $nth = (split ' ', $line, $n + 1)[$n - 1];

how to count the number of words in a line in perl?

I know I could write my own while loop along with regex to count the words in a line. But, I am processing like 1000 lines and I dont want to run this loop each and every time. So, I was wondering is there any way to count the words in the line in perl.
1000 times is not a significant number to a modern computer. In general, write the code that makes sense to you, and then, if there is a performance problem, worry about optimization.
To count words, first you need to decide what is a word. One approach is to match groups of consecutive word characters, but that counts "it's" as two words.
Another is to match groups of consecutive non-whitespace, but that counts "phrase - phrase" as three words. Once you have a regex that matches a word, you can count words like this (using consecutive word characters for this example):
scalar( () = $line =~ /\w+/g )
How about splitting the line on one or more non-word characters and counting the size of the resulting array?
$ echo "one, two, three" | perl -nE "say scalar split /\W+/"
3
As a sub that would be:
# say count_words 'foo bar' => 2
sub count_words { scalar split /\W+/, shift }
To get rid of the leading space problem spotted by ysth, you can filter out the empty segments:
$ echo " one, two, three" | perl -nE 'say scalar grep {length $_} split /\W+/'
3
…or shave the input string:
$ echo " one, two, three" | perl -nE 's/^\W+//; say scalar split /\W+/'
3