Pattern matching consecutive characters - perl

I have a list of strings that I would like to search through and ignore any that contain A or G characters that occur more than 4 times consecutively. For instance, I would like to ignore strings such as TCAAAATC or GCTGGGGAA.
I've tried:
unless ($string =~ m/A{4,}?/g || m/G{4,}?/g)
{
Do something;
}
But I get an error message "Use of uninitialized value in pattern match (m//)".
Any suggestions would be appreciated.

By writing
|| m/G{4,}?/g
you are implicitly testing $_ against this regex. But, $_ is not initialized, so you get an error.
Write
unless ($string =~ m/A{4}/ || $string =~ m/G{4}/)
instead (note the simplifications made to the regex), or, as a single expression,
unless ($string =~ m/A{4}|G{4}/)

You need to avoid the implicit comparison with $_, which you can do by writing:
unless ($string =~ m/A{4}/ || $string =~ m/G{4}/)
This looks for exactly 4 A's or exactly 4 G's in the string; if there are 4, it doesn't matter whether there are any more than 4.
You can reduce it to a single regular expression by using:
unless ($string =~ m/([AG])\1{3}/)
which looks for an A or G followed by 3 more of the same character.

Related

Best way to parse string in perl

To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.
Can someone please guide?
Note: Not with the full program, But where I can make improvement.
Thanks in advance
Input :
$str = "mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>"
Expected Output :
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
Sample Program
my $str="mail1, \#local<mail1\#mail.local>, mail2\#mail.local, <mail3\#mail.local>, mail4, local<mail4\#mail.local>";
my $count=0, #array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
if( ($count eq 0) and ($c eq ' ') ) {
next;
}
if($c) {
if( ($c eq ',') and ($flag eq 1) ) {
push #array, $tempStr;
$count=0;
$flag1=0;
$tempStr="";
next;
}
if( ($c eq '>' ) or ( $c eq '#' ) ) {
$flag=1;
}
$tempStr="$tempStr$c";
$count++;
}
}
if($count>0) {
push #array, $tempStr;
}
foreach my $var (#array) {
print "$var\n";
}
Edit:
Input:
Input is the output of above code.
Expected Output :
"mail1, local"<mail1#mail.local>
"mail4, local"<mail4#mail.local>
Sample Code:
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
my #addresses = split('\n',$str);
if(scalar #addresses) {
foreach my $address (#addresses) {
if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
$address="\"$address";
$address=~ s/</\"</g;
}
}
$str = join(',',#addresses);
}
print "$str\n";
As I see, you want to replace each:
comma and following spaces,
occurring after either # or >,
with a newline.
To make such replacement, instead of writing a parsing program, you can use
a regex.
The search part can be as follows:
([^#>]+[#>][^,]+),\s*
Details:
( - Start of the 1st capturing group.
[^#>]+ - A non-empty sequence of chars other than # or >.
[#>] - Either # or >.
[^,]+ - A non-empty sequence of chars other than a comma.
) - End of the 1st capturing group.
,\s* - A comma and optional sequence of spaces.
The replace part should be:
$1 - The 1st capturing group.
\n - A newline.
So the whole program, much shorter than yours, can be as follows:
my $str='mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4, local<mail4#mail.local>';
print "Before:\n$str\n";
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";
To replace all needed commas I used g option.
Note that I put the source string in single quotes, otherwise Perl
would have complained about Possible unintended interpolation of #mail.
Edit
Your modified requirements must be handled different way.
"Ordinary" replacement is not an option, because now there are some
fragments to match and some framents to ignore.
So the basic idea is to write a while loop with a matching regex:
(\w+),?\s+(\w+)(<[^>]+>), meaning:
(\w+) - First capturing group - a sequence of word chars (e.g. mail1).
,?\s+ - Optional comma and a sequence of spaces.
(\w+) - Second capturing group - a sequence of word chars (e.g. local).
(<[^>]+>) - Third capturing group - a sequence of chars other than >
(actual mail address), enclosed in angle brackets, e.g. <mail1#mail.local>.
Within each execution of the loop you have access to the groups
captured in this particular match ($1, $2, ...).
So the content of this loop is to print all these captured groups,
with required additional chars.
The code (again much shorter than yours) should look like below:
my $str = 'mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
print "\"$1, $2\"$3\n";
}
Here is an approach using split, which in this case also needs a careful regex
use warnings;
use strict;
use feature 'say';
my $string = # broken into two parts for readabililty
q(mail1, local<mail1#mail.local>, mail2#mail.local, )
. q(<mail3#mail.local>, mail4, local<mail4#mail.local>);
my #addresses = split /#.+?\K,\s*/, $string;
say for #addresses;
The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so #.+?,
To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.
We can instead normally match the pattern #.+? and then use the \K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\s* when that is preceded by the email address, #... (what isn't consumed).
It prints
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example
my #addresses = split /#.+?\K,\s*/, $string; #/ stop syntax highlight
s/(.+?,\s*.+?)</"$1"</ for #addresses;
say for #addresses;
The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.
In a foreach loop the index variable (or $_) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.
The statement also uses the statement modifier and it is equivalent to
foreach my $elem (#addresses) {
$elem =~ s/(.+?,\s*.+?)</"$1"</;
}
This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach.

In a string replacements how we use '/r' modifier

I need to increment a numeric value in a string:
my $str = "tool_v01.zip";
(my $newstr = $str) =~ s/\_v(\d+)\.zip$/ ($1++);/eri;
#(my $newstr = $str) =~ s/\_v(\d+)\.zip$/ ($1+1);/eri;
#(my $newstr = $str) =~ s/\_v(\d+)\.zip$/ $1=~s{(\d+)}{$1+1}/r; /eri;
print $newstr;
Expected output is tool_v02.zip
Note: the version number 01 may contain any number of leading zeroes
I don't think this question has anything to do with the /r modifier, but rather how to properly format the output. For that, I'd suggest sprintf:
my $newstr = $str =~ s{ _v (\d+) \.zip$ }
{ sprintf("_v%0*d.zip", length($1), $1+1 ) }xeri;
Or, replacing just the number with zero-width Lookaround Assertions:
my $newstr = $str =~ s{ (?<= _v ) (\d+) (?= \.zip$ ) }
{ sprintf("%0*d", length($1), $1+1 ) }xeri;
Note: With either of these solutions, something like tool_v99.zip would be altered to tool_v100.zip because the new sequence number cannot be expressed in two characters. If that's not what you want then you need to specify what alternative behaviour you require.
The bit you're missing is sprintf which works the same way as printf except rather than outputting the formatted string to stdout or a file handle, it returns it as a string. Example:
sprintf("%02d",3)
generates a string 03
Putting this into your regex you can do this. Rather than using /r you can use do a zero-width look ahead ((?=...)) to match the file suffix and just replace the matched number with the new value
s/(\d+)(?=.zip$)/sprintf("%02d",$1+1)/ei

In Perl, I want to mask/cut of X number of characters at end of string (X can be one of a set of character strings)

I have a two strings, XXXXXXnumber and XXXXXXdate and I want to strip all the XXXXXX from each string. The actual number of character represented by XXXXXX can vary. The suffixes 'number' and 'date' are constant. XXXXXXnumber and XXXXXXXdate should become XXXXXX.
my ($prefix) = ($string =~ /\A (.+?) (?:date|number) \z/x);
Alternatively:
$string =~ s/ (?:date|number) \z//x;
I would use a regular expression like $line =~ s/(number|date)$// for that task, where $line can be either line.
If your line has additional characters after number or date, they must be filtered out, too. An alternative approach would be using an expression like ($num) = ($line =~ /^(.*)(number|date).*$/);
use regexes:
($newvar = $oldvar) =~ s/^(.*)(number|date)$/$1/;
if you have no mor euse for $oldvar's original value (including the Xes) this simplifies to
$oldvar =~ s/^(.*)(number|date)$/$1/;
A simple substitution takes care of it:
$str =~ s/(?:number|date)\z/;

Perl: Why would eq work, when =~ doesn't?

Working code:
if ( $check1 eq $search_key ...
Previous 'buggy' code:
if ( $check1 =~ /$search_key/ ...
The words (in $check1 and $search_key) should be the same, but why doesn't the 2nd one return true all the time? What is different about these?
$check1 is acquired through a split. $search_key is either inputted before ("word") or at runtime: (<>), both are then passed to a subroutine.
A further question would be, can I convert the following with without any hidden problems?
if ($category_id eq "subj") {
I want to be able to say: =~ /subj/ so that "subject" would still remain true.
Thanks in advance.
$check1 =~ /$search_key/ doesn't work because any special characters in $search_key will be interpreted as a part of the regular expression.
Moreover, this really tests whether $check1 contains the substring $search_key. You really wanted $check1 =~ /^$search_key$/, although it's still incorrect because of the reason mentioned above.
Better stick with eq for exact string comparisons.
as mentioned before, special characters in $search_key will be interpreted, to prevent this, use \Q: if ( $check1 =~ /\Q$search_key/), which will take he content of $search_key as a literal. You can use \E to end this if ( $check1 =~ /\b\Q$search_key\E\b/) for example.
This information is in perlre
Regarding your second question, if just you want plain substring matching, you can use the index function. Then replace
if ($category_id eq "subj") {
with
if (0 <= index $category_id, "subj") {
This is a case-sensitive match.
Addition for clarafication: it will match asubj, subj, and even subjugate

What is wrong with this Perl code?

$value = $list[1] ~ s/\D//g;
syntax error at try1.pl line 53, near "] ~"
Execution of try1.pl aborted due to compilation errors.
I am trying to extract the digits from the second element of #list, and store it into $value.
You mean =~, not ~. ~ is a unary bitwise negation operator.
A couple of ways to do this:
($value) = $list[1] =~ /(\d+)/;
Both sets of parens are important; only if there are capturing parentheses does the match operation return actual content instead of just an indication of success, and then only in list context (provided by the list-assign operator ()=).
Or the common idiom of copy and then modify:
($value = $list[1]) =~ s/\D//;
maybe you wanted the =~ operator?
P.S. note that $value will not get assigned the resulting string (the string itself is changed in place). $value will get assigned the number of substitutions that were made
You said in a comment that are trying to get rid of non-digits. It looks like you are trying to preserve the old value and get the modified value in a new variable. The Perl idiom for that is:
( my $new = $old ) =~ s/\D//g;
And wanted \digits not non-\Digits. And have a superfluous s/ubstitute operator where a match makes more sense.
if ($list[1] =~ /(\d+)/) {
$value = $1;
}