I have a little Perl script which includes a substring search as follows.
#!/usr/bin/perl
use strict;
use warnings;
my $line = "this && is || a test if && ||";
my $nb_if = findSymbols($line, "if ");
my $nb_and = findSymbols($line, "&&");
my $nb_or = findSymbols($line, "||");
print "\nThe result for this func is $nb_if=if , $nb_and=and, $nb_or=or\n";
sub findSymbols {
my $n = () = ($_[0] =~ m/$_[1]/g);
return $n;
}
It should return:
The result for this func is 1=if , 2=and, 2=or
but, instead it returns:
The result for this func is 1=if , 2=and, 30=or
I don't understand what's wrong with my code.
Use quotemeta to escape the special meaning of the regular expression containing || (and any other characters which you pass to the function):
sub findSymbols {
my $pat = quotemeta $_[1];
my $n = () = ($_[0] =~ m/$pat/g);
return $n;
}
The pipe character (|) has a special meaning in regular expressions. It means "or" (matching either the thing on its left or the thing on its right). So having a regex that consists of just two pipes is interpreted as meaning "match an empty string or an empty string or an empty string" - and that matches everywhere in your string (30 times!)
So you need to stop the pipe being interpreted as a special character and let it just represent an actual pipe character. Here are three ways to do that:
Escape the pipes with backslashes when you're creating the string that you pass to findSymbols().
# Note: I've also changed "..." to '...'
# to avoid having to double-escape
my $nb_or = findSymbols($line, '\|\|');
Use quotemeta() to automatically escape problematic characters in any string passed to findSymbols().
my $escaped_regex = quotemeta($_[0]);
my $n = () = ($_[0] =~ m/$escaped_regex/g);
Use \Q...\E to automatically escape any problematic characters used in your regex.
# Note: In this case, the \E isn't actually needed
# as it's at the end of the regex.
my $n = () = ($_[0] =~ m/\Q$_[0]\E/g);
For more detailed information on using regular expressions in Perl, see perlretut and perlre.
| is the alternation operator in the regular expression used by m//. You need to escape each | with a backslash to match literal |s.
my $nb_or = findSymbols($line, "\\|\\|"); # or '\|\|`
(but using quotemeta as suggested by #toolic is a much better idea, as it frees your caller from having to worry about details that should be part of the abstraction provided by findSymbols.)
Related
To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.
Can someone please guide?
Note: Not with the full program, But where I can make improvement.
Thanks in advance
Input :
$str = "mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>"
Expected Output :
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
Sample Program
my $str="mail1, \#local<mail1\#mail.local>, mail2\#mail.local, <mail3\#mail.local>, mail4, local<mail4\#mail.local>";
my $count=0, #array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
if( ($count eq 0) and ($c eq ' ') ) {
next;
}
if($c) {
if( ($c eq ',') and ($flag eq 1) ) {
push #array, $tempStr;
$count=0;
$flag1=0;
$tempStr="";
next;
}
if( ($c eq '>' ) or ( $c eq '#' ) ) {
$flag=1;
}
$tempStr="$tempStr$c";
$count++;
}
}
if($count>0) {
push #array, $tempStr;
}
foreach my $var (#array) {
print "$var\n";
}
Edit:
Input:
Input is the output of above code.
Expected Output :
"mail1, local"<mail1#mail.local>
"mail4, local"<mail4#mail.local>
Sample Code:
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
my #addresses = split('\n',$str);
if(scalar #addresses) {
foreach my $address (#addresses) {
if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
$address="\"$address";
$address=~ s/</\"</g;
}
}
$str = join(',',#addresses);
}
print "$str\n";
As I see, you want to replace each:
comma and following spaces,
occurring after either # or >,
with a newline.
To make such replacement, instead of writing a parsing program, you can use
a regex.
The search part can be as follows:
([^#>]+[#>][^,]+),\s*
Details:
( - Start of the 1st capturing group.
[^#>]+ - A non-empty sequence of chars other than # or >.
[#>] - Either # or >.
[^,]+ - A non-empty sequence of chars other than a comma.
) - End of the 1st capturing group.
,\s* - A comma and optional sequence of spaces.
The replace part should be:
$1 - The 1st capturing group.
\n - A newline.
So the whole program, much shorter than yours, can be as follows:
my $str='mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4, local<mail4#mail.local>';
print "Before:\n$str\n";
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";
To replace all needed commas I used g option.
Note that I put the source string in single quotes, otherwise Perl
would have complained about Possible unintended interpolation of #mail.
Edit
Your modified requirements must be handled different way.
"Ordinary" replacement is not an option, because now there are some
fragments to match and some framents to ignore.
So the basic idea is to write a while loop with a matching regex:
(\w+),?\s+(\w+)(<[^>]+>), meaning:
(\w+) - First capturing group - a sequence of word chars (e.g. mail1).
,?\s+ - Optional comma and a sequence of spaces.
(\w+) - Second capturing group - a sequence of word chars (e.g. local).
(<[^>]+>) - Third capturing group - a sequence of chars other than >
(actual mail address), enclosed in angle brackets, e.g. <mail1#mail.local>.
Within each execution of the loop you have access to the groups
captured in this particular match ($1, $2, ...).
So the content of this loop is to print all these captured groups,
with required additional chars.
The code (again much shorter than yours) should look like below:
my $str = 'mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
print "\"$1, $2\"$3\n";
}
Here is an approach using split, which in this case also needs a careful regex
use warnings;
use strict;
use feature 'say';
my $string = # broken into two parts for readabililty
q(mail1, local<mail1#mail.local>, mail2#mail.local, )
. q(<mail3#mail.local>, mail4, local<mail4#mail.local>);
my #addresses = split /#.+?\K,\s*/, $string;
say for #addresses;
The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so #.+?,
To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.
We can instead normally match the pattern #.+? and then use the \K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\s* when that is preceded by the email address, #... (what isn't consumed).
It prints
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example
my #addresses = split /#.+?\K,\s*/, $string; #/ stop syntax highlight
s/(.+?,\s*.+?)</"$1"</ for #addresses;
say for #addresses;
The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.
In a foreach loop the index variable (or $_) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.
The statement also uses the statement modifier and it is equivalent to
foreach my $elem (#addresses) {
$elem =~ s/(.+?,\s*.+?)</"$1"</;
}
This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach.
Trying to match a string with brackets.
For example:
my $foo = "debug_bus[0]";
my $bar = "debug_bus[0][12:0] = some_value;";
if ($bar =~ $foo)
{
print "Match\n";
}
else
{
print "No Match\n";
}
I would expect "Match" but I keep getting "No Match" which leads me to believe maybe brackets in '[0]' are causing an issue?
You need to properly escape ("quote") metacharacters in your regex using \Q...\E (inside) or quotemeta (outside)
Therefore, you want:
$bar =~ m/\Q$foo\E/;
Or just:
$bar =~ /\Q$foo/;
You can omit the m when the delimiters are //, and you don't really need \E in this case because there's nothing else in your pattern.
Replace
my $foo = "debug_bus[0]";
With
my $foo = quotemeta "debug_bus[0]";
From quotemeta documentation :
quotemeta EXPR
Returns the value of EXPR with all the ASCII non-"word" characters backslashed.
Without using quotemeta, [0] is interpreted as a bracketed character class, containing only 0, and then is equivalent to just 0.
E.g:
$myVar="this###!~`%^&*()[]}{;'".,<>?/\";
I am not able to export this variable and use it as it is in my program.
Use q to store the characters and use the quotemeta to escape the all character
my $myVar=q("this###!~`%^&*()[]}{;'".,<>?/\");
$myVar = quotemeta($myVar);
print $myVar;
Or else use regex substitution to escape the all character
my $myVar=q("this###!~`%^&*()[]}{;'".,<>?/\");
$myVar =~s/(\W)/\\$1/g;
print $myVar;
This is what quotemeta is for, if I understand your quest
Returns the value of EXPR with all non-"word" characters backslashed. (That is, all characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.) This is the internal function implementing the \Q escape in double-quoted strings.
Its use is very simple
my $myVar = q(this###!~`%^&*()[]}{;'".,<>?/\\);
print "$myVar\n";
my $quoted_var = quotemeta $myVar;
print "$quoted_var\n";
Note that we must manually escape the last backslash, to prevent it from escaping the closing delimiter. Or you can tack on an extra space at the end, and then strip it (by chop).
my $myVar = q(this###!~`%^&*()[]}{;'".,<>?/\ );
chop $myVar;
Now transform $myVar like above, using quotemeta.
I take the outside pair of " to merely indicate what you'd like in the variable. But if they are in fact meant to be in the variable then simply put it all inside q(), since then the last character is ". The only problem is a backslash immediately preceding the closing delimiter.
If you need this in a regex context then you use \Q to start and \E to end escaping.
Giving Thanks to:
What's between \Q and \E is treated as normal characters, not regexp characters. For example,
'.' =~ /./; # match
'a' =~ /./; # match
'.' =~ /\Q.\E/; # match
'a' =~ /\Q.\E/; # no match
It doesn't stop variables from being interpolated.
$search = '.';
'.' =~ /$search/; # match
'a' =~ /$search/; # match
'.' =~ /\Q$search\E/; # match
'a' =~ /\Q$search\E/; # no match
I'm using Perl 5.16.2 to try to count the number of occurrences of a particular delimiter in the $_ string. The delimiter is passed to my Perl program via the #ARGV array. I verify that it is correct within the program. My instruction to count the number of delimiters in the string is:
$dlm_count = tr/$dlm//;
If I hardcode the delimiter, e.g. $dlm_count = tr/,//; the count comes out correctly. But when I use the variable $dlm, the count is wrong. I modified the instruction to say
$dlm_count = tr/$dlm/\t/;
and realized from how the tabs were inserted in the string that the operation was substituting every instance of any of the four characters "$", "d", "l", or "m" to \t — i.e. any of the four characters that made up my variable name $dlm.
Here is a sample program that illustrates the problem:
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm = ",";
my $dlm_count = tr/$dlm/\t/;
print "The count is $dlm_count\n";
print "The modified string is $_\n";
There are only two commas in the $_ string, but this program prints the following:
The count is 3
The modified string is abc efghij,k ,nopqrstuvwxyz
Why is the $dlm token being treated as a literal string of four characters instead of as a variable name?
You cannot use tr that way, it doesn't interpolate variables. It runs strictly character by character replacement. So this
$string =~ tr/a$v/123/
is going to replace every a with 1, every $ with 2, and every v with 3. It is not a regex but a transliteration. From perlop
Because the transliteration table is built at compile time, neither the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote interpolation. That means that if you want to use variables, you must use an eval():
eval "tr/$oldlist/$newlist/";
die $# if $#;
eval "tr/$oldlist/$newlist/, 1" or die $#;
The above example from docs hints how to count. For $dlms in $string
$dlm_count = eval "\$string =~ tr/$dlm//";
The $string is escaped so to not be interpolated before it gets to eval. In your case
$dlm_count = eval "tr/$dlm//";
You can also use tools other than tr (or regex). For example, with string being in $_
my $dlm_count = grep { /$dlm/ } split //;
When split breaks $_ by the pattern that is empty string (//) it returns the list of all characters in it. Then the grep block tests each against $dlm so returning the list of as many $dlm characters as there were in $_. Since this is assigned to a scalar, $dlm_count is set to the length of that list, which is the count of all $dlm.
In the section of the docs on perlop 'Quote Like Operators', it states:
Because the transliteration table is built at compile time, neither
the SEARCHLIST nor the REPLACEMENTLIST are subjected to double quote
interpolation. That means that if you want to use variables, you must
use an eval():
As documented and as you discovered, tr/// doesn't interpolate. The simple solution is to use s/// instead.
my $dlm = ",";
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm_count = s/\Q$dlm/\t/g;
If the transliteration is being performed in a loop, the following might speed things up noticeably:
my $dlm = ",";
my $tr = eval "sub { tr/\Q$dlm\E/\\t/ }";
for (...) {
my $dlm_count = $tr->();
...
}
Although several answers have hinted at the eval() idiom for tr///, none have the form that covers cases where the string has tr syntax characters in it, e.g.- (hyphen):
$_ = "abcdefghij,klm,nopqrstuvwxyz";
my $dlm = ",";
my $dlm_count = eval sprintf "tr/%s/%s/", map quotemeta, $dlm, "\t";
But as others have noted, there are lots of ways to count characters in Perl that avoid eval(), here's another:
my $dlm_count = () = m/$dlm/go;
If I had:
$foo= "12."bar bar bar"|three";
how would I insert in the text ".." after the text 12. in the variable?
Perl allows you to choose your own quote delimiters. If you find you need to use a double quote inside of an interpolating string (i.e. "") or single quote inside of a non-interpolating string (i.e. '') you can use a quote operator to specify a different character to act as the delimiter for the string. Delimiters come in two forms: bracketed and unbracketed. Bracketed delimiters have different beginning and ending characters: [], {}, (), [], and <>. All other characters* are available as unbracketed delimiters.
So your example could be written as
$foo = qq(12."bar bar bar"|three);
Inserting text after "12." can be done many ways (TIMTOWDI). A common solution is to use a substitution to match the text you want to replace.
$foo =~ s/^(12[.])/$1../;
the ^ means match at the start of the sting, the () means capture this text to the variable $1, the 12 just matches the string "12", and the [] mean match any one of the characters inside the brackets. The brackets are being used because . has special meaning in regexes in general, but not inside a character class (the []). Another option to the character class is to escape the special meaning of . with \, but many people find that to be uglier than the character class.
$foo =~ s/^(12\.)/$1../;
Another way to insert text into a string is to assign the value to a call to substr. This highlights one of Perl's fairly unique features: many of its functions can act as lvalues. That is they can be treated like variables.
substr($foo, 3, 0) = "..";
If you did not already know where "12." exists in the string you could use index to find where it starts, length to find out how long "12." is, and then use that information with substr.
Here is a fully functional Perl script that contains the code above.
#!/usr/bin/perl
use strict;
use warnings;
my $foo = my $bar = qq(12."bar bar bar"|three);
$foo =~ s/(12[.])/$1../;
my $i = index($bar, "12.") + length "12.";
substr($bar, $i, 0) = "..";
print "foo is $foo\nbar is $bar\n";
* all characters except whitespace characters (space, tab, carriage return, line feed, vertical tab, and formfeed) that is
If you want to use double quotes in a string in Perl you have two main options:
$foo = "12.\"bar bar bar\"|three";
or:
$foo = '12."bar bar bar"|three';
The first option escapes the quotes inside the string with backslash.
The second option uses single quotes. This means the double quotes are treated as part of the string. However, in single quotes everything is literal so $var or #array isn't treated as a variable. For example:
$myvar = 123;
$mystring = '"$myvar"';
print $mystring;
> "$myvar"
But:
$myvar = 123;
$mystring = "\"$myvar\"";
print $mystring;
> "123"
There are also a large number of other Quote-like Operators you could use instead.
$foo = "12.\"bar bar bar\"|three";
$foo =~s/12\./12\.\.\./;
print $foo; # results in 12...\"bar bar bar\"|three"