What is wrong with this Perl code? - perl

$value = $list[1] ~ s/\D//g;
syntax error at try1.pl line 53, near "] ~"
Execution of try1.pl aborted due to compilation errors.
I am trying to extract the digits from the second element of #list, and store it into $value.

You mean =~, not ~. ~ is a unary bitwise negation operator.
A couple of ways to do this:
($value) = $list[1] =~ /(\d+)/;
Both sets of parens are important; only if there are capturing parentheses does the match operation return actual content instead of just an indication of success, and then only in list context (provided by the list-assign operator ()=).
Or the common idiom of copy and then modify:
($value = $list[1]) =~ s/\D//;

maybe you wanted the =~ operator?
P.S. note that $value will not get assigned the resulting string (the string itself is changed in place). $value will get assigned the number of substitutions that were made

You said in a comment that are trying to get rid of non-digits. It looks like you are trying to preserve the old value and get the modified value in a new variable. The Perl idiom for that is:
( my $new = $old ) =~ s/\D//g;

And wanted \digits not non-\Digits. And have a superfluous s/ubstitute operator where a match makes more sense.
if ($list[1] =~ /(\d+)/) {
$value = $1;
}

Related

Best way to parse string in perl

To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.
Can someone please guide?
Note: Not with the full program, But where I can make improvement.
Thanks in advance
Input :
$str = "mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>"
Expected Output :
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
Sample Program
my $str="mail1, \#local<mail1\#mail.local>, mail2\#mail.local, <mail3\#mail.local>, mail4, local<mail4\#mail.local>";
my $count=0, #array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
if( ($count eq 0) and ($c eq ' ') ) {
next;
}
if($c) {
if( ($c eq ',') and ($flag eq 1) ) {
push #array, $tempStr;
$count=0;
$flag1=0;
$tempStr="";
next;
}
if( ($c eq '>' ) or ( $c eq '#' ) ) {
$flag=1;
}
$tempStr="$tempStr$c";
$count++;
}
}
if($count>0) {
push #array, $tempStr;
}
foreach my $var (#array) {
print "$var\n";
}
Edit:
Input:
Input is the output of above code.
Expected Output :
"mail1, local"<mail1#mail.local>
"mail4, local"<mail4#mail.local>
Sample Code:
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
my #addresses = split('\n',$str);
if(scalar #addresses) {
foreach my $address (#addresses) {
if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
$address="\"$address";
$address=~ s/</\"</g;
}
}
$str = join(',',#addresses);
}
print "$str\n";
As I see, you want to replace each:
comma and following spaces,
occurring after either # or >,
with a newline.
To make such replacement, instead of writing a parsing program, you can use
a regex.
The search part can be as follows:
([^#>]+[#>][^,]+),\s*
Details:
( - Start of the 1st capturing group.
[^#>]+ - A non-empty sequence of chars other than # or >.
[#>] - Either # or >.
[^,]+ - A non-empty sequence of chars other than a comma.
) - End of the 1st capturing group.
,\s* - A comma and optional sequence of spaces.
The replace part should be:
$1 - The 1st capturing group.
\n - A newline.
So the whole program, much shorter than yours, can be as follows:
my $str='mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4, local<mail4#mail.local>';
print "Before:\n$str\n";
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";
To replace all needed commas I used g option.
Note that I put the source string in single quotes, otherwise Perl
would have complained about Possible unintended interpolation of #mail.
Edit
Your modified requirements must be handled different way.
"Ordinary" replacement is not an option, because now there are some
fragments to match and some framents to ignore.
So the basic idea is to write a while loop with a matching regex:
(\w+),?\s+(\w+)(<[^>]+>), meaning:
(\w+) - First capturing group - a sequence of word chars (e.g. mail1).
,?\s+ - Optional comma and a sequence of spaces.
(\w+) - Second capturing group - a sequence of word chars (e.g. local).
(<[^>]+>) - Third capturing group - a sequence of chars other than >
(actual mail address), enclosed in angle brackets, e.g. <mail1#mail.local>.
Within each execution of the loop you have access to the groups
captured in this particular match ($1, $2, ...).
So the content of this loop is to print all these captured groups,
with required additional chars.
The code (again much shorter than yours) should look like below:
my $str = 'mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
print "\"$1, $2\"$3\n";
}
Here is an approach using split, which in this case also needs a careful regex
use warnings;
use strict;
use feature 'say';
my $string = # broken into two parts for readabililty
q(mail1, local<mail1#mail.local>, mail2#mail.local, )
. q(<mail3#mail.local>, mail4, local<mail4#mail.local>);
my #addresses = split /#.+?\K,\s*/, $string;
say for #addresses;
The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so #.+?,
To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.
We can instead normally match the pattern #.+? and then use the \K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\s* when that is preceded by the email address, #... (what isn't consumed).
It prints
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example
my #addresses = split /#.+?\K,\s*/, $string; #/ stop syntax highlight
s/(.+?,\s*.+?)</"$1"</ for #addresses;
say for #addresses;
The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.
In a foreach loop the index variable (or $_) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.
The statement also uses the statement modifier and it is equivalent to
foreach my $elem (#addresses) {
$elem =~ s/(.+?,\s*.+?)</"$1"</;
}
This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach.

Using binding operator in perl

I am working on a program in perl and I am trying to combine more than one regex in a binding operator. I have tried using the syntax below but it's not working. I will like to know if there is any other way to go with this.
$in =~ (s/pattern/replacement/)||(s/pattern/replacement/)||...
You can often get a clue about what the Perl makes of some code using B::Deparse.
$ perl -MO=Deparse -E'$in =~ (s/pattern1/replacement1/)||(s/pattern2/replacement2/)'
[ ... snip ... ]
s/pattern2/replacement2/u unless $in =~ s/pattern1/replacement1/u;
-e syntax OK
So it's attempting your first substitution on $in. And if that fails, it is then trying your second substitution. But it's not using $in for the second substitution, it's using $_ instead.
You're running up against precedence issues here. Perl interprets your code as:
($in =~ s/pattern1/replacement1/) or (s/pattern2/replacement2/)
Notice that the opening parenthesis has moved before $in.
As others have pointed out, it's best to use a loop approach here. But I thought it might be useful to explain why your version didn't work.
Update: To be clear, if you wanted to use syntax like this, then you would need:
($in =~ s/pattern1/replacement1/) or
($in =~ s/pattern2/replacement2/);
Note that I've included $in =~ in each expression. At this point, it becomes obvious (I hope) why the looping solution is better.
However, because or is a short-circuiting operator, this statement will stop after the first successful substitution. I assumed that's what you wanted from your use of it in your original code. If that's not what you want, then you need to either switch to using and or (better, in my opinion) break them out into separate statements.
$in =~ s/pattern1/replacement1/;
$in =~ s/pattern2/replacement2/;
The closest you could get with a syntax looking similar to that would be
s/one/ONE/ or
s/two/TWO/ or
...
s/ten/TEN/ for $str;
This will attempt each substitution in turn, once only, stopping after the first successful one.
Use for to "topicalize" (alias $_ to your variable).
for ($in) {
s/pattern/replacement/;
s/pattern/replacement/;
}
A simpler way might be to create an array of all such patterns and replacements, then simply iterate through your array applying the substitution one pattern at a time.
my $in = "some string you want to modify";
my #patterns = (
['pattern to match', 'replacement string'],
# ...
);
$in = replace_many($in, \#patterns);
sub replace_many {
my ($in, $replacements) = #_;
foreach my $replacement ( #$replacements ) {
my ($pattern, $replace_string) = #$replacement;
$in =~ s/$pattern/$replace_string/;
}
return $in;
}
It's not at all clear what you need, and it's not at all clear that you can accomplish what you appear to want by the means you suggest. The OR operator is a short circuit operator, and you may not want this behavior. Please give an example of the input you expect, and the output you desire, hopefully several examples of each. Meanwhile, here is a test script.
use warnings;
use strict;
my $in1 = 'George Walker Bush';
my $in2 = 'George Walker Bush';
my $in3 = 'George Walker Bush';
my $in4 = 'George Walker Bush';
(my $out1 = $in1) =~ s/e/*/g;
print "out1 = $out1 \n";
(my $out2 = $in2) =~ s/Bush/Obama/;
print "out2 = $out2 \n";
(my $out3 = $in3) =~ s/(George)|(Bush)/Obama/g;
print "out3 = $out3\n";
$in4 =~ /(George)|(Walker)|(Bush)/g;
print "$1 - $2 - $3\n";
exit(0);
You will notice in the last case that only the first OR operator matches in the regular expression. If you wanted to replace 'George Walker Bush' with Barack Hussein Obama', you could do that easily enough, but you would also replace 'George Washington'with 'Barack Washington' - is this what you want? Here is the output of the script:
out1 = G*org* Walk*r Bush
out2 = George Walker Obama
out3 = Obama Walker Obama
Use of uninitialized value $2 in concatenation (.) or string at pq_151111a.plx line 19.
Use of uninitialized value $3 in concatenation (.) or string at pq_151111a.plx line 19.
George - -

conditional substitution using hashes

I'm trying for substitution in which a condition will allow or disallow substitution.
I have a string
$string = "There is <tag1>you can do for it. that dosen't mean <tag2>you are fool.There <tag3>you got it.";
Here are two hashes which are used to check condition.
my %tag = ('tag1' => '<you>', 'tag2'=>'<do>', 'tag3'=>'<no>');
my %data = ('you'=>'<you>');
Here is actual substitution in which substitution is allowed for hash tag values not matched.
$string =~ s{(?<=<(.*?)>)(you)}{
if($tag{"$1"} eq $data{"$2"}){next;}
"I"
}sixe;
in this code I want to substitute 'you' with something with the condition that it is not equal to the hash value given in tag.
Can I use next in substitution?
Problem is that I can't use \g modifier. And after using next I cant go for next substitution.
Also I can't modify expression while matching and using next it dosen't go for second match, it stops there.
You can't use a variable length look behind assertion. The only one that is allowed is the special \K marker.
With that in mind, one way to perform this test is the following:
use strict;
use warnings;
while (my $string = <DATA>) {
$string =~ s{<([^>]*)>\K(?!\1)\w+}{I}s;
print $string;
}
__DATA__
There is <you>you can do for it. that dosen't mean <notyou>you are fool.
There is <you>you can do for it. that dosen't mean <do>you are fool.There <no>you got it.
Output:
There is <you>you can do for it. that dosen't mean <notyou>I are fool.
There is <you>you can do for it. that dosen't mean <do>I are fool.There <no>you got it.
It was simple but got my two days to think about it. I just written another substitution where it ignores previous tag which is cancelled by next;
$string = "There is <tag1>you can do for it. that dosen't mean <tag2>you are fool.There <tag3>you got it.";
my %tag = ('tag1' => '<you>', 'tag2'=>'<do>', 'tag3'=>'<no>');
my %data = ('you'=>'<you>');
my $notag;
$string =~ s{(?<=<(.*?)>)(you)}{
$notag = $2;
if($tag{"$1"} eq $data{"$2"}){next;}
"I"
}sie;
$string =~ s{(?<=<(.*?)>)(?<!$notag)(you)}{
"I"
}sie;

grep keyword with if-condition

I have an input file with
words;
yadda yadda;
keyword 123;
yadda;
and I want to simply get the value 123 saved as a variable. I tried a solution from here:
my $var;
open(FILE,$data.dat) or die "error on opening $data: $!\n";
while (my $line = <FILE>) {
if (/^keyword/) {
$var = $1;
print $line;
last;
}
}
close(FILE);
This isn't working and gives me following error: Use of uninitialized value $_ in pattern match (m//) at ./script.pl line 91, <FILE> line 384. (this occurs for all lines of <FILE>)
I found another solution without the if-condition which just states #string = sort grep /^keyword/,<FILE>; and works. Can you please explain to me what is happening here?
/edit
Thx for the answers and explanations! What do you think is the better/more elegant way? The grep or the if-condition?
You need the following change:
if ($line =~ m/^keyword\s+(\d+)/)
Explanation: You read into $line, hence $_ which is the default target for match is undefined.
In addition, you'd get another error with $1, because your pattern did not specify a capturing group.
$1 refers to the first capture group, but your regex doesn't contain any capture groups, so it's undefined. Try
if ($line =~ /^keyword\s+(-?(?:\d+|\d*\.\d*)(?:[Ee]-?(?:\d+|\d*\.\d*))?)/) {
Notice also that the regex is being applied to the variable containing the line you just read.
Edit: Updated to cope with numbers in scientific notation. This is a significant additional requirement which you should have specified explicitly in the first place.

Pattern matching consecutive characters

I have a list of strings that I would like to search through and ignore any that contain A or G characters that occur more than 4 times consecutively. For instance, I would like to ignore strings such as TCAAAATC or GCTGGGGAA.
I've tried:
unless ($string =~ m/A{4,}?/g || m/G{4,}?/g)
{
Do something;
}
But I get an error message "Use of uninitialized value in pattern match (m//)".
Any suggestions would be appreciated.
By writing
|| m/G{4,}?/g
you are implicitly testing $_ against this regex. But, $_ is not initialized, so you get an error.
Write
unless ($string =~ m/A{4}/ || $string =~ m/G{4}/)
instead (note the simplifications made to the regex), or, as a single expression,
unless ($string =~ m/A{4}|G{4}/)
You need to avoid the implicit comparison with $_, which you can do by writing:
unless ($string =~ m/A{4}/ || $string =~ m/G{4}/)
This looks for exactly 4 A's or exactly 4 G's in the string; if there are 4, it doesn't matter whether there are any more than 4.
You can reduce it to a single regular expression by using:
unless ($string =~ m/([AG])\1{3}/)
which looks for an A or G followed by 3 more of the same character.