Perl: Greedy nature refuses to work - perl

I am trying to replace a string with another string, but the greedy nature doesn't seem to be working for me. Below is my code where "PERFORM GET-APLCY" is identified and replaced properly, but string "PERFORM GET-APLCY-SOI-CVG-WVR" and many other such strings are being replaced by the the replacement string for "PERFORM GET-APLCY".
s/PERFORM $func[$i]\.*/# PERFORM $func[$i]\.\n $hash{$func[$i]}/g;
where the full stop is optional during string match and replacement. I have also tried giving the pattern to be matched as $func[$i]\b
Please help me understand what the issue could be.
Thanks in advance,
Faez

Why GET-APLCY- should not match GET-APLCY., if the dot is optional?
Easy solution: sort your array by length in descending order.
#func = sort { length $b <=> length $a } #func
Testing script:
#!/usr/bin/perl
use warnings;
use strict;
use feature 'say';
my %hash = ('GET-APLCY' => 'REP1',
'GET-APLCY-SOI-CVG-WVR' => 'REP2',
'GET-APLCY-SOI-MNG-CVRW' => 'REP3',
);
my #func = sort { length $b <=> length $a } keys %hash;
while (<DATA>) {
chomp;
print;
print "\t -> \t";
for my $i (0 .. $#func) {
s/$func[$i]/$hash{$func[$i]}/;
}
say;
}
__DATA__
GET-APLCY param
GET-APLCY- param
GET-APLCY. param
GET-APLCY-SOI. param
GET-APLCY-SOI-CVG-WVR param
GET-APLCY-SOI-MNG-CVRW param

You appear to be looping over function names, and calling s/// for each one. An alternative is to use the e option, and do them all in one go (without a loop):
my %hash = (
'GET-APLCY' => 'replacement 1',
'GET-APLCY-SOI-CVG-WVR' => 'replacement 2',
);
s{
PERFORM \s+ # 'PERFORM' keyword
([A-Z-]+) # the original function name
\.? # an optional period
}{
"# PERFORM $1.\n" . $hash{$1};
}xmsge;
The e causes the replacement part to be evaluated as an expression. Basically, the first part finds all PERFORM calls (I'm assuming that the function names are all upper case with '-' between them – adjust otherwise). The second part replaces that line with the text you want to appear.
I've also used the x, m, and s options, which is what allows the comments in the regular expression, among other things. You can find more about these under perldoc perlop.
A plain version of the s-line should be:
s/PERFORM ([A-Z-]+)\.?/"# PERFORM $1.\n" . $hash{$1}/eg;

I guess that $func[$i] contains "GET-APLCY". If so, this is because the star only applies to the dot, an actual dot, not "any character". Try
s/PERFORM $func[$i].*/# PERFORM $func[$i]\.\n $hash{$func[$i]}/g;

I'm pretty sure you trying to do some kind of loop for $i. And in that case most likely
GET-APLCY is located in #func array before GET-APLCY-SOI-CVG-WVR. So I recommend to reverse sort #func before entering loop.

Related

Best way to parse string in perl

To achieve below task I have written below C like perl program (As I am new to Perl), But I am not sure if this is the best way to achieve.
Can someone please guide?
Note: Not with the full program, But where I can make improvement.
Thanks in advance
Input :
$str = "mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>"
Expected Output :
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
Sample Program
my $str="mail1, \#local<mail1\#mail.local>, mail2\#mail.local, <mail3\#mail.local>, mail4, local<mail4\#mail.local>";
my $count=0, #array, $flag=0, $tempStr="";
for my $c (split (//,$str)) {
if( ($count eq 0) and ($c eq ' ') ) {
next;
}
if($c) {
if( ($c eq ',') and ($flag eq 1) ) {
push #array, $tempStr;
$count=0;
$flag1=0;
$tempStr="";
next;
}
if( ($c eq '>' ) or ( $c eq '#' ) ) {
$flag=1;
}
$tempStr="$tempStr$c";
$count++;
}
}
if($count>0) {
push #array, $tempStr;
}
foreach my $var (#array) {
print "$var\n";
}
Edit:
Input:
Input is the output of above code.
Expected Output :
"mail1, local"<mail1#mail.local>
"mail4, local"<mail4#mail.local>
Sample Code:
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
my #addresses = split('\n',$str);
if(scalar #addresses) {
foreach my $address (#addresses) {
if (($address =~ /</) and ($address !~ /\"/) and ($address !~ /^</)){
$address="\"$address";
$address=~ s/</\"</g;
}
}
$str = join(',',#addresses);
}
print "$str\n";
As I see, you want to replace each:
comma and following spaces,
occurring after either # or >,
with a newline.
To make such replacement, instead of writing a parsing program, you can use
a regex.
The search part can be as follows:
([^#>]+[#>][^,]+),\s*
Details:
( - Start of the 1st capturing group.
[^#>]+ - A non-empty sequence of chars other than # or >.
[#>] - Either # or >.
[^,]+ - A non-empty sequence of chars other than a comma.
) - End of the 1st capturing group.
,\s* - A comma and optional sequence of spaces.
The replace part should be:
$1 - The 1st capturing group.
\n - A newline.
So the whole program, much shorter than yours, can be as follows:
my $str='mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4, local<mail4#mail.local>';
print "Before:\n$str\n";
$str =~ s/([^#>]+[#>][^,]+),\s*/$1\n/g;
print "After:\n$str\n";
To replace all needed commas I used g option.
Note that I put the source string in single quotes, otherwise Perl
would have complained about Possible unintended interpolation of #mail.
Edit
Your modified requirements must be handled different way.
"Ordinary" replacement is not an option, because now there are some
fragments to match and some framents to ignore.
So the basic idea is to write a while loop with a matching regex:
(\w+),?\s+(\w+)(<[^>]+>), meaning:
(\w+) - First capturing group - a sequence of word chars (e.g. mail1).
,?\s+ - Optional comma and a sequence of spaces.
(\w+) - Second capturing group - a sequence of word chars (e.g. local).
(<[^>]+>) - Third capturing group - a sequence of chars other than >
(actual mail address), enclosed in angle brackets, e.g. <mail1#mail.local>.
Within each execution of the loop you have access to the groups
captured in this particular match ($1, $2, ...).
So the content of this loop is to print all these captured groups,
with required additional chars.
The code (again much shorter than yours) should look like below:
my $str = 'mail1, local<mail1#mail.local>, mail2#mail.local, <mail3#mail.local>, mail4 local<mail4#mail.local>';
while ($str =~ /(\w+),?\s+(\w+)(<[^>]+>)/g) {
print "\"$1, $2\"$3\n";
}
Here is an approach using split, which in this case also needs a careful regex
use warnings;
use strict;
use feature 'say';
my $string = # broken into two parts for readabililty
q(mail1, local<mail1#mail.local>, mail2#mail.local, )
. q(<mail3#mail.local>, mail4, local<mail4#mail.local>);
my #addresses = split /#.+?\K,\s*/, $string;
say for #addresses;
The split takes a full regex in its delimiter specification. In this case I figure that each record is delimited by a comma which comes after the email address, so #.+?,
To match a pattern only when it is preceded by another brings to mind a negative lookbehind before the comma. But those can't be of variable length, which is precisely the case here.
We can instead normally match the pattern #.+? and then use the \K form (of the lookbehind) which drops all previous matches so that they are not taken out of the string. Thus the above splits on ,\s* when that is preceded by the email address, #... (what isn't consumed).
It prints
mail1, local<mail1#mail.local>
mail2#mail.local
<mail3#mail.local>
mail4, local<mail4#mail.local>
The edit asks about quoting the description preceding <...> when it's there. A simple way is to make another pass once addresses have been parsed out of the string as above. For example
my #addresses = split /#.+?\K,\s*/, $string; #/ stop syntax highlight
s/(.+?,\s*.+?)</"$1"</ for #addresses;
say for #addresses;
The regex in a loop is one way to change elements of an array. I use it for its efficiency (changes elements in place), conciseness, and as a demonstration of the following properties.
In a foreach loop the index variable (or $_) is an alias for the currently processed element – so changing it changes that element. This is a known source of bugs when allowed unknowingly, which was another reason to show it in the above form.
The statement also uses the statement modifier and it is equivalent to
foreach my $elem (#addresses) {
$elem =~ s/(.+?,\s*.+?)</"$1"</;
}
This is often considered a more proper way to write it but I find that the other form emphasizes more clearly that elements are being changed, when that is the sole purpose of the foreach.

Perl searching for string contained in array

I have an array with the following values:
push #fruitArray, "apple|0";
push #fruitArray, "apple|1";
push #fruitArray, "pear|0";
push #fruitArray, "pear|0";
I want to find out if the string "apple" exists in this array (ignoring the "|0" "|1")
I am using:
$fruit = 'apple';
if( $fruit ~~ #fruitArray ){ print "I found apple"; }
Which isn't working.
Don't use smart matching. It never worked properly for a number of reasons and it is now marked as experimental
In this case you can use grep instead, together with an appropriate regex pattern
This program tests every element of #fruitArray to see if it starts with the letters in $fruit followed by a pipe character |. grep returns the number of elements that matched the pattern, which is a true value if at least one matched
my #fruitArray = qw/ apple|0 apple|1 pear|0 pear|0 /;
my $fruit = 'apple';
print "I found $fruit\n" if grep /^$fruit\|/, #fruitArray;
output
I found apple
I - like #Borodin suggests, too - would simply use grep():
$fruit = 'apple';
if (grep(/^\Q$fruit\E\|/, #fruitArray)) { print "I found apple"; }
which outputs:
I found apple
\Q...\E converts your string into a regex pattern.
Looking for the | prevents finding a fruit whose name starts with the name of the fruit for which you are looking.
Simple and effective... :-)
Update: to remove elements from array:
$fruit = 'apple';
#fruitsArrayWithoutApples = grep ! /^\Q$fruit\E|/, #fruitArray;
If your Perl is not ancient, you can use the first subroutine from the List::Util module (which became a core module at Perl 5.8) to do the check efficiently:
use List::Util qw{ first };
my $first_fruit = first { /\Q$fruit\E/ } #fruitArray;
if ( defined $first_fruit ) { print "I found $fruit\n"; }
Don't use grep, that will loop the entire array, even if it finds what you are looking for in the first index, so it is inefficient.
this will return true if it finds the substring 'apple', then return and not finish iterating through the rest of the array
#takes a reference to the array as the first parameter
sub find_apple{
#array_input = #{$_[0]};
foreach $fruit (#array_input){
if (index($fruit, 'apple') != -1){
return 1;
}
}
}
You can get close to the smartmatch sun without melting your wings by using match::simple:
use match::simple;
my #fruits = qw/apple|0 apple|1 pear|0 pear|0/;
$fruit = qr/apple/ ;
say "found $fruit" if $fruit |M| \#fruits ;
There's also a match() function if the infix [M] doesn't read well.
I like the way match::simple does almost everything I expected from ~~ without any surprising complexity. If you're fluent in perl it probably isn't something you'd see as necessary, but - especially with match() - code can be made pleasantly readable ... at the cost of imposing the use of references, etc.

Split functions

I want to get the split characters. I tried the below coding, but I can able to get the splitted text only. However if the split characters are same then it should be returned as that single characters
For example if the string is "asa,agas,asa" then only , should be returned.
So in the below case I should get as "| : ;" (joined with space)
use strict;
use warnings;
my $str = "Welcome|a:g;v";
my #value = split /[,;:.%|]/, $str;
foreach my $final (#value) {
print $final, "\n";
}
split splits a string into elements when given what separates those elements, so split is not what you want. Instead, use:
my #punctuations = $str =~ /([,;:.%|])/g;
So you want to get the opposite of split
try:
my #value=split /[^,;:.%|]+/,$str;
It will split on anything but the delimiters you set.
Correction after commnets:
my #value=split /[^,;:.%|]+/,$str;
shift #value;
this works fine, and gives unique answers
#value = ();
foreach(split('',",;:.%|")) { push #value,$_ if $str=~/$_/; }
To extract all the separators only once, you need something more elaborate
my #punctuations = keys %{{ map { $_ => 1 } $str =~ /[,;:.%|]/g }};
Sounds like you call "split characters" what the rest of us call "delimiters" -- if so, the POSIX character class [:punct:] might prove valuable.
OTOH, if you have a defined list of delimiters, and all you want to do is list the ones present in the string, it's much more efficient to use m// rather than split.

Simple multi-dimensional array with loop in perl

I'm trying to use an array and a loop to print out the following (basically for each letter of the alphabet, print each letter of the alphabet after it and then move on to the next letter). I'm new to perl, anyone have any quick words of :
aa
ab
ac
ad
...
ba
bb
bc
bd
...
ca
cb
...
Currently I have this, but it only prints a single character alphabet...
#arr = ("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z");
$i = #arr;
while ($i)
{
print $arr[$i];
$i--;
}
Using the range operator and the ranges you want to target:
use strict;
use warnings;
my #elements = ("aa" .. "zz");
for my $combo (#elements)
{
print "$combo\n";
}
You can utilize the initial 2 letters till the ending 2 letters you want as ending and the for will take care of everything.
This really isn't multi-dimensional array work, if it were you'd be working with stuff like:
my #foo = (
[1,2,3],
[4,7,8,1,2,3],
[2,3],
);
This is really a very basic how do I make a nested loop that iterates over the same array. I'll bet this is homework.
So, I'll let you figure out the nesting bits, but give some help with Perl's loop operators.
!! for/foreach
for (the each is optional) is the real heavy hitter for looping in perl. Use it like so:
for my $var ( #array ) {
#do stuff with $var
}
Each element in #array will be aliased to the $var variable, and the block of code will be executed. The fact that we are aliasing, rather than copying means that if alter the value of $var, #array will be changed as well. The stuff between the parenthesis may be any expression. The expression will be evaluated in list context. So if you put a file handle in the parens, the entire file will be read into memory and processed.
You can also leave off naming the loop variable, and $_ will be used instead. In general, DO NOT DO THIS.
!! C-Style for
Every once in a while you need to keep track of indexes as you loop over an array. This is when a C style for loop comes in handy.
for( my $i=0; $i<#array; $i++ ) {
# do stuff with $array[$i]
}
!! While/Until
While and until operate with boolean loop conditions. That means that the loop will repeat as long as the appropriate boolean value if found for the condition ( TRUE for while, and FALSE for until). In addition to the obvious cases where you are looking for a particular condition, while is great for processing a file one line at a time.
while ( my $line = <$fh> ) {
# Do stuff with $line.
}
!! map
map is an amazingly useful bit of functional programming kung-fu. It is used to turn one list into another. You pass an anonymous code reference that is used to enact the transformation.
# Multiply all elements of #old by two and store them in #new.
my #new = map { $_ * 2 } #old;
So how do you solve your particular problem? There are many ways. Which is best depends on how you want to use the results. If you want to create a new array of the letter pairs, use map. If you are interested primarily in a side effect (say printing a variable) use for. If you need to work with really big lists that come from sort of interator (like lines from a filehandle) use while.
Here's a solution. I wouldn't turn it in to your professor until you understand how it works.
print map { my $letter=$_; map "$letter$_\n", "a".."z" } "a".."z";
Look at perldoc articles, perlsyn for info on the looping constructs, perlfunc for info on map and look at perlop for info on the range operator (..).
Good luck.
Use the range operator (..) for your initialization. The range operator basically grabs a range of values such as numbers or characters.
Then use a nested loop to go through the array one time per character for a total of 26^2 iterations.
Rather than a while loop I've used a foreach loop to go through each item in the array. You could also put 'a' .. 'z' instead of declared #arr as the argument to the foreach loop. The foreach loops below set $char or $char2 to each value in #arr in turn.
my #arr = ('a' .. 'z');
for my $char (#arr) {
for my $char2 (#arr) {
print "$char$char2\n";
}
}
If all you really want to do is print the 676 strings you describe, then:
#!/usr/bin/perl
use warnings;
use strict;
my $str = 'aa';
while (length $str < 3) {
print $str++, "\n";
}
But I smell an "XY problem"...

Simplest way to match array of strings to search in perl?

What I want to do is check an array of strings against my search string and get the corresponding key so I can store it. Is there a magical way of doing this with Perl, or am I doomed to using a loop? If so, what is the most efficient way to do this?
I'm relatively new to Perl (I've only written 2 other scripts), so I don't know a lot of the magic yet, just that Perl is magic =D
Reference Array: (1 = 'Canon', 2 = 'HP', 3 = 'Sony')
Search String: Sony's Cyber-shot DSC-S600
End Result: 3
UPDATE:
Based on the results of discussion in this question, depending on your intent/criteria of what constitutes "not using a loop", the map based solution below (see "Option #1) may be the most concise solution, provided that you don't consider map a loop (the short version of the answers is: it's a loop as far as implementation/performance, it's not a loop from language theoretical point of view).
Assuming you don't care whether you get "3" or "Sony" as the answer, you can do it without a loop in a simple case, by building a regular expression with "or" logic (|) from the array, like this:
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
print "$which_found[0]\n";
Result from my test run: Sony
The regular expression will (once the variable $combined_search is interpolated by Perl) take the form /(Canon|HP|Sony)/ which is what you want.
This will NOT work as-is if any of the strings contain regex special characters (such as | or ) ) - in that case you need to escape them
NOTE: I personally consider this somewhat cheating, because in order to implement join(), Perl itself must do a loop somewhere inside the interpeter. So this answer may not satisfy your desire to remain loop-less, depending on whether you wanted to avoid a loop for performance considerations, of to have cleaner or shorter code.
P.S. To get "3" instead of "Sony", you will have to use a loop - either in an obvious way, by doing 1 match in a loop underneath it all; or by using a library that saves you from writing the loop yourself but will have a loop underneath the call.
I will provide 3 alternative solutions.
#1 option: - my favorite. Uses "map", which I personally still consider a loop:
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
print "$which_found[0]\n";
die "Not found" unless #which_found;
my $strings_index = 0;
my %strings_indexes = map {$_ => $strings_index++} #strings;
my $index = 1 + $strings_indexes{ $which_found[0] };
# Need to add 1 since arrays in Perl are zero-index-started and you want "3"
#2 option: Uses a loop hidden behind a nice CPAN library method:
use List::MoreUtils qw(firstidx);
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
die "Not Found!"; unless #which_found;
print "$which_found[0]\n";
my $index_of_found = 1 + firstidx { $_ eq $which_found[0] } #strings;
# Need to add 1 since arrays in Perl are zero-index-started and you want "3"
#3 option: Here's the obvious loop way:
my $found_index = -1;
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
foreach my $index (0..$#strings) {
next if $search_in !~ /$strings[$index]/;
$found_index = $index;
last; # quit the loop early, which is why I didn't use "map" here
}
# Check $found_index against -1; and if you want "3" instead of "2" add 1.
Here is a solution that builds a regular expression with embedded code to increment the index as perl moves through the regex:
my #brands = qw( Canon HP Sony );
my $string = "Sony's Cyber-shot DSC-S600";
use re 'eval'; # needed to use the (?{ code }) construct
my $index = -1;
my $regex = join '|' => map "(?{ \$index++ })\Q$_" => #brands;
print "index: $index\n" if $string =~ $regex;
# prints 2 (since Perl's array indexing starts with 0)
The string that is prepended to each brand first increments the index, and then tries to match the brand (escaped with quotemeta (as \Q) to allow for regex special characters in the brand names).
When the match fails, the regex engine moves past the alternation | and then the pattern repeats.
If you have multiple strings to match against, be sure to reset $index before each. Or you can prepend (?{$index = -1}) to the regex string.
An easy way is just to use a hash and regex:
my $search = "your search string";
my %translation = (
'canon' => 1,
'hp' => 2,
'sony' => 3
);
for my $key ( keys %translation ) {
if ( $search =~ /$key/i ) {
return $translation{$key};
)
}
Naturally the return can just as easily be a print. You can also surround the entire thing in a while loop with:
while(my $search = <>) {
#your $search is declared = to <> and now gets its values from STDIN or strings piped to this script
}
Please also take a look at perl's regex features at perlre
and take a look at perl's data structures at perlref
EDIT
as was just pointed out to me you were trying to steer away from using a loop. Another method would be to use perl's map function. Take a look here.
You can also take a look at Regexp::Assemble, which will take a collection of sub-regexes and build a single super-regex from them that can then be used to test for all of them at once (and gives you the text which matched the regex, of course). I'm not sure that it's the best solution if you're only looking at three strings/regexes that you want to match, but it's definitely the way to go if you have a substantially larger target set - the project I initially used it on has a library of some 1500 terms that it's matching against and it performs very well.