Perl Map Function - perl

I'm new to the map and grep functions and I'm trying to make an existing script more concise.
I can "grep" the #tracknames successfully but I'm having a problem with "map". I want #trackartist to return true if two consecutive "--" are found in a line and take the value of $1, otherwise false, but it returns the whole line if the upper condition is not met.
What am I doing wrong?
my #tracknames = grep /^\d\d\..*?(\.(?:flac|wv))$/, <*.*>;
my #trackartist = map { s/^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/$1/; $_; } <*.*>;
Sample of files
01. some track artist 1 -- some track name 1.(flac or wv)
02. some track artist 2 -- some track name 2.(flac or wv)
03. some track artist 3 -- some track name 3.(flac or wv)
etc.

Remember that grep is for filtering a list and map is for transforming a list. Right now, your map statement returns $_ for every item in the list. If $_ matches the pattern in your substitution, it will be modified and replaced with the first match. Otherwise, it's not modified and the original $_ is returned.
It sounds like you want to filter out items that don't match the pattern. One way would be to combine a map and a grep:
my #trackartist = map { s/^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/$1/; $_; }
grep { /^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/ } <*.*>;
Of course, this means you're doing the same pattern match twice. Another approach is to do a transform with map, but transform anything that doesn't match the pattern into an empty list.
my #trackartist = map { /^\d\d\.\s(.*?)\s--.*?\.(?:flac|wv)$/ ? $1 : ( ) } <*.*>
This uses the ternary conditional operator (?:) to check if the regex matches (returning a true value). If it does, $1 is returned from the map block, if not, an empty list ( ) is returned, which adds nothing to the list resulting from the map.
As a side note, you might want to look into using the glob function rather than <>, which has some disadvantages.

I like map and grep as much as the next guy, but your task seems more suited to a divide-and-conquer parsing approach. I say this because your comments suggest that your interest in map is leading you down a road where you'll end up with a data model consisting of parallel arrays -- #tracks, #artists, etc. -- which is often difficult to maintain in the long run. Here's a sketch of what I mean:
my #tracks;
while (my $file_name = <DATA>){ # You'll use glob() or <*.*>
# Filter out unwanted files.
my ($num, $artist_title, $ext) = $file_name =~ /
^ (\d\d) \. \s*
(.*)
\. (flac|wv) $
/x;
next unless $ext;
# Try to parse the artist and title. Adjust as needed.
my ($artist, $title) = split /\s+--\s+/, $artist_title, 2;
($artist, $title) = ('UNKNOWN', $artist) unless $title;
# Store all info as a hash ref. No need for parallel arrays.
push #tracks, {
file_name => $file_name,
ext => $ext,
artist => $artist,
title => $title,
};
}
__DATA__
01. Perl Jam -- Open or die.wv
02. Perl Jam -- Map to nowhere.flac
03. Perl Jam -- What the #$#!?.wv
04. Perl Jam -- Regex blues.wv
05. Perl Jam -- Use my package, baby.wv
06. Perl Jam -- No warnings.wv
07. Perl Jam -- Laziness ISA virtue.wv
08. Guido and the Pythons -- Home on the xrange.flac
09. Guido and the Pythons -- You gotta keep em generated.flac
10. StackOverflow medley.wv
foo.txt

Related

Perl searching for string contained in array

I have an array with the following values:
push #fruitArray, "apple|0";
push #fruitArray, "apple|1";
push #fruitArray, "pear|0";
push #fruitArray, "pear|0";
I want to find out if the string "apple" exists in this array (ignoring the "|0" "|1")
I am using:
$fruit = 'apple';
if( $fruit ~~ #fruitArray ){ print "I found apple"; }
Which isn't working.
Don't use smart matching. It never worked properly for a number of reasons and it is now marked as experimental
In this case you can use grep instead, together with an appropriate regex pattern
This program tests every element of #fruitArray to see if it starts with the letters in $fruit followed by a pipe character |. grep returns the number of elements that matched the pattern, which is a true value if at least one matched
my #fruitArray = qw/ apple|0 apple|1 pear|0 pear|0 /;
my $fruit = 'apple';
print "I found $fruit\n" if grep /^$fruit\|/, #fruitArray;
output
I found apple
I - like #Borodin suggests, too - would simply use grep():
$fruit = 'apple';
if (grep(/^\Q$fruit\E\|/, #fruitArray)) { print "I found apple"; }
which outputs:
I found apple
\Q...\E converts your string into a regex pattern.
Looking for the | prevents finding a fruit whose name starts with the name of the fruit for which you are looking.
Simple and effective... :-)
Update: to remove elements from array:
$fruit = 'apple';
#fruitsArrayWithoutApples = grep ! /^\Q$fruit\E|/, #fruitArray;
If your Perl is not ancient, you can use the first subroutine from the List::Util module (which became a core module at Perl 5.8) to do the check efficiently:
use List::Util qw{ first };
my $first_fruit = first { /\Q$fruit\E/ } #fruitArray;
if ( defined $first_fruit ) { print "I found $fruit\n"; }
Don't use grep, that will loop the entire array, even if it finds what you are looking for in the first index, so it is inefficient.
this will return true if it finds the substring 'apple', then return and not finish iterating through the rest of the array
#takes a reference to the array as the first parameter
sub find_apple{
#array_input = #{$_[0]};
foreach $fruit (#array_input){
if (index($fruit, 'apple') != -1){
return 1;
}
}
}
You can get close to the smartmatch sun without melting your wings by using match::simple:
use match::simple;
my #fruits = qw/apple|0 apple|1 pear|0 pear|0/;
$fruit = qr/apple/ ;
say "found $fruit" if $fruit |M| \#fruits ;
There's also a match() function if the infix [M] doesn't read well.
I like the way match::simple does almost everything I expected from ~~ without any surprising complexity. If you're fluent in perl it probably isn't something you'd see as necessary, but - especially with match() - code can be made pleasantly readable ... at the cost of imposing the use of references, etc.

Sorting a sub-section of a file

I'm in need of some Perl wisdom from those more experienced than myself.
So far, my answer to the below is to simply go through the file line-by line, and insert relevant elements into an array, sort the array and then append the contents. But that seems like a bit long-winded and not very efficient.
I have a file whose contents look something like this :
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.bravo.example.com # <----
noperiod.example.com # <---- This list is
.an.example.com # <---- not ordered
.some.example.com # <----
Is there a clever way in Perl (ideally a one-liner that could be piped) to sort the second list ? i.e. so you would get the following result :
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.an.example.com # <----
.bravo.example.com # <---- NOW this list
noperiod.example.com # <---- IS ordered ;-)
.some.example.com # <----
Four things to note :
The content to be sorted is always at the bottom of the file
The header (":127.200.200.100 etc.") is always the same
Names may or may not start with a period (i.e. .bravo.example.com vs noperiod.example.com)
There may be a large number of items, so needs to be reasonably efficient
Depends what you mean by 'efficient'. I mean, a one liner is rarely efficient and it's also rarely concise or clear as to what it's doing.
But in terms of efficiency? Well, it depends what you're doing already that's inefficient. I mean, pretty fundamentally, if you're sorting something you need to examine the whole data set. Otherwise how would you know that the last line in your file needs to be sorted to the top?
But for what you're doing, I'd approach it like this:
#!/usr/bin/perl
use strict;
use warnings;
sub sort_noperiods {
my $a_np = $a;
$a_np =~ s/\.//g;
my $b_np = $b;
$b_np =~ s/\.//g;
return $a_np cmp $b_np;
}
while ( <> ) {
print;
last if m/Something Else/;
}
print sort sort_noperiods <>;
Which for your sample input, prints:
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.an.example.com # <---- not ordered
.bravo.example.com # <----
noperiod.example.com # <---- This list is
.some.example.com # <----
I'm keying off the 'Something Else' line in your file, as I couldn't quite tell how you'd identify the last line of the 'header' chunk. Anything else gets read in and sorted according to the 'noperiods' sort mechanism. (There may be a small efficiency gain by caching the result of the regular expressions, but I'm not sure of this).
This can be 'one-linerified' by:
perl -e 'while ( <> ) { print; last if m/Something Else/ }; print sort { $a =~ s/\.//gr cmp $b =~ s/\.//gr } <>; '
You can sort that by the shell with a little help from Perl: Just prepend a line number to each line before the list, and for the list, use the number of its first line. Then sort numerically by the numbers, and secondary by the rest of the line:
perl -ne 'if (1 .. /^:127\.200\.200\.100:.*/) {
print "$.\t$_";
} else {
print $.--, "\t$_"
}' file.txt \
| sort -k1,1n -k2 | cut -f2-

what's the proper syntax to insert a multi-line substitution string and apply it toward a single array value?

Here's the scenario -- One step of the process involves fixing city names when the data is obviously misspelled, along with some basic conversions like "MTN" to "Mountain" and so forth. I've built a variable containing several substitution strings, and I'm trying to apply that set of subs on one of the input fields later down the line.
my $citysub = <<'EOF';
s/DEQUEEN/DE QUEEN/;
s/ELDORADO/EL DORADO/;
... # there are about 100 such substitution strings
EOF
...
while ($line <INFILE>)
{
...
#field = split(/","/,$line); # it's a comma-delimited file with quoted strings; this is spltting exactly like I intend; at the end, I'll piece it back together properly
...
# the 9th field and 12th field are city names, i.e., $field[8] and $field[12]
$field[8] =~ $citysub; # this is what I'm wanting to do, but it doesn't work!
# since that doesn't work, I'm using the following, but it's much slower, obviiously
$field[8] = `echo $field[8]|sed -e "$citysub"`; # external calls to system commands
So, what's the proper syntax to insert a multi-line substitution string and apply it toward a single array value?
my %citysub = ( "DEQUEEN" => "DE QUEEN", "ELDORADO" => "EL DORADO" );
for my $find ( keys %citysub ) {
my $replace = $citysub{ $find };
$field[8] =~ s/$find/$replace/g;
}
Explanation: Create a hash of "thing to match" => "thing to replace with". then loop over that hash and run s/// with the thing to match and the thing to replace with.

Skipping particular positions in a string using substitution operator in perl

Yesterday, I got stuck in a perl script. Let me simplify it, suppose there is a string (say ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD), first I've to break it at every position where "E" comes, and secondly, break it specifically where the user wants to be at. But, the condition is, program should not cut at those sites where E is followed by P. For example there are 6 Es in this sequence, so one should get 7 fragments, but as 2 Es are followed by P one will get 5 only fragments in the output.
I need help regarding the second case. Suppose user doesn't wants to cut this sequence at, say 5th and 10th positions of E in the sequence, then what should be the corresponding script to let program skip these two sites only? My script for first case is:
my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';
$otext=~ s/([E])/$1=/g; #Main cut rule.
$otext=~ s/=P/P/g;
#output = split( /\=/, $otext);
print "#output";
Please do help!
To split on "E" except where it's followed by "P", you should use Negative look-ahead assertions.
From perldoc perlre "Look-Around Assertions" section:
(?!pattern)
A zero-width negative look-ahead assertion.
For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar".
my $otext = 'ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD';
# E E EP E EP E
my #output=split(/E(?!P)/, $otext);
use Data::Dumper; print Data::Dumper->Dump([\#output]);"
$VAR1 = [
'ABCD',
'ABCD',
'ABCDEPABCD',
'ABCDEPABCD',
'ABCD'
];
Now, in order to NOT cut at occurences #2 and #4, you can do 2 things:
Concoct a really fancy regex that automatically fails to match on given occurence. I will leave that to someone else to attempt in an answer for completeness sake.
Simply stitch together the correct fragments.
I'm too brain-dead to come up with a good idiomatic way of doing it, but the simple and dirty way is either:
my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
my #output_final;
for(my $i=0; $i < #output; $i++) {
if ($no_cuts{$i}) {
$output_final[-1] .= $output[$i];
} else {
push #output_final, $output[$i];
}
}
print Data::Dumper->Dump([\#output_final];
$VAR1 = [
'ABCD',
'ABCDABCDEPABCD',
'ABCDEPABCDABCD'
];
Or, simpler:
my %no_cuts = map { ($_=>1) } (2,4); # Do not cut in positions 2,4
for(my $i=0; $i < #output; $i++) {
$output[$i-1] .= $output[$i];
$output[$i]=undef; # Make the slot empty
}
my #output_final = grep {$_} #output; # Skip empty slots
print Data::Dumper->Dump([\#output_final];
$VAR1 = [
'ABCD',
'ABCDABCDEPABCD',
'ABCDEPABCDABCD'
];
Here's a dirty trick that exploits two facts:
normal text strings never contain null bytes (if you don't know what a null byte is, you should as a programmer: http://en.wikipedia.org/wiki/Null_character, and nb. it is not the same thing as the number 0 or the character 0).
perl strings can contain null bytes if you put them there, but be careful, as this may screw up some perl internal functions.
The "be careful" is just a point to be aware of. Anyway, the idea is to substitute a null byte at the point where you don't want breaks:
my $s = "ABCDEABCDEABCDEPABCDEABCDEPABCDEABCD";
my #nobreak = (4,9);
foreach (#nobreak) {
substr($s, $_, 1) = "\0";
}
"\0" is an escape sequence representing a null byte like "\t" is a tab. Again: it is not the character 0. I used 4 and 9 because there were E's in those positions. If you print the string now it looks like:
ABCDABCDABCDEPABCDEABCDEPABCDEABCD
Because null bytes don't display, but they are there, and we are going to swap them back out later. First the split:
my #a = split(/E(?!P)/, $s);
Then swap the zero bytes back:
$_ =~ s/\0/E/g foreach (#a);
If you print #a now, you get:
ABCDEABCDEABCDEPABCD
ABCDEPABCD
ABCD
Which is exactly what you want. Note that split removes the delimiter (in this case, the E); if you intended to keep those you can tack them back on again afterward. If the delimiter is from a more dynamic regex it is slightly more complicated, see here:
http://perlmeme.org/howtos/perlfunc/split_function.html
"Example 9. Keeping the delimiter"
If there is some possibility that the #nobreak positions are not E's, then you must also keep track of those when you swap them out to make sure you replace with the correct character again.

Simplest way to match array of strings to search in perl?

What I want to do is check an array of strings against my search string and get the corresponding key so I can store it. Is there a magical way of doing this with Perl, or am I doomed to using a loop? If so, what is the most efficient way to do this?
I'm relatively new to Perl (I've only written 2 other scripts), so I don't know a lot of the magic yet, just that Perl is magic =D
Reference Array: (1 = 'Canon', 2 = 'HP', 3 = 'Sony')
Search String: Sony's Cyber-shot DSC-S600
End Result: 3
UPDATE:
Based on the results of discussion in this question, depending on your intent/criteria of what constitutes "not using a loop", the map based solution below (see "Option #1) may be the most concise solution, provided that you don't consider map a loop (the short version of the answers is: it's a loop as far as implementation/performance, it's not a loop from language theoretical point of view).
Assuming you don't care whether you get "3" or "Sony" as the answer, you can do it without a loop in a simple case, by building a regular expression with "or" logic (|) from the array, like this:
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
print "$which_found[0]\n";
Result from my test run: Sony
The regular expression will (once the variable $combined_search is interpolated by Perl) take the form /(Canon|HP|Sony)/ which is what you want.
This will NOT work as-is if any of the strings contain regex special characters (such as | or ) ) - in that case you need to escape them
NOTE: I personally consider this somewhat cheating, because in order to implement join(), Perl itself must do a loop somewhere inside the interpeter. So this answer may not satisfy your desire to remain loop-less, depending on whether you wanted to avoid a loop for performance considerations, of to have cleaner or shorter code.
P.S. To get "3" instead of "Sony", you will have to use a loop - either in an obvious way, by doing 1 match in a loop underneath it all; or by using a library that saves you from writing the loop yourself but will have a loop underneath the call.
I will provide 3 alternative solutions.
#1 option: - my favorite. Uses "map", which I personally still consider a loop:
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
print "$which_found[0]\n";
die "Not found" unless #which_found;
my $strings_index = 0;
my %strings_indexes = map {$_ => $strings_index++} #strings;
my $index = 1 + $strings_indexes{ $which_found[0] };
# Need to add 1 since arrays in Perl are zero-index-started and you want "3"
#2 option: Uses a loop hidden behind a nice CPAN library method:
use List::MoreUtils qw(firstidx);
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
die "Not Found!"; unless #which_found;
print "$which_found[0]\n";
my $index_of_found = 1 + firstidx { $_ eq $which_found[0] } #strings;
# Need to add 1 since arrays in Perl are zero-index-started and you want "3"
#3 option: Here's the obvious loop way:
my $found_index = -1;
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
foreach my $index (0..$#strings) {
next if $search_in !~ /$strings[$index]/;
$found_index = $index;
last; # quit the loop early, which is why I didn't use "map" here
}
# Check $found_index against -1; and if you want "3" instead of "2" add 1.
Here is a solution that builds a regular expression with embedded code to increment the index as perl moves through the regex:
my #brands = qw( Canon HP Sony );
my $string = "Sony's Cyber-shot DSC-S600";
use re 'eval'; # needed to use the (?{ code }) construct
my $index = -1;
my $regex = join '|' => map "(?{ \$index++ })\Q$_" => #brands;
print "index: $index\n" if $string =~ $regex;
# prints 2 (since Perl's array indexing starts with 0)
The string that is prepended to each brand first increments the index, and then tries to match the brand (escaped with quotemeta (as \Q) to allow for regex special characters in the brand names).
When the match fails, the regex engine moves past the alternation | and then the pattern repeats.
If you have multiple strings to match against, be sure to reset $index before each. Or you can prepend (?{$index = -1}) to the regex string.
An easy way is just to use a hash and regex:
my $search = "your search string";
my %translation = (
'canon' => 1,
'hp' => 2,
'sony' => 3
);
for my $key ( keys %translation ) {
if ( $search =~ /$key/i ) {
return $translation{$key};
)
}
Naturally the return can just as easily be a print. You can also surround the entire thing in a while loop with:
while(my $search = <>) {
#your $search is declared = to <> and now gets its values from STDIN or strings piped to this script
}
Please also take a look at perl's regex features at perlre
and take a look at perl's data structures at perlref
EDIT
as was just pointed out to me you were trying to steer away from using a loop. Another method would be to use perl's map function. Take a look here.
You can also take a look at Regexp::Assemble, which will take a collection of sub-regexes and build a single super-regex from them that can then be used to test for all of them at once (and gives you the text which matched the regex, of course). I'm not sure that it's the best solution if you're only looking at three strings/regexes that you want to match, but it's definitely the way to go if you have a substantially larger target set - the project I initially used it on has a library of some 1500 terms that it's matching against and it performs very well.