Perl: Is a single grep operation costs same as single for loop? - perl

Say I have a below existing code.
my $names = &loadNames(); # No duplicate names
my $u1;
my $u2;
for (my $i = 0 ; $i < #$names; $i++) {
if($$names[$i] eq $input_one){
$u1 = loadUserFromOneSource($input_one);
}
if($$names[$i] eq $input_two){
$u2 = loadUserFromSecondSource($input_two);
}
}
Now if I refactored the above code like below
my $names = &loadNames(); #Returns array reference
my $u1 = grep $_ eq $input_one, #$names;
my $u2 = grep $_ eq $input_two, #$names;
$u1 = loadUserFromOneSource($u1) if $u1;
$u2 = loadUserFromSecondSource($u2) if $u2;
Did I really improve anything? Or I made it even worse because I am running two greps on same list.
Clarification:
The sole purpose of asking the question is to understand the performance trade offs between grep and loop. In both cases I'm extracting out two names. But in first example in a single iteration. And second example it's done in two greps. Did I doubled the cost in second approach? Or grep is efficient enough to win over the single iteration? I will try benchmarking when I will back in work.

In general, inbuilts like grep will be faster than manual loops.
However in your particular case there are a couple of gotchas:
The two code examples don't do the same thing. In the first case, the functions may be called multiple times, if the condition is matched more than once. In the second example the functions can be called at most once.
Using an expression like /$foo/ inside a loop or grep, map, etc will result in the regex being compiled each time.
Since you did not a anchor the regex, partial matches may also occur.
I would use grep but change the condition to
my $u1 = grep $_ eq $input_one, #$names;

For one, you're no longer testing for equality in your new code, but for regex inclusion. That could've introduced a bug.
Another solution is just to translate the arrays to a hash so that you can test if a name exists. I.e. perldoc How can I tell whether a certain element is contained in a list or array?
my $names = loadNames(); #Returns array reference
my %hasName = map {$_ => 1} #$names;
my $u1 = $hasName{$input_one} ? loadUserFromOneSource($input_one) : '';
my $u2 = $hasName{$input_two} ? loadUserFromSecondSource($input_two) : '';

Related

Perl multi-order grep with $_

I'm trying to learn Perl's grep better.
I want to grep which keys of a hash are not in an array
my %args = ( fake => 1);
my #defined_args = ('color', 'colors', 'data', 'figheight', 'figwidth', 'filename', 'flip', 'grid', 'labelsize', 'logscale', 'minor_gridlines');
my #bad_args = grep { not grep {$_} #defined_args} keys %args;
where the list of bad args is in #bad_args The last line is obviously wrong.
I know that I can do the same thing with a hash, but I want to be able to do this with a multi-order grep, i.e. grep on grep.
How can I do this like the following?
my #bad_args = grep { not grep {$_ eq $_} #defined_args} keys %args;
I'm confused because there would be two $_, which I can't run an equality test on.
First the direct answer -- that block that grep takes, you can put any code in it. That's the point of the block, and an element passes/not based on the truthiness of the last statement that returns.
my #bad_args = grep {
my $key = $_;
#defined_args == grep { $key ne $_ } #defined_args
} keys %args;
Here we test whether a key is not-equal to array elements, and then test whether it was unequal to all of them, what decides. Another way would be to test whether it is equal to any one element,
not grep { $key eq $_ } #defined_args;
This is all a little convoluted, needing to work with negations.
But these are kinds of common things to do and there are libraries.
To directly improve on the above
use List::Util 1.33 qw(none); # before 1.33 it was in List::MoreUtils
my #bad_args = grep {
my $key = $_;
none { $key eq $_ } #defined_args
} keys %args;
Now the needed "negative" is absorbed in the library's function name, making this far easier to look at. Also, none will stop once it sees that it failed while grep always processes all elements so this is also more efficient.
These aren't terribly efficient in comparison with hash-based approaches (complexity O(NM-M2/2) or so) but that is completely irrelevant for small arrays. Use of hashes, mentioned in the question, for existence-related issues is a standard; see for example this post, or the source for methods used in all libraries discussed below (simplest example).
Finally, while the question is about (double) filtering it should be mentioned that we are looking for which elements of a list aren't in another; a "difference" between lists. Then other kinds of libraries come into play. Some examples
Using Set::Scalar
use Set::Scalar;
...
my $keys = Set::Scalar->new(keys %args);
my $good = Set::Scalar->new(#defined_args);
my $keys_not_in_good = $keys->difference($good);
say $keys_not_in_good;
Also note Set::Object in the same camp.
Then there are tools specifically for array comparison, like List::Compare
use List::Compare;
...
my $lc = List::Compare->new('-u', '-a', \#defined_args, [keys %args]);
my #only_in_second = $lc->get_complement();
say "#only_in_second";
Options -u and -a showcase some of modules capabilities, to speed things up; they are not necessary. This module has a lot, see docs.
On the other end is the simple Array::Utils.
There is more out there. See for example this page for plenty of ideas.
When you get into these sort of tangles, it's sometimes better to find a different way.
There are two things to think about. If you want a nested use of $_, you need to protect the outer one somehow. Since you want to use the outer and inner ones in the same expression, one of them needs a different name:
grep {
my $top = $_;
my $count = grep { $top eq $_ } ...;
...
} keys %args;
But, that inner grep is a bit weird. You want to check if something is (or isn't) in a list. That's the job for a hash and exists:
my %allowed_args = map { $_, 1 } #allowed_args;
my #found_bad_args = grep { ! exists $allowed_args{$_} } keys %args;

Perl - Data comparison taking huge time

open(INFILE1,"INPUT.txt");
my $modfile = 'Data.txt';
open MODIFIED,'>',$modfile or die "Could not open $modfile : $!";
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
my ($tablename1, $colname1,$sql1) = split(/\t/, $line1);
my ($tablename2, $colname2,$sql2) = split(/\t/, $line2);
if ($tablename1 eq $tablename2)
{
my $sth1 = $dbh->prepare($sql1);
$sth1->execute;
my $hash_ref1 = $sth1->fetchall_hashref('KEY');
my $sth2 = $dbh->prepare($sql2);
$sth2->execute;
my $hash_ref2 = $sth2->fetchall_hashref('KEY');
my #fieldname = split(/,/, $colname1);
my $colcnt=0;
my $rowcnt=0;
foreach $key1 ( keys(%{$hash_ref1}) )
{
foreach (#fieldname)
{
$colname =$_;
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
my $strvalue2='';
#val2 = $hash_ref2->{$key1}->{$colname};
if (defined #val2)
{
my #filtered = grep /#val2/, #metadata2;
my $strvalue2 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
if ($strvalue1 ne $strvalue2 )
{
$colcnt = $colcnt + 1;
print MODIFIED "$tablename1\t$colname\t$strvalue1\t$strvalue2\n";
}
}
}
if ($colcnt>0)
{
print "modified count is $colcnt\n";
}
%$hash_ref1 = ();
%$hash_ref2 = ();
}
The program is Read input file in which every line contrain three strings seperated by tab. First is TableName, Second is ALL Column Name with commas in between and third contain the sql to be run. As this utlity is doing comparison of data, so there are two rows for every tablename. One for each DB. So data needs to be picked from each respective db's and then compared column by column.
SQL returns as ID in the result set and if the value is coming from db then it needs be translated to a string by reading from a array (that array contains 100K records with Key and value seperated by ||)
Now I ran this for one set of tables which contains 18K records in each db. There are 8 columns picked from db in each sql. So for every record out of 18K, and then for every field in that record i.e. 8, this script is taking a lot of time.
My question is if someone can look and see if it can be imporoved so that it takes less time.
File contents sample
INPUT.TXT
TABLENAME COL1,COL2 select COL1,COL2 from TABLENAME where ......
TABLENAMEB COL1,COL2 select COL1,COL2 from TABLENAMEB where ......
Metadata array contains something like this(there are two i.e. for each db)
111||Code 1
222||Code 2
Please suggest
Your code does look a bit unusual, and could gain clarity from using subroutines vs. just using loops and conditionals. Here are a few other suggestions.
The excerpt
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
...;
}
is overly complicated: Not everyone knows the C-ish for(;;) idiom. You have lots of code duplication. And aren't you actually saying loop while I can read two lines?
while (defined(my $line1 = <INFILE1>) and defined(my $line2 = <INFILE1>)) {
...;
}
Yes, that line is longer, but I think it's a bit more self-documenting.
Instead of doing
if ($tablename1 eq $tablename2) { the rest of the loop }
you could say
next if $tablename1 eq $tablename2;
the rest of the loop;
and save a level of intendation. And better intendation equals better readability makes it easier to write good code. And better code might perform better.
What are you doing at foreach $key1 (keys ...) — something tells me you didn't use strict! (Just a hint: lexical variables with my can perform slightly better than global variables)
Also, doing $colname = $_ inside a for-loop is a dumb thing, for the same reason.
for my $key1 (keys ...) {
...;
for my $colname (#fieldname) { ... }
}
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
I don't think this does what you think it does.
From the $hash_ref1 you retrive a single element, then assign that element to an array (a collection of multiple values).
Then you called defined on this array. An array cannot be undefined, and what you are doing is quite deprecated. Calling defined function on a collection returns info about the memory management, but does not indicate ① whether the array is empty or ② whether the first element in that array is defined.
Interpolating an array into a regex isn't likely to be useful: The elements of the array are joined with the value of $", usually a whitespace, and the resulting string treated as a regex. This will wreak havoc if there are metacharacters present.
When you only need the first value of a list, you can force list context, but assign to a single scalar like
my ($filtered) = produce_a_list;
This frees you from weird subscripts you don't need and that only slow you down.
Then you assign to a $strvalue1 variable you just declared. This shadows the outer $strvalue1. They are not the same variable. So after the if branch, you still have the empty string in $strvalue1.
I would write this code like
my $val1 = $hash_ref1->{$key1}{$colname};
my $strvalue1 = defined $val1
? do {
my ($filtered) = grep /\Q$val1/, #metadata;
substr $filtered, 2 + index $filtered, '||'
} : '';
But this would be even cheaper if you pre-split #metadata into pairs and test for equality with the correct field. This would remove some of the bugs that are still lurking in that code.
$x = $x + 1 is commonly written $x++.
Emptying the hashrefs at the end of the iteration is unneccessary: The hashrefs are assigned to a new value at the next iteration of the loop. Also, it is unneccessary to assist Perls garbage collection for such simple tasks.
About the metadata: 100K records is a lot, so either put it in a database itself, or at the very least a hash. Especially for so many records, using a hash is a lot faster than looping through all entries and using slow regexes … aargh!
Create the hash from the file, once at the beginning of the program
my %metadata;
while (<METADATA>) {
chomp;
my ($key, $value) = split /\|\|/;
$metadata{$key} = $value; # assumes each key only has one value
}
Simply look up the key inside the loop
my $strvalue1 = defined $val1 ? $metadata{$val1} // '' : ''
That should be so much faster.
(Oh, and please consider using better names for variables. $strvalue1 doesn't tell me anything, except that it is a stringy value (d'oh). $val1 is even worse.)
This is not really an answer but it won't really fit well in a comment either so, until you provide some more information, here are some observations.
Inside you inner for loop, there is:
#val1 = $hash_ref1->{$key1}->{$colname};
Did you mean #val1 = #{ $hash_ref1->{$key1}->{$colname} };?
Later, you check if (defined #val1)? What did you really want to check? As perldoc -f defined points out:
Use of "defined" on aggregates (hashes and arrays) is
deprecated. It used to report whether memory for that aggregate
had ever been allocated. This behavior may disappear in future
versions of Perl. You should instead use a simple test for size:
In your case, if (defined #val1) will always be true.
Then, you have my #filtered = grep /#val1/, #metadata; Where did #metadata come from? What did you actually intend to check?
Then you have my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
There is some interesting stuff going on in there.
You will need to verbalize what you are actually trying to do.
I strongly suspect there is a single SQL query you can run that will give you what you want but we first need to know what you want.

how to remove items from an perl array based using grep and shift?

I need to remove the data record symbol and any blank lines in a dataRecord using perl.
for example,
$/ = "__Data__"
__Data__
riririririr
djkfkdfjkdjkf
dghghghghghghg
(blank line)
my #dataRecord = split(/\n/);
grep(/(__Data__|/,#dataRecord);
How do I remove the items I do not want in the array based on the grep filtering?
Not sure what's going on with the input record separator here, and the use of split is not valid unless the implicit $_ is being used.
To answer the question though, use the ! operator to negate the sense of the match:
#dataRecord = grep { ! /__Data__|^$/ } #dataRecord;
The ! can also be replaced with not for this case:
#dataRecord = grep { not /__Data__|^$/ } #dataRecord;
This should work:
my #filered_list = grep { length( $_ ) and $_ ne '__Data__' } #dataRecord;
Well, if you want the elements that match the criteria:
#dataRecord = grep(/expr/,#dataRecord);
(assuming that the elements that you don't want in the array are the ones that don't pass the matching regex).

Simplest way to match array of strings to search in perl?

What I want to do is check an array of strings against my search string and get the corresponding key so I can store it. Is there a magical way of doing this with Perl, or am I doomed to using a loop? If so, what is the most efficient way to do this?
I'm relatively new to Perl (I've only written 2 other scripts), so I don't know a lot of the magic yet, just that Perl is magic =D
Reference Array: (1 = 'Canon', 2 = 'HP', 3 = 'Sony')
Search String: Sony's Cyber-shot DSC-S600
End Result: 3
UPDATE:
Based on the results of discussion in this question, depending on your intent/criteria of what constitutes "not using a loop", the map based solution below (see "Option #1) may be the most concise solution, provided that you don't consider map a loop (the short version of the answers is: it's a loop as far as implementation/performance, it's not a loop from language theoretical point of view).
Assuming you don't care whether you get "3" or "Sony" as the answer, you can do it without a loop in a simple case, by building a regular expression with "or" logic (|) from the array, like this:
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
print "$which_found[0]\n";
Result from my test run: Sony
The regular expression will (once the variable $combined_search is interpolated by Perl) take the form /(Canon|HP|Sony)/ which is what you want.
This will NOT work as-is if any of the strings contain regex special characters (such as | or ) ) - in that case you need to escape them
NOTE: I personally consider this somewhat cheating, because in order to implement join(), Perl itself must do a loop somewhere inside the interpeter. So this answer may not satisfy your desire to remain loop-less, depending on whether you wanted to avoid a loop for performance considerations, of to have cleaner or shorter code.
P.S. To get "3" instead of "Sony", you will have to use a loop - either in an obvious way, by doing 1 match in a loop underneath it all; or by using a library that saves you from writing the loop yourself but will have a loop underneath the call.
I will provide 3 alternative solutions.
#1 option: - my favorite. Uses "map", which I personally still consider a loop:
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
print "$which_found[0]\n";
die "Not found" unless #which_found;
my $strings_index = 0;
my %strings_indexes = map {$_ => $strings_index++} #strings;
my $index = 1 + $strings_indexes{ $which_found[0] };
# Need to add 1 since arrays in Perl are zero-index-started and you want "3"
#2 option: Uses a loop hidden behind a nice CPAN library method:
use List::MoreUtils qw(firstidx);
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
die "Not Found!"; unless #which_found;
print "$which_found[0]\n";
my $index_of_found = 1 + firstidx { $_ eq $which_found[0] } #strings;
# Need to add 1 since arrays in Perl are zero-index-started and you want "3"
#3 option: Here's the obvious loop way:
my $found_index = -1;
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
foreach my $index (0..$#strings) {
next if $search_in !~ /$strings[$index]/;
$found_index = $index;
last; # quit the loop early, which is why I didn't use "map" here
}
# Check $found_index against -1; and if you want "3" instead of "2" add 1.
Here is a solution that builds a regular expression with embedded code to increment the index as perl moves through the regex:
my #brands = qw( Canon HP Sony );
my $string = "Sony's Cyber-shot DSC-S600";
use re 'eval'; # needed to use the (?{ code }) construct
my $index = -1;
my $regex = join '|' => map "(?{ \$index++ })\Q$_" => #brands;
print "index: $index\n" if $string =~ $regex;
# prints 2 (since Perl's array indexing starts with 0)
The string that is prepended to each brand first increments the index, and then tries to match the brand (escaped with quotemeta (as \Q) to allow for regex special characters in the brand names).
When the match fails, the regex engine moves past the alternation | and then the pattern repeats.
If you have multiple strings to match against, be sure to reset $index before each. Or you can prepend (?{$index = -1}) to the regex string.
An easy way is just to use a hash and regex:
my $search = "your search string";
my %translation = (
'canon' => 1,
'hp' => 2,
'sony' => 3
);
for my $key ( keys %translation ) {
if ( $search =~ /$key/i ) {
return $translation{$key};
)
}
Naturally the return can just as easily be a print. You can also surround the entire thing in a while loop with:
while(my $search = <>) {
#your $search is declared = to <> and now gets its values from STDIN or strings piped to this script
}
Please also take a look at perl's regex features at perlre
and take a look at perl's data structures at perlref
EDIT
as was just pointed out to me you were trying to steer away from using a loop. Another method would be to use perl's map function. Take a look here.
You can also take a look at Regexp::Assemble, which will take a collection of sub-regexes and build a single super-regex from them that can then be used to test for all of them at once (and gives you the text which matched the regex, of course). I'm not sure that it's the best solution if you're only looking at three strings/regexes that you want to match, but it's definitely the way to go if you have a substantially larger target set - the project I initially used it on has a library of some 1500 terms that it's matching against and it performs very well.

Why does Perl's shift complain 'Type of arg 1 to shift must be array (not grep iterator).'?

I've got a data structure that is a hash that contains an array of hashes. I'd like to reach in there and pull out the first hash that matches a value I'm looking for. I tried this:
my $result = shift grep {$_->{name} eq 'foo'} #{$hash_ref->{list}};
But that gives me this error: Type of arg 1 to shift must be array (not grep iterator). I've re-read the perldoc for grep and I think what I'm doing makes sense. grep returns a list, right? Is it in the wrong context?
I'll use a temporary variable for now, but I'd like to figure out why this doesn't work.
A list isn't an array.
my ($result) = grep {$_->{name} eq 'foo'} #{$hash_ref->{list}};
… should do the job though. Take the return from grep in list context, but don't assign any of the values other than the first.
I think a better way to write this would be this:
use List::Util qw/first/;
my $result = first { $_->{name} eq 'foo' } #{ $hash_ref->{list} };
Not only will it be more clear what you're trying to do, it will also be faster because it will stop grepping your array once it has found the matching element.
Another way to do it:
my $result = (grep {$_->{name} eq 'foo'} #{$hash_ref->{list}})[0];
Note that the curlies around the first argument to grep are redundant in this case, so you can avoid block setup and teardown costs with
my $result = (grep $_->{name} eq 'foo', #{$hash_ref->{list}})[0];
“List value constructors” in perldata documents subscripting of lists:
A list value may also be subscripted like a normal array. You must put the list in parentheses to avoid ambiguity. For example:
# Stat returns list value.
$time = (stat($file))[8];
# SYNTAX ERROR HERE.
$time = stat($file)[8]; # OOPS, FORGOT PARENTHESES
# Find a hex digit.
$hexdigit = ('a','b','c','d','e','f')[$digit-10];
# A "reverse comma operator".
return (pop(#foo),pop(#foo))[0];
As I recall, we got this feature when Randal Schwartz jokingly suggested it, and Chip Salzenberg—who was a patching machine in those days—implemented it that evening.
Update: A bit of searching shows the feature I had in mind was $coderef->(#args). The commit message even logs the conversation!