Look up a value in Perl based on a range - perl

I have two variables, id and date. There are millions of distinct ids, but just a few hundred distinct dates. The ids are sequential and the dates are increasing with id. Something like this:
id date
1 1/1/2000
2 1/1/2000
3 1/1/2000
4 1/2/2000
5 1/2/2000
In Perl, I need to create a function that will return a date given the id. My first thought was just to make a hash table. This will work, but given that I have millions of records, I thought it might make more sense to work with date ranges. So in the example above, rather than storing 5 records, I could store 2 records: one for each date with the earliest and latest date that correspond to the id:
date first_id last_id
1/1/2000 1 3
1/2/2000 4 5
(In my actual data, this will allow me to store just a few thousand records, rather than millions.)
My question is, given an id, what is the best way to look up the date given this structure? So given id=2, I want to return 1/1/2000 because 2 is between 1 and 3, and therefore corresponds to the first record.
Thanks for any advice.

Use a [semi] sparse array. Performance should be fine. You're looking at a few megabytes of memory usage per million records. If you convert the date to an integer epoch before storing it, even better.
use Time::Local;
my #date_by_id;
while (<FILE>) {
chomp;
my ($id, $date) = split /\s+/;
my ($mon, $mday, $year) = split /\//, $date;
$mon--;
$year -= 1900;
$date_by_id[$id] = timelocal 0, 0, 0,
$mday, $mon, $year;
}
Performance should be good enough that you won't need to wrap it in a function. Just use $date_by_id[<ID>] where needed, keeping in mind that it can be undef.

I would probably have put the data in an SQLite database, made the id field the primary key for the table. Use DBD::SQLite through DBI.
If you first prepare a query that contains a placeholder for id and repeatedly execute it for various values of id, performance should be adequate.

As others have stated, you might want to try a database. Another possibility: Use a more complex data structure.
For example, if your hash table is by dates, you could have each entry in the hash be a reference to an array of ids.
Using your example:
$hash{1/1/2000} = [ 1, 2, 3];
$hash{1/2/2000} = [ 4, 5 ];
That way, if you find a date, you can quickly find all IDs for that date. Sorting the keys will allow you to find a range of dates. This is especially true if you store the dates in a more sortable format. For example, in YYYYMMDD format or in standard Unix date/time format.
For example:
$hash{20000101} = [ 1, 2, 3];
$hash{20000102} = [ 4, 5];
You said there are a few hundred dates, so sorting your dates will be fairly quick.
Are you familiar with things like hashes of arrays? You can look at the Perl documentation for Mark's very short tutorial about references and perldsc which actually shows you how to setup a hashes of arrays.
Now, looking up a date via an id...
Imagine a more complex structure. The first level will have two elements DATES and IDS. Then, you can have the IDS part be a reference to a hash of IDs, and the DATES key be the same structure as mentioned above. You'll have to keep these two structures in sync, though...
$dataHash->{DATES}->{20020101}->[0] = 1;
$dataHash->{DATES}->{20020101}->[2] = 2;
$dataHash->{DATES}->{20020101}->[3] = 3;
$dateHash->{IDS}->{1} = 20020101;
$dateHash->{IDS}->{2} = 20020101;
$dateHash->{IDS}->{3} = 20020101;
Hmm... This is getting complex. Maybe you should look at the Perl tutorial on object oriented programming.
Writing the stuff off the top of my head without any testing:
package DataStruct;
sub new {
my $class = shift;
my $self = {};
bless $self, $class;
my $self->_Id;
my $self->_Date;
return $self;
}
sub _Id {
my $self = shift;
my $id = shift;
my $date = shift;
$self->{IDS} = {} if not exists $self->{IDS};
if (defined $id and defined $date) {
$self->{IDS}->{$id} = $date;
}
if (defined ($id) {
return $self->{IDS}->{$id};
else {
return keys %{self->{IDS}};
}
}
sub _Date {
my $self = shift;
my $date = shift;
my $id = shift;
$self->{DATES} = {} if not exists $self->{DATES};
if (defined $date and defined $id) {
$self->{DATES}->{$date} = [] if not defined $self->{DATES}->{$date};
push #{$self->{DATES}->{$date}}, $id;
};
if ($date) {
return #{$self->{DATES}->{$date}};
}
else {
return keys %{$self->{DATES};
}
}
sub Define {
my $self = shift;
my $id = shift;
my $date = shift;
$self->_Id($id, $date);
$self->_Date($date, $id);
return $self->_Date($date);
}
sub FetchId {
my $self = shift;
my $id = shift;
return $self->_Id($id);
}
sub FetchDate {
my $self = shift;
my $id = shift;
return $self->_Date;
}
In the above, you create your initial data structure with:
my $struct = DataStruct->new;
Now, to add a date and id, you'd call:
$struct->Define($id, $date);
This will in turn call $struct->_Id($id, $date); and $struct->_Date($date, $Id);. Since these begin with an underscore, they're private and can only be called by other DataStruct methods. You mainly use $struct-Set to put your data in.
To fetch a particular date (or an entire range of dates), you use the $dataStruct->FetchDate($date) method, and to fetch a particular Id you use the $dataStruct->FetchId($id);
Now, the DataStruct package will be used to keep both the IDs hash and the Dates hashes in sync with each other and keep the complexity out of the main part of your program.
There's everything you need! All you have to do is fix my numerous errors, and probably have some routine that will convert a M/D/Y style date to a YYYYMMDDstyle date or to and from the standard date/time internal storage structure. That way, you don't have to worry about fixing the date before calling these routines. Oh, and you'll probably want some sort of error handling too. What if I give you a bad date or Id number?
As others have stated, you're better off using a database structure even if you use a faux database structure like SQLite.
However, I wanted to let you know that Perl is actually quite capable of creating some very integrate data structures which can help in cases like this.
I'd assumed from the way you stated your question, you really weren't familiar with creating these complex data structures. If not, Perl has some excellent tutorials built into Perl itself. And, the command perldoc (which is installed with Perl) can pull up all the Perl documentation. Try perldoc perlreftut and see if it pulls up Mark's tutorial on references.
Once you start getting into more complex data structures, you will learn to use object oriented programming to simplify their handling. Again, there are some excellent tutorials built right into Perl on this (or you can go to the Perldoc webpage).
If you already knew all of this, I apologize. However, at least you have a basis for storing and working with your data.

If you are to go with an approach like this I would think it would make the most sense to do the querying at the database level. Then, with MySQL, for example, you could query using the BETWEEN function with something like SELECT date WHERE $id BETWEEN first_id AND last_id
Then you can create a function in Perl where you pass the id and use the query to retrieve the date.

An attempt to implement Frank's idea:
Given the
sub getDateForId {
use integer;
my ($id, $data) = #_;
my $lo = 0;
my $sz = scalar #$data;
my $hi = $sz - 1;
while ( $lo <= $hi ) {
my $mi = ($lo + $hi) / 2;
if ($data->[$mi]->[0] < $id) {
$lo = $mi + 1;
} elsif ($data->[$mi]->[0] > $id) {
$hi = $mi - 1;
} else {
return $data->[$mi]->[1];
}
}
# $lo > $hi: $id belongs to $hi range
if ($hi < 0) {
return sprintf "** id %d < first id %d **", $id, $data->[0]->[0];
} elsif ($lo >= $sz) {
return sprintf "** id %d > last id %d **", $id, $data->[$sz-1]->[0];
} else {
return sprintf "%s (<== lo %d hi %d)", $data->[$hi]->[1], $lo, $hi;
}
}
and the data
my #data = (
[2, '1/1/2000' ]
, [4, '1/2/2000' ]
, [5, '1/3/2000' ]
, [8, '1/4/2000' ]
);
, the test
for my $id (0..9) {
printf "%d => %s\n", $id, getDateForId( $id, \#data );
}
prints
0 => ** id 0 < first id 2 **
1 => ** id 1 < first id 2 **
2 => 1/1/2000
3 => 1/1/2000 (<== lo 1 hi 0)
4 => 1/2/2000
5 => 1/3/2000
6 => 1/3/2000 (<== lo 3 hi 2)
7 => 1/3/2000 (<== lo 3 hi 2)
8 => 1/4/2000
9 => ** id 9 > last id 8 **

Related

Find nearest option in a Perl Hash

I have a hashref that has data tied to days of the calendar year, for example:
my $calendarEntries = { '1' => 'Entry 1', '5' => 'Entry 2', '15' => 'Entry 3' };
I can obtain the day of the year using DateTime:
state $moduleDateTime = require DateTime;
my $dt = DateTime->now('time_zone' => 'America/Chicago');
my $dayOfTheYear = $dt->strftime('%j');
However, I'm trying to figure out the most efficient way to handle situations where the current day does not match any of the days in the hash. I'd like to always "round down" in those situations. E.g. today (which is the 7th day of the year), I'd like to load the entry with the key '5', since it is the most "recent" entry.
Is there a way to select a key in a hashref that is the closest candidate for being <= $dayOfTheYear? If I were using DBD, I could do a query like this:
'SELECT entry WHERE `key` <= ' . $dayOfTheYear . ' ORDER BY `key` DESC LIMIT 1'
But, I'd rather avoid needing to create a database and call it, if I can do something natively in Perl.
One way, expecting many searches
use List::MoreUtils qw(last_value);
my #entries = sort { $a <=> $b } keys %$calendarEntries;
my $nearest_le = last_value { $day >= $_ } #entries;
This returns the last element that is less or equal, for any input, so the key of interest.
The drawback of using simply a hash is that one needs an extra data structure to build. Any library that offers this sort of lookup must do that as well, of course, but those then come with other goodies and may be considerably better performing (depending on how often this is done).
If this 'rounding' need be done a lot for a given hash then it makes sense to build a lookup table for days, associating each with its nearest key in the hash.† ‡
If #entries is sorted descending ($b <=> $a) then the core List::Util::first does it.
† For example
my %nearest_le;
my #keys = sort { $a <=> $b } keys %$calendarEntries;
for my $day (1..366) {
for my $k (#keys) {
if ($k <= $day) {
$nearest_le{$day} = $k;
}
else { last }
}
};
This enumerates days of the year, as specified in the question.
‡ If this were needed for things other than the days (366 at most), where long lists may be expected, a better algorithmic behavior is afforded by binary searches on sorted lists (O(log n)).
The library used above, List::MoreUtils, also has lower_bound with O(log n)
Returns the index of the first element in LIST which does not compare less than val.
So this needs a few adjustments, for
use List::MoreUtils qw(lower_bound);
my #keys = sort { $a <=> $b } keys %$calendarEntries;
my $nearest_le = exists $calendarEntries->{$day}
? $day
: $keys[ -1 + lower_bound { $_ <=> $day } #keys ];
A nice simple solution.
use List::Util qw( max );
max grep { $_ <= $dayOfTheYear } keys %$calendarEntries
Notes:
Best to make sure $calendarEntries->{ $dayOfTheYear } doesn't exist first.
You'll need to handle the case where there is no matching key.
It's faster than sorting unless you perform many searches. But even then, we're only dealing with at most 365 keys, so simplicity is key here.
The simplest solution is to simply look up the value for your date, and if it is not found, go down until you find a value. In this sample, I included a rudimentary error handling.
use strict;
use warnings;
use feature 'say';
my $calendarEntries = { '1' => 'Entry 1', '5' => 'Entry 2', '15' => 'Entry 3' };
my $find = shift // 7; # for testing purposes
my $date = get_nearest_below($calendarEntries, $find);
if (defined $date) {
say "Nearest date below to '$find' is '$date'";
} else { # error handling
warn "Nearest date below not found for '$find'";
}
sub get_nearest_below {
my ($href, $n) = #_;
while ($n > 0) { # valid dates are > 0
return $n if defined $href->{$n}; # find a defined value
$n--; # or go to the next key below
}
return undef; # or return error if nothing is found before 0
}
Output:
$ foo.pl
Nearest date below to '7' is '5'
$ foo.pl 12
Nearest date below to '12' is '5'
$ foo.pl 123
Nearest date below to '123' is '15'
$ foo.pl 0
Nearest date below not found for '0' at foo.pl line 13.

Assign a variable to another variable by concatenating a string and an int

Holy cats man,
I'm maintaining some pretty awful legacy code and there is a part for adding some values to the database from a HTML form created by a loop and increments the names of the variables it submits to 14, I'm changing the loop to the amount of rows it selects before creating the HTML.
But the issue is with how it inserts the HTML form back into the database.
Here's an ad-hoc version of the way it handles the database inserts
while (my $count <= 14) {
if ($count == 1) {
$name = $name1;
$email = $email1;
}
# ...
if ($count == 14) {
$name = $name14;
$email = $email14;
}
my $sth = $dbh->prepare("INSERT INTO table SET name = ? AND email = ?");
$sth->execute($name, $email);
$count++;
}
While I'm probably just going to rewrite this entire section, I'm curious if you could add something like;
elsif ($count > 14) {
# Say count is 15 and we want to assign
# $name to $name15 using a string and the $count variable here.
$name = "name".$count;
$email = "email".$count;
}
Is that technically feasible?
What you're describing is called a "symbolic reference" in Perl lingo and it's generally frowned upon as a very, very, very... very bad practice because it only works with global variables (which are generally best avoided in their own right) and it's one of the easiest ways to create bugs which are nearly-impossible to find.
But it can be done. And, since you're asking for the sake of maintaining legacy code which is probably already just as bad, I'll show you how:
perl -e '$count14 = 42; $sref = "count14"; print $$sref . "\n"'
It's as simple as that.
But, really, don't do it if you can avoid it.
The general-case solution for avoiding symbolic references is to use a hash instead:
my %values = (name1 => 'Alice', name2 => 'Bob');
for my $count (1 .. 2) {
my $name = $values{'name' . $count};
print "$name\n";
}
In the specific case of a bunch of variables named foo1, foo2, etc., though, you probably want an array instead:
my #names = qw( . Alice Bob ); # '.' is a dummy to fill index 0 so the names start at 1
for my $count (1 .. 2) {
my $name = $names[$count];
print "$name\n";
}
I would strongly advise using one of these other techniques instead of symbolic references unless the existing code heavily depends on having $name1, $name2, etc. available. If you can afford the time to replace all of those with arrays and test that the array-based version still works, you'll have improved the quality of the code for future maintainers (which will probably include yourself).

How to Hash in Perl

I am finding uniques URL in a log file along with the response stamp which can be available using $line[7]. I am using Hash to get the unique URLs.
How can I get the count of Unique URL?
How can I get the average of response time along with the count of Unique URL?
With below code I am getting
url1
url2
url3
but I want it along with the average response time and count of each URL
URL Av.RT Count
url1 10.5 125
url2 9.3 356
url3 7.8 98
Code:
#!/usr/bin/perl
open(IN, "web1.txt") or die "can not open file";
# Hash to store final list of unique IPs
my %uniqueURLs = ();
my $z;
# Read log file line by line
while (<IN>) {
#line = split(" ",$_);
$uniqueURLs{$line[9]}=1;
}
# Go through the hash table and print the keys
# which are the unique IPs
for $url (keys %uniqueURLs) {
print $url . "\n";
}
store a listref in your hashing directory:
$uniqueURLs{$line[9]} = [ <avg response time>, <count> ];
adjust the elements accordingly, eg. the count:
if (defined($uniqueURLs{$line[9]})) {
# url known, increment count,
# update average response time with data from current log entry
$uniqueURLs{$line[9]}->[0] =
(($uniqueURLs{$line[9]}->[0] * $uniqueURLs{$line[9]}->[1]) + ($line[7] + 0.0))
/ ($uniqueURLs{$line[9]}->[1] + 1)
;
$uniqueURLs{$line[9]}->[1] += 1;
}
else {
# url not yet known,
# init count with 1 and average response time with actual response time from log entry
$uniqueURLs{$line[9]} = [ $line[7] + 0.0, 1 ];
}
to print results:
# Go through the hash table and print the keys
# which are the unique IPs
for $url (keys %uniqueURLs) {
printf ( "%s %f %d\n", $url, $uniqueURLs{$url}->[0], $uniqueURLs{$url}->[1]);
}
adding 0.0 will guarantee type coercion from string to float as a safeguard.
Read up on References. Also, read up on modern Perl practices which will help improve your programming skills.
Instead of just using the keys of your hash of unique URLs, you could store information in those hashes. Let's start with just a count of the unique URLs:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
WEB_FILE => "web1.txt",
};
open my $web_fh, "<", WEBFILE; #Autodie will catch this for you
my %unique_urls;
while ( my $line = <$web_fh> ) {
my $url = (split /\s+/, $line)[9];
if ( not exists $unique_urls{$url} ) { #Not really needed
$unique_urls{$url} = 0;
}
$unique_urls{$url} += 1;
}
close $web_fh;
Now, each key in your %unique_urls hash will contain the number of unique URLs you have.
This, by the way, is your code written in a bit more modern style. The use strict; and use warnings; pragmas will catch about 90% of the standard programming errors. The use autodie; will catch exceptions to things that you forget to check. In this case, the program will automatically die if the file doesn't exist.
The three parameter version of the open command is preferred, and so is using scalar variables for file handles. Using scalar variables for the file handle makes them easier to pass in subroutines, and the file will automatically close if the file handle falls out of scope.
However, we want to store in two items per hash. We want to store the unique count, and we want to store something that will help us find the average response time. This is where references come in.
In Perl, variables deal with single data items. A scalar variable (like $foo) deals with an individual data item. Arrays and Hashes (like #foo and %foo) deal with lists of individual data items. References help you get around this limitation.
Let's look at an array of people:
$person[0] = "Bob";
$person[1] = "Ted";
$person[2] = "Carol";
$person[3] = "Alice";
However, people are more than just first names. They have last names, phone numbers, addresses, etc. Let's take a look at a hash for Bob:
my %bob_hash;
$bob_hash{FIRST_NAME} = "Bob";
$bob_hash{LAST_NAME} = "Jones";
$bob_hash{PHONE} = "555-1234";
We can take a reference to this hash by putting a backslash in front of it. A reference is merely the memory address where this hash is stored:
$bob_reference = \%bob_hash;
print "$bob_reference\n": # Prints out something like HASH(0x7fbf79004140)
However, that memory address is a single item, and could be stored in our array of people!
$person[0] = $bob_reference;
If we want to get to the items in our reference, we dereference it by putting the right data type symbol in front. Since this is a hash, we will use %:
$bob_hash = %{ $person[0] };
Perl provides an easy way to dereference hashes with the -> syntax:
$person[0]->{FIRST_NAME} = "Bob";
$person[0]->{LAST_NAME} = "Jones";
$person[0]->{PHONE} = "555-1212";
We'll use the same technique in %unique_urls to store the number of times, and the total amount of response time. (Average will be total time / number of times).
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
WEB_FILE => "web1.txt",
};
open my $web_fh, "<", WEB_FILE; #Autodie will catch this for you
my %unique_urls;
while ( my $line ( <$web_fh> ) {
my $url = (split /\s+/, $line)[9];
my $response_time = (split /\s+/, $line)[10]; #Taking a guess
if ( not exists $unique_urls{$url} ) { #Not really needed
$unique_urls{$url}->{INSTANCES} = 0;
$unique_urls{$url}->{TOTAL_RESP_TIME} = 0;
}
$unique_urls{$url}->{INSTANCES} += 1;
$unique_urls{$url}->{TOTAL_RESP_TIME} += $response_time;
}
$close $web_fh;
Now we can print them out:
print "%20.20s %6s %8s\n", "URL", "INST", "AVE";
for my $url ( sort keys %unique_urls ) {
my $total_resp_time = $unique_urls{$url}->{TOTAL_RESP_TIME};
my $instances = $unique_urls{$url}->{INSTANCES};
my $average = $total_resp_time / $instances
printf "%-20.20s %-6d %-8.5f\n", $url, $instances, $average";
}
I like using printf for tables.
Instead of setting the value to 1 here:
$uniqueURLs{$line[9]}=1;
Store a data structure indicating the response time and the number of times this URL has been seen (so you can properly calculate the average). You can use an array ref, or hashref if you want. If the key doesn't exist yet, that means it hasn't been seen yet, and you can set some initial values.
# Initialize 3-element arrayref: [count, total, average]
$uniqueURLS{$line[9]} = [0, 0, 0] if not exists $uniqueURLS{$line[9]};
$uniqueURLs{$line[9]}->[0]++; # Count
$uniqueURLs{$line[9]}->[1] += $line[7]; # Total time
# Calculate average
$uniqueURLs{$line[9]}->[2] = $uniqueURLs{$line[9]}->[1] / $uniqueURLs{$line[9]}->[0];
One way you can get count of uniqueURLS is by counting the keys:
print scalar(keys %uniqueURLS); # Print number of unique url's
In your loop, you can print out the url and average time like this:
for $url (keys %uniqueURLs) {
print $url, ' - ', $uniqueURLs[$url]->[2], "seconds \n";
}

Perl - Data comparison taking huge time

open(INFILE1,"INPUT.txt");
my $modfile = 'Data.txt';
open MODIFIED,'>',$modfile or die "Could not open $modfile : $!";
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
my ($tablename1, $colname1,$sql1) = split(/\t/, $line1);
my ($tablename2, $colname2,$sql2) = split(/\t/, $line2);
if ($tablename1 eq $tablename2)
{
my $sth1 = $dbh->prepare($sql1);
$sth1->execute;
my $hash_ref1 = $sth1->fetchall_hashref('KEY');
my $sth2 = $dbh->prepare($sql2);
$sth2->execute;
my $hash_ref2 = $sth2->fetchall_hashref('KEY');
my #fieldname = split(/,/, $colname1);
my $colcnt=0;
my $rowcnt=0;
foreach $key1 ( keys(%{$hash_ref1}) )
{
foreach (#fieldname)
{
$colname =$_;
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
my $strvalue2='';
#val2 = $hash_ref2->{$key1}->{$colname};
if (defined #val2)
{
my #filtered = grep /#val2/, #metadata2;
my $strvalue2 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
if ($strvalue1 ne $strvalue2 )
{
$colcnt = $colcnt + 1;
print MODIFIED "$tablename1\t$colname\t$strvalue1\t$strvalue2\n";
}
}
}
if ($colcnt>0)
{
print "modified count is $colcnt\n";
}
%$hash_ref1 = ();
%$hash_ref2 = ();
}
The program is Read input file in which every line contrain three strings seperated by tab. First is TableName, Second is ALL Column Name with commas in between and third contain the sql to be run. As this utlity is doing comparison of data, so there are two rows for every tablename. One for each DB. So data needs to be picked from each respective db's and then compared column by column.
SQL returns as ID in the result set and if the value is coming from db then it needs be translated to a string by reading from a array (that array contains 100K records with Key and value seperated by ||)
Now I ran this for one set of tables which contains 18K records in each db. There are 8 columns picked from db in each sql. So for every record out of 18K, and then for every field in that record i.e. 8, this script is taking a lot of time.
My question is if someone can look and see if it can be imporoved so that it takes less time.
File contents sample
INPUT.TXT
TABLENAME COL1,COL2 select COL1,COL2 from TABLENAME where ......
TABLENAMEB COL1,COL2 select COL1,COL2 from TABLENAMEB where ......
Metadata array contains something like this(there are two i.e. for each db)
111||Code 1
222||Code 2
Please suggest
Your code does look a bit unusual, and could gain clarity from using subroutines vs. just using loops and conditionals. Here are a few other suggestions.
The excerpt
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
...;
}
is overly complicated: Not everyone knows the C-ish for(;;) idiom. You have lots of code duplication. And aren't you actually saying loop while I can read two lines?
while (defined(my $line1 = <INFILE1>) and defined(my $line2 = <INFILE1>)) {
...;
}
Yes, that line is longer, but I think it's a bit more self-documenting.
Instead of doing
if ($tablename1 eq $tablename2) { the rest of the loop }
you could say
next if $tablename1 eq $tablename2;
the rest of the loop;
and save a level of intendation. And better intendation equals better readability makes it easier to write good code. And better code might perform better.
What are you doing at foreach $key1 (keys ...) — something tells me you didn't use strict! (Just a hint: lexical variables with my can perform slightly better than global variables)
Also, doing $colname = $_ inside a for-loop is a dumb thing, for the same reason.
for my $key1 (keys ...) {
...;
for my $colname (#fieldname) { ... }
}
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
I don't think this does what you think it does.
From the $hash_ref1 you retrive a single element, then assign that element to an array (a collection of multiple values).
Then you called defined on this array. An array cannot be undefined, and what you are doing is quite deprecated. Calling defined function on a collection returns info about the memory management, but does not indicate ① whether the array is empty or ② whether the first element in that array is defined.
Interpolating an array into a regex isn't likely to be useful: The elements of the array are joined with the value of $", usually a whitespace, and the resulting string treated as a regex. This will wreak havoc if there are metacharacters present.
When you only need the first value of a list, you can force list context, but assign to a single scalar like
my ($filtered) = produce_a_list;
This frees you from weird subscripts you don't need and that only slow you down.
Then you assign to a $strvalue1 variable you just declared. This shadows the outer $strvalue1. They are not the same variable. So after the if branch, you still have the empty string in $strvalue1.
I would write this code like
my $val1 = $hash_ref1->{$key1}{$colname};
my $strvalue1 = defined $val1
? do {
my ($filtered) = grep /\Q$val1/, #metadata;
substr $filtered, 2 + index $filtered, '||'
} : '';
But this would be even cheaper if you pre-split #metadata into pairs and test for equality with the correct field. This would remove some of the bugs that are still lurking in that code.
$x = $x + 1 is commonly written $x++.
Emptying the hashrefs at the end of the iteration is unneccessary: The hashrefs are assigned to a new value at the next iteration of the loop. Also, it is unneccessary to assist Perls garbage collection for such simple tasks.
About the metadata: 100K records is a lot, so either put it in a database itself, or at the very least a hash. Especially for so many records, using a hash is a lot faster than looping through all entries and using slow regexes … aargh!
Create the hash from the file, once at the beginning of the program
my %metadata;
while (<METADATA>) {
chomp;
my ($key, $value) = split /\|\|/;
$metadata{$key} = $value; # assumes each key only has one value
}
Simply look up the key inside the loop
my $strvalue1 = defined $val1 ? $metadata{$val1} // '' : ''
That should be so much faster.
(Oh, and please consider using better names for variables. $strvalue1 doesn't tell me anything, except that it is a stringy value (d'oh). $val1 is even worse.)
This is not really an answer but it won't really fit well in a comment either so, until you provide some more information, here are some observations.
Inside you inner for loop, there is:
#val1 = $hash_ref1->{$key1}->{$colname};
Did you mean #val1 = #{ $hash_ref1->{$key1}->{$colname} };?
Later, you check if (defined #val1)? What did you really want to check? As perldoc -f defined points out:
Use of "defined" on aggregates (hashes and arrays) is
deprecated. It used to report whether memory for that aggregate
had ever been allocated. This behavior may disappear in future
versions of Perl. You should instead use a simple test for size:
In your case, if (defined #val1) will always be true.
Then, you have my #filtered = grep /#val1/, #metadata; Where did #metadata come from? What did you actually intend to check?
Then you have my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
There is some interesting stuff going on in there.
You will need to verbalize what you are actually trying to do.
I strongly suspect there is a single SQL query you can run that will give you what you want but we first need to know what you want.

Converting code to perl sub, but not sure I'm doing it right

I'm working from a question I posted earlier (here), and trying to convert the answer to a sub so I can use it multiple times. Not sure that it's done right though. Can anyone provide a better or cleaner sub?
I have a good deal of experience programming, but my primary language is PHP. It's frustrating to know how to execute in one language, but not be able to do it in another.
sub search_for_key
{
my ($args) = #_;
foreach $row(#{$args->{search_ary}}){
print "#$row[0] : #$row[1]\n";
}
my $thiskey = NULL;
my #result = map { $args->{search_ary}[$_][0] } # Get the 0th column...
grep { #$args->{search_in} =~ /$args->{search_ary}[$_][1]/ } # ... of rows where the
0 .. $#array; # first row matches
$thiskey = #result;
print "\nReturning: " . $thiskey . "\n";
return $thiskey;
}
search_for_key({
'search_ary' => $ref_cam_make,
'search_in' => 'Canon EOS Rebel XSi'
});
---Edit---
From the answers so far, I've cobbled together the function below. I'm new to Perl, so I don't really understand much of the syntax. All I know is that it throws an error (Not an ARRAY reference at line 26.) about that grep line.
Since I seem to not have given enough info, I will also mention that:
I am calling this function like this (which may or may not be correct):
search_for_key({
'search_ary' => $ref_cam_make,
'search_in' => 'Canon EOS Rebel XSi'
});
And $ref_cam_make is an array I collect from a database table like this:
$ref_cam_make = $sth->fetchall_arrayref;
And it is in the structure like this (if I understood how to make the associative fetch work properly, I would like to use it like that instead of by numeric keys):
Reference Array
Associative
row[1][cam_make_id]: 13, row[1][name]: Sony
Numeric
row[1][0]: 13, row[1][1]: Sony
row[0][0]: 19, row[0][1]: Canon
row[2][0]: 25, row[2][1]: HP
sub search_for_key
{
my ($args) = #_;
foreach my $row(#{$args->{search_ary}}){
print "#$row[0] : #$row[1]\n";
}
print grep { $args->{search_in} =~ #$args->{search_ary}[$_][1] } #$args->{search_ary};
}
You are moving in the direction of a 2D array, where the [0] element is some sort of ID number and the [1] element is the camera make. Although reasonable in a quick-and-dirty way, such approaches quickly lead to unreadable code. Your project will be easier to maintain and evolve if you work with richer, more declarative data structures.
The example below uses hash references to represent the camera brands. An even nicer approach is to use objects. When you're ready to take that step, look into Moose.
use strict;
use warnings;
demo_search_feature();
sub demo_search_feature {
my #camera_brands = (
{ make => 'Canon', id => 19 },
{ make => 'Sony', id => 13 },
{ make => 'HP', id => 25 },
);
my #test_searches = (
"Sony's Cyber-shot DSC-S600",
"Canon cameras",
"Sony HPX-32",
);
for my $ts (#test_searches){
print $ts, "\n";
my #hits = find_hits($ts, \#camera_brands);
print ' => ', cb_stringify($_), "\n" for #hits;
}
}
sub cb_stringify {
my $cb = shift;
return sprintf 'id=%d make=%s', $cb->{id}, $cb->{make};
}
sub find_hits {
my ($search, $camera_brands) = #_;
return grep { $search =~ $_->{make} } #$camera_brands;
}
This whole sub is really confusing, and I'm a fairly regular perl user. Here are some blanket suggestions.
Do not create your own undef ever -- use undef then return at the bottom return $var // 'NULL'.
Do not ever do this: foreach $row, because foreach my $row is less prone to create problems. Localizing variables is good.
Do not needlessly concatenate, for it offends the style god: not this, print "\nReturning: " . $thiskey . "\n";, but print "\nReturning: $thiskey\n";, or if you don't need the first \n: say "Returning: $thiskey;" (5.10 only)
greping over 0 .. $#array; is categorically lame, just grep over the array: grep {} #{$foo[0]}, and with that code being so complex you almost certainly don't want grep (though I don't understand what you're doing to be honest.). Check out perldoc -q first -- in short grep doesn't stop until the end.
Lastly, do not assign an array to a scalar: $thiskey = #result; is an implicit $thiskey = scalar #result; (see perldoc -q scalar) for more info. What you probably want is to return the array reference. Something like this (which eliminates $thiskey)
printf "\nReturning: %s\n", join ', ', #result;
#result ? \#result : 'NULL';
If you're intending to return whether a match is found, this code should work (inefficiently). If you're intending to return the key, though, it won't -- the scalar value of #result (which is what you're getting when you say $thiskey = #result;) is the number of items in the list, not the first entry.
$thiskey = #result; should probably be changed to $thiskey = $result[0];, if you want mostly-equivalent functionality to the code you based this off of. Note that it won't account for multiple matches anymore, though, unless you return #result in its entirety, which kinda makes more sense anyway.