I have a hashref that has data tied to days of the calendar year, for example:
my $calendarEntries = { '1' => 'Entry 1', '5' => 'Entry 2', '15' => 'Entry 3' };
I can obtain the day of the year using DateTime:
state $moduleDateTime = require DateTime;
my $dt = DateTime->now('time_zone' => 'America/Chicago');
my $dayOfTheYear = $dt->strftime('%j');
However, I'm trying to figure out the most efficient way to handle situations where the current day does not match any of the days in the hash. I'd like to always "round down" in those situations. E.g. today (which is the 7th day of the year), I'd like to load the entry with the key '5', since it is the most "recent" entry.
Is there a way to select a key in a hashref that is the closest candidate for being <= $dayOfTheYear? If I were using DBD, I could do a query like this:
'SELECT entry WHERE `key` <= ' . $dayOfTheYear . ' ORDER BY `key` DESC LIMIT 1'
But, I'd rather avoid needing to create a database and call it, if I can do something natively in Perl.
One way, expecting many searches
use List::MoreUtils qw(last_value);
my #entries = sort { $a <=> $b } keys %$calendarEntries;
my $nearest_le = last_value { $day >= $_ } #entries;
This returns the last element that is less or equal, for any input, so the key of interest.
The drawback of using simply a hash is that one needs an extra data structure to build. Any library that offers this sort of lookup must do that as well, of course, but those then come with other goodies and may be considerably better performing (depending on how often this is done).
If this 'rounding' need be done a lot for a given hash then it makes sense to build a lookup table for days, associating each with its nearest key in the hash.† ‡
If #entries is sorted descending ($b <=> $a) then the core List::Util::first does it.
† For example
my %nearest_le;
my #keys = sort { $a <=> $b } keys %$calendarEntries;
for my $day (1..366) {
for my $k (#keys) {
if ($k <= $day) {
$nearest_le{$day} = $k;
}
else { last }
}
};
This enumerates days of the year, as specified in the question.
‡ If this were needed for things other than the days (366 at most), where long lists may be expected, a better algorithmic behavior is afforded by binary searches on sorted lists (O(log n)).
The library used above, List::MoreUtils, also has lower_bound with O(log n)
Returns the index of the first element in LIST which does not compare less than val.
So this needs a few adjustments, for
use List::MoreUtils qw(lower_bound);
my #keys = sort { $a <=> $b } keys %$calendarEntries;
my $nearest_le = exists $calendarEntries->{$day}
? $day
: $keys[ -1 + lower_bound { $_ <=> $day } #keys ];
A nice simple solution.
use List::Util qw( max );
max grep { $_ <= $dayOfTheYear } keys %$calendarEntries
Notes:
Best to make sure $calendarEntries->{ $dayOfTheYear } doesn't exist first.
You'll need to handle the case where there is no matching key.
It's faster than sorting unless you perform many searches. But even then, we're only dealing with at most 365 keys, so simplicity is key here.
The simplest solution is to simply look up the value for your date, and if it is not found, go down until you find a value. In this sample, I included a rudimentary error handling.
use strict;
use warnings;
use feature 'say';
my $calendarEntries = { '1' => 'Entry 1', '5' => 'Entry 2', '15' => 'Entry 3' };
my $find = shift // 7; # for testing purposes
my $date = get_nearest_below($calendarEntries, $find);
if (defined $date) {
say "Nearest date below to '$find' is '$date'";
} else { # error handling
warn "Nearest date below not found for '$find'";
}
sub get_nearest_below {
my ($href, $n) = #_;
while ($n > 0) { # valid dates are > 0
return $n if defined $href->{$n}; # find a defined value
$n--; # or go to the next key below
}
return undef; # or return error if nothing is found before 0
}
Output:
$ foo.pl
Nearest date below to '7' is '5'
$ foo.pl 12
Nearest date below to '12' is '5'
$ foo.pl 123
Nearest date below to '123' is '15'
$ foo.pl 0
Nearest date below not found for '0' at foo.pl line 13.
Related
I was trying to sort datetimes in Perl. The datetime string is in a hash of arrays. After lots of googling I used Perl's sort function. To my surprise it worked. My code & input is below:
use strict;
use warnings;
my %hashofarrays;
$hashofarrays{'joe tribiani'} = ['16/3/28 13:42','XII','99.93%'];
$hashofarrays{'Ross'} = ['16/3/28 13:43','XII','76.93%'];
$hashofarrays{'sue grace'} = ['11/7/5 12:07','VI','77.58%'];
foreach my $key ( sort{$hashofarrays{$a}[0] cmp $hashofarrays{$b}[0]} keys %hashofarrays ) {
print "$key =>", join (", ", #{$hashofarrays{$key}}), "\n";
}
Am I doing this correctly? If I am, how is this working?
If it is wrong then what should I do to sort the date time string?
The result should list records sorted in ascending order by datetime.
The datetime format is 'YY/MM/DD' or 'YY/M/D'
My data:
joe tribiani, 16/3/28 13:42,XII,99.93%
Ross,16/3/28 13:43,XII,95.93%
sue grace,11/7/5 12:07,VI,77.58%
My excepted output:
sue grace =>11/7/5 12:07, VI, 77.58%
joe tribiani =>16/3/28 13:42, XII, 99.93%
Ross =>16/3/28 13:43, XII, 76.93%
What your code does
foreach my $key (
sort { $hashofarrays{$a}[0] cmp $hashofarrays{$b}[0] }
keys %hashofarrays
) {
print "$key =>", join (", ", #{$hashofarrays{$key}}), "\n";
}
This uses the keys of the hash to sort the elements of the hash on the first element of the array reference that's inside of that key in the hash. It uses cmp, which is an ascii-betical sorting rather than a numerical one. That means that 10 will come before 2 because 1 is a lower character than 2.
For date sorting, this makes sense. But because your date format is not consistently with two digits for months and days, it will not sort properly.
18/5/1 # sorted last
18/10/1 # sorted first
For those two dates, the later October date will be sorted first which is wrong.
What you need to do instead
If you cannot clean up the input data, you need to process it to figure out the value behind the date instead of relying on the presentation for sorting. Parsing dates is its own problem, and we'll use Time::Piece to do it, which has been in the Perl core for a while now.
We could do it like this pseudo-code:
sort { parse($a) <=> parse($b) } ...
That would work, but would be slow, especially for a larger number of records, because it parses the date with every comparison. If you don't know how sorting works internally, it compares values to each other, maybe changes position of the two values, and repeats.
There is a way of making this more efficient, but it's a bit more complex.
use strict;
use warnings;
use Time::Piece;
my %values = ( 'joe tribiani' => [ '16/3/28 13:42', 'XII', '99.93%' ],
'Ross' => [ '16/3/28 13:43', 'XII', '76.93%' ],
'sue grace' => [ '11/7/5 12:07', 'VI', '77.58%' ], );
my #sorted_keys = map { $_->[1] }
sort { $a->[0] <=> $b->[0] }
map {
[ Time::Piece->strptime( $values{$_}->[0], '%y/%m/%d %H:%M' )->epoch, $_ ]
}
keys %values;
use Data::Dumper;
print Dumper #sorted_keys;
This is called the Schwartzian transform. It essentially processes the values once, puts them into another array reference, together with the actual values that are being sorted. It then sorts on the pre-processed values, and then puts them back.
This is way more efficient, and because of the parsing it will work on all dates that might show up in your values without sorting them in the wrong way.
Note we need to use the numerical comparison operator <=> now, because we are sorting on epoch values, which are just numbers (lots of seconds from 1970-01-01T00:00:00).
I wrote up a Perl line for querying in MongoDB, but it doesn't seem to be working.
my $cursor = $collection->find({'genes.symbol'=>{'$in' => [#gene_list]}});
The elements (gene symbols) in the #gene_list array are separated by spaces (" "). I don't know if this is the issue, because in MongoDB shell, the elements should be separated by a comma. If the #gene_list has to be an array with the elements separated by comma, how should I do it?
This #gene_list array include 10 genes:
"RAD51C","FRAS1","GRIP1","FREM2","CHMP1A","WRAP53","VAX1","ACTG2","RNASEH2A","CTC1"
so, when I do
my $count = $myCursor->count;
print $count;
I assumed it would print '10' as the count, however, based on my Perl line, it always printed '0', which means the query was not successful.
It's a shame you're so reluctant to give us any more information about your problem. As it is, all I can do is to offer you an example program which inserts 13 documents into a collection and then uses find in the same way as your own code does to retrieve a subset
The only thing I can spot that may be wrong in your own code is that you are using genes.symbol as a field name, which is a little bit odd. Are you sure it the collection isn't genes while the field is symbol?
use strict;
use warnings 'all';
use feature 'say';
use MongoDB;
my $dbh = MongoDB->connect;
my $collection = $dbh->ns('test.test');
$collection->delete_many({}); # Empty the collection
for my $val ( 'A' .. 'M' ) {
$collection->insert_one({data => $val});
}
my #filter = qw/ A F N Z /;
my $curs = $collection->find({ data => { '$in' => [ #filter ] } });
my $n = 0;
while ( my $doc = $curs->next ) {
printf "%2d: %s\n", ++$n, $doc->{data};
}
output
1: A
2: F
I have two variables, id and date. There are millions of distinct ids, but just a few hundred distinct dates. The ids are sequential and the dates are increasing with id. Something like this:
id date
1 1/1/2000
2 1/1/2000
3 1/1/2000
4 1/2/2000
5 1/2/2000
In Perl, I need to create a function that will return a date given the id. My first thought was just to make a hash table. This will work, but given that I have millions of records, I thought it might make more sense to work with date ranges. So in the example above, rather than storing 5 records, I could store 2 records: one for each date with the earliest and latest date that correspond to the id:
date first_id last_id
1/1/2000 1 3
1/2/2000 4 5
(In my actual data, this will allow me to store just a few thousand records, rather than millions.)
My question is, given an id, what is the best way to look up the date given this structure? So given id=2, I want to return 1/1/2000 because 2 is between 1 and 3, and therefore corresponds to the first record.
Thanks for any advice.
Use a [semi] sparse array. Performance should be fine. You're looking at a few megabytes of memory usage per million records. If you convert the date to an integer epoch before storing it, even better.
use Time::Local;
my #date_by_id;
while (<FILE>) {
chomp;
my ($id, $date) = split /\s+/;
my ($mon, $mday, $year) = split /\//, $date;
$mon--;
$year -= 1900;
$date_by_id[$id] = timelocal 0, 0, 0,
$mday, $mon, $year;
}
Performance should be good enough that you won't need to wrap it in a function. Just use $date_by_id[<ID>] where needed, keeping in mind that it can be undef.
I would probably have put the data in an SQLite database, made the id field the primary key for the table. Use DBD::SQLite through DBI.
If you first prepare a query that contains a placeholder for id and repeatedly execute it for various values of id, performance should be adequate.
As others have stated, you might want to try a database. Another possibility: Use a more complex data structure.
For example, if your hash table is by dates, you could have each entry in the hash be a reference to an array of ids.
Using your example:
$hash{1/1/2000} = [ 1, 2, 3];
$hash{1/2/2000} = [ 4, 5 ];
That way, if you find a date, you can quickly find all IDs for that date. Sorting the keys will allow you to find a range of dates. This is especially true if you store the dates in a more sortable format. For example, in YYYYMMDD format or in standard Unix date/time format.
For example:
$hash{20000101} = [ 1, 2, 3];
$hash{20000102} = [ 4, 5];
You said there are a few hundred dates, so sorting your dates will be fairly quick.
Are you familiar with things like hashes of arrays? You can look at the Perl documentation for Mark's very short tutorial about references and perldsc which actually shows you how to setup a hashes of arrays.
Now, looking up a date via an id...
Imagine a more complex structure. The first level will have two elements DATES and IDS. Then, you can have the IDS part be a reference to a hash of IDs, and the DATES key be the same structure as mentioned above. You'll have to keep these two structures in sync, though...
$dataHash->{DATES}->{20020101}->[0] = 1;
$dataHash->{DATES}->{20020101}->[2] = 2;
$dataHash->{DATES}->{20020101}->[3] = 3;
$dateHash->{IDS}->{1} = 20020101;
$dateHash->{IDS}->{2} = 20020101;
$dateHash->{IDS}->{3} = 20020101;
Hmm... This is getting complex. Maybe you should look at the Perl tutorial on object oriented programming.
Writing the stuff off the top of my head without any testing:
package DataStruct;
sub new {
my $class = shift;
my $self = {};
bless $self, $class;
my $self->_Id;
my $self->_Date;
return $self;
}
sub _Id {
my $self = shift;
my $id = shift;
my $date = shift;
$self->{IDS} = {} if not exists $self->{IDS};
if (defined $id and defined $date) {
$self->{IDS}->{$id} = $date;
}
if (defined ($id) {
return $self->{IDS}->{$id};
else {
return keys %{self->{IDS}};
}
}
sub _Date {
my $self = shift;
my $date = shift;
my $id = shift;
$self->{DATES} = {} if not exists $self->{DATES};
if (defined $date and defined $id) {
$self->{DATES}->{$date} = [] if not defined $self->{DATES}->{$date};
push #{$self->{DATES}->{$date}}, $id;
};
if ($date) {
return #{$self->{DATES}->{$date}};
}
else {
return keys %{$self->{DATES};
}
}
sub Define {
my $self = shift;
my $id = shift;
my $date = shift;
$self->_Id($id, $date);
$self->_Date($date, $id);
return $self->_Date($date);
}
sub FetchId {
my $self = shift;
my $id = shift;
return $self->_Id($id);
}
sub FetchDate {
my $self = shift;
my $id = shift;
return $self->_Date;
}
In the above, you create your initial data structure with:
my $struct = DataStruct->new;
Now, to add a date and id, you'd call:
$struct->Define($id, $date);
This will in turn call $struct->_Id($id, $date); and $struct->_Date($date, $Id);. Since these begin with an underscore, they're private and can only be called by other DataStruct methods. You mainly use $struct-Set to put your data in.
To fetch a particular date (or an entire range of dates), you use the $dataStruct->FetchDate($date) method, and to fetch a particular Id you use the $dataStruct->FetchId($id);
Now, the DataStruct package will be used to keep both the IDs hash and the Dates hashes in sync with each other and keep the complexity out of the main part of your program.
There's everything you need! All you have to do is fix my numerous errors, and probably have some routine that will convert a M/D/Y style date to a YYYYMMDDstyle date or to and from the standard date/time internal storage structure. That way, you don't have to worry about fixing the date before calling these routines. Oh, and you'll probably want some sort of error handling too. What if I give you a bad date or Id number?
As others have stated, you're better off using a database structure even if you use a faux database structure like SQLite.
However, I wanted to let you know that Perl is actually quite capable of creating some very integrate data structures which can help in cases like this.
I'd assumed from the way you stated your question, you really weren't familiar with creating these complex data structures. If not, Perl has some excellent tutorials built into Perl itself. And, the command perldoc (which is installed with Perl) can pull up all the Perl documentation. Try perldoc perlreftut and see if it pulls up Mark's tutorial on references.
Once you start getting into more complex data structures, you will learn to use object oriented programming to simplify their handling. Again, there are some excellent tutorials built right into Perl on this (or you can go to the Perldoc webpage).
If you already knew all of this, I apologize. However, at least you have a basis for storing and working with your data.
If you are to go with an approach like this I would think it would make the most sense to do the querying at the database level. Then, with MySQL, for example, you could query using the BETWEEN function with something like SELECT date WHERE $id BETWEEN first_id AND last_id
Then you can create a function in Perl where you pass the id and use the query to retrieve the date.
An attempt to implement Frank's idea:
Given the
sub getDateForId {
use integer;
my ($id, $data) = #_;
my $lo = 0;
my $sz = scalar #$data;
my $hi = $sz - 1;
while ( $lo <= $hi ) {
my $mi = ($lo + $hi) / 2;
if ($data->[$mi]->[0] < $id) {
$lo = $mi + 1;
} elsif ($data->[$mi]->[0] > $id) {
$hi = $mi - 1;
} else {
return $data->[$mi]->[1];
}
}
# $lo > $hi: $id belongs to $hi range
if ($hi < 0) {
return sprintf "** id %d < first id %d **", $id, $data->[0]->[0];
} elsif ($lo >= $sz) {
return sprintf "** id %d > last id %d **", $id, $data->[$sz-1]->[0];
} else {
return sprintf "%s (<== lo %d hi %d)", $data->[$hi]->[1], $lo, $hi;
}
}
and the data
my #data = (
[2, '1/1/2000' ]
, [4, '1/2/2000' ]
, [5, '1/3/2000' ]
, [8, '1/4/2000' ]
);
, the test
for my $id (0..9) {
printf "%d => %s\n", $id, getDateForId( $id, \#data );
}
prints
0 => ** id 0 < first id 2 **
1 => ** id 1 < first id 2 **
2 => 1/1/2000
3 => 1/1/2000 (<== lo 1 hi 0)
4 => 1/2/2000
5 => 1/3/2000
6 => 1/3/2000 (<== lo 3 hi 2)
7 => 1/3/2000 (<== lo 3 hi 2)
8 => 1/4/2000
9 => ** id 9 > last id 8 **
Let's say I have this
#!/usr/bin/perl
%x = ('a' => 1, 'b' => 2, 'c' => 3);
and I would like to know if the value 2 is a hash value in %x.
How is that done?
Fundamentally, a hash is a data structure optimized for solving the converse question, knowing whether the key 2 is present. But it's hard to judge without knowing, so let's assume that won't change.
Possibilities presented here will depend on:
how often you need to do it
how dynamic the hash is
One-time op
grep $_==2, values %x (also spelled grep {$_==1} values %x) will return a list of as many 2s as are present in the hash, or, in scalar context, the number of matches. Evaluated as a boolean in a condition, it yields just what you want.
grep works on versions of Perl as old as I can remember.
use List::Util qw(first); first {$_==2} values %x returns only the first match, undef if none. That makes it faster, as it will short-circuit (stop examining elements) as soon as it succeeds. This isn't a problem for 2, but take care that the returned element doesn't necessarily evaluate to boolean true. Use defined in those cases.
List::Util is a part of the Perl core since 5.8.
use List::MoreUtils qw(any); any {$_==2} values %x returns exactly the information you requested as a boolean, and exhibits the short-circuiting behavior.
List::MoreUtils is available from CPAN.
2 ~~ [values %x] returns exactly the information you requested as a boolean, and exhibits the short-circuiting behavior.
Smart matching is available in Perl since 5.10.
Repeated op, static hash
Construct a hash that maps values to keys, and use that one as a natural hash to test key existence.
my %r = reverse %x;
if ( exists $r{2} ) { ... }
Repeated op, dynamic hash
Use a reverse lookup as above. You'll need to keep it up to date, which is left as an exercise to the reader/editor. (hint: value collisions are tricky)
Shorter answer using smart match (Perl version 5.10 or later):
print 2 ~~ [values %x];
my %reverse = reverse %x;
if( defined( $reverse{2} ) ) {
print "2 is a value in the hash!\n";
}
If you want to find out the keys for which the value is 2:
foreach my $key ( keys %x ) {
print "2 is the value for $key\n" if $x{$key} == 2;
}
Everyone's answer so far was not performance-driven. While the smart-match (~~) solution short circuits (e.g. stops searching when something is found), the grep ones do not.
Therefore, here's a solution which may have better performance for Perl before 5.10 that doesn't have smart match operator:
use List::MoreUtils qw(any);
if (any { $_ == 2 } values %x) {
print "Found!\n";
}
Please note that this is just a specific example of searching in a list (values %x) in this case and as such, if you care about performance, the standard performance analysis of searching in a list apply as discussed in detail in this answer
grep and values
my %x = ('a' => 1, 'b' => 2, 'c' => 3);
if (grep { $_ == 2 } values %x ) {
print "2 is in hash\n";
}
else {
print "2 is not in hash\n";
}
See also: perldoc -q hash
Where $count would be the result:
my $count = grep { $_ == 2 } values %x;
This will not only show you if it's a value in the hash, but how many times it occurs as a value. Alternatively you can do it like this as well:
my $count = grep {/2/} values %x;
I'm working from a question I posted earlier (here), and trying to convert the answer to a sub so I can use it multiple times. Not sure that it's done right though. Can anyone provide a better or cleaner sub?
I have a good deal of experience programming, but my primary language is PHP. It's frustrating to know how to execute in one language, but not be able to do it in another.
sub search_for_key
{
my ($args) = #_;
foreach $row(#{$args->{search_ary}}){
print "#$row[0] : #$row[1]\n";
}
my $thiskey = NULL;
my #result = map { $args->{search_ary}[$_][0] } # Get the 0th column...
grep { #$args->{search_in} =~ /$args->{search_ary}[$_][1]/ } # ... of rows where the
0 .. $#array; # first row matches
$thiskey = #result;
print "\nReturning: " . $thiskey . "\n";
return $thiskey;
}
search_for_key({
'search_ary' => $ref_cam_make,
'search_in' => 'Canon EOS Rebel XSi'
});
---Edit---
From the answers so far, I've cobbled together the function below. I'm new to Perl, so I don't really understand much of the syntax. All I know is that it throws an error (Not an ARRAY reference at line 26.) about that grep line.
Since I seem to not have given enough info, I will also mention that:
I am calling this function like this (which may or may not be correct):
search_for_key({
'search_ary' => $ref_cam_make,
'search_in' => 'Canon EOS Rebel XSi'
});
And $ref_cam_make is an array I collect from a database table like this:
$ref_cam_make = $sth->fetchall_arrayref;
And it is in the structure like this (if I understood how to make the associative fetch work properly, I would like to use it like that instead of by numeric keys):
Reference Array
Associative
row[1][cam_make_id]: 13, row[1][name]: Sony
Numeric
row[1][0]: 13, row[1][1]: Sony
row[0][0]: 19, row[0][1]: Canon
row[2][0]: 25, row[2][1]: HP
sub search_for_key
{
my ($args) = #_;
foreach my $row(#{$args->{search_ary}}){
print "#$row[0] : #$row[1]\n";
}
print grep { $args->{search_in} =~ #$args->{search_ary}[$_][1] } #$args->{search_ary};
}
You are moving in the direction of a 2D array, where the [0] element is some sort of ID number and the [1] element is the camera make. Although reasonable in a quick-and-dirty way, such approaches quickly lead to unreadable code. Your project will be easier to maintain and evolve if you work with richer, more declarative data structures.
The example below uses hash references to represent the camera brands. An even nicer approach is to use objects. When you're ready to take that step, look into Moose.
use strict;
use warnings;
demo_search_feature();
sub demo_search_feature {
my #camera_brands = (
{ make => 'Canon', id => 19 },
{ make => 'Sony', id => 13 },
{ make => 'HP', id => 25 },
);
my #test_searches = (
"Sony's Cyber-shot DSC-S600",
"Canon cameras",
"Sony HPX-32",
);
for my $ts (#test_searches){
print $ts, "\n";
my #hits = find_hits($ts, \#camera_brands);
print ' => ', cb_stringify($_), "\n" for #hits;
}
}
sub cb_stringify {
my $cb = shift;
return sprintf 'id=%d make=%s', $cb->{id}, $cb->{make};
}
sub find_hits {
my ($search, $camera_brands) = #_;
return grep { $search =~ $_->{make} } #$camera_brands;
}
This whole sub is really confusing, and I'm a fairly regular perl user. Here are some blanket suggestions.
Do not create your own undef ever -- use undef then return at the bottom return $var // 'NULL'.
Do not ever do this: foreach $row, because foreach my $row is less prone to create problems. Localizing variables is good.
Do not needlessly concatenate, for it offends the style god: not this, print "\nReturning: " . $thiskey . "\n";, but print "\nReturning: $thiskey\n";, or if you don't need the first \n: say "Returning: $thiskey;" (5.10 only)
greping over 0 .. $#array; is categorically lame, just grep over the array: grep {} #{$foo[0]}, and with that code being so complex you almost certainly don't want grep (though I don't understand what you're doing to be honest.). Check out perldoc -q first -- in short grep doesn't stop until the end.
Lastly, do not assign an array to a scalar: $thiskey = #result; is an implicit $thiskey = scalar #result; (see perldoc -q scalar) for more info. What you probably want is to return the array reference. Something like this (which eliminates $thiskey)
printf "\nReturning: %s\n", join ', ', #result;
#result ? \#result : 'NULL';
If you're intending to return whether a match is found, this code should work (inefficiently). If you're intending to return the key, though, it won't -- the scalar value of #result (which is what you're getting when you say $thiskey = #result;) is the number of items in the list, not the first entry.
$thiskey = #result; should probably be changed to $thiskey = $result[0];, if you want mostly-equivalent functionality to the code you based this off of. Note that it won't account for multiple matches anymore, though, unless you return #result in its entirety, which kinda makes more sense anyway.