I have a bunch of numbers timestamps that I want to check against a range to see if they match a particular range of dates. Basically like a BETWEEN .. AND .. match in SQL. The obvious data structure would be a B-tree, but while there are a number of B-tree implementations on CPAN, they only seem to implement exact matching. Berkeley DB has the same problem; there are B-tree indices, but no range matching.
What would be the simplest way to do this? I don't want to use an SQL database unless I have to.
Clarification: I have a lot of these, so I'm looking for an efficient method, not just grep over an array.
grep will be fast, even on a million of them.
# Get everything between 500 and 10,000:
my #items = 1..1_000_000;
my $min = 500;
my $max = 10_000;
my #matches = grep {
$_ <= $max && $_ >= $min
} #items;
Run under time I get this:
time perl million.pl
real 0m0.316s
user 0m0.210s
sys 0m0.070s
Timestamps are numbers. why not common numerical comparaison operators like > and < ?
If you have many of timestamps the problem is not different if you just want to filter your set once. It's O(n) and every other method will be longer.
On the other hand, with a huge set from which you want to extract many different ranges, it could be more efficient to first sort the items. Call the number of search m, the complexity of direct filtering will be O(m.n). With sort followed by search it could be O(n.log(n) + m.log(n)) which is usually much better.
Any O(n.log(n)) sort method will do, including using the built-in sort operator (or b-tree like you suggested). The major difference between efficient sorting methods is if your memory can hold your full set or not. I there is a memory bootleneck to keep both datas and keys (timestamps) in memory you can keep only the timestamp and some index to data in memory and the real data elsewhere (disk file, database). But if your data set is really so big the most efficient solution would probably be to put the whole thing in a database with and index on timestamp (tie to database is real easy using perl).
Then you will have your range. You just use a dicotomic search to look for index of the first element included in range and of the last, complexity will be O(log(n)) (if you do a linear search the whole purpose of sorting will be defeated).
Below example of using sort and binary_search on an array of timestamps, extending use to some data structure with timestamp and content is left as an exercice.
use Search::Binary;
my #array = sort ((1, 2, 1, 1, 2, 3, 2, 2, 8, 3, 8, 3) x 100000);
my $nbelt = #array;
sub cmpfn
{
my ($h, $v, $i) = #_;
$i = $lasti + 1 unless $i;
$record = #array[$i||$lasti + 1];
$lasti = $i;
return ($v<=>$record, $i);
}
for (1..1){
$pos = binary_search(1, $nbelt, 2, \&cmpfn);
}
print "found at $pos\n";
I haven't used it. But found this on searching CPAN. This may provide what you want. You can use Tree::Binary for constructing your data and subclass Tree::Binary::Visitor::Base to do your range queries.
Other easy way is to use SQLite.
Related
I'm messing around in perl, and I want to build a script to print hockey points and assists, with the date. It's about 40 rows, and I want to associate the date with the 2 pieces of information. I'm stuck on how to implement this.
Something
Like
10/24 2 goals 1 assist
As well filtering the totals. I can do this in a simple DB and use SQL but that seems like an overblown solution. (Plus I want to learn more perl). I feel I may be overthinking this. Any help is much appreciated.
When you represent structured data in Perl (and many other languages), it's common to use a hashmap. Perl doesn't have structs, so developers typically use hashes with well-known keys (which is also what many object-oriented frameworks in Perl end up doing under the hood). Without knowing anything else about your other goals, I'd start with something like this:
my %data = (
"2022-11-24" => { goals => 2, assists => 1 },
"2022-11-23" => { goals => 3, assists => 0 },
...
);
From this data structure, you can:
# get today's number of assists
my $assists_today = $data{"2022-11-24"}{assists};
# sum up each day's number of goals
my $total_goals = 0;
$total_goals += $data{$_}{goals} foreach keys %data;
I am writing a very small URL shortener with Dancer. It uses the REST plugin to store a posted URL in a database with a six character string which is used by the user to access the shorted URL.
Now I am a bit unsure about my random string generation method.
sub generate_random_string{
my $length_of_randomstring = shift; # the length of
# the random string to generate
my #chars=('a'..'z','A'..'Z','0'..'9','_');
my $random_string;
for(1..$length_of_randomstring){
# rand #chars will generate a random
# number between 0 and scalar #chars
$random_string.=$chars[rand #chars];
}
# Start over if the string is already in the Database
generate_random_string(6) if database->quick_select('urls', { shortcut => $random_string });
return $random_string;
}
This generates a six char string and calls the function recursively if the generated string is already in the DB. I know there are 63^6 possible strings but this will take some time if the database gathers more entries. And maybe it will become a nearly infinite recursion, which I want to prevent.
Are there ways to generate unique random strings, which prevent recursion?
Thanks in advance
We don't really need to be hand-wavy about how many iterations (or recursions) of your function there will be. I believe at every invocation, the expected number of iterations is geomtrically distributed (i.e. number of trials before first success is governed by the geomtric distribution), which has mean 1/p, where p is the probability of successfully finding an unused string. I believe that p is just 1 - n/63^6, where n is the number of currently stored strings. Therefore, I think that you will need to have stored 30 billion strings (~63^6/2) in your database before your function recurses on average more than 2 times per call (p = .5).
Furthermore, the variance of the geomtric distribution is 1-p/p^2, so even at 30 billion entries, one standard deviation is just sqrt(2). Therefore I expect ~99% of the time that the loop will take fewerer than 2 + 2*sqrt(2) interations or ~ 5 iterations. In other words, I would just not worry too much about it.
From an academic stance this seems like an interesting program to work on. But if you're on the clock and just need random and distinct strings I'd go with the Data::GUID module.
use strict;
use warnings;
use Data::GUID qw( guid_string );
my $guid = guid_string();
Getting rid of recursion is easy; turn your recursive call into a do-while loop. For instance, split your function into two; the "main" one and a helper. The "main" one simply calls the helper and queries the database to ensure it's unique. Assuming generate_random_string2 is the helper, here's a skeleton:
do {
$string = generate_random_string2(6);
} while (database->quick_select(...));
As for limiting the number of iterations before getting a valid string, what about just saving the last generated string and always building your new string as a function of that?
For example, when you start off, you have no strings, so let's just say your string is 'a'. Then the next time you build a string, you get the last built string ('a') and apply a transformation on it, for instance incrementing the last character. This gives you 'b'. and so on. Eventually you get to the highest character you care for (say 'z') at which point you append an 'a' to get 'za', and repeat.
Now there is no database, just one persistent value that you use to generate the next value. Of course if you want truly random strings, you will have to make the algorithm more sophisticated, but the basic principle is the same:
Your current value is a function of the last stored value.
When you generate a new value, you store it.
Ensure your generation will produce a unique value (one that did not occur before).
I've got one more idea based on using MySQL.
create table string (
string_id int(10) not null auto_increment,
string varchar(6) not null default '',
primary key(string_id)
);
insert into string set string='';
update string
set string = lpad( hex( last_insert_id() ), 6, uuid() )
where string_id = last_insert_id();
select string from string
where string_id = last_insert_id();
This gives you an incremental hex value which is left padded with non-zero junk.
I know the _id (ObjectID) of some entry; is there any way to get its relative position from the table start / number of records before it, without writing any code?
*(the stuff was required for debugging some application which ha*d* messy 'no deletions' policy along with incremental record numbers and in-memory collections)*
UPD: still looking for native way to do such things, but here's some perl sweets:
#!/usr/bin/perl -w
use MongoDB;
use MongoDB::OID;
use strict;
my $ppl = MongoDB::Connection->new(username=>"root", password=>"toor")->webapp->users->find();
my $c = 0;
while (my $user = $ppl->next) {
$c++;
print "$user->{_id} $c\n" if ( $user->{'_id'} =~/4...6|4...5/);
}
This is not possible. There is no information in an ObjectID that you can reliably use to know how many older documents are in the same collection. The "inc" part of the ObjectId comes close but exact values depend on driver implementation (and can even be random) and would require all writes to come from the same machine to a mongod that's managing a single collection.
TL;DR : No
I hope this question is still on topic, but recently I found a key-value store programmed in Perl. It was pretty simple, RAM based and I think it had just set and get and also an 'expire' option for keys. I also think it came with as both XS and pure Perl version.
I have been searching for quite a while now and I not sure whether it is on CPAN or I saw it on GitHub. Maybe someone knows what I am talking about.
It might be helpful in narrowing things down if you could explain what exactly the module does that is special in that regard. If you're looking to implement something with caching in general, I'd point you towards CHI, which is basically a common API with multiple caching drivers.
Do you mean Cache? It can store key/value pairs in a number of places, including shared memory.
It sounds like you are describing Memcached. There is a Perl interface on CPAN.
I've used Tie::Cache for this in the past with excellent results. It created a tied hash variable that exhibits LRU behavior when it grows beyond a configured maximum key count.
my $cache_size = 1000;
use vars qw(cache);
%cache = ();
tie %cache, 'Tie::Cache', $cache_size;
From here, you can store hash/value pairs (of course, the value side can be a reference) in %cache and should its size grow to 1000 keys, the LRU keys will be deleted as more are added.
In my usage, I store the right-hand side as an arrayref holding the cached value along with a timestamp of when the entry was cached; my cache reference code checks the timestamp and deletes the key without using it if the entry isn't fresh enough:
sub getCacheMatch {
my $check_value = shift;
my $timeout = 600; # 10 minutes
# Check cache for a match.
my ($result, $time_cached);
my $now = time();
my $time_cached;
my $cache_entry = $cache{$check_value};
if ($cache_entry) {
($result, $time_cached) = #{$cache_entry};
if ($now - $time_cached > $timeout) {
delete $cache{$check_value);
return undef;
} else {
return $result;
}
}
}
And I update the cache elsewhere in the code like so:
$url{$cache_checkstring} = [$value_to_cache, $now];
I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?
I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.
I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.
I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)