I'm looking for a method to generate a good seed for generating different series of random numbers in processes that starts at the same time.
I would like to avoid using one of the math or crypto libraries because I'm picking random numbers very frequently and my cpu resources are very limited.
I found few example for setting seeds. I tested them using the following method:
short program that picks 100 random numbers out of 5000 options. So each value has 2% chance to be selected.
run this program 100 times, so in theory, in a truly random environment, all possible values should be picked at least once.
count the number of values that were not selected at all.
This is the perl code I used. In each test I opt in only one method for generating seed:
#!/usr/bin/perl
#$seed=5432;
#$seed=(time ^ $$);
#$seed=($$ ^ unpack "%L*", `ps axww | gzip -f`);
$seed=(time ^ $$ ^ unpack "%L*", `ps axww | gzip -f`);
srand ($seed);
for ($i=0 ; $i< 100; $i++) {
printf ("%03d \n", rand (5000)+1000);
}
I ran the program 100 time and counted the values NOT selected using:
# run the program 100 times
for i in `seq 0 99`; do /tmp/rand_test.pl ; done > /tmp/list.txt
# test 1000 values (out of 5000). It should be good-enough representation.
for i in `seq 1000 1999`; do echo -n "$i "; grep -c $i /tmp/list.txt; done | grep " 0" | wc -l
The table shows the result of the tests (Lower value is better):
count Seed generation method
114 default - the line: "srand ($seed);" is commented ou
986 constant seed (5432)
122 time ^ $$
125 $$ ^ unpack "%L*", `ps axww | gzip -f`
163 time ^ $$ ^ unpack "%L*", `ps axww | gzip -f`
The constant seed method showed 986 or 1000 values not selected. In other words, only 1.4% of the possible values were selected. This is close enough to the 2% that was expected.
However, I expected that the last option that was recommended in few places, would be significantly better than the default.
Is there any better method to generate a seed for each of the processes?
I'm picking random numbers very frequently and my cpu resources are very limited.
You're worrying before you even have made a measurement.
Is there any better method to generate a seed for each of the processes?
Yes. You have to leave user space which is prone to manipulation. Simply use Crypt::URandom.
It is safe for any purpose, including fetching a seed.
It will use the kernel CSPRNG for each operating system (see source code) and hence avoid the problems shown in the article above.
It does not suffer from the documented rand weakness.
Don't generate a seed. Let Perl do it for you. Don't call srand (or call it without a parameter if you do).
Quote srand,
If srand is not called explicitly, it is called implicitly without a parameter at the first use of the rand operator
and
When called with a parameter, srand uses that for the seed; otherwise it (semi-)randomly chooses a seed.
It doesn't simply use the time as the seed.
$ perl -M5.014 -E'say for srand, srand'
2665271449
1007037147
Your goal seems to be how to generate random numbers rather than how to generate seeds. In most cases, just use a cryptographic RNG (such as Crypt::URandom in Perl) to generate the random numbers you want, rather than generate seeds for another RNG. (In general, cryptographic RNGs take care of seeding and other issues for you.) You should not use a weaker RNG unless—
the random values you generate aren't involved in information security (e.g., the random values are neither passwords nor nonces nor encryption keys), and
either—
you care about repeatable "randomness" (which is not the case here), or
you have measured the performance of your application and find random number generation to be a performance bottleneck.
Since you will generate random names for the purpose of querying a database, which may be in a remote location, it will be highly unlikely that the random number generation itself will be the performance bottleneck.
Related
I am running the following expecting return strings of 5 characters:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
print "$_\n";
}
but it returns only 4 characters:
anbc
anbd
anbe
anbf
anbg
...
However, when I reduce the number of characters in the list:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
print "$_\n";
}
it returns correctly:
aamid
aamie
aamif
aamig
aamih
...
Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?
If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28
The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.
In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.†
I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.
As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.
However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.
There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...
Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)
† I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.
Update A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.
It did produce the right number of actual 5-char long strings, seemingly correct, on this server.
Everything has some limitation.
Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:
use v5.10;
use Set::CrossProduct;
my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );
while( my $item = $set->get ) {
say join '', #$item
}
I try some examples from Rosettacode and encounter an issue with the provided Ackermann example: When running it "unmodified" (I replaced the utf-8 variable names by latin-1 ones), I get (similar, but now copyable):
$ perl6 t/ackermann.p6
65533
19729 digits starting with 20035299304068464649790723515602557504478254755697...
Cannot unbox 65536 bit wide bigint into native integer
in sub A at t/ackermann.p6 line 3
in sub A at t/ackermann.p6 line 11
in sub A at t/ackermann.p6 line 3
in block <unit> at t/ackermann.p6 line 17
Removing the proto declaration in line 3 (by commenting out):
$ perl6 t/ackermann.p6
65533
19729 digits starting with 20035299304068464649790723515602557504478254755697...
Numeric overflow
in sub A at t/ackermann.p6 line 8
in sub A at t/ackermann.p6 line 11
in block <unit> at t/ackermann.p6 line 17
What went wrong? The program doesn't allocate much memory. Is the natural integer kind-of limited?
I replaced in the code from Ackermann function the 𝑚 with m and the 𝑛 with n for better terminal interaction for copying errors and tried to comment out proto declaration. I also asked Liz ;)
use v6;
proto A(Int \m, Int \n) { (state #)[m][n] //= {*} }
multi A(0, Int \n) { n + 1 }
multi A(1, Int \n) { n + 2 }
multi A(2, Int \n) { 3 + 2 * n }
multi A(3, Int \n) { 5 + 8 * (2 ** n - 1) }
multi A(Int \m, 0 ) { A(m - 1, 1) }
multi A(Int \m, Int \n) { A(m - 1, A(m, n - 1)) }
# Testing:
say A(4,1);
say .chars, " digits starting with ", .substr(0,50), "..." given A(4,2);
A(4, 3).say;
Please read JJ's answer first. It's breezy and led to this answer which is effectively an elaboration of it.
TL;DR A(4,3) is a very big number, one that cannot be computed in this universe. But raku(do) will try. As it does you will blow past reasonable limits related to memory allocation and indexing if you use the caching version and limits related to numeric calculations if you don't.
I try some examples from Rosettacode and encounter an issue with the provided Ackermann example
Quoting the task description with some added emphasis:
Arbitrary precision is preferred (since the function grows so quickly)
raku's standard integer type Int is arbitrary precision. The raku solution uses them to compute the most advanced answer possible. It only fails when you make it try to do the impossible.
When running it "unmodified" (I replaced the utf-8 variable names by latin-1 ones)
Replacing the variable names is not a significant change.
But adding the A(4,3) line shifted the code from being computable in reality to not being computable in reality.
The example you modified has just one explanatory comment:
Here's a caching version of that ... to make A(4,2) possible
Note that the A(4,2) solution is nearly 20,000 digits long.
If you look at the other solutions on that page most don't even try to reach A(4,2). There are comments like this one on the Phix version:
optimised. still no bignum library, so ack(4,2), which is power(2,65536)-3, which is apparently 19729 digits, and any above, are beyond (the CPU/FPU hardware) and this [code].
A solution for A(4,2) is the most advanced possible.
A(4,3) is not computable in practice
To quote Academic Kids: Ackermann function:
Even for small inputs (4,3, say) the values of the Ackermann function become so large that they cannot be feasibly computed, and in fact their decimal expansions cannot even be stored in the entire physical universe.
So computing A(4,3).say is impossible (in this universe).
It must inevitably lead to an overflow of even arbitrary precision integer arithmetic. It's just a matter of when and how.
Cannot unbox 65536 bit wide bigint into native integer
The first error message mentions this line of code:
proto A(Int \m, Int \n) { (state #)[m][n] //= {*} }
The state # is an anonymous state array variable.
By default # variables use the default concrete type for raku's abstract array type. This default array type provides a balance between implementation complexity and decent performance.
While computing A(4,2) the indexes (m and n) remain small enough that the computation completes without overflowing the default array's indexing limit.
This limit is a "native" integer (note: not a "natural" integer). A "native" integer is what raku calls the fixed width integers supported by the hardware it's running on, typically a long long which in turn is typically 64 bits.
A 64 bit wide index can handle indices up to 9,223,372,036,854,775,807.
But in trying to compute A(4,3) the algorithm generates a 65536 bits (8192 bytes) wide integer index. Such an integer could be as big as 265536, a 20,032 decimal digit number. But the biggest index allowed is a 64 bit native integer. So unless you comment out the caching line that uses an array, then for A(4,3) the program ends up throwing the exception:
Cannot unbox 65536 bit wide bigint into native integer
Limits to allocations and indexing of the default array type
As already explained, there is no array that could be big enough to help fully compute A(4,3). In addition, a 64 bit integer is already a pretty big index (9,223,372,036,854,775,807).
That said, raku can accommodate other array implementations such as Array::Sparse so I'll discuss that briefly below because such possibilities might be of interest for other problems.
But before discussing bigger arrays, running the code below on tio.run shows the practical limits for the default array type on that platform:
my #array;
#array[2**29]++; # works
#array[2**30]++; # could not allocate 8589967360 bytes
#array[2**60]++; # Unable to allocate ... 1152921504606846977 elements
#array[2**63]++; # Cannot unbox 64 bit wide bigint into native integer
(Comment out error lines to see later/greater errors.)
The "could not allocate 8589967360 bytes" error is a MoarVM panic. It's a result of tio.run refusing a memory allocation request.
I think the "Unable to allocate ... elements" error is a raku level exception that's thrown as a result of exceeding some internal Rakudo implementation limit.
The last error message shows the indexing limit for the default array type even if vast amounts of memory were made available to programs.
What if someone wanted to do larger indexing?
It's possible to create/use other # (does Positional) data types that support things like sparse arrays etc.
And, using this mechanism, it's possible that someone could write an array implementation that supports larger integer indexing than is supported by the default array type (presumably by layering logic on top of the underlying platform's instructions; perhaps the Array::Sparse I linked above does).
If such an alternative were called BigArray then the cache line could be replaced with:
my #array is BigArray;
proto A(Int \𝑚, Int \𝑛) { #array[𝑚][𝑛] //= {*} }
Again, this still wouldn't be enough to store interim results for fully computing A(4,3) but my point was to show use of custom array types.
Numeric overflow
When you comment out the caching you get:
Numeric overflow
Raku/Rakudo do arbitrary precision arithmetic. While this is sometimes called infinite precision it obviously isn't actually infinite but is instead, well, "arbitrary", which in this context also means "sane" for some definition of "sane".
This classically means running out of memory to store a number. But in Rakudo's case I think there's an attempt to keep things sane by switching from a truly vast Int to a Num (a floating point number) before completely running out of RAM. But then computing A(4,3) eventually overflows even a double float.
So while the caching blows up sooner, the code is bound to blow up later anyway, and then you'd get a numeric overflow that would either manifest as an out of memory error or a numeric overflow error as it is in this case.
Array subscripts use native ints; that's why you get the error in line 3, when you use the big ints as array subscripts. You might have to define a new BigArray that uses Ints as array subscripts.
The second problem arises in the ** operator: the result is a Real, and when the low-level operations returns a Num, it throws an exception.
https://github.com/rakudo/rakudo/blob/master/src/core/Int.pm6#L391-L401
So creating a BigArray might not be helpful anyway. You'll have to create your own ** too, that always works with Int, but you seem to have hit the (not so infinite) limit of the infinite precision Ints.
I have a select statement which return capacity as exponential value e.g.
Capacity=5.4835615662E+003
in Perl code
I am using a db2 database, and if I explicitly run a query in database it returns
5483.5615662
but when I use next select query when I use capacity value in condition it doesn't match
e.g. pseudo code is as below,
my $capacity = 'SELECT capacity FROM table';
# it returns $capacity = 5.4835615662E+003
my $result = "SELECT MEASUREMENT FROM TABLE WHERE CAPACITY = $capacity";
Here $capacity is 5.4835615662E+003, so it does not match any row in the table. It should be 5483.5615662.
How to convert exponential value to float without rounding off?
You are interpolating the value of $capacity into a string. Instead, you should use placeholders as in:
my $sth = $dbh->prepare(q{SELECT MEASUREMENT FROM TABLE WHERE CAPACITY=?});
$sth->execute($capacity);
It is hard to say if there are any other problems because the code snippets you provide don't really do anything.
It is likely that the number stored in the database is not exactly 5483.5615662 and that is just the displayed string when you query it.
If possible, I would recommend taking #Сухой27's advice and letting the database do the work for you:
SELECT MEASUREMENT FROM TABLE
WHERE CAPACITY = (SELECT CAPACITY FROM TABLE where ..?)
Alternatively, decide ahead of time how many digits past the decimal point really matter and use ROUND or similar functionality:
my $sth = $dbh->prepare(q{
SELECT MEASUREMENT FROM TABLE
WHERE ROUND(CAPACITY, 6)=ROUND(?, 6)
});
$sth->execute($capacity);
Please take a look at Why doesn't this sql query return any results comparing floating point numbers?
I'm concerned about the 5.4835615662+003 that you show in your question. That isn't a valid representation of a number, and it means just 5.4835615662 + 3. You need an E or an e before the exponent to use it as it is
There is also an issue with comparing floating-point values, whereby two numbers that are essentially equal may have a slightly different binary representation, and so will not compare as equal. If your value has been converted to a string (and that seems highly likely, as Perl will not use an exponent to display 5483.5615662 unless told to do so) and back again to floating point, then it is extremely unlikely to result in exactly the same value. Your comparisons will always fail
In Perl, and most other languages, a numeric values has no specific format. For example, if I run this
perl -E 'say 5.4835615662E+003'
I get the output
5483.5615662
showing that the two string representations are equivalent
It would help to see exactly how you got the value of $capacity from the database, because if it were a simple number then it wouldn't use the scientific representation. You would have to use sprintf to get what you have shown
SQL is the same and doesn't care about the format of the number as long as it's valid, so if you wrote
SELECT measurement FROM table WHERE capacity = 5.4835615662E+003
then you would get a result where capacity is exactly equal to that value. But since it has been trimmed to eleven significant digits, you are hugely unlikely to find the record that the value came from, unless it contains 5483.56156620000000000
Update
If I run
perl -MMath::Trig=pi -E 'for (0 .. 20) { $x = pi * 10**$_; say qq{$x}; }'
I get this result
3.14159265358979
31.4159265358979
314.159265358979
3141.59265358979
31415.9265358979
314159.265358979
3141592.65358979
31415926.5358979
314159265.358979
3141592653.58979
31415926535.8979
314159265358.979
3141592653589.79
31415926535897.9
314159265358979
3.14159265358979e+015
3.14159265358979e+016
3.14159265358979e+017
3.14159265358979e+018
3.14159265358979e+019
3.14159265358979e+020
So by default Perl won't resort to using scientific notation until the value reaches 1015. It clearly doesn't apply to 5483.5615662. Something has coerced the floating-point value in the question to a much less precise string in scientific notation. Comparing that for equality doesn't stand a chance of succeeding
When I use the code:
(sub {
use strict;
use warnings;
print 0.49999999999999994;
})->();
Perl outputs "0.5".
And when I remove one "9" from the number:
(sub {
use strict;
use warnings;
print 0.4999999999999994;
})->();
It prints 0.499999999999999.
Only when I remove another 9, it actually stores the number precisely.
I know that floating point numbers are a can of worms nobody wants to deal with, but I am curious if there is a way in Perl to "trap" this implicit conversion and die, so that I can use eval to catch this die and let the user know that the number they are trying to pass is not supported by Perl in its' native form(So the user can maybe pass a string or an object instead).
The reason why I need this is to avoid a situations like passing 0.49999999999999994 to be rounded by my function, but the number gets converted to 0.5, and in turn gets rounded to 1 instead of 0. I am not sure how to "intercept" this conversion so that my function "knows" that it did not actually get 0.5 as input, but that the user's input was intercepted.
Without knowing how to intercept this kind of conversion, I cannot trust "round" because I do not know whether it received my input as I sent it, or if that input has been modified(at compile time or runtime, not sure) before the function was called(and in turn, the function has no idea if the input it is operating on is the input the user intended or not and has no means to warn the user).
This is not a Perl unique problem, it happens in JavaScript:
(() => {
'use strict';
/* oops: 1 */
console.log(Math.round(0.49999999999999999))
})();
It happens in Ruby:
(Proc.new {
# oops: 1
print (0.49999999999999999.round)
}).call()
It happens in PHP:
<?php
(call_user_func(function() {
/* oops: 1 */
echo round(0.49999999999999999);
}));
?>
it even happens in C(which is okay to happen, but my gcc does not warn me that the number has not been stored precisely(when specifying specific floating point literals, they had better be stored exactly, or the compiler should warn you that it decided to turn it into another form(e.g. "Your number x cannot be represented in 64 bit/32 bit floating point form, so I converted it to y." ) so you can see if that's okay or not, in this case it is NOT)):
#include <math.h>
#include <stdio.h>
int main(int argc, char **argv)
{
/* oops: 1 */
printf("%f.\n", round(0.49999999999999999));
return 0;
}
Summary:
Is it possible to make Perl show error or warning on implicit conversions of floating numbers, or is this something that Perl5(along with other languages) are incapable of doing at this moment(e.g. The compiler does not go out of its' way to support such warnings/offer a flag to enable such warnings)?
e.g.
warning: the number 0.49999999999999994 is not representable, it has been converted to 0.5. using bigint might solve this. Consider reducing precision of the number.
Perhaps use BigNum:
$ perl -Mbignum -le 'print 0.49999999999999994'
0.49999999999999994
$ perl -Mbignum -le 'print 0.49999999999999994+0.1'
0.59999999999999994
$ perl -Mbignum -le 'print 0.49999999999999994-0.1'
0.39999999999999994
$ perl -Mbignum -le 'print 0.49999999999999994+10.1'
10.59999999999999994
It transparently extends precision of Perl floating point and ints to extended precision.
be aware that bignum is 150 times slower than internal and other math solutions, and will typicaly NOT solve your problem (as soon as you need to store your numbers in JSON or databases or whatever, you're back at the same problem again).
Typically, sprintf takes care of prettying your output for you, so you do not have to see the ugly imprecision, however, it's still there.
Here is an example which works on my x64 platform which understands how to deal with that imprecision.
This correctly tells you if the 2 numbers you're interested in are the same:
sub safe_eq {
my($var1,$var2)=#_;
return 1 if($var1==$var2);
my $dust;
if($var2==0) { $dust=abs($var1); }
else { $dust= abs(($var1/$var2)-1); }
return 0 if($dust>5.32907051820076e-15 ); # dust <= 5.32907051820075e-15
return 1;
}
You can build on top of this to solve all your problems.
It works by understanding the magnitude of the imprecision in your native numbers, and accommodating it.
As you said in the question, dealing with floating-point numbers in code is quite the can of worms, precisely because the standard floating-point representation, regardless of the precision employed, is incapable of accurately representing many decimal numbers. The only 100% reliable way around this is to not use floating-point numbers.
The easiest way to apply that is to instead use fixed-point numbers, although that limits precision to a fixed number of decimal places. e.g., Instead of storing 10.0050, define a convention that all numbers are stored to 4 decimal places and store 100050 instead.
But that doesn't seem likely to satisfy you, based on the minimal explanation you've given for what you're actually trying to accomplish (building a general-purpose math library). The next option, then, would be to store the number of decimal places as a scaling factor with each value. So 10.0050 would become an object containing the data { value => 100050, scale => 4 }.
This can then be extended into a more general "rational number" data type by effectively storing each number as a numerator and denominator, thus allowing you to precisely store numbers such as 1/3, which neither base 2 nor base 10 can represent exactly. This is, incidentally, the approach that I am told Perl 6 has taken. So, if switching to Perl 6 is an option, then you may find that it all Just Works for you once you do so.
Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. It may be clear that this tool is intended for linguists.
At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). The goal is to make large sets of texts searchable as well. We are talking about +- 500 millions words. Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper) but this has not been tested yet. The results of this attempt is a new file structure. Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX.
My question is: what is the best way to test and compare both approaches with a Perl script? Below you find my current script to query BaseX locally. It takes two arguments. A directory that stores different files. These files each individually store XPaths. Those XPaths are the ones that I have selected to benchmark with. A second argument is the limit of results to return. When set to zero, no limit is set.
Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. They are stored in <tb> tags inside treebankparts.lst.
#!/usr/bin/perl
use warnings;
$| = 1; # flush every print
# Directory where XPaths are stored
my $directory = shift(#ARGV);
# Set limit. If set to zero all results will be returned
my $limit = shift(#ARGV);
# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);
# List all files in directory
#xpathfiles = <$directory/*.txt>;
# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my #tlines = <$tfh> );
close $tfh;
# Loop through all XPaths in $directory
foreach my $xpathfile (#xpathfiles) {
open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
chomp( my #xlines = <$xfh> );
close $xfh;
print STDOUT "File = $xpathfile\n";
# Loop through lines from XPath file (= XPath query)
foreach my $xline (#xlines) {
# Loop through the lines of treebank file
foreach my $tline (#tlines) {
my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
QuerySonar( $xline, $treebank );
}
}
}
$session->close();
sub QuerySonar {
my ( $xpath, $db ) = #_;
print STDOUT "Querying $db for $xpath\n";
print STDOUT "Limit = $limit\n";
my $x_limit;
my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
. $xpath . ';';
my $x_open = '<results>';
my $x_totalcount = '<total>{count($results)}</total>';
my $x_loopinit = '{for $node at $limitresults in $results';
# Spaces are important!
if ( $limit > 0 ) {
$x_limit = ' where $limitresults <= ' . $limit . ' ';
}
# Comment needed to prevent `Incomplete FLWOR expression`
else { $x_limit = '(: No limit set :)'; }
my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/#id)
let $sentence := ($node/ancestor::alpino_ds/sentence)
let $begin := ($node//#begin)
let $idlist := ($node//#id)
let $beginlist := (distinct-values($begin))';
# Separate sentence info by tab
my $x_loopexit = 'return <match>{data($sentid)}
{string-join($idlist, "-")}
{string-join($beginlist, "-")}
{data($sentence)}</match>}';
my $x_close = '</results>';
# Concatenate all XQuery parts
my $x_concatquery =
$x_resultsofxp
. $x_open
. $x_totalcount
. $x_loopinit
. $x_limit
. $x_sentenceinfo
. $x_loopexit
. $x_close;
my $querysent = $session->query($x_concatquery);
my $basexoutput = $querysent->execute();
print $basexoutput. "\n\n";
$querysent->close();
}
(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!)
What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. The sub then queries BaseX. This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it.
First of all, I'd start with adding a profiler to this script. I guess that bit is obvious. However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? Or would I run each query by both databases in the same script, almost at the same time?
It is important to consider caching that is happening. Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. First one script, then the other. Both at the same time. Alternating queries between the two. And so on. There are so many possibilities, but I wonder which would provide the best results. Also, I would repeat the process a couple of times. Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again?
(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.)
One possible improvement: minimize the number of times you transfer control from Perl to the database -- just as you have minimized the number of database connections. (Or at least set yourself up to measure the cost of the transfer of control.) I suspect you will get significantly better results if you move your loop into XQuery rather than running the loop in Perl.
A single call to a database management system asking it to perform 1000 searches is likely to be somewhat faster than 1000 calls to the DBMS each requesting a single search. The first involves two context switches: one from your script or bash to the dbms, and one back; the second involves 2000. The last time I measured something like this carefully, each context switch cost something like 500 ms; it mounted up fast. (That said, this was a long time ago, with a different database. But it was surprising [and sobering] to learn than the difference between the two query formulations I was trying to compare was dwarfed by the difference between running the test loop in a script or inside the dbms.)
A second suggestion: From what you say, the size of the database and the result sets seem likely to ensure that caching between runs doesn't have a big effect on the results. But this seems to be a testable assertion, and one worth testing. So set up your A and B scripts, and then do a trial run: does for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done produce results comparable to for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done? If they are comparable, then you have reason to believe it doesn't matter if you run A and B separately or in alternation. If they are not comparable, then you know it does matter, which would be very valuable information. Other things being equal, I would expect caching to produce lower times when running one query several times before moving on to the next query, and cache misses to produce higher times if running each query just once. Probably worth measuring.
In the same spirit, I would recommend that you run tests both with the loop in the Perl script and with the loop in an XQuery query.
A third suggestion: in practice, a query at the corpus query interface will involve several stages, each with measurable time: transmission of the query from the user's browser (assuming it's a Web interface) to the server, translation of the request into a form suitable for transmission to the back end dbms (here BaseX), context switch to BaseX, processing within BaseX, context switch back, handling by web server, transmission to user. It would be useful to have at least rough estimates of the times involved in each of these steps, or at least of the time taken for everything-but-BaseX.
So if it were me running the tests, I think I'd also prepare a set of vacuous XQuery tests, along the lines of
2 + 3
or just
42
to push the BaseX time as close to zero as possible; the measured time between user initiation of the query and display of response is the per-query overhead. (Interesting question: should one use many different trivial expressions to prevent caching of results, or should one use the same expression over and over, to encourage caching of the result? How can we try to ensure that BaseX will cache the result, but the Web server won't? ...)
A final suggestion: remember that other people who need to do benchmarking will often have the same questions as you do. This means that you can reformulate every question of the form "Should I do X or Y?" into the form "What measurable effect does the difference between X and Y have on the results of a benchmarking test?" Run some tests to try to measure that effect, and write them up. (I always find it makes it more interesting if I force myself to make a prediction after formulating the question but before measuring the difference.)
There are several things we have to separate here: The first issue is that BaseX performance should not be confused with your perl script as your perl script seems to simply construct an XQuery (and not XPath as you suggested in your question and tags). So for testing I would suggest to use some already pre-fined XQueries suitable to your real-world scenarios, as your XQuery construction should be negligible. How you pass your query to BaseX, so via the Perl API or via any other means should not be relevant. Even if your perl performance is relevant, you should test the performance separately.
Hence, your original question whether you should put test both scenarios in the same script or not is not relevant anymore. Instead you simply execute the two separate XQueries for the scenario A and B by themself without the perl script.
You are partly correct to worry about caching, however it is the Java JIT compiler which most likely will be relevant here (as BaseX is written in java, JIT and use caching, not BaseX itself. You should therefore use the Client/Server infrastructure and have a long-running server and warm it up before running performance measurements.
Regarding performance: The BaseX GUI and also the command line already have an included measurement (using command line you can set -V to get run times for parsing, compiling, evaluating and printing). Also, using the -r parameter you can execute a query multiple times and it will give you the average execution times.
In general, if you want to improve the performance of your script you should take a look at the query plan and the optimized query and check whether the appropriate indexes are used. Also, our new Selective Indexing might be very useful to you. If the index isn't used, your query will definitely not perform well for 500 million words.
Full Disclosure: I am with the BaseX team and you might get better help at the BaseX mailing list or might want to reference this questions as our head architect isn't watching SO as regularly as the ML.