Does Perl's Glob have a limitation? - perl

I am running the following expecting return strings of 5 characters:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
print "$_\n";
}
but it returns only 4 characters:
anbc
anbd
anbe
anbf
anbg
...
However, when I reduce the number of characters in the list:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
print "$_\n";
}
it returns correctly:
aamid
aamie
aamif
aamig
aamih
...
Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?
If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28

The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.
In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.†
I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.
As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.
However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.
There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...
Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)
† I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.
Update A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.
It did produce the right number of actual 5-char long strings, seemingly correct, on this server.

Everything has some limitation.
Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:
use v5.10;
use Set::CrossProduct;
my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );
while( my $item = $set->get ) {
say join '', #$item
}

Related

Why does this line return sum of integers 1-10?

I'd like to understand how unpack is returning the sum in the given perl one-liner.
I've looked at pack man page and mostly understood that it is simply formatting the given array into a scalar of ten doubles.
However, I couldn't find proper documentation for unpack with %123. Looking for help here.
print unpack "%123d*" , pack( "d*", (1..10));
This line correctly outputs 55 which is 1+2+3+...+10.
From perldoc -f unpack:
In addition to fields allowed in pack(), you may prefix a field with a % to indicate that you want a <number>-bit checksum of the items instead of the items themselves.
Thus %123d* means to add all the input integers 1..10 and then take the first 123 bit of this result in order to construct the "<number>-bit checksum". Note that %8d* or just %d* (which is equivalent to %16d*) would suffice too given that the sum is small enough.

Benchmarking in BaseX: how to set up

Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. It may be clear that this tool is intended for linguists.
At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). The goal is to make large sets of texts searchable as well. We are talking about +- 500 millions words. Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper) but this has not been tested yet. The results of this attempt is a new file structure. Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX.
My question is: what is the best way to test and compare both approaches with a Perl script? Below you find my current script to query BaseX locally. It takes two arguments. A directory that stores different files. These files each individually store XPaths. Those XPaths are the ones that I have selected to benchmark with. A second argument is the limit of results to return. When set to zero, no limit is set.
Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. They are stored in <tb> tags inside treebankparts.lst.
#!/usr/bin/perl
use warnings;
$| = 1; # flush every print
# Directory where XPaths are stored
my $directory = shift(#ARGV);
# Set limit. If set to zero all results will be returned
my $limit = shift(#ARGV);
# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);
# List all files in directory
#xpathfiles = <$directory/*.txt>;
# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my #tlines = <$tfh> );
close $tfh;
# Loop through all XPaths in $directory
foreach my $xpathfile (#xpathfiles) {
open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
chomp( my #xlines = <$xfh> );
close $xfh;
print STDOUT "File = $xpathfile\n";
# Loop through lines from XPath file (= XPath query)
foreach my $xline (#xlines) {
# Loop through the lines of treebank file
foreach my $tline (#tlines) {
my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
QuerySonar( $xline, $treebank );
}
}
}
$session->close();
sub QuerySonar {
my ( $xpath, $db ) = #_;
print STDOUT "Querying $db for $xpath\n";
print STDOUT "Limit = $limit\n";
my $x_limit;
my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
. $xpath . ';';
my $x_open = '<results>';
my $x_totalcount = '<total>{count($results)}</total>';
my $x_loopinit = '{for $node at $limitresults in $results';
# Spaces are important!
if ( $limit > 0 ) {
$x_limit = ' where $limitresults <= ' . $limit . ' ';
}
# Comment needed to prevent `Incomplete FLWOR expression`
else { $x_limit = '(: No limit set :)'; }
my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/#id)
let $sentence := ($node/ancestor::alpino_ds/sentence)
let $begin := ($node//#begin)
let $idlist := ($node//#id)
let $beginlist := (distinct-values($begin))';
# Separate sentence info by tab
my $x_loopexit = 'return <match>{data($sentid)}
{string-join($idlist, "-")}
{string-join($beginlist, "-")}
{data($sentence)}</match>}';
my $x_close = '</results>';
# Concatenate all XQuery parts
my $x_concatquery =
$x_resultsofxp
. $x_open
. $x_totalcount
. $x_loopinit
. $x_limit
. $x_sentenceinfo
. $x_loopexit
. $x_close;
my $querysent = $session->query($x_concatquery);
my $basexoutput = $querysent->execute();
print $basexoutput. "\n\n";
$querysent->close();
}
(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!)
What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. The sub then queries BaseX. This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it.
First of all, I'd start with adding a profiler to this script. I guess that bit is obvious. However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? Or would I run each query by both databases in the same script, almost at the same time?
It is important to consider caching that is happening. Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. First one script, then the other. Both at the same time. Alternating queries between the two. And so on. There are so many possibilities, but I wonder which would provide the best results. Also, I would repeat the process a couple of times. Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again?
(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.)
One possible improvement: minimize the number of times you transfer control from Perl to the database -- just as you have minimized the number of database connections. (Or at least set yourself up to measure the cost of the transfer of control.) I suspect you will get significantly better results if you move your loop into XQuery rather than running the loop in Perl.
A single call to a database management system asking it to perform 1000 searches is likely to be somewhat faster than 1000 calls to the DBMS each requesting a single search. The first involves two context switches: one from your script or bash to the dbms, and one back; the second involves 2000. The last time I measured something like this carefully, each context switch cost something like 500 ms; it mounted up fast. (That said, this was a long time ago, with a different database. But it was surprising [and sobering] to learn than the difference between the two query formulations I was trying to compare was dwarfed by the difference between running the test loop in a script or inside the dbms.)
A second suggestion: From what you say, the size of the database and the result sets seem likely to ensure that caching between runs doesn't have a big effect on the results. But this seems to be a testable assertion, and one worth testing. So set up your A and B scripts, and then do a trial run: does for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done produce results comparable to for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done? If they are comparable, then you have reason to believe it doesn't matter if you run A and B separately or in alternation. If they are not comparable, then you know it does matter, which would be very valuable information. Other things being equal, I would expect caching to produce lower times when running one query several times before moving on to the next query, and cache misses to produce higher times if running each query just once. Probably worth measuring.
In the same spirit, I would recommend that you run tests both with the loop in the Perl script and with the loop in an XQuery query.
A third suggestion: in practice, a query at the corpus query interface will involve several stages, each with measurable time: transmission of the query from the user's browser (assuming it's a Web interface) to the server, translation of the request into a form suitable for transmission to the back end dbms (here BaseX), context switch to BaseX, processing within BaseX, context switch back, handling by web server, transmission to user. It would be useful to have at least rough estimates of the times involved in each of these steps, or at least of the time taken for everything-but-BaseX.
So if it were me running the tests, I think I'd also prepare a set of vacuous XQuery tests, along the lines of
2 + 3
or just
42
to push the BaseX time as close to zero as possible; the measured time between user initiation of the query and display of response is the per-query overhead. (Interesting question: should one use many different trivial expressions to prevent caching of results, or should one use the same expression over and over, to encourage caching of the result? How can we try to ensure that BaseX will cache the result, but the Web server won't? ...)
A final suggestion: remember that other people who need to do benchmarking will often have the same questions as you do. This means that you can reformulate every question of the form "Should I do X or Y?" into the form "What measurable effect does the difference between X and Y have on the results of a benchmarking test?" Run some tests to try to measure that effect, and write them up. (I always find it makes it more interesting if I force myself to make a prediction after formulating the question but before measuring the difference.)
There are several things we have to separate here: The first issue is that BaseX performance should not be confused with your perl script as your perl script seems to simply construct an XQuery (and not XPath as you suggested in your question and tags). So for testing I would suggest to use some already pre-fined XQueries suitable to your real-world scenarios, as your XQuery construction should be negligible. How you pass your query to BaseX, so via the Perl API or via any other means should not be relevant. Even if your perl performance is relevant, you should test the performance separately.
Hence, your original question whether you should put test both scenarios in the same script or not is not relevant anymore. Instead you simply execute the two separate XQueries for the scenario A and B by themself without the perl script.
You are partly correct to worry about caching, however it is the Java JIT compiler which most likely will be relevant here (as BaseX is written in java, JIT and use caching, not BaseX itself. You should therefore use the Client/Server infrastructure and have a long-running server and warm it up before running performance measurements.
Regarding performance: The BaseX GUI and also the command line already have an included measurement (using command line you can set -V to get run times for parsing, compiling, evaluating and printing). Also, using the -r parameter you can execute a query multiple times and it will give you the average execution times.
In general, if you want to improve the performance of your script you should take a look at the query plan and the optimized query and check whether the appropriate indexes are used. Also, our new Selective Indexing might be very useful to you. If the index isn't used, your query will definitely not perform well for 500 million words.
Full Disclosure: I am with the BaseX team and you might get better help at the BaseX mailing list or might want to reference this questions as our head architect isn't watching SO as regularly as the ML.

How does this Perl one-liner actually work?

So, I happened to notice that last.fm is hiring in my area, and since I've known a few people who worked there, I though of applying.
But I thought I'd better take a look at the current staff first.
Everyone on that page has a cute/clever/dumb strapline, like "Is life not a thousand times too short for us to bore ourselves?". In fact, it was quite amusing, until I got to this:
perl -e'print+pack+q,c*,,map$.+=$_,74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34'
Which I couldn't resist pasting into my terminal (kind of a stupid thing to do, maybe), but it printed:
Just another Last.fm hacker,
I thought it would be relatively easy to figure out how that Perl one-liner works. But I couldn't really make sense of the documentation, and I don't know Perl, so I wasn't even sure I was reading the relevant documentation.
So I tried modifying the numbers, which got me nowhere. So I decided it was genuinely interesting and worth figuring out.
So, 'how does it work' being a bit vague, my question is mainly,
What are those numbers? Why are there negative numbers and positive numbers, and does the negativity or positivity matter?
What does the combination of operators +=$_ do?
What's pack+q,c*,, doing?
This is a variant on “Just another Perl hacker”, a Perl meme. As JAPHs go, this one is relatively tame.
The first thing you need to do is figure out how to parse the perl program. It lacks parentheses around function calls and uses the + and quote-like operators in interesting ways. The original program is this:
print+pack+q,c*,,map$.+=$_,74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34
pack is a function, whereas print and map are list operators. Either way, a function or non-nullary operator name immediately followed by a plus sign can't be using + as a binary operator, so both + signs at the beginning are unary operators. This oddity is described in the manual.
If we add parentheses, use the block syntax for map, and add a bit of whitespace, we get:
print(+pack(+q,c*,,
map{$.+=$_} (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21,
18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34)))
The next tricky bit is that q here is the q quote-like operator. It's more commonly written with single quotes:
print(+pack(+'c*',
map{$.+=$_} (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21,
18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34)))
Remember that the unary plus is a no-op (apart from forcing a scalar context), so things should now be looking more familiar. This is a call to the pack function, with a format of c*, meaning “any number of characters, specified by their number in the current character set”. An alternate way to write this is
print(join("", map {chr($.+=$_)} (74, …, -34)))
The map function applies the supplied block to the elements of the argument list in order. For each element, $_ is set to the element value, and the result of the map call is the list of values returned by executing the block on the successive elements. A longer way to write this program would be
#list_accumulator = ();
for $n in (74, …, -34) {
$. += $n;
push #list_accumulator, chr($.)
}
print(join("", #list_accumulator))
The $. variable contains a running total of the numbers. The numbers are chosen so that the running total is the ASCII codes of the characters the author wants to print: 74=J, 74+43=117=u, 74+43-2=115=s, etc. They are negative or positive depending on whether each character is before or after the previous one in ASCII order.
For your next task, explain this JAPH (produced by EyesDrop).
''=~('(?{'.('-)#.)#_*([]#!#/)(#)#-#),#(##+#)'
^'][)#]`}`]()`#.#]#%[`}%[#`#!##%[').',"})')
Don't use any of this in production code.
The basic idea behind this is quite simple. You have an array containing the ASCII values of the characters. To make things a little bit more complicated you don't use absolute values, but relative ones except for the first one. So the idea is to add the specific value to the previous one, for example:
74 -> J
74 + 43 -> u
74 + 42 + (-2 ) -> s
Even though $. is a special variable in Perl it does not mean anything special in this case. It is just used to save the previous value and add the current element:
map($.+=$_, ARRAY)
Basically it means add the current list element ($_) to the variable $.. This will return a new array with the correct ASCII values for the new sentence.
The q function in Perl is used for single quoted, literal strings. E.g. you can use something like
q/Literal $1 String/
q!Another literal String!
q,Third literal string,
This means that pack+q,c*,, is basically pack 'c*', ARRAY. The c* modifier in pack interprets the value as characters. For example, it will use the value and interpret it as a character.
It basically boils down to this:
#!/usr/bin/perl
use strict;
use warnings;
my $prev_value = 0;
my #relative = (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34);
my #absolute = map($prev_value += $_, #relative);
print pack("c*", #absolute);

reading and storing numbers in perl without a loss of percision (Perl)

I have a few numbers in a file in a variety of formats: 8.3, 0.001, 9e-18. I'm looking for an easy way to read them in and store them without any loss of precision. This would be easy in AWK, but how's it done in Perl? I'm only open to using Perl. Thanks!
Also, I was wondering if there's an easy way to print them in an appropriate format. For example, 8.3 should be printed as "8.3" not "8.3e0"
If they're text strings, then reading them into Perl as strings and writing them back out as strings shouldn't result in any loss of precision. If you have to do arithmetic on them, then I suggest installing the CPAN module Math::BigFloat to ensure that you don't lose any precision to rounding.
As to your second question, Perl doesn't do any reformatting unless you ask it to:
$ perl -le 'print 8.3'
8.3
Am I missing something?
From http://perldoc.perl.org/perlnumber.html:
Perl can internally represent numbers in 3 different ways: as native
integers, as native floating point numbers, and as decimal strings.
Decimal strings may have an exponential notation part, as in
"12.34e-56" . Native here means "a format supported by the C compiler
which was used to build perl".
This means that printing the number out depends on how the number is stored internal to perl, which means, in turn, that you have to know how the number is represented on input.
By and large, Perl will just do the right thing, but you should know how what compiler was used, how it represents numbers internally, and how to print those numbers. For example:
$ perldoc -f int
int EXPR
int Returns the integer portion of EXPR. If EXPR is omitted, uses $_. You should
not use this function for rounding: one because it truncates towards 0, and two
because machine representations of floating-point numbers can sometimes produce
counterintuitive results. For example, "int(-6.725/0.025)" produces -268 rather than
the correct -269; that's because it's really more like -268.99999999999994315658
instead. Usually, the "sprintf", "printf", or the "POSIX::floor" and
"POSIX::ceil" functions will serve you better than will int().
I think that if you want to read a number in explicitly as a string, your best bet would be to use unpack() with the 'A*' format.

Limiting the amount of information printed by Perl debugger

One of my pet peeves with debugging Perl code (in command line debbugger, perl -d) is the fact that mistakenly printing (via x command) the contents of a huge datastructure is guaranteed to freeze up your terminal for forever and a half while 100s of pages of data are printed. Epecially if that happens across slowish network.
As such, I'd like to be able to limit the amount of data that x prints.
I see two approaches - I'd be willing to try either if someone knows how to do.
Limit the amount of data any single command in debugger prints.
Better yet, somehow replace the built-in x command with a custom Perl method (which would calculate the "size" of the data structure, and refuse to print its contents without confirmation).
I'm specifically asking "how to replace x with custom code" - building a Good Enough "is the data structure too big" Perl method is something I can likely do on my own without too much effort although I see enough pitfalls preventing the "perfect" one from being a fairly frustrating endeavour. Heck, merely doing Data::Dumper->Dump and taking the length of the string might do the trick :)
Please note that I'm perfectly well aware of how to manually avoid the issue by recursively examining layers of datastructure (e.g. print the ref, print the count of keys/array elements, etc...)... the whole point is I want to be able to avoid thoughtlessly typing x $huge_pile_of_data without thinking - or stumbling on a bug populating said huge pile of data into what should be a scalar.
The x command takes an optional argument for the maximum depth to display. That's not quite the same as limiting the amount of data to N pages, but it's definitely useful to prevent overload.
DB<1> %h = (a => { b => { c => 1 } } )
DB<2> x %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
'c' => 1
DB<3> x 2 %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
You can specify the default depth to print via the o command, e.g.
DB<1>o dumpDepth=1
Add that to your .perldb file to apply it to all debugger sessions.
Otherwise, it looks like the x command invokes DB::dumpit() which is just a wrapper for dumpval.pl (or, more specifically, the main::dumpValue() sub it defines). You could modify/replace that script as you see fit. I'm not sure how you'd make it interactive, though.
The | command in the debugger pipes another command's output to your pager, e.g.
DB<1> |x %huge_datastructure