Limiting the amount of information printed by Perl debugger - perl

One of my pet peeves with debugging Perl code (in command line debbugger, perl -d) is the fact that mistakenly printing (via x command) the contents of a huge datastructure is guaranteed to freeze up your terminal for forever and a half while 100s of pages of data are printed. Epecially if that happens across slowish network.
As such, I'd like to be able to limit the amount of data that x prints.
I see two approaches - I'd be willing to try either if someone knows how to do.
Limit the amount of data any single command in debugger prints.
Better yet, somehow replace the built-in x command with a custom Perl method (which would calculate the "size" of the data structure, and refuse to print its contents without confirmation).
I'm specifically asking "how to replace x with custom code" - building a Good Enough "is the data structure too big" Perl method is something I can likely do on my own without too much effort although I see enough pitfalls preventing the "perfect" one from being a fairly frustrating endeavour. Heck, merely doing Data::Dumper->Dump and taking the length of the string might do the trick :)
Please note that I'm perfectly well aware of how to manually avoid the issue by recursively examining layers of datastructure (e.g. print the ref, print the count of keys/array elements, etc...)... the whole point is I want to be able to avoid thoughtlessly typing x $huge_pile_of_data without thinking - or stumbling on a bug populating said huge pile of data into what should be a scalar.

The x command takes an optional argument for the maximum depth to display. That's not quite the same as limiting the amount of data to N pages, but it's definitely useful to prevent overload.
DB<1> %h = (a => { b => { c => 1 } } )
DB<2> x %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
'c' => 1
DB<3> x 2 %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
You can specify the default depth to print via the o command, e.g.
DB<1>o dumpDepth=1
Add that to your .perldb file to apply it to all debugger sessions.
Otherwise, it looks like the x command invokes DB::dumpit() which is just a wrapper for dumpval.pl (or, more specifically, the main::dumpValue() sub it defines). You could modify/replace that script as you see fit. I'm not sure how you'd make it interactive, though.

The | command in the debugger pipes another command's output to your pager, e.g.
DB<1> |x %huge_datastructure

Related

Does Perl's Glob have a limitation?

I am running the following expecting return strings of 5 characters:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
print "$_\n";
}
but it returns only 4 characters:
anbc
anbd
anbe
anbf
anbg
...
However, when I reduce the number of characters in the list:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
print "$_\n";
}
it returns correctly:
aamid
aamie
aamif
aamig
aamih
...
Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?
If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28
The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.
In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.†
I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.
As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.
However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.
There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...
Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)
† I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.
Update A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.
It did produce the right number of actual 5-char long strings, seemingly correct, on this server.
Everything has some limitation.
Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:
use v5.10;
use Set::CrossProduct;
my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );
while( my $item = $set->get ) {
say join '', #$item
}

Benchmarking in BaseX: how to set up

Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. It may be clear that this tool is intended for linguists.
At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). The goal is to make large sets of texts searchable as well. We are talking about +- 500 millions words. Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper) but this has not been tested yet. The results of this attempt is a new file structure. Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX.
My question is: what is the best way to test and compare both approaches with a Perl script? Below you find my current script to query BaseX locally. It takes two arguments. A directory that stores different files. These files each individually store XPaths. Those XPaths are the ones that I have selected to benchmark with. A second argument is the limit of results to return. When set to zero, no limit is set.
Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. They are stored in <tb> tags inside treebankparts.lst.
#!/usr/bin/perl
use warnings;
$| = 1; # flush every print
# Directory where XPaths are stored
my $directory = shift(#ARGV);
# Set limit. If set to zero all results will be returned
my $limit = shift(#ARGV);
# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);
# List all files in directory
#xpathfiles = <$directory/*.txt>;
# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my #tlines = <$tfh> );
close $tfh;
# Loop through all XPaths in $directory
foreach my $xpathfile (#xpathfiles) {
open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
chomp( my #xlines = <$xfh> );
close $xfh;
print STDOUT "File = $xpathfile\n";
# Loop through lines from XPath file (= XPath query)
foreach my $xline (#xlines) {
# Loop through the lines of treebank file
foreach my $tline (#tlines) {
my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
QuerySonar( $xline, $treebank );
}
}
}
$session->close();
sub QuerySonar {
my ( $xpath, $db ) = #_;
print STDOUT "Querying $db for $xpath\n";
print STDOUT "Limit = $limit\n";
my $x_limit;
my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
. $xpath . ';';
my $x_open = '<results>';
my $x_totalcount = '<total>{count($results)}</total>';
my $x_loopinit = '{for $node at $limitresults in $results';
# Spaces are important!
if ( $limit > 0 ) {
$x_limit = ' where $limitresults <= ' . $limit . ' ';
}
# Comment needed to prevent `Incomplete FLWOR expression`
else { $x_limit = '(: No limit set :)'; }
my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/#id)
let $sentence := ($node/ancestor::alpino_ds/sentence)
let $begin := ($node//#begin)
let $idlist := ($node//#id)
let $beginlist := (distinct-values($begin))';
# Separate sentence info by tab
my $x_loopexit = 'return <match>{data($sentid)}
{string-join($idlist, "-")}
{string-join($beginlist, "-")}
{data($sentence)}</match>}';
my $x_close = '</results>';
# Concatenate all XQuery parts
my $x_concatquery =
$x_resultsofxp
. $x_open
. $x_totalcount
. $x_loopinit
. $x_limit
. $x_sentenceinfo
. $x_loopexit
. $x_close;
my $querysent = $session->query($x_concatquery);
my $basexoutput = $querysent->execute();
print $basexoutput. "\n\n";
$querysent->close();
}
(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!)
What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. The sub then queries BaseX. This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it.
First of all, I'd start with adding a profiler to this script. I guess that bit is obvious. However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? Or would I run each query by both databases in the same script, almost at the same time?
It is important to consider caching that is happening. Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. First one script, then the other. Both at the same time. Alternating queries between the two. And so on. There are so many possibilities, but I wonder which would provide the best results. Also, I would repeat the process a couple of times. Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again?
(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.)
One possible improvement: minimize the number of times you transfer control from Perl to the database -- just as you have minimized the number of database connections. (Or at least set yourself up to measure the cost of the transfer of control.) I suspect you will get significantly better results if you move your loop into XQuery rather than running the loop in Perl.
A single call to a database management system asking it to perform 1000 searches is likely to be somewhat faster than 1000 calls to the DBMS each requesting a single search. The first involves two context switches: one from your script or bash to the dbms, and one back; the second involves 2000. The last time I measured something like this carefully, each context switch cost something like 500 ms; it mounted up fast. (That said, this was a long time ago, with a different database. But it was surprising [and sobering] to learn than the difference between the two query formulations I was trying to compare was dwarfed by the difference between running the test loop in a script or inside the dbms.)
A second suggestion: From what you say, the size of the database and the result sets seem likely to ensure that caching between runs doesn't have a big effect on the results. But this seems to be a testable assertion, and one worth testing. So set up your A and B scripts, and then do a trial run: does for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done produce results comparable to for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done? If they are comparable, then you have reason to believe it doesn't matter if you run A and B separately or in alternation. If they are not comparable, then you know it does matter, which would be very valuable information. Other things being equal, I would expect caching to produce lower times when running one query several times before moving on to the next query, and cache misses to produce higher times if running each query just once. Probably worth measuring.
In the same spirit, I would recommend that you run tests both with the loop in the Perl script and with the loop in an XQuery query.
A third suggestion: in practice, a query at the corpus query interface will involve several stages, each with measurable time: transmission of the query from the user's browser (assuming it's a Web interface) to the server, translation of the request into a form suitable for transmission to the back end dbms (here BaseX), context switch to BaseX, processing within BaseX, context switch back, handling by web server, transmission to user. It would be useful to have at least rough estimates of the times involved in each of these steps, or at least of the time taken for everything-but-BaseX.
So if it were me running the tests, I think I'd also prepare a set of vacuous XQuery tests, along the lines of
2 + 3
or just
42
to push the BaseX time as close to zero as possible; the measured time between user initiation of the query and display of response is the per-query overhead. (Interesting question: should one use many different trivial expressions to prevent caching of results, or should one use the same expression over and over, to encourage caching of the result? How can we try to ensure that BaseX will cache the result, but the Web server won't? ...)
A final suggestion: remember that other people who need to do benchmarking will often have the same questions as you do. This means that you can reformulate every question of the form "Should I do X or Y?" into the form "What measurable effect does the difference between X and Y have on the results of a benchmarking test?" Run some tests to try to measure that effect, and write them up. (I always find it makes it more interesting if I force myself to make a prediction after formulating the question but before measuring the difference.)
There are several things we have to separate here: The first issue is that BaseX performance should not be confused with your perl script as your perl script seems to simply construct an XQuery (and not XPath as you suggested in your question and tags). So for testing I would suggest to use some already pre-fined XQueries suitable to your real-world scenarios, as your XQuery construction should be negligible. How you pass your query to BaseX, so via the Perl API or via any other means should not be relevant. Even if your perl performance is relevant, you should test the performance separately.
Hence, your original question whether you should put test both scenarios in the same script or not is not relevant anymore. Instead you simply execute the two separate XQueries for the scenario A and B by themself without the perl script.
You are partly correct to worry about caching, however it is the Java JIT compiler which most likely will be relevant here (as BaseX is written in java, JIT and use caching, not BaseX itself. You should therefore use the Client/Server infrastructure and have a long-running server and warm it up before running performance measurements.
Regarding performance: The BaseX GUI and also the command line already have an included measurement (using command line you can set -V to get run times for parsing, compiling, evaluating and printing). Also, using the -r parameter you can execute a query multiple times and it will give you the average execution times.
In general, if you want to improve the performance of your script you should take a look at the query plan and the optimized query and check whether the appropriate indexes are used. Also, our new Selective Indexing might be very useful to you. If the index isn't used, your query will definitely not perform well for 500 million words.
Full Disclosure: I am with the BaseX team and you might get better help at the BaseX mailing list or might want to reference this questions as our head architect isn't watching SO as regularly as the ML.

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}

Is "map" a loop?

While answering this question, I came to realize that I was not sure whether Perl's map can be considered a loop or not?
On one hand, it quacks/walks like a loop (does O(n) work, can be easily re-written by an equivalent loop, and sort of fits the common definition = "a sequence of instructions that is continually repeated").
On the other hand, map is not usually listed among Perl's control structures, of which loops are a subset of. E.g. http://en.wikipedia.org/wiki/Perl_control_structures#Loops
So, what I'm looking for is a formal reason to be convinced of one side vs. the other. So far, the former (it is a loop) sounds a lot more convincing to me, but I'm bothered by the fact that I never saw "map" mentioned in a list of Perl loops.
map is a higher level concept than loops, borrowed from functional programming. It doesn't say "call this function on each of these items, one by one, from beginning to end," it says "call this function on all of these items." It might be implemented as a loop, but that's not the point -- it also might be implemented asynchronously -- it would still be map.
Additionally, it's not really a control structure in itself -- what if every perl function that used a loop in its implementation were listed under "loops?" Just because something is implemented using a loop, doesn't mean it should be considered its own type of loop.
No, it is not a loop, from my perspective.
Characteristic of (perl) loops is that they can be broken out of (last) or resumed (next, redo). map cannot:
map { last } qw(stack overflow); # ERROR! Can't "last" outside a loop block
The error message suggests that perl itself doesn't consider the evaluated block a loop block.
From an academic standpoint, a case can be made for both depending on how map is defined. If it always iterates in order, then a foreach loop could be emulated by map making the two equivalent. Some other definitions of map may allow out of order execution of the list for performance (dividing the work amongst threads or even separate computers). The same could be done with the foreach construct.
But as far as Perl 5 is concerned, map is always executed in order, making it equivalent to a loop. The internal structure of the expression map $_*2, 1, 2, 3 results in the following execution order opcodes which show that map is built internally as a while-like control structure:
OP enter
COP nextstate
OP pushmark
SVOP const IV 1
SVOP const IV 2
SVOP const IV 3
LISTOP mapstart
LOGOP (0x2f96150) mapwhile <-- while still has items, shift one off into $_
PADOP gvsv GV *_
SVOP const IV 2 loop body
BINOP multiply
goto LOGOP (0x2f96150) <-- jump back to the top of the loop
LISTOP leave
The map function is not a loop in Perl. This can be clearly seen by the failure of next, redo, and last inside a map:
perl -le '#a = map { next if $_ %2; } 1 .. 5; print for #a'
Can't "next" outside a loop block at -e line 1.
To achieve the desired affect in a map, you must return an empty list:
perl -le '#a = map { $_ %2 ? () : $_ } 1 .. 5; print for #a'
2
4
I think transformation is better name for constructs like map. It transforms one list into another. A similar function to map is List::Util::reduce, but instead of transforming a list into another list, it transforms a list into a scalar value. By using the word transformation, we can talk about the common aspects of these two higher order functions.
That said, it works by visiting every member of the list. This means it behaves much like a loop, and depending on what your definition of "a loop" is it might qualify. Note, my definition means that there is no loop in this code either:
#!/usr/bin/perl
use strict;
use warnings;
my $i = 0;
FOO:
print "hello world!\n";
goto FOO unless ++$i == 5;
Perl actually does define the word loop in its documentation:
loop
A construct that performs something repeatedly, like a roller
coaster.
By this definition, map is a loop because it preforms its block repeatedly; however, it also defines "loop control statement" and "loop label":
loop control statement
Any statement within the body of a loop that can make a loop
prematurely stop looping or skip an "iteration". Generally you
shouldn't try this on roller coasters.
loop label
A kind of key or name attached to a loop (or roller coaster) so
that loop control statements can talk about which loop they want to
control.
I believe it is imprecise to call map a loop because next and its kin are defined as loop control statements and they cannot control map.
This is all just playing with words though. Describing map as like-a-loop is a perfectly valid way of introducing someone to it. Even the documentation for map uses a foreach loop as part of its example:
%hash = map { get_a_key_for($_) => $_ } #array;
is just a funny way to write
%hash = ();
foreach (#array) {
$hash{get_a_key_for($_)} = $_;
}
It all depends on the context though. It is useful to describe multiplication to someone as repeated addition when you are trying to get him or her to understand the concept, but you wouldn't want him or her to continue to think of it that way. You would want him or her to learn the rules of multiplication instead of always translating back to the rules of addition.
Your question turns on the issue of classification. At least under one interpretation, asking whether map is a loop is like asking whether map is a subset of "Loop". Framed in this way, I think the answer is no. Although map and Loop have many things in common, there are important differences.
Loop controls: Chas. Owens makes a strong case that Perl loops are subject to loop controls like next and last, while map is not.
Return values: the purpose of map is its return value; with loops, not so much.
We encounter relationships like this all the time in the real world -- things that have much in common with each other, but with neither being a perfect subset of the other.
-----------------------------------------
|Things that iterate? |
| |
| ------------------ |
| |map() | |
| | | |
| | --------|---------- |
| | | | | |
| | | | | |
| ------------------ | |
| | | |
| | Loop| |
| ------------------ |
| |
-----------------------------------------
map is a higher-order function. The same applies to grep. Book Higher-Order Perl explains the idea in full details.
It's sad to see that discussion moved towards implementation details, not the concept.
FM's and Dave Sherohman's answers are quite good, but let me add an additional way of looking at map.
map is a function which is guaranteed to look at every element of a structure exactly once. And it is not a control structure, as it (itself) is a pure function. In other words, the invariants that map preserves are very strong, much stronger than 'a loop'. So if you can use a map, that's great, because you then get all these invariants 'for free', while if you're using a (more general!) control structure, you'll have to establish all these invariants yourself if you want to be sure your code is right.
And that's really the beauty of a lot of these higher-order functions: you get many more invariants for free, so that you as a programmer can spend your valuable thinking time maintaining application-dependent invariants instead of worrying about low-level implementation-dependent issues.
map itself is generally implemented using a loop of some sort (to loop over iterators, typically), but since it is a higher-level structure, it's often not included in lists of lower-level control structures.
Here is a definition of map as a recurrence:
sub _map (&#) {
my $f = shift;
return unless #_;
return $f->( local $_ = shift #_ ),
_map( $f, #_ );
}
my #squares = _map { $_ ** 2 } 1..100;
"Loop" is more of a CS term rather than a language-specific one. You can be reasonably confident in calling something a loop if it exhibits these characteristics:
iterates over elements
does the same thing every time
is O(n)
map fits these pretty closely, but it's not a loop because it's a higher-level abstraction. It's okay to say it has the properties of a loop, even if it itself isn't a loop in the strictest, lowest-level sense.
I think map fits the definition of a Functor.
It all depends on how you look at it...
On the one hand, Perl's map can be considered a loop, if only because that's how it's implemented in (current versions of) Perl.
On the other, though, I view it as a functional map and choose to use it accordingly which, among other things, includes only making the assumption that all elements of the list will be visited, but not making any assumptions about the order in which they will be visited. Aside from the degree of functional purity this brings and giving map a reason to exist and be used instead of for, this also leaves me in good shape if some future version of Perl provides a parallelizable implementation of map. (Not that I have any expectation of that ever happening...)
I think of map as more akin to an operator, like multiplication. You could even think of integer multiplication as a loop of additions :). It's not a loop of course, even if it were stupidly implemented that way. I see map similarly.
A map in Perl is a higher order function that applies a given function to all elements of an array and returns the modified array.
Whether this is implemented using an iterative loop or by recursion or any other way is not relevant and unspecified.
So a map is not a loop, though it may be implemented using a loop.
Map only looks like a loop if you ignore the lvalue. You can't do this with a for loop:
print join ' ', map { $_ * $_ } (1 .. 5)
1 4 9 16 25

Perl: Programming Efficiency when computing correlation coefficients for a large set of data

EDIT: Link should work now, sorry for the trouble
I have a text file that looks like this:
Name, Test 1, Test 2, Test 3, Test 4, Test 5
Bob, 86, 83, 86, 80, 23
Alice, 38, 90, 100, 53, 32
Jill, 49, 53, 63, 43, 23.
I am writing a program that given this text file, it will generate a Pearson's correlation coefficient table that looks like this where the entry (x,y) is the correlation between person x and person y:
Name,Bob,Alice,Jill
Bob, 1, 0.567088412588577, 0.899798494392584
Alice, 0.567088412588577, 1, 0.812425393004088
Jill, 0.899798494392584, 0.812425393004088, 1
My program works, except that the data set I am feeding it has 82 columns and, more importantly, 54000 rows. When I run my program right now, it is incredibly slow and I get an out of memory error. Is there a way I can first of all, remove any possibility of an out of memory error and maybe make the program run a little more efficiently? The code is here: code.
Thanks for your help,
Jack
Edit: In case anyone else is trying to do large scale computation, convert your data into hdf5 format. This is what I ended up doing to solve this issue.
You're going to have to do at least 54000^2*82 calculations and comparisons. Of course it's going to take a lot of time. Are you holding everything in memory? That's going to be pretty large too. It will be slower, but it might use less memory if you can keep the users in a database and calculate one user against all the others, then go on to the next and do it against all the others instead of one massive array or hash.
Have a look at Tie::File to deal with the high memory usage of having your input and output files stored in memory.
Have you searched CPAN? My own search yielded another method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.
gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }
You may want to look at PDL:
PDL ("Perl Data Language") gives
standard Perl the ability to compactly
store and speedily manipulate the
large N-dimensional data arrays which
are the bread and butter of scientific
computing
.
Essentially Paul Tomblin has given you the answer: It's a lot of calculation so it will take a long time. It's a lot of data, so it will take a lot of memory.
However, there may be one gotcha: If you use perl 5.10.0, your list assignments at the start of each method may be victims of a subtle performance bug in that version of perl (cf. perlmonks thread).
A couple of minor points:
The printout may actually slow down the program somewhat depending one where it goes.
There is no need to reopen the output file for each line! Just do something like this:
open FILE, ">", "file.txt" or die $!;
print FILE "Name, ", join(", ", 0..$#{$correlations[0]}+1), "\n";
my $rowno = 1;
foreach my $row (#correlations) {
print FILE "$rowno, " . join(", ", #$row) . "\n";
$rowno++;
}
close FILE;
Finally, while I do use Perl whenever I can, with a program and data set such as you describe, it might be the simplest route to simply use C++ with its iostreams (which make parsing easy enough) for this task.
Note that all of this is just minor optimization. There's no algorithmic gain.
I don't know enough about what you are trying to do to give good advice about implementation, but you might look at Statistics::LSNoHistory, it claims to have a method pearson_r that returns Pearson's r correlation coefficient.
Further to the comment above about PDL, here is the code how to calculate the correlation table even for very big datasets quite efficiently:
use PDL::Stats; # this useful module can be downloaded from CPAN
my $data = random(82, 5400); # your data should replace this
my $table = $data->corr_table(); # that's all, really
You might need to set $PDL::BIGPDL = 1; in the header of your script and make sure you run this on a machine with A LOT of memory. The computation itself is reasonably fast, a 82 x 5400 table took only a few seconds on my laptop.