Perl: Programming Efficiency when computing correlation coefficients for a large set of data

Perl: Programming Efficiency when computing correlation coefficients for a large set of data - perl

EDIT: Link should work now, sorry for the trouble
I have a text file that looks like this:
Name, Test 1, Test 2, Test 3, Test 4, Test 5
Bob, 86, 83, 86, 80, 23
Alice, 38, 90, 100, 53, 32
Jill, 49, 53, 63, 43, 23.
I am writing a program that given this text file, it will generate a Pearson's correlation coefficient table that looks like this where the entry (x,y) is the correlation between person x and person y:
Name,Bob,Alice,Jill
Bob, 1, 0.567088412588577, 0.899798494392584
Alice, 0.567088412588577, 1, 0.812425393004088
Jill, 0.899798494392584, 0.812425393004088, 1
My program works, except that the data set I am feeding it has 82 columns and, more importantly, 54000 rows. When I run my program right now, it is incredibly slow and I get an out of memory error. Is there a way I can first of all, remove any possibility of an out of memory error and maybe make the program run a little more efficiently? The code is here: code.
Thanks for your help,
Jack
Edit: In case anyone else is trying to do large scale computation, convert your data into hdf5 format. This is what I ended up doing to solve this issue.

You're going to have to do at least 54000^2*82 calculations and comparisons. Of course it's going to take a lot of time. Are you holding everything in memory? That's going to be pretty large too. It will be slower, but it might use less memory if you can keep the users in a database and calculate one user against all the others, then go on to the next and do it against all the others instead of one massive array or hash.

Have a look at Tie::File to deal with the high memory usage of having your input and output files stored in memory.

Have you searched CPAN? My own search yielded another method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.
gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }

You may want to look at PDL:
PDL ("Perl Data Language") gives
standard Perl the ability to compactly
store and speedily manipulate the
large N-dimensional data arrays which
are the bread and butter of scientific
computing
.

Essentially Paul Tomblin has given you the answer: It's a lot of calculation so it will take a long time. It's a lot of data, so it will take a lot of memory.
However, there may be one gotcha: If you use perl 5.10.0, your list assignments at the start of each method may be victims of a subtle performance bug in that version of perl (cf. perlmonks thread).
A couple of minor points:
The printout may actually slow down the program somewhat depending one where it goes.
There is no need to reopen the output file for each line! Just do something like this:
open FILE, ">", "file.txt" or die $!;
print FILE "Name, ", join(", ", 0..$#{$correlations[0]}+1), "\n";
my $rowno = 1;
foreach my $row (#correlations) {
print FILE "$rowno, " . join(", ", #$row) . "\n";
$rowno++;
}
close FILE;
Finally, while I do use Perl whenever I can, with a program and data set such as you describe, it might be the simplest route to simply use C++ with its iostreams (which make parsing easy enough) for this task.
Note that all of this is just minor optimization. There's no algorithmic gain.

I don't know enough about what you are trying to do to give good advice about implementation, but you might look at Statistics::LSNoHistory, it claims to have a method pearson_r that returns Pearson's r correlation coefficient.

Further to the comment above about PDL, here is the code how to calculate the correlation table even for very big datasets quite efficiently:
use PDL::Stats; # this useful module can be downloaded from CPAN
my $data = random(82, 5400); # your data should replace this
my $table = $data->corr_table(); # that's all, really
You might need to set $PDL::BIGPDL = 1; in the header of your script and make sure you run this on a machine with A LOT of memory. The computation itself is reasonably fast, a 82 x 5400 table took only a few seconds on my laptop.

Related

Does Perl's Glob have a limitation?

I am running the following expecting return strings of 5 characters:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
print "$_\n";
}
but it returns only 4 characters:
anbc
anbd
anbe
anbf
anbg
...
However, when I reduce the number of characters in the list:
while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
print "$_\n";
}
it returns correctly:
aamid
aamie
aamif
aamig
aamih
...
Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?
If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28

The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.
In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.†
I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.
As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.
However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.
There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...
Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)
† I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.
Update A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.
It did produce the right number of actual 5-char long strings, seemingly correct, on this server.

Everything has some limitation.
Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:
use v5.10;
use Set::CrossProduct;
my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );
while( my $item = $set->get ) {
say join '', #$item
}

Benchmarking in BaseX: how to set up

Currently I am an intern at a research group that makes large sets of texts (corpora) searchable. Not only can one search for literal strings, but more importantly it is also possible to look for similar syntactical dependency structures as the given input, without the need of being proficient in any programming language or corpus annotation style. It may be clear that this tool is intended for linguists.
At the start of the project - before I was engaged in the project - the tool was limited to rather small corpora (up to 9 million words). The goal is to make large sets of texts searchable as well. We are talking about +- 500 millions words. Attempts have been made that in theory ought to improve speed by reducing the search space (see this paper) but this has not been tested yet. The results of this attempt is a new file structure. Let's call this structure B, compared to a non-processed structure A. We expect B to provide faster results when queried with BaseX.
My question is: what is the best way to test and compare both approaches with a Perl script? Below you find my current script to query BaseX locally. It takes two arguments. A directory that stores different files. These files each individually store XPaths. Those XPaths are the ones that I have selected to benchmark with. A second argument is the limit of results to return. When set to zero, no limit is set.
Because some parts of the dataset are so incredibly huge, we have divided them in different, equally sized files as well, called treebankparts. They are stored in <tb> tags inside treebankparts.lst.
#!/usr/bin/perl
use warnings;
$| = 1; # flush every print
# Directory where XPaths are stored
my $directory = shift(#ARGV);
# Set limit. If set to zero all results will be returned
my $limit = shift(#ARGV);
# Create session, connect to BaseX
my $session = Session->new([INFORMATION WITHHELD]);
# List all files in directory
#xpathfiles = <$directory/*.txt>;
# Read lines of treebank parts into variable
open( my $tfh, "treebankparts.lst" ) or die "cannot open file treebankparts.lst";
chomp( my #tlines = <$tfh> );
close $tfh;
# Loop through all XPaths in $directory
foreach my $xpathfile (#xpathfiles) {
open( my $xfh, $xpathfile ) or die "cannot open file $xpathfile";
chomp( my #xlines = <$xfh> );
close $xfh;
print STDOUT "File = $xpathfile\n";
# Loop through lines from XPath file (= XPath query)
foreach my $xline (#xlines) {
# Loop through the lines of treebank file
foreach my $tline (#tlines) {
my ($treebank) = $tline =~ /<tb>(.+)<\/tb>/;
QuerySonar( $xline, $treebank );
}
}
}
$session->close();
sub QuerySonar {
my ( $xpath, $db ) = #_;
print STDOUT "Querying $db for $xpath\n";
print STDOUT "Limit = $limit\n";
my $x_limit;
my $x_resultsofxp = 'declare variable $results := db:open("' . $db . '")/treebank/alpino_ds'
. $xpath . ';';
my $x_open = '<results>';
my $x_totalcount = '<total>{count($results)}</total>';
my $x_loopinit = '{for $node at $limitresults in $results';
# Spaces are important!
if ( $limit > 0 ) {
$x_limit = ' where $limitresults <= ' . $limit . ' ';
}
# Comment needed to prevent `Incomplete FLWOR expression`
else { $x_limit = '(: No limit set :)'; }
my $x_sentenceinfo = 'let $sentid := ($node/ancestor::alpino_ds/#id)
let $sentence := ($node/ancestor::alpino_ds/sentence)
let $begin := ($node//#begin)
let $idlist := ($node//#id)
let $beginlist := (distinct-values($begin))';
# Separate sentence info by tab
my $x_loopexit = 'return <match>{data($sentid)}
{string-join($idlist, "-")}
{string-join($beginlist, "-")}
{data($sentence)}</match>}';
my $x_close = '</results>';
# Concatenate all XQuery parts
my $x_concatquery =
$x_resultsofxp
. $x_open
. $x_totalcount
. $x_loopinit
. $x_limit
. $x_sentenceinfo
. $x_loopexit
. $x_close;
my $querysent = $session->query($x_concatquery);
my $basexoutput = $querysent->execute();
print $basexoutput. "\n\n";
$querysent->close();
}
(Note that this is a stripped down version and that it may not work as-is. This snippet does not use structure B!)
What happens is: loop through all XPath files, loop through each line in an XPath file, loop through all treebankparts and then execute the sub. The sub then queries BaseX. This comes down to sending a new XQuery to BaseX, and returning the total hits and the results (possibly limited by an argument in the Perl script). So I got that going, but now the question is: how can I improve this script so I can get some benchmarking results out of it.
First of all, I'd start with adding a profiler to this script. I guess that bit is obvious. However, I am not sure how I should start comparing structure A with B. Would I put both queries (to different databases) in separate scripts, then call a profiler on both, and run both scripts a number of times and get a mean value and compare? Or would I run each query by both databases in the same script, almost at the same time?
It is important to consider caching that is happening. Therefore I am not entirely sure what build-up for benchmarking of a database this huge is appropriate. First one script, then the other. Both at the same time. Alternating queries between the two. And so on. There are so many possibilities, but I wonder which would provide the best results. Also, I would repeat the process a couple of times. Would I repeat each query and then continue to the next, or finish all XPath files, and then repeat the whole process again?
(Reading the description of the benchmark-tag I am confident that this - albeit elaborate - post is suited for SO.)

One possible improvement: minimize the number of times you transfer control from Perl to the database -- just as you have minimized the number of database connections. (Or at least set yourself up to measure the cost of the transfer of control.) I suspect you will get significantly better results if you move your loop into XQuery rather than running the loop in Perl.
A single call to a database management system asking it to perform 1000 searches is likely to be somewhat faster than 1000 calls to the DBMS each requesting a single search. The first involves two context switches: one from your script or bash to the dbms, and one back; the second involves 2000. The last time I measured something like this carefully, each context switch cost something like 500 ms; it mounted up fast. (That said, this was a long time ago, with a different database. But it was surprising [and sobering] to learn than the difference between the two query formulations I was trying to compare was dwarfed by the difference between running the test loop in a script or inside the dbms.)
A second suggestion: From what you say, the size of the database and the result sets seem likely to ensure that caching between runs doesn't have a big effect on the results. But this seems to be a testable assertion, and one worth testing. So set up your A and B scripts, and then do a trial run: does for runcount in 1 2 3 4 5; do perl A.pl; perl B.pl; done produce results comparable to for runcount in 1 2 3 4 5; do perl A.pl; done; for runcount in 1 2 3 4 5; do perl B.pl; done? If they are comparable, then you have reason to believe it doesn't matter if you run A and B separately or in alternation. If they are not comparable, then you know it does matter, which would be very valuable information. Other things being equal, I would expect caching to produce lower times when running one query several times before moving on to the next query, and cache misses to produce higher times if running each query just once. Probably worth measuring.
In the same spirit, I would recommend that you run tests both with the loop in the Perl script and with the loop in an XQuery query.
A third suggestion: in practice, a query at the corpus query interface will involve several stages, each with measurable time: transmission of the query from the user's browser (assuming it's a Web interface) to the server, translation of the request into a form suitable for transmission to the back end dbms (here BaseX), context switch to BaseX, processing within BaseX, context switch back, handling by web server, transmission to user. It would be useful to have at least rough estimates of the times involved in each of these steps, or at least of the time taken for everything-but-BaseX.
So if it were me running the tests, I think I'd also prepare a set of vacuous XQuery tests, along the lines of
2 + 3
or just
42
to push the BaseX time as close to zero as possible; the measured time between user initiation of the query and display of response is the per-query overhead. (Interesting question: should one use many different trivial expressions to prevent caching of results, or should one use the same expression over and over, to encourage caching of the result? How can we try to ensure that BaseX will cache the result, but the Web server won't? ...)
A final suggestion: remember that other people who need to do benchmarking will often have the same questions as you do. This means that you can reformulate every question of the form "Should I do X or Y?" into the form "What measurable effect does the difference between X and Y have on the results of a benchmarking test?" Run some tests to try to measure that effect, and write them up. (I always find it makes it more interesting if I force myself to make a prediction after formulating the question but before measuring the difference.)

There are several things we have to separate here: The first issue is that BaseX performance should not be confused with your perl script as your perl script seems to simply construct an XQuery (and not XPath as you suggested in your question and tags). So for testing I would suggest to use some already pre-fined XQueries suitable to your real-world scenarios, as your XQuery construction should be negligible. How you pass your query to BaseX, so via the Perl API or via any other means should not be relevant. Even if your perl performance is relevant, you should test the performance separately.
Hence, your original question whether you should put test both scenarios in the same script or not is not relevant anymore. Instead you simply execute the two separate XQueries for the scenario A and B by themself without the perl script.
You are partly correct to worry about caching, however it is the Java JIT compiler which most likely will be relevant here (as BaseX is written in java, JIT and use caching, not BaseX itself. You should therefore use the Client/Server infrastructure and have a long-running server and warm it up before running performance measurements.
Regarding performance: The BaseX GUI and also the command line already have an included measurement (using command line you can set -V to get run times for parsing, compiling, evaluating and printing). Also, using the -r parameter you can execute a query multiple times and it will give you the average execution times.
In general, if you want to improve the performance of your script you should take a look at the query plan and the optimized query and check whether the appropriate indexes are used. Also, our new Selective Indexing might be very useful to you. If the index isn't used, your query will definitely not perform well for 500 million words.
Full Disclosure: I am with the BaseX team and you might get better help at the BaseX mailing list or might want to reference this questions as our head architect isn't watching SO as regularly as the ML.

Find all differences between .mat files

I am looking for a way to list the differences between two .mat files, something that can be usefull for many people.
Though I searched everywhere I could think of, I have not found anything that meets my requirements:
Pick 2 mat files
Find the differences
Save them properly
The closest I have come is visdiff. As long as I stay within matlab, it will allow me to browse the differences, but when I save the result it only shows me the top level.
Here is a simplified example of what my files typically look like:
a = 6;
b.c.d = 7;
b.c.e = 'x';
save f1
f = a;
clear a
b.c.e = 'y';
save f2
visdiff('f1.mat','f2.mat')
If I click here on b, I can find the difference. However if I run this and use 'file>save', I am not able to click on b. Thus I still don't know what has been changed.
Note: I don't have Simulink
Hence my question is:
How can I show all differences between 2 mat files to someone without Matlab
Here are the answers that I personally consider to be most suitable for different situations:
Answer for users with Simulink
General answer
Answer displaying all value differences

Find all differences between mat files without MATLAB?
You can find the differences between HDF5 based .mat files with the HDF5 Tools.
Example
Let me shorten your MATLAB example and assume you create two mat files with
clear ; a = 6 ; b.c = 'hello' ; save -v7.3 f1
clear ; a = 7 ; b.e = 'world' ; save -v7.3 f2
Outside MATLAB use
h5ls -v -r f1.mat
to get a listing about the kind of data included f1.mat:
Opened "f1.mat" with sec2 driver.
/ Group
Location: 1:96
Links: 1
/a Dataset {1/1, 1/1}
Attribute: MATLAB_class scalar
Type: 6-byte null-terminated ASCII string
Data: "double"
Location: 1:2576
Links: 1
Storage: 8 logical bytes, 8 allocated bytes, 100.00% utilization
Type: native double
/b Group
Attribute: MATLAB_class scalar
Type: 6-byte null-terminated ASCII string
Data: "struct"
Location: 1:800
Links: 1
/b/c Dataset {5/5, 1/1}
Attribute: H5PATH scalar
Type: 2-byte null-terminated ASCII string
Data: "/b"
Attribute: MATLAB_class scalar
Type: 4-byte null-terminated ASCII string
Data: "char"
Attribute: MATLAB_int_decode scalar
Type: native int
Data: 2
Location: 1:1832
Links: 1
Storage: 10 logical bytes, 10 allocated bytes, 100.00% utilization
Type: native unsigned short
Use of
h5ls -d -r f1.mat
returns the values of the stored data:
/ Group
/a Dataset {1, 1}
Data:
(0,0) 6
/b Group
/b/c Dataset {5, 1}
Data:
(0,0) 104, 101, 108, 108, 111
The data 104, 101, 108, 108, 111 represents the word hello, which can be seen with
h5ls -d -r f1.mat | tail -1 | awk '{FS=",";printf("%c%c%c%c%c \n",$2,$3,$4,$5,$6)}'
You can get the same listing for f2.mat and compare the two outputs with the tool of your choice.
Comparison also works directly with HDF5 Tools. To compare the two numbers a from both files use
h5diff -r f1.mat f2.mat /a
which will show you the values and their difference
dataset: </a> and </a>
size: [1x1] [1x1]
position a a difference
------------------------------------------------------------
[ 0 0 ] 6 7 1
1 differences found
attribute: <MATLAB_class of </a>> and <MATLAB_class of </a>>
0 differences found
Remarks
There are a few more commands and options in the HDF5 Tools, which may help to get your real problem solved.
Binary distributions are available for Linux and Windows from The HDF Group. For OS X you can get them installed via MacPorts. If needed there is also a GUI: HDFView.

If you have simulink you can use Simulink.saveVars to generate an m-file that upon execution creates the same variables in work space:
a = 6;
b.c.d = 7;
b.c.e = 'x';
Simulink.saveVars('f1');
f = a;
clear a
b.c.e = 'y';
Simulink.saveVars('f2');
visdiff('f1.m','f2.m')
as illustrated in this sctreenshot
Note that by default it limits the number of elements in arrays to 1000 and you can increase it to 10000. Arrays larger than that limit will be saved in a separate mat-file.
UPDATE: From R2014a a new function similar to Simulink.saveVars has been added to MATLAB. see matlab.io.saveVariablesToScript

This is only part of the answer, but maybe it helps.
You could use gencode, a Matlab function that generates Matlab code from a variable such that running the code reproduces the variable. You do this for all of the variables in each mat-file (takes some programming, but should be doable) and put the results in different .m-files.
Then you use a standard text comparison tool (maybe even visdiff) to compare the .m-files.

There are several good tools to compare XML-Files, this I would proceed this way:
Download struct2xml.m
Load both matfiles
Export each with struct2xml
compare, using XMLSpy or similar

Simple general answer, without displaying value differences
Due to the insight I gained from the answers of #BHF, #Daniel R and #Dennis Jaheruddin, I have managed to find a simple scalable solution:
[fs1, fs2, er] = comp_struct(load('f1.mat'),load('f2.mat'))
Note that it works for .mat containing an arbritrary number of variables.
This uses the Compare Structures - File Exchange submission.

Answer for small files, displaying all value differences
Based on the suggestion by #A. Donda I have tried to use gencode to create a variable for everything.
Though it works for my toy example, it is quite slow and tells me that I exceed the allowed amount of variables for my real .mat files.
Anyway, for those who are looking for something that works with small files, I will post this option:
wList=who;
for iLoop = 1:numel(wList)
eval(['generated_' wList{iLoop} '= gencode(' wList{iLoop} ');'])
for jLoop = 1:numel(eval(['generated_' wList{iLoop}]))
eval(['generated_' wList{iLoop} '_' num2str(jLoop) '= generated_' wList{iLoop} '(' num2str(jLoop) ');' ])
end
end
Though it may work, I don't feel like this is the best way to go.

General answer, without displaying value differences
Due to the insight I gained from the answers of #BHF and #Daniel R I have managed to find a reasonably scalable solution.
Step 1: Save all variables from each files as a single struct
This uses the Save workspace to struct - File Exchange submission.
Here are the steps to take assuming you want to compare f1.mat and f2.mat:
clear
load f1
myStruct1 = ws2struct;
save myStruct1 myStruct1
clear
load f2
myStruct2 = ws2struct;
save myStruct2 myStruct2
clear
load myStruct1
load myStruct2
Step 2: Compare the structs
This uses the Compare Structures - File Exchange submission
Given that you want to compare myStruct1 and myStruct2 you can simply call:
[fs1, fs2, er] = comp_struct(myStruct1,myStruct2)
I was positively surprised at how readable the list of differences in er is, here is the output for the example that was used in the question:
er =
's2 is missing field a'
's1(1).b(1).c(1).e and s2(1).b(1).c(1).e do not match'
Note that it will not show values, from a technical point of view it is probably not too hard to change the m file if value difference displays are desirable. However, especially if there are some big matrices I suppose this could result in problematic output.

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!

I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..

Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}

Limiting the amount of information printed by Perl debugger

One of my pet peeves with debugging Perl code (in command line debbugger, perl -d) is the fact that mistakenly printing (via x command) the contents of a huge datastructure is guaranteed to freeze up your terminal for forever and a half while 100s of pages of data are printed. Epecially if that happens across slowish network.
As such, I'd like to be able to limit the amount of data that x prints.
I see two approaches - I'd be willing to try either if someone knows how to do.
Limit the amount of data any single command in debugger prints.
Better yet, somehow replace the built-in x command with a custom Perl method (which would calculate the "size" of the data structure, and refuse to print its contents without confirmation).
I'm specifically asking "how to replace x with custom code" - building a Good Enough "is the data structure too big" Perl method is something I can likely do on my own without too much effort although I see enough pitfalls preventing the "perfect" one from being a fairly frustrating endeavour. Heck, merely doing Data::Dumper->Dump and taking the length of the string might do the trick :)
Please note that I'm perfectly well aware of how to manually avoid the issue by recursively examining layers of datastructure (e.g. print the ref, print the count of keys/array elements, etc...)... the whole point is I want to be able to avoid thoughtlessly typing x $huge_pile_of_data without thinking - or stumbling on a bug populating said huge pile of data into what should be a scalar.

The x command takes an optional argument for the maximum depth to display. That's not quite the same as limiting the amount of data to N pages, but it's definitely useful to prevent overload.
DB<1> %h = (a => { b => { c => 1 } } )
DB<2> x %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
'c' => 1
DB<3> x 2 %h
0 'a'
1 HASH(0x1d5ff44)
'b' => HASH(0x1d61424)
You can specify the default depth to print via the o command, e.g.
DB<1>o dumpDepth=1
Add that to your .perldb file to apply it to all debugger sessions.
Otherwise, it looks like the x command invokes DB::dumpit() which is just a wrapper for dumpval.pl (or, more specifically, the main::dumpValue() sub it defines). You could modify/replace that script as you see fit. I'm not sure how you'd make it interactive, though.

The | command in the debugger pipes another command's output to your pager, e.g.
DB<1> |x %huge_datastructure

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse