I am very new to Perl. Recently I wrote a code to calculate the coefficient of correlation between the atoms between two structures. This is a brief summary of my program.
for($i=1;$i<=2500;$i++)
{
for($j=1;$j<=2500;$j++)
{
calculate the correlation (Cij);
print $Cij;
}
}
This program prints all the correlations serially in a single column. But I need to print the correlations in the form of a matrix, something like..
Atom1 Atom2 Atom3 Atom4
Atom1 0.5 -0.1 0.6 0.8
Atom2 0.1 0.2 0.3 -0.5
Atom3 -0.8 0.9 1.0 0.0
Atom4 0.3 1.0 0.8 -0.8
I don't know, how it can be done. Please help me with a solution or suggest me how to do it !
Simple issue you're having. You need to print a NL after you finish printing a row. However, while i have your attention, I'll prattle on.
You should store your data in a matrix using references. This way, the way you store your data matches the concept of your data:
my #atoms; # Storing the data in here
my $i = 300;
my $j = 400;
my $value = ...; # Calculating what the value should be at column 300, row 400.
# Any one of these will work. Pick one:
my $atoms[$i][$j] = $value; # Looks just like a matrix!
my $atoms[$i]->[$j] = $value; # Reminds you this isn't really a matrix.
my ${$atoms[$1]}[$j] = $value; # Now this just looks ridiculous, but is technically correct.
My preference is the second way. It's just a light reminder that this isn't actually a matrix. Instead it's an array of my rows, and each row points to another array that holds the column data for that particular row. The syntax is still pretty clean although not quite as clean as the first way.
Now, let's get back to your problem:
my #atoms; # I'll store the calculated values here
....
my $atoms[$i]->[$j] = ... # calculated value for row $i column $j
....
# And not to print out my matrix
for my $i (0..$#atoms) {
for my $j (0..$#{ $atoms[$i] } ) {
printf "%4.2f ", $atoms[$i]->[$j]; # Notice no "\n".
}
print "\n"; # Print the NL once you finish a row
}
Notice I use for my $i (0..$#atoms). This syntax is cleaner than the C style three part for which is being discouraged. (Python doesn't have it, and I don't know it will be supported in Perl 6). This is very easy to understand: I'm incrementing through my array. I also use $#atom which is the length of my #atoms array -- or the number of rows in my Matrix. This way, as my matrix size changes, I don't have to edit my program.
The columns [$j] is a bit tricker. $atom[$i] is a reference to an array that contains my column data for row $i, and doesn't really represent a row of data directly. (This is why I like $atoms[$i]->[$j] instead of $atoms[$i][$j]. It gives me this subtle reminder.) To get the actual array that contains my column data for row $i, I need to dereference it. Thus, the actual column values are stored in row $i in the array array #{$atoms[$i]}.
To get the last entry in an array, you replace the # sigil with $#, so the last index in my
array is $#{ $atoms[$i] }.
Oh, another thing because this isn't a true matrix: Each row could have a different numbers of entries. You can't have that with a real matrix. This makes using an Array of Arrays in Perl a bit more powerful, and a bit more dangerous. If you need a consistent number of columns, you have to manually check for that. A true matrix would automatically create the required columns based upon the largest $j value.
Disclaimer: Pseudo Code, you might have to take care of special cases and especially the headers yourself.
for($i=1;$i<=2500;$i++)
{
print "\n"; # linebreak here.
for($j=1;$j<=2500;$j++)
{
calculate the correlation (Cij);
printf "\t%4f",$Cij; # print a tab followed by your float giving it 4
# spaces of room. But no linebreak here.
}
}
This is of course a very crude and quick and dirty solution. But if you save the output into a .csv file, most csv-able spreadsheet programs (OpenOfice) should easily be able to read it into a proper table. If the spreadsheet viewer of your choice can not understand tabs as delimeter, you could easily add ; or / or whatever it can use into the printf string.
Related
I am using the Perl
Algorithm::NaiveBayes
module to classify text
I am adding the length of the text as attribute. I have about 1,000 texts, most of which have different lengths? When I try to predict new text I get a NaN result. If I limiting the training to 10 the NaN issue goes away.
Is it better in this case to normalize the length to different brackets? Something like
text length 0-10 return the same value for the attribute
text length 11-20 return different value of the attribute
Could someone also help me to understand why I get the NaN? I am using code like this:
my $result = $nb->predict( attributes => { bar => 3, blurp => 2 } );
print $result->{CAT1}, "\n";
print $result->{CAT2}, "\n";
I have a loop for example :
for my $something ( #place[1..$#thing] ) {
}
I don't get this statement 1..$#thing
I know that # is for comments but my IDE doesn't color #thing as comment. Or is it really just a comment for someone to know that what is in "$" is "thing" ? And if it's a comment why was the rest of the line not commented out like ] ) { ?
If it has other meanings, i will like to know. Sorry if my question sounds odd, i am just new to perl and perplexed by such an expression.
The $# is the syntax for getting the highest index of the array in question, so $#thing is the highest index of the array #thing. This is documented in perldoc perldata
.. is the range operator, and 1 .. $#thing means a list of numbers, from 1 to whatever the highest index of #thing is.
Using this list inside array brackets with the # sigill denotes that this is an array slice, which is to say, a selected number of elements in the #place array.
So assuming the following:
my #thing = qw(foo bar baz);
my #place = qw(home work restaurant gym);
then #place[1 .. $#thing] (or 1 .. 2) would expand into the list work, restaurant.
It is correct that # is used for comments, but not in this case.
it's how you define a range. From starting value to some other value.
for my $something ( #place[1..3] ) {
# Takes the first three elements
}
Binary ".." is the range operator, which is really two different
operators depending on the context. In list context, it returns a list
of values counting (up by ones) from the left value to the right
value. If the left value is greater than the right value then it
returns the empty list. The range operator is useful for writing
foreach (1..10) loops and for doing slice operations on arrays. In the
current implementation, no temporary array is created when the range
operator is used as the expression in foreach loops, but older
versions of Perl might burn a lot of memory when you write something
like this:
http://perldoc.perl.org/perlop.html#Range-Operators
I have a one-dimensional PDL that I'd like to perform calculations on each half of; i.e. split it, then do calculations on the first half, and the same calculations on the second half.
Is there an easier/nicer/elegant way to simply split the PDL in half than getting the number of elements (with nelem), dividing that in two, then doing two lots of slices?
Thanks
Yes, in so far as you don't need to directly invoke slice to get what you want. You could chain splitdim and dog with something like this:
# Assume we have $data, a piddle
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
That, of course, is easily extended to more than two divisions. However, if you want to extend it to higher-dimensional piddles (i.e. a collection of time series all stored in one piddle), you would need to be a little more subtle. If you want to split along the first dimension (which has index 0), you would say this instead:
# Assume we have $data, a piddle
my ($left, $right) = $data->splitdim(0, $data->dim(0)/2)->mv(1, -1)->dog;
The splitdim operation splits the 0th dimension into two dimensions, the 0th being dim(0)/2 in length, the 1st being 2 in length (because we divided it into two pieces). Since dog operates on the last dimension, we move the 1st dimension to the end before invoking dog.
However, even with the single-dimensional solution, there's a caveat. Due to the way that $data->splitdim works, it will truncate the last piece of data if you have an odd number of elements. Try that operation on a piddle with 21 elements and you'll see what I mean:
my $data = sequence(20);
say "data is $data"; # lists 0-19
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
say "left is $left and right is $right"; # lists 0-9, then 10-19
$data = sequence(21);
say "data is $data"; # lists 0-20, i.e. 21 elements
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
say "left is $left and right is $right"; # lists 0-9, then 10-19!!
If you want to avoid that, you can produce your own method that splits the first dimension in half without truncation. It would probably look something like this:
sub PDL::split_in_half {
my $self = shift;
# the int() isn't strictly necessary, but should make things a
# tad faster
my $left = $self->slice(':' . int($self->dim(0)/2-1) );
my $right = $self->slice(int($self->dim(0)/2) . ':');
return ($left, $right);
}
Here I have also used the int built-in to make sure we don't have the .5 if dim(0) is odd. It's a little more complicated, but we're burying this complexity into a method precisely so we don't have to think about the complexity, so we may as well buy ourselves a few clock cycles while we're at it.
Then you could easily invoke the method thus:
my ($left, $right) = $data->split_in_half;
I think I've got the gist of creating a table using Perl's PDF::Report and PDF::Report::Table, but am having difficulty seeing what the 2-dimensional array #data would look like.
The documentation says it's a 2-dimensional array, but the example on CPAN just shows an array of arrays test1, test2, and so on, rather than the example showing data and formatting like $padding $bgcolor_odd, and so on.
Here's what I've done so far:
$main_rpt_path = "/home/ics/work/rpts/interim/mtr_prebill.rpt";
$main_rpt_pdf =
new PDF::Report('PageSize' => 'letter', 'PageOrientation' => 'Landscape',);
$main_rpt_tbl_wrt =
PDF::Report::Table->new($main_rpt_pdf);
Obviously, I can't pass a one dimensional array, but I have searched for examples and can only find the one in CPAN search.
Edit:
Here is how I am trying to call addTable:
$main_rpt_tbl_wrt->addTable(build_table_writer_array($pt_column_headers_ref, undef));
.
.
.
sub build_table_writer_array
# $data -- an array ref of data
# $format -- an array ref of formatting
#
# returns an array ref of a 2d array.
#
{
my ($data, $format) = #_;
my $out_data_table = undef;
my #format_array = (10, 10, 0xFFFFFF, 0xFFFFCC);
$out_data_table = [[#$data],];
return $out_data_table;
}
and here is the error I'm getting.
Use of uninitialized value in subtraction (-) at /usr/local/share/perl5/PDF/Report/Table.pm line 88.
at /usr/local/share/perl5/PDF/Report/Table.pm line 88
I cannot figure out what addTable wants for data. That is I am wondering where the formatting is supposed to go.
Edit:
It appears the addData call should look like
$main_rpt_tbl_wrt->addTable(build_table_writer_array($pt_column_headers_ref), 10,10,xFFFFFF, 0xFFFFCC);
not the way I've indicated.
This looks like a bug in the module. I tried running the example code in the SYNOPSIS, and I got the same error you get. The module has no real tests, so it is no surprise that there would be bugs. You can report it on CPAN.
The POD has bugs, too.
You increase your chances of getting it fixed if you look at the source code and fix it yourself with a patch.
Thank you in advance for indulging an amateur Perl question. I'm extracting some data from a large, unformatted text file, and am having trouble combining the use of a 'while' loop and regular expression matching over multiple lines.
First, a sample of the data:
01-034575 18/12/2007 258,750.00 11,559.00 36 -2 0 6 -3 2 -2 0 2 1 -1 3 0 5 15
-13 -44 -74 -104 -134 -165 -196 -226 -257 -287 -318 -349 -377 -408 -438
-469 -510 -541 -572 -602 -633 -663
Atraso Promedio ---> 0.94
The first sequence, XX-XXXXXX is a loan ID number. The date and the following two numbers aren't important. '36' is the number of payments. The following sequence of positive and negative numbers represent how late/early this client was for this loan at each of the 36 payment periods. The '0.94' following 'Atraso Promedio' is the bank's calculation for average delay. The problem is it's wrong, since they substitute all negative (i.e. early) payments in the series with zeros, effectively over-stating how risky a client is. I need to write a program that extracts ID and number of payments, and then dynamically calculates a multi-line average delay.
Here's what I have so far:
#Create an output file
open(OUT, ">out.csv");
print OUT "Loan_ID,Atraso_promedio,Atraso_alt,N_payments,\n";
open(MYINPUTFILE, "<DATA.txt");
while(<MYINPUTFILE>){
chomp($_);
if($ID_select != 1 && m/(\d{2}\-\d{6})/){$Loan_ID = $1, $ID_select = 1}
if($ID_select == 1 && m/\d{1,2},\d{1,3}\.00\s+\d{1,2},\d{1,3}\.00\s+(\d{1,2})/) {$N_payments = $1, $Payment_find = 1};
if($Payment_find == 1 && $ID_select == 1){
while(m/\s{2,}(\-?\d{1,3})/g){
$N++;
$SUM = $SUM + $1;
print OUT "$Loan_ID,$1\n"; #THIS SHOWS ME WHAT NUMBERS THE CODE IS GRABBING. ACTUAL OUTPUT WILL BE WRITTEN BELOW
print $Loan_ID,"\n";
}
if(m/---> *(\d*.\d*)/){$Atraso = $1, $Atraso_select = 1}
if($ID_select == 1 && $Payment_find == 1 && $Atraso_select == 1){
...
There's more, but the while loop is where the program is breaking down. The problem is with the pattern modifier, 'g,' which performs a global search of the string. This makes the program grab numbers that I don't want, such as the '1' in loan ID and the '36' for the number of payments. I need the while loop to start from wherever the previous line in the code left off, which should be right after it has identified the number of loans. I've tried every pattern modifier that I've been able to look up, and only 'g' keeps me out of an infinite loop. I need the while loop to go to the end of the line, then start on the next one without combing over the parts of the string already fed through the program.
Thoughts? Does this make sense? Would be immensely grateful for any help you can offer. This work is pro-bono, unpaid: just trying to help out some friends in a micro-lending institution conduct a risk analysis.
Cheers,
Aaron
The problem is probably easier using split, for instance something like this:
use strict;
use warnings;
open DATA, "<DATA.txt" or die "$!";
my #payments;
my $numberOfPayments;
my $loanNumber;
while(<DATA>)
{
if(/\b\d{2}-\d{6}\b/)
{
($loanNumber, undef, undef, undef, $numberOfPayments, #payments) = split;
}
elsif(/Atraso Promedio/)
{
my (undef, undef, undef, $atrasoPromedio) = split;
# Calculate average of payments and print results
}
else
{
push(#payments, split);
}
}
If the data's clean enough, I might approach it by using split instead of regular expressions. The first line is identifiable if field[0] matches the form of a loan number and field[1] matches the format of a date; then the payment dates are an array slice of field[5..-1]. Similarly testing the first field of each line tells you where you are in the data.
Peter van her Heijden's answer is a nice simplification for a solution.
To answer the OP's question about getting the regexp to continue where it left off, see Perl operators - regexp-quote-like operators, specifically the section "Matching in list context" and the "\G assertion" section just after that.
Essentially, you can use m//gc along with the \G assertion to use regexps match where previous matches left off.
The example in the "\G assertion" section about lex-like scanners would seem to apply to this question.