I am using the Perl
Algorithm::NaiveBayes
module to classify text
I am adding the length of the text as attribute. I have about 1,000 texts, most of which have different lengths? When I try to predict new text I get a NaN result. If I limiting the training to 10 the NaN issue goes away.
Is it better in this case to normalize the length to different brackets? Something like
text length 0-10 return the same value for the attribute
text length 11-20 return different value of the attribute
Could someone also help me to understand why I get the NaN? I am using code like this:
my $result = $nb->predict( attributes => { bar => 3, blurp => 2 } );
print $result->{CAT1}, "\n";
print $result->{CAT2}, "\n";
Related
I have a string and I need two characters to be returned.
I tried with strsplit but the delimiter must be a string and I don't have any delimiters in my string. Instead, I always want to get the second number in my string. The number is always 2 digits.
Example: 001a02.jpg I use the fileparts function to delete the extension of the image (jpg), so I get this string: 001a02
The expected return value is 02
Another example: 001A43a . Return values: 43
Another one: 002A12. Return values: 12
All the filenames are in a matrix 1002x1. Maybe I can use textscan but in the second example, it gives "43a" as a result.
(Just so this question doesn't remain unanswered, here's a possible approach: )
One way to go about this uses splitting with regular expressions (MATLAB's strsplit which you mentioned):
str = '001a02.jpg';
C = strsplit(str,'[a-zA-Z.]','DelimiterType','RegularExpression');
Results in:
C =
'001' '02' ''
In older versions of MATLAB, before strsplit was introduced, similar functionality was achieved using regexp(...,'split').
If you want to learn more about regular expressions (abbreviated as "regex" or "regexp"), there are many online resources (JGI..)
In your case, if you only need to take the 5th and 6th characters from the string you could use:
D = str(5:6);
... and if you want to convert those into numbers you could use:
E = str2double(str(5:6));
If your number is always at a certain position in the string, you can simply index this position.
In the examples you gave, the number is always the 5th and 6th characters in the string.
filename = '002A12';
num = str2num(filename(5:6));
Otherwise, if the formating is more complex, you may want to use a regular expression. There is a similar question matlab - extracting numbers from (odd) string. Modifying the code found there you can do the following
all_num = regexp(filename, '\d+', 'match'); %Find all numbers in the filename
num = str2num(all_num{2}) %Convert second number from str
I am very new to Perl. Recently I wrote a code to calculate the coefficient of correlation between the atoms between two structures. This is a brief summary of my program.
for($i=1;$i<=2500;$i++)
{
for($j=1;$j<=2500;$j++)
{
calculate the correlation (Cij);
print $Cij;
}
}
This program prints all the correlations serially in a single column. But I need to print the correlations in the form of a matrix, something like..
Atom1 Atom2 Atom3 Atom4
Atom1 0.5 -0.1 0.6 0.8
Atom2 0.1 0.2 0.3 -0.5
Atom3 -0.8 0.9 1.0 0.0
Atom4 0.3 1.0 0.8 -0.8
I don't know, how it can be done. Please help me with a solution or suggest me how to do it !
Simple issue you're having. You need to print a NL after you finish printing a row. However, while i have your attention, I'll prattle on.
You should store your data in a matrix using references. This way, the way you store your data matches the concept of your data:
my #atoms; # Storing the data in here
my $i = 300;
my $j = 400;
my $value = ...; # Calculating what the value should be at column 300, row 400.
# Any one of these will work. Pick one:
my $atoms[$i][$j] = $value; # Looks just like a matrix!
my $atoms[$i]->[$j] = $value; # Reminds you this isn't really a matrix.
my ${$atoms[$1]}[$j] = $value; # Now this just looks ridiculous, but is technically correct.
My preference is the second way. It's just a light reminder that this isn't actually a matrix. Instead it's an array of my rows, and each row points to another array that holds the column data for that particular row. The syntax is still pretty clean although not quite as clean as the first way.
Now, let's get back to your problem:
my #atoms; # I'll store the calculated values here
....
my $atoms[$i]->[$j] = ... # calculated value for row $i column $j
....
# And not to print out my matrix
for my $i (0..$#atoms) {
for my $j (0..$#{ $atoms[$i] } ) {
printf "%4.2f ", $atoms[$i]->[$j]; # Notice no "\n".
}
print "\n"; # Print the NL once you finish a row
}
Notice I use for my $i (0..$#atoms). This syntax is cleaner than the C style three part for which is being discouraged. (Python doesn't have it, and I don't know it will be supported in Perl 6). This is very easy to understand: I'm incrementing through my array. I also use $#atom which is the length of my #atoms array -- or the number of rows in my Matrix. This way, as my matrix size changes, I don't have to edit my program.
The columns [$j] is a bit tricker. $atom[$i] is a reference to an array that contains my column data for row $i, and doesn't really represent a row of data directly. (This is why I like $atoms[$i]->[$j] instead of $atoms[$i][$j]. It gives me this subtle reminder.) To get the actual array that contains my column data for row $i, I need to dereference it. Thus, the actual column values are stored in row $i in the array array #{$atoms[$i]}.
To get the last entry in an array, you replace the # sigil with $#, so the last index in my
array is $#{ $atoms[$i] }.
Oh, another thing because this isn't a true matrix: Each row could have a different numbers of entries. You can't have that with a real matrix. This makes using an Array of Arrays in Perl a bit more powerful, and a bit more dangerous. If you need a consistent number of columns, you have to manually check for that. A true matrix would automatically create the required columns based upon the largest $j value.
Disclaimer: Pseudo Code, you might have to take care of special cases and especially the headers yourself.
for($i=1;$i<=2500;$i++)
{
print "\n"; # linebreak here.
for($j=1;$j<=2500;$j++)
{
calculate the correlation (Cij);
printf "\t%4f",$Cij; # print a tab followed by your float giving it 4
# spaces of room. But no linebreak here.
}
}
This is of course a very crude and quick and dirty solution. But if you save the output into a .csv file, most csv-able spreadsheet programs (OpenOfice) should easily be able to read it into a proper table. If the spreadsheet viewer of your choice can not understand tabs as delimeter, you could easily add ; or / or whatever it can use into the printf string.
I have a one-dimensional PDL that I'd like to perform calculations on each half of; i.e. split it, then do calculations on the first half, and the same calculations on the second half.
Is there an easier/nicer/elegant way to simply split the PDL in half than getting the number of elements (with nelem), dividing that in two, then doing two lots of slices?
Thanks
Yes, in so far as you don't need to directly invoke slice to get what you want. You could chain splitdim and dog with something like this:
# Assume we have $data, a piddle
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
That, of course, is easily extended to more than two divisions. However, if you want to extend it to higher-dimensional piddles (i.e. a collection of time series all stored in one piddle), you would need to be a little more subtle. If you want to split along the first dimension (which has index 0), you would say this instead:
# Assume we have $data, a piddle
my ($left, $right) = $data->splitdim(0, $data->dim(0)/2)->mv(1, -1)->dog;
The splitdim operation splits the 0th dimension into two dimensions, the 0th being dim(0)/2 in length, the 1st being 2 in length (because we divided it into two pieces). Since dog operates on the last dimension, we move the 1st dimension to the end before invoking dog.
However, even with the single-dimensional solution, there's a caveat. Due to the way that $data->splitdim works, it will truncate the last piece of data if you have an odd number of elements. Try that operation on a piddle with 21 elements and you'll see what I mean:
my $data = sequence(20);
say "data is $data"; # lists 0-19
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
say "left is $left and right is $right"; # lists 0-9, then 10-19
$data = sequence(21);
say "data is $data"; # lists 0-20, i.e. 21 elements
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
say "left is $left and right is $right"; # lists 0-9, then 10-19!!
If you want to avoid that, you can produce your own method that splits the first dimension in half without truncation. It would probably look something like this:
sub PDL::split_in_half {
my $self = shift;
# the int() isn't strictly necessary, but should make things a
# tad faster
my $left = $self->slice(':' . int($self->dim(0)/2-1) );
my $right = $self->slice(int($self->dim(0)/2) . ':');
return ($left, $right);
}
Here I have also used the int built-in to make sure we don't have the .5 if dim(0) is odd. It's a little more complicated, but we're burying this complexity into a method precisely so we don't have to think about the complexity, so we may as well buy ourselves a few clock cycles while we're at it.
Then you could easily invoke the method thus:
my ($left, $right) = $data->split_in_half;
I have a textfile that I am I want to make into a list. I have asked two questions recently about this topic. The problem I keep coming across is that the I want to parse the textfile but the sections are of different length. So I cannot use
textscan(fid,'%s %s %s')
because the length of each gene varies. I have also had trouble using fields because when I use the code to set up the fields it only allows for one line iin each field for the "note" field below in the first gene I would like to be able to multiple lines in one field an be able to read them in. currently I am getting errors about the index exceeds matrix dimensions.
fieldname = regexp(line{1},'/(.+)=','tokens','once');
value = regexp(line{1},'="?([^"]+)"?$','tokens','once');
Another possible way I see this working is using some sort of isLineEmpty to be able to divide up the genes be the empty line that is between them.
Is there a way to be able to have multiple lines in my field entry so I can get all the information associated with "note" ? or a way to use an isLineEmpty and skip using fields?
gene 218705..219367
/locus_tag="Rv0187"
/db_xref="GeneID:886779"
CDS 218705..219367
/locus_tag="Rv0187"
/EC_number="2.1.1.-"
/function="THOUGHT TO BE INVOLVED IN TRANSFER OF METHYL
GROUP."
/note="Rv0187, (MTCI28.26), len: 220 aa. Probable
O-methyltransferase (EC 2.1.1.-), similar to many e.g.
AB93458.1|AL357591 putative O-methyltransferase from
Streptomyces coelicolor (223 aa); MDMC_STRMY|Q00719
O-methyltransferase from Streptomyces mycarofaciens (221
aa), FASTA scores: opt: 327, E(): 2.4e-17, (35.9% identity
in 192 aa overlap). Also similar to Rv1703c, Rv1220c from
Mycobacterium tuberculosis."
/codon_start=1
/transl_table=11
/product="O-methyltransferase"
/protein_id="NP_214701.1"
/db_xref="GI:15607328"
/db_xref="GeneID:886779"
gene 219486..219917
/locus_tag="Rv0188"
/db_xref="GeneID:886776"
CDS 219486..219917
/locus_tag="Rv0188"
/function="UNKNOWN"
/experiment="experimental evidence, no additional details
recorded"
/codon_start=1
/transl_table=11
/product="transmembrane protein"
/protein_id="NP_214702.1"
/db_xref="GI:15607329"
I would probably consider using some sort of simple wrapper function to collapse the multi-line fields into a single line. Something like:
function l = readlongline( fh )
quotesSeen = 0;
done = false;
l = '';
while ~done
tline = fgetl( fh );
if ~ischar( tline )
% Hit EOF
l = tline;
return
end
quotesSeen = quotesSeen + length( strfind( tline, '"' ) );
% Break if we've seen 0 or 2 quotes
done = any( quotesSeen == [0 2] );
l = [l, tline];
end
end
This is intended to be a replacement for fgetl.
A=rand(10)
B=find(A>98)
How do you have text saying "There were 2 elements found" where the 2 is general i.e. it isn't text, so that if I changed B=find(A>90) it would automatically no longer be 2.
some_number = 2;
text_to_display = sprintf('There were %d elements found',some_number);
disp(text_to_display);
Also, if you wanted to count the number of elements greater than 98 in A, you should one of the following:
numel(find(A>98));
Or
sum(A>98);
sprintf is a very elegant way to display such data and it's quite easy for a person with a C/C++ background to start using it. If you're not comfortable with the format-specifier syntax (check out the link) then you can use:
text_to_display = ['There were ' num2str(some_number) ' elements found'];
But I would recommend sprintf :)