How to get the array of all ngrams in Perl Text::Ngrams - perl

As you know the module Text::Ngrams in Perl can give Ngrams analysis. There is the following function to retrieve the array of Ngrams and frequencies.
get_ngrams(orderby=>'ngram|frequency|none',onlyfirst=>NUMBER,out=>filename|handle,normalize=>1)
But it gives only the last Ngrams.
For example the following code does not give both Uni-Gram and Bi-Gram:
my $ng3 = Text::Ngrams->new( windowsize => 2, type=>'byte');
my $text = "test teXT TESTtexT";
$text =~ s/ +/ /g; # replace multiple spaces to single
$text = uc $text; # uppercase all
$ng3->process_text($text);
my #ngramsarray = $ng3->get_ngrams(orderby=>'frequency', onlyfirst=>10, normalize=>0 );
foreach(#ngramsarray)
{
print "$_\n";
}
output:
T E
4
E X
2
_ T
2
E S
2
S T
2
X T
2
T _
2
T T
1
However by using the function
to_string(orderby=>'ngram|frequency|none',onlyfirst=>NUMBER,out=>filename|handle,normalize=>1,spartan=>1)
it shows both of Ngrams. But only it displays the result. I need the result in an array.
How to get all Ngrams (Unigram and Bigram) at the same time by this array?

You can't get all the different sizes of n-grams at the same time, but you can get them all using multiple calls to get_ngrams. There is an undocumented parameter n to get_ngrams that says the size of the n-grams you want listed.
In your code, if you say
my #ngramsarray = $ng3->get_ngrams(
n => 1,
orderby = >'frequency',
onlyfirst => 10,
normalize => 0);
you get this list
('T', 8, 'E', 4, 'X', 2, '_', 2, 'S', 2)

Related

Regular Expression Matching Perl for first case of pattern

I have multiple variables that have strings in the following format:
some_text_here__what__i__want_here__andthen_someĀ 
I want to be able to assign to a variable the what__i__want_here portion of the first variable. In other words, everything after the FIRST double underscore. There may be double underscores in the rest of the string but I only want to take the text after the FIRST pair of underscores.
Ex.
If I have $var = "some_text_here__what__i__want_here__andthen_some", I would like to assign to a new variable only the second part like $var2 = "what__i__want_here__andthen_some"
I'm not very good at matching so I'm not quite sure how to do it so it just takes everything after the first double underscore.
my $text = 'some_text_here__what__i__want_here';
# .*? # Match a minimal number of characters - see "man perlre"
# /s # Make . match also newline - see "man perlre"
my ($var) = $text =~ /^.*?__(.*)$/s;
# $var is not defined when there is no __ in the string
print "var=${var}\n" if defined($var);
You might consider this an example of where split's third parameter is useful. The third parameter to split constrains how many elements to return. Here is an example:
my #examples = (
'some_text_here__what__i_want_here',
'__keep_this__part',
'nothing_found_here',
'nothing_after__',
);
foreach my $string (#examples) {
my $want = (split /__/, $string, 2)[1];
print "$string => ", (defined $want ? $want : ''), "\n";
}
The output will look like this:
some_text_here__what__i_want_here => what__i_want_here
__keep_this__part => keep_this__part
nothing_found_here =>
nothing_after__ =>
This line is a little dense:
my $want = (split /__/, $string, 2)[1];
Let's break that down:
my ($prefix, $want) = split /__/, $string, 2;
The 2 parameter tells split that no matter how many times the pattern /__/ could match, we only want to split one time, the first time it's found. So as another example:
my (#parts) = split /#/, "foo#bar#baz#buzz", 3;
The #parts array will receive these elements: 'foo', 'bar', 'baz#buzz', because we told it to stop splitting after the second split, so that we get a total maximum of three elements in our result.
Back to your case, we set 2 as the maximum number of elements. We then go one step further by eliminating the need for my ($throwaway, $want) = .... We can tell Perl we only care about the second element in the list of things returned by split, by providing an index.
my $want = ('a', 'b', 'c', 'd')[2]; # c, the element at offset 2 in the list.
my $want = (split /__/, $string, 2)[1]; # The element at offset 1 in the list
# of two elements returned by split.
You use brackets to capature then reorder the string, the first set of brackets () is $1 in the next part of the substitution, etc ...
my $string = "some_text_here__what__i__want_here";
(my $newstring = $string) =~ s/(some_text_here)(__)(what__i__want_here)/$3$2$1/;
print $newstring;
OUTPUT
what__i__want_here__some_text_here

How to print specific key in an array (Perl) [duplicate]

This question already has answers here:
Simple hash search by value
(5 answers)
Closed 5 years ago.
I recently started learning Perl, so I'm not too familiar with the functions and syntax.
If I have a Perl array and some variables,
#!/usr/bin/perl
use strict;
use warnings;
my #numbers = (a =>1, b=> 2, c => 3, d =>4, e => 5);
my $x;
my $range = 5;
$x = int(rand($range));
print "$x";
to generate a random number between 1-5, how can I get the program to print the actual key (a, b, c, etc.) instead of just the number (1, 2, 3, 4, 5)?
It seems that you want to do a reverse lookup, key-by-value, opposite to what we get from a hash. Since a hash is a list you can reverse it and use the resulting hash to look up by number.
A couple of corrections: you need a hash variable (not an array), and you need to add 1 to your rand integer generator so to have the desired 1..5 range
use warnings;
use strict;
use feature 'say';
my %numbers = (a => 1, b => 2, c => 3, d => 4, e => 5);
my %lookup_by_number = reverse %numbers; # values need be unique
my $range = 5;
my $x = int(rand $range) + 1;
say $lookup_by_number{$x};
Without reversing the hash you'd need to iterate the hash %numbers over values, testing each against $x so to find its key.
If there are same values for various keys in your original hash then you have to do it by hand since reverse-ing would attempt to create a hash with duplicate keys, in which case only the last one assigned remains. So you'd lose some values. One way
my #at_num = grep { $x == $numbers{$_} } keys %numbers;
as in the post that this was marked as duplicate of.
But then you should build a data structure for reverse lookup so to not search through the list every time information is needed. This can be a hash where keys are the list of unique numbers while their values are then array references (arrayrefs) with corresponding keys from the original hash
use warnings;
use strict;
my %num = (a => 1, b => 2, c => 1, d => 3, e => 2); # with duplicate values
my %lookup_by_num;
foreach my $key (keys %num) {
push #{ $lookup_by_num{$num{$key}} }, $key;
}
say "$_ => [ #{$lookup_by_num{$_}} ]" for keys %lookup_by_num;
This prints
1 => [ c a ]
3 => [ d ]
2 => [ e b ]
A nice way to display complex data structures is via Data::Dumper, or Data::Dump (or others).
The expression #{ $lookup_by_num{ $num{$key} } } extracts the value of %lookup_by_num for the key $num{$key}and dereferences it #{ ... }, so that it can then push the $key to it. The critical part of this is that the first time it encounters $num{$key} it autovivifies the arrayref and its corresponding key. See this post with its references for details.
There's many ways to do it. For example, declare "numbers" as a hash rather than an array. Note that the keys come first in each key-value pair, and here you want to use your random int as the key:
my %numbers = ( 0 => 'a', 1 => 'b', 2 => 'c', 3 => 'd', 4 => 'e' );
Then you can look up the "key" as you call it using:
my $key = $numbers{$x};
Note that rand( $x ); returns a number greater than or equal to zero and less than $x. So if you want integers in the range 1-5, you must add 1 in your code: at the moment you'll get 0-4, not 1-5.
Firstly, arrays don't have keys (well, they kind of do, but they're integers and not the values you want). So I think you want a hash, not an array.
my %numbers = (a =>1, b=> 2, c => 3, d =>4, e => 5);
And if you want to get the letter, given the integer then you need the reverse of this hash:
my %rev_numbers = %numbers;
Note that reversing a hash like this only works if the values in your original hash are unique (because reversing a hash makes the values into keys and hash keys are always unique).
Then, you can just look up an integer in your %rev_hash to get its associated letter.
my $integer = 3;
say $rev_numbers{$integer}; # prints 'c'

Remove row and create new array

I am reading a CSV file
A B C D
A 1, 2, 3, 4
B 5, 1, 7, 8
C 9, 4, 1, 2
D 2, 7, 8, 1
The idea is to compute matrix correlation
How can I remove a row and create a new array?
I tried this
my #new_row = split(/\s+/, $header_line);
This is my first Perl program
my #row = /#desired_row/ && $_;
current output
A 1,2,3,4
What I am trying now
my #newarray = ( );
#newarray = grep ($_ > 2, #row);
print "#newarray\n";
Result I am trying to get
A 3, 4
Your question is confusing. My interpretation is that you are trying to read a correlation matrix into a Perl data structure. The fact the matrix has unit diagonal seems to support that interpretation. On the other hand, the matrix is not symmetric, so that's even more confusing.
I am assuming you are confused about how to remove the variable labels, and just getting the numbers into a matrix. In Perl, you can represent a matrix either as an array of references to anonymous arrays (dense matrix) or as a hash whose keys are row or column indexes and values are references to anonymous arrays (sparse matrix). The choice of whether to store column vectors or row vectors can affect how well you can work with the data, but, once again, there is not enough information in your question to deduce what would be most appropriate in your situation.
The code below shows the most basic way of reading that matrix into a Perl data structure. In your code, you would open the data file and assign a filehandle, e.g. $fh, and read from that rather than the __DATA__ section of your script.
#!/usr/bin/env perl
use strict;
use warnings;
my #vars = split ' ', <DATA>;
my #correl;
while (my $line = <DATA>) {
push #correl, [ $line =~ /([0-9]+)/g ];
}
print join("\t", #vars), "\n";
print join("\t", #$_), "\n" for #correl;
__DATA__
A B C D
A 1, 2, 3, 4
B 5, 1, 7, 8
C 9, 4, 1, 2
D 2, 7, 8, 1
Output:
A B C D
1 2 3 4
5 1 7 8
9 4 1 2
2 7 8 1

How to subtract values in 2 different arrays in perl?

Hi I have two arrays containing 4 columns and I want to subtract the value in column1 of array2 from value in column1 of array1 and value of column2 of array2 from column2 of array1 so on..example:
my #array1=(4.3,0.2,7,2.2,0.2,2.4)
my #array2=(2.2,0.6,5,2.1,1.3,3.2)
so the required output is
2.1 -0.4 2 # [4.3-2.2] [0.2-0.6] [7-5]
0.1 -1.1 -0.8 # [2.2-2.1] [0.2-1.3] [2.4-3.2]
For this the code I used is
my #diff = map {$array1[$_] - $array2[$_]} (0..2);
print OUT join(' ', #diff), "\n";
and the output I am getting now is
2.1 -0.4 2
2.2 -1.1 3.8
Again the first row is used from array one and not the second row, how can I overcome this problem?
I need output in rows of 3 columns like the way i have shown above so just i had filled in my array in row of 3 values.
This will produce the requested output. However, I suspect (based on your comments), that we could produce a better solution if you simply showed your raw input.
use strict;
use warnings;
my #a1 = (4.3,0.2,7,2.2,0.2,2.4);
my #a2 = (2.2,0.6,5,2.1,1.3,3.2);
my #out = map { $a1[$_] - $a2[$_] } 0 .. $#a1;
print "#out[0..2]\n";
print "#out[3..$#a1]\n";
First of all, your code doesn't even compile. Perl arrays aren't space separated - you need a qw() to turn those into arrays. Not sure how you got your results.
Perl doesn't have 2D arrays. 2.2 is NOT a column1 of row 1 of #array1 - it's element with index 3 of #array1. As far as Perl is concerned, your newline is just another whitespace separator, NOT something that magically turns a 1-d array into a table as you seem to think.
To get the result you want (process those 6 elements as 2 3-element arrays), you can either store them in an array of arrayrefs (Perl's implementation of C multidimentional arrays):
my #array1=( [ 4.3, 0.2, 7 ],
[ 2.2, 0.2, 2.4] );
for(my $idx=0; $idx1 < 2; $idx1++) {
for(my $idx2=0; $idx2 < 3; $idx2++) {
print $array1[$idx1]->[$idx2] - $array2[$idx1]->[$idx2];
print " ";
}
print "\n";
}
or, you can simply fake it by using offsets, the same way pointer arithmetic works in C's multidimentional arrays:
my #array1=( 4.3, 0.2, 7, # index 0..2
2.2, 0.2, 2.4); # index 3..5
for(my $idx=0; $idx1 < 2; $idx1++) {
for(my $idx2=0; $idx2 < 3; $idx2++) {
print $array1[$idx1 * 3 + $idx2] - $array2[$idx1 * 3 + $idx2];
print " ";
}
print "\n";
}

How do I calculate the difference between each element in two arrays?

I have a text file with numbers which I have grouped as follows, seperated by blank line:
42.034 41.630 40.158 26.823 26.366 25.289 23.949
34.712 35.133 35.185 35.577 28.463 28.412 30.831
33.490 33.839 32.059 32.072 33.425 33.349 34.709
12.596 13.332 12.810 13.329 13.329 13.569 11.418
Note: the groups are always of equal length and can be arranged in more than one line long, if the group is large, say 500 numbers long.
I was thinking of putting the groups in arrays and iterate along the length of the file.
My first question is: how should I subtract the first element of array 2 from array 1, array 3 from array 2, similarly for the second element and so on till the end of the group?
i.e.:
34.712-42.034,35.133-41.630,35.185-40.158 ...till the end of each group
33.490-34.712,33.839-35.133 ..................
and then save the differences of the first element in one group (second question: how ?) till the end
i.e.:
34.712-42.034 ; 33.490-34.712 ; and so on in one group
35.133-41.630 ; 33.839-35.133 ; ........
I am a beginner so any suggestions would be helpful.
Assuming you have your file opened, the following is a quick sketch
use List::MoreUtils qw<pairwise>;
...
my #list1 = split ' ', <$file_handle>;
my #list2 = split ' ', <$file_handle>;
my #diff = pairwise { $a - $b } #list1, #list2;
pairwise is the simplest way.
Otherwise there is the old standby:
# construct a list using each index in #list1 ( 0..$#list1 )
# consisting of the difference at each slot.
my #diff = map { $list1[$_] - $list2[$_] } 0..$#list1;
Here's the rest of the infrastructure to make Axeman's code work:
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw<pairwise>;
my (#prev_line, #this_line, #diff);
while (<>) {
next if /^\s+$/; # skip leading blank lines, if any
#prev_line = split;
last;
}
# get the rest of the lines, calculating and printing the difference,
# then saving this line's values in the previous line's values for the next
# set of differences
while (<>) {
next if /^\s+$/; # skip embedded blank lines
#this_line = split;
#diff = pairwise { $a - $b } #this_line, #prev_line;
print join(" ; ", #diff), "\n\n";
#prev_line = #this_line;
}
So given the input:
1 1 1
3 2 1
2 2 2
You'll get:
2 ; 1 ; 0
-1 ; 0 ; 1