Specifically Iterating through a file in perl - perl

So, I have this script :
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Math::Vector::Real;
use constant DEG_PER_RAD => 45 / atan2(1, 1);
my ( $source, $out ) = qw/ OUT4 OUTABA12 /;
open my $in_fh, '<', $source or die qq{Unable to open "$source" for input: $!\n};
open my $out_fh, '>', $out or die qq{Unable to open "$out" for output: $!\n};
my #data;
push #data, V(split) while <$in_fh>;
my #aoa;
for my $i ( 0 .. $#data ) {
for my $j ( 0 .. $#data ) {
my $val1 = $data[$i];
my $val2 = $data[$j];
if ($val1 != $val2) {
my $math = sqrt(($val1->[0] - $val2->[0])**2 +
($val1->[1] - $val2->[1])**2 +
($val1->[2] - $val2->[2])**2);
if ($math < 2.2) {
print "#$val1 #$val2 $math\n";
push #aoa, [#$val1, #$val2, $math];
}
}
}
}
for my $k ( 0 .. $#aoa-1 ) {
my $aoadata1 = $aoa[$k];
my $aoadata2 = $aoa[$k+1];
my $vect1 = [ #{ $aoa[$k] }[0..2] ];
my $vect2 = [ #{ $aoa[$k+1] }[0..2] ];
my $vect3 = [ #{ $aoa[$k] }[3..5] ];
my $vect4 = [ #{ $aoa[$k+1] }[3..5] ];
my $math1 = [ #{ $aoa[$k] }[6] ];
my $math2 = [ #{ $aoa[$k+1] }[6] ];
my #matha = #$math1;
my #mathb = #$math2;
my #vecta = #$vect1;
my #vectb = #$vect2;
my #vectc = #$vect3;
my #vectd = #$vect4;
if ( #vecta != #vectb ) {
print "1\n";
}
}
Which runs on a test file like so:
18.474525 20.161419 20.33903
21.999333 20.220667 19.786734
18.333228 21.649157 21.125111
20.371077 19.675844 19.77649
17.04323 19.3106 20.148842
22.941106 19.105412 19.069893
and it calculates the distance between each point and every other point, and if it's below a threshold push it to an array for later. (For testing purposes, I also have it printing it. )
What I've been stuck on is the bottom half - I'm trying to eventually get to the point where the bottom half of the script script will iterate between rows like so:
If the first triple set of values on row 1 is not identical to the first set of triple values on row two, print 180, but only if this is the only instance of this line. If there is ever a point where row one's values were equal to row two's, do not print 180 whatsoever.
I cannot, for the life of me, get it to work. Any help is appreciated.

I think this is correct
I've used the facilities of the Math::Vector::Real module as I described in a comment
I leave the vectors as objects and avoid accessing their their contents directly
I calculate the distance between $vec1 and $vec2 as abs($vec1 - $vec2
I use the class's stringify ability to display it instead of extracting the individual values in my code
I have also changed the intermediate data format. The distance is no longer kept because it's not necessary, and the array #groups now contains an array for each group of vector pairs that have a common first vector. Each group is of the form
[ $vec1, $vec2, $vec2, $vec2, ... ]
and I use the first function from List::Util to find the group that each new vector pair belongs in. If an existing group is found with a matching first value then the second vector is just pushed onto the end of the group; otherwise a new group is created that looks like [ $vec1, $vec2 ]
Once the #groups array is built, it is processed again to generate the output
If there are only two values in the group then they are $vec1 and $vec2 of a unique point. $vec1 is printed with 180
If there are more than two elements then a line of output is generated for every pair of $vec2 values, each containing the value $vec1 together with the angle between the two $vec2 vectors in the pair
use strict;
use warnings;
use Math::Vector::Real qw/ V /;
use List::Util qw / first /;
use constant DEG_PER_RAD => 45 / atan2(1, 1);
my ( $source, $out ) = qw/ OUT4 OUTABA12 /;
open my $in_fh, '<', $source or die qq{Unable to open "$source" for input: $!\n};
my #data = map V(split), <$in_fh>;
my #groups;
for my $vec1 ( #data ) {
for my $vec2 ( #data ) {
next if abs($vec1 - $vec2) > 2.2 or $vec1 == $vec2;
my $group = first { $_->[0] == $vec1 } #groups;
if ( $group ) {
push #$group, $vec2;
}
else {
push #groups, [ $vec1, $vec2 ];
}
}
}
open my $out_fh, '>', $out or die qq{Unable to open "$out" for output: $!\n};
select $out_fh;
for my $group ( #groups ) {
my ($vec1, #vec2) = #$group;
if ( #vec2 == 1 ) {
print "$vec1 180\n";
next;
}
for my $i ( 0 .. $#vec2-1 ) {
for my $j ( $i+1 .. $#vec2 ) {
my ($vec2a, $vec2b) = #vec2[$i, $j];
my $angle = atan2( $vec2a, $vec2b ) * DEG_PER_RAD;
print "$vec1 $angle\n";
}
}
}
output
{18.474525, 20.161419, 20.33903} 4.96831567625921
{18.474525, 20.161419, 20.33903} 1.65052300046634
{18.474525, 20.161419, 20.33903} 4.80888617553369
{21.999333, 20.220667, 19.786734} 4.22499387836286
{18.333228, 21.649157, 21.125111} 180
{20.371077, 19.675844, 19.77649} 5.09280172684743
{17.04323, 19.3106, 20.148842} 180
{22.941106, 19.105412, 19.069893} 180

Related

Multi-column file comparison and range extraction

Pardon me for asking a question without any coding effort. But it seems too much difficult to me.
I have a data file with tab separated three data columns (and some repetitive header lines) as:
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
From the second data column, for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).
For the above Input, the output would be:
NP_416485.4: 1, 3-5
YP_986467.7: 2-4
Here, "NP_416485.4" and "YP_986467.7" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-4", but I increases all of them with +1 as I don't want to start with 0).
Thanks for your consideration. I would appreciate any help. Thank you
Here is one approach. In case you would have a DOS data file on a Unix machine, I used \r?\n to match a new line, so it will work for all cases:
use feature qw(say);
use strict;
use warnings;
my $file_name = 'input.txt';
open ( my $fh, '<', $file_name ) or die "Could not open file '$file_name': $!";
my $str = do { local $/; <$fh> };
close $fh;
my #chunks = $str =~ /(Sequence(?:.(?!Sequence))*)/sg;
my %ids;
for my $cstr ( #chunks ) {
my ( $id, $data ) = $cstr
=~/Split_Seq\/(\S+)\.fasta.*?\r?\n\r?\n(.*)$/s;
my #lines = split /\n/, $data;
my #vals;
for my $line ( #lines ) {
my #fields = split " ", $line;
push ( #vals, $fields[0] + 1 ) if $fields[1] > 0.5;
}
$ids{$id} = \#vals;
}
for my $id ( keys %ids ) {
my #tmp = sort { $a <=> $b } #{ $ids{$id} };
my ( $first, $last );
my #rr;
for my $i (0..$#tmp) {
if ( $i == 0 ) {
$first = $tmp[0];
$last = undef;
}
if ( $i < $#tmp && ($tmp[$i] == ($tmp[$i+1] - 1 )) ) {
$last = $tmp[$i+1];
next;
}
if ( defined $last ) {
push #rr, "$first-$last";
$last = undef;
}
else {
push #rr, $tmp[$i];
}
$first = ( $i < $#tmp ) ? $tmp[$i+1] : undef;
}
say "$id: ", join ",", #rr;
}
Output:
NP_416485.4: 1,3-5
YP_986467.7: 2-4
You don't really give a good description of your problem, and you haven't made any effort to solve it yourself, but here's a solution to the first part of your problem (parsing the file into a data structure). You'll need to walk the %results hash and produce the output that you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %results;
my $section;
while (<DATA>) {
# Look for a new section
if (/\\Programs\\(\S+)\s/) {
$section = $1;
}
# Look for data lines
if (/^\d\b/) {
my #data = split;
if ($data[1] > 0.5) {
push #{$results{$section}}, $data[0] + 1;
}
}
}
say Dumper \%results;
__DATA__
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr

To calculate the averages in csv file using perl as move up file collecting numbers

I have a CSV file with three columns in order called Mb_size, tax_id, and parent_id. There is a relationship between tax_id and parent_id, for example, in the csv file at the end where you have 22.2220658537 for the mb size, 5820 is the tax id and 5819 is the parent id. As move up the file 5819 the parent id will be seen in the tax id column. The parent id can be repeated but tax id is uniqie in its column.
Starting at the end which has values in Mb_size, I need to work up to the top calculating the average everytime the parent_id becomes the tax_id. Then move up by when this happens the parent Id that is next to that tax Id become new start point to move up.
Below is the sample input :
Mb_size,tax_id,parent_id
,1,1
,131567,1
,2759,131567
,5819,2759
,147429,2759
22.2220658537,5820,5819
184.801317,4557,147429
748.66869,4575,147429
555.55,1234,5819
Below is the sample output:
Mb_size,tax_id,parent_id
377.810518214,1,1
377.810518214,131567,1
377.810518214,2759,131567
288.886032927,5819,2759
466.7350035,147429,2759
22.2220658537,5820,5819
184.801317,4557,147429
748.66869,4575,147429
555.55,1234,5819,
The code so far
use strict;
use warnings;
no warnings 'numeric';
open taxa_fh, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_sized.csv" or die qq{Failed to open for output: $!\n};
my %data;
while ( my $line = <taxa_fh> ) {
chomp( $line );
my #fields = split( /,/, $line );
my $Mb_size = $fields[0];
my $tax_id = $fields[1];
my $parent_id = $fields[2];
$data{$parent_id}{sum} += $Mb_size;
$data{$parent_id}{count}++;
}
for my $parent_id ( sort keys %data ) {
my $avg = $data{$parent_id}{sum} / $data{$parent_id}{count};
print match_fh "$parent_id, $avg \n";
}
close taxa_fh;
close match_fh;
The code I have so far, is from a poster of help earlier. I edited the question to help make it better/clearer. I cant get it to continue the calculation up and include in the printing the original lines from below.
I tried a foreach(tax_id) but didn't work. Any suggestions to include to accomplish this . It does move up but doesn’t do calculation.
You need build a data-structure carefully from down to up first. I am using hashes for that.
Here for every parent_id as key I am building a hash in which I am saving averages,tax_id,sum and count associated with that.
As there could be multiple tax_id associated with single parent_id we need to store averages separately for them.
Now when It becomes a tree like structure then It becomes trivial to print it out according to our requirements.
As they are hashes, orders are not conserved. To maintain order you can use arrays instead of hashes.
One way to do it will be like below:
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, '<', 'tax' or die "unable to open file:$!\n";
my %data;
my #lines;
chomp(my $header=<$fh>); #slurp header
while(<$fh>){
chomp;
my #fields=split(/,/);
if($fields[0]){
##actually field0 is avg so storing it as avg here
$data{$fields[2]}{$fields[1]}{avg}=$fields[0];
$data{$fields[2]}{sum}+=$fields[0];
$data{$fields[2]}{count}++;
}
else{
push(#lines,[split(/,/)]);
}
}
close($fh);
#lines=reverse #lines;
foreach my $lines(#lines){
if(exists $data{$lines->[1]}){
$data{$lines->[2]}{$lines->[1]}{avg}=($data{($lines->[1])}{sum})/($data{($lines->[1])}{count});
$data{$lines->[2]}{sum}+=$data{$lines->[2]}{$lines->[1]}{avg};
$data{$lines->[2]}{count}++;
}
else{
print "Sorry No Such Entry ",$lines->[2]," present\n";
}
}
print "$header\n";
foreach my $tax_id(keys %data){
foreach my $parent_id(keys $data{$tax_id} ){
if(ref ($data{$tax_id}{$parent_id}) eq 'HASH'){
print $data{$tax_id}{$parent_id}->{'avg'}.",".$tax_id.",".$parent_id."\n";
}
}
}
Here is another similar solution, based on your work:
use strict;
use warnings;
open taxa_fh, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_sized.csv" or die qq{Failed to open for output: $!\n};
my %node_data;
my %parent;
my #node_order;
my $header;
while ( my $line = <taxa_fh> ) {
chomp( $line );
if (1 == $.) {
$header = $line;
next; # Skip header
}
my #fields = split( /,/, $line );
my $Mb_size = $fields[0] || 0; # To avoid uninitialized warning
my $tax_id = $fields[1];
my $parent_id = $fields[2];
$parent{$tax_id} = $parent_id;
push #node_order, $tax_id;
$node_data{$tax_id} = $Mb_size;
}
# Add the node value for all parents in the tree
my %totals;
for my $tax_id ( sort keys %parent ) {
my $parent = $parent{$tax_id};
my $done = 0;
while( ! $done ) {
if ($node_data{$tax_id} > 0) {
$totals{$parent}->{sum} += $node_data{$tax_id};
$totals{$parent}->{count}++;
}
$done++ if ($parent{$parent} == $parent);
$parent = $parent{$parent};
}
}
print match_fh "$header\n";
for my $id ( #node_order ) {
my $avg;
if ( exists $totals{$id} ) {
# Parent Node
$avg = $totals{$id}->{sum} / $totals{$id}->{count};
} else {
# Leaf Node
$avg = $node_data{$id};
}
print match_fh "$avg, $id, " . $parent{$id} . "\n";
}
close taxa_fh;
close match_fh;
Output:
Mb_size,tax_id,parent_id
377.810518213425, 1, 1
377.810518213425, 131567, 1
377.810518213425, 2759, 131567
288.88603292685, 5819, 2759
466.7350035, 147429, 2759
22.2220658537, 5820, 5819
184.801317, 4557, 147429
748.66869, 4575, 147429
555.55, 1234, 5819

How can I use an array slice to access several elements of an array simultaneously?

I'm trying to modify this script:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Math::Vector::Real;
use constant DEG_PER_RAD => 45 / atan2(1, 1);
my ( $source, $out ) = qw/ OUT4 OUTABA12 /;
open my $in_fh, '<', $source or die qq{Unable to open "$source" for input: $!\n};
open my $out_fh, '>', $out or die qq{Unable to open "$out" for output: $!\n};
my #data;
push #data, V(split) while <$in_fh>;
my #aoa;
for my $i ( 0 .. $#data ) {
for my $j ( 0 .. $#data ) {
my $val1 = $data[$i];
my $val2 = $data[$j];
if ($val1 != $val2) {
my $math = sqrt(($val1->[0] - $val2->[0])**2 +
($val1->[1] - $val2->[1])**2 +
($val1->[2] - $val2->[2])**2);
if ($math < 2.2) {
push #aoa, "#$val1 #$val2";
}
}
}
}
for my $k ( 0 .. $#aoa ) {
for my $m ( 0 .. $#aoa ) {
my $aoadata1 = $aoa[$k];
my $aoadata2 = $aoa[$m];
my $vect1 = $aoadata1[0..2];
my $vect2 = $aoadata2[0..2];
print "$vect1 $vect2\n";
}
}
.
At the end of the script, I want to be able to do things with the variables $aoadata1 and $aoadata2 in fields 0-2. However, I cannot get them to stop throwing up errors regarding things not referenced right (I think). Can anyone tell me why this is happening/how to fix it?
Thanks.
If you want to use multiple subscripts in an array, you have to change the sigil:
#array[ 0 .. 2 ];
#{ $arra_ref }[ 0 .. 2 ];
It makes no sense to assign the result to a scalar, though. Use an anonymous array:
my $aoadata1 = $aoa[$k];
my $vect1 = [ #{ $aoadata1 }[ 0 .. 2 ] ];
or, without the temp var:
my $vect1 = [ #{ $aoa[$k] }[ 0 .. 2 ] ];
It might still not work, as I noticed you used
push #aoa, "#$val1 #$val2";
Did you mean
push #aoa, [ #$val1, #$val2 ];
or something similar?

Looping variables

I'm working with perl to make a script that will work with Dot products/assorted vector math. I've got a working script ( Still very much in progress/needs refinement ) that will do what I ask.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Math::Vector::Real;
use 5.010;
use Math::Trig;
my $source = "./IN";
my $out = "./OUT";
open(IN, '<', $source) or die "Couldn't open $source: $!\n";
open(OUT, '>', $out) or die "Couldn't open $out: $!\n";
my(#data);
while (<IN>) {
push #data, [ split ];
}
my $v;
my $u;
$v = V(0, 0, 0);
$u = V(0, 0, 0);
my $i = 0;
sub maths {
my $dot = $v * $u;
my $mag1 = (sqrt ((#$v[0])**2 + (#$v[1])**2 + (#$v[2])**2 ));
my $mag2 = (sqrt ((#$u[0])**2 + (#$u[1])**2 + (#$u[2])**2 ));
my $prefinal = acos( ( $dot ) / ( $mag1 * $mag2 ) );
my $degrees = ( ( 180 / 3.14159 ) * ( $prefinal ) );
return ($degrees);
}
my $ref1 = $data[$i][0];
my $ref2 = $data[$i][1];
my $ref3 = $data[$i][2];
my $ref4 = $data[$i+1][0];
my $ref5 = $data[$i+1][1];
my $ref6 = $data[$i+1][2];
$v->[0] = $ref1;
$v->[1] = $ref2;
$v->[2] = $ref3;
$u->[0] = $ref4;
$u->[1] = $ref5;
$u->[2] = $ref6;
my $result = maths;
print "$result\n";
A lot of stuff in the script is vestigial and for ease to follow (For me).
What I desire it to do, is to have the script rotate through each line of the input file and perform the calculations on it.
Something akin to having :
foreach $i (#data) {
my $ref1 = $data[$i][0];
my $ref2 = $data[$i][1];
my $ref3 = $data[$i][2];
my $ref4 = $data[$i+1][0];
my $ref5 = $data[$i+1][1];
my $ref6 = $data[$i+1][2];
$v->[0] = $ref1;
$v->[1] = $ref2;
$v->[2] = $ref3;
$u->[0] = $ref4;
$u->[1] = $ref5;
$u->[2] = $ref6;
my $result = maths;
print "$result\n";
}
Anything is appreciated.
As per my comment, most of the functionality of your program is provided by the Math::Vector::Real module that you're already using
It looks like you want the angle in degrees between successive pairs of 3D vectors in your file. This code creates vectors from each line in the file until there are two vectors in #pair, and then uses atan2 to calculate the angle between them. DEG_PER_RAD is precalculated as a constant from the value of atan2(1,1) which is π ÷ 4 by definition
#!/usr/bin/perl
use strict;
use warnings;
use Math::Vector::Real;
use constant DEG_PER_RAD => 45 / atan2(1, 1);
my ( $source, $out ) = qw/ IN OUT /;
open my $in_fh, '<', $source or die qq{Unable to open "$source" for input: $!\n};
open my $out_fh, '<', $out or die qq{Unable to open "$out" for output: $!\n};
select $out_fh;
my #pair;
while ( <$in_fh> ) {
push #pair, V(split);
if ( #pair == 2 ) {
my $degrees = atan2(#pair) * DEG_PER_RAD;
print "$degrees\n";
shift #pair;
}
}
Update
To do the calculation for every possible pair of vectors, it is simplest to read the whole of the file into an array of vectors and process them from there with two nested loops
This code demonstrates. Insert it after the select call in the original program. It labels each line of output with the two array indices that gave rise to the angle
my #data;
push #data, V(split) while <$in_fh>;
for my $i ( 0 .. $#data-1 ) {
for my $j ( $i+1 .. $#data ) {
my $degrees = atan2(#data[$i,$j]) * DEG_PER_RAD;
print "[$i,$j] $degrees\n";
}
}
You could use the following:
my $v = V( #{ $data[$i+0] } );
my $u = V( #{ $data[$i+1] } );
Cleaned up:
#!/usr/bin/perl
use strict;
use warnings;
use Math::Trig qw( acos );
use Math::Vector::Real qw( V );
sub maths {
my ($v, $u) = #_;
my $dot = $v * $u;
my $mag1 = sqrt( $v->[0]**2 + $v->[1]**2 + $v->[2]**2 );
my $mag2 = sqrt( $u->[0]**2 + $u->[1]**2 + $u->[2]**2 );
my $prefinal = acos( $dot / ( $mag1 * $mag2 ) );
my $degrees = ( 180 / 3.14159 ) * $prefinal;
return $degrees;
}
{
my #data = map { V( split ) } <>;
for my $i (0..$#data-1) {
for my $j ($i+1..$#data) {
my $maths = maths(#data[$i, $j]);
print("$i,$j: $maths\n");
}
}
}
Or with Borodin's alternate method:
#!/usr/bin/perl
use strict;
use warnings;
use Math::Trig qw( acos atan2 );
use Math::Vector::Real qw( V );
use constant DEG_PER_RAD => 45 / atan2(1, 1);
{
my #data = map { V( split ) } <>;
for my $i (0..$#data-1) {
for my $j ($i+1..$#data) {
my $maths = atan2(#data[$i, $j]) * DEG_PER_RAD;
print("$i,$j: $maths\n");
}
}
}
Usage:
perl script.pl ./IN >./OUT

How can I generate a set of ranges from the first letters of a list of words in Perl?

I'm not sure exactly how to explain this, so I'll just start with an example.
Given the following data:
Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry
I want to generate an index based on the first letter of my data, but I want the letters grouped together.
Here is the frequency of the first letters in the above dataset:
2 A
2 B
3 C
1 E
2 G
1 K
1 M
1 N
4 P
2 R
1 S
Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:
A B C D-G H-O P Q-Z
Clicking the "D-G" link would show:
Elderberry
Grapefruit
Grapes
In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:
A B C E-G K-N P R-S
Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.
I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.
Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.
I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.
Here is what I started with:
#!/usr/bin/perl
use strict;
use warnings;
my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};
open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";
while ( my $item = <$DATASET> ) {
chomp($item);
my $first_letter = uc( substr( $item, 0, 1 ) );
$index_frequency->{$first_letter}++;
}
foreach my $letter ( sort keys %{$index_frequency} ) {
if ( $index_frequency->{$letter} ) {
# build $ranges here
}
}
My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.
Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.
Thanks in advance!
Basic approach:
#!/usr/bin/perl -w
use strict;
use autodie;
my $PAGE_SIZE = 3;
my %frequencies;
open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
next unless $l =~ m{\A([a-z])}i;
$frequencies{ uc $1 }++;
}
close $fh;
my $current_sum = 0;
my #letters = ();
my #pages = ();
for my $letter ( "A" .. "Z" ) {
my $letter_weigth = ( $frequencies{ $letter } || 0 );
if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
$current_sum = $letter_weigth;
#letters = ( $letter );
next;
}
push #letters, $letter;
$current_sum += $letter_weigth;
}
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
print "Pages : " . join( " , ", #pages ) . "\n";
Problem with it is that it outputs (from your data):
Pages : A , B , C-D , E-J , K-O , P , Q-Z
But I would argue this is actually good approach :) And you can always change the for loop into:
for my $letter ( sort keys %frequencies ) {
if you need.
Here's my suggestion:
# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
push #{ $freq{ $count } }, $letter;
}
# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
my #sorted_letters = sort #{ $freq{$_} };
print "$_: #sorted_letters\n";
}
Update: I think that I misunderstood your requirements. The following code block does something more like what you want.
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# get the maximum frequency
my $max_freq = (sort values %count)[-1];
my $curr_set_count = 0;
my #curr_set = ();
foreach ('A' .. 'Z') {
push #curr_set, $_;
$curr_set_count += $count{$_};
if ($curr_set_count >= $max_freq) {
# print out the range of the current set, then clear the set
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
#curr_set = ();
$curr_set_count = 0;
}
}
# print any trailing letters from the end of the alphabet
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.
count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
count += frequency[letter];
if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
if (inrange): print rangeStart+'-'
print letter+' '
inrange=false
count=0
else:
if (not inrange) rangeStart=letter
inrange=true
use strict;
use warnings;
use List::Util qw(sum);
my #letters = ('A' .. 'Z');
my #raw_data = qw(
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);
# Store the data by starting letter.
my %data;
push #{$data{ substr $_, 0, 1 }}, $_ for #raw_data;
# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar #$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);
# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my #pages = (['']);
for my $letter (#letters){
my #d = exists $data{$letter} ? #{$data{$letter}} : ();
if (#{$pages[-1]} - 1 < $MAX_SIZE or #d == 0){
push #{$pages[-1]}, #d;
$pages[-1][0] .= $letter;
}
else {
push #pages, [ $letter, #d ];
}
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for #pages; # Convert letters to range.
This is an example of how I would write this program.
#! /opt/perl/bin/perl
use strict;
use warnings;
my %frequency;
{
use autodie;
open my $data_file, '<', 'datafile';
while( my $line = <$data_file> ){
my $first_letter = uc( substr( $line, 0, 1 ) );
$frequency{$first_letter} ++
}
# $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';
# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;
my #index;
my #group;
for my $letter ( sort keys %frequency ){
my $frequency = $frequency{$letter};
if( $frequency >= $threshold ){
if( #group ){
if( #group == 1 ){
push #index, #group;
}else{
# push #index, [#group]; # copy #group
push #index, "$group[0]-$group[-1]";
}
#group = ();
}
push #index, $letter;
}elsif( sum( #frequency{#group,$letter} ) >= $threshold ){
if( #group == 1 ){
push #index, #group;
}else{
#push #index, [#group];
push #index, "$group[0]-$group[-1]"
}
#group = ($letter);
}else{
push #group, $letter;
}
}
#push #index, [#group] if #group;
push #index, "$group[0]-$group[-1]" if #group;
print join( ', ', #index ), "\n";