Multi-column file comparison and range extraction

Multi-column file comparison and range extraction - perl

Pardon me for asking a question without any coding effort. But it seems too much difficult to me.
I have a data file with tab separated three data columns (and some repetitive header lines) as:
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
From the second data column, for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).
For the above Input, the output would be:
NP_416485.4: 1, 3-5
YP_986467.7: 2-4
Here, "NP_416485.4" and "YP_986467.7" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-4", but I increases all of them with +1 as I don't want to start with 0).
Thanks for your consideration. I would appreciate any help. Thank you

Here is one approach. In case you would have a DOS data file on a Unix machine, I used \r?\n to match a new line, so it will work for all cases:
use feature qw(say);
use strict;
use warnings;
my $file_name = 'input.txt';
open ( my $fh, '<', $file_name ) or die "Could not open file '$file_name': $!";
my $str = do { local $/; <$fh> };
close $fh;
my #chunks = $str =~ /(Sequence(?:.(?!Sequence))*)/sg;
my %ids;
for my $cstr ( #chunks ) {
my ( $id, $data ) = $cstr
=~/Split_Seq\/(\S+)\.fasta.*?\r?\n\r?\n(.*)$/s;
my #lines = split /\n/, $data;
my #vals;
for my $line ( #lines ) {
my #fields = split " ", $line;
push ( #vals, $fields[0] + 1 ) if $fields[1] > 0.5;
}
$ids{$id} = \#vals;
}
for my $id ( keys %ids ) {
my #tmp = sort { $a <=> $b } #{ $ids{$id} };
my ( $first, $last );
my #rr;
for my $i (0..$#tmp) {
if ( $i == 0 ) {
$first = $tmp[0];
$last = undef;
}
if ( $i < $#tmp && ($tmp[$i] == ($tmp[$i+1] - 1 )) ) {
$last = $tmp[$i+1];
next;
}
if ( defined $last ) {
push #rr, "$first-$last";
$last = undef;
}
else {
push #rr, $tmp[$i];
}
$first = ( $i < $#tmp ) ? $tmp[$i+1] : undef;
}
say "$id: ", join ",", #rr;
}
Output:
NP_416485.4: 1,3-5
YP_986467.7: 2-4

You don't really give a good description of your problem, and you haven't made any effort to solve it yourself, but here's a solution to the first part of your problem (parsing the file into a data structure). You'll need to walk the %results hash and produce the output that you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %results;
my $section;
while (<DATA>) {
# Look for a new section
if (/\\Programs\\(\S+)\s/) {
$section = $1;
}
# Look for data lines
if (/^\d\b/) {
my #data = split;
if ($data[1] > 0.5) {
push #{$results{$section}}, $data[0] + 1;
}
}
}
say Dumper \%results;
__DATA__
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr

Related

Perl compare individual elements of two arrays

I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}

my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.

I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.

If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}

if value lies between two values then add another value to corresponding line

This is a description of my problem: I have two text files (here $variants and $annotation). I want to check if the value from column 2 in $variants lies between the values from column 2 and 3 in $annotation. If this is true then the value from column 1 in $annotation should be added to a new column in $variants.
This is how my sample input files look like
$annotationrepresents a tab-delimited text file
These values can be overlapping and cannot be perfectly sorted, since I'm working with a circular genome
C0 C1 C2
gene1 0 100
gene2 500 1000
gene3 980 1200
gene4 1500 5
$variants represents a tab-delimited text file
C0 C1
... 5
... 10
... 100
... 540
... 990
The output should look like this ($variants with two other columns added)
C0 C1 C2 C3
... 5 gene1 gene4
... 10 gene1
... 100 gene1
... 540 gene2
... 990 gene2 gene3
This is how my script looks like for the moment
my %hash1=();
while(<$annotation>){
my #column = split(/\t/); #split on tabs
my $keyfield = $column[1] && $column[2]; # I need to remember values from two columns here. How do I do that?
}
while(<$variants>){
my #column=split(/\t/); # split on tabs
my $keyfield = $column[1];
if ($hash1{$keyfield} >= # so the value in column[1] should be between the values from column[1] & [2] in $annotation
push # if true then add values from column[0] in $annotation to new column in $variants
}
So my biggest problems are how to remember two values in a file using hashes and how to put a value from one file to a column in another file. Could someone help me with this?

If the input files are not large and the positions are not too high, you can use arrays to represent all positions:
#!/usr/bin/perl
use warnings;
use strict;
sub skip_header {
my $FH = shift;
<$FH>;
}
open my $ANN, '<', 'annotation' or die $!;
my $max = 0;
while (<$ANN>) {
$_ > $max and $max = $_ for (split)[1, 2];
}
seek $ANN, 0, 0; # Rewind the file back.
my $circular;
my #genes;
while (<$ANN>) {
my ($gene, $from, $to) = split;
if ($from <= $to) {
$genes[$_] .= "$gene " for $from .. $to;
} else {
$circular = 1;
$genes[$_] .= "$gene " for 0 .. $to, $from .. $max + 1;
}
}
chop #genes;
open my $VAR, '<', 'variants' or die $!;
skip_header($VAR);
while (<$VAR>) {
next if /^\s*#/;
chomp;
my ($str, $pos) = split;
$pos = $#genes if $circular and $pos > $#genes;
print "$_ ", $genes[$pos] // q(), "\n";
}

No hashing needed at all. This example expects the annotations to be sorted and not overlapping, it also works only if all the values from variants should be printed.
#!/usr/bin/perl
use warnings;
use strict;
open my $VAR, '<', 'variants' or die $!;
<$VAR>; # skip header
my ($str, $pos) = split ' ', <$VAR>;
open my $ANN, '<', 'annotation' or die $!;
<$ANN>; # skip header
while (<$ANN>) {
my ($gene, $from, $to) = split;
while ($from <= $pos and $pos <= $to) {
print "$str $pos $gene\n";
($str, $pos) = split ' ', <$VAR> or last;
}
}

parse a tab delimited data using perl

I have a tab delimited data. I want to process that data using perl. I am a newbie to perl and could not figure out how to solve .
This is sample table: My original file is almost a GB
gi|306963568|gb|GL429799.1|_1316857_1453052 13 1
gi|306963568|gb|GL429799.1|_1316857_1453052 14 1
gi|306963568|gb|GL429799.1|_1316857_1453052 15 1
gi|306963568|gb|GL429799.1|_1316857_1453052 16 1
gi|306963568|gb|GL429799.1|_1316857_1453052 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 1
gi|306963568|gb|GL429799.1|_1316857_1453052 361 1
gi|306963568|gb|GL429799.1|_1316857_1453052 362 1
gi|306963568|gb|GL429799.1|_1316857_1453052 363 1
gi|306963568|gb|GL429799.1|_1316857_1453052 364 1
gi|306963568|gb|GL429799.1|_1316857_1453052 365 1
gi|306963568|gb|GL429799.1|_1316857_1453052 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38641 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38642 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38643 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38644 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38645 1
I would like to get the output as
Name, start value, end value, average
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1
it will be great if someone could share their wisdom.

The general pattern is
use strict;
use warnings;
open my $fh, '<', 'myfile' or die $!;
while (<$fh>) {
chomp;
my #fields = split /\t/;
...
}
Within the loop the fields can be accessed as $fields[0] through $fields[2].
Update
I have understood your question better, and I think this solution will work for you. Note that it assumes the input data is sorted, as you have shown in your question.
It accumulates the start and end values, the total and the count in hash %data, and keeps a list of all the names encountered in #names so that the data can be displayed in the order it was read.
The program expects the input file name as a parameter on the command line.
You need to consider the formatting of the average because it is a floating point value. As it stands it will display the value to sixteen significant figures, and you may want to curtail that using sprintf.
use strict;
use warnings;
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
my #names;
my %data;
my $current_name = '';
my $last_index;
while (<$fh>) {
chomp;
my ($name, $index, $value) = split /\t/;
if ( $current_name ne $name or $index > $last_index + 1 ) {
push #names, $name unless $data{$name};
push #{ $data{$name} }, {
start => $index,
count => 0,
total => 0,
};
$current_name = $name;
}
my $entry = $data{$name}[-1];
$entry->{end} = $index;
$entry->{count} += 1;
$entry->{total} += $value;
$last_index = $index;
}
for my $name (#names) {
for my $entry (#{ $data{$name} }) {
my ($start, $end, $total, $count) = #{$entry}{qw/ start end total count /};
print join("\t", $name, $start, $end, $total / $count), "\n";
}
}
output
gi|306963568|gb|GL429799.1|_1316857_1453052 13 17 1
gi|306963568|gb|GL429799.1|_1316857_1453052 360 366 1
gi|306963580|gb|GL429787.1|_4276355_4500645 38640 38645 1

This will produce the same output for the sample in your question:
#!/usr/bin/env perl -n
#
my ($name, $i, $value) = split(/\t/);
sub print_stats {
print join("\t", $prev_name, $start, $prev_i, $sum / ($prev_i - $start + 1)), "\n";
}
if ($prev_name eq $name && $i == $prev_i + 1) {
$sum += $value;
$prev_i = $i;
}
else {
if ($prev_name) {
&print_stats();
}
$start = $i;
$prev_name = $name;
$sum = $value;
$prev_i = $i;
}
END {
&print_stats();
}
Use it as:
./parser.pl < sample.txt
UPDATE: answers to the questions in comments:
To print output to a file, run like this: ./parser.pl < sample.txt > output.txt
$prev_name and $prev_i are NOT initialized, so they are undef at first (= NULL)

You could do something like this....
open (FILE, 'data.txt');
while (<FILE>) {
chomp;
($name, $start_value, $end_value, $average) = split("\t");
print "Name: $name\n";
print "Start Value: $start_value\n";
print "End Value: $End_Value\n";
print "Average: %average
print "---------\n";
}
close (FILE);
exit;
Those look like GenBank files...so I'm unsure where you are getting the start, end values, average.

Here's an example using Text::CSV:
use Text::CSV; # This will implicitly use Text::CSV_XS if it's installed
my $parser = Text::CSV->new( { sep_char => '|' } );
open my $fh, '<', 'myfile' or die $!;
while (my $row = $parser->getline($fh)) {
# $row references an array of field values from the line just read
}
Also, as a minor side detail, your sample data is delimited by pipe characters, not tabs, although that may just be to avoid copy/paste errors for those answering your question. If the actual data is tab-delimited, set sep_char to "\t" instead of '|'.

How can I normalize my results in perl with foreach control structure?

I have this output:
10dvex2_miRNA_ce.out.data|6361
10dvex2_miRNA_ce.out.data|6361
10dvex2_misc_RNA_ce.out.data|0
10dvex2_rRNA_ce.out.data|239
with this script in Perl:
#!/usr/bin/perl
use warnings;
use strict;
open(MYINPUTFILE, $ARGV[0]); # open for input
my #lines = <MYINPUTFILE>; # read file into list
my $count = 0;
print "Frag"."\t"."ncRNA"."\t"."Amount"."\n";
foreach my $lines (#lines){
my $pattern = $lines;
$pattern =~ s/(.*)dvex\d_(.*)_(.*).(out.data)\|(.*)/$1 $2 $3 $5/g;
$count += $5;
print $1."\t".$2.$3."\t".$5."\n";
}
close(MYINPUTFILE);
exit;
I extract this information:
Frag ncRNA Amount
10 miRNAce 6361
10 misc_RNAce 0
10 rRNAce 239
but in the Amount column I want to report those numbers divided by the total result (6600). In this case I want this output:
Frag ncRNA Amount
10 miRNAce 0.964
10 misc_RNAce 0
10 rRNAce 0.036
My problem is extract the TOTAL result in the loop...to normalize this data. Some ideas?

Perhaps the following will be helpful:
use strict;
use warnings;
my ( %hash, $total, %seen, #array );
while (<>) {
next if $seen{$_}++;
/(\d+).+?_([^.]+).+\|(\d+)$/;
$hash{$1}{$2} = $3;
$total += $3;
}
print "Frag\tncRNA\tAmount\n";
while ( my ( $key1, $val1 ) = each %hash ) {
while ( my ( $key2, $val2 ) = each %$val1 ) {
my $frac = $val2 / $total == 0 ? 0 : sprintf( '%.3f', $val2 / $total );
push #array, "$key1\t$key2\t$frac\n";
}
}
print map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map { [ $_, (split)[2] ] }
#array;
Output from your data set:
Frag ncRNA Amount
10 miRNA_ce 0.964
10 rRNA_ce 0.036
10 misc_RNA_ce 0
Identical lines are skipped, and then the required elements are captured from each line. A running total is kept for the subsequent calculation. Your desired output showed sorting from high to low, which is why each record is pushed onto #array. However, if sorting isn't necessary, you can just print that line and omit the Schwartzian transform on #array.
Hope this helps!

To do this you will need two passes over the data.
#! /usr/bin/env perl
use warnings;
use strict;
print join("\t",qw'Frag ncRNA Amount'),"\n";
my #data;
my $total = 0;
# parse the lines
while( <> ){
my #elem = /(.+?)(?>dvex)\d_(.+)_([^._]+)[.]out[.]data[|](d+)/;
next unless #elem;
# running total
$total += $elem[-1];
# combine $2 and $3
splice #elem, 1, 2, $2.$3; # $elem[1].$elem[2];
push #data, \#elem;
}
# print them
for( #data ){
my #copy = #$_;
$copy[-1] = $copy[-1] / $total;
$copy[-1] = sprintf('%.3f', $copy[-1]) if $copy[-1];
print join("\t",#copy),"\n";
}

How can I generate a set of ranges from the first letters of a list of words in Perl?

I'm not sure exactly how to explain this, so I'll just start with an example.
Given the following data:
Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry
I want to generate an index based on the first letter of my data, but I want the letters grouped together.
Here is the frequency of the first letters in the above dataset:
2 A
2 B
3 C
1 E
2 G
1 K
1 M
1 N
4 P
2 R
1 S
Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:
A B C D-G H-O P Q-Z
Clicking the "D-G" link would show:
Elderberry
Grapefruit
Grapes
In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:
A B C E-G K-N P R-S
Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.
I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.
Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.
I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.
Here is what I started with:
#!/usr/bin/perl
use strict;
use warnings;
my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};
open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";
while ( my $item = <$DATASET> ) {
chomp($item);
my $first_letter = uc( substr( $item, 0, 1 ) );
$index_frequency->{$first_letter}++;
}
foreach my $letter ( sort keys %{$index_frequency} ) {
if ( $index_frequency->{$letter} ) {
# build $ranges here
}
}
My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.
Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.
Thanks in advance!

Basic approach:
#!/usr/bin/perl -w
use strict;
use autodie;
my $PAGE_SIZE = 3;
my %frequencies;
open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
next unless $l =~ m{\A([a-z])}i;
$frequencies{ uc $1 }++;
}
close $fh;
my $current_sum = 0;
my #letters = ();
my #pages = ();
for my $letter ( "A" .. "Z" ) {
my $letter_weigth = ( $frequencies{ $letter } || 0 );
if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
$current_sum = $letter_weigth;
#letters = ( $letter );
next;
}
push #letters, $letter;
$current_sum += $letter_weigth;
}
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
print "Pages : " . join( " , ", #pages ) . "\n";
Problem with it is that it outputs (from your data):
Pages : A , B , C-D , E-J , K-O , P , Q-Z
But I would argue this is actually good approach :) And you can always change the for loop into:
for my $letter ( sort keys %frequencies ) {
if you need.

Here's my suggestion:
# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
push #{ $freq{ $count } }, $letter;
}
# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
my #sorted_letters = sort #{ $freq{$_} };
print "$_: #sorted_letters\n";
}
Update: I think that I misunderstood your requirements. The following code block does something more like what you want.
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# get the maximum frequency
my $max_freq = (sort values %count)[-1];
my $curr_set_count = 0;
my #curr_set = ();
foreach ('A' .. 'Z') {
push #curr_set, $_;
$curr_set_count += $count{$_};
if ($curr_set_count >= $max_freq) {
# print out the range of the current set, then clear the set
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
#curr_set = ();
$curr_set_count = 0;
}
}
# print any trailing letters from the end of the alphabet
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";

Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.
count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
count += frequency[letter];
if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
if (inrange): print rangeStart+'-'
print letter+' '
inrange=false
count=0
else:
if (not inrange) rangeStart=letter
inrange=true

use strict;
use warnings;
use List::Util qw(sum);
my #letters = ('A' .. 'Z');
my #raw_data = qw(
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);
# Store the data by starting letter.
my %data;
push #{$data{ substr $_, 0, 1 }}, $_ for #raw_data;
# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar #$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);
# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my #pages = (['']);
for my $letter (#letters){
my #d = exists $data{$letter} ? #{$data{$letter}} : ();
if (#{$pages[-1]} - 1 < $MAX_SIZE or #d == 0){
push #{$pages[-1]}, #d;
$pages[-1][0] .= $letter;
}
else {
push #pages, [ $letter, #d ];
}
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for #pages; # Convert letters to range.

This is an example of how I would write this program.
#! /opt/perl/bin/perl
use strict;
use warnings;
my %frequency;
{
use autodie;
open my $data_file, '<', 'datafile';
while( my $line = <$data_file> ){
my $first_letter = uc( substr( $line, 0, 1 ) );
$frequency{$first_letter} ++
}
# $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';
# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;
my #index;
my #group;
for my $letter ( sort keys %frequency ){
my $frequency = $frequency{$letter};
if( $frequency >= $threshold ){
if( #group ){
if( #group == 1 ){
push #index, #group;
}else{
# push #index, [#group]; # copy #group
push #index, "$group[0]-$group[-1]";
}
#group = ();
}
push #index, $letter;
}elsif( sum( #frequency{#group,$letter} ) >= $threshold ){
if( #group == 1 ){
push #index, #group;
}else{
#push #index, [#group];
push #index, "$group[0]-$group[-1]"
}
#group = ($letter);
}else{
push #group, $letter;
}
}
#push #index, [#group] if #group;
push #index, "$group[0]-$group[-1]" if #group;
print join( ', ', #index ), "\n";