How can I normalize my results in perl with foreach control structure? - perl

I have this output:
10dvex2_miRNA_ce.out.data|6361
10dvex2_miRNA_ce.out.data|6361
10dvex2_misc_RNA_ce.out.data|0
10dvex2_rRNA_ce.out.data|239
with this script in Perl:
#!/usr/bin/perl
use warnings;
use strict;
open(MYINPUTFILE, $ARGV[0]); # open for input
my #lines = <MYINPUTFILE>; # read file into list
my $count = 0;
print "Frag"."\t"."ncRNA"."\t"."Amount"."\n";
foreach my $lines (#lines){
my $pattern = $lines;
$pattern =~ s/(.*)dvex\d_(.*)_(.*).(out.data)\|(.*)/$1 $2 $3 $5/g;
$count += $5;
print $1."\t".$2.$3."\t".$5."\n";
}
close(MYINPUTFILE);
exit;
I extract this information:
Frag ncRNA Amount
10 miRNAce 6361
10 misc_RNAce 0
10 rRNAce 239
but in the Amount column I want to report those numbers divided by the total result (6600). In this case I want this output:
Frag ncRNA Amount
10 miRNAce 0.964
10 misc_RNAce 0
10 rRNAce 0.036
My problem is extract the TOTAL result in the loop...to normalize this data. Some ideas?

Perhaps the following will be helpful:
use strict;
use warnings;
my ( %hash, $total, %seen, #array );
while (<>) {
next if $seen{$_}++;
/(\d+).+?_([^.]+).+\|(\d+)$/;
$hash{$1}{$2} = $3;
$total += $3;
}
print "Frag\tncRNA\tAmount\n";
while ( my ( $key1, $val1 ) = each %hash ) {
while ( my ( $key2, $val2 ) = each %$val1 ) {
my $frac = $val2 / $total == 0 ? 0 : sprintf( '%.3f', $val2 / $total );
push #array, "$key1\t$key2\t$frac\n";
}
}
print map { $_->[0] }
sort { $b->[1] <=> $a->[1] }
map { [ $_, (split)[2] ] }
#array;
Output from your data set:
Frag ncRNA Amount
10 miRNA_ce 0.964
10 rRNA_ce 0.036
10 misc_RNA_ce 0
Identical lines are skipped, and then the required elements are captured from each line. A running total is kept for the subsequent calculation. Your desired output showed sorting from high to low, which is why each record is pushed onto #array. However, if sorting isn't necessary, you can just print that line and omit the Schwartzian transform on #array.
Hope this helps!

To do this you will need two passes over the data.
#! /usr/bin/env perl
use warnings;
use strict;
print join("\t",qw'Frag ncRNA Amount'),"\n";
my #data;
my $total = 0;
# parse the lines
while( <> ){
my #elem = /(.+?)(?>dvex)\d_(.+)_([^._]+)[.]out[.]data[|](d+)/;
next unless #elem;
# running total
$total += $elem[-1];
# combine $2 and $3
splice #elem, 1, 2, $2.$3; # $elem[1].$elem[2];
push #data, \#elem;
}
# print them
for( #data ){
my #copy = #$_;
$copy[-1] = $copy[-1] / $total;
$copy[-1] = sprintf('%.3f', $copy[-1]) if $copy[-1];
print join("\t",#copy),"\n";
}

Related

Conditionally print elements in a text file

I have a text file of the following format:
1 4730 1031782 init
4 0 6 events
2190 450 0 top
21413 5928 1 sshd
22355 1970 2009 find
I want to print rows of this file only if the second column of data meets the requirement >= 2000 - how can I do this?
Currently I am reading the file and printing it like so:
sub read_file{
my $data_failed = 1;
my $file = 'task_file';
if(open (my $file, "task_file" || die "$!\n")){
my #COLUMNS = qw( memory cpu program );
my %sort_strings = ( program => sub { $a cmp $b } );
my (%process_details, %sort);
while (<$file>) {
$data_failed = 0;
my ($process_id, $memory_size, $cpu_time, $program_name) = split;
$process_details{$process_id} = { memory => $memory_size,
cpu => $cpu_time,
program => $program_name };
undef $sort{memory}{$memory_size}{$process_id};
undef $sort{cpu}{$cpu_time}{$process_id};
undef $sort{program}{$program_name}{$process_id};
}
if($option_a == 1){
if (-z $file){print "No tasks found\n";}
for my $column ($COLUMNS[2]) {
my $cmp = $sort_strings{$column} || sub { $a <=> $b };
for my $value (sort $cmp keys %{ $sort{$column} }
) {
my #pids = keys %{ $sort{$column}{$value} };
say join ' ', $_, #{ $process_details{$_} }{#COLUMNS}
for #pids;
}
}
}
} else { print "No tasks found\n"}
}
The if($option_a == 1) bit is just reading values from another function that parses command line options.
my ($process_id, $memory_size, $cpu_time, $program_name) = split;
At this point, you can complete the loop, or you can continue to the next line. Just add the line:
next if $memory_size < 2000;
right after the split, and you'll eliminate all the records in memory that fail to meet your requirements.
Filtering a list is easily done with grep:
#!/usr/bin/perl
use strict;
use feature qw{ say };
use warnings;
my #COLUMNS = qw( memory cpu program );
my (%process_details, %sort);
while (<DATA>) {
my ($process_id, $memory_size, $cpu_time, $program_name) = split;
$process_details{$process_id} = { memory => $memory_size,
cpu => $cpu_time,
program => $program_name };
undef $sort{memory}{$memory_size}{$process_id};
undef $sort{cpu}{$cpu_time}{$process_id};
undef $sort{program}{$program_name}{$process_id};
}
for my $value (sort { $a cmp $b } keys %{ $sort{program} }) {
my #pids = grep $process_details{$_}{memory} > 2000,
keys %{ $sort{program}{$value} };
say join ' ', $_, #{ $process_details{$_} }{#COLUMNS}
for #pids;
}
__DATA__
...
Something like this perhaps:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
print if (split)[1] > 2000;
}
__DATA__
1 4730 1031782 init
4 0 6 events
2190 450 0 top
21413 5928 1 sshd
22355 1970 2009 find
With no arguments, split() splits $_ on whitespace (which is what we want). We can then use a list slice to look at the second element of that and print the line if that value is greater than 2000.

Multi-column file comparison and range extraction

Pardon me for asking a question without any coding effort. But it seems too much difficult to me.
I have a data file with tab separated three data columns (and some repetitive header lines) as:
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
From the second data column, for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).
For the above Input, the output would be:
NP_416485.4: 1, 3-5
YP_986467.7: 2-4
Here, "NP_416485.4" and "YP_986467.7" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-4", but I increases all of them with +1 as I don't want to start with 0).
Thanks for your consideration. I would appreciate any help. Thank you
Here is one approach. In case you would have a DOS data file on a Unix machine, I used \r?\n to match a new line, so it will work for all cases:
use feature qw(say);
use strict;
use warnings;
my $file_name = 'input.txt';
open ( my $fh, '<', $file_name ) or die "Could not open file '$file_name': $!";
my $str = do { local $/; <$fh> };
close $fh;
my #chunks = $str =~ /(Sequence(?:.(?!Sequence))*)/sg;
my %ids;
for my $cstr ( #chunks ) {
my ( $id, $data ) = $cstr
=~/Split_Seq\/(\S+)\.fasta.*?\r?\n\r?\n(.*)$/s;
my #lines = split /\n/, $data;
my #vals;
for my $line ( #lines ) {
my #fields = split " ", $line;
push ( #vals, $fields[0] + 1 ) if $fields[1] > 0.5;
}
$ids{$id} = \#vals;
}
for my $id ( keys %ids ) {
my #tmp = sort { $a <=> $b } #{ $ids{$id} };
my ( $first, $last );
my #rr;
for my $i (0..$#tmp) {
if ( $i == 0 ) {
$first = $tmp[0];
$last = undef;
}
if ( $i < $#tmp && ($tmp[$i] == ($tmp[$i+1] - 1 )) ) {
$last = $tmp[$i+1];
next;
}
if ( defined $last ) {
push #rr, "$first-$last";
$last = undef;
}
else {
push #rr, $tmp[$i];
}
$first = ( $i < $#tmp ) ? $tmp[$i+1] : undef;
}
say "$id: ", join ",", #rr;
}
Output:
NP_416485.4: 1,3-5
YP_986467.7: 2-4
You don't really give a good description of your problem, and you haven't made any effort to solve it yourself, but here's a solution to the first part of your problem (parsing the file into a data structure). You'll need to walk the %results hash and produce the output that you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %results;
my $section;
while (<DATA>) {
# Look for a new section
if (/\\Programs\\(\S+)\s/) {
$section = $1;
}
# Look for data lines
if (/^\d\b/) {
my #data = split;
if ($data[1] > 0.5) {
push #{$results{$section}}, $data[0] + 1;
}
}
}
say Dumper \%results;
__DATA__
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr

Perl hash does not print value if it begins with 2 or 22 under certain conditions

This is really frustrating me. The script I'm writing is indexing coordinates in a hash and then using those index numbers to pull out values from an array.
The weird thing is that if the value begins with 2 or 22 it will not print. Any other number works. I'll show you two variations and output of the script.
First variation. This is what I want the script to do. Print chromosome, position, value.
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Scalar::Util qw(looks_like_number);
open IN, "/home/big/scratch/affy_map.txt" or die "Cannot open reference\n";
my %ref;
my $head = <IN>;
my $index = 0;
while(<IN>){
chomp $_;
my #row = split /\t/, $_;
my $value = join "\t", $row[1],$row[2];
if($row[1] == 2 && $row[2] <= 50000 && $row[2] <= 51113178) { $ref{$index}=$value; print $index."\t".$value."\n";}
if($row[1] == 22 && $row[2] <= 16300001 && $row[2] <= 20500000) { $ref{$index}=$value; print $index."\t".$value."\n"; }
$index++;
}
close(IN);
my #files;
my $masterDirect = "/nfs/archive02/big/Norm/norm_gcc/";
find(\&file_names, $masterDirect);
sub file_names {
if( -f && $File::Find::name=~/\.nzd$/)
{
push #files, $File::Find::name;
}
}
my $count=0;
foreach(#files){
$count++;
if($count % 100 == 0 ){ print "\n","-" x 10, " $count ", "-" x 10,"\n";}
undef my #probes;
open IN, $_;
#file name handling
my #inDir = split "\/", $_;
my $id = pop(#inDir);
$id =~ s/\.gcc.nzd$//;
#header test
$head =<IN>;
if(looks_like_number($head)) { push #probes, $head; }
#open output
open OUT, ">/home/big/scratch/phase1_affy/".$id."_select_probeset.txt";
#load probe array
#probes = <IN>;
close(IN);
foreach my $key (sort keys %ref){
#intended function
print OUT $ref{$key}."\t".$probes[$key];
#testing
my #temp = split "\t", $ref{$key};
foreach(#temp){if($temp[0] == 2){print $key."\t".$ref{$key}."\t".$probes[$key];}}
}
close(OUT);
}
Here's the output for the test. The printing from the reference file is flawless. The first number is the $key or index number. The second is frome $probes[$key] why is the $ref{$key} missing?
146529 0.777314368326637
146529 0.777314368326637
146530 0.116241153901913
146530 0.116241153901913
146531 0.940593233609167
146531 0.940593233609167
Variation 2.
...
foreach my $key (sort keys %ref){
print OUT $ref{$key}."\t".$probes[$key];
my #temp = split "\t", $ref{$key};
foreach(#temp){if($temp[0] == 2){print $key."\t".$ref{$key}."\n";}}
}
And its output. See now it's printing correctly. $key and $ref{$key}
146542 2 31852
146542 2 31852
146543 2 37693
146543 2 37693
146544 2 40415
146544 2 40415
146545 2 40814
I thought it might be a DOS->UNIX file problem but I performed perl -pi -e 's/\R/\n/g' input_files.txt for all the input the script sees. It prints the same value twice because there are two elements in the #temp array. I'm really at a loss right now.
Here is a hint for possible issue. In the beginning part,
if($row[1] == 2 && $row[2] <= 50000 && $row[2] <= 51113178) { $ref{$index}=$value; print $index."\t".$value."\n";}
Note that you used two "<=" for $row[2], which looks peculiar. The next line has such "problem" too. Please double check it first otherwise you may have filtered them out in the first place.

Perl compare individual elements of two arrays

I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}
my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.
I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.
If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}

How can I generate a set of ranges from the first letters of a list of words in Perl?

I'm not sure exactly how to explain this, so I'll just start with an example.
Given the following data:
Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry
I want to generate an index based on the first letter of my data, but I want the letters grouped together.
Here is the frequency of the first letters in the above dataset:
2 A
2 B
3 C
1 E
2 G
1 K
1 M
1 N
4 P
2 R
1 S
Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:
A B C D-G H-O P Q-Z
Clicking the "D-G" link would show:
Elderberry
Grapefruit
Grapes
In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:
A B C E-G K-N P R-S
Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.
I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.
Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.
I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.
Here is what I started with:
#!/usr/bin/perl
use strict;
use warnings;
my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};
open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";
while ( my $item = <$DATASET> ) {
chomp($item);
my $first_letter = uc( substr( $item, 0, 1 ) );
$index_frequency->{$first_letter}++;
}
foreach my $letter ( sort keys %{$index_frequency} ) {
if ( $index_frequency->{$letter} ) {
# build $ranges here
}
}
My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.
Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.
Thanks in advance!
Basic approach:
#!/usr/bin/perl -w
use strict;
use autodie;
my $PAGE_SIZE = 3;
my %frequencies;
open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
next unless $l =~ m{\A([a-z])}i;
$frequencies{ uc $1 }++;
}
close $fh;
my $current_sum = 0;
my #letters = ();
my #pages = ();
for my $letter ( "A" .. "Z" ) {
my $letter_weigth = ( $frequencies{ $letter } || 0 );
if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
$current_sum = $letter_weigth;
#letters = ( $letter );
next;
}
push #letters, $letter;
$current_sum += $letter_weigth;
}
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
print "Pages : " . join( " , ", #pages ) . "\n";
Problem with it is that it outputs (from your data):
Pages : A , B , C-D , E-J , K-O , P , Q-Z
But I would argue this is actually good approach :) And you can always change the for loop into:
for my $letter ( sort keys %frequencies ) {
if you need.
Here's my suggestion:
# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
push #{ $freq{ $count } }, $letter;
}
# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
my #sorted_letters = sort #{ $freq{$_} };
print "$_: #sorted_letters\n";
}
Update: I think that I misunderstood your requirements. The following code block does something more like what you want.
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# get the maximum frequency
my $max_freq = (sort values %count)[-1];
my $curr_set_count = 0;
my #curr_set = ();
foreach ('A' .. 'Z') {
push #curr_set, $_;
$curr_set_count += $count{$_};
if ($curr_set_count >= $max_freq) {
# print out the range of the current set, then clear the set
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
#curr_set = ();
$curr_set_count = 0;
}
}
# print any trailing letters from the end of the alphabet
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.
count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
count += frequency[letter];
if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
if (inrange): print rangeStart+'-'
print letter+' '
inrange=false
count=0
else:
if (not inrange) rangeStart=letter
inrange=true
use strict;
use warnings;
use List::Util qw(sum);
my #letters = ('A' .. 'Z');
my #raw_data = qw(
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);
# Store the data by starting letter.
my %data;
push #{$data{ substr $_, 0, 1 }}, $_ for #raw_data;
# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar #$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);
# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my #pages = (['']);
for my $letter (#letters){
my #d = exists $data{$letter} ? #{$data{$letter}} : ();
if (#{$pages[-1]} - 1 < $MAX_SIZE or #d == 0){
push #{$pages[-1]}, #d;
$pages[-1][0] .= $letter;
}
else {
push #pages, [ $letter, #d ];
}
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for #pages; # Convert letters to range.
This is an example of how I would write this program.
#! /opt/perl/bin/perl
use strict;
use warnings;
my %frequency;
{
use autodie;
open my $data_file, '<', 'datafile';
while( my $line = <$data_file> ){
my $first_letter = uc( substr( $line, 0, 1 ) );
$frequency{$first_letter} ++
}
# $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';
# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;
my #index;
my #group;
for my $letter ( sort keys %frequency ){
my $frequency = $frequency{$letter};
if( $frequency >= $threshold ){
if( #group ){
if( #group == 1 ){
push #index, #group;
}else{
# push #index, [#group]; # copy #group
push #index, "$group[0]-$group[-1]";
}
#group = ();
}
push #index, $letter;
}elsif( sum( #frequency{#group,$letter} ) >= $threshold ){
if( #group == 1 ){
push #index, #group;
}else{
#push #index, [#group];
push #index, "$group[0]-$group[-1]"
}
#group = ($letter);
}else{
push #group, $letter;
}
}
#push #index, [#group] if #group;
push #index, "$group[0]-$group[-1]" if #group;
print join( ', ', #index ), "\n";