loop is not working after satisfying the condition - perl

I have two files myresult and annotation. details of these files are as follows.
myresult:
288..639 1.13075739182609-6.20035408429888i
300..651 1.90372125344918-6.09008858828515i
312..663 1.6908117147722-5.67058877579329i
324..675 0.644484787809351-5.54571698740166i
336..687 1.21850904281332-5.47700589647424i
annotation:
272..1042
1649..2629
For loop is running only once after satisfying the If condition. It is not entering again in loop even after the condition is satisfied. for example, in file myresult 2nd line satisfies the condition i.e numbers 300..651 lies in the range of 1st line of file annotation, therefore it prints all values from 300 to 651. But when it need to go for next round(3rd line) i.e. 312..663 it is not entering this loop since this range also lies in 1st line of file annotation.
so the output should be numbers from 300 to 663, but it is printing from 300 to 651 only.
Code:
#!/usr/bin/perl
use Math::Complex;
open( $inp0, "<myresult" ) or die "not found";
open( $inp2, "<annotation" ) or die "not found";
my #arr2 = <$inp0>;
my #arr4 = <$inp2>;
my #result;
foreach my $line1 (#arr2) {
my ( $col1, $col2 ) = split( /\s/, $line1 );
if ( $col2 > 1.60 ) {
my ( $from1, $to1 ) = split( /\.\./, $col1 );
foreach my $line2 (#arr4) {
my ( $from2, $to2 ) = split( /\.\./, $line2 );
for ( my $i = $from1; $i <= $to1; $i++ ) {
for ( my $j = $from2; $j <= $to2; $j++ ) {
$res = grep( /$i/, #result );
if ( $i == $j && $res == 0 ) {
print "$i \n";
push( #result, $i );
}
}
}
}
}
}

Second columns of the file "myresult" contains complex numbers like "1.13075739182609-6.20035408429888i" for example.
Two complex numbers can not be compared. (Complex number is a vector of complex plane with real axis and imaginary axis. Two vectors could not be compared like integer.)
'($col2>1.60)' will be false for all of your data in the file "myresult". This is why for loop is not executed.

As Fumu said two complex numbers can not be compared.
If you need help with complex numbers in Perl then check out Math::Complex module.

Related

Hash incorrectly tracking counts, runtime long

I am working on a program in Perl and my output is wrong and taking forever to process. The code is meant to take in a large DNA sequence file, read through it in 15 letter increments (kmers), stepping forward 1 position at a time. I'm supposed to enter the kmer sequences into a hash, with their value being the number of incidences of that kmer- meaning each key should be unique and when a duplicate is found, it should increase the count for that particular kmer. I know from my Prof. expected output file, that I have too many lines, so it is allowing duplicates and not counting correctly. It's also running 5+ minutes, so I have to Ctrl+C to escape. When I go look at kmers.txt, the file is at least written and formatted correctly.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# countKmers.pl
# Open file /scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta
# Identify all k-mers of length 15, load them into a hash
# and count the number of occurences of each k-mer. Each
# unique k-mer and its' count will be written to file
# kmers.txt
#Create an empty hash
my %kMersHash = ();
#Open a filehandle for the output file kmers.txt
unless ( open ( KMERS, ">", "kmers.txt" ) ) {
die $!;
}
#Call subroutine to load Fly Chromosome 2L
my $sequenceRef = loadSequence("/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta");
my $kMer = 15; #Set the size of the sliding window
my $stepSize = 1; #Set the step size
for (
#The sliding window's start position is 0
my $windowStart = 0;
#Prevent going past end of the file
$windowStart <= ( length($$sequenceRef) - $kMer );
#Advance the window by the step size
$windowStart += $stepSize
)
{
#Get the substring from $windowStart for length $kMer
my $kMerSeq = substr( $$sequenceRef, $windowStart, $kMer );
#Call the subroutine to iterate through the kMers
processKMers($kMerSeq);
}
sub processKMers {
my ($kMerSeq) = #_;
#Initialize $kCount with at least 1 occurrence
my $kCount = 1;
#If the key already exists, the count is
#increased and changed in the hash
if ( not exists $kMersHash{$kMerSeq} ) {
#The hash key=>value is loaded: kMer=>count
$kMersHash{$kMerSeq} = $kCount;
}
else {
#Increment the count
$kCount ++;
#The hash is updated
$kMersHash{$kMerSeq} = $kCount;
}
#Print out the hash to filehandle KMERS
for (keys %kMersHash) {
print KMERS $_, "\t", $kMersHash{$_}, "\n";
}
}
sub loadSequence {
#Get my sequence file name from the parameter array
my ($sequenceFile) = #_;
#Initialize my sequence to the empty string
my $sequence = "";
#Open the sequence file
unless ( open( FASTA, "<", $sequenceFile ) ) {
die $!;
}
#Loop through the file line-by-line
while (<FASTA>) {
#Assign the line, which is in the default
#variable to a named variable for readability.
my $line = $_;
#Chomp to get rid of end-of-line characters
chomp($line);
#Check to see if this is a FASTA header line
if ( $line !~ /^>/ ) {
#If it's not a header line append it
#to my sequence
$sequence .= $line;
}
}
#Return a reference to the sequence
return \$sequence;
}
Here's how I would write your application. The processKMers subroutine boils down to just incrementing a hash element, so I've removed that. I've also altered the identifiers to be match the snake_case that is more usual in Perl code, and I didn't see any point in load_sequence returning a reference to the sequence so I've changed it to return the string itself
use strict;
use warnings 'all';
use constant FASTA_FILE => '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
use constant KMER_SIZE => 15;
use constant STEP_SIZE => 1;
my $sequence = load_sequence( FASTA_FILE );
my %kmers;
for (my $offset = 0;
$offset + KMER_SIZE <= length $sequence;
$offset += STEP_SIZE ) {
my $kmer_seq = substr $sequence, $start, KMER_SIZE;
++$kmers{$kmer_seq};
}
open my $out_fh, '>', 'kmers.txt' or die $!;
for ( keys %kmers ) {
printf $out_fh "%s\t%d\n", $_, $kmers{$_};
}
sub load_sequence {
my ( $sequence_file ) = #_;
my $sequence = "";
open my $fh, '<', $sequence_file or die $!;
while ( <$fh> ) {
next if /^>/;
chomp;
$sequence .= $_;
}
return $sequence;
}
Here's a neater way to increment a hash element without using ++ on the hash directly
my $n;
if ( exists $kMersHash{$kMerSeq} ) {
$n = $kMersHash{$kMerSeq};
}
else {
$n = 0;
}
++$n;
$kMersHash{$kMerSeq} = $n;
Everything looks fine in your code besides processKMers. The main issues:
$kCount is not persistent between calls to processKMers, so in your else statement, $kCount will always be 2
You are printing every time you call processKMers, which is what is slowing you down. Printing frequently slows down your process significantly, you should wait until the end of your program and print once.
Keeping your code mostly the same:
sub processKMers {
my ($kMerSeq) = #_;
if ( not exists $kMersHash{$kMerSeq} ) {
$kMersHash{$kMerSeq} = 1;
}
else {
$kMersHash{$kMerSeq}++;
}
}
Then you want to move your print logic to immediately after your for-loop.

Perl program to look for k-mer with specific sequence

I am trying to enhance a perl program I have previously written so that it recognizes top 1000 length 23 k-mers that ends with GG and print out the k-mers that only appears once in the sequence. However, no matter where I add the reg exp, I am unable to get the expected result.
The code I have:
#!/usr/bin/perl
use strict;
use warnings;
my $k = 23;
my $input = 'Fasta.fasta';
my $output = 'Fasta2.fasta';
my $match_count = 0;
#Open File
unless ( open( FASTA, "<", $input ) ) {
die "Unable to open fasta file", $!;
}
#Unwraps the FASTA format file
$/ = ">";
#Separate header and sequence
#Remove spaces
unless ( open( OUTPUT, ">", $output ) ) {
die "Unable to open file", $!;
}
<FASTA>; # discard 'first' 'empty' record
my %seen;
while ( my $line = <FASTA> ) {
chomp $line;
my ( $header, #seq ) = split( /\n/, $line );
my $sequence = join '', #seq;
for ( length($sequence) >= $k ) {
$sequence =~ m/([ACTG]{21}[G]{2})/g;
for my $i ( 0 .. length($sequence) - $k ) {
my $kmer = substr( $sequence, $i, $k );
##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
$match_count = $match_count + 1;
print OUTPUT ">crispr_$match_count", "\n", "$kmer", "\n" unless $seen{$kmer}++;
}
}
}
The input fasta file looks like this:
> >2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ:NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=23011544; release=r5.54; species=Dmel;
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT
TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT
GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA
TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA
TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG
CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT
TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT
TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA
ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG
CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG
AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG
GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA
TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA
GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA
TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG
GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC
CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT
TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT
and so on...
The expected outcome (print out the 23k-mers with GG ending that only appear once in the sequence) I am hoping to get:
>crispr_1
GGGTGGAGCTCCCGAAATGCAGG
>crispr_2
TTAATAAATATTGACACAGCGGG
>crispr_3
ATCGTGGGGCGTTTTGTGAAAGG
>crispr_4
AGTTTTTCACATAATCAGACAGG
>crispr_5
GTGTTGGATGAGTGTCCTCTGGG
>crispr_6
ATAGGTTGGTTGTTTTAAAAGGG
>crispr_7
AAATTTTTGTTGCCACTGAATGG
>crispr_8
AAGTTTCGAACTACGATGGTTGG
>crispr_9
CATGCTTTGTGGAAATAAGTCGG
>crispr_10
CACAGTGGGTGTTTGCACCTCGG
.... and so on
The current code I did create a fasta file with following:
>crispr_1
CGACAATGCACGACAGAGGAAGC
>crispr_2
GACAATGCACGACAGAGGAAGCA
>crispr_3
ACAATGCACGACAGAGGAAGCAG
>crispr_4
CAATGCACGACAGAGGAAGCAGA
>crispr_5
AATGCACGACAGAGGAAGCAGAA
>crispr_6
ATGCACGACAGAGGAAGCAGAAC
>crispr_7
TGCACGACAGAGGAAGCAGAACA
>crispr_8
GCACGACAGAGGAAGCAGAACAG
>crispr_9
CACGACAGAGGAAGCAGAACAGA
>crispr_10
ACGACAGAGGAAGCAGAACAGAT
.... and so on
while if I remove the
for (length($sequence) >=$k){
$sequence =~m/([ACTG]{21}[G]{2})/g;
and add the ##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
I am getting fasta file (with results which is not numbered correctly and unable to distinguish between duplicated and unique sequences):
>crispr_1
CATTTTCTCTCCCATATTATAGG
>crispr_2
ATTTTCTCTCCCATATTATAGGG
>crispr_3
TATTGTGCTCTTTGATTTTTTGG
>crispr_4
GATTTTTTGGCAACCCAAAATGG
>crispr_5
TTTTTGGCAACCCAAAATGGTGG
>crispr_6
TTGGCAACCCAAAATGGTGGCGG
>crispr_7
ACGACAGAGAGAGAGAGCAGCGG
>crispr_8
AAATCGACAATGCACGACAGAGG
>crispr_91
TATTGTGATCTTCGATTTTTTGG
>crispr_93
TTTTTGGCAACCCAAAATGGAGG
.... and so on
I have attempted to move the regex around my code, but none of them generated the expected result. I do not know what I did wrong over here. I have not add the exit the program when count reaches 1000 into the code yet.
Thanks in advance!
I'm not sure I understand your question completely, but could the following be what you need.
<FASTA>; # discard 'first' 'empty' record
my %data;
while (my $line = <FASTA>){
chomp $line;
my($header, #seq) = split(/\n/, $line);
my $sequence = join '', #seq;
for my $i (0 .. length($sequence) - $k) {
my $kmer = substr($sequence, $i, $k);
$data{$kmer}++ if $kmer =~ /GG$/;
}
}
my $i = 0;
for my $kmer (sort {$data{$b} <=> $data{$a}} keys %data) {
printf "crispr_%d\n%s appears %d times\n", ++$i, $kmer, $data{$kmer};
last if $i == 1000;
}
Some output on a file I have is:
crispr_1
ggttttccggcacccgggcctgg appears 4 times
crispr_2
ccgagctgggcgagaagtagggg appears 4 times
crispr_3
gccgagctgggcgagaagtaggg appears 4 times
crispr_4
gcacccgggcctgggtggcaggg appears 4 times
crispr_5
agcagcgggatcgggttttccgg appears 4 times
crispr_6
gctgggcgagaagtaggggaggg appears 4 times
crispr_7
cccttctgcttcagtgtgaaagg appears 4 times
crispr_8
gtggcagggaagaatgtgccggg appears 4 times
crispr_9
gatcgggttttccggcacccggg appears 4 times
crispr_10
tgagggaaagtgctgctgctggg appears 4 times
crispr_11
agctgggcgagaagtaggggagg appears 4 times
. . . .
ggcacccgggcctgggtggcagg appears 4 times
crispr_50
gaatctctttactgcctggctgg appears 4 times
crispr_51
accacaacattgacagttggtgg appears 2 times
crispr_52
caacattgacagttggtggaggg appears 2 times
crispr_53
catgctcatcgtatctgtgttgg appears 2 times
crispr_54
gattaatgaagtggttattttgg appears 2 times
crispr_55
gaaaccacaacattgacagttgg appears 2 times
crispr_56
aacattgacagttggtggagggg appears 2 times
crispr_57
gacttgatcgattaatgaagtgg appears 2 times
crispr_58
acaacattgacagttggtggagg appears 2 times
crispr_59
gaaccatatattgttatcactgg appears 2 times
crispr_60
ccacagcgcccacttcaaggtgg appears 1 times
crispr_61
ctgctcctgggtgtgagcagagg appears 1 times
crispr_62
ccatatattatctgtggtttcgg appears 1 times
. . . .
Update
To get the results you mentioned in your comment (below), replace the output code with:
my $i = 1;
while (my ($kmer, $count) = each %data) {
next unless $count == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}
To answer my own comment to get first 1000.
<FASTA>; # discard 'first' 'empty' record
my %seen;
my #kmers;
while (my $line = <FASTA>){
chomp $line;
my($header, #seq) = split(/\n/, $line);
my $sequence = join '', #seq;
for my $i (0 .. length($sequence) - $k) {
my $kmer = substr($sequence, $i, $k);
if ($kmer =~ /GG$/) {
push #kmers, $kmer unless $seen{$kmer}++;
}
}
}
my $i = 1;
for my $kmer (#kmers) {
next unless $seen{$kmer} == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}
Answer To check for uniqueness of final 12 chars ending in GG, the code below achieves that.
if ($kmer =~ /(.{10}GG)$/) {
my $substr = $1;
push #kmers, $kmer unless $seen{$substr}++;
}
my $i = 1;
for my $kmer (#kmers) {
my $substr = substr $kmer, -12;
next unless $seen{$substr} == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}
Actually, this code line
$sequence =~m/([ACTG]{21}[G]{2})/g;
this line is just for the regex match, if you try to print this $sequence, it will surely print out the original result.
please add the code segement like this
if($sequence =~/([ACTG]{21}[G]{2}$)/g)
{
}#please remember to match the end with $.
BTW,It looks like the multiple for loop to parse this data is not very reasonable, the parse speed is without the best-efficiency.

Nested Loop running very slowly

I'm trying to run a program to check each line of one file against each line of a second file to see if some of the elements match. Each file is around 200k lines.
What I've got so far looks like this;
#!/usr/bin/perl
#gffgenefind.pl
use strict;
use warnings;
die "SNP gff\n" unless #ARGV == 4;
open( my $snp, "<", $ARGV[0] ) or die "Can't open $:";
open( my $gff, "<", $ARGV[1] ) or die "can't open $:";
open( my $outg, ">", $ARGV[2] );
open( my $outs, ">", $ARGV[3] );
my $scaffold;
my $site;
my #snplines = <$snp>;
my #gfflines = <$gff>;
foreach my $snpline (#snplines) {
my #arr = split( /\t/, $snpline );
$scaffold = $arr[0];
$site = $arr[1];
foreach my $line (#gfflines) {
my #arr1 = split( /\t/, $line );
if ( $arr1[3] <= $site and $site <= $arr1[4] and $arr1[0] eq $scaffold ) {
print $outg "$line";
print $outs "$snpline";
}
}
}
File 1 (snp) looks like this scaffold_100 10689 A C A 0 0 0 0 0 0
File 2 (gff) looks like this scaffold_1 phytozomev10 gene 750912 765975 . - . ID=Carubv10008059m.g.v1.0;Name=Carubv10008059m.g
Essentially, I'm looking to see if the first values match and if the second value from snp is within the range defined on the second file (in this case 750912 to 765975)
I've seen that nested loops are to be avoided, and was wondering if there's an alternative way for me to look through this data.
Thanks!
Firstly - lose the foreach loop. That reads your whole file into memory, when you probably don't need to.
Try instead:
while ( my $snpline = <$snp> ) {
because it reads line by line.
Generally - mixing array indicies and named variables is also bad style.
The core problem will most likely be though because each line of your first file, you're cycling all of the second file.
Edit: Note - because 'scaffold' isn't unique, amended accordingly
This seems like a good place to use a hash. E.g.
my %sites;
while ( <$snp> ) {
my ( $scaffold, $site ) = split ( /\t/ );
$sites{$site}{$scaffold}++
}
while ( <$gff> ) {
my ( $name, $tmp1, $tmp2, $range_start, $range_end ) = split ( /\t/ );
if ( $sites{$name} ) {
foreach my $scaffold ( keys %{ $sites{$name} ) {
if ( $scaffold > $range_start
and $scaffold < $range_end ) {
#do stuff with it;
print;
}
}
}
}
Hopefully you get the gist, even if it isn't specifically what you're after?
Try this Python snippet:
#!/usr/bin/env python
import sys
import contextlib
if len(sys.argv) !=5:
raise Exception('SNP gff')
snp, gff, outg, outs = sys.argv[1:]
gff_dict = {}
with open(gff) as gff_handler:
for line in gff_handler:
fields=line.split()
try:
gff_dict[fields[0]].append(fields[1:])
except KeyError:
gff_dict[fields[0]] = [fields[1:]]
with contextlib.nested(open(snp),
open(outs, 'w'),
open(outg, 'w')) as (snp_handler,
outs_handler,
outg_handler):
for line_snp in snp_handler:
fields=line_snp.split()
key = fields[0]
if key in gff_dict:
for ele in gff_dict[key]:
if ele[2] <= fields[1] <= ele[3]:
outs_handler.write(line_snp)
outg_handler.write("{0}\t{1}\n".format(key,"\t".join(ele)))

How can I skip some block content while reading in Perl

I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.
Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.
First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];
On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.

How can I generate a set of ranges from the first letters of a list of words in Perl?

I'm not sure exactly how to explain this, so I'll just start with an example.
Given the following data:
Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry
I want to generate an index based on the first letter of my data, but I want the letters grouped together.
Here is the frequency of the first letters in the above dataset:
2 A
2 B
3 C
1 E
2 G
1 K
1 M
1 N
4 P
2 R
1 S
Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:
A B C D-G H-O P Q-Z
Clicking the "D-G" link would show:
Elderberry
Grapefruit
Grapes
In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:
A B C E-G K-N P R-S
Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.
I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.
Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.
I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.
Here is what I started with:
#!/usr/bin/perl
use strict;
use warnings;
my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};
open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";
while ( my $item = <$DATASET> ) {
chomp($item);
my $first_letter = uc( substr( $item, 0, 1 ) );
$index_frequency->{$first_letter}++;
}
foreach my $letter ( sort keys %{$index_frequency} ) {
if ( $index_frequency->{$letter} ) {
# build $ranges here
}
}
My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.
Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.
Thanks in advance!
Basic approach:
#!/usr/bin/perl -w
use strict;
use autodie;
my $PAGE_SIZE = 3;
my %frequencies;
open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
next unless $l =~ m{\A([a-z])}i;
$frequencies{ uc $1 }++;
}
close $fh;
my $current_sum = 0;
my #letters = ();
my #pages = ();
for my $letter ( "A" .. "Z" ) {
my $letter_weigth = ( $frequencies{ $letter } || 0 );
if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
$current_sum = $letter_weigth;
#letters = ( $letter );
next;
}
push #letters, $letter;
$current_sum += $letter_weigth;
}
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
print "Pages : " . join( " , ", #pages ) . "\n";
Problem with it is that it outputs (from your data):
Pages : A , B , C-D , E-J , K-O , P , Q-Z
But I would argue this is actually good approach :) And you can always change the for loop into:
for my $letter ( sort keys %frequencies ) {
if you need.
Here's my suggestion:
# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
push #{ $freq{ $count } }, $letter;
}
# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
my #sorted_letters = sort #{ $freq{$_} };
print "$_: #sorted_letters\n";
}
Update: I think that I misunderstood your requirements. The following code block does something more like what you want.
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# get the maximum frequency
my $max_freq = (sort values %count)[-1];
my $curr_set_count = 0;
my #curr_set = ();
foreach ('A' .. 'Z') {
push #curr_set, $_;
$curr_set_count += $count{$_};
if ($curr_set_count >= $max_freq) {
# print out the range of the current set, then clear the set
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
#curr_set = ();
$curr_set_count = 0;
}
}
# print any trailing letters from the end of the alphabet
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.
count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
count += frequency[letter];
if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
if (inrange): print rangeStart+'-'
print letter+' '
inrange=false
count=0
else:
if (not inrange) rangeStart=letter
inrange=true
use strict;
use warnings;
use List::Util qw(sum);
my #letters = ('A' .. 'Z');
my #raw_data = qw(
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);
# Store the data by starting letter.
my %data;
push #{$data{ substr $_, 0, 1 }}, $_ for #raw_data;
# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar #$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);
# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my #pages = (['']);
for my $letter (#letters){
my #d = exists $data{$letter} ? #{$data{$letter}} : ();
if (#{$pages[-1]} - 1 < $MAX_SIZE or #d == 0){
push #{$pages[-1]}, #d;
$pages[-1][0] .= $letter;
}
else {
push #pages, [ $letter, #d ];
}
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for #pages; # Convert letters to range.
This is an example of how I would write this program.
#! /opt/perl/bin/perl
use strict;
use warnings;
my %frequency;
{
use autodie;
open my $data_file, '<', 'datafile';
while( my $line = <$data_file> ){
my $first_letter = uc( substr( $line, 0, 1 ) );
$frequency{$first_letter} ++
}
# $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';
# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;
my #index;
my #group;
for my $letter ( sort keys %frequency ){
my $frequency = $frequency{$letter};
if( $frequency >= $threshold ){
if( #group ){
if( #group == 1 ){
push #index, #group;
}else{
# push #index, [#group]; # copy #group
push #index, "$group[0]-$group[-1]";
}
#group = ();
}
push #index, $letter;
}elsif( sum( #frequency{#group,$letter} ) >= $threshold ){
if( #group == 1 ){
push #index, #group;
}else{
#push #index, [#group];
push #index, "$group[0]-$group[-1]"
}
#group = ($letter);
}else{
push #group, $letter;
}
}
#push #index, [#group] if #group;
push #index, "$group[0]-$group[-1]" if #group;
print join( ', ', #index ), "\n";