Perl: Final Output Line in Foreach Loop Prints Twice - perl

I'm trying to write a very simple script that takes two words from STDIN and outputs TRUE if they're anagrams and FALSE if not. My main issue is that if the two words aren't anagrams (this is the final "else" statement in the script), the output looks like:
Sorry, that's not an anagram pair
Sorry, that's not an anagram pair
where I just want:
Sorry, that's not an anagram pair
Other, more minor issues for the especially generous:
I know what the FALSE values are for Perl, but I can't get the script to print FALSE by, for example, setting a variable to '' or 0, etc. or saying "return ''". Ideally, I wouldn't have to put "print TRUE/FALSE" in the script at all.
I put in the last elsif statement in the script to see if it would affect the printing twice problem. It didn't, and now I'm curious why my m// expression doesn't work. It's supposed to find pairs that are identical except that one has more whitespace than the other.
Here's the script! I'm sorry it's so long - again, the problem is at the very end with the final "else" statement. Many thanks!!!
#To Run: Type start.pl on the command line.
#The script should prompt you to enter a word or phrase.
#Once you've done that, it'll prompt you for another one.
#Then you will be told if the two terms are anagrams or not.
#!/usr/bin/perl -w
use strict;
#I have to use this to make STDIN work. IDK why.
$|=1;
#variables
my $aWord;
my $bWord;
my $word;
my $sortWord;
my $sortWords;
my #words;
my %anaHash;
print "\nReady to play the anagram game? Excellent.\n\nType your first word or phrase, then hit Enter.\n\n";
$aWord = <STDIN>;
chomp $aWord;
print "\n\nThanks! Now type your second word or phrase and hit Enter.\n\n";
$bWord = <STDIN>;
chomp $bWord;
#This foreach loop performs the following tasks:
#1. Pushes the two words from STDIN into an array (unsure if this is really necessary)
#2. lowercases everything and removes all characters except for letters & spaces
#3. splits both words into characters, sorts them alphabetically, then joins the sorted letters into a single "word"
#4.pushes the array into a hash
#words = ($bWord, $aWord);
foreach $word (#words) {
$word =~ tr/A-Z/a-z/;
$word =~ s/[^a-z ]//ig;
$sortWord = join '', sort(split(//, $word));
push #{$anaHash{$sortWord}}, $word;
}
#This foreach loop tries to determine if the word pairs are anagrams or not.
foreach $sortWords (values %anaHash) {
#"if you see the same word twice AND the input was two identical words:"
if (1 < #$sortWords &&
#$sortWords[0] eq #$sortWords[1]) {
print "\n\nFALSE: Your phrases are identical!\n\n";
}
#"if you see the same word twice AND the input was two different words (i.e. a real anagram):"
elsif (1 < #$sortWords &&
#$sortWords[0] ne #$sortWords[1]) {
print "\n\nTRUE: #$sortWords[0] and #$sortWords[1] are anagrams!\n\n";
}
#this is a failed attempt to identify pairs that are identical except one has extra spaces. Right now, this fails and falls into the "else" category below.
elsif (#$sortWords[0] =~ m/ +#$sortWords[-1]/ ||
#$sortWords[-1] =~ m/ +#$sortWords[0]/) {
print "\n\FALSE: #$sortWords[0] and #$sortWords[-1] are NOT anagrams. Spaces are characters, too!\n\n";
}
#This is supposed to identify anything that's not an acronym. But the output prints twice! It's maddening!!!!
else {
print "Sorry, that's not an anagram pair\n";
}
}

It's useful to print out the contents of %anaHash after you've finished building it, but before you start examining it. Using the words "foo" and "bar", I get this result using Data::Dumper.
$VAR1 = {
'abr' => [
'bar'
],
'foo' => [
'foo'
]
};
So the hash has two keys. And as you loop round all of the keys in the hash, you'll get the message twice (once for each key).
I'm not really sure what the hash is for here. I don't think it's necessary. I think that you need to:
Read in the two words
Convert the words to a canonical format
Check if the two strings are the same
Simplified, your code would look like this:
print 'Give me a word: ';
chomp(my $word1 = <STDIN>);
print 'Give me another word: ';
chomp(my $word2 = <STDIN>);
# convert to lower case
$word1 = lc $word1;
$word2 = lc $word2;
# remove non-letters
$word1 =~ s/[^a-z]//g;
$word2 =~ s/[^a-z]//g;
# sort letters
$word1 = join '', sort split //, $word1;
$word2 = join '', sort split //, $word2;
if ($word1 eq $word2) {
# you have an anagram
} else {
# you don't
}

My final answer, thanks so much to Dave and #zdim! I'm so happy I could die.
#!/usr/bin/perl -w
use strict;
use feature qw(say);
#I have to use this to make STDIN work. IDK why.
$|=1;
#declare variables below
print "First word?\n";
$aWord = <STDIN>;
chomp $aWord;
print "Second word?\n";
$bWord = <STDIN>;
chomp $bWord;
#clean up input
$aWord =~ tr/A-Z/a-z/;
$bWord =~ tr/A-Z/a-z/;
$aWord =~ s/[^a-z ]//ig;
$bWord =~ s/[^a-z ]//ig;
#if the two inputs are identical, print FALSE and exit
if ($aWord eq $bWord) {
say "\n\nFALSE: Your phrases are identical!\n";
exit;
}
#split each word by character, sort characters alphabetically, join characters
$aSortWord = join '', sort(split(//, $aWord));
$bSortWord = join '', sort(split(//, $bWord));
#if the sorted characters match, you have an anagram
#if not, you don't
if ($aSortWord eq $bSortWord) {
say "\n\nTRUE: Your two terms are anagrams!";
}
else {
say "\n\nFALSE: Your two terms are not acronyms.";
}

Related

Perl regular expressions and returned array of matched groups

i am new in Perl and i need to do some regexp.
I read, when array is used like integer value, it gives count of elements inside.
So i am doing for example
if (#result = $pattern =~ /(\d)\.(\d)/) {....}
and i was thinking it should return empty array, when pattern matching fails, but it gives me still array with 2 elements, but with uninitialized values.
So how i can put pattern matching inside if condition, is it possible?
EDIT:
foreach (keys #ARGV) {
if (my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) {
if (defined $params{$result[0]}) {
print STDERR "Cmd option error\n";
}
$params{$result[0]} = (defined $result[1] ? $result[1] : 1);
}
else {
print STDERR "Cmd option error\n";
exit ERROR_CMD;
}
}
It is regexp pattern for command line options, cmd options are in long format with two hyphens preceding and possible with argument, so
--CMD[=ARG]. I want elegant solution, so this is why i want put it to if condition without some prolog etc.
EDIT2:
oh sry, i was thinking groups in #result array are always counted from 0, but accesible are only groups from branch, where the pattern is success. So if in my code command is "input", it should be in $result[0], but actually it is in $result[1]. I thought if $result[0] is uninitialized, than pattern fails and it goes to the if statement.
Consider the following:
use strict;
use warnings;
my $pattern = 42.42;
my #result = $pattern =~ /(\d)\.(\d)/;
print #result, ' elements';
Output:
24 elements
Context tells Perl how to treat #result. There certainly aren't 24 elements! Perl has printed the array's elements which resulted from your regex's captures. However, if we do the following:
print 0 + #result, ' elements';
we get:
2 elements
In this latter case, Perl interprets a scalar context for #result, so adds the number of elements to 0. This can also be achieved through scalar #results.
Edit to accommodate revised posting: Thus, the conditional in your code:
if(my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) { ...
evaluates to true if and only if the match was successful.
#results = $pattern =~ /(\d)\.(\d)/ ? ($1,$2) : ();
Try this:
#result = ();
if ($pattern =~ /(\d)\.(\d)/)
{
push #result, $1;
push #result, $2;
}
=~ is not an equal sign. It's doing a regexp comparison.
So my code above is initializing the array to empty, then assigning values only if the regexp matches.

Selectively joining elements of an array into fewer elements of a new array

I'm having some trouble manipulating an array of DNA sequence data that is in .fasta format. What I would specifically like to do is take a file that has a few thousand sequences and adjoin sequence data for each sequence in the file onto a single line in the file. [Fasta format is as such: A sequence ID starts with > after which everything on that line is a description. On the next line(s) the sequence corresponding to this ID is present. And this can continue indefinitely until the next line that begins with >, which is the id of the next sequence in the file] So, in my particular file most of my sequences are on multiple lines, so what I would like to do is essentially remove the newlines, but only the new lines between sequence data, not between sequence data and sequence ID lines (that start with >).
I'm doing this because I want to be able to attain sequence lengths of each sequence (through length, I believe is the easiest way), and then get an average sequence length of all the sequences in the whole file.
Here's my script so far, that doesnt seem to want to work:
#!/usr/bin/perl -w
##Subroutine
sub get_file_data1 {
my($filename) = $_[0];
my #filedata = ();
unless( open(GET_FILE_DATA, $filename)) {
print STDERR "Cannot open file \"$filename\"\n\n";
exit;
}
#filedata = <GET_FILE_DATA>;
close GET_FILE_DATA;
return #filedata;
}
##Opening files
my $fsafile = $ARGV[0];
my #filedata = &get_file_data1($fsafile);
##Procedure
my #count;
my #ids;
my $seq;
foreach $seq (#filedata){
if ($seq =~ /^>/) {push #ids, $seq;
push #count, "\n";
}
else {push #count, $seq;
}
}
foreach my $line (#count) {
if ($line =~ /^[AGTCagtc]/){
$line =~ s/^([AGTCagtc]*)\n/$1/;
}
}
##Make a text file to have a look
open FILE3, "> unbrokenseq.txt" or die "Cannot open output.txt: $!";
foreach (#count)
{
print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;
__END__
##Creating array of lengths
my $number;
my #numberarray;
foreach $number (#count) {
push #numberarray, length($number);
}
print #numberarray;
__END__
use List::Util qw(sum);
sub mean {
return sum(#numberarray)/#numberarray;
}
There's something wrong with the second foreach line of the Procedure section and I can't seem to figure out what it is. Note that the code after the END lines I haven't even tried yet because I cant seem to get the code in the procedure step to do what I want. Any idea how I can get a nice array with elements of unbroken sequence (I've chosen to just remove the sequence ID lines from the new array..)? When I can then get an array of lengths, after which I can then average?
Finally I should unfortunately admit that I cannot get Bio::Perl working on my computer, I have tried for hours but the errors are beyond my skill to fix. Ill be talking to someone who can hopefully help me with my Bio::perl issues. But for now I'm just going to have to press on without it.
Thanks! Sorry for the length of this post, I appreciate the help.
Andrew
The problem with your second loop is that you are not actually changing anything in #count because $line contains a copy of the values in #count.
But, if all you want to do in the second loop is to remove the newline character at the end, use the chomp function. with this you wouldn't need your second loop. (And it would also be faster than using the regex.)
# remove newlines for all array elements before doing anything else with it
chomp #filedata;
# .. or you can do it in your first loop
foreach $seq (#filedata){
chomp $seq;
if ($seq =~ /^>/) {
...
}
An additional tip: Using get_file_data1 to read the entire file into an array might be slow if your files are large. In that case it would be better to iterate through the file as you go:
open my $FILE_DATA, $filename or die "Cannot open file \"$filename\"\n";
while (my $line = <$FILE_DATA>) {
chomp $line;
# process the record as in your Procedure section
...
}
close $FILE_DATA;
Your regex captures specifically to $1 but you are printing $_ to the file. The result being most likely not what you intended.
Be careful with the '*' or 'greedy' modifier to your character groups in s///. You usually want the '+' instead. '*' will also match lines containing none of your characters.
A Search expression with a 'g' modifier can also count characters. Like this:
$perl -e '$a="aggaacaat"; $b = $a =~ s/[a]//g; print $b; '
5
Pretty cool huh! Alternately, in your code, you could just call length() against $1.
I was taken aback to see the escaped '/n' in your regex. While it works fine, the common 'end-of-line' search term is '$'. This is more portable and doesn't mess up your character counts.

Calculate Character Frequency in Message using Perl

I am writing a Perl Script to find out the frequency of occurrence of characters in a message. Here is the logic I am following:
Read one char at a time from the message using getc() and store it into an array.
Run a for loop starting from index 0 to the length of this array.
This loop will read each char of the array and assign it to a temp variable.
Run another for loop nested in the above, which will run from the index of the character being tested till the length of the array.
Using a string comparison between this character and the current array indexed char, a counter is incremented if they are equal.
After completion of inner For Loop, I am printing the frequency of the char for debug purposes.
Question: I don't want the program to recompute the frequency of a character if it's already been calculated. For instance, if character "a" occurs 3 times, for the first run, it calculates the correct frequency. However, at the next occurrence of "a", since loop runs from that index till the end, the frequency is (actual freq -1). Similary for the third occurrence, frequency is (actual freq -2).
To solve this. I used another temp array to which I would push the char whose frequency is already evaluated.
And then at the next run of for loop, before entering the inner for loop, I compare the current char with the array of evaluated chars and set a flag. Based on that flag, the inner for loop runs.
This is not working for me. Still the same results.
Here's the code I have written to accomplish the above:
#!/usr/bin/perl
use strict;
use warnings;
my $input=$ARGV[0];
my ($c,$ch,$flag,$s,#arr,#temp);
open(INPUT,"<$input");
while(defined($c = getc(INPUT)))
{
push(#arr,$c);
}
close(INPUT);
my $length=$#arr+1;
for(my $i=0;$i<$length;$i++)
{
$count=0;
$flag=0;
$ch=$arr[$i];
foreach $s (#temp)
{
if($ch eq $s)
{
$flag = 1;
}
}
if($flag == 0)
{
for(my $k=$i;$k<$length;$k++)
{
if($ch eq $arr[$k])
{
$count = $count+1;
}
}
push(#temp,$ch);
print "The character \"".$ch."\" appears ".$count." number of times in the message"."\n";
}
}
You're making your life much harder than it needs to be. Use a hash:
my %freq;
while(defined($c = getc(INPUT)))
{
$freq{$c}++;
}
print $_, " ", $freq{$_}, "\n" for sort keys %freq;
$freq{$c}++ increments the value stored in $freq{$c}. (If it was unset or zero, it becomes one.)
The print line is equivalent to:
foreach my $key (sort keys %freq) {
print $key, " ", $freq{$key}, "\n";
}
If you want to do a single character count for the whole file then use any of the suggested methods posted by the others. If you want a count of all the occurances
of each character in a file then I propose:
#!/usr/bin/perl
use strict;
use warnings;
# read in the contents of the file
my $contents;
open(TMP, "<$ARGV[0]") or die ("Failed to open $ARGV[0]: $!");
{
local($/) = undef;
$contents = <TMP>;
}
close(TMP);
# split the contents around each character
my #bits = split(//, $contents);
# build the hash of each character with it's respective count
my %counts = map {
# use lc($_) to make the search case-insensitive
my $foo = $_;
# filter out newlines
$_ ne "\n" ?
($foo => scalar grep {$_ eq $foo} #bits) :
() } #bits;
# reverse sort (highest first) the hash values and print
foreach(reverse sort {$counts{$a} <=> $counts{$b}} keys %counts) {
print "$_: $counts{$_}\n";
}
I donĀ“t understand the problem you are trying to solve, so I propose a more simple way to count the characters in a string:
$string = "fooooooobar";
$char = 'o';
$count = grep {$_ eq $char} split //, $string;
print $count, "\n";
This prints the number of $char occurrences in $string (7).
Hope this helps to write a more compact code
Faster solution :
#result = $subject =~ m/a/g; #subject is your file
print "Found : ", scalar #result, " a characters in file!\n";
Of course you can put a variable in the place of 'a' or even better execute this line for whatever characters you want to count the occurrences.
As a one-liner:
perl -F"" -anE '$h{$_}++ for #F; END { say "$_ : $h{$_}" for keys %h }' foo.txt

counting letters for each word in a text with Perl

I am trying to write a program wit Perl which should returns the frequency of all words in the file and the length of each word in the file (not the sum of all characters!) to produce a Zipf curve from a Spanish text (is not a big deal if you don't know what a Zipf's curve is). Now my problem is: I can do the first part and I get the frequency of all word but I don't how to get the length of each word! :( I know the command line
$word_length = length($words) but after trying to change the code I really don't know where I should include it and how to count the length for each word.
That's how my code looks like until know:
#!/usr/bin/perl
use strict;
use warnings;
my %count_of;
while (my $line = <>) { #read from file or STDIN
foreach my $word (split /\s+/gi, $line){
$count_of{$word}++;
}
}
print "All words and their counts: \n";
for my $word (sort keys %count_of) {
print "$word: $count_of{$word}\n";
}
__END__
I hope somebody have any suggestions!
You can use hash of hashes if you want to store the length of the word.
while (my $line = <>) {
foreach my $word (split /\s+/, $line) {
$count_of{$word}{word_count}++;
$count_of{$word}{word_length} = length($word);
}
}
print "All words and their counts and length: \n";
for my $word (sort keys %count_of) {
print "$word: $count_of{$word}{word_count} ";
print "Length of the word:$count_of{$word}{word_length}\n";
}
This will print the length right next to the count:
print "$word: $count_of{$word} ", length($word), "\n";
Just for your information - the other possibility for
length length($word)
might be:
$word =~ s/(\w)/$1/g
It is not as clear solution as toolic but can give you other view on this issue (TIMTOWTDI :))
Little explanation:
\w and g modifier matches every letter in your $word
$1 prevents overwriting original $word by s///
s/// returns number of letters (matched with \w) in $word

Split on comma, but only when not in parenthesis

I am trying to do a split on a string with comma delimiter
my $string='ab,12,20100401,xyz(A,B)';
my #array=split(',',$string);
If I do a split as above the array will have values
ab
12
20100401
xyz(A,
B)
I need values as below.
ab
12
20100401
xyz(A,B)
(should not split xyz(A,B) into 2 values)
How do I do that?
use Text::Balanced qw(extract_bracketed);
my $string = "ab,12,20100401,xyz(A,B(a,d))";
my #params = ();
while ($string) {
if ($string =~ /^([^(]*?),/) {
push #params, $1;
$string =~ s/^\Q$1\E\s*,?\s*//;
} else {
my ($ext, $pre);
($ext, $string, $pre) = extract_bracketed($string,'()','[^()]+');
push #params, "$pre$ext";
$string =~ s/^\s*,\s*//;
}
}
This one supports:
nested parentheses;
empty fields;
strings of any length.
Here is one way that should work.
use Regexp::Common;
my $string = 'ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(?:$RE{balanced}{-parens=>'()'}|[^,])+/g);
Regexp::Common can be installed from CPAN.
There is a bug in this code, coming from the depths of Regexp::Common. Be warned that this will (unfortunately) fail to match the lack of space between ,,.
Well, old question, but I just happened to wrestle with this all night, and the question was never marked answered, so in case anyone arrives here by Google as I did, here's what I finally got. It's a very short answer using only built-in PERL regex features:
my $string='ab,12,20100401,xyz(A,B)';
$string =~ s/((\((?>[^)(]*(?2)?)*\))|[^,()]*)(*SKIP),/$1\n/g;
my #array=split('\n',$string);
Commas that are not inside parentheses are changed to newlines and then the array is split on them. This will ignore commas inside any level of nested parentheses, as long as they're properly balanced with a matching number of open and close parens.
This assumes you won't have newline \n characters in the initial value of $string. If you need to, either temporarily replace them with something else before the substitution line and then use a loop to replace back after the split, or just pick a different delimiter to split the array on.
Limit the number of elements it can be split into:
split(',', $string, 4)
Here's another way:
my $string='ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(
[^,]*\([^)]*\) # comma inside parens is part of the word
|
[^,]*) # split on comma outside parens
(?:,|$)/gx);
Produces:
ab
12
20100401
xyz(A,B)
Here is my attempt. It should handle depth well and could even be extended to include other bracketed symbols easily (though harder to be sure that they MATCH). This method will not in general work for quotation marks rather than brackets.
#!/usr/bin/perl
use strict;
use warnings;
my $string='ab,12,20100401,xyz(A(2,3),B)';
print "$_\n" for parse($string);
sub parse {
my ($string) = #_;
my #fields;
my #comma_separated = split(/,/, $string);
my #to_be_joined;
my $depth = 0;
foreach my $field (#comma_separated) {
my #brackets = $field =~ /(\(|\))/g;
foreach (#brackets) {
$depth++ if /\(/;
$depth-- if /\)/;
}
if ($depth == 0) {
push #fields, join(",", #to_be_joined, $field);
#to_be_joined = ();
} else {
push #to_be_joined, $field;
}
}
return #fields;
}