Calculate Character Frequency in Message using Perl - perl

I am writing a Perl Script to find out the frequency of occurrence of characters in a message. Here is the logic I am following:
Read one char at a time from the message using getc() and store it into an array.
Run a for loop starting from index 0 to the length of this array.
This loop will read each char of the array and assign it to a temp variable.
Run another for loop nested in the above, which will run from the index of the character being tested till the length of the array.
Using a string comparison between this character and the current array indexed char, a counter is incremented if they are equal.
After completion of inner For Loop, I am printing the frequency of the char for debug purposes.
Question: I don't want the program to recompute the frequency of a character if it's already been calculated. For instance, if character "a" occurs 3 times, for the first run, it calculates the correct frequency. However, at the next occurrence of "a", since loop runs from that index till the end, the frequency is (actual freq -1). Similary for the third occurrence, frequency is (actual freq -2).
To solve this. I used another temp array to which I would push the char whose frequency is already evaluated.
And then at the next run of for loop, before entering the inner for loop, I compare the current char with the array of evaluated chars and set a flag. Based on that flag, the inner for loop runs.
This is not working for me. Still the same results.
Here's the code I have written to accomplish the above:
#!/usr/bin/perl
use strict;
use warnings;
my $input=$ARGV[0];
my ($c,$ch,$flag,$s,#arr,#temp);
open(INPUT,"<$input");
while(defined($c = getc(INPUT)))
{
push(#arr,$c);
}
close(INPUT);
my $length=$#arr+1;
for(my $i=0;$i<$length;$i++)
{
$count=0;
$flag=0;
$ch=$arr[$i];
foreach $s (#temp)
{
if($ch eq $s)
{
$flag = 1;
}
}
if($flag == 0)
{
for(my $k=$i;$k<$length;$k++)
{
if($ch eq $arr[$k])
{
$count = $count+1;
}
}
push(#temp,$ch);
print "The character \"".$ch."\" appears ".$count." number of times in the message"."\n";
}
}

You're making your life much harder than it needs to be. Use a hash:
my %freq;
while(defined($c = getc(INPUT)))
{
$freq{$c}++;
}
print $_, " ", $freq{$_}, "\n" for sort keys %freq;
$freq{$c}++ increments the value stored in $freq{$c}. (If it was unset or zero, it becomes one.)
The print line is equivalent to:
foreach my $key (sort keys %freq) {
print $key, " ", $freq{$key}, "\n";
}

If you want to do a single character count for the whole file then use any of the suggested methods posted by the others. If you want a count of all the occurances
of each character in a file then I propose:
#!/usr/bin/perl
use strict;
use warnings;
# read in the contents of the file
my $contents;
open(TMP, "<$ARGV[0]") or die ("Failed to open $ARGV[0]: $!");
{
local($/) = undef;
$contents = <TMP>;
}
close(TMP);
# split the contents around each character
my #bits = split(//, $contents);
# build the hash of each character with it's respective count
my %counts = map {
# use lc($_) to make the search case-insensitive
my $foo = $_;
# filter out newlines
$_ ne "\n" ?
($foo => scalar grep {$_ eq $foo} #bits) :
() } #bits;
# reverse sort (highest first) the hash values and print
foreach(reverse sort {$counts{$a} <=> $counts{$b}} keys %counts) {
print "$_: $counts{$_}\n";
}

I donĀ“t understand the problem you are trying to solve, so I propose a more simple way to count the characters in a string:
$string = "fooooooobar";
$char = 'o';
$count = grep {$_ eq $char} split //, $string;
print $count, "\n";
This prints the number of $char occurrences in $string (7).
Hope this helps to write a more compact code

Faster solution :
#result = $subject =~ m/a/g; #subject is your file
print "Found : ", scalar #result, " a characters in file!\n";
Of course you can put a variable in the place of 'a' or even better execute this line for whatever characters you want to count the occurrences.

As a one-liner:
perl -F"" -anE '$h{$_}++ for #F; END { say "$_ : $h{$_}" for keys %h }' foo.txt

Related

Data value of array not printing properly

I have written a script which collects marks of students and print the one who scored above 50.
Script is below:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
print Dumper(\#array);
my $class = "3";
foreach my $each_value (#array) {
print "EACH: $each_value\n";
my ($name, $score ) = split (/,/, $each_value);
if ($score lt 50) {
next;
} else {
print "$name, \"GOOD SCORE\", $score, $class";
}
}
Here I wanted to print data of STUDENT1, since his score is greater than 50.
So output should be:
STUDENT1, "GOOD SCORE", 90, 3
But its printing output like this:
STUDENT1, "GOOD SCORE", 90
STUDENT2, 3
Here some manipulation happens between 90 STUDENT2 which it discards to separate it.
I know I was not splitting data with new line character since we have single element in the array #array.
How can I split the element which is in array to new line, so that inside for loop I can split again with comma(,) to have the values in $name and $score.
Actually the #array is coming as an argument to this script. So I have to modify this script in order to parse right values.
As you already know your "array" only has one "element" with a string with the actual records in it, so it essentially is more a scalar than an array.
And as you suspect, you can split this scalar just as you already did with the newline as a separator instead of a comma. You can then put a foreach around the result of split() to iterate over the records.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $records = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
my $class = "3";
foreach my $record (split("\n", $records)) {
my ($name, $score) = split(',', $record);
if ($score >= 50) {
print("$name, \"GOOD SCORE\", $score, $class\n");
}
}
As a small note, lt is a string comparison operator. The numeric comparisons use symbols, such as <.
Although you have an array, you only have a single string value in it:
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
That's not a big deal. Dave Cross has already shown you have you can break that up into multiple values, but there's another way I like to handle multi-line strings. You can open a filehandle on a reference to the string, then read lines from the string as you would a file:
my $string = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
One of the things to consider while programming is how many times you are duplicating the data. If you have it in a big string then split it into an array, you've now stored the data twice. That might be fine and its usually expedient. You can't always avoid it, but you should have some tools in your toolbox that let you avoid it.
And, here's a chance to use indented here docs:
use v5.26;
my $string = <<~"HERE";
STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
HERE
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
For your particular problem, I think you have a single string where the lines are separated by the '|' character. You don't show how you call this program or get the data, though.
You can choose any line ending you like by setting the value for the input record separator, $/. Set it to a pipe and this works:
use v5.10;
my $string = 'STUDENT1,90|STUDENT2,40|STUDENT3,30|STUDENT4,30';
{
local $/ = '|'; # input record separator
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
say "Got $_";
}
}
Now the structure of your program isn't too far away from taking the data from standard input or a file. That gives you a lot of flexibility.
The #array contains one element, Actually the for loop will working correct, you can fix it without any change in the for block just by replacing this array:
my #array = (
'STUDENT1,90',
'STUDENT2,40',
'STUDENT3,30',
'STUDENT4,30');
Otherwise you can iterate on them by splitting lines using new line \n .

Perl: Final Output Line in Foreach Loop Prints Twice

I'm trying to write a very simple script that takes two words from STDIN and outputs TRUE if they're anagrams and FALSE if not. My main issue is that if the two words aren't anagrams (this is the final "else" statement in the script), the output looks like:
Sorry, that's not an anagram pair
Sorry, that's not an anagram pair
where I just want:
Sorry, that's not an anagram pair
Other, more minor issues for the especially generous:
I know what the FALSE values are for Perl, but I can't get the script to print FALSE by, for example, setting a variable to '' or 0, etc. or saying "return ''". Ideally, I wouldn't have to put "print TRUE/FALSE" in the script at all.
I put in the last elsif statement in the script to see if it would affect the printing twice problem. It didn't, and now I'm curious why my m// expression doesn't work. It's supposed to find pairs that are identical except that one has more whitespace than the other.
Here's the script! I'm sorry it's so long - again, the problem is at the very end with the final "else" statement. Many thanks!!!
#To Run: Type start.pl on the command line.
#The script should prompt you to enter a word or phrase.
#Once you've done that, it'll prompt you for another one.
#Then you will be told if the two terms are anagrams or not.
#!/usr/bin/perl -w
use strict;
#I have to use this to make STDIN work. IDK why.
$|=1;
#variables
my $aWord;
my $bWord;
my $word;
my $sortWord;
my $sortWords;
my #words;
my %anaHash;
print "\nReady to play the anagram game? Excellent.\n\nType your first word or phrase, then hit Enter.\n\n";
$aWord = <STDIN>;
chomp $aWord;
print "\n\nThanks! Now type your second word or phrase and hit Enter.\n\n";
$bWord = <STDIN>;
chomp $bWord;
#This foreach loop performs the following tasks:
#1. Pushes the two words from STDIN into an array (unsure if this is really necessary)
#2. lowercases everything and removes all characters except for letters & spaces
#3. splits both words into characters, sorts them alphabetically, then joins the sorted letters into a single "word"
#4.pushes the array into a hash
#words = ($bWord, $aWord);
foreach $word (#words) {
$word =~ tr/A-Z/a-z/;
$word =~ s/[^a-z ]//ig;
$sortWord = join '', sort(split(//, $word));
push #{$anaHash{$sortWord}}, $word;
}
#This foreach loop tries to determine if the word pairs are anagrams or not.
foreach $sortWords (values %anaHash) {
#"if you see the same word twice AND the input was two identical words:"
if (1 < #$sortWords &&
#$sortWords[0] eq #$sortWords[1]) {
print "\n\nFALSE: Your phrases are identical!\n\n";
}
#"if you see the same word twice AND the input was two different words (i.e. a real anagram):"
elsif (1 < #$sortWords &&
#$sortWords[0] ne #$sortWords[1]) {
print "\n\nTRUE: #$sortWords[0] and #$sortWords[1] are anagrams!\n\n";
}
#this is a failed attempt to identify pairs that are identical except one has extra spaces. Right now, this fails and falls into the "else" category below.
elsif (#$sortWords[0] =~ m/ +#$sortWords[-1]/ ||
#$sortWords[-1] =~ m/ +#$sortWords[0]/) {
print "\n\FALSE: #$sortWords[0] and #$sortWords[-1] are NOT anagrams. Spaces are characters, too!\n\n";
}
#This is supposed to identify anything that's not an acronym. But the output prints twice! It's maddening!!!!
else {
print "Sorry, that's not an anagram pair\n";
}
}
It's useful to print out the contents of %anaHash after you've finished building it, but before you start examining it. Using the words "foo" and "bar", I get this result using Data::Dumper.
$VAR1 = {
'abr' => [
'bar'
],
'foo' => [
'foo'
]
};
So the hash has two keys. And as you loop round all of the keys in the hash, you'll get the message twice (once for each key).
I'm not really sure what the hash is for here. I don't think it's necessary. I think that you need to:
Read in the two words
Convert the words to a canonical format
Check if the two strings are the same
Simplified, your code would look like this:
print 'Give me a word: ';
chomp(my $word1 = <STDIN>);
print 'Give me another word: ';
chomp(my $word2 = <STDIN>);
# convert to lower case
$word1 = lc $word1;
$word2 = lc $word2;
# remove non-letters
$word1 =~ s/[^a-z]//g;
$word2 =~ s/[^a-z]//g;
# sort letters
$word1 = join '', sort split //, $word1;
$word2 = join '', sort split //, $word2;
if ($word1 eq $word2) {
# you have an anagram
} else {
# you don't
}
My final answer, thanks so much to Dave and #zdim! I'm so happy I could die.
#!/usr/bin/perl -w
use strict;
use feature qw(say);
#I have to use this to make STDIN work. IDK why.
$|=1;
#declare variables below
print "First word?\n";
$aWord = <STDIN>;
chomp $aWord;
print "Second word?\n";
$bWord = <STDIN>;
chomp $bWord;
#clean up input
$aWord =~ tr/A-Z/a-z/;
$bWord =~ tr/A-Z/a-z/;
$aWord =~ s/[^a-z ]//ig;
$bWord =~ s/[^a-z ]//ig;
#if the two inputs are identical, print FALSE and exit
if ($aWord eq $bWord) {
say "\n\nFALSE: Your phrases are identical!\n";
exit;
}
#split each word by character, sort characters alphabetically, join characters
$aSortWord = join '', sort(split(//, $aWord));
$bSortWord = join '', sort(split(//, $bWord));
#if the sorted characters match, you have an anagram
#if not, you don't
if ($aSortWord eq $bSortWord) {
say "\n\nTRUE: Your two terms are anagrams!";
}
else {
say "\n\nFALSE: Your two terms are not acronyms.";
}

Perl split function basics reading each word from an input file

I'm having trouble understanding why this code will not output anything:
#!/usr/bin/perl -w
use strict;
my %allwords = (); #Create an empty hash list.
my $running_total = 0;
while (<>) {
print "In the loop 1";
chomp;
print "Got here";
my #words = split(/\W+/,$_);
}
foreach my $val (my #words) {
print "$val\n";
}
And I run it from the terminal using the command:
perl wordfinder.pl < exampletext.txt
I would expect the code above to output each word from the input file, but it does not output anything other than "In the loop 1" and "Got here". I'm trying to separate the input file word by word, using the split parameter I specified.
Update 1: Here, I have declared the variables within their proper scope, which was my main issue. Now I am getting all of the words from the input file to output on the terminal:
my %allwords = (); #Create an empty hash list.
my $running_total = 0;
my #words = ();
my $val;
while (<>) {
print "Inputting words into an array! \n";
chomp;
#words = split(/\W+/,$_);
}
print("Words have been input successfully, performing analysis: \n");
foreach $val (#words) {
print "$val\n";
}
UPDATE 2: Progress has been made. Now, we put all words from any input files into a hash, and then print each unique key (i.e. each unique word found across all input files) from the hash.
#!/usr/bin/perl -w
use strict;
# Description: We want to take ALL text files from the command line input and calculate
# the frequencies of the words contained therein.
# Step 1: Loop over all words in all input files, and put each new unique word in a
# hash (check to see if contained in hash, if not, put the word in; if the word already
# exists in the hash, then increase its "total" by 1). Also, keep a running total of
# all words.
print("Welcome to word frequency finder. \n");
my $running_total = 0;
my %words;
my $val;
while (<>) {
chomp;
foreach my $str (split(/\W+/,$_)) {
$words{$str}++;
$running_total++;
}
}
print("Words have been input successfully, performing analysis: \n");
# Step 2: Loop over all entries in the hash and look for the word (key) with the
# maximum amount, and then remove this from the hash and put in a separate list.
# Do this until the size of the separate list is 10, since we want the top 10 words.
foreach $val (keys %words) {
print "$val\n";
}
Since you've already completed step 1, you're left with getting your top ten most common words. Rather than looping through the hash and finding the most frequent entry, let's let Perl do the work for us by sorting the hash by its values.
To sort the %words hash by its keys, we can use the expression sort keys %words; to sort a hash by its values, but be able to access its keys, we need a more complex expression:
sort { $words{$a} <=> $words{$a} } keys %words
Breaking it down, to sort numerically, we use the expression
sort { $a <=> $b } #array
(see [perl sort][1] for more on the special variables $a and $b used in sorting)
sort { $a <=> $b } keys %words
would sort on the hash keys, so to sort on the values, we do
sort { $words{$a} <=> $words{$b} } keys %words
Note that the output is still the keys of the hash %words.
We actually want to sort from high to low, so swap $a and $b over to reverse the sort direction:
sort { $words{$b} <=> $words{$a} } keys %words
Since we're compiling a top ten list, we only want the first ten from our hash. It's possible to do this by taking a slice of the hash, but the easiest way is just to use an accumulator to keep count of how many entries we have in the top ten:
my %top_ten;
my $i = 0;
for (sort { $words{$b} <=> $words{$a} } keys %words) {
# $_ is the current hash key
$top_ten{$_} = $words{$_};
$i++;
last if $i == 10;
}
And we're done!

Selecting highest count of element except when...

So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output.
So with this input:
#extract = ------------------------------------------------------------------MGG-------------------------------------------------------------------------------------
And following code:
#Count selected amino acids.
my %counter = ();
foreach my $extract(#extract) {
#next if $_ =~ /\-/; #This line code does not function correctly.
$counter{$_}++;
}
sub largest_value_mem (\%) {
my $counter = shift;
my ($key, #keys) = keys %$counter;
my ($big, #vals) = values %$counter;
for (0 .. $#keys) {
if ($vals[$_] > $big) {
$big = $vals[$_];
$key = $keys[$_];
}
}
$key
}
I expect the most common element to be G, same as the output. If there is a tie in the elements, say G = M, if there is a way to display both in that would be great but not necessary. Any tips on how to delete or remove the '-' is much appreciated. I am slowly learning perl language.
Please let me know if what I am asking is not clear or if more information is needed, thanks again kindly for all the comments.
Your data doesn't entirely make sense, since it's not actually working perl code. I'm guessing that it's a string divided into characters. After that it sounds like you just want to be able to find the highest frequency character, which is essentially just a sort by descending count.
Therefore the following demonstrates how to count your characters and then sort the results:
use strict;
use warnings;
my $str = '------------------------------------------------------------------MGG-------------------------------------------------------------------------------------';
my #chars = split '', $str;
#Count Characteres
my %count;
$count{$_}++ for #chars;
delete $count{'-'}; # Don't count -
# Sort keys by count descending
my #keys = sort {$count{$b} <=> $count{$a}} keys %count;
for my $key (#keys) {
print "$key $count{$key}\n";
}
Outputs:
G 2
M 1
foreach my $extract(#extract) {
#next if $_ =~ /\-/
$_ setting is suppressed by $extract here.
(In this case, $_ keeps value from above, e.g. routine argument list, previous match, etc.)
Also, you can use character class for better readability:
next if $extract=~/[-]/;

Perl regular expressions and returned array of matched groups

i am new in Perl and i need to do some regexp.
I read, when array is used like integer value, it gives count of elements inside.
So i am doing for example
if (#result = $pattern =~ /(\d)\.(\d)/) {....}
and i was thinking it should return empty array, when pattern matching fails, but it gives me still array with 2 elements, but with uninitialized values.
So how i can put pattern matching inside if condition, is it possible?
EDIT:
foreach (keys #ARGV) {
if (my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) {
if (defined $params{$result[0]}) {
print STDERR "Cmd option error\n";
}
$params{$result[0]} = (defined $result[1] ? $result[1] : 1);
}
else {
print STDERR "Cmd option error\n";
exit ERROR_CMD;
}
}
It is regexp pattern for command line options, cmd options are in long format with two hyphens preceding and possible with argument, so
--CMD[=ARG]. I want elegant solution, so this is why i want put it to if condition without some prolog etc.
EDIT2:
oh sry, i was thinking groups in #result array are always counted from 0, but accesible are only groups from branch, where the pattern is success. So if in my code command is "input", it should be in $result[0], but actually it is in $result[1]. I thought if $result[0] is uninitialized, than pattern fails and it goes to the if statement.
Consider the following:
use strict;
use warnings;
my $pattern = 42.42;
my #result = $pattern =~ /(\d)\.(\d)/;
print #result, ' elements';
Output:
24 elements
Context tells Perl how to treat #result. There certainly aren't 24 elements! Perl has printed the array's elements which resulted from your regex's captures. However, if we do the following:
print 0 + #result, ' elements';
we get:
2 elements
In this latter case, Perl interprets a scalar context for #result, so adds the number of elements to 0. This can also be achieved through scalar #results.
Edit to accommodate revised posting: Thus, the conditional in your code:
if(my #result = $ARGV[$_] =~ /^--(?:(help|br)|(?:(input|output|format)=(.+)))$/) { ...
evaluates to true if and only if the match was successful.
#results = $pattern =~ /(\d)\.(\d)/ ? ($1,$2) : ();
Try this:
#result = ();
if ($pattern =~ /(\d)\.(\d)/)
{
push #result, $1;
push #result, $2;
}
=~ is not an equal sign. It's doing a regexp comparison.
So my code above is initializing the array to empty, then assigning values only if the regexp matches.