Perl split function basics reading each word from an input file

Perl split function basics reading each word from an input file - perl

I'm having trouble understanding why this code will not output anything:
#!/usr/bin/perl -w
use strict;
my %allwords = (); #Create an empty hash list.
my $running_total = 0;
while (<>) {
print "In the loop 1";
chomp;
print "Got here";
my #words = split(/\W+/,$_);
}
foreach my $val (my #words) {
print "$val\n";
}
And I run it from the terminal using the command:
perl wordfinder.pl < exampletext.txt
I would expect the code above to output each word from the input file, but it does not output anything other than "In the loop 1" and "Got here". I'm trying to separate the input file word by word, using the split parameter I specified.
Update 1: Here, I have declared the variables within their proper scope, which was my main issue. Now I am getting all of the words from the input file to output on the terminal:
my %allwords = (); #Create an empty hash list.
my $running_total = 0;
my #words = ();
my $val;
while (<>) {
print "Inputting words into an array! \n";
chomp;
#words = split(/\W+/,$_);
}
print("Words have been input successfully, performing analysis: \n");
foreach $val (#words) {
print "$val\n";
}
UPDATE 2: Progress has been made. Now, we put all words from any input files into a hash, and then print each unique key (i.e. each unique word found across all input files) from the hash.
#!/usr/bin/perl -w
use strict;
# Description: We want to take ALL text files from the command line input and calculate
# the frequencies of the words contained therein.
# Step 1: Loop over all words in all input files, and put each new unique word in a
# hash (check to see if contained in hash, if not, put the word in; if the word already
# exists in the hash, then increase its "total" by 1). Also, keep a running total of
# all words.
print("Welcome to word frequency finder. \n");
my $running_total = 0;
my %words;
my $val;
while (<>) {
chomp;
foreach my $str (split(/\W+/,$_)) {
$words{$str}++;
$running_total++;
}
}
print("Words have been input successfully, performing analysis: \n");
# Step 2: Loop over all entries in the hash and look for the word (key) with the
# maximum amount, and then remove this from the hash and put in a separate list.
# Do this until the size of the separate list is 10, since we want the top 10 words.
foreach $val (keys %words) {
print "$val\n";
}

Since you've already completed step 1, you're left with getting your top ten most common words. Rather than looping through the hash and finding the most frequent entry, let's let Perl do the work for us by sorting the hash by its values.
To sort the %words hash by its keys, we can use the expression sort keys %words; to sort a hash by its values, but be able to access its keys, we need a more complex expression:
sort { $words{$a} <=> $words{$a} } keys %words
Breaking it down, to sort numerically, we use the expression
sort { $a <=> $b } #array
(see [perl sort][1] for more on the special variables $a and $b used in sorting)
sort { $a <=> $b } keys %words
would sort on the hash keys, so to sort on the values, we do
sort { $words{$a} <=> $words{$b} } keys %words
Note that the output is still the keys of the hash %words.
We actually want to sort from high to low, so swap $a and $b over to reverse the sort direction:
sort { $words{$b} <=> $words{$a} } keys %words
Since we're compiling a top ten list, we only want the first ten from our hash. It's possible to do this by taking a slice of the hash, but the easiest way is just to use an accumulator to keep count of how many entries we have in the top ten:
my %top_ten;
my $i = 0;
for (sort { $words{$b} <=> $words{$a} } keys %words) {
# $_ is the current hash key
$top_ten{$_} = $words{$_};
$i++;
last if $i == 10;
}
And we're done!

Related

Printing specific number of key-value pairs in a hash

I have a hash which stores the count of key-value pairs, from an array of strings taken from an input document then sorts them and prints them.
%count = ();
foreach $string (#strings) {
$count{$string}++;
}
foreach my $key (sort {$count{$b} <=> $count{$a} } keys %count) {
print $key, ": ", $count{$key} ;
}
so I am wondering is there a way to only print a certain number of key-value pairs in the hash instead of all of them ? i.e print top 5 based the value?
edit: would a for loop solve this?

%count = ();
foreach $string (#strings) {
$count{$string}++;
}
my $n=0; # variable to keep count of processed keys
foreach my $key (sort {$count{$b} <=> $count{$a} } keys %count) {
# count processed keys (++$n)
# and terminate the loop after processing 5 keys
last if ++$n>5;
print $key, ": ", $count{$key} ;
}

Can take a slice of the list returned by sort
use strict;
use warnings;
use feature 'say';
....
my %count;
foreach my $string (#strings) {
++$count{$string}
}
say "$_: $count{$_}"
for ( sort { $count{$b} <=> $count{$a} } keys %count )[0..4];
(This expects that the hash indeed has five keys; if it can happen that that is not the case you'd get hit by warnings so add a test in that case, for instance $_ and say "..." for ...)
The code in the question is clearly not using strict; I recommend to always use it.
The %count = () makes sense if the hash has been populated before and now need be emptied. If you are creating it then just declare (and without = (), which does nothing).
Note, thanks to Grinnz: very recent List::Util 1.50 adds head (and tail) functions

Selecting highest count of element except when...

So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output.
So with this input:
#extract = ------------------------------------------------------------------MGG-------------------------------------------------------------------------------------
And following code:
#Count selected amino acids.
my %counter = ();
foreach my $extract(#extract) {
#next if $_ =~ /\-/; #This line code does not function correctly.
$counter{$_}++;
}
sub largest_value_mem (\%) {
my $counter = shift;
my ($key, #keys) = keys %$counter;
my ($big, #vals) = values %$counter;
for (0 .. $#keys) {
if ($vals[$_] > $big) {
$big = $vals[$_];
$key = $keys[$_];
}
}
$key
}
I expect the most common element to be G, same as the output. If there is a tie in the elements, say G = M, if there is a way to display both in that would be great but not necessary. Any tips on how to delete or remove the '-' is much appreciated. I am slowly learning perl language.
Please let me know if what I am asking is not clear or if more information is needed, thanks again kindly for all the comments.

Your data doesn't entirely make sense, since it's not actually working perl code. I'm guessing that it's a string divided into characters. After that it sounds like you just want to be able to find the highest frequency character, which is essentially just a sort by descending count.
Therefore the following demonstrates how to count your characters and then sort the results:
use strict;
use warnings;
my $str = '------------------------------------------------------------------MGG-------------------------------------------------------------------------------------';
my #chars = split '', $str;
#Count Characteres
my %count;
$count{$_}++ for #chars;
delete $count{'-'}; # Don't count -
# Sort keys by count descending
my #keys = sort {$count{$b} <=> $count{$a}} keys %count;
for my $key (#keys) {
print "$key $count{$key}\n";
}
Outputs:
G 2
M 1

foreach my $extract(#extract) {
#next if $_ =~ /\-/
$_ setting is suppressed by $extract here.
(In this case, $_ keeps value from above, e.g. routine argument list, previous match, etc.)
Also, you can use character class for better readability:
next if $extract=~/[-]/;

three questions on a Perl function

I am trying to use an existing Perl program, which includes the following function of GetItems. The way to call this function is listed in the following.
I have several questions for this program:
what does foreach my $ref (#_) aim to do? I think #_ should be related to the parameters passed, but not quite sure.
In my #items = sort { $a <=> $b } keys %items; the "items" on the left side should be different from the "items" on the right side? Why do they use the same name?
What does $items{$items[$i]} = $i + 1; aim to do? Looks like it just sets up the value for the hash $items sequentially.
$items = GetItems($classes, $pVectors, $nVectors, $uVectors);
######################################
sub GetItems
######################################
{
my $classes = shift;
my %items = ();
foreach my $ref (#_)
{
foreach my $id (keys %$ref)
{
foreach my $cui (keys %{$ref->{$id}}) { $items{$cui} = 1 }
}
}
my #items = sort { $a <=> $b } keys %items;
open(VAL, "> $classes.items");
for my $i (0 .. $#items)
{
print VAL "$items[$i]\n";
$items{$items[$i]} = $i + 1;
}
close VAL;
return \%items;
}

When you enter a function, #_ starts out as an array of (aliases to) all the parameters passed into the function; but the my $classes = shift removes the first element of #_ and stores it in the variable $classes, so the foreach my $ref (#_) iterates over all the remaining parameters, storing (aliases to) them one at a time in $ref.
Scalars, hashes, and arrays are all distinguished by the syntax, so they're allowed to have the same name. You can have a $foo, a #foo, and a %foo all at the same time, and they don't have to have any relationship to each other. (This, together with the fact that $foo[0] refers to #foo and $foo{'a'} refers to %foo, causes a lot of confusion for newcomers to the language; you're not alone.)
Exactly. It sets each element of %items to a distinct integer ranging from one to the number of elements, proceeding in numeric (!) order by key.

foreach my $ref (#_) loops through each hash reference passed as a parameter to GetItems. If the call looks like this:
$items = GetItems($classes, $pVectors, $nVectors, $uVectors);
then the loop processes the hash refs in $pVector, $nVectors, and $uVectors.
#items and %items are COMPLETELY DIFFERENT VARIABLES!! #items is an array variable and %items is a hash variable.
$items{$items[$i]} = $i + 1 does exactly as you say. It sets the value of the %items hash whose key is $items[$i] to $i+1.

Here is an (nearly) line by line description of what is happening in the subroutine
Define a sub named GetItems.
sub GetItems {
Store the first value in the default array #_, and remove it from the array.
my $classes = shift;
Create a new hash named %items.
my %items;
Loop over the remaining values given to the subroutine, setting $ref to the value on each iteration.
for my $ref (#_){
This code assumes that the previous line set $ref to a hash ref. It loops over the unsorted keys of the hash referenced by $ref, storing the key in $id.
for my $id (keys %$ref){
Using the key ($id) given by the previous line, loop over the keys of the hash ref at that position in $ref. While also setting the value of $cui.
for my $cui (keys %{$ref->{$id}}) {
Set the value of %item at position $cui, to 1.
$items{$cui} = 1;
End of the loops on the previous lines.
}
}
}
Store a sorted list of the keys of %items in #items according to numeric value.
my #items = sort { $a <=> $b } keys %items;
Open the file named by $classes with .items appended to it. This uses the old-style two arg form of open. It also ignores the return value of open, so it continues on to the next line even on error. It stores the file handle in the global *VAL{IO}.
open(VAL, "> $classes.items");
Loop over a list of indexes of #items.
for my $i (0 .. $#items){
Print the value at that index on it's own line to *VAL{IO}.
print VAL "$items[$i]\n";
Using that same value as an index into %items (which it is a key of) to the index plus one.
$items{$items[$i]} = $i + 1;
End of loop.
}
Close the file handle *VAL{IO}.
close VAL;
Return a reference to the hash %items.
return \%items;
End of subroutine.
}

I have several questions for this program:
What does foreach my $ref (#_) aim to do? I think #_ should be related to the parameters passed, but not quite sure.
Yes, you are correct. When you pass parameters into a subroutine, they automatically are placed in the #_ array. (Called a list in Perl). The foreach my $ref (#_) begins a loop. This loop will be repeated for each item in the #_ array, and each time, the value of $ref will be assigned the next item in the array. See Perldoc's Perlsyn (Perl Syntax) section about for loops and foreach loops. Also look at Perldoc's Perlvar (Perl Variables) section of General variables for information about special variables like #_.
Now, the line my $classes = shift; is removing the first item in the #_ list and putting it into the variable $classes. Thus, the foreach loop will be repeated three times. Each time, $ref will be first set to the value of $pVectors, $nVectors, and finally $uVectors.
By the way, these aren't really scalar values. In Perl, you can have what is called a reference. This is the memory location of the data structure you're referencing. For example, I have five students, and each student has a series of tests they've taken. I want to store all the values of each test in a hash keyed by the student's ID.
Normally, each entry in the hash can only contain a single item. However, what if this item refers to a list that contains the student's grades?
Here's the list of student #100's grade:
#grades = (100, 93, 89, 95, 74);
And here's how I set Student 100's entry in my hash:
$student{100} = \#grades;
Now, I can talk about the first grade of the year for Student #100 as $student{100}[0]. See the Perldoc's Mark's very short tutorial about references.
In my #items = sort { $a <=> $b } keys %items; the "items" on the left side should be different from the "items" on the right side? Why do they use the same name?
In Perl, you have three major types of variables: Lists (what some people call Arrays), Hashes (what some people call Keyed Arrays), and Scalars. In Perl, it is perfectly legal to have different variable types have the same name. Thus, you can have $var, %var, and #var in your program, and they'll be treated as completely separate variables1.
This is usually a bad thing to do and is highly discouraged. It gets worse when you think of the individual values: $var refers to the scalar while $var[3] refers to the list, and $var{3} refers to the hash. Yes, it can be very, very confusing.
In this particular case, he has a hash (a keyed array) called %item, and he's converting the keys in this hash into a list sorted by the keys. This syntax could be simplified from:
my #items = sort { $a <=> $b } keys %items;
to just:
my #items = sort keys %items;
See the Perldocs on the sort function and the keys function.
What does $items{$items[$i]} = $i + 1; aim to do? Looks like it just sets up the value for the hash $items sequentially.
Let's look at the entire loop:
foreach my $i (0 .. $#items)
{
print VAL "$items[$i]\n";
$items{$items[$i]} = $i + 1;
}
The subroutine is going to loop through this loop once for each item in the #items list. This is the sorted list of keys to the old %items hash. The $#items means the largest index in the item list. For example, if #items = ("foo", "bar", and "foobar"), then $#item would be 2 because the last item in this list is $item[2] which equals foobar.
This way, he's hitting the index of each entry in #items. (REMEMBER: This is different from %item!).
The next line is a bit tricky:
$items{$items[$i]} = $i + 1;
Remember that $item{} refers to the old %items hash! He's creating a new %items hash. This is being keyed by each item in the #items list. And, the value is being set to the index of that item plus 1. Let's assume that:
#items = ("foo", "bar", "foobar")
In the end, he's doing this:
$item{foo} = 1;
$item{bar} = 2;
$item{foobar} = 3;
1 Well, this isn't 100% true. Perl stores each variable in a kind of hash structure. In memory, $var, #var, and %var will be stored in the same hash entry in memory, but in positions related to each variable type. 99.9999% of the time, this matters not one bit. As far as you are concerned, these are three completely different variables.
However, there are a few rare occasions where a programmer will take advantage of this when they futz directly with memory in Perl.

I want to show you how I would write that subroutine.
Bur first, I want to show you some of the steps of how, and why, I changed the code.
Reduce the number of for loops:
First off this loop doesn't need to set the value of $items{$cui} to anything in particular. It also doesn't have to be a loop at all.
foreach my $cui (keys %{$ref->{$id}}) { $items{$cui} = 1 }
This does practically the same thing. The only real difference is it sets them all to undef instead.
#items{ keys %{$ref->{$id}} } = ();
If you really needed to set the values to 1. Note that (1)x#keys returns a list of 1's with the same number of elements in #keys.
my #keys = keys %{$ref->{$id}};
#items{ #keys } = (1) x #keys;
If you are going to have to loop over a very large number of elements then a for loop may be a good idea, but only if you have to set the value to something other than undef. Since we are only using the loop variable once, to do something simple; I would use this code:
$items{$_} = 1 for keys %{$ref->{$id}};
Swap keys with values:
On the line before that we see:
foreach my $id (keys %$ref){
In case you didn't notice $id was used only once, and that was for getting the associated value.
That means we can use values and get rid of the %{$ref->{$id}} syntax.
for my $hash (values %$ref){
#items{ keys %$hash } = ();
}
( $hash isn't a good name, but I don't know what it represents. )
3 arg open:
It isn't recommended to use the two argument form of open, or to blindly use the bareword style of filehandles.
open(VAL, "> $classes.items");
As an aside, did you know there is also a one argument form of open. I don't really recommend it though, it's mostly there for backward compatibility.
our $VAL = "> $classes.items";
open(VAL);
The recommend way to do it, is with 3 arguments.
open my $val, '>', "$classes.items";
There may be some rare edge cases where you need/want to use the two argument version though.
Put it all together:
sub GetItems {
# this will cause open and close to die on error (in this subroutine only)
use autodie;
my $classes = shift;
my %items;
for my $vector_hash (#_){
# use values so that we don't have to use $ref->{$id}
for my $hash (values %$ref){
# create the keys in %items
#items{keys %$hash} = ();
}
}
# This assumes that the keys of %items are numbers
my #items = sort { $a <=> $b } keys %items;
# using 3 arg open
open my $output, '>', "$classes.items";
my $index; # = 0;
for $item (#items){
print {$output} $item, "\n";
$items{$item} = ++$index; # 1...
}
close $output;
return \%items;
}
Another option for that last for loop.
for my $index ( 1..#items ){
my $item = $items[$index-1];
print {$output} $item, "\n";
$items{$item} = $index;
}
If your version of Perl is 5.12 or newer, you could write that last for loop like this:
while( my($index,$item) = each #items ){
print {$output} $item, "\n";
$items{$item} = $index + 1;
}

sorting an array on the first number found in each element

I'm looking for help sorting an array where each element is made up of "a number, then a string, then a number". I would like to sort on the first number part of the array elements, descending (so that I list the higher numbers first), while also listing the text etc.
am still a beginner so alternatives to the below are also welcome
use strict;
use warnings;
my #arr = map {int( rand(49) + 1) } ( 1..100 ); # build an array of 100 random numbers between 1 and 49
my #count2;
foreach my $i (1..49) {
my #count = join(',', #arr) =~ m/$i,/g; # maybe try to make a string only once then search trough it... ???
my $count1 = scalar(#count); # I want this $count1 to be the number of times each of the numbers($i) was found within the string/array.
push(#count2, $count1 ." times for ". $i); # pushing a "number then text and a number / scalar, string, scalar" to an array.
}
#for (#count2) {print "$_\n";}
# try to add up all numbers in the first coloum to make sure they == 100
#sort #count2 and print the top 7
#count2 = sort {$b <=> $a} #count2; # try to stop printout of this, or sort on =~ m/^anumber/ ??? or just on the first one or two \d
foreach my $i (0..6) {
print $count2[$i] ."\n"; # seems to be sorted right anyway
}

First, store your data in an array, not in a string:
# inside the first loop, replace your line with the push() with this one:
push(#count2, [$count1, $i];
Then you can easily sort by the first element of each subarray:
my #sorted = sort { $b->[0] <=> $a->[0] } #count2;
And when you print it, construct the string:
printf "%d times for %d\n", $sorted[$i][0], $sorted[$i][1];
See also: http://perldoc.perl.org/perlreftut.html, perlfaq4

Taking your requirements as is. You're probably better off not embedding count information in a string. However, I'll take it as a learning exercise.
Note, I am trading memory for brevity and likely speed by using a hash to do the counting.
However, the sort could be optimized by using a Schwartzian Transform.
EDIT: Create results array using only numbers that were drawn
#!/usr/bin/perl
use strict; use warnings;
my #arr = map {int( rand(49) + 1) } ( 1..100 );
my %counts;
++$counts{$_} for #arr;
my #result = map sprintf('%d times for %d', $counts{$_}, $_),
sort {$counts{$a} <=> $counts{$b}} keys %counts;
print "$_\n" for #result;
However, I'd probably have done something like this:
#!/usr/bin/perl
use strict; use warnings;
use YAML;
my #arr;
$#arr = 99; #initialize #arr capacity to 100 elements
my %counts;
for my $i (0 .. 99) {
my $n = int(rand(49) + 1); # pick a number
$arr[ $i ] = $n; # store it
++$counts{ $n }; # update count
}
# sort keys according to counts, keys of %counts has only the numbers drawn
# for each number drawn, create an anonymous array ref where the first element
# is the number drawn, and the second element is the number of times it was drawn
# and put it in the #result array
my #result = map [$_, $counts{$_}],
sort {$counts{$a} <=> $counts{$b} }
keys %counts;
print Dump \#result;

Calculate Character Frequency in Message using Perl

I am writing a Perl Script to find out the frequency of occurrence of characters in a message. Here is the logic I am following:
Read one char at a time from the message using getc() and store it into an array.
Run a for loop starting from index 0 to the length of this array.
This loop will read each char of the array and assign it to a temp variable.
Run another for loop nested in the above, which will run from the index of the character being tested till the length of the array.
Using a string comparison between this character and the current array indexed char, a counter is incremented if they are equal.
After completion of inner For Loop, I am printing the frequency of the char for debug purposes.
Question: I don't want the program to recompute the frequency of a character if it's already been calculated. For instance, if character "a" occurs 3 times, for the first run, it calculates the correct frequency. However, at the next occurrence of "a", since loop runs from that index till the end, the frequency is (actual freq -1). Similary for the third occurrence, frequency is (actual freq -2).
To solve this. I used another temp array to which I would push the char whose frequency is already evaluated.
And then at the next run of for loop, before entering the inner for loop, I compare the current char with the array of evaluated chars and set a flag. Based on that flag, the inner for loop runs.
This is not working for me. Still the same results.
Here's the code I have written to accomplish the above:
#!/usr/bin/perl
use strict;
use warnings;
my $input=$ARGV[0];
my ($c,$ch,$flag,$s,#arr,#temp);
open(INPUT,"<$input");
while(defined($c = getc(INPUT)))
{
push(#arr,$c);
}
close(INPUT);
my $length=$#arr+1;
for(my $i=0;$i<$length;$i++)
{
$count=0;
$flag=0;
$ch=$arr[$i];
foreach $s (#temp)
{
if($ch eq $s)
{
$flag = 1;
}
}
if($flag == 0)
{
for(my $k=$i;$k<$length;$k++)
{
if($ch eq $arr[$k])
{
$count = $count+1;
}
}
push(#temp,$ch);
print "The character \"".$ch."\" appears ".$count." number of times in the message"."\n";
}
}

You're making your life much harder than it needs to be. Use a hash:
my %freq;
while(defined($c = getc(INPUT)))
{
$freq{$c}++;
}
print $_, " ", $freq{$_}, "\n" for sort keys %freq;
$freq{$c}++ increments the value stored in $freq{$c}. (If it was unset or zero, it becomes one.)
The print line is equivalent to:
foreach my $key (sort keys %freq) {
print $key, " ", $freq{$key}, "\n";
}

If you want to do a single character count for the whole file then use any of the suggested methods posted by the others. If you want a count of all the occurances
of each character in a file then I propose:
#!/usr/bin/perl
use strict;
use warnings;
# read in the contents of the file
my $contents;
open(TMP, "<$ARGV[0]") or die ("Failed to open $ARGV[0]: $!");
{
local($/) = undef;
$contents = <TMP>;
}
close(TMP);
# split the contents around each character
my #bits = split(//, $contents);
# build the hash of each character with it's respective count
my %counts = map {
# use lc($_) to make the search case-insensitive
my $foo = $_;
# filter out newlines
$_ ne "\n" ?
($foo => scalar grep {$_ eq $foo} #bits) :
() } #bits;
# reverse sort (highest first) the hash values and print
foreach(reverse sort {$counts{$a} <=> $counts{$b}} keys %counts) {
print "$_: $counts{$_}\n";
}

I don´t understand the problem you are trying to solve, so I propose a more simple way to count the characters in a string:
$string = "fooooooobar";
$char = 'o';
$count = grep {$_ eq $char} split //, $string;
print $count, "\n";
This prints the number of $char occurrences in $string (7).
Hope this helps to write a more compact code

Faster solution :
#result = $subject =~ m/a/g; #subject is your file
print "Found : ", scalar #result, " a characters in file!\n";
Of course you can put a variable in the place of 'a' or even better execute this line for whatever characters you want to count the occurrences.

As a one-liner:
perl -F"" -anE '$h{$_}++ for #F; END { say "$_ : $h{$_}" for keys %h }' foo.txt

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl split function basics reading each word from an input file - perl

Related

Printing specific number of key-value pairs in a hash

Selecting highest count of element except when...

three questions on a Perl function

sorting an array on the first number found in each element

Calculate Character Frequency in Message using Perl

Categories

Resources