counting letters for each word in a text with Perl - perl

I am trying to write a program wit Perl which should returns the frequency of all words in the file and the length of each word in the file (not the sum of all characters!) to produce a Zipf curve from a Spanish text (is not a big deal if you don't know what a Zipf's curve is). Now my problem is: I can do the first part and I get the frequency of all word but I don't how to get the length of each word! :( I know the command line
$word_length = length($words) but after trying to change the code I really don't know where I should include it and how to count the length for each word.
That's how my code looks like until know:
#!/usr/bin/perl
use strict;
use warnings;
my %count_of;
while (my $line = <>) { #read from file or STDIN
foreach my $word (split /\s+/gi, $line){
$count_of{$word}++;
}
}
print "All words and their counts: \n";
for my $word (sort keys %count_of) {
print "$word: $count_of{$word}\n";
}
__END__
I hope somebody have any suggestions!

You can use hash of hashes if you want to store the length of the word.
while (my $line = <>) {
foreach my $word (split /\s+/, $line) {
$count_of{$word}{word_count}++;
$count_of{$word}{word_length} = length($word);
}
}
print "All words and their counts and length: \n";
for my $word (sort keys %count_of) {
print "$word: $count_of{$word}{word_count} ";
print "Length of the word:$count_of{$word}{word_length}\n";
}

This will print the length right next to the count:
print "$word: $count_of{$word} ", length($word), "\n";

Just for your information - the other possibility for
length length($word)
might be:
$word =~ s/(\w)/$1/g
It is not as clear solution as toolic but can give you other view on this issue (TIMTOWTDI :))
Little explanation:
\w and g modifier matches every letter in your $word
$1 prevents overwriting original $word by s///
s/// returns number of letters (matched with \w) in $word

Related

Perl: Final Output Line in Foreach Loop Prints Twice

I'm trying to write a very simple script that takes two words from STDIN and outputs TRUE if they're anagrams and FALSE if not. My main issue is that if the two words aren't anagrams (this is the final "else" statement in the script), the output looks like:
Sorry, that's not an anagram pair
Sorry, that's not an anagram pair
where I just want:
Sorry, that's not an anagram pair
Other, more minor issues for the especially generous:
I know what the FALSE values are for Perl, but I can't get the script to print FALSE by, for example, setting a variable to '' or 0, etc. or saying "return ''". Ideally, I wouldn't have to put "print TRUE/FALSE" in the script at all.
I put in the last elsif statement in the script to see if it would affect the printing twice problem. It didn't, and now I'm curious why my m// expression doesn't work. It's supposed to find pairs that are identical except that one has more whitespace than the other.
Here's the script! I'm sorry it's so long - again, the problem is at the very end with the final "else" statement. Many thanks!!!
#To Run: Type start.pl on the command line.
#The script should prompt you to enter a word or phrase.
#Once you've done that, it'll prompt you for another one.
#Then you will be told if the two terms are anagrams or not.
#!/usr/bin/perl -w
use strict;
#I have to use this to make STDIN work. IDK why.
$|=1;
#variables
my $aWord;
my $bWord;
my $word;
my $sortWord;
my $sortWords;
my #words;
my %anaHash;
print "\nReady to play the anagram game? Excellent.\n\nType your first word or phrase, then hit Enter.\n\n";
$aWord = <STDIN>;
chomp $aWord;
print "\n\nThanks! Now type your second word or phrase and hit Enter.\n\n";
$bWord = <STDIN>;
chomp $bWord;
#This foreach loop performs the following tasks:
#1. Pushes the two words from STDIN into an array (unsure if this is really necessary)
#2. lowercases everything and removes all characters except for letters & spaces
#3. splits both words into characters, sorts them alphabetically, then joins the sorted letters into a single "word"
#4.pushes the array into a hash
#words = ($bWord, $aWord);
foreach $word (#words) {
$word =~ tr/A-Z/a-z/;
$word =~ s/[^a-z ]//ig;
$sortWord = join '', sort(split(//, $word));
push #{$anaHash{$sortWord}}, $word;
}
#This foreach loop tries to determine if the word pairs are anagrams or not.
foreach $sortWords (values %anaHash) {
#"if you see the same word twice AND the input was two identical words:"
if (1 < #$sortWords &&
#$sortWords[0] eq #$sortWords[1]) {
print "\n\nFALSE: Your phrases are identical!\n\n";
}
#"if you see the same word twice AND the input was two different words (i.e. a real anagram):"
elsif (1 < #$sortWords &&
#$sortWords[0] ne #$sortWords[1]) {
print "\n\nTRUE: #$sortWords[0] and #$sortWords[1] are anagrams!\n\n";
}
#this is a failed attempt to identify pairs that are identical except one has extra spaces. Right now, this fails and falls into the "else" category below.
elsif (#$sortWords[0] =~ m/ +#$sortWords[-1]/ ||
#$sortWords[-1] =~ m/ +#$sortWords[0]/) {
print "\n\FALSE: #$sortWords[0] and #$sortWords[-1] are NOT anagrams. Spaces are characters, too!\n\n";
}
#This is supposed to identify anything that's not an acronym. But the output prints twice! It's maddening!!!!
else {
print "Sorry, that's not an anagram pair\n";
}
}
It's useful to print out the contents of %anaHash after you've finished building it, but before you start examining it. Using the words "foo" and "bar", I get this result using Data::Dumper.
$VAR1 = {
'abr' => [
'bar'
],
'foo' => [
'foo'
]
};
So the hash has two keys. And as you loop round all of the keys in the hash, you'll get the message twice (once for each key).
I'm not really sure what the hash is for here. I don't think it's necessary. I think that you need to:
Read in the two words
Convert the words to a canonical format
Check if the two strings are the same
Simplified, your code would look like this:
print 'Give me a word: ';
chomp(my $word1 = <STDIN>);
print 'Give me another word: ';
chomp(my $word2 = <STDIN>);
# convert to lower case
$word1 = lc $word1;
$word2 = lc $word2;
# remove non-letters
$word1 =~ s/[^a-z]//g;
$word2 =~ s/[^a-z]//g;
# sort letters
$word1 = join '', sort split //, $word1;
$word2 = join '', sort split //, $word2;
if ($word1 eq $word2) {
# you have an anagram
} else {
# you don't
}
My final answer, thanks so much to Dave and #zdim! I'm so happy I could die.
#!/usr/bin/perl -w
use strict;
use feature qw(say);
#I have to use this to make STDIN work. IDK why.
$|=1;
#declare variables below
print "First word?\n";
$aWord = <STDIN>;
chomp $aWord;
print "Second word?\n";
$bWord = <STDIN>;
chomp $bWord;
#clean up input
$aWord =~ tr/A-Z/a-z/;
$bWord =~ tr/A-Z/a-z/;
$aWord =~ s/[^a-z ]//ig;
$bWord =~ s/[^a-z ]//ig;
#if the two inputs are identical, print FALSE and exit
if ($aWord eq $bWord) {
say "\n\nFALSE: Your phrases are identical!\n";
exit;
}
#split each word by character, sort characters alphabetically, join characters
$aSortWord = join '', sort(split(//, $aWord));
$bSortWord = join '', sort(split(//, $bWord));
#if the sorted characters match, you have an anagram
#if not, you don't
if ($aSortWord eq $bSortWord) {
say "\n\nTRUE: Your two terms are anagrams!";
}
else {
say "\n\nFALSE: Your two terms are not acronyms.";
}

find duplicate filenames and append them to hash of arrays

Perl question: I have a colon separated file containing paths that I'm using. I just split using a regex, like this:
my %unique_import_hash;
while (my $line = <$log_fh>) {
my ($log_type, $log_import_filename, $log_object_filename)
= split /:/, line;
$log_type =~ s/^\s+|\s+$//g; # trim whitespace
$log_import_filename =~ s/^\s+|\s+$//g; # trim whitespace
$log_object_filename =~ s/^\s+|\s+$//g; # trim whitespace
}
The exact file format is:
type : source-filename : import-filename
What I want is an index file that contains the last pushed $log_object_filename for each unique key $log_import_filename, so, what I'm going to do in English/Perl pseudo-code is push the $log_object_filename onto an array indexed by the hash %unique_import_hash. Then, I want to iterate over the keys and pop the array referred by %unique_import_hash and store it in an array of scalars.
My specific question is: what is the syntax for appending to an array that is the value of a hash?
You can use push, but you have to dereference the array referenced by the hash value:
push #{ $hash{$key} }, $filename;
See perlref for details.
If you only care about the last value for each key, you're over-thinking the problem. No need to fool around with arrays when a simple assignment will overwrite the previous value:
while (my $line = <$log_fh>) {
# ...
$unique_import_hash{$log_import_filename} = $log_object_filename;
}
use strict;
use warnings;
my %unique_import_hash;
my $log_filename = "file.log";
open(my $log_fh, "<" . $log_filename);
while (my $line = <$log_fh>) {
$line =~ s/ *: */:/g;
(my $log_type, my $log_import_filename, my $log_object_filename) = split /:/, $line;
push (#{$unique_import_hash{$log_import_filename}}, $log_object_filename);
}
Seek the wisdom of the Perl monks.

Split string from file in Perl

So, i have a file to read like this
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
What I want is if $x matches with Some.Text for example, how can I get a variable with "Some big text with spaces and numbers and something" or if it matches with "Some.Text2" to get "Again some big test, etc".
open FILE, "<cats.txt" or die $!;
while (<FILE>) {
chomp;
my #values = split('~~~', $_);
foreach my $val (#values) {
print "$val\n" if ($val eq $x)
}
exit 0;
}
close FILE;
And from now on I don't know what to do. I just managed to print "Some.text" if it matches with my variable.
splice can be used to remove elements from #values in pairs:
while(my ($matcher, $printer) = splice(#values, 0, 2)) {
print $printer if $matcher eq $x;
}
Alternatively, if you need to leave #values intact you can use a c style loop:
for (my $i=0; $i<#values; $i+=2) {
print $values[$i+1] if $values[$i] eq $x;
}
Your best option is perhaps not to split, but to use a regex, like this:
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
while (/Some.Text2?~~~(.+?)~~~/g) {
say $1;
}
}
__DATA__
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
Output:
Some big text with spaces and numbers and something
Again some big test, etc

In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values

I've searched around the site and surprisingly I can't seem to find something that will work for my particular problem. So I figured I'd post it and see how some of you more experienced programmers can address with problem.
I have a spreadsheet like text file (many lines with tab delimited columns), that I would like to search through for certain labels (ex scaffold1253.1_size81005.6.32799_7496) and replace them with more simplified labels (ex scaffold1253.1a). These labels are only in the first column of the text file. I've already written the script such that I have a hash with the old labels as keys corresponding to the new labels as their respective values. This hash has about 26000 lines. So essentially I'd like to take the hash keys 1 by 1, search for them in the text file, and replace them with their respective hash values.
I have a pretty good server availible so if its too complicated to make it first column specific to speed up the process then thats ok.
THis is what I have so far:
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
#gtfarray = <FASTAFILE2>;
#print #gtfarray;
my %hash;
while (<>)
{
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
#print %hash;
while (my ($find, $replace) = each %hash) {
foreach (#gtfarray){
$_ =~ s/$find/$replace/g;
push #newgtf, $_;
}
}
print #newgtf;
This code doesn't seem to work as it doesn't complete. I'm pretty sure it's a problem with the foreach loop structure. Sorry I don't know of any other way to do this. Does anyone have a better way to run through this file and conduct the replacement?
Any input would be greatly appreciated!
Thanks,
Andrew
#DVK
Here is the full script with your mods that runs into syntax errors with your while loop, any idea why it's not accepting it? Thanks again!
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
my %hash;
while (<>){
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
while $line (<FASTAFILE2>){
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n"
}
__END__
Here is the syntax error:
perl gtf_mod2.pl <./Hc_genome/header_file.txt
syntax error at gtf_mod2.pl line 14, near "while $line "
syntax error at gtf_mod2.pl line 23, near "}"
Execution of gtf_mod2.pl aborted due to compilation errors.
You exhaust your file the first time through your loop using the initial $find and $replace key/value pair.
There are two potential solutions:
Open the file for reading during each iteration of your while loop (expensive)
Move the foreach loop to the outside of the while and iterate the hash each time (less expensive)
example:
REPLACE:
for my $line (#gtfarray) {
while(my ($find, $replace) = each %hash) {
if($line =~ s/$find/$replace/g) {
push #newgtf, $line;
next REPLACE; # skip to next iteration
}
}
# if there was no replacement, push the old line
push #newgtf, $line
}
How big is the file that you are replacing the first column in?
If it's >50,000 lines, you are better off doing the reverse:
Iterate through hash file once, and store that hash in memory
Iterate through main file once, and for every line, for every column, find that value in the memorized hash, replace with hash value if found, and write.
In other words, remove the first #gtfarray = <FASTAFILE2>; and replace your last while loop with:
while my $line (<FASTAFILE2>) {
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n";
}
NOTE: I'm making an assumption that the fields contain FULL contents of your hash keys (e.g. your data file would contain a field with "scaffold1253.1_size81005.6.32799_7496" but NOT a field with "XYZscaffold1253.1_size81005.6.32799_7496___IOU").
If that assumption is wrong and you really DO need to run a regex because your scaffold strings may be contained in longer strings, there may still be a better solution aside from running O(N*M) regexes: if your scaffold strings are all of a certain well defined format (e.g. "scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN"), what you need to do then is:
For each line of data file, run a single regex finding that pattern, with the entire pattern inside a capture group parenthesis:
#matches = ($line =~ m/(scaffold\d+\.\d+_size\d+\.\d+\.\d+_\d+/g );
Then, look up every value of #matches array in the hash. If found, run ONLY the matches as a s/// regex.
Looking at your previous post, wouldn't it be more simple to create the shortened 'id' while reading the file. Then you would have no need of the other file where you get your hash?
Here is the (untested) code below. (would need to direct the print statements to an output file on the command line or open a file for writing in your script).
#!/usr/bin/perl
use strict;
use warnings;
my $gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open my $FASTAFILE2, "<", $gtf or die "Unable to open '$gtf' for reading. $!";
my %seen;
while (<$FASTAFILE2>) {
chomp;
my ($id, $val) = split /\t/, $_, 2;
# copy $id to $prefix and
# remove everything after '.1' in $prefix
(my $prefix = $id) =~ s/\.1\K.*//;
if ($seen{$id}) {
++$seen{$id};
}
else {
$seen{$id} = 'a';
}
print "$prefix$seen{$id}\t$val\n";
}
close $FASTAFILE2 or die "Unable to close '$gtf' from reading. $!";
Could it be a job for Tie::File? Assuming, that is, the data file could be operated on as an array.
use Tie::File;
my $file = "./Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf";
tie #lines, 'Tie::File', $file or die ;
for (#lines) {
s/Oldlabel/NewLable/g; # Change this to fit
}
untie #lines ;
Tie::File does a bunch of tricks to keep the "in place " changes to the file memory efficient.

Calculate Character Frequency in Message using Perl

I am writing a Perl Script to find out the frequency of occurrence of characters in a message. Here is the logic I am following:
Read one char at a time from the message using getc() and store it into an array.
Run a for loop starting from index 0 to the length of this array.
This loop will read each char of the array and assign it to a temp variable.
Run another for loop nested in the above, which will run from the index of the character being tested till the length of the array.
Using a string comparison between this character and the current array indexed char, a counter is incremented if they are equal.
After completion of inner For Loop, I am printing the frequency of the char for debug purposes.
Question: I don't want the program to recompute the frequency of a character if it's already been calculated. For instance, if character "a" occurs 3 times, for the first run, it calculates the correct frequency. However, at the next occurrence of "a", since loop runs from that index till the end, the frequency is (actual freq -1). Similary for the third occurrence, frequency is (actual freq -2).
To solve this. I used another temp array to which I would push the char whose frequency is already evaluated.
And then at the next run of for loop, before entering the inner for loop, I compare the current char with the array of evaluated chars and set a flag. Based on that flag, the inner for loop runs.
This is not working for me. Still the same results.
Here's the code I have written to accomplish the above:
#!/usr/bin/perl
use strict;
use warnings;
my $input=$ARGV[0];
my ($c,$ch,$flag,$s,#arr,#temp);
open(INPUT,"<$input");
while(defined($c = getc(INPUT)))
{
push(#arr,$c);
}
close(INPUT);
my $length=$#arr+1;
for(my $i=0;$i<$length;$i++)
{
$count=0;
$flag=0;
$ch=$arr[$i];
foreach $s (#temp)
{
if($ch eq $s)
{
$flag = 1;
}
}
if($flag == 0)
{
for(my $k=$i;$k<$length;$k++)
{
if($ch eq $arr[$k])
{
$count = $count+1;
}
}
push(#temp,$ch);
print "The character \"".$ch."\" appears ".$count." number of times in the message"."\n";
}
}
You're making your life much harder than it needs to be. Use a hash:
my %freq;
while(defined($c = getc(INPUT)))
{
$freq{$c}++;
}
print $_, " ", $freq{$_}, "\n" for sort keys %freq;
$freq{$c}++ increments the value stored in $freq{$c}. (If it was unset or zero, it becomes one.)
The print line is equivalent to:
foreach my $key (sort keys %freq) {
print $key, " ", $freq{$key}, "\n";
}
If you want to do a single character count for the whole file then use any of the suggested methods posted by the others. If you want a count of all the occurances
of each character in a file then I propose:
#!/usr/bin/perl
use strict;
use warnings;
# read in the contents of the file
my $contents;
open(TMP, "<$ARGV[0]") or die ("Failed to open $ARGV[0]: $!");
{
local($/) = undef;
$contents = <TMP>;
}
close(TMP);
# split the contents around each character
my #bits = split(//, $contents);
# build the hash of each character with it's respective count
my %counts = map {
# use lc($_) to make the search case-insensitive
my $foo = $_;
# filter out newlines
$_ ne "\n" ?
($foo => scalar grep {$_ eq $foo} #bits) :
() } #bits;
# reverse sort (highest first) the hash values and print
foreach(reverse sort {$counts{$a} <=> $counts{$b}} keys %counts) {
print "$_: $counts{$_}\n";
}
I donĀ“t understand the problem you are trying to solve, so I propose a more simple way to count the characters in a string:
$string = "fooooooobar";
$char = 'o';
$count = grep {$_ eq $char} split //, $string;
print $count, "\n";
This prints the number of $char occurrences in $string (7).
Hope this helps to write a more compact code
Faster solution :
#result = $subject =~ m/a/g; #subject is your file
print "Found : ", scalar #result, " a characters in file!\n";
Of course you can put a variable in the place of 'a' or even better execute this line for whatever characters you want to count the occurrences.
As a one-liner:
perl -F"" -anE '$h{$_}++ for #F; END { say "$_ : $h{$_}" for keys %h }' foo.txt