How to print the frequency of words in perl? - perl

open INP,"<C:\\Users\\hanadi\\Documents\\cs.txt";
while ($line=<INP>)
{
chomp($line);
#list=split/\s+/,$line;
foreach $w (#list)
{
$wordfreq{$w}++;
}
}
foreach $w2(keys>wordfreq)
{
print "$w2==>$wordfreq{$w}";
}
I want to print each word and its frequency.now i want code in Perl to jump and Print the above information for the next
ranks (>100), but do not print
every line, print only one line for every 1000 words (otherwise there will
be too many lines to print) in decreasing order of frequency and
decreasing alphabetical order among words with the same frequency.

The first issue of this problem is to define the word "word." Am assuming, by one of your comments, that punctuation is not part of a "word," since you were asking how to "...delete punctuations from the text..."
One solution to this is to use a regex to capture only "word" characters, i.e., alphanumeric and underscore, by matching the text against \w in the regex.
Building a hash, where the keys are the words and the associated values are the counts, is the way to go. However, when doing this, you need to insure that the keys are all the same case, i.e., either all UPPER or all lower.
After you've built this hash, you can sort the output in descending order by the has values (frequency) and use a counter to print just the top 100 words. There will be words with the same frequency count--especially having only one occurrence. How do you want these printed, as it can make a difference whether they appear in the list of top 100. I'd suggest ordering these cases alphabetically.
Give the above, consider the following solution, which uses the text above the code below as the corpus:
use strict;
use warnings;
my %hash;
open my $fh, '<', 'words.txt' or die $!;
while (<$fh>) {
$hash{ lc $1 }++ while /(\w+)/g;
}
close $fh;
my $i = 1;
for my $word ( sort { $hash{$b} <=> $hash{$a} || $a cmp $b } keys %hash ) {
print "$i. $word: ($hash{$word})\n" if $i++ < 100 or !( $i % 1000 );
}
Partial output:
1. the: (22)
2. to: (8)
3. a: (5)
4. you: (5)
5. is: (4)
6. of: (4)
7. this: (4)
8. word: (4)
9. all: (3)
10. and: (3)
...
96. punctuation: (1)
97. punctuations: (1)
98. since: (1)
99. sort: (1)
100. suggest: (1)
Limitations:
One issue that results from capturing word characters can be seen in cases of some possessive forms of words, e.g., word's. In this case, both word and s would be captured as words. If you want to retain such punctuation, and split on whitespace, you can just use the following instead of the regex line:
$hash{ lc $_ }++ for split ' ';
Hope this helps!

please pay attention to all of amons reply, and always rtq. (good spot amon).
(i have determined) your problem is that once you have constructed your wordcount hash, you now need to invert the hash so you can sort the values into some kind of order. The problem with this is that more than one word may have the same count and would overwrite earlier stored words.
To do this you need to store an array in a hash value, and this is done by using a reference to an array. Hash values may only be scalars, an array is not a scalar, but a reference to an array is.
In this re-write of your problem, I have updated the open and close calls to use scalar filehandles, with appropriate error handling (or die), and converted your foreach statements into 'maps'. These can take a bit of time to grasp so do not copy and paste them. Rather, focus on the inverting of the hash and how the array is accessed. This is possibly quite complex for you so I have left this parts in foreach style.
The 'each' keyword takes a key/value pair from the hash, and is often used this way to process hashes in while statements.
You will still need to work on converting the counts into frequencies as per amons suggestion and retrieving the top 100. There is a clue to the counting in the '$c' variable.
#!/usr/bin/perl
# word count #wct.pl
use warnings;
use strict;
my (%wordfreq);
open my $input, '<', 'wc.txt'
or die "cannot open wc txt file $!";
map { $wordfreq{ $_ }++; } (split /\s+/, $_) while <$input> ;
close $input
or die "cannot close wc txt file $!";
# print (
# map {"$_ has ". $wordfreq{$_} . "\n" } (keys %wordfreq)
# );
foreach (keys %wordfreq){
# print "$_ has ". $wordfreq{$_} . "\n"
}
my %invertedhash;
while (my ($key,$value) = each %wordfreq){
push #{$invertedhash{$value}}, $key;
}
my $c;
foreach (reverse sort keys %invertedhash){
last if $c++ == 2;
print "words with a count of $_ are #{$invertedhash{$_}} \n";
}
exit 0;
sample
one two two
three three three four
four four four five
five five five
produces
words with a count of 4 are four five
words with a count of 3 are three
Hope this helps.

Related

How to use perl to parse a text file

I am new to perl,
I have text file contains 2 columns:
lib1 cell1
lib1 cell2
lib2 cell3
lib2 cell1
I would like to use perl to find there is duplicated in name in column 2 then print the name of column 1
In this text cell1 is repeated 2 times.
I would like to have a report something like:
cell1 found in lib1 lib2
I use the code below to read and open the file
#!/usr/bin/env perl
use strict;
use warnings;
for my $file ( #ARGV ){
open my$in_fh, '<', $file or die "could not open $file: $!\n";
while( my $line = <$in_fh> ){
chomp( $line );
print "$line\n"
}
}
But I don't know how to find the duplicated name in second column and print the 1st column
There are a few things here that Perl can do for you.
First, Perl will handle opening and reading the files you specify on the command line this the empty deadline operator (and here I'm using the safer double diamond version introduced in v5.22):
use v5.22
while( <<>> ) {
...
}
Then, you can track what you've seen with a hash. Extract the columns, and use the interesting column as the key in the hash. Here I post-increment it's value. On the first go around, the post increment returns 0 (then increases the value by 1), so the conditional is false the first time. The next time it sees that same key, the value is true, so it warns:
use v5.22
my %Seen;
while( <<>> ) {
chomp;
my( $first, $second ) = split;
if( $Seen{$second}++ ) {
warn "Duplicated second column! Line $.\n";
}
}
The hash is a great way to track things that are strings instead of positions.
Now, you want to know which values in the first column appear with each value in the second. You could get a bit more fancy with that hash and make another level in the hash to store the first column. Perl automatically takes care of the details for you (and we have extended examples of this in Intermediate Perl.
First, accumulate the data in the hash:
use v5.22
my %Seen;
while( <<>> ) {
chomp;
my( $first, $second ) = split;
$Seen{$second}{$first}++;
}
Once you have the hash, you move on to the second step of reporting the data. All the values of the second column are the top level keys for the hash. With that key, get the second level of the hash, and get those keys, which are the first column:
foreach my $second ( keys %Seen ) {
my #firsts = keys %{ $Seen{$second} };
say "$second found in #firsts";
}
With v5.24's postfix dereferencing, that's slightly cleaner since the dereference reads left to right rather than inside out:
use v5.24;
foreach my $second ( keys %Seen ) {
my #firsts = keys $Seen{$second}->%*;
say "$second found in #firsts";
}
And, since the hash keys in the second level only appear once per value, you don't have duplicates.

Sort mixed text lines (alphanum) in Perl

I have txt file with every line structure like this:
P[containerVrsn:U(0)recordVrsn:U(0)size:U(212)ownGid:G[mdp:U(1090171666)**seqNo:U(81920)**]logicalDbNo:U(1)classVrsn:U(1)timeStamp:U(0)dbRecord:T[classNo:U(1064620)size:U(184)updateVersion:U(3)checksum:U(748981000)
And have to sort file lines based on seqNo (min to max). Sequence number can be virtually any number starting from zero. Any idea how can it be done in efficient way?
The Schwartzian Transform as suggested in Toto's answer is probably the fastest way to sort your lines here. But you said you're a Perl newbie, and I like to show how the lines can be sorted traditionally.
Perl has a sort function that sorts a list simply by alphabet. But you can supply a custom comparison function and let sort use your function to compare the elements. During its operation sort must continuously compare two elements (=lines) of your list and decide which one is greater or lesser or whether they are equal.
If you supply a comparison function, sort will call it with two such elements as the parameters $a and $b. You do not need to must not declare $a and $b, they are magic and just there. Your comparison function could look like this:
sub by_seqNo
{
# extract the sequence number from $a and $b
my ($seqA) = ($a =~ /seqNo:U\((\d+)/);
my ($seqB) = ($b =~ /seqNo:U\((\d+)/);
# numerically compare the sequence numbers (returns -1/0/+1)
$seqA <=> $seqB;
}
The first two lines extract the numbers after seqNo:U( and store them as $seqA and $seqB. The third line compares these sequence numbers as integers and returns that result. Combined with the sort function this gives:
my #sorted = sort by_seqNo #lines;
The reason why the Schwartzian Transform (ST) is faster than this solution is because the ST does the (expensive) operation of extracting the seqNo from your lines exactly once for each line. The "traditional" approach on the other hand extracts the seqNo twice for each comparison.
You could use Schwartzian Transform.
Here is a small script that does the job:
#!/usr/bin/perl
use strict;
use warnings;
open my $fhi, '<', 'path/to/input/file' or die "Unable to open input file: $!";
my #lines = <$fhi>;
my #sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { my ($x) = $_ =~ /SeqNo:U\((\d+)/i; [$_, $x]; }
#lines;
open my $fho, '>', 'path/to/output/file' or die "Unable to open output file: $!";
print $fho $_ for #sorted;

Selectively joining elements of an array into fewer elements of a new array

I'm having some trouble manipulating an array of DNA sequence data that is in .fasta format. What I would specifically like to do is take a file that has a few thousand sequences and adjoin sequence data for each sequence in the file onto a single line in the file. [Fasta format is as such: A sequence ID starts with > after which everything on that line is a description. On the next line(s) the sequence corresponding to this ID is present. And this can continue indefinitely until the next line that begins with >, which is the id of the next sequence in the file] So, in my particular file most of my sequences are on multiple lines, so what I would like to do is essentially remove the newlines, but only the new lines between sequence data, not between sequence data and sequence ID lines (that start with >).
I'm doing this because I want to be able to attain sequence lengths of each sequence (through length, I believe is the easiest way), and then get an average sequence length of all the sequences in the whole file.
Here's my script so far, that doesnt seem to want to work:
#!/usr/bin/perl -w
##Subroutine
sub get_file_data1 {
my($filename) = $_[0];
my #filedata = ();
unless( open(GET_FILE_DATA, $filename)) {
print STDERR "Cannot open file \"$filename\"\n\n";
exit;
}
#filedata = <GET_FILE_DATA>;
close GET_FILE_DATA;
return #filedata;
}
##Opening files
my $fsafile = $ARGV[0];
my #filedata = &get_file_data1($fsafile);
##Procedure
my #count;
my #ids;
my $seq;
foreach $seq (#filedata){
if ($seq =~ /^>/) {push #ids, $seq;
push #count, "\n";
}
else {push #count, $seq;
}
}
foreach my $line (#count) {
if ($line =~ /^[AGTCagtc]/){
$line =~ s/^([AGTCagtc]*)\n/$1/;
}
}
##Make a text file to have a look
open FILE3, "> unbrokenseq.txt" or die "Cannot open output.txt: $!";
foreach (#count)
{
print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;
__END__
##Creating array of lengths
my $number;
my #numberarray;
foreach $number (#count) {
push #numberarray, length($number);
}
print #numberarray;
__END__
use List::Util qw(sum);
sub mean {
return sum(#numberarray)/#numberarray;
}
There's something wrong with the second foreach line of the Procedure section and I can't seem to figure out what it is. Note that the code after the END lines I haven't even tried yet because I cant seem to get the code in the procedure step to do what I want. Any idea how I can get a nice array with elements of unbroken sequence (I've chosen to just remove the sequence ID lines from the new array..)? When I can then get an array of lengths, after which I can then average?
Finally I should unfortunately admit that I cannot get Bio::Perl working on my computer, I have tried for hours but the errors are beyond my skill to fix. Ill be talking to someone who can hopefully help me with my Bio::perl issues. But for now I'm just going to have to press on without it.
Thanks! Sorry for the length of this post, I appreciate the help.
Andrew
The problem with your second loop is that you are not actually changing anything in #count because $line contains a copy of the values in #count.
But, if all you want to do in the second loop is to remove the newline character at the end, use the chomp function. with this you wouldn't need your second loop. (And it would also be faster than using the regex.)
# remove newlines for all array elements before doing anything else with it
chomp #filedata;
# .. or you can do it in your first loop
foreach $seq (#filedata){
chomp $seq;
if ($seq =~ /^>/) {
...
}
An additional tip: Using get_file_data1 to read the entire file into an array might be slow if your files are large. In that case it would be better to iterate through the file as you go:
open my $FILE_DATA, $filename or die "Cannot open file \"$filename\"\n";
while (my $line = <$FILE_DATA>) {
chomp $line;
# process the record as in your Procedure section
...
}
close $FILE_DATA;
Your regex captures specifically to $1 but you are printing $_ to the file. The result being most likely not what you intended.
Be careful with the '*' or 'greedy' modifier to your character groups in s///. You usually want the '+' instead. '*' will also match lines containing none of your characters.
A Search expression with a 'g' modifier can also count characters. Like this:
$perl -e '$a="aggaacaat"; $b = $a =~ s/[a]//g; print $b; '
5
Pretty cool huh! Alternately, in your code, you could just call length() against $1.
I was taken aback to see the escaped '/n' in your regex. While it works fine, the common 'end-of-line' search term is '$'. This is more portable and doesn't mess up your character counts.

How to compare two text files by column and output the number of times the columns match

I have two tab-delimited genome sequence files (SAM format), and I would like to compare them to see how many times certain sequencing reads (which comprise a single line) are present in each.
Here is an example of input file format:
HWI-AT555:86:D0:6:2208:13551:55125 122 chr1 77028 255 94M555N7M * 0 0 GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&& NM:i:3 XS:A:+ NH:i:1
The important part is the sequence read id, which is the first column (ie HWI-....55125). This is what I want to use to compare the two files so that I can count the number of duplicates/copies.
Here is what I have so far:
unless (#ARGV == 2) {
print "Use as follows: perl program.pl in1.file in2.file\n";
die;
}
my $in1 = $ARGV[0];
my $in2 = $ARGV[1];
open ONE, $in1;
open TWO, $in2;
my %hash1;
my #hit;
while (<ONE>){
chomp;
my #hit = split(/\t/, $_);
$hash1{$hit[0]}=1;
}
close ONE;
my #col;
while (<TWO>){
chomp;
my #col = split(/\t/, $_);
if ($col[0] =~ /^H/){ #only valid sequence read lines start with "H"
print "$col[0]\n" if defined($hash1{$_});
}
}
close TWO;
So far it looks for a match in hash1 while going through the second file line by line and prints out any matches. What I would like it to do is count how many times it finds a match and then print out the number of times that happens for each sequence id and a total number of matches.
I am new to programming and I am quite stuck with how I can keep a count when there are matches while going through a loop. Any help would be appreciated. Let me know if I didn't make something clear enough.
Initialize your %hash1 with zeros instead of ones:
while (<ONE>){
chomp;
my #hit = split(/\t/, $_);
# Start them as "0" for "no duplicates".
$hash1{$hit[0]} = 0;
}
Then, in your second loop, you can increment $hash1{$col[0]}:
while (<TWO>){
chomp;
my #col = split(/\t/, $_);
# Increment the counter if %hash1 has what we're looking for.
++$hash1{$col[0]} if(exists($hash1{$col[0]}));
}
There's no need to check $col[0] =~ /^H/ since %hash1 will only have entries for valid sequences, so you can just do an exists check on the hash. And you want to look at $hash1{$col[0]} rather than $hash1{$_} since you're only storing the first part of the lines in your first loop, $_ will have the whole line. Furthermore, if you're just grabbing the first field of each line you don't need the chomp calls but they do no harm so you can keep them if you want.
This leaves you with the all the repeated entries in %hash1 as entries with non-zero values and you can grep those out:
my #dups = grep { $hash1{$_} > 0 } keys %hash1;
And then display them with their counts:
for my $k (sort #dups) {
print "$k\t$hash1{$k}\n";
}
You could also check the counts while displaying the matches:
for my $k (sort keys %hash1) {
print "$k\t$hash1{$k}\n" if($hash1{$k} > 0);
}

Why is first value of captured expression getting stored in fourth element in Perl?

I am storing information captured by regex into an array. But for some reason the first value is getting stored at 4 element of array. Any suggestion on whats going wrong and how to store the first value in the first element of array.
The following is the script:
#!/usr/bin/perl
use strict;
my #value;
my $find= qr/^\s+([0-9]+)\s+([A-Z])/;
open (FILE, "</usr/test")|| die "cant open file";
my #body=<FILE>;
foreach my $line (#body){
chomp $line;
push #value, join('', $line =~ /$find/);
}
print "$value[0]\n"; #does not print anything
print "$value[4]\n"; #prints first value i.e 1389E
exit;
DATA
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes
Your second line has more than one space between the number group and the letter, so you probably want \s+ both times rather than \s the second time.
You won't necessarily know how many items you have in the #value array at the end, so you might want to put the printing into a for loop rather than assume you have a fifth item. (Maybe you know you want the first and fifth, however?) Follow-up: based on your edit, you have more than two entries after all. The version that I give below, using split and \s+ captures the number and letter for all the lines. I'll adjust the print part of the script to show you what I mean.
A few other things:
You should always enable warnings.
There's no reason to read the whole file into an array and then process through it line by line. Skip the #body array and just do what you need to in the while loop.
Use the more modern form of open with lexical filehandles and three arguments.
split seems more straightforward here to me, rather than a regular expression with captures. Since you want to capture two specific parts of the line, you can use split with an array slice to grab those two items and feed them to join.
#value is not an especially helpful variable name, but I think you should at least make it plural. It's a good habit to get into, I think, insofar as the array stores your plural records. (That's not a hard and fast rule, but it bugged me here. This point is pretty minor.)
Here's how all this might look:
#!/usr/bin/env perl
use warnings;
use strict;
my #values;
# open my $filehandle, '<', '/usr/test'
# or die "Can't open /usr/test: $!";
while (my $line = <DATA>) {
chomp $line;
push #values, join('', (split /\s+/, $line)[1..2]);
}
for my $record (#values) {
print $record, "\n";
}
__DATA__
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes
I think you should be writing
print "$value[0]\n";
print "$value[4]\n";
to access elements of an array.
You should use lexical file handles and the three argument form of open as well as avoiding slurping files unnecessarily.
In any case, the most likely reason for your problem is a single character missing from your pattern. Compare the one below with the one you have above.
#!/usr/bin/perl
use strict;
use warnings;
my #value;
my $find= qr/^\s+([0-9]+)\s+([A-Z])/;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
push #value, join '', $line =~ $find;
}
use Data::Dumper;
print Dumper \#value;
__DATA__
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes
Do you have leading whitespace lines, or other leading lines in your data that don't match your regexp? Since you're unconditionally push()-ing onto your output array, regardless of whether your regexp matches, you'll get blank array elements for every non-matching line in your input.
Observe:
#!/usr/bin/perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
push #lines , ( join( '' , /^\s*(\d+)/ ));
}
foreach ( 0 .. $#lines ) {
print "$_ -> $lines[$_]\n";
}
__DATA__
FOO
Bar
Baz
1234
456
bargle
Output:
0 ->
1 ->
2 ->
3 -> 1234
4 -> 456
5 ->