I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}
The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}
Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}
Related
Here is the script of user Suic for calculating molecular weight of fasta sequences (calculating molecular weight in perl),
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
for my $file (#ARGV) {
open my $fh, '<:encoding(UTF-8)', $file;
my $input = join q{}, <$fh>;
close $fh;
while ( $input =~ /^(>.*?)$([^>]*)/smxg ) {
my $name = $1;
my $seq = $2;
$seq =~ s/\n//smxg;
my $mass = calc_mass($seq);
print "$name has mass $mass\n";
}
}
sub calc_mass {
my $a = shift;
my #a = ();
my $x = length $a;
#a = split q{}, $a;
my $b = 0;
my %data = (
A=>71.09, R=>16.19, D=>114.11, N=>115.09,
C=>103.15, E=>129.12, Q=>128.14, G=>57.05,
H=>137.14, I=>113.16, L=>113.16, K=>128.17,
M=>131.19, F=>147.18, P=>97.12, S=>87.08,
T=>101.11, W=>186.12, Y=>163.18, V=>99.14
);
for my $i( #a ) {
$b += $data{$i};
}
my $c = $b - (18 * ($x - 1));
return $c;
}
and the protein.fasta file with n (here is 2) sequences:
seq_ID_1 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS
SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASG
When using: perl molecular_weight.pl protein.fasta > output.txt
in terminal, it will generate the correct results, however it also presents an error of "Use of unitialized value in addition (+) at molecular_weight.pl line36", which is just localized in line of "$b += $data{$i};" how to fix this bug ? Thanks in advance !
You probably have an errant SPACE somewhere in your data file. Just change
$seq =~ s/\n//smxg;
into
$seq =~ s/\s//smxg;
EDIT:
Besides whitespace, there may be some non-whitespace invisible characters in the data, like WORD JOINER (U+2060).
If you want to be sure to be thorough and you know all the legal symbols, you can delete everything apart from them:
$seq =~ s/[^ARDNCEQGHILKMFPSTWYV]//smxg;
Or, to make sure you won't miss any (even if you later change the symbols), you can populate a filter regex dynamically from the hash keys.
You'd need to make %Data and the filter regex global, so the filter is available in the main loop. As a beneficial side effect, you don't need to re-initialize the data hash every time you enter calc_mass().
use strict;
use warnings;
my %Data = (A=>71.09,...);
my $Filter_regex = eval { my $x = '[^' . join('', keys %Data) . ']'; qr/$x/; };
...
$seq =~ s/$Filter_regex//smxg;
(This filter works as long as the symbols are single character. For more complicated ones, it may be preferable to match for the symbols and collect them from the sequence, instead of removing unwanted characters.)
I would like to sort files numerically using Perl script.
My files looks like below:
1:file1:filed2
3:filed1:field2
10:filed1:field2
4:field1:field2
7:field1:field2
I would like to display it as:
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2
The way sort works in perl, is that it works through your list, setting each element to $a and $b - then testing those. By default, it uses cmp which is an alphanumeric sort.
You've also got <=> which is a numeric sort, and the kind you're looking for. (Alpha sorts 10 ahead of 2 ).
So - all we need do is extract the numeric value of your key. There's a number of ways you could do this - the obvious being to take a subroutine that temporarily copies the variables:
#!/usr/bin/env perl
use strict;
use warnings;
sub compare_first_num {
my ( $a1 ) = split ( /:/, $a );
my ( $b1 ) = split ( /:/, $b );
return $a1 <=> $b1;
}
print sort compare_first_num <>;
This uses <> - the magic filehandle - to read STDIN or files specified on command line.
Or alternatively, in newer perls (5.16+):
print sort { $a =~ s/:.*//r <=> $b =~ s/:.*//r } <>;
We use the 'substitute-and-return' operation to compare just the substrings we're interested in. (Numerically).
Split on : and store in a hash of arrays. Then you can sort and print out the hash keys:
my %data;
while(<DATA>){
my #field = split(/:/);
$data{$field[0]} = [#field[1..2]];
}
print join (':', $_, #{$data{$_}}) for sort { $a <=> $b } keys %data;
print "\n";
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2
For simple and fast solution, use Sort::Key::Natural (fast natural sorting) module:
use warnings;
use strict;
use Sort::Key::Natural qw( natsort );
open my $fh, "<", "file.txt" or die $!;
my #files = natsort <$fh>;
close $fh;
print #files;
Output:
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2
I need to sort hash key using perl also i need to allow duplicate in key. So that i planned to check exists method in perl if it is exists then i increment a last digit then i will store into hash.
I tried the following code:
use strict;
use warnings;
use iPerl::Basic qw(_save_file _open_file);
my $xml = $ARGV[0];
my ($xmlcnt,$backcnt,$refcnt,$name,$year) = "";
my %sort = ();
if(($#ARGV != 0) or(not -f "$xml") or($xml!~ m{\.xml$}i)){
print_exit("\t\tSYSTAX ERROR: <EXE> <xml File>\n\n")
};
$xmlcnt=_open_file($xml);
$xmlcnt =~ s{<back(?: [^>]+)?>(?:(?!</?back[ >]).)*</back>}{
$backcnt = $&;
while($backcnt =~ m{<ref(?: [^>]+)?>(?:(?!<ref[ >]).)*</ref>}igs){
$refcnt = $&;
$name = $1 if($refcnt =~ m{<person-group(?: [^>]+)?>((?:(?!</?person-group[ >]).)*)</person-group>}is);
$year = $1 if($refcnt =~ m{<year>((?:(?!</?year[ >]).)*)</year>}is);
$name =~ s{</?(?:string-name|surname|given-names)>}{}ig;
my $count = 1;
my $keys="$name $year\E$count";
if(exists ($sort{$keys})){
$keys =~ s{(\d)$}{my $icr=$1;$icr++;qq($icr)}e;
#print"$keys\n";
$sort{$keys}="$refcnt";
}
else
{
$sort{$keys}="$refcnt";
}
print join("\n",keys %sort);
}
qq($backcnt)
}igse;
my #keys = sort {
$sort{$a} <=> $sort{$b}
# or
# "\L$a" cmp "\L$b"
} keys %sort;
# print join("\n",#keys);
sub print_exit {
my $msg = shift;
#print "\n$msg";
exit;
}
Please can anyone tell me what went wrong here?
input:
thieooieroh
apple
apple
highefhfe
bufghifeh
output:
apple
apple
bufghifeh
highefhfe
thieooieroh
Thanks in advance.
From a very brief look at your code, it appears that you want to store refcounts as the values in your hash, with the ability to have multiple counts for a single key. This is easily doable by using a hash of arrays (commonly abbreviated to HoA). Each key must, by definition, be unique, but the associated value can be a reference, allowing you to store multiple items under that key, or to build even more complex data structures.
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
my %hash;
while (my $line = <DATA>) {
chomp $line;
my ($key, $count) = split ',', $line;
push #{$hash{$key}}, $count;
}
for my $key (sort keys %hash) {
my $values = $hash{$key};
for (#$values) {
say "$key ($_)";
}
}
__DATA__
thieooieroh,1
apple,2
apple,3
highefhfe,4
bufghifeh,5
Output:
apple (2)
apple (3)
bufghifeh (5)
highefhfe (4)
thieooieroh (1)
If you're not actually concerned with storing multiple data items with each key, but only with the number of times each key appears, it's even simpler. Change the two loops in the above code to:
while (my $line = <DATA>) {
chomp $line;
$hash{$line}++;
}
for my $key (sort keys %hash) {
say $key for 1 .. $hash{$key};
}
and you get the output
apple
apple
bufghifeh
highefhfe
thieooieroh
As for the rest of your posted code, don't try to parse XML with regexes. Arbitrary XML cannot be parsed beyond a very crude first approximation by regular expressions because XML is not structurally "regular". There are many fine XML-parsing modules on CPAN which will parse your XML correctly for you, while also requiring far less effort from you than trying to write your own parser. Use one of them. Not regexes.
So, i have a file to read like this
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
What I want is if $x matches with Some.Text for example, how can I get a variable with "Some big text with spaces and numbers and something" or if it matches with "Some.Text2" to get "Again some big test, etc".
open FILE, "<cats.txt" or die $!;
while (<FILE>) {
chomp;
my #values = split('~~~', $_);
foreach my $val (#values) {
print "$val\n" if ($val eq $x)
}
exit 0;
}
close FILE;
And from now on I don't know what to do. I just managed to print "Some.text" if it matches with my variable.
splice can be used to remove elements from #values in pairs:
while(my ($matcher, $printer) = splice(#values, 0, 2)) {
print $printer if $matcher eq $x;
}
Alternatively, if you need to leave #values intact you can use a c style loop:
for (my $i=0; $i<#values; $i+=2) {
print $values[$i+1] if $values[$i] eq $x;
}
Your best option is perhaps not to split, but to use a regex, like this:
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
while (/Some.Text2?~~~(.+?)~~~/g) {
say $1;
}
}
__DATA__
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
Output:
Some big text with spaces and numbers and something
Again some big test, etc
I'd like to get the last entry of a duplicate line from a file.
The basis for duplicate checking would be the first element from a csv.
The duplicates may or may not be adjacent.
Input file:
971~11
972~12
973~11
974~11
972~11
Expected output:
971~11
973~11
974~11
972~11
I'm not looking for a perl one-liner as I intend to write this as
a subroutine.
Thanks!
PS:
I have modified this code from somewhere, but this just removes the duplicates
#!/usr/bin/perl -w
while (<STDIN>) { push (#lines, $_); }
print "-\n";
foreach my $i (#lines)
{
#newline = split(/\||~/, $i);
if (scalar(grep{ /$newline[0]/ } #lines) == 1)
{
print $i;
}
}
If the output order doesn't matter, the easiest way to do this is to use a hash to do the duplicate removal. Something like the following:
#!/usr/bin/perl -w
use strict;
sub printlast(#) {
my %dedup;
foreach my $line (#_) {
my $a = (split(/\||~/, $line))[0];
$dedup{$a} = $line;
}
print $dedup{$_} for keys %dedup; # or sort keys %dedup for prettier output
}
my #lines;
while (<STDIN>) { push (#lines, $_); }
print "-\n";
printlast(#lines);
When looking to dedup, it's almost always best to use a hash.
Here's something similar to the accepted answer (since #Mat beat me to it)
#!/usr/bin/env perl -lw
use Data::Dumper; $Data::Dumper::Indent = 1;
my %seen;
while (<DATA>) {
chomp;
my #fields = split('~');
$seen{$fields[0]} = $fields[1];
}
my #output;
while (my ($k,$v) = each %seen) {
push #output, join('~', $k, $v);
}
print Dumper \#output;
__DATA__
971~11
972~12
973~11
974~11
972~11