Perl Frequency sorting and other things - perl

Edit:
So I got the script working great with all your help, so thanks a lot.
I'm also trying to figure out how I can keep the standard input choice but still be able to use a command-line "start" aswell,
I want both to be able to start it by for ex. "perl wfreq.pl" and it then asks the user what file, but I also want to be able to start it by saying "perl wfreq.pl example.txt" and then it shouldnt ask for the user input.
Is this possible?
#! /usr/bin/perl
use utf8;
use warnings;
print "Please enter the name of the file: \n" ;
$file = <STDIN>;
chop $file;
open(my $DATA, "<:utf8", $file) or die "Oops!!: $!";
binmode STDOUT, ":utf8";
while(<$DATA>) {
tr/A-Za-z//cs;
s/[;:()".,!?]/ /gio;
foreach $word (split(' ', lc $_)) {
$freq{$word}++;
}
}
foreach $word (sort { $freq{$b} <=> $freq{$a} } keys %freq) {
#fr = (#fr, $freq{$word});
#ord = (#ord, $word);
}
for ($v =0; $v < 10; $v++){
print " $fr[$v] | $ord[$v]\n";
}

As MVS wrote, you need to use the "spaceship" <=> operator and sort not keys, but values.
Here's example that should work:
Change
foreach $word (sort keys %freq) {
print "$word $freq{$word}\n";
}
To
foreach $word (sort { $freq{$a} <=> $freq{$b} } keys %freq) {
print "$word $freq{$word}\n";
}
perldoc -f sort contains just the same example at lines 23-27 of code snippet.
Talking about encoding: utf8 pragma documentation says to use it for
enable/disable UTF-8 (or UTF-EBCDIC) in source code
To enable UTF-8 in file input you need to open file in specific mode (using specific layer) and apply 'utf8' layer to STDOUT:
open(my $DATA, "<:utf8", $file) or die "Oops!!: $!";
binmode STDOUT, ":utf8";
For more information about :utf8 and other IO Layers you can read
:utf8 layer description
perldoc -f open
perldoc -f binmode

sort keys will sort by the keys of the hash, which are the words themselves. Instead, you'll want to sort by the values instead. Have a look at the documentation to help you (in particular, the parts about the "spaceship operator" <=>).
To put the numbers before the words, just switch $word and $freq{$word} in the print statement.
As for returning only the top 10 results, try a counter inside your foreach loop (and the break statement).
For UTF-8 characters, your use utf8 should be good enough to enable them, but if you're encountering problems, this might help.

You can use bunch of neat things in Perl:
perl -CS -F\\PL+ -alne'$f{lc$_}++for#F}{print"$f{$_} $_"for sort{$f{$b}<=>$f{$a}}keys%f'
See perlrun(1) for more details.

Related

Perl print to seperate files

I have a text file which lists a service, device and a filter, here I list 3 examples only:
service1 device04 filter9
service2 device01 filter2
service2 device10 filter11
I have written a perl script that iterates through the file and should then print device=device filter=filter to a file named according to the service it belongs to, but if a string contains a duplicate filter, it should add the devices to the same file, seperated by semicolons. Looking at the above example, I then need a result of:
service1.txt
device=device04 filter=filter9
service2.txt
device=device01 filter=filter2 ; device=device10 filter=filter11
Here is my code:
use strict;
use warnings qw(all);
open INPUT, "<", "file.txt" or die $!;
my #Input = <INPUT>;
foreach my $item(#Input) {
my ($serv, $device, $filter) = split(/ /, $item);
chomp ($serv, $device, $filter);
push my #arr, "device==$device & filter==$filter";
open OUTPUT, ">>", "$serv.txt" or die $!;
print OUTPUT join(" ; ", #arr);
close OUTPUT;
}
The problem I am having is that both service1.txt and service2.txt are created, but my results are all wrong, see my current result:
service1.txt
device==device04 filter==filter9
service2.txt
device==device04 filter==filter9 ; device==device01 filter==filter2device==device04 filter==filter9 ; device==device01 filter==filter2 ; device==device10 filter==filter11
I apologise, I know this is something stupid, but it has been a really long night and my brain cannot function properly I believe.
For each service to have its own file where data for it accumulates you need to distinguish for each line what file to print it to.
Then open a new service-file when a service without one is encountered, feasible since there aren't so many as clarified in a comment. This can be organized by a hash service => filehandle.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV || 'data.txt';
my %handle;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
my ($serv, $device, $filter) = split;
if (exists $handle{$serv}) {
print { $handle{$serv} } " ; device==$device & filter==$filter";
}
else {
open my $fh_out, '>', "$serv.txt" or do {
warn "Can't open $serv.txt: $!";
next;
};
print $fh_out "device==$device & filter==$filter";
$handle{$serv} = $fh_out;
}
}
say $_ '' for values %handle; # terminate the line in each file
close $_ for values %handle;
For clarity the code prints almost the same in both cases, what surely can be made cleaner. This was tested only with the provided sample data and produces the desired output.
Note that when a filehandle need be evaluated we need { }. See this post, for example.
Comments on the original code (addressed in the code above)
Use lexical filehandles (my $fh) instead of typeglobs (FH)
Don't read the whole file at once unless there is a specific reason for that
split has nice defaults, split ' ', $_, where ' ' splits on whitespace and discards leading and trailing space as well. (And then there is no need to chomp in this case.)
Another option is to first collect data for each service, just as OP attempts, but again use a hash (service => arrayref/string with data) and print at the end. But I don't see a reason to not print as you go, since you'd need the same logic to decide when ; need be added.
Your code looks pretty perl4-ish, but that's not a problem. As MrTux has pointed out, you are confusing collection and fanning out of your data. I have refactored this to use a hash as intermediate container with the service name as keys. Please note that this will not accumulate results across mutliple calls (as it uses ">" and not ">>").
use strict;
use warnings qw(all);
use File::Slurp qw/read_file/;
my #Input = read_file('file.txt', chomp => 1);
my %store = (); # Global container
# Capture
foreach my $item(#Input) {
my ($serv, $device, $filter) = split(/ /, $item);
push #{$store{$serv}}, "device==$device & filter==$filter";
}
# Write out for each service file
foreach my $k(keys %store) {
open(my $OUTPUT, ">", "$k.txt") or die $!;
print $OUTPUT join(" ; ", #{$store{$k}});
close( $OUTPUT );
}

How to check for a string using Perl given

I'm trying to replace a particular line in a file. I can get my program to run, but it doesn't actually do the replacing that I want it to.
Here is my sample file:
test line 1
test line 2
line to be overwritten
test line 3
Here is the code that I have:
my $origFile = $file_path . "junk\.file";
my $newFile = $file_path . "junk\.file\.backup";
# system command to make a backup of the file
system "mv $origFile $newFile";
#opens the files
open( my $INFILE, $newFile ) || die "Unable to read $newFile\n";
open( my $OUTFILE, '>' . $origFile ) || die "Unable to create $origFile\n";
# While loop to read in the file line by line
while ( <$INFILE> ) {
given ($_) {
when ("line to be overwritten") {
print $OUTFILE "line has been overwritten\n";
}
default {
print $OUTFILE $_;
}
}
}
close($INFILE);
close($OUTFILE);
I've tried to change the when statements several different ways to no avail:
when ($_ eq "line to be overwritten")
when ($_ == "line to be overwritten")
when ($_ cmp "line to be overwritten")
But those only generate errors. Anyone know what I'm doing wrong here?
As highlighted in a comment on the original question, given/when is an experimental feature of perl. I would personally recommend using if/else in a loop, and then either use string equality or a regex to match the line(s) you want to replace. A quick example:
use strict;
use warnings;
while(my $line = <DATA>) {
if ( $line =~ /line to be overwritten/ ) {
print "Overwritten\n";
} else {
print $line;
}
}
__DATA__
test line 1
test line 2
line to be overwritten
test line 3
This will give the output:
test line 1
test line 2
Overwritten
test line 3
You could also use the string equality if you aren't confident in your regex, or the string is guaranteed to be the same:
...
if ($line eq 'line to be overwritten') {
...
Sidenotes
open
On your initial open, it is recommended to use the 3 argument version of open to save from unexpected issues:
open(my $INFILE, '<', $newFile) || die "Unable to read $newFile\n";
(for more info on this, see here: http://modernperlbooks.com/mt/2010/04/three-arg-open-migrating-to-modern-perl.html)
strict & warnings
Also, it is recommended to use strict and warnings in your code file, as seen in my example above - this will save you from accidental mistakes like trying to use a variable which has not been declared, and syntax errors which may give you head-scratching results!
Experimental Features
Experimental features in perl are where there is no guarantee made for backwards compatibility to be maintained when a new release of perl comes out. Obviously if you are using the same version of perl everywhere it should be compatible, but things may break if you update to another major version of perl. answered here as I dont have the reputation to answer in the comments...
You seem to be making it way more complicated than it needs to be - a simple regex to check each line and act accordingly should do the job.
while(<$INFILE>)
{
chomp($_);
if /^line to be overwritten$/ )
{
print $OUTFILE "line has been overwritten\n";
}
else
{
print $OUTFILE "$_\n";
}
}
One way to do it is to use Tie::File module. It allows to replace data right in the file. You can make the backup same way you are currently doing, before changing the original file.
use strict;
use warnings;
use Tie::File;
my $file = 'test.txt';
tie my #textFile, 'Tie::File', $file, recsep => "\n" or die $!;
s/line to be overwritten/line has been overwritten/ for #textFile;
untie #textFile;

Perl unicode support from console (#ARGV) /Windows/

I'm trying to get unicode characters as arguments in perl script:
C:\>perl test.pl ö
#----
# test.pl
#----
#!/usr/bin/perl
use warnings;
use strict;
my ($name, $number) = #ARGV;
if (not defined $name) {
die "Need name\n";
}
if (defined $number) {
print "Save '$name' and '$number'\n";
# save name/number in database
exit;
}
if ($name eq 'ö') {
print "Fetch umlaut 'oe'\n";
} elsif ($name eq 'o') {
print "Fetch simple 'o'\n";
} else {
print "Fetch other '$name'\n";
}
print "ü";
and I get the output:
Fetch simple 'o'
ü
I've tested the code (algorithm) in python 3 and it works, so I get "ö".
But obviously in perl there is something more that I must add or set.
It doesn't matter if it is Strawberry Perl or ActiveState Perl. I get the same result.
Thanks in advance!
#!/usr/bin/perl
use strict;
use warnings;
my $encoding_in;
my $encoding_out;
my $encoding_sys;
BEGIN {
require Win32;
$encoding_in = 'cp' . Win32::GetConsoleCP();
$encoding_out = 'cp' . Win32::GetConsoleOutputCP();
$encoding_sys = 'cp' . Win32::GetACP();
binmode(STDIN, ":encoding($encoding_in)");
binmode(STDOUT, ":encoding($encoding_out)");
binmode(STDERR, ":encoding($encoding_out)");
}
use Encode qw( decode );
{
my ($name, $number) = map { decode($encoding_sys, $_) } #ARGV;
if (not defined $name) {
die "Need name\n";
}
if (defined $number) {
print "Save '$name' and '$number'\n";
# save name/number in database
exit;
}
if ($name eq 'ö') {
print "Fetch umlaut 'oe'\n";
} elsif ($name eq 'o') {
print "Fetch simple 'o'\n";
} else {
print "Fetch other '$name'\n";
}
print "ü";
}
Also, you should add use feature qw( unicode_strings ); and/or encode your file using UTF-8 and add use utf8;.
In addition to ikagami's fine answer, I'm a fan of the Encode::Locale module that automatically creates aliases for the current console's code pages. It works well with Win32, OS X & other flavors of *nix.
#!/usr/bin/perl
use strict;
use warnings;
# These two lines make life better when you leave the world of ASCII
# Just remember to *save* the file as UTF8....
use utf8;
use feature 'unicode_strings';
use Encode::Locale 'decode_argv'; # We'll use the console_in & console_out aliases as well as decode_argv().
use Encode;
binmode(STDIN, ":encoding(console_in)");
binmode(STDOUT, ":encoding(console_out)");
binmode(STDERR, ":encoding(console_out)");
decode_argv( ); # Decode ARGV in place
my ($name, $number) = #ARGV;
if (not defined $name) {
die "Need name\n";
}
if (defined $number) {
print "Save '$name' and '$number'\n";
# save name/number in database
exit;
}
if ($name eq 'ö') {
print "Fetch umlaut 'oe'\n";
} elsif ($name eq 'o') {
print "Fetch simple 'o'\n";
} else {
print "Fetch other '$name'\n";
}
print "ü";
Perhaps it's only syntactic sugar, but it makes easy reading and promotes cross-platform compatibility.
I think that the code answers to this question are well pointed but not complete:
that way , it is very complicated to construct a script with all the code page + source codification in mind, and moreover, it would be harder to make it portable: ö may be known to latin alphabet users, but の or 렌 also exist...
they may run ok with chars in a particular code page, but with chars outside it, they will fail (which is probably the case with some users in the comments). Note that Windows' Code Pages are previous to Unicode.
The fundamental problem is that Perl 5 for Windows is not compiled with Unicode support as Windows understands it: it is just a port of the linux code, and so, almost all Unicode chars are mangled before they even reach the Perl code.
A longer technical explanation (and a C patch!) is provided by A. Sinan Unur's page Fixing Perl's Unicode problems on the command line on Windows: A trilogy in N parts (under Artistic License 2.0).
So (but not for the faint of spirit): a recompilation of perl.exe is possible and almost fully Unicode compliant in Windows. Hopefully they'll be integrated some day in the source code... Until them I've resumed some detailed instructions to patch perl.exe here.
Note also that a proper command console with full Unicode support is needed. A quick solution is to use ConEmu, but Windows' cmd.exe could also work after some heavy tweaks.
I don't know if this is the solution for very scenario, but I could get away by using the parameter "-CAS" when calling my script.
Example:
Script_1:
use strict;
use utf8;
$|++; # Prevent buffering issues
my ($arg) = #ARGV;
save_to_file('test.txt', $arg);
sub save_to_file{
my ($filename, $content) = #_;
open(my $fh, '>:encoding(UTF-8)', $filename) or die "Can't open < $filename: $!";;
print $fh $content;
close $fh;
return;
}
Script_2 calling 1:
use strict;
use utf8;
execute_command();
sub execute_command {
my $command = "perl -CAS simple_utf_string.pl äääöööü";
# Execute command
print "The command to run is: $command\n";
open my $command_pipe, "-|:encoding(UTF-8)", $command or die "Pipe from $command failed: $!";
while (<$command_pipe>) {
print $_;
}
}
Result in: text.txt:
äääöööü

Parsing the large files in Perl

I need to compare the big file(2GB) contains 22 million lines with the another file. its taking more time to process it while using Tie::File.so i have done it through 'while' but problem remains. see my code below...
use strict;
use Tie::File;
# use warnings;
my #arr;
# tie #arr, 'Tie::File', 'title_Nov19.txt';
# open(IT,"<title_Nov19.txt");
# my #arr=<IT>;
# close(IT);
open(RE,">>res.txt");
open(IN,"<input.txt");
while(my $data=<IN>){
chomp($data);
print"$data\n";
my $occ=0;
open(IT,"<title_Nov19.txt");
while(my $line2=<IT>){
my $line=$line2;
chomp($line);
if($line=~m/\b$data\b/is){
$occ++;
}
}
print RE"$data\t$occ\n";
}
close(IT);
close(IN);
close(RE);
so help me to reduce it...
Lots of things wrong with this.
Asides from the usual (lack of use strict, use warnings, use of 2-argument open(), not checking open() result, use of global filehandles), the specific problem in your case is that you are opening/reading/closing the second file once for every single line of the first. This is going to be very slow.
I suggest you open the file title_Nov19.txt once, read all the lines into an array or hash or something, then close it; and then you can open the first file, input.txt and walk along that once, comparing to things in the array so you don't have to reopen that second file all the time.
Futher I suggest you read some basic articles on style/etc.. as your question is likely to gain more attention if it's actually written in vaguely modern standards.
I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as #LeoNerd explained in his answer. Then I use a hash to keep track of the match count:
#!/usr/bin/env perl
use strict;
use warnings;
# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my #comparison = <$comp_file>);
close $comp_file;
# prepare comparison
open my $input, '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();
# compare each line
while (my $title = <$input>) {
chomp $title;
# iterate comparison strings
foreach my $comp (#comparison) {
$count{$comp}++ if $title =~ /\b$comp\b/i;
}
}
# done
close $input;
# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (#comparison) {
print $output "$comp\t$count{$comp}\n";
}
close $output;
Just to get you started... If someone wants to further work on this: these were my test files:
title_Nov19.txt
This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!
input.txt
foo
bar
And the result of the program was written to res.txt:
foo 3
bar 1
Here's another option using memowe's (thank you) data:
use strict;
use warnings;
use File::Slurp qw/read_file write_file/;
my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';
for ( read_file 'title_Nov19.txt' ) {
my %seen;
!$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
}
write_file 'res.txt', map "$_\t$count{$_}\n",
sort { $count{$b} <=> $count{$a} } keys %count;
Numerically-sorted output to res.txt:
foo 3
bar 1
An alternation regex which quotes meta characters (\Q$_\E) is built and used, so only one pass against the large file's lines is needed. The hash %seen is used to insure that the input words are only counted once per line.
Hope this helps!
Try this:
grep -i -c -w -f input.txt title_Nov19.txt > res.txt

Writing to a file in perl

I want to write the key and value pair that i have populated in the hash.I am using
open(OUTFILE,">>output_file.txt");
{
foreach my $name(keys %HoH) {
my $values = $HoH{$name};
print "$name: $values\n";
}
}
close(OUTFILE);
Somehow it creates the output_file.txt but it does not write the data to it.What could be the reason?
Use:
print OUTFILE "$name: $values\n";
Without specifying the filehandle in the print statement, you are printing to STDOUT, which is by default the console.
open my $outfile, '>>', "output_file.txt";
print $outfile map { "$_: $HOH{$_}\n" } keys %HoH;
close($outfile);
I cleaned up for code, using the map function here would be more concise. Also I used my variables for the file handles, always good practice. There are still more ways to do this, you should check out Perl Cook book, here
When you open OUTFILE you have a couple of choices for how to write to it. One, you can specify the filehandle in your print statements, or two, you can select the filehandle and then print normally (without specifying a filehandle). You're doing neither. I'll demonstrate:
use strict;
use warnings;
use autodie;
my $filename = 'somefile.txt';
open my( $filehandle ), '>>', $filename;
foreach my $name ( keys %HoH ) {
print $filehandle "$name: $HoH{$name}\n";
}
close $filehandle;
If you were to use select, you could do it this way:
use strict;
use warnings;
use autodie;
my $filename = 'somefile.txt';
open my( $filehandle ), '>>', $filename;
my $oldout = select $filehandle;
foreach my $name( keys %HoH ) {
print "$name: $HoH{$name}\n";
}
close $filehandle;
select $oldout;
Each method has its uses, but more often than not, in the interest of writing clear and easy to read/maintain code, you use the first approach unless you have a real good reason.
Just remember, whenever you're printing to a file, specify the filehandle in your print statement.
sergio's answer of specifying the filehandle is the best one.
Nonetheless there is another way: use select to change the default output filehandle. And in another alternate way to do things, using while ( each ) rather than foreach ( keys ) can be better in some cases (particularly, when the hash is tied to a file somehow and it would take a lot of memory to get all the keys at once).
open(OUTFILE,">>output_file.txt");
select OUTFILE;
while (my ($name, $value) = each %HoH) {
print "$name: $value\n";
}
close(OUTFILE);