Sorting by second word in perl - perl

Hey guys,
I have this file called phonebook
Steve Blenheim:239-923-7366:238-934-7865:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Betty Boop:245-836-8357:245-876-7656:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500
Igor Chevsky:385-375-8395:385-333-8976:3567 Populus Place, Caldwell, NJ 23875:6/18/68:23400
Norma Corder:397-857-2735:397-857-7651:74 Pine Street, Dearborn, MI 23874:3/28/45:245700
And I am trying to sort the text in reverse alphabetical order from the second word (the last name) and have not been able to find out how to do it. I am reading from the file by doing this
open (FILE, phonebook);
#line = <FILE>;
close(FILE);
any ideas? I can sort the first field in alphabetical order and reverse, but can't seem to get the second one to sort properly.
Thanks in advance

I share tadmc's concern that the second field, by whitespace isn't always going to be the surname, but answering the question as it pertains to the second field, you can get it using split, and you can sort it like this:
The simple but horribly slow version (easy to read, but it re-splits every field every single time it compares two lines, which is inefficient).
#lines = sort { # Compare second fields
(split " ", $a)[1]
cmp
(split " ", $b)[1]
} #lines;
The Schwartzian transform version (does the exact same thing as the previous one, only much faster):
#lines = map { # Get original line back
$_->[0]
} sort { # Compare second fields
$a->[1] cmp $b->[1]
} map { # Turn each line into [original line, second field]
[ $_, (split " ", $_)[1] ]
} #lines;

If you don't mind using the shell, sort -r -k2 will sort your file in reverse order.

Based on Miguel Prz solution I replaced the 'cmd' to '<=>'.
It is important for numbers. If the CMP is used, then sorting will work as a string (digits) - first character is most important, then second and so on. If you have the numbers: 607, 8 and 35 then CMP will sort it as: 8, 607, 35. To sort it as numbers we use the "<=>' method and the result will be: 607, 35, 8
use strict;
open my $FILE, '<', 'phonebook';
my #lines = <$FILE>;
my #sorted = sort {
my #a = split(/\s+/,$a);
my #b = split(/\s+/,$b);
$b[1] <=> $a[1] } #lines;
foreach my $item(#sorted) {
print "$item\n";
}
close $FILE;

You'll need to read the file line by line to do that. Something like this:
my %list;
open(FILE, phonebook);
while(<FILE>){
my #vals = split(/:/, $_);
(my $key = $vals[0]) =~ s/(\S+)\s+(.+)/$2 $1/; # split first field, reverse word order
$list{$key} = $_; #save row keyed on $key
}
foreach my $key(sort {$b cmp $a} keys(%list)){
print $list{$key};
}

I think it's interesting to write in a Modern Perl way (the solution is the same), and this is the complete script:
use strict;
open my $FILE, '<', 'phonebook';
my #lines = <$FILE>;
my #sorted = sort {
my #a = split(/\s+/,$a);
my #b = split(/\s+/,$b);
$b[1] cmp $a[1] } #lines;
foreach my $item(#sorted) {
print "$item\n";
}
close $FILE;

I am surprised nobody has mentioned this, but if we are sorting a phonebook, we probably don't really want a pure ASCII sort.
Does Bob DeCarlo really belong before Ralph Dearborn? If you sort by using cmp Mr. DeCarlo winds up first in the results.
Even if you normalize for case, you've still got issues. There are a host of complications with sorting and filing things. Different organizations have rules for handling these issues.
Since sort is an expensive operation, you'll want to make each comparison work as quickly as possible. The way to do this is to use the simplest code possible for all your comparisons. Since cmp won't give us the desired result by itself, we need to generate and cache a normalized sort term for each item in the phone book.
So, assuming you've already got your phone book data in an array:
sub extract_and_normalize {
# Do stuff here to embody your alphabetization rules.
return [ $normed, $line ];
}
# Generate your sort terms
my #processed = map extract_and_normalize($_), #lines;
# Sort by the normalized values
my #sorted = sort {$a->[0] cmp $b->[0]}, #processed;
# Extract the lines from the sorted set.
#lines = map $_->[1], #sorted;
Or use the Schwartzian Transform, as hobbs suggests, to avoid making all the intermediate variables:
#lines = map $_->[1],
sort { $a->[0] cmp $b->[0] }
map extract_and_normalize($_), #lines;

Related

Sorting 5th column in descending order error message

The text file I am trying to sort:
MYNETAPP01-NY
700000123456
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
I am trying to sort this text file by its 5th column (the capacity field) in descending order.
When I first started this there was a percentage symbol mixed with the numbers. I solved this by substituting the the value like so: s/%/ %/g for #data;. This made it easier to sort the numbers alone. Afterwards I will change it back to the way it was with s/ %/%/g.
After running the script, I received this error:
#ACI-CM-L-53:~$ ./netapp.pl
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, line 24 (#1)
(F) You've told Perl to dereference a string, something which
use strict blocks to prevent it happening accidentally. See
"Symbolic references" in perlref. This can be triggered by an # or $
in a double-quoted string immediately before interpolating a variable,
for example in "user #$twitter_id", which says to treat the contents
of $twitter_id as an array reference; use a \ to have a literal #
symbol followed by the contents of $twitter_id: "user \#$twitter_id".
Uncaught exception from user code:
Can't use string ("/vol/vfiler_PROD1_SF_isci_15K01/"...) as an ARRAY ref while "strict refs" in use at ./netapp.pl line 20, <$DATA> line 24.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $DATA, "<raw_info.txt") or die "$!";
my $systemName = <$DATA>;
my $systemSN = <$DATA>;
my $header = <$DATA>;
my #data;
while ( <$DATA> ) {
#data = (<$DATA>);
}
s/%/ %/g for #data;
s/---/000/ for #data;
print #data;
my #sorted = sort { $b->[5] <=> $a->[5] } #data;
print #sorted;
close($DATA);
Here is an approach using Text::Table which will nicely align your output into neat columns.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Table;
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
chomp(my $hdr = <$DATA>); # header
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
print $tbl;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
The output generated is:
Filesystem total used avail capacity Mounted on
/vol/vfiler_PROD1_SF_isci_15K01/ 2048GB 1653GB 394GB 81% /vol/vfiler_PROD1_SF_isci_15K01/
/vol/vfiler_PROD1_SF_NFS15K01/ 1638GB 735GB 903GB 45% /vol/vfiler_PROD1_SF_NFS15K01/
/vol/vfiler_PROD1_SF_NFS15K01/.snapshot 409GB 105GB 303GB 26% /vol/vfiler_PROD1_SF_NFS15K01/.snapshot
snap reserve 0TB 0TB 0TB ---% /vol/vfiler_PROD1_SF_isci_15K01/..
Update
To explain some of the advanced parts of the program.
my $tbl = Text::Table->new( split ' ', $hdr, 6 );
This creates the Text::Table object with the header split into 6 columns. Without the limit of 6 columns, it would have created 7 columns (because the last field, 'mounted on', also contains a space. It would have been incorrectly split into 2 columns for a total of 7).
$tbl->load( map [split /\s{2,}/], sort by_percent <$DATA> );
The statement above 'loads' the data into the table. The map applies a transformation to each line from <$DATA>. Each line is split into an anonymous array, (created by [....]). The split is on 2 or more spaces, \s{2,}. If that wasn't specified, then the data `snap reserve' with 1 space would have been incorrectly split.
I hope this makes whats going on more clear.
And a simpler example that doesn't align the columns like Text::Table, but leaves them in the form they originally were read might be:
open my $DATA, '<', 'file1' or die $!;
<$DATA> for 1 .. 2; # throw away first two lines
my $hdr = <$DATA>; # header
print $hdr;
print sort by_percent <$DATA>;
sub by_percent {
my $keya = $a =~ /(\d+)%/ ? $1 : '0';
my $keyb = $b =~ /(\d+)%/ ? $1 : '0';
$keyb <=> $keya
}
In addition to skipping the fourth line of the file, this line is wrong
my #sorted = sort { $b->[5] <=> $a->[5] } #data
But presumably you knew that as the error message says
at ./netapp.pl line 20
$a and $b are lines of text from the array #data, but you're treating them as array references. It looks like you need to extract the fifth "field" from both variables before you compare them, but no one can tell you how to do that
You code is quite far from what you want. Trying to change it as little as possible, this works:
#!/usr/bin/perl
use strict;
use warnings;
open (my $fh, "<", "raw_info.txt") or die "$!";
my $systemName = <$fh>;
my $systemSN = <$fh>;
my $header = <$fh>;
my #data;
while( my $d = <$fh> ) {
chomp $d;
my #fields = split '\s{2,}', $d;
if( scalar #fields > 4 ) {
$fields[4] = $fields[4] =~ /(\d+)/ ? $1 : 0;
push #data, [ #fields ];
}
}
foreach my $i ( #data ) {
print join("\t", #$i), "\n";
}
my #sorted = sort { $b->[4] <=> $a->[4] } #data;
foreach my $i ( #sorted ) {
$i->[4] .= '%';
print join("\t", #$i), "\n";
}
close($fh);
Let´s make a few things clear:
If using the $ notation, it is customary to define file variables in lower case as $fd. It is also typical to name the file descriptor as "fd".
You define but not use the first three variables. If you don´t apply chomp to them, the final CR will be added to them. I have not done it as they are not used.
You are defining a list with a line in each element. But then you need a list ref inside to separate the fields.
The separation is done using split.
Empty lines are skipped by counting the number of fields.
I use something more compact to get rid of the % and transform the --- into a 0.
Lines are added to list #data using push and turning the list to add into a list ref with [ #list ].
A list of list refs needs two loops to get printed. One traverses the list (foreach), another (implicit in join) the columns.
Now you can sort the list and print it out in the same way. By the way, Perl lists (or arrays) start at index 0, so the 5th column is 4.
This is not the way I would have coded it, but I hope it is clear to you as it is close to your original code.

Sort mixed text lines (alphanum) in Perl

I have txt file with every line structure like this:
P[containerVrsn:U(0)recordVrsn:U(0)size:U(212)ownGid:G[mdp:U(1090171666)**seqNo:U(81920)**]logicalDbNo:U(1)classVrsn:U(1)timeStamp:U(0)dbRecord:T[classNo:U(1064620)size:U(184)updateVersion:U(3)checksum:U(748981000)
And have to sort file lines based on seqNo (min to max). Sequence number can be virtually any number starting from zero. Any idea how can it be done in efficient way?
The Schwartzian Transform as suggested in Toto's answer is probably the fastest way to sort your lines here. But you said you're a Perl newbie, and I like to show how the lines can be sorted traditionally.
Perl has a sort function that sorts a list simply by alphabet. But you can supply a custom comparison function and let sort use your function to compare the elements. During its operation sort must continuously compare two elements (=lines) of your list and decide which one is greater or lesser or whether they are equal.
If you supply a comparison function, sort will call it with two such elements as the parameters $a and $b. You do not need to must not declare $a and $b, they are magic and just there. Your comparison function could look like this:
sub by_seqNo
{
# extract the sequence number from $a and $b
my ($seqA) = ($a =~ /seqNo:U\((\d+)/);
my ($seqB) = ($b =~ /seqNo:U\((\d+)/);
# numerically compare the sequence numbers (returns -1/0/+1)
$seqA <=> $seqB;
}
The first two lines extract the numbers after seqNo:U( and store them as $seqA and $seqB. The third line compares these sequence numbers as integers and returns that result. Combined with the sort function this gives:
my #sorted = sort by_seqNo #lines;
The reason why the Schwartzian Transform (ST) is faster than this solution is because the ST does the (expensive) operation of extracting the seqNo from your lines exactly once for each line. The "traditional" approach on the other hand extracts the seqNo twice for each comparison.
You could use Schwartzian Transform.
Here is a small script that does the job:
#!/usr/bin/perl
use strict;
use warnings;
open my $fhi, '<', 'path/to/input/file' or die "Unable to open input file: $!";
my #lines = <$fhi>;
my #sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { my ($x) = $_ =~ /SeqNo:U\((\d+)/i; [$_, $x]; }
#lines;
open my $fho, '>', 'path/to/output/file' or die "Unable to open output file: $!";
print $fho $_ for #sorted;

Selecting highest count of element except when...

So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output.
So with this input:
#extract = ------------------------------------------------------------------MGG-------------------------------------------------------------------------------------
And following code:
#Count selected amino acids.
my %counter = ();
foreach my $extract(#extract) {
#next if $_ =~ /\-/; #This line code does not function correctly.
$counter{$_}++;
}
sub largest_value_mem (\%) {
my $counter = shift;
my ($key, #keys) = keys %$counter;
my ($big, #vals) = values %$counter;
for (0 .. $#keys) {
if ($vals[$_] > $big) {
$big = $vals[$_];
$key = $keys[$_];
}
}
$key
}
I expect the most common element to be G, same as the output. If there is a tie in the elements, say G = M, if there is a way to display both in that would be great but not necessary. Any tips on how to delete or remove the '-' is much appreciated. I am slowly learning perl language.
Please let me know if what I am asking is not clear or if more information is needed, thanks again kindly for all the comments.
Your data doesn't entirely make sense, since it's not actually working perl code. I'm guessing that it's a string divided into characters. After that it sounds like you just want to be able to find the highest frequency character, which is essentially just a sort by descending count.
Therefore the following demonstrates how to count your characters and then sort the results:
use strict;
use warnings;
my $str = '------------------------------------------------------------------MGG-------------------------------------------------------------------------------------';
my #chars = split '', $str;
#Count Characteres
my %count;
$count{$_}++ for #chars;
delete $count{'-'}; # Don't count -
# Sort keys by count descending
my #keys = sort {$count{$b} <=> $count{$a}} keys %count;
for my $key (#keys) {
print "$key $count{$key}\n";
}
Outputs:
G 2
M 1
foreach my $extract(#extract) {
#next if $_ =~ /\-/
$_ setting is suppressed by $extract here.
(In this case, $_ keeps value from above, e.g. routine argument list, previous match, etc.)
Also, you can use character class for better readability:
next if $extract=~/[-]/;

How to print the frequency of words in perl?

open INP,"<C:\\Users\\hanadi\\Documents\\cs.txt";
while ($line=<INP>)
{
chomp($line);
#list=split/\s+/,$line;
foreach $w (#list)
{
$wordfreq{$w}++;
}
}
foreach $w2(keys>wordfreq)
{
print "$w2==>$wordfreq{$w}";
}
I want to print each word and its frequency.now i want code in Perl to jump and Print the above information for the next
ranks (>100), but do not print
every line, print only one line for every 1000 words (otherwise there will
be too many lines to print) in decreasing order of frequency and
decreasing alphabetical order among words with the same frequency.
The first issue of this problem is to define the word "word." Am assuming, by one of your comments, that punctuation is not part of a "word," since you were asking how to "...delete punctuations from the text..."
One solution to this is to use a regex to capture only "word" characters, i.e., alphanumeric and underscore, by matching the text against \w in the regex.
Building a hash, where the keys are the words and the associated values are the counts, is the way to go. However, when doing this, you need to insure that the keys are all the same case, i.e., either all UPPER or all lower.
After you've built this hash, you can sort the output in descending order by the has values (frequency) and use a counter to print just the top 100 words. There will be words with the same frequency count--especially having only one occurrence. How do you want these printed, as it can make a difference whether they appear in the list of top 100. I'd suggest ordering these cases alphabetically.
Give the above, consider the following solution, which uses the text above the code below as the corpus:
use strict;
use warnings;
my %hash;
open my $fh, '<', 'words.txt' or die $!;
while (<$fh>) {
$hash{ lc $1 }++ while /(\w+)/g;
}
close $fh;
my $i = 1;
for my $word ( sort { $hash{$b} <=> $hash{$a} || $a cmp $b } keys %hash ) {
print "$i. $word: ($hash{$word})\n" if $i++ < 100 or !( $i % 1000 );
}
Partial output:
1. the: (22)
2. to: (8)
3. a: (5)
4. you: (5)
5. is: (4)
6. of: (4)
7. this: (4)
8. word: (4)
9. all: (3)
10. and: (3)
...
96. punctuation: (1)
97. punctuations: (1)
98. since: (1)
99. sort: (1)
100. suggest: (1)
Limitations:
One issue that results from capturing word characters can be seen in cases of some possessive forms of words, e.g., word's. In this case, both word and s would be captured as words. If you want to retain such punctuation, and split on whitespace, you can just use the following instead of the regex line:
$hash{ lc $_ }++ for split ' ';
Hope this helps!
please pay attention to all of amons reply, and always rtq. (good spot amon).
(i have determined) your problem is that once you have constructed your wordcount hash, you now need to invert the hash so you can sort the values into some kind of order. The problem with this is that more than one word may have the same count and would overwrite earlier stored words.
To do this you need to store an array in a hash value, and this is done by using a reference to an array. Hash values may only be scalars, an array is not a scalar, but a reference to an array is.
In this re-write of your problem, I have updated the open and close calls to use scalar filehandles, with appropriate error handling (or die), and converted your foreach statements into 'maps'. These can take a bit of time to grasp so do not copy and paste them. Rather, focus on the inverting of the hash and how the array is accessed. This is possibly quite complex for you so I have left this parts in foreach style.
The 'each' keyword takes a key/value pair from the hash, and is often used this way to process hashes in while statements.
You will still need to work on converting the counts into frequencies as per amons suggestion and retrieving the top 100. There is a clue to the counting in the '$c' variable.
#!/usr/bin/perl
# word count #wct.pl
use warnings;
use strict;
my (%wordfreq);
open my $input, '<', 'wc.txt'
or die "cannot open wc txt file $!";
map { $wordfreq{ $_ }++; } (split /\s+/, $_) while <$input> ;
close $input
or die "cannot close wc txt file $!";
# print (
# map {"$_ has ". $wordfreq{$_} . "\n" } (keys %wordfreq)
# );
foreach (keys %wordfreq){
# print "$_ has ". $wordfreq{$_} . "\n"
}
my %invertedhash;
while (my ($key,$value) = each %wordfreq){
push #{$invertedhash{$value}}, $key;
}
my $c;
foreach (reverse sort keys %invertedhash){
last if $c++ == 2;
print "words with a count of $_ are #{$invertedhash{$_}} \n";
}
exit 0;
sample
one two two
three three three four
four four four five
five five five
produces
words with a count of 4 are four five
words with a count of 3 are three
Hope this helps.

How to get the top keys from a hash by value

I have a hash that I sorted by values greatest to least. How would I go about getting the top 5? There was a post on here that talked about getting only one value.
What is the easiest way to get a key with the highest value from a hash in Perl?
I understand that so would lets say getting those values add them to an array and delete the element in the hash and then do the process again?
Seems like there should be an easier way to do this then that though.
My hash is called %words.
Edited Took out code as the question answered without really needing it.
Your question is how to get the five highest values from your hash. You have this code:
my #keys = sort {
$words{$b} <=> $words{$a}
or
"\L$a" cmp "\L$b"
} keys %words;
Where you have your sorted hash keys. Take the five top keys from there?
my #highest = splice #keys, 0, 5; # also deletes the keys from the array
my #highest = #keys[0..4]; # non-destructive solution
Also some comments on your code:
open( my $filehandle0, '<', $file0 ) || die "Could not open $file0\n";
It is a good idea to include the error message $! in your die statement to get valuable information for why the open failed.
for (#words) {
s/[\,|\.|\!|\?|\:|\;|\"]//g;
}
Like I said in the comment, you do not need to escape characters or use alternations in a character class bracket. Use either:
s/[,.!?:;"]//g for #words; #or
tr/,.!?:;"//d for #words;
This next part is a bit odd.
my #stopwords;
while ( my $line = <$filehandle1> ) {
chomp $line;
my #linearray = split( " ", $line );
push( #stopwords, #linearray );
}
for my $w ( my #stopwords ) {
s/\b\Q$w\E\B//ig;
}
You read in the stopwords from a file... and then you delete the stopwords from $_? Are you even using $_ at this point? Moreover, you are redeclaring the #stopwords array in the loop header, which will effectively mean your new array will be empty, and your loop will never run. This error is silent, it seems, so you might never notice.
my %words = %words_count;
Here you make a copy of %words_count, which seems to be redundant, since you never use it again. If you have a big hash, this can decrease performance.
my $key_count = 0;
$key_count = keys %words;
This can be done in one line: my $key_count = keys %words. More readable, in my opinion.
$value_count = $words{$key} + $value_count;
Can also be abbreviated with the += operator: $value_cont += $words{$key}
It is very good that you use strict and warnings.
If performance isn't a big deal
(sort {$words{$a} <=> $words{$b}} keys %words)[0..4])
if you absolutely need killer speed, a selection sort which terminates after 5 iterations is probably the best thing for you.
my #results;
for (0..4) {
my $maxkey;
my $max = 0;
for my $key (keys %words){
if ($max < $words{$key}){
$maxkey = $key;
$max = $words{$key};
}
}
push #results, $maxkey;
delete $words{$maxkey};
}
say join(","=>#results);
There's CPAN module for that, Sort::Key::Top.
It has a straight-forward interface and an efficient XS implementation:
use Sort::Key::Top qw(rnkeytop);
my #results = rnkeytop { $words{$_} } 5 => keys %words;