Sorting hashes on value length whilst preserving order - perl

I'm currently writing a Perl script to sort lines from stdin and print the lines in order of line length whilst preserving order for the ones that are equal.
My sorting code consists of the following:
while (my $curr_line = <STDIN>) {
chomp($curr_line);
$lines{$curr_line} = length $curr_line;
}
for my $line (sort{ $lines{$a} <=> $lines{$b} } keys %lines){
print $line, "\n";
}
For example my stdin consists of the following:
tiny line
medium line
big line
huge line
rand line
megahugegigantic line
I'd get the following output:
big line
rand line
tiny line
huge line
medium line
megahugegigantic line
Is there any way I can preserve the order for lines of equal length such that tiny would come before huge which comes before rand? Also, the order seems to change everytime I run the script.
Thanks in advance

One possible solution
You can save the position of the line in the input file handle as well as the length. The $. magic variable (input line number) provides this. You can then sort on both values.
use strict;
use warnings;
my %lines;
while ( my $curr_line = <DATA> ) {
chomp($curr_line);
$lines{$curr_line} = [ length $curr_line, $. ];
}
for my $line (
sort {
$lines{$a}->[0] <=> $lines{$b}->[0]
|| $lines{$a}->[1] <=> $lines{$b}->[1]
} keys %lines
) {
print $line, "\n";
}
__DATA__
tiny lin1
medium line
big line
huge lin2
rand lin3
megahugegigantic line
This will always output
big line
tiny lin1
huge lin2
rand lin3
medium line
megahugegigantic line
You can of course use a hash to make the code more readable, too.
$lines{$curr_line} = {
length => length $curr_line,
position => $.,
};
Explanation of your implementation
Your results changed their order every time because of random hash ordering. The way keys returns the list of keys is random, because of the way Perl implements hashes. This is by design, and a security feature. Since there are several keys that have the same value, the sort will sometimes return different results, based on which of the equal value keys came out first.
You could mitigate this by sticking another sort in front of your keys call. That would sort the keys by name, at least making the order of the undesired result be consistent.
# vvvv
for my $line (sort{ $lines{$a} <=> $lines{$b} } sort keys %lines) { ... }
Note that you don't have to chomp the input if you put the \n back when you print. It's always of the same length anyway. If you do, you should print a $/, which is the input record separator that chomp removed, or you falsify your data.

Your problem is not with sort Perl uses the quick sort algorithm which is a stable sort, inputs that match the same sort key have the same order on output of the sort as input.
Your problem is that you are storing the lines in a hash. A hash is an unordered collection of key value pairs so adding the lines to the hash and then printing them out again with out the sort will give you the lines in a random order.
You need to read all the lines into an array and then sort them on length, the quickest way being to use a Schwartzian Transformation see below.
my #lines = <STDIN>;
chomp(#lines);
my #sorted = # This is the clever bit and needs to be red from the last map up
map { $_->[0] } # Get the lines
sort { $a->[1] <=> $b->[1] } # Sort on length
map { [$_, length $_] } # Create a list of array refs containing
# the line and the length of the line
#lines;
print join "\n", #sorted; # print out the sorted lines

Nowhere do you store the original order, so you can't possibly sort by it. The easiest fix is to store the lines in an array, and ensure that Perl is using a stable sort.
use sort 'stable';
my #lines = <>;
chomp(#lines);
for my $line ( sort { length($a) <=> length($b) } #lines) {
say $line;
}
[ ST is overkill for this. It' such overkill that it probably even slows things down! ]

As has been explained, the randomness comes from your use of hash keys to store the strings. There is no need for this, or anything more elaborate like a Schwartzian Transform, to make this work
All Perl versions since v5.8 have used a stable sort, which will keep values that sort equally in the same order. But you can insist that the sort operator you get is a stable one using the sort pragma with
use sort 'stable'
Here's how I would write your program. It stops reading input at end of file, or when it sees a blank line in case you want to enter the data from the keyboard
use strict;
use warnings 'all';
use feature 'say';
use sort 'stable';
my #list;
while ( <> ) {
last unless /\S/;
chomp;
push #list, $_;
}
say for sort { length $a <=> length $b } #list;
Using the same input as you use in the question, this produces
output
big line
tiny line
huge line
rand line
medium line
megahugegigantic line

Related

Appending values to Hash if key is same in Perl

Problem is to read a file with value at every new line. Content of file looks like
3ssdwyeim3,3ssdwyeic9,2017-03-16,09:10:35.372,0.476,EndInbound
3ssdwyeim3,3ssdwyfyyn,2017-03-16,09:10:35.369,0.421,EndOutbound
3ssdwyfxc0,3ssdwyfxfi,2017-03-16,09:10:35.456,0.509,EndInbound
3ssdwyfxc0,3ssdwyhg0v,2017-03-16,09:10:35.453,0.436,EndOutbound
With the string before first comma being the Key and string in between last and second last comma the Value
i.e. for the first line 3ssdwyeim3 becomes the key and 0.476 Value.
Now as we are looping over each line if the key exists we have to concatenate the values separated by comma.
Hence for the next new line as key already exists key remains 3ssdwyeim3 but the value is updated to 0.476,0.421.
Finally we have to print the keys and values in a file.
I have written a code to achieve the same, which is as follows.
sub findbreakdown {
my ( $out ) = #_;
my %timeLogger;
open READ, "out.txt" or die "Cannot open out.txt for read :$!";
open OUTBD, ">$out\_breakdown.csv" or die "Cannot open $out\_breakdown.csv for write :$!";
while ( <READ> ) {
if ( /(.*),.*,.*,.*,(.*),.*/ ) {
$btxnId = $1;
$time = $2;
if ( !$timeLogger{$btxnId} ) {
$timeLogger{$btxnId} = $time;
}
else {
$previousValue = $timeLogger{$btxnId};
$newValue = join ",", $previousValue, $time;
$timeLogger{$btxnId} = $newValue;
}
}
foreach ( sort keys %timeLogger ) {
print OUTBD "$_ ,$timeLogger{$_}\n";
}
}
close OUTBD;
close READ;
}
However Something is going wrong and its printing like this
3ssdwyeim3,0.476
3ssdwyeim3,0.476,0.421
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509,0.436
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509,0.436
Whereas expected is:
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509,0.436
Your program is behaving correctly, but you are printing the current state of the entire hash after you process each line.
Therefore you are printing hash keys before they have the complete set of values, and you have many duplicated lines.
If you move the foreach loop that prints to the end of your program (or simply use the debugger to inspect the variables) you will find that the final state of the hash is exactly what you expect.
Edit: I previously thought the problem was the below, but it's because I misread the sample data in your question.
This regular expression is not ideal:
if (/(.*),.*,.*,.*,(.*),.*/) {
The .* is greedy and will match as much as possible (including some content with commas). So if any line contains more than six comma-separated items, more than one item will be included in the first matching group. This may not be a problem in your actual data, but it's not an ideal way to write the code. The expression is more ambiguous than necessary.
It would be better written like this:
if (/^([^,]*),[^,]*,[^,]*,[^,]*,([^,]*),[^,]*$/) {
Which would only match lines with exactly six items.
Or consider using split on the input line, which would be a cleaner solution.
This is much simpler than you have made it. You can just split each line into fields and use push to add the value to the list corresponding to the key
I trust you can modify this to read from an external file instead of the DATA file handle?
use strict;
use warnings 'all';
my %data;
while ( <DATA> ) {
my #fields = split /,/;
push #{ $data{$fields[0]} }, $fields[-2];
}
for my $key ( sort keys %data ) {
print join(',', $key, #{ $data{$key} }), "\n";
}
__DATA__
3ssdwyeim3,3ssdwyeic9,2017-03-16,09:10:35.372,0.476,EndInbound
3ssdwyeim3,3ssdwyfyyn,2017-03-16,09:10:35.369,0.421,EndOutbound
3ssdwyfxc0,3ssdwyfxfi,2017-03-16,09:10:35.456,0.509,EndInbound
3ssdwyfxc0,3ssdwyhg0v,2017-03-16,09:10:35.453,0.436,EndOutbound
output
3ssdwyeim3,0.476,0.421
3ssdwyfxc0,0.509,0.436

Sorting a sub-section of a file

I'm in need of some Perl wisdom from those more experienced than myself.
So far, my answer to the below is to simply go through the file line-by line, and insert relevant elements into an array, sort the array and then append the contents. But that seems like a bit long-winded and not very efficient.
I have a file whose contents look something like this :
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.bravo.example.com # <----
noperiod.example.com # <---- This list is
.an.example.com # <---- not ordered
.some.example.com # <----
Is there a clever way in Perl (ideally a one-liner that could be piped) to sort the second list ? i.e. so you would get the following result :
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.an.example.com # <----
.bravo.example.com # <---- NOW this list
noperiod.example.com # <---- IS ordered ;-)
.some.example.com # <----
Four things to note :
The content to be sorted is always at the bottom of the file
The header (":127.200.200.100 etc.") is always the same
Names may or may not start with a period (i.e. .bravo.example.com vs noperiod.example.com)
There may be a large number of items, so needs to be reasonably efficient
Depends what you mean by 'efficient'. I mean, a one liner is rarely efficient and it's also rarely concise or clear as to what it's doing.
But in terms of efficiency? Well, it depends what you're doing already that's inefficient. I mean, pretty fundamentally, if you're sorting something you need to examine the whole data set. Otherwise how would you know that the last line in your file needs to be sorted to the top?
But for what you're doing, I'd approach it like this:
#!/usr/bin/perl
use strict;
use warnings;
sub sort_noperiods {
my $a_np = $a;
$a_np =~ s/\.//g;
my $b_np = $b;
$b_np =~ s/\.//g;
return $a_np cmp $b_np;
}
while ( <> ) {
print;
last if m/Something Else/;
}
print sort sort_noperiods <>;
Which for your sample input, prints:
# A Comment
# Another comment
:127.100.100.255:Something
.789
.123
.456
:127.200.200.100:Something Else
.an.example.com # <---- not ordered
.bravo.example.com # <----
noperiod.example.com # <---- This list is
.some.example.com # <----
I'm keying off the 'Something Else' line in your file, as I couldn't quite tell how you'd identify the last line of the 'header' chunk. Anything else gets read in and sorted according to the 'noperiods' sort mechanism. (There may be a small efficiency gain by caching the result of the regular expressions, but I'm not sure of this).
This can be 'one-linerified' by:
perl -e 'while ( <> ) { print; last if m/Something Else/ }; print sort { $a =~ s/\.//gr cmp $b =~ s/\.//gr } <>; '
You can sort that by the shell with a little help from Perl: Just prepend a line number to each line before the list, and for the list, use the number of its first line. Then sort numerically by the numbers, and secondary by the rest of the line:
perl -ne 'if (1 .. /^:127\.200\.200\.100:.*/) {
print "$.\t$_";
} else {
print $.--, "\t$_"
}' file.txt \
| sort -k1,1n -k2 | cut -f2-

Finding equal lines in file with Perl

I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.
If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup
A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent
Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables
To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.
amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3

How to count duplicates in a hash in perl

Please bear with me.
Write a program that reads a series of words (with one word per line)1 until end-of-input, then print a summary of how many times each word was seen. (Hint: remember that when an undefined value is used as if it were a number, Perl automatically converts it to 0. It may help to look back at the earlier exercise that kept a running total.) If the input words were fred, barney, fred, dino, wilma, fred (all on separate lines), the output should tell us that fred was seen 3 times. For extra credit, sort the summary words in ASCII order in the output.
[1] It has to be one word per line because we still haven't shown you how to extract individual words from a line of input.
This one should use a hash. And I can't figure out how. I can only think of using an array and using 2 loops to compare duplicates. I guess I didn't understand the problem right. Here's my code using an array.
#! usr/bin/perl
use warnings;
use strict;
chomp(my #input = <STDIN>);
foreach my $name (#input) {
my $count;
foreach my $compare_name (#input) {
if ($name eq $compare_name) {
$count += 1;
}
}
print "$name seen $count times\n";
}
but this prints say for example:
myname
myname
myname
it prints:
myname seen 3 times
myname seen 3 times
myname seen 3 times
can somebody guide me on how to use a hash on this one? Thanks
You are trying to find out how many times you've seen a word, which is to say you're trying to access the count associated with a word.
If only you could use the word as the index into an array and store the count in that array element...
Well, that's exactly what a hash is. It's no surprise it was recommended by the exercise.
$counts{$word} # Retrieves the count associated with a word.
++$counts{$word}; # Increment the number of times you've seen a word.
keys(%counts) # Returns a list of the words you have encountered.
It is three times, because you have three iterations of the outer loop, each yielding same results (3 times).
A much simple way is:
my %occurs;
while (<STDIN>) {
chomp($_);
$occurs{$_}++;
}
foreach my $occur (sort keys %occurs) {
print "$occur seen $occurs{$occur} times\n";
}
Here is one way to do:
#!/usr/bin/perl -w
use v5.14;
use strict;
my %map;
chomp(my #words = <STDIN>);
foreach my $word (#words) {
$map{$word} += 1;
}
foreach my $key (sort keys %map) {
say "$key occurs $map{$key} times."
}
It's simple, readable and easy to maintain.

PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read

I'm working on a program and I have a couple of questions, hope you can help:
First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to retrieve are found and store in a hash.
I've been looking for a way to include all array elements in a regex that I can use in the file search, but I haven´t been able to make it work. Eventually i've found a way that works:
my #atoms = ();
my $natoms=0;
foreach my $atomi (keys %{$atome}){
push (#atoms,$atomi);
$natoms++;
}
#atoms = sort {$b cmp $a} #atoms;
and then I use it as a regex this way:
while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
$natoms--;
.....
}
Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:
8 C5 9.9153 2.3814 -8.6988 C.ar 1 MLK -0.1500
The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.
My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8
I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:
#info = split (/[\s]+/,$_);
if ($info[0] eq " ") {splice (#info, 0,1);} # also tried $info[0] =~ m/\s+/
and when I print the array #info I get this:
Array:
Array: 8
Array: C5
Array: 9.9153
Array: 2.3814
.....
How can I get rid of the empty column?
Many thanks for your help
Merche
There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:
my $line = ' begins with spaces and ends with spaces ';
my #tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', #tokens), "|\n";
See the documentation for split at http://p3rl.org/split (or with perldoc split)
Also, the first part of your program might be simpler as:
my #atoms = sort {$b cmp $a} keys %$atome;
my $natoms = #atoms;
But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:
my #atoms = keys %$atome;
while (<IN_LIG>){
# The atom ID on this line
my ($atom_id) = split ' ';
# Is this atom ID in the array of atom IDs that we are looking for
if (grep { /$atom_id/ } #atoms) {
# This line of the file has an atom that was in the array: $atom_id
}
}
Lets warm up by refining and correcting some of your code:
# If these are all numbers, do a numerical sort: <=> not cmp
my #atoms = ( sort { $b <=> $a } keys %{$atome} );
my $natoms = scalar #atoms;
No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...
$natoms can be assigned directly by the count of values in #atoms.
while(<IN_LIG>) {
last unless $natoms;
my $key = (split)[0]; # split splits on whitespace and $_ by default
$natoms-- if ($key == $atoms[$natoms - 1]);
}
I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.
The best way to compare two numbers is not by a regex, but with the equality operator ==.
Your empty field should be gone by splitting on whitespace. The default for split is split ' '.
Also, if you are not already doing it, you should use:
use strict;
use warnings;
It will save you a lot of headaches.
for your second question you could use this line:
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
in order to capture 9 items from each line (assuming they do not contain whitespace).
The first question I do not understand.
Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [#info[1..8]] as the value. Then you can lookup the entries by your index.
my %details;
while (<IN_LIG>) {
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
$details{ $info[0] } = [ #info[1..$#info] ];
}
Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).
thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!
As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better
The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.
I tried this for the regex:
my $strings = join "|" map quotemeta,
sort { $hash->{$b} <=> $hash->{$a}} keys %($hash);
but I did something wrong cos it wouldn't take all keys