How can I do alpha numeric sort in Perl? - perl

I have a file which looks like this:
80,1p21
81,19q13
82,6p12.3
83,Xp11.22
84,3pter-q21
86,3q26.33
87,14q24.1-q24.2|14q24|14q22-q24
88,1q42-q43
89,11q13.1
90,2q23-q24
91,12q13
92,2q22.3
93,3p22
94,12q11-q14
95,3p21.1
97,14q24.3
98,2p16.2
And I want to sort them based on the second column. And the first column should change accordingly too. When you use the 'sort' command in Perl, it doesn't do it because it says it's not numeric. Is there a way to sort things alpha numerically in Perl?

If you read the documentation for sort, you'll see that you don't need to do a numeric sort in Perl. You can do string comparisons too.
#sorted = sort { $a cmp $b } #unsorted;
But that still leaves you with a problem as, for example, 19q will sort before 6p. So you can write your own sort function which can make whatever transformations you want before doing the comparison.
#sorted = sort my_complex_sort #unsorted;
sub my_complex_sort {
# code that compares $a and $b and returns -1, 0 or 1 as appropriate
# It's probably best in most cases to do the actual comparison using cmp or <=>
# Extract the digits following the first comma
my ($number_a) = $a =~ /,(\d+)/;
my ($number_b) = $b =~ /,(\d+)/;
# Extract the letter following those digits
my ($letter_a) = $a =~ /,\d+(a-z)/;
my ($letter_b) = $b =~ /,\d+(a-z)/;
# Compare and return
return $number_a <=> $number_b or $letter_a cmp $letter_b;
}

#!/usr/bin/env perl
use strict;
use warnings;
my #datas = map { /^(\d+),(\d*)(.*)$/; [$1, $2, $3]; } <DATA>;
my #res = sort {$a->[1] <=> $b->[1] or $a->[2] cmp $b->[2]} #datas;
foreach my $data (#res) {
my ($x, $y, $z) = #{$data};
print "$x,$y$z\n";
}
__DATA__
80,1p21
81,19q13
82,6p12.3
83,Xp11.22
84,3pter-q21
86,3q26.33
87,14q24.1-q24.2|14q24|14q22-q24
88,1q42-q43
89,11q13.1
90,2q23-q24
91,12q13
92,2q22.3
93,3p22
94,12q11-q14
95,3p21.1
97,14q24.3
98,2p16.2

I actually found the answer to this. The code looks a bit complicated though.
#!/usr/bin/env perl
use strict;
use warnings;
sub main {
my $file;
if (#ARGV != 1) {
die "Usage: perl hashofhash_sort.pl <filename>\n";
}
else {
$file = $ARGV[0];
}
open(IN, $file) or die "Error!! Cannot open the $file file: $!\n";
my #file = <IN>;
chomp #file;
my ($entrez_gene, $loci, $chr, $band, $pq, $band_num);
my (%chromosome, %loci_entrez);
foreach my $line (#file) {
if ($line =~ /(\d+),(.+)/) {
# Entrez genes
$entrez_gene = $1;
# Locus like 12p23.4
$loci = $2;
if ($loci =~ /^(\d+)(.+)?/) {
# chromosome number alone (only numericals)
$chr = $1;
if ($2) {
# locus minus chromosome number. If 12p23.4, then $band is p23.4
$band = "$2";
if ($band =~ /^([pq])(.+)/) {
# either p or q
$pq = $1;
# stores the numericals. for p23.4, stores 23.4
$band_num = $2;
}
if (exists $chromosome{$chr}) {
if (exists $chromosome{$chr}{$pq}) {
push (#{$chromosome{$chr}{$pq}}, $band_num);
}
else {
$chromosome{$chr}{$pq} = [$band_num];
}
}
else {
$chromosome{$chr}{$pq} = [$band_num];
}
}
}
}
} # End of foreach loop
foreach my $key (sort {$a <=> $b} keys %chromosome) {
my %seen = ();
foreach my $key2 (sort {$a cmp $b } keys %{$chromosome{$key}}) {
my #unique = grep { ! $seen{$_}++ } #{$chromosome{$key}{$key2}};
my #sorted = sort #unique;
foreach my $element (#sorted) {
my $sorted_locus = "$key$key2$element";
if (exists $loci_entrez{$sorted_locus}) {
foreach my $element2 (#{$loci_entrez{$sorted_locus}}) {
print "$element2,$sorted_locus\n";
}
}
}
}
}
} # End of main
main();

In the very general case, the question is ambiguous on what to do with integers that are equal but written differently, because of the possibility of leading zeros. The following comparison function (for sort) allows one to consider the lexicographic order as soon as one doesn't have different integers. This is the same as zsh's numeric sort.
sub alphanumcmp ($$)
{
my (#u,#v);
if ((#u = $_[0] =~ /^(\d+)/) &&
(#v = $_[1] =~ /^(\d+)/))
{
my $c = $u[0] <=> $v[0];
return $c if $c;
}
if ((#u = $_[0] =~ /^(.)(.*)/) &&
(#v = $_[1] =~ /^(.)(.*)/))
{
return $u[0] cmp $v[0] || &alphanumcmp($u[1],$v[1]);
}
return $_[0] cmp $_[1];
}
For instance, one would get the following sorted elements:
a0. a00. a000b a00b a0b a001b a01. a01b a1. a1b a010b a10b a011b a11b
Note 1: The use of <=> assumes that the numbers are not too large.
Note 2: In the question, the user wants to do an alphanumeric sort on the second column (instead of the whole string). So, in this particular case, the comparison function could just be adapted to ignore the first column or a Schwartzian transform could be used.

Related

Handling Nested Delimiters in perl

use strict;
use warnings;
my %result_hash = ();
my %final_hash = ();
Compare_results();
foreach my $key (sort keys %result_hash ){
print "$key \n";
print "$result_hash{$key} \n";
}
sub Compare_results
{
while ( <DATA> )
{
my($instance,$values) = split /\:/, $_;
$result_hash{$instance} = $values;
}
}
__DATA__
1:7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\ d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2:7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Output
1
7802315095\d\d,7802315098\d\d;7802025001\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
2
7802315095\d\d,7802025002\d\d,7802025003\d\d,7802025004\d\d,7802025005\d\d,7802025006\d\d,7802025007\d\d
Iam trying to fetch value of each key and again trying to split the comma seperated value from result hash , if i find a semicolon in any value i would want to store the left and right values in separate hash keys.
Something like below
1.#split the value of result_hash{$key} again by , and see whether any chunk is seperated by ;
2. #every chunk without ; and value on left with ; should be stored in
#{$final_hash{"eto"}} = ['7802315095\d\d','7802315098\d\d','7802025002\d\d','7802025003\d\d','7802025004\d\d','7802025005\d\d','7802025006\d\d','7802025007\d\d'] ;
3.#Anything found on the right side of ; has to be stored in
#{$final_hash{"pro"}} = ['7802025001\d\d'] ;
Is there a way that i can handle everything in the subroutine? Can i make the code more simpler
Update :
I tried splitting the string in a single shot, but its just picking the values with semicolon and ignoring everything
foreach my $key (sort keys %result_hash ){
# print "$key \n";
# print "$result_hash{$key} \n";
my ($o,$t) = split(/,|;/, $result_hash{$key});
print "Left : $o \n";
print "Left : $t \n";
#push #{$final_hash{"eto"}}, $o;
#push #{$final_hash{"pro"}} ,$t;
}
}
My updated code after help
sub Compare_results
{
open my $fh, '<', 'Data_File.txt' or die $!;
# split by colon and further split by , and ; if any (done in insert_array)
my %result_hash = map { chomp; split ':', $_ } <$fh> ;
foreach ( sort { $a <=> $b } (keys %result_hash) )
{
($_ < 21)
? insert_array($result_hash{$_}, "west")
: insert_array($result_hash{$_}, "east");
}
}
sub insert_array()
{
my ($val,$key) = #_;
foreach my $field (split ',', $val)
{
$field =~ s/^\s+|\s+$//g; # / turn off editor coloring
if ($field !~ /;/) {
push #{ $file_data{"pto"}{$key} }, $field ;
}
else {
my ($left, $right) = split ';', $field;
push #{$file_data{"pto"}{$key}}, $left if($left ne '') ;
push #{$file_data{"ero"}{$key}}, $right if($right ne '') ;
}
}
}
Thanks
Update Added a two-pass regex, at the end
Just proceed systematically, analyze the string step by step. The fact that you need consecutive splits and a particular separation rule makes it unwieldy to do in one shot. Better have a clear method than a monster statement.
use warnings 'all';
use strict;
use feature 'say';
my (%result_hash, %final_hash);
Compare_results();
say "$_ => $result_hash{$_}" for sort keys %result_hash;
say '---';
say "$_ => [ #{$final_hash{$_}} ]" for sort keys %final_hash;
sub Compare_results
{
%result_hash = map { chomp; split ':', $_ } <DATA>;
my (#eto, #pro);
foreach my $val (values %result_hash)
{
foreach my $field (split ',', $val)
{
if ($field !~ /;/) { push #eto, $field }
else {
my ($left, $right) = split ';', $field;
push #eto, $left;
push #pro, $right;
}
}
}
$final_hash{eto} = \#eto;
$final_hash{pro} = \#pro;
return 1; # but add checks above
}
There are some inefficiencies here, and no error checking, but the method is straightforward. If your input is anything but smallish please change the above to process line by line, what you clearly know how to do. It prints
1 => ... (what you have in the question)
---
eto => [ 7802315095\d\d 7802315098\d\d 7802025002\d\d 7802025003\d\ d ...
pro => [ 7802025001\d\d ]
Note that your data does have one loose \d\ d.
We don't need to build the whole hash %result_hash for this but only need to pick the part of the line after :. I left the hash in since it is declared global so you may want to have it around. If it in fact isn't needed on its own this simplifies
sub Compare_results {
my (#eto, #pro);
while (<DATA>) {
my ($val) = /:(.*)/;
foreach my $field (split ',', $val)
# ... same
}
# assign to %final_hash, return from sub
}
Thanks to ikegami for comments.
Just for the curiosity's sake, here it is in two passes with regex
sub compare_rx {
my #data = map { (split ':', $_)[1] } <DATA>;
$final_hash{eto} = [ map { /([^,;]+)/g } #data ];
$final_hash{pro} = [ map { /;([^,;]+)/g } #data ];
return 1;
}
This picks all characters which are not , or ;, using the negated character class, [^,;]. So that is up to the first either of them, left to right. It does this globally, /g, so it keeps going through the string, collecting all fields that are "left of" , or ;. Then it cheats a bit, picking all [^,;] that are right of ;. The map is used to do this for all lines of data.
If %result_hash is needed build it instead of #data and then pull the values from it with my #values = values %hash_result and feed the map with #values.
Or, broken line by line (again, you can build %result_hash instead of taking $data directly)
my (#eto, #pro);
while (<DATA>) {
my ($data) = /:(.*)/;
push #eto, $data =~ /([^,;]+)/g;
push #pro, $data =~ /;([^,;]+)/g;
}

A simple variable count inside array

After working with this code, I am stuck at what I think is a simple error, yet I need outside eyes to see what is wrong.
I used unpack function to divide an array into the following.
#extract =
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
Apparently, after unpacking into the array, when I try to go into the while loop, #extract shows up completely empty. Any idea as to why this is happening?
print #extract; #<-----------Prints input
my $sum = 0;
my %counter = ();
while (my $column = #extract) {
print #extract; #<------- This extract is completely empty. Should be input
for (my $aa = (split ('', $column))){
$counter{$aa}++;
delete $counter{'-'}; # Don't count -
}
# Sort keys by count descending
my #keys = (sort {$counter{$b} <=> $counter{$a}} keys %counter) [0]; #gives highest letter
for my $key (#keys) {
$sum += $counter{$key};
print OUTPUT "$key $counter{$key} ";
Each line is an array element correct? I don't see in your code where you are checking the individual characters.
Assuming the input that you have shown is a 3 element array containing the line as a string:
#!/usr/bin/perl
use strict;
use warnings;
my #entries;
while(my $line = shift(#extract)){
my %hash;
for my $char(split('', $line)){
if($char =~ /[a-zA-Z]/) { $hash{$char}++ }
}
my $high;
for my $key (keys %hash) {
if(!defined($high)){ $high = $key }
elsif($hash{$high} < $hash{$key}){
$high = $key
}
}
push #entries, {$high => $hash{$high}};
}
Note this empties #extract, if you don't want to do that you'd have to use a for loop like below
for my $i (0 .. $#extract){
#my %hash etc...
}
EDIT:
Changed it so that only the highest number is actually kept
An approach using reduce from List::Util.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util 'reduce';
my #extract = qw/
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
/;
for (#extract) {
my %count;
tr/a-zA-Z//cd;
for (split //) {
$count{$_}++;
}
my $max = reduce { $count{$a} > $count{$b} ? $a : $b } keys %count;
print "$max $count{$max}\n";
}

Perl Hash of Hash Output

I'm reading a file. I want a hash that gives me the first number of a line as a key to a hash of all the numbers of the rest of the line to 1.
I believe I'm adding the hash correctly, because Dumper prints correctly.
However, print "$first $secondID\n" is not giving me any output.
while (<FILE>) {
chomp $_;
if (/(\d+)\t(.+)/) {
$firstNum = $1;
#seconds = split(/\,/,$2);
foreach $following (#seconds) {
$Pairs->{$firstNum}{$following} = 1;
}
foreach $first (sort {$a <=> $b} keys %Pairs) {
print "$first\n";
%second = {$Pairs{$first}};
foreach $secondID (sort {$a <=> $b} keys %second) {
print "$first $secondID\n";
}
}
print Dumper($Pairs);
}
else {
print "ERROR\n";
}
}
Later on, given a pair of numbers I would like to look up to see whether $Pairs{$num1}{$num2} is defined. would I write
if(defined $Pairs{$num1}{$num2})
Or should I check the first key first. Then check the second key
if (defined $Pairs{$num1}) {
$temp = $Pairs{$num1};
if (defined $temp{$num2}) {
print "true\n;
}
}
You have a couple of errors. Firstly you seem to be unsure whether you are using %Pairs or $Pairs to store your hash, and secondly you have %second = {$Pairs{$first}}, which tries to assign a hash reference to the hash %second. Presumably you want my %second = %{ $Pairs{$first} }.
You should always use strict and use warnings at the start of all your Perl programs, and declare all variables at the point of first use using my. This will alert you to simple mistakes you could otherwise easily overlook, and would have shown up your use of both %Pairs and $Pairs in this program, as well as your attempt to assign a single value (a hash reference) to a hash.
Rather than copying the entire hash, you should save a reference to it in $seconds. Then you can dereference it in the following for loop.
Experienced Perl programmers would also thank you for using lower-case plus underscore for local (my) variables, and reserving capitals for package and class names.
This program works as you intended, and expects the file name as a command-line parameter:
use strict;
use warnings;
my %pairs;
while (<>) {
unless ( /(\d+)\s+(.+)/ ) {
print "ERROR\n";
next;
}
my $first_num = $1;
my #seconds = split /,/, $2;
foreach my $following (#seconds) {
$pairs{$first_num}{$following} = 1;
}
foreach my $first (sort { $a <=> $b } keys %pairs) {
print "$first\n";
my $second = $pairs{$first};
foreach my $second_id (sort { $a <=> $b } keys %$second) {
print "$first $second_id\n";
}
}
}
my %hash;
while ( <> ) {
my #numbers = split /\D+/;
my $key = shift #numbers;
#{$hash{$key}}{ #numbers } = ( 1 ) x #numbers;
}
# test it this way...
if ( $hash{ $num1 }{ $num2 } ) {
}
Use:
%second = %{$Pairs->{$first}};

How can I sort a list of strings by numbers in them?

I have a list of filenames which are like so:
fw_d.log.1.gz
through
fw_d.log.300.gz
When I use this code block below, it almost sorts it the way I want, but not quite:
#!/usr/bin/perl -w
my $basedir = "/var/log";
my #verdir = qw(fw_d);
my $fulldir;
my $configs;
my $combidir;
foreach $combidir (#verdir) {
$fulldir = "$basedir/$combidir";
opendir (DIR, $fulldir);
my #files = grep { $_ ne '.' && $_ ne '..' && $_ ne 'CVS' readdir DIR;
closedir (DIR);
#files1 = sort {$a cmp $b}(#files);
foreach my $configs (#files1) {
print "Checking $configs\n";
system("less $basedir/$combidir/$configs | grep \'.* Group = , Username = .* autheauthenticated.\' >> output.log" );
}
}
Here is a snippet output:
Checking fw_d.log
Checking fw_d.log.1.gz
Checking fw_d.log.10.gz
Checking fw_d.log.100.gz
Checking fw_d.log.101.gz
Checking fw_d.log.102.gz
As you can see, it almost sorts it how I was hoping... Does anyone have any suggestions, on either reading, or a code snippet I can use?
You could use Schartzian-transform :
my #sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$_, $_=~/(\d+)/] }
#files;
print Dumper \#sorted;
Added benchmark for comparison between Schwartzian-Transform and subroutine
use Benchmark qw(:all);
# build list of files
my #files = map {'fw_d.log.'.int(rand()*1000).'.log' } 0 ..300;
my $count = -3;
my $r = cmpthese($count, {
'subname' => sub {
sub expand {
my $file=shift;
$file=~s{(\d+)}{sprintf "%04d", $1}eg;
return $file;
}
my #sorted = sort { expand($a) cmp expand($b) } #files;
},
'schwartzian' => sub {
my #sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [$_, $_=~/(\d+)/] }
#files;
}
});
Result:
Rate subname schwartzian
subname 21.2/s -- -92%
schwartzian 279/s 1215% --
Schwartzian-transform is about 13 times more efficient for sorting 300 files.
the problem is that the code does what you tell it to do: sort the file names in alphabetical order.
You should replace sort { $a cmp $b } by sort { expand($a) cmp expand($b) }
with expand:
sub expand
{ my $file=shift;
$file=~s{(\d+)}{sprintf "%04d", $1}eg; # expand all numbers to 4 digits
return $file;
}
What you can try is using a custom sort function:
sub sort_by_number {
$a =~ /(\d+)/;
$numa = $1;
$b =~ /(\d+)/;
$numb = $1;
return $numa <=> $numb;
}
and then sort like this:
#files1 = sort sort_by_number #files;
This will sort the strings in #files by the value of the first number in each string.
Older question, but there's an answer as yet unmentioned.
Sort::Naturally does this for you:
Sort lexically, but sort numeral parts numerically
#!/usr/bin/env perl
use strict;
use warnings;
use Sort::Naturally;
print nsort <DATA>;
__DATA__
fw_d.log
fw_d.log.101.gz
fw_d.log.1.gz
fw_d.log.10.gz
fw_d.log.100.gz
fw_d.log.2.gz
fw_d.log.102.gz
fw_d.log.12.gz
This orders as:
fw_d.log
fw_d.log.1.gz
fw_d.log.2.gz
fw_d.log.10.gz
fw_d.log.12.gz
fw_d.log.100.gz
fw_d.log.101.gz
fw_d.log.102.gz

Sorting a directory in perl, taking numbers into account

I think I need some sort of Schwartzian Transform to get this working, but I'm having trouble figuring it out, as perl isn't my strongest language.
I have a directory with contents as such:
album1.htm
album2.htm
album3.htm
....
album99.htm
album100.htm
I'm trying to get the album with the highest number from this directory (in this case, album100.htm). Note that timestamps on the files are not a reliable means of determining things, as people are adding old "missing" albums after the fact.
The previous developer simply used the code snippet below, but this clearly breaks down once there are more than 9 albums in a directory.
opendir(DIR, PATH) || print $!;
#files = readdir(DIR);
foreach $file ( sort(#files) ) {
if ( $file =~ /album/ ) {
$last_file = $file;
}
}
If you just need to find the album with the highest number, you don't really need to sort the list, just run through it and keep track of the maximum.
#!/usr/bin/perl
use strict;
use warnings;
my $max = 0;
while ( <DATA> ) {
my ($album) = $_ =~ m/album(\d+)/;
$max = $album if $album > $max;
}
print "album$max.htm";
__DATA__
album1.htm
album100.htm
album2.htm
album3.htm
album99.htm
To find the highest number, try a custom sort...
sub sort_files {
(my $num_a = $a) =~ s/^album(\d+)\.htm$/$1/;
(my $num_b = $b) =~ s/^album(\d+)\.htm$/$1/;
return $num_a <=> $num_b;
}
my #sorted = sort \&sort_files #files;
my $last = pop #sorted;
Also, take a look at the File::Next module. It will let you pick out just the files that begin with the word "album". I find it a little easier than readdir.
The reason why you're encountering difficulties is the operator, <=> is the numeric comparison, cmp is the default and it is string comparison.
$ perl -E'say for sort qw/01 1 02 200/';
01
02
1
200
With a slight modification we get something much closer to correct:
$ perl -E'say for sort { $a <=> $b } qw/01 1 02 200/';
01
1
02
200
However, in your case you need to remove the non digits.
$ perl -E'say for sort { my $s1 = $a =~ m/(\d+)/; my $s2 = $b =~ /(\d+)/; $s1 <=> $s2 } qw/01 1 02 200/';
01
1
02
200
Here is it more pretty:
sort {
my $s1 = $a =~ m/(\d+)/;
my $s2 = $b =~ /(\d+)/;
$s1 <=> $s2
}
This isn't flawless, but it should give you a good idea of your issue with sort.
Oh, and as a follow up, the Shcwartzian Transform solves a different problem: it stops you from having to run a complex task (unlike the one you're needing -- a regex) multiple times in the search algorithm. It comes at a memory cost of having to cache the results (not to be unexpected). Essentially, what you do is map the input of the problem, to the output (typically in an array) [$input, $output] then you sort on the outputs $a->[1] <=> $b->[1]. With your stuff now sorted you map back over to get your original inputs $_->[0].
map $_->[0],
sort { $a->[1] <=> $b->[1] }
map [ $_, fn($_) ]
, qw/input list here/
;
It is cool because it is so compact while being so efficient.
Here you go, using Schwartzian Transform:
my #files = <DATA>;
print join '',
map { $_->[1] }
sort { $a->[0] <=> $b->[0] }
map { [ m/album(\d+)/, $_ ] }
#files;
__DATA__
album12.htm
album1.htm
album2.htm
album10.htm
Here's an alternative solution using reduce:
use strict;
use warnings;
use List::Util 'reduce';
my $max = reduce {
my ($aval, $bval) = ($a =~ m/album(\d+)/, $b =~ m/album(\d+)/);
$aval > $bval ? $a : $b
} <DATA>;
print "max album is $max\n";
__DATA__
album1.htm
album100.htm
album2.htm
album3.htm
album99.htm
Here's a generic solution:
my #sorted_list
= map { $_->[0] } # we stored it at the head of the list, so we can pull it out
sort {
# first test a normalized version
my $v = $a->[1] cmp $b->[1];
return $v if $v;
my $lim = #$a > #$b ? #$a : #$b;
# we alternate between ascii sections and numeric
for ( my $i = 2; $i < $lim; $i++ ) {
$v = ( $a->[$i] || '' ) cmp ( $b->[$i] || '' );
return $v if $v;
$i++;
$v = ( $a->[$i] || 0 ) <=> ( $b->[$i] || 0 );
return $v if $v;
}
return 0;
}
map {
# split on digits and retain captures in place.
my #parts = split /(\d+)/;
my $nstr = join( '', map { m/\D/ ? $_ : '0' x length() } #parts );
[ $_, $nstr, #parts ];
} #directory_names
;