Recursive sorting in Perl - perl

I have a hash that contains keys that correspond to database subscripts, but the database can have multidimensional records so the key could be a single subscript, or a list of subscripts.
I need to find a way to sort these records so I can print them in a logical order.
Example:
my $data = {
'1,1,1' => 'data1',
'1,2' => 'data2',
'1,1,3' => 'stuff',
'2,1,1' => 'data3',
'2,1,2' => 'data4',
'2,1,3' => 'data blah',
'2,2,2' => 'datawk2n',
'3,1,2' => 'more',
};
# Should print the keys in the properly sorted order
print join "\n", sort some_function keys %$data;
sub some_function {
# Do some sorting magikz
}
I want it to sort by the leftmost subscript first. If the leftmost value is identical I want to move to the next value and compare those. If those are identical I want to continue to the next one ... and so on ... until all possibilities are exhausted.
This will most likely involve some recursion, but I can't figure out how to make recursion work with those fancy $a and $b variables.
What can I put in some_function to get the following output?
1,1,1
1,1,3
1,2
2,1,1
2,1,2
2,1,3
2,2,2
3,1,2

The following is the fastest solution (by far!):
my #sorted_keys =
map { join ',', unpack 'N*', $_ }
sort
map { pack 'N*', split /,/, $_ }
keys(%$data);
If you want something simpler, and still quite fast, you could use a "natural sort".
Sort::Key::Natural
use Sort::Key::Natural qw( natsort );
my #sorted_keys = natsort(keys(%$data));
Sort::Naturally
use Sort::Naturally qw( nsort );
my #sorted_keys = nsort(keys(%$data));
Benchmarks:
Rate SN SKN grt
SN 3769/s -- -40% -88%
SKN 6300/s 67% -- -79%
grt 30362/s 705% 382% --
Benchmark code:
use strict;
use warnings;
use Benchmark qw( cmpthese );
use List::Util qw( shuffle );
use Sort::Key::Natural qw( );
use Sort::Naturally qw( );
my #keys =
shuffle
split ' ',
'1 1,0 1,1 1,1,1 1,1,3 1,2 2,1,1 2,1,2 2,1,3 2,2,2 3,1,2 10,1,1';
sub grt {
my #sorted_keys =
map { join ',', unpack 'N*', $_ }
sort
map { pack 'N*', split /,/, $_ }
#keys;
}
sub SKN { my #sorted_keys = Sort::Key::Natural::natsort(#keys); }
sub SN { my #sorted_keys = Sort::Naturally::nsort(#keys); }
cmpthese(-3, {
grt => \&grt,
SKN => \&SKN,
SN => \&SN,
});

I thought the Sort::Naturally module would help you here, but it seems not
I must have had a bug in my test. This works fine
use Sort::Naturally 'nsort';
say for nsort keys %$data;
I recommend either this or the Sort::Key::Naturally solution as they are the clearest
It is bad practice to chase speed of execution, especially at the expense of readability, before there is evidence that a given solution is too slow. Even then it is foolish to randomly optimise chunks of your code in the hope of making a difference, and your solution should be run through a profiler to discover where it would be most fruitful to make enhancements
There is no need for recursion. This program shows a sort subroutine by_elements which simply compares each item in the list until it finds either a mismatch or the end of one of the lists
In the former case the result is just the comparison of the two differ elements, and in the latter it is a comparison of the number of elements in the two lists
use strict;
use warnings 'all';
use feature 'say';
my $data = {
'1,1,1' => 'data1',
'1,2' => 'data2',
'1,1,3' => 'stuff',
'2,1,1' => 'data3',
'2,1,2' => 'data4',
'2,1,3' => 'data blah',
'2,2,2' => 'datawk2n',
'3,1,2' => 'more',
'10,1,1' => 'odd',
'1,1' => 'simple',
'1,0' => 'simple0',
'1' => 'simpler',
};
say for sort by_elements keys %$data;
sub by_elements {
my ( $aa, $bb ) = map [/\d+/g], $a, $b;
for ( my $i = 0; $i < #$aa and $i < #$bb; ++$i ) {
my $cmp = $aa->[$i] <=> $bb->[$i];
return $cmp if $cmp;
}
return #$aa <=> #$bb;
}
output
1
1,0
1,1
1,1,1
1,1,3
1,2
2,1,1
2,1,2
2,1,3
2,2,2
3,1,2
10,1,1

Use natsort of Sort::Key::Natural:
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use Sort::Key::Natural qw(natsort);
my %data = (
'1,1,1' => 'data1',
'1,2' => 'data2',
'1,1,3' => 'stuff',
'2,1,1' => 'data3',
'2,1,2' => 'data4',
'2,1,3' => 'data blah',
'2,2,2' => 'datawk2n',
'10,1,2' => 'more',
);
say for natsort keys %data;
Outputs:
1,1,1
1,1,3
1,2
2,1,1
2,1,2
2,1,3
2,2,2
10,1,2

No need for recursion, just a loop that you can break out of.
sub some_function {
my #aa = split /,/, $a;
my #bb = split /,/, $b;
my $cmp = 0;
for (my $i=0; $i<#aa || $i<#bb; $i++) {
$cmp = $aa[$i] <=> $bb[$i];
last if $cmp;
}
$cmp;
}
But if your heart is set on a recursive solution, there's
sub aref_sort_recurse {
my ($c,$d) = #_;
#$c ? #$d ? shift #$c <=> shift #$d || aref_sort_recurse($c,$d) ? 1 : -#$d
}
sub some_function {
aref_sort_recurse( [split /,/, $a], [split /,/, $b] )
}

Related

How can I remove the title name of each unique number in Data::Dumper in perl?

I use Data::Dumper to catch uniqe number in each element.
#!perl
use warnings;
use strict;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;
my #names = qw(A A A A B B B C D);
my %counts;
$counts{$_}++ for #names;
print Dumper (\%counts);
exit;
This is output.
$VAR1 = {
'A' => 4,
'B' => 3,
'C' => 1,
'D' => 1
};
How can I remove the title name of each unique number to get output like this format?
$VAR1 = { 4 ,3 ,1 ,1 }
Presuming you want the counts in descending order, you could use the following:
printf "\$VAR1 = { %s};\n",
join ',',
map "$_ ",
sort { $b <=> $a }
values(%counts);
If instead you want the counts sorted by key,
printf "\$VAR1 = { %s};\n",
join ',',
map "$counts{$_} ",
sort
keys(%counts);
Either way, that's a really weird format. Square brackets would make more sense than curly ones.
One of many ways to get desired result
use strict;
use warnings;
use feature 'say';
my #names = qw( A A A A B B B C D );
my %counts;
$counts{$_}++ for #names;
my #values = map { $counts{$_} } sort keys %counts;
say join(',', #values);
output
4,3,1,1

Remove duplicates from a 2D array in perl

I have a 2D array in perl whose data is coming as rows in html format from a DB like the data shown below:
<tr><td>Rafa</td><td>Nadal</td><td>Data1</td></tr>,
<tr><td>Goran</td><td>Ivan</td><td>Data2</td></tr>,
<tr><td>Leander</td><td>Paes</td><td>Data2</td></tr>,
<tr><td>Leander</td><td>Paes</td><td>Data2</td></tr>
i want to remove the duplicate rows from the array.
"<tr><td>Leander</td><td>Paes</td><td>Data2</td></tr>" should be removed in above case.
I tried the below piece of code, but it's not working out.
sub unique {
my %seen;
grep ! $seen{ join $;, #$_ }++, #_
}
First: you really should try not to use outdated Perl syntax and side effects.
Second: the answer depends on the data structure you generate from the input. Here are two example implementations:
#!/usr/bin/perl
use strict;
use warnings;
# 2D Array: list of array references
my #data = (
['Rafa', 'Nadal', 'Data1'],
['Goran', 'Ivan', 'Data2'],
['Leander', 'Paes', 'Data2'],
['Leander', 'Paes', 'Data2'],
);
my %seen;
foreach my $unique (
grep {
not $seen{
join('', #{ $_ })
}++
} #data
) {
print join(',', #{ $unique }), "\n";
}
print "\n";
# List of "objects", keys are table column names
#data = (
{ first => 'Rafa', last => 'Nadal', data => 'Data1' },
{ first => 'Goran', last => 'Ivan', data => 'Data2' },
{ first => 'Leander', last => 'Paes', data => 'Data2' },
{ first => 'Leander', last => 'Paes', data => 'Data2' },
);
%seen = ();
my #key_order = qw(first last data);
foreach my $unique (
grep {
not $seen{
join('', #{ $_ }{ #key_order } )
}++
} #data
) {
print join(',', #{ $unique }{ #key_order }), "\n";
}
Output:
$ perl dummy.pl
Rafa,Nadal,Data1
Goran,Ivan,Data2
Leander,Paes,Data2
Rafa,Nadal,Data1
Goran,Ivan,Data2
Leander,Paes,Data2
The shown sub is good for the job, with an array which for elements has array references. That is indeed a basic way to organize 2D data, where your rows are arrayrefs.
There are modules that can be leveraged for this, but this good old method works fine as well
use warnings;
use strict;
use Data::Dump qw(dd);
sub uniq_arys {
my %seen;
grep { not $seen{join $;, #$_}++ } #_;
}
my #data = (
[ qw(one two three) ],
[ qw(ten eleven twelve) ],
[ qw(10 11 12) ],
[ qw(ten eleven twelve) ],
);
my #data_uniq = uniq_arys(#data);
dd \#data_uniq;
Prints as expected (last row is gone), using Data::Dump to show data.
The sub works by joining each array into a string, and those are then checked for duplicates using a hash. The $; is a subscript separator, and an empty string '' is just fine instead.
This approach creates a lot of ancillary data -- in principle doubles the data -- and if performance becomes a problem it may be better to simply compare element-wise (at the cost of complexity). This can be an issue only with rather large data sets.
A module example: use uniq_by from List::UtilsBy
use List::UtilsBy qw(uniq_by);
my #no_dupes = uniq_by { join '', #$_ } #data;
This does, more or less, the same as the sub above.

Perl to sort words by user-defined alphabet sequence

I have an array of "words" (strings), which consist of letters from an "alphabet" with user-defined sequence. E.g my "alphabet" starts with "ʔ ʕ b g d", so a list of "words" (bʔd ʔbg ʕʔb bʕd) after sort by_my_alphabet should be ʔbd ʕʔb bʔd bʕd.
sort by_my_alphabet (bʔd ʔbg ʕʔb bʕd) # gives ʔbd ʕʔb bʔd bʕd
Is there a way to make a simple subroutine by_my_alphabet with $a and $b to solve this problem?
Simple, and very fast because it doesn't use a compare callback, but it needs to scan the entire string:
use utf8;
my #my_chr = split //, "ʔʕbgd";
my %my_ord = map { $my_chr[$_] => $_ } 0..$#my_chr;
my #sorted =
map { join '', #my_chr[ unpack 'W*', $_ ] } # "\x00\x01\x02\x03\x04" ⇒ "ʔʕbgd"
sort
map { pack 'W*', #my_ord{ split //, $_ } } # "ʔʕbgd" ⇒ "\x00\x01\x02\x03\x04"
#unsorted;
Optimized for long strings since it only scans a string up until a difference is found:
use utf8;
use List::Util qw( min );
my #my_chr = split //, "ʔʕbgd";
my %my_ord = map { $my_chr[$_] => $_ } 0..$#my_chr;
sub my_cmp($$) {
for ( 0 .. ( min map length($_), #_ ) - 1 ) {
my $cmp = $my_ord{substr($_[0], $_, 1)} <=> $my_ord{substr($_[1], $_, 1)};
return $cmp if $cmp;
}
return length($_[0]) <=> length($_[1]);
}
my #sorted = sort my_cmp #unsorted;
Both should be faster than Sobrique's. Theirs uses a compare callback, and it scans the entire strings being compared.
Yes.
sort can take any function that returns a relative sort position. All you need is a function that correctly looks up the 'sort value' of a string for comparing.
So all you need to do here is define a 'relative weight' of your extra letters, and then compare the two.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #sort_order = qw ( B C A D );
my #array_to_sort = qw ( A B C D A B C D AB BB CCC ABC );
my $count = 0;
my %position_of;
$position_of{$_} = $count++ for #sort_order;
print Dumper \%position_of;
sub sort_by_pos {
my #a = split //, $a;
my #b = split //, $b;
#iterate one letter at a time, using 'shift' to take it off the front
#of the array.
while ( #a and #b ) {
my $result = $position_of{shift #a} <=> $position_of{shift #b};
#result is 'true' if it's "-1" or "1" which indicates relative position.
# 0 is false, and that'll cause the next loop iteration to test the next
#letter-pair
return $result if $result;
}
#return a value based on remaining length - longest 'string' will sort last;
#That's so "AAA" comparing with "AA" comparison actually work,
return scalar #a <=> scalar #b;
}
my #new = sort { sort_by_pos } #array_to_sort;
print Dumper \#new;
Bit of a simple case, but it sorts our array into:
$VAR1 = [
'B',
'B',
'BB',
'C',
'C',
'CCC',
'A',
'A',
'AB',
'ABC',
'D',
'D'
];

Pairs as hash keys

Does anyone know how to make a hash with pairs of strings serving as keys in perl?
Something like...
{
($key1, $key2) => $value1;
($key1, $key3) => $value2;
($key2, $key3) => $value3;
etc....
You can't have a pair of scalars as a hash key, but you can make a multilevel hash:
my %hash;
$hash{$key1}{$key2} = $value1;
$hash{$key1}{$key3} = $value2;
$hash{$key2}{$key3} = $value3;
If you want to define it all at once:
my %hash = ( $key1 => { $key2 => $value1, $key3 => $value2 },
$key2 => { $key3 => $value3 } );
Alternatively, if it works for your situation, you could just concatenate your keys together
$hash{$key1 . $key2} = $value1; # etc
Or add a delimiter to separate the keys:
$hash{"$key1:$key2"} = $value1; # etc
You could use an invisible separator to join the coordinates:
Primarily for mathematics, the Invisible Separator (U+2063) provides a separator between characters where punctuation or space may be omitted such as in a two-dimensional index like i⁣j.
#!/usr/bin/env perl
use utf8;
use v5.12;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open qw(:std :utf8);
use charnames qw(:full :short);
use YAML;
my %sparse_matrix = (
mk_key(34,56) => -1,
mk_key(1200,11) => 1,
);
print Dump \%sparse_matrix;
sub mk_key { join("\N{INVISIBLE SEPARATOR}", #_) }
sub mk_vec { map [split "\N{INVISIBLE SEPARATOR}"], #_ }
~/tmp> perl mm.pl |xxd
0000000: 2d2d 2d0a 3132 3030 e281 a331 313a 2031 ---.1200...11: 1
0000010: 0a33 34e2 81a3 3536 3a20 2d31 0a .34...56: -1.
Usage: Multiple keys of a single value in a hash can be used for implementing a 2D matrix or N-dimensional matrix!
#!/usr/bin/perl -w
use warnings;
use strict;
use Data::Dumper;
my %hash = ();
my ($a, $b, $c) = (2,3,4);
$hash{"$a, $b ,$c"} = 1;
$hash{"$b, $c ,$a"} = 1;
foreach(keys(%hash) )
{
my #a = split(/,/, $_);
print Dumper(#a);
}
I do this:
{ "$key1\x1F$key2" => $value, ... }
Usually with a helper method:
sub getKey() {
return join( "\x1F", #_ );
}
{ getKey( $key1, $key2 ) => $value, ... }
----- EDIT -----
Updated the code above to use the ASCII Unit Separator per the recommendation from #chepner above
Use $; implicitly (or explicitly) in your hash keys, used for multidimensional emulation, like so:
my %hash;
$hash{$key1, $key2} = $value; # or %hash = ( $key1.$;.$key2 => $value );
print $hash{$key1, $key2} # returns $value
You can even set $; to \x1F if needed (the default is \034, from SUBSEP in awk):
local $; = "\x1F";

How to convert an array into a hash, with variable names mapped as keys in Perl?

I find myself doing this pattern a lot in perl
sub fun {
my $line = $_[0];
my ( $this, $that, $the_other_thing ) = split /\t/, $line;
return { 'this' => $this, 'that' => $that, 'the_other_thing' => $the_other_thing};
}
Obviously I can simplify this pattern by returning the output of a function which transforms a given array of variables into a map, where the keys are the same names as the variables eg
sub fun {
my $line = $_[0];
my ( $this, $that, $the_other_thing ) = split /\t/, $line;
return &to_hash( $this, $that, $the_other_thing );
}
It helps as the quantity of elements get larger. How do I do this? It looks like I could combine PadWalker & closures, but I would like a way to do this using only the core language.
EDIT: thb provided a clever solution to this problem, but I've not checked it because it bypasses a lot of the hard parts(tm). How would you do it if you wanted to rely on the core language's destructuring semantics and drive your reflection off the actual variables?
EDIT2: Here's the solution I hinted at using PadWalker & closures:
use PadWalker qw( var_name );
# Given two arrays, we build a hash by treating the first set as keys and
# the second as values
sub to_hash {
my $keys = $_[0];
my $vals = $_[1];
my %hash;
#hash{#$keys} = #$vals;
return \%hash;
}
# Given a list of variables, and a callback function, retrieves the
# symbols for the variables in the list. It calls the function with
# the generated syms, followed by the original variables, and returns
# that output.
# Input is: Function, var1, var2, var3, etc....
sub with_syms {
my $fun = shift #_;
my #syms = map substr( var_name(1, \$_), 1 ), #_;
$fun->(\#syms, \#_);
}
sub fun {
my $line = $_[0];
my ( $this, $that, $other) = split /\t/, $line;
return &with_syms(\&to_hash, $this, $that, $other);
}
You could use PadWalker to try to get the name of the variables, but that's really not something you should do. It's fragile and/or limiting.
Instead, you could use a hash slice:
sub fun {
my ($line) = #_;
my %hash;
#hash{qw( this that the_other_thing )} = split /\t/, $line;
return \%hash;
}
You can hide the slice in a function to_hash if that's what you desire.
sub to_hash {
my $var_names = shift;
return { map { $_ => shift } #$var_names };
}
sub fun_long {
my ($line) = #_;
my #fields = split /\t/, $line;
return to_hash [qw( this that the_other_thing )] #fields;
}
sub fun_short {
my ($line) = #_;
return to_hash [qw( this that the_other_thing )], split /\t/, $line;
}
But if you insist, here's the PadWalker version:
use Carp qw( croak );
use PadWalker qw( var_name );
sub to_hash {
my %hash;
for (0..$#_) {
my $var_name = var_name(1, \$_[$_])
or croak("Can't determine name of \$_[$_]");
$hash{ substr($var_name, 1) } = $_[$_];
}
return \%hash;
}
sub fun {
my ($line) = #_;
my ($this, $that, $the_other_thing) = split /\t/, $line;
return to_hash($this, $that, $the_other_thing);
}
This does it:
my #part_label = qw( part1 part2 part3 );
sub fun {
my $line = $_[0];
my #part = split /\t/, $line;
my $no_part = $#part_label <= $#part ? $#part_label : $#part;
return map { $part_label[$_] => $part[$_] } (0 .. $no_part);
}
Of course, your code must name the parts somewhere. The code above does it by qw(), but you can have your code autogenerate the names if you like.
[If you anticipate a very large list of *part_labels,* then you should probably avoid the *(0 .. $no_part)* idiom, but for lists of moderate size it works fine.]
Update in response to OP's comment below: You pose an interesting challenge. I like it. How close does the following get to what you want?
sub to_hash ($$) {
my #var_name = #{shift()};
my #value = #{shift()};
$#var_name == $#value or die "$0: wrong number of elements in to_hash()\n";
return map { $var_name[$_] => $value[$_] } (0 .. $#var_name);
}
sub fun {
my $line = $_[0];
return to_hash [qw( this that the_other_thing )], [split /\t/, $line];
}
If I understand you properly you want to build a hash by assigning a given sequence of keys to values split from a data record.
This code seems to do the trick. Please explain if I have misunderstood you.
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Terse++;
my $line = "1111 2222 3333 4444 5555 6666 7777 8888 9999\n";
print Dumper to_hash($line, qw/ class division grade group kind level rank section tier /);
sub to_hash {
my #fields = split ' ', shift;
my %fields = map {$_ => shift #fields} #_;
return \%fields;
}
output
{
'division' => '2222',
'grade' => '3333',
'section' => '8888',
'tier' => '9999',
'group' => '4444',
'kind' => '5555',
'level' => '6666',
'class' => '1111',
'rank' => '7777'
}
For a more general solution which will build a hash from any two lists, I suggest the zip_by function from List::UtilsBy
use strict;
use warnings;
use List::UtilsBy qw/zip_by/;
use Data::Dumper;
$Data::Dumper::Terse++;
my $line = "1111 2222 3333 4444 5555 6666 7777 8888 9999\n";
my %fields = zip_by { $_[0] => $_[1] }
[qw/ class division grade group kind level rank section tier /],
[split ' ', $line];
print Dumper \%fields;
The output is identical to that of my initial solution.
See also the pairwise function from List::MoreUtils which takes a pair of arrays instead of a list of array references.
Aside from parsing the Perl code yourself, a to_hash function isn't feasible using just the core language. The function being called doesn't know whether those args are variables, return values from other functions, string literals, or what have you...much less what their names are. And it doesn't, and shouldn't, care.