Perl sort genomic positions - perl

I have a list of genomic positions in the format
chromosome:start-end
for example
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
I want to sort this by chromosome number and numerical start position to get this:
chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-200
chrX:100-200
What is a good and efficient way to do this in perl?

Just use the module Sort::Keys::Natural:
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
print natsort <DATA>;
__DATA__
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
chrY:100-200
chrX:1-100
chr10:100-150
Outputs:
chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-150
chr10:100-200
chrX:1-100
chrX:100-200
chrY:100-200

You can sort this by providing a custom comparator. It appears that you want a two level value as the sorting key, so your custom comparator would derive the key for a row and then compare that:
# You want karyotypical sorting on the first element,
# so set up this hash with an appropriate normalized value
# per available input:
my %karyotypical_sort = (
1 => 1,
...
X => 100,
);
sub row_to_sortable {
my $row = shift;
$row =~ /chr(.+):(\d+)-/; # assuming match here! Be careful
return [$karyotypical_sort{$1}, $2];
}
sub sortable_compare {
my ($one, $two) = #_;
return $one->[0] <=> $two->[0] || $one->[1] <=> $two->[1];
# If first comparison returns 0 then try the second
}
#lines = ...
print join "\n", sort {
sortable_compare(row_to_sortable($a), row_to_sortable($b))
} #lines;
Since the calculation would be slightly onerous (string manipulation is not free) and since you are probably dealing with a lot of data (genomes!) it is likely you will notice improved performance if you perform a Schwartzian Transform. This is performed by precalculating the sort key for the row and then sorting using that and finally removing the additional data:
#st_lines = map { [ row_to_sortable($_), $_ ] } #lines;
#sorted_st_lines = sort { sortable_compare($a->[0], $b->[0]) } #st_lines;
#sorted_lines = map { $_->[1] } #sorted_st_lines;
Or combined:
print join "\n",
map { $_->[1] }
sort { sortable_compare($a->[0], $b->[0]) }
map { [ row_to_sortable($_), $_ ] } #lines;

It looks to me like you want to sort in order of the following:
By Chromosome Number
Then by the Start Position
Then (maybe) by the End Position.
So, perhaps a custom sort like this:
use strict;
use warnings;
print sort {
my #a = split /chr|:|-/, $a;
my #b = split /chr|:|-/, $b;
"$a[1]$b[1]" !~ /\D/ ? $a[1] <=> $b[1] : $a[1] cmp $b[1]
or $a[2] <=> $b[2]
or $a[3] <=> $b[3]
} <DATA>;
__DATA__
chr1:100-110
chr1:1000-1100
chr1:200-300
chr10:100-200
chr2:100-200
chrX:100-200
chrY:100-200
chrX:1-100
chr10:100-150
Outputs:
chr1:100-110
chr1:200-300
chr1:1000-1100
chr2:100-200
chr10:100-150
chr10:100-200
chrX:1-100
chrX:100-200
chrY:100-200

You could do something like this the following script, which takes a text file given your above input. The sorting on the chromosome number would need to change a bit because it's not purely lexical or numerical. But i'm sure you could tweak what I have below:
use strict;
my %chromosomes;
while(<>){
if ($_ =~ /^chr(\w+):(\d+)-\d+$/)
{
my $chr_num = $1;
my $chr_start = $2;
$chromosomes{$1}{$2} = $_;
}
}
my #chr_nums = sort(keys(%chromosomes));
foreach my $chr_num (#chr_nums) {
my #chr_starts = sort { $a <=> $b }(keys(%{$chromosomes{$chr_num}}));
foreach my $chr_start (#chr_starts) {
print "$chromosomes{$chr_num}{$chr_start}";
}
}
1;

There is a similar question asked and answered here:
How to do alpha numeric sort perl?
What you are likely looking for is a general numeric sort, like using sort -g.

Related

Perl sort array by pattern match

I would like to sort this array based on the value after the comma
my #coords;
$coords[0] = "33.7645539, -84.3585973";
$coords[1] = "33.7683870, -84.3559850";
$coords[2] = "33.7687753, -84.3541355";
foreach my $coord (#sorted_coords) {
print "$coord\n";
}
Output:
33.7687753, -84.3541355
33.7683870, -84.3559850
33.7645539, -84.3585973
I've thought about using map, grep, and capture groups as the list input for sort, but I haven't gotten very far:
my #sorted_coords = sort { $a <=> $b } map {$_ =~ /, (-*\d+\.\d+)/} #unique_coords;
It is easy to submit to the temptation to use a fancy implementation instead of something straightforward and clear. Unless the data set is huge, the speed advantage of using a transform is negligible, and comes at the cost of much reduced legibility
A standard sort block is all that's necessary here
use strict;
use warnings;
my #coords = (
"33.7645539, -84.3585973",
"33.7683870, -84.3559850",
"33.7687753, -84.3541355",
);
my #sorted_coords = sort {
my ($aa, $bb) = map { (split)[1] } $a, $b;
$bb <=> $aa;
} #coords;
print "$_\n" for #sorted_coords;
output
33.7687753, -84.3541355
33.7683870, -84.3559850
33.7645539, -84.3585973
Update
If you prefer, the second field may be extracted from the input records using a regex instead. Replacing the map statement with something like this
my ($aa, $bb) = map /.*(\S+)/, $a, $b;
will work fine
Looks like you could use a Schwartzian transform. You had the right idea:
my #coords;
$coords[1] = "33.7683870, -84.3559850";
$coords[2] = "33.7687753, -84.3541355";
$coords[0] = "33.7645539, -84.3585973";
my #sorted_coords = map { $_->[0] } # 3. extract the first element
sort { $b->[1] <=> $a->[1] } # 2. sort on the second
# element, descending
map { [ $_, /,\s*(\S+)$/ ] } # 1. create list of array refs
#coords;
foreach my $coord (#sorted_coords) {
print "$coord\n";
}
Edit: Adding Joshua's suggestion:
my #sorted_coords = map { join ', ', #$_ }
sort { $b->[1] <=> $a->[1] }
map { [ split /, / ] }
#coords;
It seems easier to look at and more descriptive than my original example.

A simple variable count inside array

After working with this code, I am stuck at what I think is a simple error, yet I need outside eyes to see what is wrong.
I used unpack function to divide an array into the following.
#extract =
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
Apparently, after unpacking into the array, when I try to go into the while loop, #extract shows up completely empty. Any idea as to why this is happening?
print #extract; #<-----------Prints input
my $sum = 0;
my %counter = ();
while (my $column = #extract) {
print #extract; #<------- This extract is completely empty. Should be input
for (my $aa = (split ('', $column))){
$counter{$aa}++;
delete $counter{'-'}; # Don't count -
}
# Sort keys by count descending
my #keys = (sort {$counter{$b} <=> $counter{$a}} keys %counter) [0]; #gives highest letter
for my $key (#keys) {
$sum += $counter{$key};
print OUTPUT "$key $counter{$key} ";
Each line is an array element correct? I don't see in your code where you are checking the individual characters.
Assuming the input that you have shown is a 3 element array containing the line as a string:
#!/usr/bin/perl
use strict;
use warnings;
my #entries;
while(my $line = shift(#extract)){
my %hash;
for my $char(split('', $line)){
if($char =~ /[a-zA-Z]/) { $hash{$char}++ }
}
my $high;
for my $key (keys %hash) {
if(!defined($high)){ $high = $key }
elsif($hash{$high} < $hash{$key}){
$high = $key
}
}
push #entries, {$high => $hash{$high}};
}
Note this empties #extract, if you don't want to do that you'd have to use a for loop like below
for my $i (0 .. $#extract){
#my %hash etc...
}
EDIT:
Changed it so that only the highest number is actually kept
An approach using reduce from List::Util.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util 'reduce';
my #extract = qw/
------MMMMMMMMMMMMMMMMMMMMMMMMMM-M-MMMMMMMM
------SSSSSSSSSSSSSSSSSSSSSSSSSS-S-SSSSSDTA
------TIIIIIIIIIIIIITIIIVVIIIIII-I-IIIIITTT
/;
for (#extract) {
my %count;
tr/a-zA-Z//cd;
for (split //) {
$count{$_}++;
}
my $max = reduce { $count{$a} > $count{$b} ? $a : $b } keys %count;
print "$max $count{$max}\n";
}

How can I do alpha numeric sort in Perl?

I have a file which looks like this:
80,1p21
81,19q13
82,6p12.3
83,Xp11.22
84,3pter-q21
86,3q26.33
87,14q24.1-q24.2|14q24|14q22-q24
88,1q42-q43
89,11q13.1
90,2q23-q24
91,12q13
92,2q22.3
93,3p22
94,12q11-q14
95,3p21.1
97,14q24.3
98,2p16.2
And I want to sort them based on the second column. And the first column should change accordingly too. When you use the 'sort' command in Perl, it doesn't do it because it says it's not numeric. Is there a way to sort things alpha numerically in Perl?
If you read the documentation for sort, you'll see that you don't need to do a numeric sort in Perl. You can do string comparisons too.
#sorted = sort { $a cmp $b } #unsorted;
But that still leaves you with a problem as, for example, 19q will sort before 6p. So you can write your own sort function which can make whatever transformations you want before doing the comparison.
#sorted = sort my_complex_sort #unsorted;
sub my_complex_sort {
# code that compares $a and $b and returns -1, 0 or 1 as appropriate
# It's probably best in most cases to do the actual comparison using cmp or <=>
# Extract the digits following the first comma
my ($number_a) = $a =~ /,(\d+)/;
my ($number_b) = $b =~ /,(\d+)/;
# Extract the letter following those digits
my ($letter_a) = $a =~ /,\d+(a-z)/;
my ($letter_b) = $b =~ /,\d+(a-z)/;
# Compare and return
return $number_a <=> $number_b or $letter_a cmp $letter_b;
}
#!/usr/bin/env perl
use strict;
use warnings;
my #datas = map { /^(\d+),(\d*)(.*)$/; [$1, $2, $3]; } <DATA>;
my #res = sort {$a->[1] <=> $b->[1] or $a->[2] cmp $b->[2]} #datas;
foreach my $data (#res) {
my ($x, $y, $z) = #{$data};
print "$x,$y$z\n";
}
__DATA__
80,1p21
81,19q13
82,6p12.3
83,Xp11.22
84,3pter-q21
86,3q26.33
87,14q24.1-q24.2|14q24|14q22-q24
88,1q42-q43
89,11q13.1
90,2q23-q24
91,12q13
92,2q22.3
93,3p22
94,12q11-q14
95,3p21.1
97,14q24.3
98,2p16.2
I actually found the answer to this. The code looks a bit complicated though.
#!/usr/bin/env perl
use strict;
use warnings;
sub main {
my $file;
if (#ARGV != 1) {
die "Usage: perl hashofhash_sort.pl <filename>\n";
}
else {
$file = $ARGV[0];
}
open(IN, $file) or die "Error!! Cannot open the $file file: $!\n";
my #file = <IN>;
chomp #file;
my ($entrez_gene, $loci, $chr, $band, $pq, $band_num);
my (%chromosome, %loci_entrez);
foreach my $line (#file) {
if ($line =~ /(\d+),(.+)/) {
# Entrez genes
$entrez_gene = $1;
# Locus like 12p23.4
$loci = $2;
if ($loci =~ /^(\d+)(.+)?/) {
# chromosome number alone (only numericals)
$chr = $1;
if ($2) {
# locus minus chromosome number. If 12p23.4, then $band is p23.4
$band = "$2";
if ($band =~ /^([pq])(.+)/) {
# either p or q
$pq = $1;
# stores the numericals. for p23.4, stores 23.4
$band_num = $2;
}
if (exists $chromosome{$chr}) {
if (exists $chromosome{$chr}{$pq}) {
push (#{$chromosome{$chr}{$pq}}, $band_num);
}
else {
$chromosome{$chr}{$pq} = [$band_num];
}
}
else {
$chromosome{$chr}{$pq} = [$band_num];
}
}
}
}
} # End of foreach loop
foreach my $key (sort {$a <=> $b} keys %chromosome) {
my %seen = ();
foreach my $key2 (sort {$a cmp $b } keys %{$chromosome{$key}}) {
my #unique = grep { ! $seen{$_}++ } #{$chromosome{$key}{$key2}};
my #sorted = sort #unique;
foreach my $element (#sorted) {
my $sorted_locus = "$key$key2$element";
if (exists $loci_entrez{$sorted_locus}) {
foreach my $element2 (#{$loci_entrez{$sorted_locus}}) {
print "$element2,$sorted_locus\n";
}
}
}
}
}
} # End of main
main();
In the very general case, the question is ambiguous on what to do with integers that are equal but written differently, because of the possibility of leading zeros. The following comparison function (for sort) allows one to consider the lexicographic order as soon as one doesn't have different integers. This is the same as zsh's numeric sort.
sub alphanumcmp ($$)
{
my (#u,#v);
if ((#u = $_[0] =~ /^(\d+)/) &&
(#v = $_[1] =~ /^(\d+)/))
{
my $c = $u[0] <=> $v[0];
return $c if $c;
}
if ((#u = $_[0] =~ /^(.)(.*)/) &&
(#v = $_[1] =~ /^(.)(.*)/))
{
return $u[0] cmp $v[0] || &alphanumcmp($u[1],$v[1]);
}
return $_[0] cmp $_[1];
}
For instance, one would get the following sorted elements:
a0. a00. a000b a00b a0b a001b a01. a01b a1. a1b a010b a10b a011b a11b
Note 1: The use of <=> assumes that the numbers are not too large.
Note 2: In the question, the user wants to do an alphanumeric sort on the second column (instead of the whole string). So, in this particular case, the comparison function could just be adapted to ignore the first column or a Schwartzian transform could be used.

Sorting a directory in perl, taking numbers into account

I think I need some sort of Schwartzian Transform to get this working, but I'm having trouble figuring it out, as perl isn't my strongest language.
I have a directory with contents as such:
album1.htm
album2.htm
album3.htm
....
album99.htm
album100.htm
I'm trying to get the album with the highest number from this directory (in this case, album100.htm). Note that timestamps on the files are not a reliable means of determining things, as people are adding old "missing" albums after the fact.
The previous developer simply used the code snippet below, but this clearly breaks down once there are more than 9 albums in a directory.
opendir(DIR, PATH) || print $!;
#files = readdir(DIR);
foreach $file ( sort(#files) ) {
if ( $file =~ /album/ ) {
$last_file = $file;
}
}
If you just need to find the album with the highest number, you don't really need to sort the list, just run through it and keep track of the maximum.
#!/usr/bin/perl
use strict;
use warnings;
my $max = 0;
while ( <DATA> ) {
my ($album) = $_ =~ m/album(\d+)/;
$max = $album if $album > $max;
}
print "album$max.htm";
__DATA__
album1.htm
album100.htm
album2.htm
album3.htm
album99.htm
To find the highest number, try a custom sort...
sub sort_files {
(my $num_a = $a) =~ s/^album(\d+)\.htm$/$1/;
(my $num_b = $b) =~ s/^album(\d+)\.htm$/$1/;
return $num_a <=> $num_b;
}
my #sorted = sort \&sort_files #files;
my $last = pop #sorted;
Also, take a look at the File::Next module. It will let you pick out just the files that begin with the word "album". I find it a little easier than readdir.
The reason why you're encountering difficulties is the operator, <=> is the numeric comparison, cmp is the default and it is string comparison.
$ perl -E'say for sort qw/01 1 02 200/';
01
02
1
200
With a slight modification we get something much closer to correct:
$ perl -E'say for sort { $a <=> $b } qw/01 1 02 200/';
01
1
02
200
However, in your case you need to remove the non digits.
$ perl -E'say for sort { my $s1 = $a =~ m/(\d+)/; my $s2 = $b =~ /(\d+)/; $s1 <=> $s2 } qw/01 1 02 200/';
01
1
02
200
Here is it more pretty:
sort {
my $s1 = $a =~ m/(\d+)/;
my $s2 = $b =~ /(\d+)/;
$s1 <=> $s2
}
This isn't flawless, but it should give you a good idea of your issue with sort.
Oh, and as a follow up, the Shcwartzian Transform solves a different problem: it stops you from having to run a complex task (unlike the one you're needing -- a regex) multiple times in the search algorithm. It comes at a memory cost of having to cache the results (not to be unexpected). Essentially, what you do is map the input of the problem, to the output (typically in an array) [$input, $output] then you sort on the outputs $a->[1] <=> $b->[1]. With your stuff now sorted you map back over to get your original inputs $_->[0].
map $_->[0],
sort { $a->[1] <=> $b->[1] }
map [ $_, fn($_) ]
, qw/input list here/
;
It is cool because it is so compact while being so efficient.
Here you go, using Schwartzian Transform:
my #files = <DATA>;
print join '',
map { $_->[1] }
sort { $a->[0] <=> $b->[0] }
map { [ m/album(\d+)/, $_ ] }
#files;
__DATA__
album12.htm
album1.htm
album2.htm
album10.htm
Here's an alternative solution using reduce:
use strict;
use warnings;
use List::Util 'reduce';
my $max = reduce {
my ($aval, $bval) = ($a =~ m/album(\d+)/, $b =~ m/album(\d+)/);
$aval > $bval ? $a : $b
} <DATA>;
print "max album is $max\n";
__DATA__
album1.htm
album100.htm
album2.htm
album3.htm
album99.htm
Here's a generic solution:
my #sorted_list
= map { $_->[0] } # we stored it at the head of the list, so we can pull it out
sort {
# first test a normalized version
my $v = $a->[1] cmp $b->[1];
return $v if $v;
my $lim = #$a > #$b ? #$a : #$b;
# we alternate between ascii sections and numeric
for ( my $i = 2; $i < $lim; $i++ ) {
$v = ( $a->[$i] || '' ) cmp ( $b->[$i] || '' );
return $v if $v;
$i++;
$v = ( $a->[$i] || 0 ) <=> ( $b->[$i] || 0 );
return $v if $v;
}
return 0;
}
map {
# split on digits and retain captures in place.
my #parts = split /(\d+)/;
my $nstr = join( '', map { m/\D/ ? $_ : '0' x length() } #parts );
[ $_, $nstr, #parts ];
} #directory_names
;

How can add values in each row and column and print at the end in Perl?

Below is the sample csv file
date,type1,type2,.....
2009-07-01,n1,n2,.....
2009-07-02,n21,n22,....
and so on...
I want to add the values in each row and each column and print at the end and bottom of each line. i.e.
date,type1,type2
2009-07-01,n1,n2,.....row_total1
2009-07-02,n21,n22,....row_total2
Total,col_total1,col_total1,......total
Please suggest.
Less elegant and shorter:
$ perl -plaF, -e '$r=0;$r+=$F[$_],$c[$_]+=$F[$_]for 1..$#F;$_.=",$r";END{$c[0]="Total";print join",",#c}'
Quick and dirty, but should do the trick in basic cases. For anything more complex, use Text::CSV and an actual script.
An expanded version as it's getting a little hairy:
#! perl -plaF,
$r=0;
$r+=$F[$_], $c[$_]+=$F[$_] for 1..$#F;
$_.=",$r";
END { $c[0]="Total"; print join ",", #c }'
Here is a straightforward way which you can easily build upon depending on your requirements:
use strict;
use warnings;
use 5.010;
use List::Util qw(sum);
use List::MoreUtils qw(pairwise);
use Text::ParseWords;
our ($a, $b);
my #header = parse_csv( scalar <DATA> );
my #total = (0) x #header;
output_csv( #header, 'row_total' );
for my $line (<DATA>) {
my #cols = parse_csv( $line );
my $label = shift #cols;
push #cols, sum #cols;
output_csv( $label, #cols );
#total = pairwise { $a + $b } #total, #cols;
}
output_csv( 'Total', #total );
sub parse_csv {
chomp( my $data = shift );
quotewords ',', 0, $data;
}
sub output_csv { say join ',' => #_ }
__DATA__
date,type1,type2
2009-07-01,1,2
2009-07-02,21,22
Outputs the expected:
date,type1,type2,row_total
2009-07-01,1,2,3
2009-07-02,21,22,43
Total,22,24,46
Some things to take away from above is the use of List::Util and List::MoreUtils:
# using List::Util::sum
my $sum_of_all_values_in_list = sum #list;
# using List::MoreUtils::pairwise
my #two_arrays_added_together = pairwise { $a + $b } #array1, #array2;
Also while I've used Text::ParseWords in my example you should really look into using Text::CSV. This modules covers more bizarre CSV edge cases and also provides correct CSV composition (my output_csv() sub is pretty naive!).
/I3az/
Like JB's perlgolf candidate, except prints the end line totals and labels.
#!/usr/bin/perl -alnF,
use List::Util qw(sum);
chomp;
push #F, $. == 1 ? "total" : sum(#F[1..$#F]);
print "$_,$F[-1]";
for (my $i=1;$i<#F;$i++) {
$totals[$i] += $F[$i];
}
END {
$totals[0] = "Total";
print join(",",#totals);
};
Is this something that needs to be done for sure in a Perl script? There is no "quick and dirty" method to do this in Perl. You will need to read the file in, accumulate your totals, and write the file back out (processing input and output line by line would be the cleanest).
If this is a one-time report, or you are working with a competent user base, the data you want can most easily be produced with a spreadsheet program like Excel.
Whenever I work with CSV, I use the AnyData module. It may add a bit of overhead, but it keeps me from making mistakes ("Oh crap, that date column is quoted and has commas in it!?").
The process for you would look something like this:
use AnyData;
my #columns = qw/date type1 type2 type3/; ## Define your input columns.
my $input = adTie( 'CSV', 'input_file.csv', 'r', {col_names => join(',', #columns)} );
push #columns, 'total'; ## Add the total columns.
my $output = adTie( 'CSV', 'output_file.csv', 'o', {col_names => join(',', #columns)} );
my %totals;
while ( my $row = each %$input ) {
next if ($. == 1); ## Skip the header row. AnyData will add it to the output.
my $sum = 0;
foreach my $col (#columns[1..3]) {
$totals{$col} += $row->{$col};
$sum += $row->{$col};
}
$totals{total} += $sum;
$row->{total} = $sum;
$output->{$row->{date}} = $row;
}
$output->{Total} = \%totals;
print adDump( $output ); ## Prints a little table to see the data. Not required.
undef $input; ## Close the file.
undef $output;
Input:
date,type1,type2,type3
2009-07-01,1,2,3
2009-07-03,31,32,33
2009-07-06,61,62,63
"Dec 31, 1969",81,82,83
Output:
date,type1,type2,type3,total
2009-07-01,1,2,3,6
2009-07-03,31,32,33,96
2009-07-06,61,62,63,186
"Dec 31, 1969",81,82,83,246
Total,174,178,182,534
The following in Perl does what you want, its not elegant but it works :-)
Call the script with the inputfile as argument, results in stdout.
chop($_ = <>);
print "$_,Total\n";
while (<>) {
chop;
split(/,/);
shift(#_);
$sum = 0;
for ($n = 0; 0 < scalar(#_); $n++) {
$c = shift(#_);
$sum += $c;
$sums[$n] += $c;
}
$total += $sum;
print "$_,$sum\n";
}
print "Total";
for ($n = 0; $n <= $#sums; $n++) {
print "," . $sums[$n];
}
print ",$total\n";
Edit: fixed for 0 values.
The output is like this:
date,type1,type2,type3,Total
2009-07-01,1, 2, 3,6
2009-07-02,4, 5, 6,15
Total,5,7,9,21