Custom sorting of file names - perl

I have a list of files:
TC-00001-(null)-20141027-204159FN.tif
TC-00020-(null)-20141027-203422FN.tif
etc
I need to process these files in an ordered way, but not the default way that sort {$a cmp $b}, gives me. Currently, it is sorting according to the TC-0000X numeration.
I would in fact like to sort according to the last number (204159 and 203422) before the FN characters. This is read as a sort of simplified time stamp 20:14:59 and 20:34:22.
The sampling frequency of these files is one second, so this time-stamp uniquely identifies the file.
How can I sort in perl with this number?

Sort allows you to define custom sorts that'll sort by any algorithm you like.
All the sub needs to is return positive, zero or negative - much like cmp or <=> do.
$a and $b are special variables used for this purpose.
So in your case:
#!/usr/bin/perl
use strict;
use warnings;
sub compare_last {
#first we extract the values we're interested in...
my ( $a_last ) = ( $a =~ m/(\d+)FN\./ );
my ( $b_last ) = ( $b =~ m/(\d+)FN\./ );
# print "GOT: $a_last, $b_last, \n";
#then we return the comparison. <=> is numeric, but you could use cmp.
#or manually set your own return codes - sort doesn't care, just bear in mind that
#each element is compared so you can end up with some pretty fruity results if you
#return a random number or something.
return ( $a_last <=> $b_last );
}
print sort compare_last <DATA> ;
## some dummy data
__DATA__
TC-00001-(null)-20141027-204159FN.tif
TC-00020-(null)-20141027-203422FN.tif
TC-00001-(null)-20141027-204159FN.tif
TC-00020-(null)-20141027-123456FN.tif
TC-00001-(null)-20141027-332FN.tif
TC-00020-(null)-20141027-018234FN.tif

This can be done with a one-liner using List::UtilsBy (which, like the other List::* modules, has an XS version for efficiency).
perl -MList::UtilsBy::XS=nsort_by -wle 'print nsort_by { /(\d+)FN[.]/ ? $1 : -1 } <>' filelist.txt

Related

Sort mixed text lines (alphanum) in Perl

I have txt file with every line structure like this:
P[containerVrsn:U(0)recordVrsn:U(0)size:U(212)ownGid:G[mdp:U(1090171666)**seqNo:U(81920)**]logicalDbNo:U(1)classVrsn:U(1)timeStamp:U(0)dbRecord:T[classNo:U(1064620)size:U(184)updateVersion:U(3)checksum:U(748981000)
And have to sort file lines based on seqNo (min to max). Sequence number can be virtually any number starting from zero. Any idea how can it be done in efficient way?
The Schwartzian Transform as suggested in Toto's answer is probably the fastest way to sort your lines here. But you said you're a Perl newbie, and I like to show how the lines can be sorted traditionally.
Perl has a sort function that sorts a list simply by alphabet. But you can supply a custom comparison function and let sort use your function to compare the elements. During its operation sort must continuously compare two elements (=lines) of your list and decide which one is greater or lesser or whether they are equal.
If you supply a comparison function, sort will call it with two such elements as the parameters $a and $b. You do not need to must not declare $a and $b, they are magic and just there. Your comparison function could look like this:
sub by_seqNo
{
# extract the sequence number from $a and $b
my ($seqA) = ($a =~ /seqNo:U\((\d+)/);
my ($seqB) = ($b =~ /seqNo:U\((\d+)/);
# numerically compare the sequence numbers (returns -1/0/+1)
$seqA <=> $seqB;
}
The first two lines extract the numbers after seqNo:U( and store them as $seqA and $seqB. The third line compares these sequence numbers as integers and returns that result. Combined with the sort function this gives:
my #sorted = sort by_seqNo #lines;
The reason why the Schwartzian Transform (ST) is faster than this solution is because the ST does the (expensive) operation of extracting the seqNo from your lines exactly once for each line. The "traditional" approach on the other hand extracts the seqNo twice for each comparison.
You could use Schwartzian Transform.
Here is a small script that does the job:
#!/usr/bin/perl
use strict;
use warnings;
open my $fhi, '<', 'path/to/input/file' or die "Unable to open input file: $!";
my #lines = <$fhi>;
my #sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { my ($x) = $_ =~ /SeqNo:U\((\d+)/i; [$_, $x]; }
#lines;
open my $fho, '>', 'path/to/output/file' or die "Unable to open output file: $!";
print $fho $_ for #sorted;

Write file name in sequence of generation in perl

I have some 1000 files in a directory. Naming convention of the file is like below.
TC_01_abcd_16_07_2014_14_06.txt
TC_02_abcd_16_07_2014_14_06.txt
TC_03_abcd_16_07_2014_14_07.txt
.
.
.
.
TC_100_abcd_16_07_2014_15_16.txt
.
.
.
TC_999_abcd_16_07_2014_17_06.txt
I have written some code like this
my #dir="/var/tmp";
foreach my $inputfile (glob("$dir/*abcd*.txt")) {
print $inputfile."\n";
}
While running this it is not printing in sequence.
it it printing till 09 file then it is printing 1000th file name then
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
TC_11_abcd_16_07_2014_11_56.txt
Please guide me how to print in sequence
The files are sorted according to the rules of shell glob expansion, which is a simple alpha sort. You will need to sort them according to a numeric sort of the first numeric field.
Here is one way to do that:
# Declare a sort comparison sub, which extracts the part of the filename
# which we want to sort on and compares them numerically.
# This sub will be called by the sort function with the variables $a and $b
# set to the list items to be compared
sub compareFilenames {
my ($na) = ($a =~ /TC_(\d+)/);
my ($nb) = ($b =~ /TC_(\d+)/);
return $na <=> $nb;
}
# Now use glob to get the list of filenames, but sort them
# using this comparison
foreach my $file (sort compareFilenames glob("$dir/*abcd*.txt")) {
print "$file\n";
}
See: perldoc for sort
That's printing the files in order -- ASCII order that is.
In ASCII, the underscore (_) is after the digits when sorting. If you want to sort your files in the correct order, you'll have to sort them yourself. Without sort, there's no guarantee that they'll print in any order. Even worse for you, you don't really want to print the files in either numeric sorted order (because the file names aren't numeric) or ASCII order (because you want TC_10 to print before TC_100.
Therefore, you need to write your own sorting routine. Perl gives you the sort command. By default, it will sort in ASCII order. However, you can define your own subroutine to sort in the order you want. sort will pass two values to your in your sort routine $a and $b. What you can do is parse these two values to get the sort keys you want, then use the <=> or cmp operators to return the values in the correct sort order:
#! /usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw(say);
opendir my $dir, 'temp'; # Opens a directory for reading
my #dir_list = readdir $dir;
closedir $dir;
#dir_list = sort { # My sort routine embedded inside the sort command
my $a_val;
my $b_val;
if ( $a =~ /^TC_(\d+)_/ ) {
$a_val = $1;
}
else {
$a_val = 0;
}
if ( $b =~ /^TC_(\d+)_/ ) {
$b_val = $1;
}
else {
$b_val = 0;
}
return $a_val <=> $b_val;
} #dir_list;
for my $file (#dir_list) {
next if $file =~ /^\./;
say "$file";
}
In my sort subroutine am going to take $a and $b and pull out the number you want to sort them by and put that value into $a_val and $b_val. I also have to watch what happens if the files don't have the name I think they may have. Here I simply decide to set the sort value to 0 and hope for the best.
I am using opendir and readdir instead of globbing. This will end up including . and .. in my list, and it will include any file that starts with .. No problem, I'll remove these when I print out the list.
In my test, this prints out:
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_11_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
Files are sorted numerically by the first set of digits after TC_.
Here you go:
#!/usr/bin/perl
use warnings;
use strict;
sub by_substring{
$a=~ /(\d+)/;
my $x=$1;
$b=~ /(\d+)/;
my $y=$1;
return $x <=> $y;
}
my #files=<*.txt>;
#files = sort by_substring #files;
for my $inputfile (#files){
print $inputfile."\n";
}
It will not matter if your filenames start with "TC" or "BD" or "President Carter", this will just use the first set of adjacent digits for the sorting.
the sort in the directory will be alphanumeric, hence your effect. i do not know how to sort glob by creation date, here is a workaround:
my #dir="/var/tmp";
my #files = glob("$dir/*abcd*.txt");
my #sorted_files;
for my $filename (#files) {
my ($number) = $filename =~ m/TC_(\d+)_abcd/;
$sorted_files[$number] = $filename;
}
print join "\n", #sorted_filenames;

Selecting highest count of element except when...

So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output.
So with this input:
#extract = ------------------------------------------------------------------MGG-------------------------------------------------------------------------------------
And following code:
#Count selected amino acids.
my %counter = ();
foreach my $extract(#extract) {
#next if $_ =~ /\-/; #This line code does not function correctly.
$counter{$_}++;
}
sub largest_value_mem (\%) {
my $counter = shift;
my ($key, #keys) = keys %$counter;
my ($big, #vals) = values %$counter;
for (0 .. $#keys) {
if ($vals[$_] > $big) {
$big = $vals[$_];
$key = $keys[$_];
}
}
$key
}
I expect the most common element to be G, same as the output. If there is a tie in the elements, say G = M, if there is a way to display both in that would be great but not necessary. Any tips on how to delete or remove the '-' is much appreciated. I am slowly learning perl language.
Please let me know if what I am asking is not clear or if more information is needed, thanks again kindly for all the comments.
Your data doesn't entirely make sense, since it's not actually working perl code. I'm guessing that it's a string divided into characters. After that it sounds like you just want to be able to find the highest frequency character, which is essentially just a sort by descending count.
Therefore the following demonstrates how to count your characters and then sort the results:
use strict;
use warnings;
my $str = '------------------------------------------------------------------MGG-------------------------------------------------------------------------------------';
my #chars = split '', $str;
#Count Characteres
my %count;
$count{$_}++ for #chars;
delete $count{'-'}; # Don't count -
# Sort keys by count descending
my #keys = sort {$count{$b} <=> $count{$a}} keys %count;
for my $key (#keys) {
print "$key $count{$key}\n";
}
Outputs:
G 2
M 1
foreach my $extract(#extract) {
#next if $_ =~ /\-/
$_ setting is suppressed by $extract here.
(In this case, $_ keeps value from above, e.g. routine argument list, previous match, etc.)
Also, you can use character class for better readability:
next if $extract=~/[-]/;

How to print the frequency of words in perl?

open INP,"<C:\\Users\\hanadi\\Documents\\cs.txt";
while ($line=<INP>)
{
chomp($line);
#list=split/\s+/,$line;
foreach $w (#list)
{
$wordfreq{$w}++;
}
}
foreach $w2(keys>wordfreq)
{
print "$w2==>$wordfreq{$w}";
}
I want to print each word and its frequency.now i want code in Perl to jump and Print the above information for the next
ranks (>100), but do not print
every line, print only one line for every 1000 words (otherwise there will
be too many lines to print) in decreasing order of frequency and
decreasing alphabetical order among words with the same frequency.
The first issue of this problem is to define the word "word." Am assuming, by one of your comments, that punctuation is not part of a "word," since you were asking how to "...delete punctuations from the text..."
One solution to this is to use a regex to capture only "word" characters, i.e., alphanumeric and underscore, by matching the text against \w in the regex.
Building a hash, where the keys are the words and the associated values are the counts, is the way to go. However, when doing this, you need to insure that the keys are all the same case, i.e., either all UPPER or all lower.
After you've built this hash, you can sort the output in descending order by the has values (frequency) and use a counter to print just the top 100 words. There will be words with the same frequency count--especially having only one occurrence. How do you want these printed, as it can make a difference whether they appear in the list of top 100. I'd suggest ordering these cases alphabetically.
Give the above, consider the following solution, which uses the text above the code below as the corpus:
use strict;
use warnings;
my %hash;
open my $fh, '<', 'words.txt' or die $!;
while (<$fh>) {
$hash{ lc $1 }++ while /(\w+)/g;
}
close $fh;
my $i = 1;
for my $word ( sort { $hash{$b} <=> $hash{$a} || $a cmp $b } keys %hash ) {
print "$i. $word: ($hash{$word})\n" if $i++ < 100 or !( $i % 1000 );
}
Partial output:
1. the: (22)
2. to: (8)
3. a: (5)
4. you: (5)
5. is: (4)
6. of: (4)
7. this: (4)
8. word: (4)
9. all: (3)
10. and: (3)
...
96. punctuation: (1)
97. punctuations: (1)
98. since: (1)
99. sort: (1)
100. suggest: (1)
Limitations:
One issue that results from capturing word characters can be seen in cases of some possessive forms of words, e.g., word's. In this case, both word and s would be captured as words. If you want to retain such punctuation, and split on whitespace, you can just use the following instead of the regex line:
$hash{ lc $_ }++ for split ' ';
Hope this helps!
please pay attention to all of amons reply, and always rtq. (good spot amon).
(i have determined) your problem is that once you have constructed your wordcount hash, you now need to invert the hash so you can sort the values into some kind of order. The problem with this is that more than one word may have the same count and would overwrite earlier stored words.
To do this you need to store an array in a hash value, and this is done by using a reference to an array. Hash values may only be scalars, an array is not a scalar, but a reference to an array is.
In this re-write of your problem, I have updated the open and close calls to use scalar filehandles, with appropriate error handling (or die), and converted your foreach statements into 'maps'. These can take a bit of time to grasp so do not copy and paste them. Rather, focus on the inverting of the hash and how the array is accessed. This is possibly quite complex for you so I have left this parts in foreach style.
The 'each' keyword takes a key/value pair from the hash, and is often used this way to process hashes in while statements.
You will still need to work on converting the counts into frequencies as per amons suggestion and retrieving the top 100. There is a clue to the counting in the '$c' variable.
#!/usr/bin/perl
# word count #wct.pl
use warnings;
use strict;
my (%wordfreq);
open my $input, '<', 'wc.txt'
or die "cannot open wc txt file $!";
map { $wordfreq{ $_ }++; } (split /\s+/, $_) while <$input> ;
close $input
or die "cannot close wc txt file $!";
# print (
# map {"$_ has ". $wordfreq{$_} . "\n" } (keys %wordfreq)
# );
foreach (keys %wordfreq){
# print "$_ has ". $wordfreq{$_} . "\n"
}
my %invertedhash;
while (my ($key,$value) = each %wordfreq){
push #{$invertedhash{$value}}, $key;
}
my $c;
foreach (reverse sort keys %invertedhash){
last if $c++ == 2;
print "words with a count of $_ are #{$invertedhash{$_}} \n";
}
exit 0;
sample
one two two
three three three four
four four four five
five five five
produces
words with a count of 4 are four five
words with a count of 3 are three
Hope this helps.

Sort CSV based on a certain column?

I'm sure I've done this in the past and there is something small I'm forgetting, but how can I sort a CSV file on a certain column? I'm interested in answers with and without 3rd party Perl modules. Mainly methods without, since I don't always have access to install additional modules.
Example data:
name,25,female
name,24,male
name,27,female
name,21,male
desired end result after sorting on the 2nd numeric column:
name,21,male
name,24,male
name,25,female
name,27,female
As CSV is a pretty complex format, it is better to use a module that does the work for us.
Following is an example using the Text::CSV module:
#!/usr/bin/env perl
use strict;
use warnings;
use constant AGE => 1;
use Text::CSV;
my $csv = Text::CSV->new();
my #rows;
while ( my $row_ref = $csv->getline( \*DATA ) ) {
push #rows, $row_ref;
}
#rows = sort { $a->[AGE] <=> $b->[AGE] } #rows;
for my $row_ref (#rows) {
$csv->combine(#$row_ref);
print $csv->string(), "\n";
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male
In the spirit of there always being another way to do it, bear in mind that plain old GNU sort might be enough.
$ sort -t, -k2 -n unsorted.txt
name,21,male
name,24,male
name,25,female
name,27,female
Where the command line args are:
-t, # use comma as the record separator
-k2 # sort on the second key (record) in the line
-n # sort using numerical comparison (like using <=> instead of cmp in perl)
If you want a Perl solution, wrap it in qx() ;-)
There is also DBD::CSV:
#!/usr/bin/perl
use strict; use warnings;
use DBI;
my $dbh = DBI->connect('dbi:CSV:', undef, undef, {
RaiseError => 1,
f_ext => '.csv',
csv_tables => { test => { col_names => [qw' name age sex '] } },
});
my $sth = $dbh->prepare(q{
SELECT name, age, sex FROM test ORDER BY age
});
$sth->execute;
while ( my #row = $sth->fetchrow_array ) {
print join(',' => #row), "\n";
}
$sth->finish;
$dbh->disconnect;
Output:
name,21,male
name,24,male
name,25,female
name,27,female
The original poster asked for no third-party modules (which I take to mean nothing from CPAN). Whilst this is restriction that will horribly limit your ability to write good modern Perl code, in this instance it's possible using the (core) Text::ParseWords module in place of the (non-core) Text::CSV. So, borrowing heavily from Alan's example, we get:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
my #rows;
while (<DATA>) {
push #rows, [ parse_line(',', 0, $_) ];
}
#rows = sort { $a->[1] <=> $b->[1] } #rows;
foreach (#rows) {
print join ',', #$_;
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male
When you provide your own comparison code, you can sort on anything. Just extract the desired element with a regex, or probably a split in this case, and then compare on that. If you have a lot of elements, I would parse the data into a list of lists and then the comparison code can access it without parsing. That would eliminate parsing the same row over and over as it's compared with other rows.
using Raku (née Perl6)
This is a fairly quick-and-dirty solution, mainly intended for "hand-rolled" CSV. The code works as long as there's only one (1) age-per-row: Read lines $a, comb for 1-to-3 <digit> surrounded by commas and assign to #b, derive sorting index $c, use $c to reorder lines $a:
~$ raku -e 'my $a=lines(); my #b=$a.comb(/ \, <(\d**1..3)> \, /).pairs; my $c=#b.sort(*.values)>>.keys.flat; $a[$c.flat]>>.put;' sort_age.txt
name,21,male
name,24,male
name,25,female
name,27,female
I prepended a few dummy lines to the OP's input file see how the code above reacts with 1). a blank age field, 2). a blank "" string for age, 3). a bogus "9999" for age, and 4). a bogus "NA" for age. The code above fails catastrophically. To fix this you have to write a ternary that inserts a numeric placeholder value (e.g. zero) whenever the regex fails to match a line.
Below is a longer but more robust solution. Note--I use a placeholder value of 999 to move lines with blank/invalid ages to the bottom:
~$ raku -e 'my #a=lines(); my #b = do for #a {if $_ ~~ m/ \, <(\d**1..3)> \, / -> { +$/ } else { 999 }; }; my $c=#b.pairs.sort(*.values)>>.keys.flat; #a[$c.flat]>>.put;' sort_age.txt
name,21,male
name,24,male
name,25,female
name,27,female
name,,male
name,"",female
name,9999,male
name,NA,male
To sort in reverse, add .reverse to the end of the method chain that creates $c. Again, change the else placeholder argument to move lines absent a valid age to the top or to the bottom. Also, creation of #b above can be written using the ternary operator: my #b = do for #a {(m/ \, <(\d**1..3)> \, /) ?? +$/ !! 999 };, as an alternative.
Here's the unsorted input file for posterity:
$ cat sort_age.txt
name,,male
name,"",female
name,9999,male
name,NA,male
name,25,female
name,24,male
name,27,female
name,21,male
HTH.
https://raku.org/
I would do something like this:
#!/usr/bin/perl
use warnings;
use strict;
my #rows = map { chomp; [split /[,\s]+/, $_] } <DATA>; #read each row into an array
my #sorted = sort { $a->[1] <=> $b->[1] } #rows; # sort the rows (numerically) by second column
for (#sorted) {
print join(', ', #$_) . "\n"; # print them out as CSV
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male