Sort mixed text lines (alphanum) in Perl

Sort mixed text lines (alphanum) in Perl - perl

I have txt file with every line structure like this:
P[containerVrsn:U(0)recordVrsn:U(0)size:U(212)ownGid:G[mdp:U(1090171666)**seqNo:U(81920)**]logicalDbNo:U(1)classVrsn:U(1)timeStamp:U(0)dbRecord:T[classNo:U(1064620)size:U(184)updateVersion:U(3)checksum:U(748981000)
And have to sort file lines based on seqNo (min to max). Sequence number can be virtually any number starting from zero. Any idea how can it be done in efficient way?

The Schwartzian Transform as suggested in Toto's answer is probably the fastest way to sort your lines here. But you said you're a Perl newbie, and I like to show how the lines can be sorted traditionally.
Perl has a sort function that sorts a list simply by alphabet. But you can supply a custom comparison function and let sort use your function to compare the elements. During its operation sort must continuously compare two elements (=lines) of your list and decide which one is greater or lesser or whether they are equal.
If you supply a comparison function, sort will call it with two such elements as the parameters $a and $b. You do not need to must not declare $a and $b, they are magic and just there. Your comparison function could look like this:
sub by_seqNo
{
# extract the sequence number from $a and $b
my ($seqA) = ($a =~ /seqNo:U\((\d+)/);
my ($seqB) = ($b =~ /seqNo:U\((\d+)/);
# numerically compare the sequence numbers (returns -1/0/+1)
$seqA <=> $seqB;
}
The first two lines extract the numbers after seqNo:U( and store them as $seqA and $seqB. The third line compares these sequence numbers as integers and returns that result. Combined with the sort function this gives:
my #sorted = sort by_seqNo #lines;
The reason why the Schwartzian Transform (ST) is faster than this solution is because the ST does the (expensive) operation of extracting the seqNo from your lines exactly once for each line. The "traditional" approach on the other hand extracts the seqNo twice for each comparison.

You could use Schwartzian Transform.
Here is a small script that does the job:
#!/usr/bin/perl
use strict;
use warnings;
open my $fhi, '<', 'path/to/input/file' or die "Unable to open input file: $!";
my #lines = <$fhi>;
my #sorted = map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { my ($x) = $_ =~ /SeqNo:U\((\d+)/i; [$_, $x]; }
#lines;
open my $fho, '>', 'path/to/output/file' or die "Unable to open output file: $!";
print $fho $_ for #sorted;

Related

Why my sorting fails for double digits using perl? [duplicate]

I have 1500 files in one directory and I need to get some information out of every one and write it into a new, single file. The file names consist of a word and a number (Temp1, Temp2, Temp3 and so on) and it is important that the files are read in the correct order according to the numbers.
I did this using
my #files = <Temp*.csv>;
for my $file (#files)
{
this part appends the required data to a seperate file and works fine
}
my problem now is that the files are not opened in the correct order but after file 1 the file 100 gets opened.
Can anybody please give me a hint how I can make it read the files in the right order?
Thank you,
Ca

Sort the files naturally with Sort::Key::Natural natsort.
The following will automatically sort the files naturally, separating out alpha and numerical portions of the name for the appropriate sort logic.
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
for my $file ( natsort <Temp*.csv> ) {
# this part appends the required data to a seperate file and works fine
}
The following fake data should demonstrate this module in action:
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
print natsort <DATA>;
__DATA__
Temp100.csv
Temp8.csv
Temp20.csv
Temp1.csv
Temp7.csv
Outputs:
Temp1.csv
Temp7.csv
Temp8.csv
Temp20.csv
Temp100.csv

You can use Schwartzian transform to read and sort files in one step,
my #files =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\d+)/ ] } <Temp*.csv>;
or using less efficient, and more straightforward sort,
my #files = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } <Temp*.csv>;

If the numbers are really important, you might want to read them specifically after file name, with error reporting about missing files:
my #nums = 1 .. 1500; # or whatever the highest is
for my $num (#nums) {
my $file = "Temp$num.csv";
unless (-e $file) {
warn "Missing file: $file";
next;
}
...
# proceed as normal
}
If you need a file count, you can simply use your old glob:
my #files = <Temp*.csv>;
my $count = #files; # get the size of the array
my #nums = 1 .. $count;
On the other hand, if you control the process that prints the files, you might select a format that will automatically sort itself, such as:
temp00001.csv
temp00002.csv
temp00003.csv
temp00004.csv
...
temp00101.csv

Custom sorting of file names

I have a list of files:
TC-00001-(null)-20141027-204159FN.tif
TC-00020-(null)-20141027-203422FN.tif
etc
I need to process these files in an ordered way, but not the default way that sort {$a cmp $b}, gives me. Currently, it is sorting according to the TC-0000X numeration.
I would in fact like to sort according to the last number (204159 and 203422) before the FN characters. This is read as a sort of simplified time stamp 20:14:59 and 20:34:22.
The sampling frequency of these files is one second, so this time-stamp uniquely identifies the file.
How can I sort in perl with this number?

Sort allows you to define custom sorts that'll sort by any algorithm you like.
All the sub needs to is return positive, zero or negative - much like cmp or <=> do.
$a and $b are special variables used for this purpose.
So in your case:
#!/usr/bin/perl
use strict;
use warnings;
sub compare_last {
#first we extract the values we're interested in...
my ( $a_last ) = ( $a =~ m/(\d+)FN\./ );
my ( $b_last ) = ( $b =~ m/(\d+)FN\./ );
# print "GOT: $a_last, $b_last, \n";
#then we return the comparison. <=> is numeric, but you could use cmp.
#or manually set your own return codes - sort doesn't care, just bear in mind that
#each element is compared so you can end up with some pretty fruity results if you
#return a random number or something.
return ( $a_last <=> $b_last );
}
print sort compare_last <DATA> ;
## some dummy data
__DATA__
TC-00001-(null)-20141027-204159FN.tif
TC-00020-(null)-20141027-203422FN.tif
TC-00001-(null)-20141027-204159FN.tif
TC-00020-(null)-20141027-123456FN.tif
TC-00001-(null)-20141027-332FN.tif
TC-00020-(null)-20141027-018234FN.tif

This can be done with a one-liner using List::UtilsBy (which, like the other List::* modules, has an XS version for efficiency).
perl -MList::UtilsBy::XS=nsort_by -wle 'print nsort_by { /(\d+)FN[.]/ ? $1 : -1 } <>' filelist.txt

Write file name in sequence of generation in perl

I have some 1000 files in a directory. Naming convention of the file is like below.
TC_01_abcd_16_07_2014_14_06.txt
TC_02_abcd_16_07_2014_14_06.txt
TC_03_abcd_16_07_2014_14_07.txt
.
.
.
.
TC_100_abcd_16_07_2014_15_16.txt
.
.
.
TC_999_abcd_16_07_2014_17_06.txt
I have written some code like this
my #dir="/var/tmp";
foreach my $inputfile (glob("$dir/*abcd*.txt")) {
print $inputfile."\n";
}
While running this it is not printing in sequence.
it it printing till 09 file then it is printing 1000th file name then
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
TC_11_abcd_16_07_2014_11_56.txt
Please guide me how to print in sequence

The files are sorted according to the rules of shell glob expansion, which is a simple alpha sort. You will need to sort them according to a numeric sort of the first numeric field.
Here is one way to do that:
# Declare a sort comparison sub, which extracts the part of the filename
# which we want to sort on and compares them numerically.
# This sub will be called by the sort function with the variables $a and $b
# set to the list items to be compared
sub compareFilenames {
my ($na) = ($a =~ /TC_(\d+)/);
my ($nb) = ($b =~ /TC_(\d+)/);
return $na <=> $nb;
}
# Now use glob to get the list of filenames, but sort them
# using this comparison
foreach my $file (sort compareFilenames glob("$dir/*abcd*.txt")) {
print "$file\n";
}
See: perldoc for sort

That's printing the files in order -- ASCII order that is.
In ASCII, the underscore (_) is after the digits when sorting. If you want to sort your files in the correct order, you'll have to sort them yourself. Without sort, there's no guarantee that they'll print in any order. Even worse for you, you don't really want to print the files in either numeric sorted order (because the file names aren't numeric) or ASCII order (because you want TC_10 to print before TC_100.
Therefore, you need to write your own sorting routine. Perl gives you the sort command. By default, it will sort in ASCII order. However, you can define your own subroutine to sort in the order you want. sort will pass two values to your in your sort routine $a and $b. What you can do is parse these two values to get the sort keys you want, then use the <=> or cmp operators to return the values in the correct sort order:
#! /usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw(say);
opendir my $dir, 'temp'; # Opens a directory for reading
my #dir_list = readdir $dir;
closedir $dir;
#dir_list = sort { # My sort routine embedded inside the sort command
my $a_val;
my $b_val;
if ( $a =~ /^TC_(\d+)_/ ) {
$a_val = $1;
}
else {
$a_val = 0;
}
if ( $b =~ /^TC_(\d+)_/ ) {
$b_val = $1;
}
else {
$b_val = 0;
}
return $a_val <=> $b_val;
} #dir_list;
for my $file (#dir_list) {
next if $file =~ /^\./;
say "$file";
}
In my sort subroutine am going to take $a and $b and pull out the number you want to sort them by and put that value into $a_val and $b_val. I also have to watch what happens if the files don't have the name I think they may have. Here I simply decide to set the sort value to 0 and hope for the best.
I am using opendir and readdir instead of globbing. This will end up including . and .. in my list, and it will include any file that starts with .. No problem, I'll remove these when I print out the list.
In my test, this prints out:
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_11_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
Files are sorted numerically by the first set of digits after TC_.

Here you go:
#!/usr/bin/perl
use warnings;
use strict;
sub by_substring{
$a=~ /(\d+)/;
my $x=$1;
$b=~ /(\d+)/;
my $y=$1;
return $x <=> $y;
}
my #files=<*.txt>;
#files = sort by_substring #files;
for my $inputfile (#files){
print $inputfile."\n";
}
It will not matter if your filenames start with "TC" or "BD" or "President Carter", this will just use the first set of adjacent digits for the sorting.

the sort in the directory will be alphanumeric, hence your effect. i do not know how to sort glob by creation date, here is a workaround:
my #dir="/var/tmp";
my #files = glob("$dir/*abcd*.txt");
my #sorted_files;
for my $filename (#files) {
my ($number) = $filename =~ m/TC_(\d+)_abcd/;
$sorted_files[$number] = $filename;
}
print join "\n", #sorted_filenames;

How to print the frequency of words in perl?

open INP,"<C:\\Users\\hanadi\\Documents\\cs.txt";
while ($line=<INP>)
{
chomp($line);
#list=split/\s+/,$line;
foreach $w (#list)
{
$wordfreq{$w}++;
}
}
foreach $w2(keys>wordfreq)
{
print "$w2==>$wordfreq{$w}";
}
I want to print each word and its frequency.now i want code in Perl to jump and Print the above information for the next
ranks (>100), but do not print
every line, print only one line for every 1000 words (otherwise there will
be too many lines to print) in decreasing order of frequency and
decreasing alphabetical order among words with the same frequency.

The first issue of this problem is to define the word "word." Am assuming, by one of your comments, that punctuation is not part of a "word," since you were asking how to "...delete punctuations from the text..."
One solution to this is to use a regex to capture only "word" characters, i.e., alphanumeric and underscore, by matching the text against \w in the regex.
Building a hash, where the keys are the words and the associated values are the counts, is the way to go. However, when doing this, you need to insure that the keys are all the same case, i.e., either all UPPER or all lower.
After you've built this hash, you can sort the output in descending order by the has values (frequency) and use a counter to print just the top 100 words. There will be words with the same frequency count--especially having only one occurrence. How do you want these printed, as it can make a difference whether they appear in the list of top 100. I'd suggest ordering these cases alphabetically.
Give the above, consider the following solution, which uses the text above the code below as the corpus:
use strict;
use warnings;
my %hash;
open my $fh, '<', 'words.txt' or die $!;
while (<$fh>) {
$hash{ lc $1 }++ while /(\w+)/g;
}
close $fh;
my $i = 1;
for my $word ( sort { $hash{$b} <=> $hash{$a} || $a cmp $b } keys %hash ) {
print "$i. $word: ($hash{$word})\n" if $i++ < 100 or !( $i % 1000 );
}
Partial output:
1. the: (22)
2. to: (8)
3. a: (5)
4. you: (5)
5. is: (4)
6. of: (4)
7. this: (4)
8. word: (4)
9. all: (3)
10. and: (3)
...
96. punctuation: (1)
97. punctuations: (1)
98. since: (1)
99. sort: (1)
100. suggest: (1)
Limitations:
One issue that results from capturing word characters can be seen in cases of some possessive forms of words, e.g., word's. In this case, both word and s would be captured as words. If you want to retain such punctuation, and split on whitespace, you can just use the following instead of the regex line:
$hash{ lc $_ }++ for split ' ';
Hope this helps!

please pay attention to all of amons reply, and always rtq. (good spot amon).
(i have determined) your problem is that once you have constructed your wordcount hash, you now need to invert the hash so you can sort the values into some kind of order. The problem with this is that more than one word may have the same count and would overwrite earlier stored words.
To do this you need to store an array in a hash value, and this is done by using a reference to an array. Hash values may only be scalars, an array is not a scalar, but a reference to an array is.
In this re-write of your problem, I have updated the open and close calls to use scalar filehandles, with appropriate error handling (or die), and converted your foreach statements into 'maps'. These can take a bit of time to grasp so do not copy and paste them. Rather, focus on the inverting of the hash and how the array is accessed. This is possibly quite complex for you so I have left this parts in foreach style.
The 'each' keyword takes a key/value pair from the hash, and is often used this way to process hashes in while statements.
You will still need to work on converting the counts into frequencies as per amons suggestion and retrieving the top 100. There is a clue to the counting in the '$c' variable.
#!/usr/bin/perl
# word count #wct.pl
use warnings;
use strict;
my (%wordfreq);
open my $input, '<', 'wc.txt'
or die "cannot open wc txt file $!";
map { $wordfreq{ $_ }++; } (split /\s+/, $_) while <$input> ;
close $input
or die "cannot close wc txt file $!";
# print (
# map {"$_ has ". $wordfreq{$_} . "\n" } (keys %wordfreq)
# );
foreach (keys %wordfreq){
# print "$_ has ". $wordfreq{$_} . "\n"
}
my %invertedhash;
while (my ($key,$value) = each %wordfreq){
push #{$invertedhash{$value}}, $key;
}
my $c;
foreach (reverse sort keys %invertedhash){
last if $c++ == 2;
print "words with a count of $_ are #{$invertedhash{$_}} \n";
}
exit 0;
sample
one two two
three three three four
four four four five
five five five
produces
words with a count of 4 are four five
words with a count of 3 are three
Hope this helps.

Sorting by second word in perl

Hey guys,
I have this file called phonebook
Steve Blenheim:239-923-7366:238-934-7865:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Betty Boop:245-836-8357:245-876-7656:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500
Igor Chevsky:385-375-8395:385-333-8976:3567 Populus Place, Caldwell, NJ 23875:6/18/68:23400
Norma Corder:397-857-2735:397-857-7651:74 Pine Street, Dearborn, MI 23874:3/28/45:245700
And I am trying to sort the text in reverse alphabetical order from the second word (the last name) and have not been able to find out how to do it. I am reading from the file by doing this
open (FILE, phonebook);
#line = <FILE>;
close(FILE);
any ideas? I can sort the first field in alphabetical order and reverse, but can't seem to get the second one to sort properly.
Thanks in advance

I share tadmc's concern that the second field, by whitespace isn't always going to be the surname, but answering the question as it pertains to the second field, you can get it using split, and you can sort it like this:
The simple but horribly slow version (easy to read, but it re-splits every field every single time it compares two lines, which is inefficient).
#lines = sort { # Compare second fields
(split " ", $a)[1]
cmp
(split " ", $b)[1]
} #lines;
The Schwartzian transform version (does the exact same thing as the previous one, only much faster):
#lines = map { # Get original line back
$_->[0]
} sort { # Compare second fields
$a->[1] cmp $b->[1]
} map { # Turn each line into [original line, second field]
[ $_, (split " ", $_)[1] ]
} #lines;

If you don't mind using the shell, sort -r -k2 will sort your file in reverse order.

Based on Miguel Prz solution I replaced the 'cmd' to '<=>'.
It is important for numbers. If the CMP is used, then sorting will work as a string (digits) - first character is most important, then second and so on. If you have the numbers: 607, 8 and 35 then CMP will sort it as: 8, 607, 35. To sort it as numbers we use the "<=>' method and the result will be: 607, 35, 8
use strict;
open my $FILE, '<', 'phonebook';
my #lines = <$FILE>;
my #sorted = sort {
my #a = split(/\s+/,$a);
my #b = split(/\s+/,$b);
$b[1] <=> $a[1] } #lines;
foreach my $item(#sorted) {
print "$item\n";
}
close $FILE;

You'll need to read the file line by line to do that. Something like this:
my %list;
open(FILE, phonebook);
while(<FILE>){
my #vals = split(/:/, $_);
(my $key = $vals[0]) =~ s/(\S+)\s+(.+)/$2 $1/; # split first field, reverse word order
$list{$key} = $_; #save row keyed on $key
}
foreach my $key(sort {$b cmp $a} keys(%list)){
print $list{$key};
}

I think it's interesting to write in a Modern Perl way (the solution is the same), and this is the complete script:
use strict;
open my $FILE, '<', 'phonebook';
my #lines = <$FILE>;
my #sorted = sort {
my #a = split(/\s+/,$a);
my #b = split(/\s+/,$b);
$b[1] cmp $a[1] } #lines;
foreach my $item(#sorted) {
print "$item\n";
}
close $FILE;

I am surprised nobody has mentioned this, but if we are sorting a phonebook, we probably don't really want a pure ASCII sort.
Does Bob DeCarlo really belong before Ralph Dearborn? If you sort by using cmp Mr. DeCarlo winds up first in the results.
Even if you normalize for case, you've still got issues. There are a host of complications with sorting and filing things. Different organizations have rules for handling these issues.
Since sort is an expensive operation, you'll want to make each comparison work as quickly as possible. The way to do this is to use the simplest code possible for all your comparisons. Since cmp won't give us the desired result by itself, we need to generate and cache a normalized sort term for each item in the phone book.
So, assuming you've already got your phone book data in an array:
sub extract_and_normalize {
# Do stuff here to embody your alphabetization rules.
return [ $normed, $line ];
}
# Generate your sort terms
my #processed = map extract_and_normalize($_), #lines;
# Sort by the normalized values
my #sorted = sort {$a->[0] cmp $b->[0]}, #processed;
# Extract the lines from the sorted set.
#lines = map $_->[1], #sorted;
Or use the Schwartzian Transform, as hobbs suggests, to avoid making all the intermediate variables:
#lines = map $_->[1],
sort { $a->[0] cmp $b->[0] }
map extract_and_normalize($_), #lines;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Sort mixed text lines (alphanum) in Perl - perl

Related

Why my sorting fails for double digits using perl? [duplicate]

Custom sorting of file names

Write file name in sequence of generation in perl

How to print the frequency of words in perl?

Sorting by second word in perl

Categories

Resources