Sorting names of files in Perl? - perl

I'm writing a script in Perl that I want to run on all the .csv files in a given directory. The names of the files are of the type: CCCC0.csv, CCCC1.csv, ..., CCCC198.csv. However, I want Perl to first run the script on file CCCC0.csv, than on CCCC1.csv etc...So, basically, according to the increasing value of the number at the end of the file name.
If I write:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $file;
my #files = <*.csv>;
my #orderedfiles = sort #files;
for $file (#orderedfiles) {
... do stuff
}
it first runs on CCCC100.csv rather than CCCC11.csv while if i write
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $file;
my #files = <*.csv>;
my #orderedfiles = sort { substr($a, 4) <=> substr($b, 4) } #files;
for $file (#orderedfiles) {
... do stuff
}
it gives me an error telling me that I'm not ordering a numeric (I assume that he doesn't understand that it's a number after the 4 characters rather than another character.)
I have looked at the countless questions on Stackoverflow or perlmonks that deal with sorting but i haven't been able to find an answer to my question.
EDIT: I'm using a windows machine.

You were almost there... the '.CSV' is still there. You'd be better served using regex to read just numeric characters.
my #sorted = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } #files;
There is an idiom called the Schwartzian Transform that can also do this, though it takes a CS major to understand :D
my #sorted = map { $_->[0] } # return the sorted file names
#
sort { $a->[1] <=> $b->[1] } # sort on the numeric portion
#
map { [$_, /(\d+)/] } # wrap the file names in a temporary
#files; # array with their numeric portions.
# ^^ read from bottom to top ^^

You could give Sort::Key::Natural a spin. From the synopsis:
use Sort::Key::Natural qw(natsort);
my #data = qw(foo1 foo23 foo6 bar12 bar1
foo bar2 bar-45 foomatic b-a-r-45);
my #sorted = natsort #data;
print "#sorted\n";
# prints:
# b-a-r-45 bar1 bar2 bar12 bar-45 foo foo1 foo6 foo23 foomatic

I believe that the substr($a, 4) is returning "100.csv" in your example, so that you need to trim the .csv suffix off it still.

Related

Why my sorting fails for double digits using perl? [duplicate]

I have 1500 files in one directory and I need to get some information out of every one and write it into a new, single file. The file names consist of a word and a number (Temp1, Temp2, Temp3 and so on) and it is important that the files are read in the correct order according to the numbers.
I did this using
my #files = <Temp*.csv>;
for my $file (#files)
{
this part appends the required data to a seperate file and works fine
}
my problem now is that the files are not opened in the correct order but after file 1 the file 100 gets opened.
Can anybody please give me a hint how I can make it read the files in the right order?
Thank you,
Ca
Sort the files naturally with Sort::Key::Natural natsort.
The following will automatically sort the files naturally, separating out alpha and numerical portions of the name for the appropriate sort logic.
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
for my $file ( natsort <Temp*.csv> ) {
# this part appends the required data to a seperate file and works fine
}
The following fake data should demonstrate this module in action:
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
print natsort <DATA>;
__DATA__
Temp100.csv
Temp8.csv
Temp20.csv
Temp1.csv
Temp7.csv
Outputs:
Temp1.csv
Temp7.csv
Temp8.csv
Temp20.csv
Temp100.csv
You can use Schwartzian transform to read and sort files in one step,
my #files =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\d+)/ ] } <Temp*.csv>;
or using less efficient, and more straightforward sort,
my #files = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } <Temp*.csv>;
If the numbers are really important, you might want to read them specifically after file name, with error reporting about missing files:
my #nums = 1 .. 1500; # or whatever the highest is
for my $num (#nums) {
my $file = "Temp$num.csv";
unless (-e $file) {
warn "Missing file: $file";
next;
}
...
# proceed as normal
}
If you need a file count, you can simply use your old glob:
my #files = <Temp*.csv>;
my $count = #files; # get the size of the array
my #nums = 1 .. $count;
On the other hand, if you control the process that prints the files, you might select a format that will automatically sort itself, such as:
temp00001.csv
temp00002.csv
temp00003.csv
temp00004.csv
...
temp00101.csv

Perl: Printing out the file where a word occurs

I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}
The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}
Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}

Write file name in sequence of generation in perl

I have some 1000 files in a directory. Naming convention of the file is like below.
TC_01_abcd_16_07_2014_14_06.txt
TC_02_abcd_16_07_2014_14_06.txt
TC_03_abcd_16_07_2014_14_07.txt
.
.
.
.
TC_100_abcd_16_07_2014_15_16.txt
.
.
.
TC_999_abcd_16_07_2014_17_06.txt
I have written some code like this
my #dir="/var/tmp";
foreach my $inputfile (glob("$dir/*abcd*.txt")) {
print $inputfile."\n";
}
While running this it is not printing in sequence.
it it printing till 09 file then it is printing 1000th file name then
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
TC_11_abcd_16_07_2014_11_56.txt
Please guide me how to print in sequence
The files are sorted according to the rules of shell glob expansion, which is a simple alpha sort. You will need to sort them according to a numeric sort of the first numeric field.
Here is one way to do that:
# Declare a sort comparison sub, which extracts the part of the filename
# which we want to sort on and compares them numerically.
# This sub will be called by the sort function with the variables $a and $b
# set to the list items to be compared
sub compareFilenames {
my ($na) = ($a =~ /TC_(\d+)/);
my ($nb) = ($b =~ /TC_(\d+)/);
return $na <=> $nb;
}
# Now use glob to get the list of filenames, but sort them
# using this comparison
foreach my $file (sort compareFilenames glob("$dir/*abcd*.txt")) {
print "$file\n";
}
See: perldoc for sort
That's printing the files in order -- ASCII order that is.
In ASCII, the underscore (_) is after the digits when sorting. If you want to sort your files in the correct order, you'll have to sort them yourself. Without sort, there's no guarantee that they'll print in any order. Even worse for you, you don't really want to print the files in either numeric sorted order (because the file names aren't numeric) or ASCII order (because you want TC_10 to print before TC_100.
Therefore, you need to write your own sorting routine. Perl gives you the sort command. By default, it will sort in ASCII order. However, you can define your own subroutine to sort in the order you want. sort will pass two values to your in your sort routine $a and $b. What you can do is parse these two values to get the sort keys you want, then use the <=> or cmp operators to return the values in the correct sort order:
#! /usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw(say);
opendir my $dir, 'temp'; # Opens a directory for reading
my #dir_list = readdir $dir;
closedir $dir;
#dir_list = sort { # My sort routine embedded inside the sort command
my $a_val;
my $b_val;
if ( $a =~ /^TC_(\d+)_/ ) {
$a_val = $1;
}
else {
$a_val = 0;
}
if ( $b =~ /^TC_(\d+)_/ ) {
$b_val = $1;
}
else {
$b_val = 0;
}
return $a_val <=> $b_val;
} #dir_list;
for my $file (#dir_list) {
next if $file =~ /^\./;
say "$file";
}
In my sort subroutine am going to take $a and $b and pull out the number you want to sort them by and put that value into $a_val and $b_val. I also have to watch what happens if the files don't have the name I think they may have. Here I simply decide to set the sort value to 0 and hope for the best.
I am using opendir and readdir instead of globbing. This will end up including . and .. in my list, and it will include any file that starts with .. No problem, I'll remove these when I print out the list.
In my test, this prints out:
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_11_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
Files are sorted numerically by the first set of digits after TC_.
Here you go:
#!/usr/bin/perl
use warnings;
use strict;
sub by_substring{
$a=~ /(\d+)/;
my $x=$1;
$b=~ /(\d+)/;
my $y=$1;
return $x <=> $y;
}
my #files=<*.txt>;
#files = sort by_substring #files;
for my $inputfile (#files){
print $inputfile."\n";
}
It will not matter if your filenames start with "TC" or "BD" or "President Carter", this will just use the first set of adjacent digits for the sorting.
the sort in the directory will be alphanumeric, hence your effect. i do not know how to sort glob by creation date, here is a workaround:
my #dir="/var/tmp";
my #files = glob("$dir/*abcd*.txt");
my #sorted_files;
for my $filename (#files) {
my ($number) = $filename =~ m/TC_(\d+)_abcd/;
$sorted_files[$number] = $filename;
}
print join "\n", #sorted_filenames;

Parsing the large files in Perl

I need to compare the big file(2GB) contains 22 million lines with the another file. its taking more time to process it while using Tie::File.so i have done it through 'while' but problem remains. see my code below...
use strict;
use Tie::File;
# use warnings;
my #arr;
# tie #arr, 'Tie::File', 'title_Nov19.txt';
# open(IT,"<title_Nov19.txt");
# my #arr=<IT>;
# close(IT);
open(RE,">>res.txt");
open(IN,"<input.txt");
while(my $data=<IN>){
chomp($data);
print"$data\n";
my $occ=0;
open(IT,"<title_Nov19.txt");
while(my $line2=<IT>){
my $line=$line2;
chomp($line);
if($line=~m/\b$data\b/is){
$occ++;
}
}
print RE"$data\t$occ\n";
}
close(IT);
close(IN);
close(RE);
so help me to reduce it...
Lots of things wrong with this.
Asides from the usual (lack of use strict, use warnings, use of 2-argument open(), not checking open() result, use of global filehandles), the specific problem in your case is that you are opening/reading/closing the second file once for every single line of the first. This is going to be very slow.
I suggest you open the file title_Nov19.txt once, read all the lines into an array or hash or something, then close it; and then you can open the first file, input.txt and walk along that once, comparing to things in the array so you don't have to reopen that second file all the time.
Futher I suggest you read some basic articles on style/etc.. as your question is likely to gain more attention if it's actually written in vaguely modern standards.
I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as #LeoNerd explained in his answer. Then I use a hash to keep track of the match count:
#!/usr/bin/env perl
use strict;
use warnings;
# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my #comparison = <$comp_file>);
close $comp_file;
# prepare comparison
open my $input, '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();
# compare each line
while (my $title = <$input>) {
chomp $title;
# iterate comparison strings
foreach my $comp (#comparison) {
$count{$comp}++ if $title =~ /\b$comp\b/i;
}
}
# done
close $input;
# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (#comparison) {
print $output "$comp\t$count{$comp}\n";
}
close $output;
Just to get you started... If someone wants to further work on this: these were my test files:
title_Nov19.txt
This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!
input.txt
foo
bar
And the result of the program was written to res.txt:
foo 3
bar 1
Here's another option using memowe's (thank you) data:
use strict;
use warnings;
use File::Slurp qw/read_file write_file/;
my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';
for ( read_file 'title_Nov19.txt' ) {
my %seen;
!$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
}
write_file 'res.txt', map "$_\t$count{$_}\n",
sort { $count{$b} <=> $count{$a} } keys %count;
Numerically-sorted output to res.txt:
foo 3
bar 1
An alternation regex which quotes meta characters (\Q$_\E) is built and used, so only one pass against the large file's lines is needed. The hash %seen is used to insure that the input words are only counted once per line.
Hope this helps!
Try this:
grep -i -c -w -f input.txt title_Nov19.txt > res.txt

Sort CSV based on a certain column?

I'm sure I've done this in the past and there is something small I'm forgetting, but how can I sort a CSV file on a certain column? I'm interested in answers with and without 3rd party Perl modules. Mainly methods without, since I don't always have access to install additional modules.
Example data:
name,25,female
name,24,male
name,27,female
name,21,male
desired end result after sorting on the 2nd numeric column:
name,21,male
name,24,male
name,25,female
name,27,female
As CSV is a pretty complex format, it is better to use a module that does the work for us.
Following is an example using the Text::CSV module:
#!/usr/bin/env perl
use strict;
use warnings;
use constant AGE => 1;
use Text::CSV;
my $csv = Text::CSV->new();
my #rows;
while ( my $row_ref = $csv->getline( \*DATA ) ) {
push #rows, $row_ref;
}
#rows = sort { $a->[AGE] <=> $b->[AGE] } #rows;
for my $row_ref (#rows) {
$csv->combine(#$row_ref);
print $csv->string(), "\n";
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male
In the spirit of there always being another way to do it, bear in mind that plain old GNU sort might be enough.
$ sort -t, -k2 -n unsorted.txt
name,21,male
name,24,male
name,25,female
name,27,female
Where the command line args are:
-t, # use comma as the record separator
-k2 # sort on the second key (record) in the line
-n # sort using numerical comparison (like using <=> instead of cmp in perl)
If you want a Perl solution, wrap it in qx() ;-)
There is also DBD::CSV:
#!/usr/bin/perl
use strict; use warnings;
use DBI;
my $dbh = DBI->connect('dbi:CSV:', undef, undef, {
RaiseError => 1,
f_ext => '.csv',
csv_tables => { test => { col_names => [qw' name age sex '] } },
});
my $sth = $dbh->prepare(q{
SELECT name, age, sex FROM test ORDER BY age
});
$sth->execute;
while ( my #row = $sth->fetchrow_array ) {
print join(',' => #row), "\n";
}
$sth->finish;
$dbh->disconnect;
Output:
name,21,male
name,24,male
name,25,female
name,27,female
The original poster asked for no third-party modules (which I take to mean nothing from CPAN). Whilst this is restriction that will horribly limit your ability to write good modern Perl code, in this instance it's possible using the (core) Text::ParseWords module in place of the (non-core) Text::CSV. So, borrowing heavily from Alan's example, we get:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
my #rows;
while (<DATA>) {
push #rows, [ parse_line(',', 0, $_) ];
}
#rows = sort { $a->[1] <=> $b->[1] } #rows;
foreach (#rows) {
print join ',', #$_;
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male
When you provide your own comparison code, you can sort on anything. Just extract the desired element with a regex, or probably a split in this case, and then compare on that. If you have a lot of elements, I would parse the data into a list of lists and then the comparison code can access it without parsing. That would eliminate parsing the same row over and over as it's compared with other rows.
using Raku (née Perl6)
This is a fairly quick-and-dirty solution, mainly intended for "hand-rolled" CSV. The code works as long as there's only one (1) age-per-row: Read lines $a, comb for 1-to-3 <digit> surrounded by commas and assign to #b, derive sorting index $c, use $c to reorder lines $a:
~$ raku -e 'my $a=lines(); my #b=$a.comb(/ \, <(\d**1..3)> \, /).pairs; my $c=#b.sort(*.values)>>.keys.flat; $a[$c.flat]>>.put;' sort_age.txt
name,21,male
name,24,male
name,25,female
name,27,female
I prepended a few dummy lines to the OP's input file see how the code above reacts with 1). a blank age field, 2). a blank "" string for age, 3). a bogus "9999" for age, and 4). a bogus "NA" for age. The code above fails catastrophically. To fix this you have to write a ternary that inserts a numeric placeholder value (e.g. zero) whenever the regex fails to match a line.
Below is a longer but more robust solution. Note--I use a placeholder value of 999 to move lines with blank/invalid ages to the bottom:
~$ raku -e 'my #a=lines(); my #b = do for #a {if $_ ~~ m/ \, <(\d**1..3)> \, / -> { +$/ } else { 999 }; }; my $c=#b.pairs.sort(*.values)>>.keys.flat; #a[$c.flat]>>.put;' sort_age.txt
name,21,male
name,24,male
name,25,female
name,27,female
name,,male
name,"",female
name,9999,male
name,NA,male
To sort in reverse, add .reverse to the end of the method chain that creates $c. Again, change the else placeholder argument to move lines absent a valid age to the top or to the bottom. Also, creation of #b above can be written using the ternary operator: my #b = do for #a {(m/ \, <(\d**1..3)> \, /) ?? +$/ !! 999 };, as an alternative.
Here's the unsorted input file for posterity:
$ cat sort_age.txt
name,,male
name,"",female
name,9999,male
name,NA,male
name,25,female
name,24,male
name,27,female
name,21,male
HTH.
https://raku.org/
I would do something like this:
#!/usr/bin/perl
use warnings;
use strict;
my #rows = map { chomp; [split /[,\s]+/, $_] } <DATA>; #read each row into an array
my #sorted = sort { $a->[1] <=> $b->[1] } #rows; # sort the rows (numerically) by second column
for (#sorted) {
print join(', ', #$_) . "\n"; # print them out as CSV
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male