Write file name in sequence of generation in perl - perl

I have some 1000 files in a directory. Naming convention of the file is like below.
TC_01_abcd_16_07_2014_14_06.txt
TC_02_abcd_16_07_2014_14_06.txt
TC_03_abcd_16_07_2014_14_07.txt
.
.
.
.
TC_100_abcd_16_07_2014_15_16.txt
.
.
.
TC_999_abcd_16_07_2014_17_06.txt
I have written some code like this
my #dir="/var/tmp";
foreach my $inputfile (glob("$dir/*abcd*.txt")) {
print $inputfile."\n";
}
While running this it is not printing in sequence.
it it printing till 09 file then it is printing 1000th file name then
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
TC_11_abcd_16_07_2014_11_56.txt
Please guide me how to print in sequence

The files are sorted according to the rules of shell glob expansion, which is a simple alpha sort. You will need to sort them according to a numeric sort of the first numeric field.
Here is one way to do that:
# Declare a sort comparison sub, which extracts the part of the filename
# which we want to sort on and compares them numerically.
# This sub will be called by the sort function with the variables $a and $b
# set to the list items to be compared
sub compareFilenames {
my ($na) = ($a =~ /TC_(\d+)/);
my ($nb) = ($b =~ /TC_(\d+)/);
return $na <=> $nb;
}
# Now use glob to get the list of filenames, but sort them
# using this comparison
foreach my $file (sort compareFilenames glob("$dir/*abcd*.txt")) {
print "$file\n";
}
See: perldoc for sort

That's printing the files in order -- ASCII order that is.
In ASCII, the underscore (_) is after the digits when sorting. If you want to sort your files in the correct order, you'll have to sort them yourself. Without sort, there's no guarantee that they'll print in any order. Even worse for you, you don't really want to print the files in either numeric sorted order (because the file names aren't numeric) or ASCII order (because you want TC_10 to print before TC_100.
Therefore, you need to write your own sorting routine. Perl gives you the sort command. By default, it will sort in ASCII order. However, you can define your own subroutine to sort in the order you want. sort will pass two values to your in your sort routine $a and $b. What you can do is parse these two values to get the sort keys you want, then use the <=> or cmp operators to return the values in the correct sort order:
#! /usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw(say);
opendir my $dir, 'temp'; # Opens a directory for reading
my #dir_list = readdir $dir;
closedir $dir;
#dir_list = sort { # My sort routine embedded inside the sort command
my $a_val;
my $b_val;
if ( $a =~ /^TC_(\d+)_/ ) {
$a_val = $1;
}
else {
$a_val = 0;
}
if ( $b =~ /^TC_(\d+)_/ ) {
$b_val = $1;
}
else {
$b_val = 0;
}
return $a_val <=> $b_val;
} #dir_list;
for my $file (#dir_list) {
next if $file =~ /^\./;
say "$file";
}
In my sort subroutine am going to take $a and $b and pull out the number you want to sort them by and put that value into $a_val and $b_val. I also have to watch what happens if the files don't have the name I think they may have. Here I simply decide to set the sort value to 0 and hope for the best.
I am using opendir and readdir instead of globbing. This will end up including . and .. in my list, and it will include any file that starts with .. No problem, I'll remove these when I print out the list.
In my test, this prints out:
TC_01_abcd_16_07_2014_11_55.txt
TC_02_abcd_16_07_2014_11_55.txt
TC_03_abcd_16_07_2014_11_55.txt
TC_04_abcd_16_07_2014_11_55.txt
TC_05_abcd_16_07_2014_11_56.txt
TC_06_abcd_16_07_2014_11_56.txt
TC_07_abcd_16_07_2014_11_56.txt
TC_08_abcd_16_07_2014_11_56.txt
TC_09_abcd_16_07_2014_11_56.txt
TC_10_abcd_16_07_2014_11_56.txt
TC_11_abcd_16_07_2014_11_56.txt
TC_100_abcd_16_07_2014_12_04.txt
TC_101_abcd_16_07_2014_12_04.txt
TC_102_abcd_16_07_2014_12_04.txt
TC_103_abcd_16_07_2014_12_04.txt
TC_104_abcd_16_07_2014_12_04.txt
TC_105_abcd_16_07_2014_12_04.txt
TC_106_abcd_16_07_2014_12_04.txt
TC_107_abcd_16_07_2014_12_04.txt
TC_108_abcd_16_07_2014_12_05.txt
TC_109_abcd_16_07_2014_12_05.txt
TC_110_abcd_16_07_2014_12_05.txt
TC_111_abcd_16_07_2014_12_05.txt
TC_112_abcd_16_07_2014_12_05.txt
TC_113_abcd_16_07_2014_12_05.txt
TC_114_abcd_16_07_2014_12_05.txt
TC_115_abcd_16_07_2014_12_05.txt
TC_116_abcd_16_07_2014_12_05.txt
TC_117_abcd_16_07_2014_12_05.txt
TC_118_abcd_16_07_2014_12_05.txt
TC_119_abcd_16_07_2014_12_06.txt
Files are sorted numerically by the first set of digits after TC_.

Here you go:
#!/usr/bin/perl
use warnings;
use strict;
sub by_substring{
$a=~ /(\d+)/;
my $x=$1;
$b=~ /(\d+)/;
my $y=$1;
return $x <=> $y;
}
my #files=<*.txt>;
#files = sort by_substring #files;
for my $inputfile (#files){
print $inputfile."\n";
}
It will not matter if your filenames start with "TC" or "BD" or "President Carter", this will just use the first set of adjacent digits for the sorting.

the sort in the directory will be alphanumeric, hence your effect. i do not know how to sort glob by creation date, here is a workaround:
my #dir="/var/tmp";
my #files = glob("$dir/*abcd*.txt");
my #sorted_files;
for my $filename (#files) {
my ($number) = $filename =~ m/TC_(\d+)_abcd/;
$sorted_files[$number] = $filename;
}
print join "\n", #sorted_filenames;

Related

Perl - Could not open and read files

I've created a script for validating xml files after given input folder. It should grep xml files from the input directory then sort out the xml files and check the condition. But it throws a command that not Open at line , <STDIN> line 1.
But it creates an empty log file.
Since i faced numeric error while sorting, comment that.
so i need to be given input location, the script should check the xml files and throw errors in a mentioned log file.
Anyone can help this?
Script
#!/usr/bin/perl
# use strict;
use warnings;
use Cwd;
use File::Basename;
use File::Path;
use File::Copy;
use File::Find;
print "Enter the path: ";
my $filepath = <STDIN>;
chomp $filepath;
die "\n\tpleas give input folder \n" if(!defined $filepath or !-d $filepath);
my $Toolpath = dirname($0);
my $base = basename($filepath);
my $base_path = dirname($filepath);
my ($xmlF, #xmlF);
my #errors=();
my #warnings=();
my #checkings=();
my $ecount=0;
my $wcount=0;
my $ccount=0;
my ($x, $y);
my $z="0";
opendir(DIR,"$filepath");
my #xmlFiles = grep{/\.xml$/} readdir(DIR);
closedir(DIR);
my $logfile = "$base_path\\$base"."_Err.log";
# #xmlF=sort{$a <=> $b}#xmlFiles;
#xmlF=sort{$a cmp $b}#xmlFiles;
open(OUT, ">$logfile") || die ("\nLog file couldnt write $logfile :$!");
my $line;
my $flcnt = scalar (#xmlF);
for ($x=0; $x < $flcnt; $x++)
{
open IN, "$xmlF[$x]" or die "not Open";
print OUT "\n".$xmlF[$x]."\n==================\n";
print "\nProcessing File $xmlF[$x] .....\n";
local $/;
while ($line=<IN>)
{
while ($line=~m#(<res(?: [^>]+)? type="weblink"[^>]*>)((?:(?!</res>).)*)</res>#igs)
{
my $tmp1 = $1; my $tmp2 = $&; my $pre1 = $`;
if($tmp1 =~ m{ subgroup="Weblink"}i){
my $pre = $pre1.$`;
if($tmp2 !~ m{<tooltip><\!\[CDATA\[Weblink\]\]><\/tooltip>}ms){
my $pre = $pre1.$`;
push(#errors,lineno($pre),"\t<tooltip><\!\[CDATA\[Weblink\]\]></tooltip> is missing\n");
}
}
}
foreach my $warnings(#warnings)
{
$wcount = $wcount+1;
}
foreach my $checkings(#checkings)
{
$ccount = $ccount+1;
}
foreach my $errors(#errors)
{
$ecount = $ecount+1;
}
my $count_err = $ecount/2;
print OUT "".$count_err." Error(s) Found:-\n------------------------\n ";
print OUT "#errors\n";
$ecount = 0;
my $count_war = $wcount/2;
print OUT "$count_war Warning(s) Found:-\n-------------------------\n ";
print OUT "#warnings\n";
$wcount = 0;
my $count_check = $ccount/2;
print OUT "$count_check Checking(s) Found:-\n-------------------------\n ";
print OUT "#checkings\n";
$wcount = 0;
undef #errors;
undef #warnings;
undef #checkings;
close IN;
}
}
The readdir returns bare file names, without the path.
So when you go ahead to open those files you need to prepend the names returned by readdir with the name of the directory the readdir read them from, here $filepath. Or build the full path names right away
use warnings;
use strict;
use feature 'say';
use File::Spec;
print "Enter the path: ";
my $filepath = <STDIN>;
chomp $filepath;
die "\nPlease give input folder\n" if !defined $filepath or !-d $filepath;
opendir(my $fh_dir, $filepath) or die "Can't opendir $filepath: $!";
my #xml_files =
map { File::Spec->catfile($filepath, $_) }
grep { /\.xml$/ }
readdir $fh_dir;
closedir $fh_dir;
say for #xml_files;
where I used File::Spec to portably piece together the file name.
The map can be made to also do grep's job so to make only one pass over the file list
my #xml_files =
map { /\.xml$/ ? File::Spec->catfile($filepath, $_) : () }
readdir $fh_dir;
The empty list () gets flattened in the returned list, effectively disappearing altogether.
Here are some comments on the code. Note that this is normally done at Code Review but I feel that it is needed here.
First: a long list of variables is declared upfront. It is in fact important to declare in as small a scope as possible. It turns out that most of those variables can indeed be declared where they are used, as seen in comments below.
The location of the executable is best found using
use FindBin qw($RealBin);
where $RealBin also resolves links (as opposed to $Bin, also available)
Assigning () to an array at declaration doesn't do anything; it is exactly the same as normal my #errors;. They can also go together, my (#errors, #warnings, #checks);. If the array has something then = () clears it, what is a good way to empty an array
Assigning a "0" makes the variable a string. While Perl normally converts between strings and numbers as needed, if a number is needed then use a number, my $z = 0;
Lexical filehandles (open my $fh, ...) are better than globs (open FH, ...)
I don't understand the comment about "numeric error" in sorting. The cmp operator sorts lexicographically, for numeric sort use <=>
When array is used in scalar context – when assigned to a scalar for example – the number of elements is returned. So no need for scalar but do my flcnt = #xmlF;
For iteration over array indices use $#ary, the index of the last element of #ary, for
foreach my $i (0..$#xmlF) { ... }
But if there aren't any uses of the index (I don't see any) then loop over elements
foreach my $file (#xmlF) { ... }
When you check the file open print the error $!, open ... or die "... : $!";. This is done elsewhere in the code, and it should be done always.
The local $/; unsets the input record separator, what makes the following read take the whole file. If that is intended then $line is not a good name. Also note that a variable can be declared inside the condition, while (my $line = <$fh>) { }
I can't comment on the regex as I don't know what it's supposed to accomplish, but it is complex; any chance to simplify all that?
The series of foreach loops only works out the number of elements of those arrays; there is no need for loops then, just my $ecount = #errors; (etc). This also allows you to keep the declaration of those counter variables in minimal scope.
The undef #errors; (etc) aren't needed since those arrays count for each file and so you can declare them inside the loops, anew at each iteration (and at smallest scope). When you wish to empty an array it is better to do #ary = (); than to undef it; that way it's not allocated all over again on the next use

Why my sorting fails for double digits using perl? [duplicate]

I have 1500 files in one directory and I need to get some information out of every one and write it into a new, single file. The file names consist of a word and a number (Temp1, Temp2, Temp3 and so on) and it is important that the files are read in the correct order according to the numbers.
I did this using
my #files = <Temp*.csv>;
for my $file (#files)
{
this part appends the required data to a seperate file and works fine
}
my problem now is that the files are not opened in the correct order but after file 1 the file 100 gets opened.
Can anybody please give me a hint how I can make it read the files in the right order?
Thank you,
Ca
Sort the files naturally with Sort::Key::Natural natsort.
The following will automatically sort the files naturally, separating out alpha and numerical portions of the name for the appropriate sort logic.
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
for my $file ( natsort <Temp*.csv> ) {
# this part appends the required data to a seperate file and works fine
}
The following fake data should demonstrate this module in action:
use strict;
use warnings;
use Sort::Key::Natural qw(natsort);
print natsort <DATA>;
__DATA__
Temp100.csv
Temp8.csv
Temp20.csv
Temp1.csv
Temp7.csv
Outputs:
Temp1.csv
Temp7.csv
Temp8.csv
Temp20.csv
Temp100.csv
You can use Schwartzian transform to read and sort files in one step,
my #files =
map { $_->[0] }
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\d+)/ ] } <Temp*.csv>;
or using less efficient, and more straightforward sort,
my #files = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } <Temp*.csv>;
If the numbers are really important, you might want to read them specifically after file name, with error reporting about missing files:
my #nums = 1 .. 1500; # or whatever the highest is
for my $num (#nums) {
my $file = "Temp$num.csv";
unless (-e $file) {
warn "Missing file: $file";
next;
}
...
# proceed as normal
}
If you need a file count, you can simply use your old glob:
my #files = <Temp*.csv>;
my $count = #files; # get the size of the array
my #nums = 1 .. $count;
On the other hand, if you control the process that prints the files, you might select a format that will automatically sort itself, such as:
temp00001.csv
temp00002.csv
temp00003.csv
temp00004.csv
...
temp00101.csv

Perl: Printing out the file where a word occurs

I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}
The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}
Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}

Sorting files numerically in Perl

I would like to sort files numerically using Perl script.
My files looks like below:
1:file1:filed2
3:filed1:field2
10:filed1:field2
4:field1:field2
7:field1:field2
I would like to display it as:
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2
The way sort works in perl, is that it works through your list, setting each element to $a and $b - then testing those. By default, it uses cmp which is an alphanumeric sort.
You've also got <=> which is a numeric sort, and the kind you're looking for. (Alpha sorts 10 ahead of 2 ).
So - all we need do is extract the numeric value of your key. There's a number of ways you could do this - the obvious being to take a subroutine that temporarily copies the variables:
#!/usr/bin/env perl
use strict;
use warnings;
sub compare_first_num {
my ( $a1 ) = split ( /:/, $a );
my ( $b1 ) = split ( /:/, $b );
return $a1 <=> $b1;
}
print sort compare_first_num <>;
This uses <> - the magic filehandle - to read STDIN or files specified on command line.
Or alternatively, in newer perls (5.16+):
print sort { $a =~ s/:.*//r <=> $b =~ s/:.*//r } <>;
We use the 'substitute-and-return' operation to compare just the substrings we're interested in. (Numerically).
Split on : and store in a hash of arrays. Then you can sort and print out the hash keys:
my %data;
while(<DATA>){
my #field = split(/:/);
$data{$field[0]} = [#field[1..2]];
}
print join (':', $_, #{$data{$_}}) for sort { $a <=> $b } keys %data;
print "\n";
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2
For simple and fast solution, use Sort::Key::Natural (fast natural sorting) module:
use warnings;
use strict;
use Sort::Key::Natural qw( natsort );
open my $fh, "<", "file.txt" or die $!;
my #files = natsort <$fh>;
close $fh;
print #files;
Output:
1:file1:filed2
3:filed1:field2
4:field1:field2
7:field1:field2
10:filed1:field2

Sorting names of files in Perl?

I'm writing a script in Perl that I want to run on all the .csv files in a given directory. The names of the files are of the type: CCCC0.csv, CCCC1.csv, ..., CCCC198.csv. However, I want Perl to first run the script on file CCCC0.csv, than on CCCC1.csv etc...So, basically, according to the increasing value of the number at the end of the file name.
If I write:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $file;
my #files = <*.csv>;
my #orderedfiles = sort #files;
for $file (#orderedfiles) {
... do stuff
}
it first runs on CCCC100.csv rather than CCCC11.csv while if i write
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $file;
my #files = <*.csv>;
my #orderedfiles = sort { substr($a, 4) <=> substr($b, 4) } #files;
for $file (#orderedfiles) {
... do stuff
}
it gives me an error telling me that I'm not ordering a numeric (I assume that he doesn't understand that it's a number after the 4 characters rather than another character.)
I have looked at the countless questions on Stackoverflow or perlmonks that deal with sorting but i haven't been able to find an answer to my question.
EDIT: I'm using a windows machine.
You were almost there... the '.CSV' is still there. You'd be better served using regex to read just numeric characters.
my #sorted = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } #files;
There is an idiom called the Schwartzian Transform that can also do this, though it takes a CS major to understand :D
my #sorted = map { $_->[0] } # return the sorted file names
#
sort { $a->[1] <=> $b->[1] } # sort on the numeric portion
#
map { [$_, /(\d+)/] } # wrap the file names in a temporary
#files; # array with their numeric portions.
# ^^ read from bottom to top ^^
You could give Sort::Key::Natural a spin. From the synopsis:
use Sort::Key::Natural qw(natsort);
my #data = qw(foo1 foo23 foo6 bar12 bar1
foo bar2 bar-45 foomatic b-a-r-45);
my #sorted = natsort #data;
print "#sorted\n";
# prints:
# b-a-r-45 bar1 bar2 bar12 bar-45 foo foo1 foo6 foo23 foomatic
I believe that the substr($a, 4) is returning "100.csv" in your example, so that you need to trim the .csv suffix off it still.