What is the most efficient way to open/act upon all of the files in a directory? - perl

I need to perform my script (a search) on all the files of a directory. Here are the methods which work. I am just asking which is best. (I need file names of form: parsedchpt31_4.txt)
Glob:
my $parse_corpus; #(for all options)
##glob (only if all files in same directory as script?):
my #files = glob("parsed"."*.txt");
foreach my $file (#files) {
open($parse_corpus, '<', "$file") or die $!;
... all my code...
}
Readdir with while and conditions:
##readdir:
my $dir = '.';
opendir(DIR, $dir) or die $!;
while (my $file = readdir(DIR)) {
next unless (-f "$dir/$file"); ##Ensure it's a file
next unless ($file =~ m/^parsed.*\.txt/); ##Ensure it's a parsed file
open($parse_corpus, '<', "$file") or die "Couldn't open directory $!";
... all my code...
}
Readdir with foreach and grep:
##readdir+grep:
my $dir = '.';
opendir(DIR, $dir) or die $!;
foreach my $file (grep {/^parsed.*\.txt/} readdir (DIR)) {
next unless (-f "$dir/$file"); ##Ensure it's a file
open($parse_corpus, '<', "$file") or die "Couldn't open directory $!";
... all my code...
}
File::Find:
##File::Find
my $dir = "."; ##current directory: could be (include quotes): '/Users/jon/Desktop/...'
my #files;
find(\&open_file, $dir); ##built in function
sub open_file {
push #files, $File::Find::name if(/^parsed.*\.txt/);
}
foreach my $file (#files) {
open($parse_corpus, '<', "$file") or die $!;
...all my code...
}
Is there another way? Is it good to enclose my entire script in the loops? Is it okay I don't use closedir? I'm passing this off to others, I'm not sure where their files will be (may not be able to use glob)
Thanks a lot, hopefully this is the right place to ask this.

The best or most efficient approach depends on your purposes and the larger context. Do you mean best in terms of raw speed, simplicity of the code, or something else? I'm skeptical that memory considerations should drive this choice. How many files are in the directory?
For sheer practicality, the glob approach works fairly well. Before resorting to anything more involved, I'd ask whether there is a problem.
If you're able to use other modules, another approach is to let someone else worry about the grubby details:
use File::Util qw();
my $fu = File::Util->new;
my #files = $fu->list_dir($dir, qw(--with-paths --files-only));
Note that File::Find performs a recursive search descending into all subdirectories. Many times you don't want or need that.
I would also add that I dislike your two readdir examples because they comingle different pieces of functionality: (1) getting file names, and (2) processing individual files. I would keep those jobs separate.
my $dir = '.';
opendir(my $dh, $dir) or die $!; # Use a lexical directory handle.
my #files =
grep { -f }
map { "$dir/$_" }
grep { /^parsed.*\.txt$/ }
readdir($dh);
for my $file (#files){
...
}

I think using a while loop is the safer answer. Why? Because loading all the file names into an array could mean a large memory usage, and using line-by-line operation avoids that problem.
I prefer readdir to glob, but that's probably more a matter of taste.
If performance is an issue, one could say that the -f check is unnecessary for any file with the .txt extension.

I find that a recursive directory walking function using the perfect partners opendir/readdir and File::chdir (my fav CPAN module, great for cross-platform) allows one to easily and clearly manipulate anything in a directory including subdirectories if desired (if not, omit the recursion).
Example (a simple deep ls):
#!/usr/bin/env perl
use strict;
use warnings;
use File::chdir; #Provides special variable $CWD
# assign $CWD sets working directory
# can be local to a block
# evaluates/stringifies to absolute path
# other great features
walk_dir(shift);
sub do_something {
print shift . "\n";
}
sub walk_dir {
my $dir = shift;
local $CWD = $dir;
opendir my $dh, $CWD; # lexical opendir, so no closedir needed
print "In: $CWD\n";
while (my $entry = readdir $dh) {
next if ($entry =~ /^\.+$/);
# other exclusion tests
if (-d $entry) {
walk_dir($entry);
} elsif (-f $entry) {
do_something($entry);
}
}
}

Related

Unable to open files returned by readdir in Perl [duplicate]

This question already has answers here:
Why can't I open files returned by Perl's readdir?
(2 answers)
Closed 7 years ago.
I have a problem with a Perl script, as follows.
I must open and analyze all the *.txt files in a directory, but I cannot.
I can read file names that are saved in the #files array and printed, but I cannot open those files for reading.
This is my code:
my $dir= "../Scrivania/programmi" ;
opendir my ($dh), $dir;
my #files = grep { -f and /\.txt/i } readdir $dir;
closedir $dh;
for my $file ( #files ) {
$file = catfile($dir, $file);
print qq{Opening "$file"\n};
open my $fh, '<', $file;
# Do stuff with the data from $fh
print "sono nel foreach\n";
print " in : "."$fh\n";
#open(CANALI,$fh);
##righe=<CANALI>;
#close(CANALI);
#print "canali:"."#righe\n";
#foreach $canali (#righe)
#{
# $canali =~ /\d\d:\d\d (-) (.*)/;
# $ora= $1;
#
# if($hhSplit[0] == $ora)
# {
# push(#output, "$canali");
#
# }
#}
}
The main problem you have is that the file names returned by readdir have no path, so you're trying to open, say, x.txt when you should be opening ../Sc/direct/x.txt. The file doesn't exist in the current working directory so your open call fails
You also have a strange mixture of stuff in glob("$dir/(.*).txt/") which looks a little like a regex pattern, which glob doesn't understand. The value of $dir is a directory handle left open from the opendir on the first line. What you should be using is glob '../Sc/direct/*.txt', but then there's no need for the readdir
There are two ways to find the contents of a file. You can use opendir and readdir to read everything in the directory, or you can use glob
The first method returns only the bare name of each entry, which means you must concatenate each name with the path to the containing directory, preferably using catfile from File::Spec::Functions. It also includes the pseudo-directories . and .. so you must filter those out before you can use the list of names
glob has neither of these disadvantages. All the strings it returns are real directory entries, and they will include a path if you provided one in the pattern you passed as a parameter
You seem to have become rather muddled over the two, so I have written this program which differentiates between the two approaches. I hope it makes things clearer
use strict;
use warnings;
use v5.10.1;
use autodie;
use File::Spec::Functions qw/ catfile /;
my $dir = '../Sc/direct';
### Using glob
for my $file ( glob catfile($dir, '*.txt') ) {
print qq{Opening "$file"\n};
open my $fh, '<', $file;
# Do stuff with the data from $fh
}
### Using opendir / readdir
opendir my ($dh), $dir;
my #files = grep { -f and /\.txt$/i } readdir $dir;
closedir $dh;
for my $file ( #files ) {
$file = catfile($dir, $file);
print qq{Opening "$file"\n};
open my $fh, '<', $file;
# Do stuff with the data from $fh
}
Using $dir in the glob is incorrect. $dir is a GLOB type not a string value. Rather you should be looping over the #files array and looking for names that match what you want. Maybe something like so:
foreach my $fp (#files) {
if ($fp =~ /(.*).txt/) {
print "$fp is a .txt\n";
open (my $in, "<", $fp)
while (<$in>) ...
}
}

In Perl, how can filter all log files in a directory, and extract interesting lines?

I'm trying to select only the .log files in my directory and then search in those files for the word "unbound" and print the entire line into a new output file with the same name as the log file (number###.log) but with a .txt extension. This is what I have so far:
#!/usr/bin/perl
use strict;
use warnings;
my $path = $ARGV[0];
my $outpath = $ARGV[1];
my #files;
my $files;
opendir(DIR,$path) or die "$!";
#files = grep { /\.log$/} readdir(DIR);
my #out;
my $out;
opendir(OUT,$outpath) or die "$!";
my $line;
foreach $files (#files) {
open (FILE, "$files");
my #line = <FILE>;
my $regex = Unbound;
open (OUT, ">>$out");
print grep {$line =~ /$regex/ } <>;
}
close OUT;
close FILE;
closedir(DIR);
closedir (OUT);
I'm a beginner, and I don't really know how to create a new text file with the acquired output.
Few things I'd suggest to improve this code:
declare your loop iterators within the loop. foreach my $file ( #files ) {
use 3 arg open: open ( my $input_fh, "<", $filename );
use glob rather than opendir then grep. foreach my $file ( <$path/*.txt> ) {
grep is good for extracting things into arrays. Your grep reads the whole file to print it, which isn't necessary. Doesn't matter much if the file is short though.
perltidy is great for reformatting code.
you're opening 'OUT' to a directory path (I think?) which isn't going to work.
$outpath isn't, it's a file. You need to do something different to output to different files. opendir isn't really valid to an output.
because you're using opendir that's actually giving you filenames - not full paths. So you might be in the wrong place to actually open the files. Prepending the path name, doing a chdir are possible solutions. But that's one of the reasons I like glob because it returns a path as well.
So with that in mind - how about:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
#Extract paths
my $input_path = $ARGV[0];
my $output_path = $ARGV[1];
#Error if paths are invalid.
unless (defined $input_path
and -d $input_path
and defined $output_path
and -d $output_path )
{
die "Usage: $0 <input_path> <output_path>\n";
}
foreach my $filename (<$input_path/*.log>) {
# extract the 'name' bit of the filename.
# be slightly careful with this - it's based
# on an assumption which isn't always true.
# File::Spec is a more powerful way of accomplishing this.
# but should grab 'number####' from /path/to/file/number####.log
my $output_file = basename ( $filename, '.log' );
#open input and output filehandles.
open( my $input_fh, "<", $filename ) or die $!;
open( my $output_fh, ">", "$output_path/$output_file.txt" ) or die $!;
print "Processing $filename -> $output_path/$output_file.txt\n";
#iterate input, extracting into $line
while ( my $line = <$input_fh> ) {
#check if $line matches your RE.
if ( $line =~ m/Unbound/ ) {
#write it to output.
print {$output_fh} $line;
}
}
#tidy up our filehandles. Although technically, they'll
#close automatically because they leave scope
close($output_fh);
close($input_fh);
}
Here is a script that takes advantage of Path::Tiny. Now, at this stage of your learning process, you are probably better off understanding #Sobrique's solution, but using modules such as Path::Tiny or Path::Class will make it easier to write these one off scripts more quickly, and correctly.
Also, I didn't really test this script, so watch out for bugs.
#!/usr/bin/env perl
use strict;
use warnings;
use Path::Tiny;
run(\#ARGV);
sub run {
my $argv = shift;
unless (#$argv == 2) {
die "Need source and destination paths\n";
}
my $it = path($argv->[0])->realpath->iterator({
recurse => 0,
follow_symlinks => 0,
});
my $outdir = path($argv->[1])->realpath;
while (my $path = $it->()) {
next unless -f $path;
next unless $path =~ /[.]log\z/;
my $logfh = $path->openr;
my $outfile = $outdir->child($path->basename('.log') . '.txt');
my $outfh;
while (my $line = <$logfh>) {
next unless $line =~ /Unbound/;
unless ($outfh) {
$outfh = $outfile->openw;
}
print $outfh $line;
}
close $outfh
or die "Cannot close output '$outfile': $!";
}
}
Notes
realpath will croak if the path provided does not exist.
Similarly for openr and openw.
I am reading input files line-by-line to keep the memory footprint of the program independent of the sizes of input files.
I do not open the output file until I know I have a match to print to.
When matching a file extension using a regular expression pattern, keep in mind that \n is a valid character in Unix file names, and the $ anchor will match it.

Why does readdir() list the filenames in wrong order?

I'm using the following code to read filenames from a directory and push them onto an array:
#!/usr/bin/perl
use strict;
use warnings;
my $directory="/var/www/out-original";
my $filterstring=".csv";
my #files;
# Open the folder
opendir(DIR, $directory) or die "couldn't open $directory: $!\n";
foreach my $filename (readdir(DIR)) {
if ($filename =~ m/$filterstring/) {
# print $filename;
# print "\n";
push (#files, $filename);
}
}
closedir DIR;
foreach my $file (#files) {
print $file . "\n";
}
The output I get from running this code is:
Report_10_2014.csv
Report_04_2014.csv
Report_07_2014.csv
Report_05_2014.csv
Report_02_2014.csv
Report_06_2014.csv
Report_03_2014.csv
Report_01_2014.csv
Report_08_2014.csv
Report.csv
Report_09_2014.csv
Why is this code pushing the file names into the array in this order, and not from 01 to 10?
Unix directories are not stored in sorted order. Unix commands like ls and sh sort directory listings for you, but Perl's opendir function does not; it returns items in the same order the kernel does, which is based on the order they're stored in. If you want the results to be sorted, you'll need to do that yourself:
for my $filename (sort readdir(DIR)) {
(Btw: bareword file handles, like DIR, are global variables; it's considered good practice to use lexical file handles instead, like:
opendir my $dir, $directory or die "Couldn't open $directory: $!\n";
for my $filename (sort readdir($dir)) {
as a safety measure.)

Recursive Perl detail need help

i think this is a simple problem, but i'm stuck with it for some time now! I need a fresh pair of eyes on this.
The thing is i have this code in perl:
#!c:/Perl/bin/perl
use CGI qw/param/;
use URI::Escape;
print "Content-type: text/html\n\n";
my $directory = param ('directory');
$directory = uri_unescape ($directory);
my #contents;
readDir($directory);
foreach (#contents) {
print "$_\n";
}
#------------------------------------------------------------------------
sub readDir(){
my $dir = shift;
opendir(DIR, $dir) or die $!;
while (my $file = readdir(DIR)) {
next if ($file =~ m/^\./);
if(-d $dir.$file)
{
#print $dir.$file. " ----- DIR\n";
readDir($dir.$file);
}
push #contents, ($dir . $file);
}
closedir(DIR);
}
I've tried to make it recursive. I need to have all the files of all of the directories and subdirectories, with the full path, so that i can open the files in the future.
But my output only returns the files in the current directory and the files in the first directory that it finds. If i have 3 folders inside the directory it only shows the first one.
Ex. of cmd call:
"perl readDir.pl directory=C:/PerlTest/"
Thanks
Avoid wheel reinvention, use CPAN.
use Path::Class::Iterator;
my $it = Path::Class::Iterator->new(
root => $dir,
breadth_first => 0
);
until ($it->done) {
my $f = $it->next;
push #contents, $f;
}
Make sure that you don't let people set $dir to something that will let them look somewhere you don't want them to look.
Your problem is the scope of the directory handle DIR. DIR has global scope so each recursive call to readDir is using the same DIR; so, when you closdir(DIR) and return to the caller, the caller does a readdir on a closed directory handle and everything stops. The solution is to use a local directory handle:
sub readDir {
my ($dir) = #_;
opendir(my $dh, $dir) or die $!;
while(my $file = readdir($dh)) {
next if($file eq '.' || $file eq '..');
my $path = $dir . '/' . $file;
if(-d $path) {
readDir($path);
}
push(#contents, $path);
}
closedir($dh);
}
Also notice that you would be missing a directory separator if (a) it wasn't at the end of $directory or (b) on every recursive call. AFAIK, slashes will be internally converted to backslashes on Windows but you might want to use a path mangling module from CPAN anyway (I only care about Unix systems so I don't have any recommendations).
I'd also recommend that you pass a reference to #contents to readDir rather than leaving it as a global variable, fewer errors and less confusion that way. And don't use parentheses on sub definitions unless you know exactly what they do and what they're for. Some sanity checking and scrubbing on $directory would be a good idea as well.
There are many modules that are available for recursively listing files in a directory.
My favourite is File::Find::Rule
use strict ;
use Data::Dumper ;
use File::Find::Rule ;
my $dir = shift ; # get directory from command line
my #files = File::Find::Rule->in( $dir );
print Dumper( \#files ) ;
Which sends a list of files into an array ( which your program was doing).
$VAR1 = [
'testdir',
'testdir/file1.txt',
'testdir/file2.txt',
'testdir/subdir',
'testdir/subdir/file3.txt'
];
There a loads of other options, like only listing files with particular names. Or you can set it up as an iterator, which is described in How can I use File::Find
How can I use File::Find in Perl?
If you want to stick to modules that come with Perl Core, have a look at File::Find.

How can I list all files in a directory using Perl?

I usually use something like
my $dir="/path/to/dir";
opendir(DIR, $dir) or die "can't open $dir: $!";
my #files = readdir DIR;
closedir DIR;
or sometimes I use glob, but anyway, I always need to add a line or two to filter out . and .. which is quite annoying.
How do you usually go about this common task?
my #files = grep {!/^\./} readdir DIR;
This will exclude all the dotfiles as well, but that's usually What You Want.
I often use File::Slurp. Benefits include: (1) Dies automatically if the directory does not exist. (2) Excludes . and .. by default. It's behavior is like readdir in that it does not return the full paths.
use File::Slurp qw(read_dir);
my $dir = '/path/to/dir';
my #contents = read_dir($dir);
Another useful module is File::Util, which provides many options when reading a directory. For example:
use File::Util;
my $dir = '/path/to/dir';
my $fu = File::Util->new;
my #contents = $fu->list_dir( $dir, '--with-paths', '--no-fsdots' );
I will normally use the glob method:
for my $file (glob "$dir/*") {
#do stuff with $file
}
This works fine unless the directory has lots of files in it. In those cases you have to switch back to readdir in a while loop (putting readdir in list context is just as bad as the glob):
open my $dh, $dir
or die "could not open $dir: $!";
while (my $file = readdir $dh) {
next if $file =~ /^[.]/;
#do stuff with $file
}
Often though, if I am reading a bunch of files in a directory, I want to read them in a recursive manner. In those cases I use File::Find:
use File::Find;
find sub {
return if /^[.]/;
#do stuff with $_ or $File::Find::name
}, $dir;
If some of the dotfiles are important,
my #files = grep !/^\.\.?$/, readdir DIR;
will only exclude . and ..
When I just want the files (as opposed to directories), I use grep with a -f test:
my #files = grep { -f } readdir $dir;
Thanks Chris and Ether for your recommendations. I used the following to read a listing of all files (excluded directories), from a directory handle referencing a directory other than my current directory, into an array. The array was always missing one file when not using the absolute path in the grep statement
use File::Slurp;
print "\nWhich folder do you want to replace text? " ;
chomp (my $input = <>);
if ($input eq "") {
print "\nNo folder entered exiting program!!!\n";
exit 0;
}
opendir(my $dh, $input) or die "\nUnable to access directory $input!!!\n";
my #dir = grep { -f "$input\\$_" } readdir $dh;