Count number of files in a folder with Perl - perl

I would like to count the number of files inside a folder with Perl. With the following code I can list them, but how can I count them in Perl?
$dir = "/home/Enric/gfs-0.5.2016061400";
opendir(DIR, "$dir");
#FILES = grep { /gfs./ } readdir(DIR);
foreach $file (#FILES) {
print $file, "\n";
}
closedir(DIR);

If you want to just count them, once you have a directory open for reading you can manipulate context so that readdir returns the list of all entries but then assign that to a scalar. This gives you the length of the list, ie. the number of elements
opendir my $dh, $dir;
my $num_entries = () = readdir($dh);
The construct = () = imposes list context on readdir and assigns (that expression†) to a scalar, which thus gets the number of elements in that list.‡ §   See it in perlsecret. Also see this page.
There are clearer ways, of course, as below.
If you want to count certain kinds of files, pass the file list through grep first, like you do. Since grep imposes the list context on its input readdir returns the list of all files, and after filtering grep itself returns a list. When you assign that to a scalar you get the length of that list (number of elements), ie. your count. For example, for all regular files and /gfs./ files
use warnings;
use strict;
my $dir = '/home/Enric/gfs-0.5.2016061400';
opendir my $dh, $dir or die "Can't open $dir: $!";
my $num_files = grep { -f "$dir/$_" } readdir($dh);
rewinddir($dh); # so that it can read the dir again
my $num_gfs = grep { /gfs./ } readdir($dh);
(This is only an example, with rewinddir so that it works as it stands. To really get two kinds of files from a directory better iterate over the entries one at a time and sort them out in the process, or read all files into an array and then process that)
Note that readdir returns the bare filename, without any path. So for most of what is normally done with files we need to prepend it with the path (unless you first chdir to that directory). This is what is done in the grep block above so that the -f file test (-X) has the correct filename.
If you need to use the file list itself, get that into an array and then assign it to a scalar
# Get the file list, then its length
my #files_gfs = map { "$dir/$_" } grep { /gfs./ } readdir($dh);
my $num_gfs = #files_gfs;
Here map builds the full path for each file. If you don't need the path drop map { }. Note that there is normally no need for the formal use of scalar on the array to get the count, like
my $num_gfs = scalar #files_gfs; # no need for "scalar" here!
Instead, simply assign an array to a scalar instead, it's an idiom (to say the least).
If you are processing files as you read, count as you go
my $cnt_gfs = 0;
while (my $filename = readdir($dh)) {
$cnt_gfs++ if $filename =~ /gfs./;
# Process $dir/$filename as needed
}
Here readdir is in the scalar context (since its output is assigned to a scalar), and it iterates through the directory entries, returning one at a time.
A few notes
In all code above I use the example from the question, /gfs./ -- but if that is in fact meant to signify a literal period then it should be replaced by /gfs\./
All this talk about how readdir returns bare filename (no path) would not be needed with glob (or then better File::Glob), which does return the full path
use File::Glob ':bsd_glob'; # (better with this)
my #files = glob "$dir/*";
This returns the list of files with the path $dir/filename.
Not that there is anything wrong with opendir+readdir. Just don't forget the path.
Yet another option is to use libraries, like Path::Tiny with its children method.
† The assignment () = readdir $dh itself returns a value as well, and in this case that whole expression (the assignment) is placed in the scalar context.
‡ The problem is that many facilities in Perl depend in their operation and return on context so one cannot always merely assign what would be a list to a scalar and expect to get the length of the list. The readdir is a good example, returning a list of all entries in list context but a single entry in scalar context.
§ Here is another trick for it
my $num_entries = #{ [ readdir $dh ] };
Here it is the constructor for an anonymous array (reference), [], which imposes the list context on readdir, while the dereferencing #{ } doesn't care about context and simply returns the list of elements of that arrayref. So we can assign that to a scalar and such scalar assignment returns the number of elements in that list.

You have the list of files in #FILES. So your question becomes "how do I get the length of an array?" And that's simple, you simply evaluate the array in scalar context.
my $number_of_files = #FILES;
print $number_of_files;
Or you can eliminate the unnecessary scalar variable by using the scalar() function.
print scalar #FILES;

Try this code for starters (this is on Windows and will include . , .. and folders. Those can be filtered out if you want only files):
#!/usr/bin/perl -w
my $dirname = "C:/Perl_Code";
my $filecnt = 0;
opendir (DIR, $dirname) || die "Error while opening dir $dirname: $!\n";
while(my $filename = readdir(DIR)){
print("$filename\n");
$filecnt++;
}
closedir(DIR);
print "Files in $dirname : $filecnt\n";
exit;

I know this isn't in Perl, but if you ever need a quick way, just type this into bash command line:
ls -1 | wc -l
ls -1 gives you a list of the files in the directory, and wc -l gives you the line count. Combined, they'll give you the number of files in your directory.
Alternatively, you can call bash from Perl (although you probably shouldn't), using
system("ls -1 | wc -l");

Related

reading from multiple files using wild character in perl

Suppose I have 5 files: tmp1.txt, tmp2.txt, tmp3.txt, temp1.txt, temp2.txt.
Now is there any way to open multiple files and read from them using wilcards.
Example,
If I write "t*.txt" then data from each file should be read.
If I write "tm*.txt" then only data from 3 files should be read.
Yes, you can use a glob, assuming these files exist in a local directory, and no other files with similar names are in that directory.
print "Read which files? ";
chomp(my $glob = <STDIN>);
my #files_to_read = glob $glob;
Of course, you can assure that you get no other files by filtering them
my %valid = map { $_ => 1 } qw(tmp1 tmp2 tmp3 temp1 temp2);
#files = grep $valid{$_}, glob $glob;
The first statement creates a hash where the valid file name keys have a true value, the other statement runs this check on the elements of the glob list.
You can use glob to find the list of files, and read through them sequentially by assigning the list to #ARGV, which emulates them being passed on the command line.
our #ARGV = glob '/path/to/tm*.txt';
while (<ARGV>) {
print;
}

How to make recursive calls using Perl, awk or sed?

If a .cpp or .h file has #includes (e.g. #include "ready.h"), I need to make a text file that has these filenames on it. Since ready.h may have its own #includes, the calls have to be made recursively. Not sure how to do this.
The solution of #OneSolitaryNoob will likely work allright, but has an issue: for each recursion, it starts another process, which is quite wasteful. We can use subroutines to do that more efficiently. Assuming that all header files are in the working directory:
sub collect_recursive_includes {
# Unpack parameter from subroutine
my ($filename, $seen) = #_;
# Open the file to lexically scoped filehandle
# In your script, you'll probably have to transform $filename to correct path
open my $fh, "<", $filename or do {
# On failure: Print a warning, and return. I.e. go on with next include
warn "Can't open $filename: $!";
return;
};
# Loop through each line, recursing as needed
LINE: while(<$fh>) {
if (/^\s*#include\s+"([^"]+)"/) {
my $include = $1;
# you should probably normalize $include before testing if you've seen it
next LINE if $seen->{$include}; # skip seen includes
$seen->{$include} = 1;
collect_recursive_includes($include, $seen);
}
}
}
This subroutine remembers what files it has already seen, and avoids recursing there again—each file is visited one time only.
At the top level, you need to provide a hashref as second argument, that will hold all filenames as keys after the sub has run:
my %seen = ( $start_filename => 1 );
collect_recursive_includes($start_filename, \%seen);
my #files = sort keys %seen;
# output #files, e.g. print "$_\n" for #files;
I hinted in the code comments that you'll probabably have to normalize the filenames. E.g consider a starting filename ./foo/bar/baz.h, which points to qux.h. Then the actual filename we wan't to recurse to is ./foo/bar/qux.h, not ./qux.h. The Cwd module can help you find your current location, and to transform relative to absolute paths. The File::Spec module is a lot more complex, but has good support for platform-independent filename and -path manipulation.
In Perl, recursion is straightforward:
sub factorial
{
my $n = shift;
if($n <= 1)
{ return 1; }
else
{ return $n * factorial($n - 1); }
}
print factorial 7; # prints 7 * 6 * 5 * 4 * 3 * 2 * 1
Offhand, I can think of only two things that require care:
In Perl, variables are global by default, and therefore static by default. Since you don't want one function-call's variables to trample another's, you need to be sure to localize your variables, e.g. by using my.
There are some limitations with prototypes and recursion. If you want to use prototypes (e.g. sub factorial($) instead of just sub factorial), then you need to provide the prototype before the function definition, so that it can be used within the function body. (Alternatively, you can use & when you call the function recursively; that will prevent the prototype from being applied.)
Not totally clear what you want the display to look like, but the basic would be a script called follow_includes.pl:
#!/usr/bin/perl -w
while(<>) {
if(/\#include "(\S+)\"/) {
print STDOUT $1 . "\n";
system("./follow_includes.pl $1");
}
}
Run it like:
% follow_includes.pl somefile.cpp
And if you want to hide any duplicate includes, run it like:
% follow_includes.pl somefile.cpp | sort -u
Usually you'd want this in some sort of tree-print.

How to find a file which exists in different directories under a given path in Perl

I'm looking for a method to looks for file which resides in a few directories in a given path. In other words, those directories will be having files with same filename across. My script seem to have the hierarchy problem on looking into the correct path to grep the filename for processing. I have a fix path as input and the script will need to looks into the path and finding files from there but my script seem stuck on 2 tiers up and process from there rather than looking into the last directories in the tier (in my case here it process on "ln" and "nn" and start processing the subroutine).
The fix input path is:-
/nfs/disks/version_2.0/
The files that I want to do post processing by subroutine will be exist under several directories as below. Basically I wanted to check if the file1.abc do exists in all the directories temp1, temp2 & temp3 under ln directory. Same for file2.abc if exist in temp1, temp2, temp3 under nn directory.
The files that I wanted to check in full path will be like this:-
/nfs/disks/version_2.0/dir_a/ln/temp1/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp2/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp3/file1.abc
/nfs/disks/version_2.0/dir_a/nn/temp1/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp2/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp3/file2.abc
My script as below:-
#! /usr/bin/perl -w
my $dir = '/nfs/fm/disks/version_2.0/' ;
opendir(TEMP, $dir) || die $! ;
foreach my $file (readdir(TEMP)) {
next if ($file eq "." || $file eq "..") ;
if (-d "$dir/$file") {
my $d = "$dir/$file";
print "Directory:- $d\n" ;
&getFile($d);
&compare($file) ;
}
}
Note that I put the print "Directory:- $d\n" ; there for debug purposes and it printed this:-
/nfs/disks/version_2.0/dir_a/
/nfs/disks/version_2.0/dir_b/
So I knew it get into the wrong path for processing the following subroutine.
Can somebody help to point me where is the error in my script? Thanks!
To be clear: the script is supposed to recurse through a directory and look for files with a particular filename? In this case, I think the following code is the problem:
if (-d "$dir/$file") {
my $d = "$dir/$file";
print "Directory:- $d\n" ;
&getFile($d);
&compare($file) ;
}
I'm assuming the &getFile($d) is meant to step into a directory (i.e., the recursive step). This is fine. However, it looks like the &compare($file) is the action that you want to take when the object that you're looking at isn't a directory. Therefore, that code block should look something like this:
if (-d "$dir/$file") {
&getFile("$dir/$file"); # the recursive step, for directories inside of this one
} elsif( -f "$dir/$file" ){
&compare("$dir/$file"); # the action on files inside of the current directory
}
The general pseudo-code should like like this:
sub myFind {
my $dir = shift;
foreach my $file( stat $dir ){
next if $file -eq "." || $file -eq ".."
my $obj = "$dir/$file";
if( -d $obj ){
myFind( $obj );
} elsif( -f $obj ){
doSomethingWithFile( $obj );
}
}
}
myFind( "/nfs/fm/disks/version_2.0" );
As a side note: this script is reinventing the wheel. You only need to write a script that does the processing on an individual file. You could do the rest entirely from the shell:
find /nfs/fm/disks/version_2.0 -type f -name "the-filename-you-want" -exec your_script.pl {} \;
Wow, it's like reliving the 1990s! Perl code has evolved somewhat, and you really need to learn the new stuff. It looks like you learned Perl in version 3.0 or 4.0. Here's some pointers:
Use use warnings; instead of -w on the command line.
Use use strict;. This will require you to predeclare variables using my which will scope them to the local block or the file if they're not in a local block. This helps catch a lot of errors.
Don't put & in front of subroutine names.
Use and, or, and not instead of &&, ||, and !.
Learn about Perl Modules which can save you a lot of time and effort.
When someone says detect duplicates, I immediately think of hashes. If you use a hash based upon your file's name, you can easily see if there are duplicate files.
Of course a hash can only have a single value for each key. Fortunately, in Perl 5.x, that value can be a reference to another data structure.
So, I recommend you use a hash that contains a reference to a list (array in old parlance). You can push each instance of the file to that list.
Using your example, you'd have a data structure that looks like this:
%file_hash = {
file1.abc => [
/nfs/disks/version_2.0/dir_a/ln/temp1
/nfs/disks/version_2.0/dir_a/ln/temp2
/nfs/disks/version_2.0/dir_a/ln/temp3
],
file2.abc => [
/nfs/disks/version_2.0/dir_a/nn/temp1
/nfs/disks/version_2.0/dir_a/nn/temp2
/nfs/disks/version_2.0/dir_a/nn/temp3
];
And, here's a program to do it:
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say); #Can use `say` which is like `print "\n"`;
use File::Basename; #imports `dirname` and `basename` commands
use File::Find; #Implements Unix `find` command.
use constant DIR => "/nfs/disks/version_2.0";
# Find all duplicates
my %file_hash;
find (\&wanted, DIR);
# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
if (scalar (#{$file_hash{$file_name}}) > 1) {
say qq(Duplicate File: "$file_name");
foreach my $dir_name (#{$file_hash{$file_name}}) {
say " $dir_name";
}
}
}
sub wanted {
return if not -f $_;
if (not exists $file_hash{$_}) {
$file_hash{$_} = [];
}
push #{$file_hash{$_}}, $File::Find::dir;
}
Here's a few things about File::Find:
The work takes place in the subroutine wanted.
The $_ is the name of the file, and I can use this to see if this is a file or directory
$File::Find::Name is the full name of the file including the path.
$File::Find::dir is the name of the directory.
If the array reference doesn't exist, I create it with the $file_hash{$_} = [];. This isn't necessary, but I find it comforting, and it can prevent errors. To use $file_hash{$_} as an array, I have to dereference it. I do that by putting a # in front of it, so it can be #$file_hash{$_} or, #{$file_hash{$_}}.
Once all the file are found, I can print out the entire structure. The only thing I do is check to make sure there is more than one member in each array. If there's only a single member, then there are no duplicates.
Response to Grace
Hi David W., thank you very much for your explainaion and sample script. Sorry maybe I'm not really clear in definding my problem statement. I think I can't use hash in my path finding for the data structure. Since the file*.abc is a few hundred and undertermined and each of the file*.abc even is having same filename but it is actually differ in content in each directory structures.
Such as the file1.abc resides under "/nfs/disks/version_2.0/dir_a/ln/temp1" is not the same content as file1.abc resides under "/nfs/disks/version_2.0/dir_a/ln/temp2" and "/nfs/disks/version_2.0/dir_a/ln/temp3". My intention is to grep the list of files*.abc in each of the directories structure (temp1, temp2 and temp3 ) and compare the filename list with a masterlist. Could you help to shed some lights on how to solve this? Thanks. – Grace yesterday
I'm just printing the file in my sample code, but instead of printing the file, you could open them and process them. After all, you now have the file name and the directory. Here's the heart of my program again. This time, I'm opening the file and looking at the content:
foreach my $file_name (sort keys %file_hash) {
if (scalar (#{$file_hash{$file_name}}) > 1) {
#say qq(Duplicate File: "$file_name");
foreach my $dir_name (#{$file_hash{$file_name}}) {
#say " $dir_name";
open (my $fh, "<", "$dir_name/$file_name")
or die qq(Can't open file "$dir_name/$file_name" for reading);
# Process your file here...
close $fh;
}
}
}
If you are only looking for certain files, you could modify the wanted function to skip over files you don't want. For example, here I am only looking for files which match the file*.txt pattern. Note I use a regular expression of /^file.*\.txt$/ to match the name of the file. As you can see, it's the same as the previous wanted subroutine. The only difference is my test: I'm looking for something that is a file (-f) and has the correct name (file*.txt):
sub wanted {
return if not -f $_ and /^file.*\.txt$/;
if (not exists $file_hash{$_}) {
$file_hash{$_} = [];
}
push #{$file_hash{$_}}, $File::Find::dir;
}
If you are looking at the file contents, you can use the MD5 hash to determine if the file contents match or don't match. This reduces a file to a mere string of 16 to 28 characters which could even be used as a hash key instead of the file name. This way, files that have matching MD5 hashes (and thus matching contents) would be in the same hash list.
You talk about a "master list" of files and it seems you have the idea that this master list needs to match the content of the file you're looking for. So, I'm making a slight mod in my program. I am first taking that master list you talked about, and generating MD5 sums for each file. Then I'll look at all the files in that directory, but only take the ones with the matching MD5 hash...
By the way, this has not been tested.
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say); #Can use `say` which is like `print "\n"`;
use File::Find; #Implements Unix `find` command.
use Digest::file qw(digest_file_hex);
use constant DIR => "/nfs/disks/version_2.0";
use constant MASTER_LIST_DIR => "/some/directory";
# First, I'm going thorugh the MASTER_LIST_DIR directory
# and finding all of the master list files. I'm going to take
# the MD5 hash of those files, and store them in a Perl hash
# that's keyed by the name of file file. Thus, when I find a
# file with a matching name, I can compare the MD5 of that file
# and the master file. If they match, the files are the same. If
# not, they're different.
# In this example, I'm inlining the function I use to find the files
# instead of making it a separat function.
my %master_hash;
find (
{
%master_hash($_) = digest_file_hex($_, "MD5") if -f;
},
MASTER_LIST_DIR
);
# Now I have the MD5 of all the master files, I'm going to search my
# DIR directory for the files that have the same MD5 hash as the
# master list files did. If they do have the same MD5 hash, I'll
# print out their names as before.
my %file_hash;
find (\&wanted, DIR);
# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
if (scalar (#{$file_hash{$file_name}}) > 1) {
say qq(Duplicate File: "$file_name");
foreach my $dir_name (#{$file_hash{$file_name}}) {
say " $dir_name";
}
}
}
# The wanted function has been modified since the last example.
# Here, I'm only going to put files in the %file_hash if they
sub wanted {
if (-f $_ and $file_hash{$_} = digest_file_hex($_, "MD5")) {
$file_hash{$_} //= []; #Using TLP's syntax hint
push #{$file_hash{$_}}, $File::Find::dir;
}
}

Why do I get a number instead a list of filenames from readdir?

I have a piece of Perl code for searchnig a directory and display the contents of that directory, if match is found. The code is given below:
$test_case_directory = "/home/sait11/Desktop/SaLT/Data_Base/Test_Case";
$xml_file_name = "sample.xml"
$file_search_return = file_search($xml_file_name);
print "file search return::$file_search_return\n";
sub file_search
{
opendir (DIR, $test_case_directory) or die "\n\tFailed to open directory that contains the test case xml files\n\n";
print "xml_file_name in sub routines:: $xml_file_name\n";
$dirs_found = grep { /^$xml_file_name/i } readdir DIR;
print "Files in the directory are dirs_found :: $dirs_found\n";
closedir (DIR);
return $dirs_found;
}
Output is,
xml_file_name in sub routines:: sample.xml
Files in the directory are dirs_found :: 1
file search return::1
It is not returning the file name found. Instead it returns the number 1 always.
I don't know why it is not returning the file name called sample.xml present in the directory.
perldoc grep says:
In scalar context, returns the number of times the expression was true.
And that's exactly what you are doing. So you found 1 file and that result is assigned to $dirs_found variable.
($dirs_found) = grep { /^$xml_file_name/i } readdir DIR; #capture it
Problem was that, you were evaluating the grep as scalar context, change it to list context will give you the desired result.
In scalar context, grep returns the number of times the expression was true.
In list context, it returns the elements for which the expression was true.
Why are you opening a directory and looking for a particular filename? If you want to see if the file is there, just test for it directly:
use File::Spec::Functions;
my $file = catfile( $test_case_directory, $xml_file_name );
if( -e $file ) { ... }
When you run into these sorts of problems, though, check the result at each step to check what you are getting. Your first step would decompose the problem statement:
my #files = readdir DIR;
print "Files are [#files]\n";
my $filtered = grep { ... } #files;
print "Files are [$filtered]\n";
Once you do that you see that the problem is grep. Once you know that the problem is grep, you read its documentation, notice that you are using it wrong, and you're done sooner than it would take to post a question on StackOverflow. :)
you should say #dirs_found, not $dirs_found

How can I loop through files in a directory in Perl? [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How can I list all of the files in a directory with Perl?
I want to loop through a few hundred files that are all contained in the same directory. How would I do this in Perl?
#!/usr/bin/perl -w
my #files = <*>;
foreach my $file (#files) {
print $file . "\n";
}
Where
#files = <*>;
can be
#files = </var/www/htdocs/*>;
#files = </var/www/htdocs/*.html>;
etc.
Enjoy.
opendir(DH, "directory");
my #files = readdir(DH);
closedir(DH);
foreach my $file (#files)
{
# skip . and ..
next if($file =~ /^\.$/);
next if($file =~ /^\.\.$/);
# $file is the file used on this iteration of the loop
}
You can use readdir or glob.
Or, you can use a module such as Path::Class:
Ordinarily children() will not include the self and parent entries . and .. (or their equivalents on non-Unix systems), because that's like I'm-my-own-grandpa business. If you do want all directory entries including these special ones, pass a true value for the all parameter:
#c = $dir->children(); # Just the children
#c = $dir->children(all => 1); # All entries
In addition, there's a no_hidden parameter that will exclude all normally "hidden" entries - on Unix this means excluding all entries that begin with a dot (.):
#c = $dir->children(no_hidden => 1); # Just normally-visible entries
Or, Path::Tiny:
#paths = path("/tmp")->children;
#paths = path("/tmp")->children( qr/\.txt$/ );
Returns a list of Path::Tiny objects for all files and directories within a directory. Excludes "." and ".." automatically.
If an optional qr// argument is provided, it only returns objects for child names that match the given regular expression. Only the base name is used for matching:
#paths = path("/tmp")->children( qr/^foo/ );
# matches children like the glob foo*
Getting the list of directory entries into an array wastes some memory (as opposed to getting one file name at a time), but with only a few hundred files, this is unlikely to be an issue.
Path::Class is portable to operating systems other than *nix and Windows. On the other hand, AFAIK, its instances use more memory than do Path::Tiny instances.
If memory is an issue, you are better off using readdir in a while loop.