How do I skip .svn folders with File::Find? - perl

I am writing a Perl script that is iterating over file names in a directory and its sub-directories, using the following method:
find(\&getFile, $mainDir);
sub getFile {
my $file_dir = $File::Find::name;
return unless -f $file_dir; # return if its a folder
}
The file structure looks like this:
main/classes/pages/filename.php
However because of version control each folder and subfolder has a hidden .svn directory that has duplicates of every file inside with a .svn-base suffix:
main/.svn/classes/pages/filename.php.svn-base
I was wondering if there is a return statement like the one I had previously using:
return if ($file_dir eq "something here");
to skip all the .svn folders to not find filenames with the .svn-base suffix. I have been fiddling around with regex and searching for hours without much luck. I have only been using perl for couple days.

You may use
return if ($file_dir !~ /\.svn/);
(!~ is equivalent to !($file_dir =~ /\.svn/). The =~ operator compares a variable with a pattern.

Related

How do you delete a file that does not have a file extension?

How do you delete a file that does not have a file extension?
I'm using Strawberry Perl 5.32.1.1 64 bit version. Here is what I have:
unlink glob "$dir/*.*";
I've also tried the following:
my #file_list;
opendir(my $dh, $dir) || die "can't opendir $dir: $!";
while (readdir $dh){
next unless -f;
push #file_list, $_;
}
closedir $dh;
unlink #file_list;
The result of this is that all files with an extension are deleted,
but those files without an extension remain undeleted.
You don't really explain what the problem is, so this is just a guess.
You're using readdir() to get a list of files to delete and then passing that list to unlink(). The problem here is that the filenames that you get back from readdir() do not include the directory name that you originally passed to readdir(). So you need to populate your array like this:
push #file_list, "$dir/$_";
In a comment you say:
I'm testing unlink glob "$dir/."; but the files without file extensions are not deleted.
Well, if you think about it, you're only asking for files with an extension. If the pattern you pass to glob() contains *.*, then it will only return files that match that pattern (i.e. files with a dot and more text after the dot).
The solution would seem to be to simplify the pattern that you are passing to glob() so it's just *.
unlink glob "$dir/*";
That will, of course, try to delete directories as well, so you might want this instead:
unlink grep { -f } glob "$dir/*";
You expect glob to perform DOS-like globbing, but the glob function provides csh-like globbing.
glob("*.*") matches all files that contains a ., ignoring files with a leading dot.
glob("*") matches all files, ignoring files with a leading dot.
glob("* .*") matches all files.
Note that every kind of file is matched. This includes directories. In particular, note that . and .. are matched by .*.
If you want DOS-style globs, you can use File::DosGlob.
Yes, you can use glob, but you need to do a little more work than that. grep can help select the files you are looking for.
* grabs all entries (which do not begin with a dot), -f selects only files, !/\./ removes files with a dot in the name:
unlink grep {!/\./} grep {-f} glob "$dir/*";

How to get the list of files that are not in another directory in perl

I have to fix a Perl script, which does the following:
# Get the list of files in the staging directory; skip all beginning with '.'
opendir ERR_STAGING_DIR, "$ERR_STAGING" or die "$PID: Cannot open directory $ERR_STAGING";
#allfiles = grep !/^$ERR_STAGING\/\./, map "$ERR_STAGING/$_", readdir(ERR_STAGING_DIR);
closedir(ERR_STAGING_DIR);
I have two directories one is STAGING and other is ERROR. STAGING contains files like ABC_201608100000.fin and ERR_STAGING_DIR contains ABC_201608100000.fin.bc_lerr.xml. Now the Perl script is run as a daemon process which constantly looks for the files in ERR_STAGING_DIR directory and processes the error files.
However, my requirement is to do not process the file if ABC_201608100000.fin exists in STAGING.
Question:
Is there a way , I can filter the allfiles array and select files which don't exist in STAGING directory?
WHAT I HAVE TRIED:
I have done programmatic way to ignore the files that exist in STAGING dir. Though it is not working.
# Move file from the staging directory to the processing directory.
#splitf = split(/.bc_lerr.xml/,basename($file));
my $finFile = $STAGING . "/" . $splitf[0];
print LOG "$PID: Staging File $finFile \n";
foreach $file(#sorted_allfiles) {
if ( -e $finFile )
{
print LOG "$PID: Staging File still exist.. moving to next $finFile \n";
next;
}
# DO THE PROCESSING.
The questions of timing aside, I assume that a snapshot of files may be processed without worrying about new files showing up. I take it that #allfiles has all file names from the ERROR directory.
Remove a file name from the front of the array at each iteration. Check for the corresponding file in STAGING and if it's not there process away, otherwise push it on the back of the array and skip.
while (#allfiles)
{
my $errfile = shift #allfiles;
my ($file) = $errfile =~ /(.*)\.bc_lerr\.xml$/;
if (-e "$STAGING/$file")
{
push #allfiles, $errfile;
sleep 1; # more time for existing files to clear
next;
}
# process the error file
}
If the processing is faster than what it takes for existing files in STAGING to go away, we would exhaust all processable files and then continuously run file tests. There is no reason for such abuse of resources, thus the sleep, to give STAGING files some more time to go away. Note that if just one file in STAGING fails to go away this loop will keep checking it and you want to add some guard against that.
Another way would be to process the error files with a foreach, and add those that should be skipped to a separate array. That can then be attempted separately, perhaps with a suitable wait.
How suitable this is depends on details of the whole process. For how long do STAGING files hang around, and is this typical or exceptional? How often do new files show up? How many files are there typically?
If you only wish to filter out the error files that have their counterparts in STAGING
my #errfiles_nostaging = grep {
my ($file) = $_ =~ /(.*)\.bc_lerr\.xml$/;
not -e "$STAGING/$file";
} #allfiles;
The output array contains the files from #allfiles which have no corresponding file in $STAGING and can be readily processed. This would be suitable if the error files are processed very fast in comparison to how long the $STAGING files stay around.
The filter can be written in one statement as well. For example
grep { not -e "$STAGING/" . s/\.bc_lerr\.xml$//r } # / or
grep { not -e "$STAGING/" . (split /\.bc_lerr\.xml$/, $_)[0] }
The first example uses the non-destructive /r modifier, available since 5.14. It changes the substitution to return the changed string and not change the original one. See it in perlrequick and in perlop.
This is extremely brute force example, but if you have the contents of the staging directory in an array, you can check against that array when you read the contents of the error directory.
I've made some GIGANTIC assumptions about the relationship of the filenames -- basically that the stage directory contains the file truncated, specifically the way you listed in your example. If that's universally the case, then a substring would work even faster, but this example is a little more scalable, in the event your example was simplified to illustrate the issue.
use strict;
my #error = qw(
ABC_201608100000.fin.bc_lerr.xml
ABD_201608100000.fin.bc_lerr.xml
ABE_201608100000.fin.bc_lerr.xml
ABF_201608100000.fin.bc_lerr.xml
);
my #staging = qw(
ABC_201608100000.fin
ABD_201608100000.fin
);
foreach my $error (#error) {
my $stage = $error;
$stage =~ s/\.bc_lerr\.xml//;
unless (grep { /$stage/ } #staging) {
## process the file here
}
}
The grep in this example is O(n), so if you have a really large list of either array you would want to load this into a hash first, which would be O(1).

Finding matching files with Perl module File::Find::Rule

I'n trying to use the File::Find::Rule module to find a specific file (output.txt) in a subdirectory, and if it is not there then search in the root directory to see if it exists. The issue is that multiple output.txt files exist, so we should only be looking for others if the original is not found.
Basically the directory structure looks like this
top
level-1-a
level-2-a
output.txt
level-2-b
output.txt
level-1-b
level-2-a
output.txt
level-2-b
output.txt
Right now I have:
#files = File::Find::Rule->file()->name($output)->in($sub_dir);
if ( ! #files ) {
#files = File::Find::Rule->file()->name($output)->in($root_dir);
}
Where the behavior is, we look for output.txt in \top\level-1-a first, where it finds the matches in level-2-a and level-2-b. If there are no matching files under level-1-a, we will then make the same call on \top to find the matches show up in the level-1-b directories. Is there a cleaner way to check with that "if-else" idea?
I would check the subdirectories one at a time. Here's an example. The last breaks out of the for loop as soon as soon as a subdirectory has been found that contains the required file
my #files;
for my $subdir ( 'level-1-a', 'level-1-b' ) {
last if #files = File::Find::Rule->file()->name($output)->in("/top/$subdir");
}

Perl regex search of file name and extensions against a predefined array

I want to filter out some files from a directory. I am able to grab the files and their extensions recursively, but now what I want to do is to match the file extension and file name with a predefined array of extensions and file names using wildcard search as we use to do in sql.
my #ignore_exts = qw( .vmdk .iso .7z .bundle .wim .hd .vhd .evtx .manifest .lib .mst );
I want to filter out the files which will have extensions like the above one.
e.g. File name is abc.1.149_1041.mst and since the extension .mst is present in #ignore_ext, so I want this to filter out. The extension I am getting is '.1.149_1041.mst'. As in sql I'll do something like select * from <some-table> where extension like '%.mst'. Same thing I want to do in perl.
This is what I am using for grabbing the extension.
my $ext = (fileparse($filepath, '\..*?')) [2];
In order to pull a file extension off a filename this should work:
/^(.*)\.([^.]+)$/
$fileName = $1;
$extension = $2;
This might do the trick for you.
Input: a.b.c.text
$1 will be a.b.c.d
$2 will be text
Basically this will take everything from the start of the line until the last period and group that in the 1st group, and then everything from the last period to the end of the line as group 2
You can see a sample here: http://regex101.com/r/vX3dK1
As for checking whether the extension exists in the array read here: (How can I check if a Perl array contains a particular value?)
if (grep (/^$extension/, #array)) {
print "Extension Found\n";
}
Just turn your list of extensions into a regular expression, and then test against the $filepath.
my #ignore_exts = qw( .vmdk .iso .7z .bundle .wim .hd .vhd .evtx .manifest .lib .mst );
my $ignore_exts_re = '(' . join('|', map quotemeta, #ignore_exts) . ')$';
And then later to compare
if ($filepath =~ $ignore_exts_re) {
print "Ignore $filepath because it ends in $1\n";
next;

Reading the contents of directories in Perl

I am trying to accomplish two things with a Perl script. I have a file, which in the first subdirectory has different user directories, and in each of these user directories contains some folders that have text files in them. I am trying to write a Perl script that
Lists the directories for each user
Gets the total number of .txt files
For the second objective I have this code
my #emails = glob "$dir/*.txt";
for (0..$#emails){
$emails[$_] =~ s/\.txt$//;
}
$emails=#emails;
but $emails is returning 0. Any insight?
Typically, using glob is not very good idea when it comes to processing files in directories and possible subdirectories.
Much better way is to use File::Find module, like this:
use File::Find;
my #emails;
File::Find::find(
{
# this will be called for each file under $dir:
wanted => sub {
my $file = $File::Find::name;
return unless -f $file and $file =~ /\.txt$/;
# this file looks like email, remember it:
push #emails, $file;
}
},
$dir
);
print "Found " . scalar #emails . " .txt files (emails) in '$dir'\n";