Ignoring Hidden Files with File::Find - perl

I am using file::find to walk a directory structure and print it out, but I am having trouble excluding hidden files. Here is what I have so far:
find(\&todo, $start_dir);
sub todo
{
if ($_ =~ /^./)
{
print "hidden file $_\n";
}
else
{
if (-f $_) #check for file
{
file;
}
elsif (-d $_) #check for directory
{
directory($File::Find::dir);
}
else
{
print "ERROR: $_\n";
}
}
}
If I remove the if ($_ =~ /^./) check, the files and directories work fine, but adding this prints everything as a hidden file. As you can see, I only need this to work on unix.
Can anyone point me in the right direction?
EDIT: I forgot a backslash in front of the . - should be if ($_ =~ /^./), but does find() have a default way of ignoring hidden files/directories?
Thanks!

. in a regex matches any character; use \. to match a literal .. And you probably should learn about regexes.

See geekosaur for an explanation of your problem. In a simple case like this, substr might be a better call than a regular expression:
if(substr($_, 0, 1) eq '.') {
Regular expressions are a great tool but they shouldn't be the only thing in your toolbox.

The /^./ is a regular expression. The period means any single character, so what you're saying is match any string that starts with any character, and that pretty much matches all file names.
You need to put a backslash before the period, or use the \Q and '\E'. The \Q disables matching on metacharacters which means it basically removes all magic and makes everything a plain ol' string. In this circumstance, the backslash would be better, but you can imagine trying to match something a bit more complex, and the \Q and \E would work better:
Either of these will work:
if ($_ =~ /^\./)
{
print "hidden file $_\n";
}
if ($_ =~ /^\Q.\E/)
{
print "hidden file $_\n";
}

If you are on Windows the perl module Win32::File will tell you whether a file is hidden or not. Win32::File is installed by default in ActivePerl.
Unfortunately Win32::File doesn't come with any examples, (I wish cpan added a comment feature to every page like the PHP site. Sure there's Annocpan, but that extra click means almost no one ever contributes to it), but this thread will help http://www.perlmonks.org/?node_id=194011
There's also the nicer object oriented Win32::File::Object but it only works with files unless you apply this patch :( https://rt.cpan.org/Public/Bug/Display.html?id=60735
Results seem weird though as lots of files you don't except to be hidden or system turn out to be both. I gave up in the end and just hacked this to ignore certain folders:
if( $_ =~ /(RECYCLER)|(System Volume Information)/ ) {
$File::Find::prune = 1;
}

Related

what does print for mean in Perl?

I need to edit some Perl script and I'm new to this language.
I encountered the following statement:
print for (#$result);
I know that $result is a reference to an array and #$result returns the whole array.
But what does print for mean?
Thank you in advance.
In Perl, there's such a thing as an implicit variable. You may have seen it already as $_. There's a lot of built in functions in perl that will work on $_ by default.
$_ is set in a variety of places, such as loops. So you can do:
while ( <$filehandle> ) {
chomp;
tr/A-Z/a-z/;
s/oldword/newword/;
print;
}
Each of these lines is using $_ and modifying it as it goes. Your for loop is doing the same - each iteration of the loop sets $_ to the current value and print is then doing that by default.
I would point out though - whilst useful and clever, it's also a really good way to make confusing and inscrutable code. In nested loops, for example, it can be quite unclear what's actually going on with $_.
So I'd typically:
avoid writing it explicitly - if you need to do that, you should consider actually naming your variable properly.
only use it in places where it makes it clearer what's going on. As a rule of thumb - if you use it more than twice, you should probably use a named variable instead.
I find it particularly useful if iterating on a file handle. E.g.:
while ( <$filehandle> ) {
next unless m/keyword/; #skips any line without 'keyword' in it.
my ( $wiggle, $wobble, $fronk ) = split ( /:/ ); #split $_ into 3 variables on ':'
print $wobble, "\n";
}
It would be redundant to assign a variable name to capture a line from <$filehandle>, only to immediately discard it - thus instead we use split which by default uses $_ to extract 3 values.
If it's hard to figure out what's going on, then one of the more useful ways is to use perl -MO=Deparse which'll re-print the 'parsed' version of the script. So in the example you give:
foreach $_ (#$result) {
print $_;
}
It is equivalent to for (#$result) { print; }, which is equivalent to for (#$result) { print $_; }. $_ refers to the current element.

Perl script to rename files with spaces in name, pushd/popd equivalent?

My Linux system mounts some Samba shares, and some files are deposited by Windows users. The names of these files sometimes contain spaces and other undesirable characters. Changing these characters to hyphens - seems like a reasonable solution. Nothing else needs to be changed to handle these cleaned file names.
A couple of questions,
What other characters besides spaces, parenthesis should be translated?
What other file attributes (besides file type (file/dir) and permissions) should be checked?
Does Perl offer a pushd/popd equivalent, or is chdir a reasonable solution to traversing the directory tree?
This is my Perl program
#!/bin/env perl
use strict;
use warnings;
use File::Copy;
#rename files, map characters (not allowed) to allowed characters
#map [\s\(\)] to "-"
my $verbose = 2;
my $pat = "[\\s\\(\\)]";
sub clean {
my ($name) = #_;
my $name2 = $name;
$name2 =~ s/$pat/\-/g;
#skip when unchanged, collision
return $name if (($name eq $name2) || -e $name2); #skip collisions
print "r: $name\n" if ($verbose > 2);
rename($name, $name2);
$name2;
}
sub pDir {
my ($obj) = #_;
return if (!-d $obj);
return if (!opendir(DIR, $obj));
print "p: $obj/\n" if ($verbose > 2);
chdir($obj);
foreach my $dir (readdir DIR) {
next if ($dir =~ /^\.\.?$/); #skip ./, ../
pDir(clean($dir));
}
close(DIR);
chdir("..");
}
sub main {
foreach my $argv (#ARGV) {
print "$argv/\n" if ($verbose > 3);
$argv = clean($argv);
if (-d $argv) { pDir($argv); }
}
}
&main();
These posts are related, but don't really address my questions,
Use quotes: How to handle filenames with spaces? (using other scripts, prefer removing need for quotes)
File::Find perl script to recursively list all filename in directory (yes, but I have other reasons)
URL escaping: Modifying a Perl script which has an error handling spaces in files (not urls)
Quotemeta: How can I safely pass a filename with spaces to an external command in Perl? (not urls)
Here's a different way to think about the problem:
Perl has a built-in rename function. You should use it.
Create a data structure mapping old names to new names. Having this data will allow various sanity checks: for example, you don't want cleaned names stomping over existing files.
Since you aren't processing the directories recursively, you can use glob to good advantage. No need to go through the hassles of opening directories, reading them, filtering out dot-dirs, etc.
Don't invoke subroutines with a leading ampersand (search this issue for more details).
Many Unix-like systems include a Perl-based rename command for quick-and-dirty renaming jobs. It's good to know about even if you don't use it for your current project.
Here's a rough outline:
use strict;
use warnings;
sub main {
# Map the input arguments to oldname-newname pairs.
my #renamings =
map { [$_, cleaned($_)] }
map { -f $_ ? $_ : glob("$_/*") }
#_;
# Sanity checks first.
# - New names should be unique.
# - New should not already exist.
# - ...
# Then rename.
for my $rnm (#renamings){
my ($old, $new) = #$rnm;
rename($old, $new) unless $new eq $old;
}
}
sub cleaned {
# Allowed characters: word characters, hyphens, periods, slashes.
# Adjust as needed.
my $f = shift;
$f =~ s/[^\w\-\.\/]/-/g;
return $f;
}
main(#ARGV);
Don't blame Windows for your problems. Linux is much more lax, and the only character it prohibits from its file names is NUL.
It isn't clear exactly what you are asking. Did you publish your code for a critique, or are you having problems with it?
As for the specific questions you asked,
What other characters besides spaces, parenthesis should be translated?
Windows allows any character in its filenames except for control characters from 0x00 to 0x1F and any of < > \ / * ? |
DEL at 0x7F is fine.
Within the ASCII set, that leaves ! # $ % & ' ( ) + , - . : ; = # [ ] ^ _ ` { } ~
The set of characters you need to translate depends on your reason for doing this. You may want to start by excluding non-ASCII characters, so your code should read something like
$name2 =~ tr/\x21-\x7E/-/c
which will change all non-ASCII characters, spaces and DEL to hyphens. Then you need to go ahead and fix all the ASCII characters that you consider undersirable.
What other file attributes (besides file type (file/dir) and permissions) should be checked?
The answer to this has to be according to your purpose. If you are referring only to whether renaming a file or directory as required is possible, then I suggest that you just let rename itself tell you whether it succeeded. It will return a false value if the operation failed, and the reason will be in $!.
Does Perl offer a pushd/popd equivalent, or is chdir a reasonable solution to traversing the directory tree?
If you want to work with that idiom, then you should take a look at File::pushd, which allows you to temporarily chdir to a new location. A popd is done implicitly at the end of the enclosing block.
I hope this helps. If you have any other specific questions then please make them known by editing your original post.

perl regular expressions with date

Trying to validate the date of the format (YYYY_MM_DD). With the test variable set as 2012_4_123 it's printing "valid format" after script is run. It should give an "invalid error" message because in the regular expression the day part is checked to be atleast 1 digit and not more than 2 digits. Not sure how it's printing "valid format" as the output message.
my $test="2012_4_123";
if ($test !~ m/^(\d{4})_(\d{1,2})_(\d{1,2})/)
{
print "invalid format\n";
}
else
{
print "valid format\n";
}
-if ($test !~ m/^(\d{4})_(\d{1,2})_(\d{1,2})/)
+if ($test !~ m/^(\d{4})_(\d{1,2})_(\d{1,2})$/)
you're missing a $ at the end. it's matching the string "2012_4_12" because you didn't tell it to match the end of the string too. Your regex should be this.
$test !~ m/^(\d{4})_(\d{1,2})_(\d{1,2})$/
Simply adding $ solves the initial problem of allowing for more than two digits for the day, but introduces a more subtle bug: dates will now validate despite having a newline at the end. This may not matter depending on your application, but it can be avoided by using the regex in the following example:
use strict;
use warnings;
my #tests = (
'2012_4_123',
'2012_11_22',
"2012_11_22\n",
);
use Data::Dumper;
print Dumper \#tests;
foreach my $test (#tests) {
if ( $test !~ m/\A(\d{4})_(\d{1,2})_(\d{1,2})\z/smx )
{
print "invalid format\n";
}
else
{
print "valid format\n";
}
}
Note: /smx is recommended by Perl Best Practices and I write my regexes with it unless there's a specific need not to have it, but it may trip you up if you're not used to it.
/s and /m will allow you to process multiline strings more easily; /s because . will then match newlines and /m to allow you to use ^ and $ to match the start and end of a line respectively, and \A and \z will then match the start and end of the entire string.
/x is simply to allow whitespace and comments within a regex, though you'll need to escape whitespace if you're actually trying to match it.
In this case, it's using \z instead of $ that makes the difference irrespective of the use of /smx.
Also, it mightn't be a bad idea to look at a module to perform date validation rather than just date format validation (again, depending on what you're using this for). See this discussion on perlmonks.

Can I skip a whole file with the <> operator?

The following Perl code has an obvious inefficiency;
while (<>)
{
if ($ARGV =~ /\d+\.\d+\.\d+/) {next;}
... or do something useful
}
The code will step through every line of the file we don't want.
On the size of files this particular script is running on this is unlikely to make a noticeable difference, but for the sake of learning; How can I junk the whole file <> is working and move to the next one?
The purpose of this is because the sever this script runs on stores old versions of apps with the version number in the file name, I'm only interested in the current version.
Paul Roub's solution is best if you can filter #ARGV before you start reading any files.
If you have to skip a file after you've begun iterating it,
while (<>) {
if (/# Skip the rest of this file/) {
close ARGV;
next;
}
print "$ARGV: $_";
}
grep ARGV first.
#ARGV = grep { $_ !~ /\d+\.\d+\.\d+/ } #ARGV;
while (<>)
{
# do something with the other files
}
Paul Roub's answer works for more information, see The IO operators section of perlop man page. The pattern of using grep is mentioned as well as a few other things related to <>.
Take note of the mention of ARGV::readonly regarding things like:
perl dangerous.pl 'rm -rfv *|'

Should I manually set Perl's #ARGV so I can use <> to open, scan, and close files?

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}