I have code in perl as below. Trying pick files which matches the pattern.
opendir ERR_STAGING_DIR, "$ERR_STAGING" or die "$PID: Cannot open directory $ERR_STAGING";
#allfiles = grep !/^$ERR_STAGING\/\./, map "$ERR_STAGING/$_", readdir(ERR_STAGING_DIR);
closedir(ERR_STAGING_DIR);
$ERR_FILETYPE = basename ($ERR_FILETYPE);
$ERR_FILETYPE =~ s/\./\\\./g;
$ERR_FILETYPE =~ s/\*/\*/g;
#file_type = grep /^$ERR_STAGING/./$ERR_FILETYPE$/, #allfiles;
$numelements = #file_type;
if ($numelements <= 0) {
print LOG "$PID: No files match specified pattern, exiting.\n";
&HandlerDie($NO_FILE_TYPE, $current_poid);
}
Here is what I'm doing above. Grep all files from ERR_STAGING directory. grep files matching pattern 'BVN*.fin.bc_lerr.xml.bc' e.g BVN_201608250000.fin.bc_lerr.xml.bc and do something with the file. However the above code is returning files which doesn't match pattern too, it also pickup some temp directories.
Correct two rows:
$ERR_FILETYPE =~ s/\*/.\*/g; # Add DOT
#file_type = grep /\/$ERR_FILETYPE$/, #allfiles; # such a filter is sufficient
I'm going to suggest an alternative - you mention 'BVN*.fin.bc_lerr.xml.bc' as a pattern.
But that's not a regular expression (well, ok, it is - but I'm pretty sure you don't want 'zero or more N' you want 'anything after BVN). And you appear to be trying to convert it into a regex.
That means you're actually looking a shell glob, not a regex. They're similar, but not the same.
So can I suggest instead of readdir and grep that instead what you want is glob.
Then you can:
my #files = glob ( '/path/to/BVN*.fin.bc_lerr.xml.bc' );
... and that's it. It'll expand your pattern using shell logic, not regex logic - and then read /path/to to find files matching that.
So in your example:
my #files = glob ( "$ERR_STAGING/$ERR_FILETYPE" );
Related
I am an absolute beginner in perl and I am trying to extract lines of text between 2 strings on different lines but without success. It looks like I`m missing something in my code. The code should print out the file name and the found strings. Do you have any idea where could be the problem ? Many thanks indeed for your help or advice. Here is the example:
*****************
example:
START
new line 1
new line 2
new line 3
END
*****************
and my script:
use strict;
use warnings;
my $command0 = "";
opendir (DIR, "C:/Users/input/") or die "$!";
my #files = readdir DIR;
close DIR;
splice (#files,0,2);
open(MYOUTFILE, ">>output/output.txt");
foreach my $file (#files) {
open (CHECKBOOK, "input/$file")|| die "$!";
while ($record = <CHECKBOOK>) {
if (/\bstart\..\/bend\b/) {
print MYOUTFILE "$file;$_\n";
}
}
close(CHECKBOOK);
$command0 = "";
}
close(MYOUTFILE);
I suppose that you are trying to use a flip-flop here, which might work well for your input, but you've written it wrong:
if (/\bstart\..\/bend\b/) {
A flip-flop (the range operator) uses two statements, separated by either .. or .... What you want is two regexes joined with ..:
if (/\bSTART\b/ .. /\bEND\b/)
Of course, you also want to match the case (upper), or use the /i modifier to ignore case. You might even want to use beginning of line anchor ^ to only match at the beginning of a line, e.g.:
if (/^START\b/ .. /^END\b/)
You should also know that your entire program can be replaced with a one-liner, such as
perl -ne 'print if /^START\b/ .. /^END\b/' input/*
Alas, this only works for linux. The cmd shell in Windows does not glob, so you must do that manually:
perl -ne "BEGIN { #ARGV = map glob, #ARGV }; print if /^START\b/ .. /^END\b/" input/*
If you are having troubles with the whole file printing no matter what you do, I think the problem lies with your input file. So take a moment to study it and make sure it is what you think it is, for example:
perl -MData::Dumper -e"$Data::Dumper::Useqq = 1; print Dumper $_;" file.txt
If you're matching a multi-line string, you might need to tell the regexp about it:
if (/\bstart\..\/bend\b/s) {
note the s after the regex.
Perldoc says:
s
Treat string as single line. That is, change "." to match any
character whatsoever, even a newline, which normally it would not
match.
My requirement is to print the files having 'xyz' text in their file names using perl.
I tried below and got the following error
Quantifier follows nothing in regex marked by <-- HERE in m/* <-- HERE xyz.xlsx$/;
use strict;
use warnings;
my #files = qw(file_xyz.xlsx,file.xlsx);
my #my_files = grep { /*xyz.xlsx$/ } #files;
for my $file (#my_files) {
print "The output $file \n";
}
Problem is coming when I add * in grep regular expression.
How can I possibly achieve this?
The * is a meta character, called a quantifier. It means "repeat the previous character or character class zero or more times". In your case, it follows nothing, and is therefore a syntax error. What you probably are trying is to match anything, which is .*: Wildcard, followed by a quantifier. However, this is already the default behaviour of a regex match unless it is anchored. So all you need is:
my #my_files = grep { /xyz/ } #files;
You could keep your end of the string anchor xlsx$, but since you have a limited list of file names, that hardly seems necessary. Though you have used qw() wrong, it is not comma separated, it is space separated:
my #files = qw(file_xyz.xlsx file.xlsx);
However, if you should have a larger set of file names, such as one read from a directory, you can place a wildcard string in the middle:
my #my_files = grep { /xyz.*\.xlsx$/i } #files;
Note the use of the /i modifier to match case insensitively. Also note that you must escape . because it is another meta character.
Suppose I have 5 files: tmp1.txt, tmp2.txt, tmp3.txt, temp1.txt, temp2.txt.
Now is there any way to open multiple files and read from them using wilcards.
Example,
If I write "t*.txt" then data from each file should be read.
If I write "tm*.txt" then only data from 3 files should be read.
Yes, you can use a glob, assuming these files exist in a local directory, and no other files with similar names are in that directory.
print "Read which files? ";
chomp(my $glob = <STDIN>);
my #files_to_read = glob $glob;
Of course, you can assure that you get no other files by filtering them
my %valid = map { $_ => 1 } qw(tmp1 tmp2 tmp3 temp1 temp2);
#files = grep $valid{$_}, glob $glob;
The first statement creates a hash where the valid file name keys have a true value, the other statement runs this check on the elements of the glob list.
You can use glob to find the list of files, and read through them sequentially by assigning the list to #ARGV, which emulates them being passed on the command line.
our #ARGV = glob '/path/to/tm*.txt';
while (<ARGV>) {
print;
}
I'm new with perl.
I would like to say that a variable could take 2 values, then I call it from another function.
I tried:
my(#file) = <${dirname}/*.txt || ${dirname}/*.xml> ;
but this seems not working for the second value, any suggestions?
When using the <*> operator as a fileglob operator, you can use any common glob pattern. Available patterns are
* (any number of any characters),
? (any single character),
{a,b,c} (any of the a, b or c patterns),
So you could do
my #file = glob "$dirname/*.{txt,xml}";
or
my #file = (glob("$dirname/*.txt"), glob("$dirname/*.xml"));
or
my #file = glob "$dirname/*.txt $dirname/*.xml";
as the glob pattern is split at whitespace into subpatterns
If I understood correctly, you want #files to fallback on the second option (*.xml) if no *.txt files are found.
If so, your syntax is close. It should be:
my #files = <$dirname/*.txt> || <$dirname/*.xml>;
or
my #files = glob( "$dirname/*.txt" ) || glob( "$dirname/*.xml" );
Also, it's a good idea to check for #files to make sure it's populated (what if you don't have any *.txt or *.xml?)
warn 'No #files' unless #files;
my (#file) = (<${dirname}/*.txt>, <${dirname}/*.xml>);
my(#file) = <${dirname}/*.txt>, <${dirname}/*.xml> ;
<> converts it into an array of file names, so you are essentially doing my #file = #array1, #array2. This will iterate first through txt files and then through xml files.
This works
my $file = $val1 || $val2;
what it means is set $file to $val1, but if $val1 is 0 or false or undef then set $file1 to $val2
In essence, surrounding a variable with < > means either
1) treat it as a filehandle ( for example $read=<$filehandle> )
2) use it as a shell glob (for example #files=<*.xml> )
Looks to me like you wish to interpolate the value $dirname and add either .txt or .xml on the end. The < > will not achieve this
If you wish to send two values to a function then this might be what you want
my #file=("$dirname.txt","$dirname.xml");
then call the function with #file, ie myfunction(#file)
In the function
sub myfunction {
my $file1=shift;
my $file2=shift;
All this stuff is covered in perldocs perlsub and perldata
Have fun
I'm looking for a method to looks for file which resides in a few directories in a given path. In other words, those directories will be having files with same filename across. My script seem to have the hierarchy problem on looking into the correct path to grep the filename for processing. I have a fix path as input and the script will need to looks into the path and finding files from there but my script seem stuck on 2 tiers up and process from there rather than looking into the last directories in the tier (in my case here it process on "ln" and "nn" and start processing the subroutine).
The fix input path is:-
/nfs/disks/version_2.0/
The files that I want to do post processing by subroutine will be exist under several directories as below. Basically I wanted to check if the file1.abc do exists in all the directories temp1, temp2 & temp3 under ln directory. Same for file2.abc if exist in temp1, temp2, temp3 under nn directory.
The files that I wanted to check in full path will be like this:-
/nfs/disks/version_2.0/dir_a/ln/temp1/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp2/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp3/file1.abc
/nfs/disks/version_2.0/dir_a/nn/temp1/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp2/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp3/file2.abc
My script as below:-
#! /usr/bin/perl -w
my $dir = '/nfs/fm/disks/version_2.0/' ;
opendir(TEMP, $dir) || die $! ;
foreach my $file (readdir(TEMP)) {
next if ($file eq "." || $file eq "..") ;
if (-d "$dir/$file") {
my $d = "$dir/$file";
print "Directory:- $d\n" ;
&getFile($d);
&compare($file) ;
}
}
Note that I put the print "Directory:- $d\n" ; there for debug purposes and it printed this:-
/nfs/disks/version_2.0/dir_a/
/nfs/disks/version_2.0/dir_b/
So I knew it get into the wrong path for processing the following subroutine.
Can somebody help to point me where is the error in my script? Thanks!
To be clear: the script is supposed to recurse through a directory and look for files with a particular filename? In this case, I think the following code is the problem:
if (-d "$dir/$file") {
my $d = "$dir/$file";
print "Directory:- $d\n" ;
&getFile($d);
&compare($file) ;
}
I'm assuming the &getFile($d) is meant to step into a directory (i.e., the recursive step). This is fine. However, it looks like the &compare($file) is the action that you want to take when the object that you're looking at isn't a directory. Therefore, that code block should look something like this:
if (-d "$dir/$file") {
&getFile("$dir/$file"); # the recursive step, for directories inside of this one
} elsif( -f "$dir/$file" ){
&compare("$dir/$file"); # the action on files inside of the current directory
}
The general pseudo-code should like like this:
sub myFind {
my $dir = shift;
foreach my $file( stat $dir ){
next if $file -eq "." || $file -eq ".."
my $obj = "$dir/$file";
if( -d $obj ){
myFind( $obj );
} elsif( -f $obj ){
doSomethingWithFile( $obj );
}
}
}
myFind( "/nfs/fm/disks/version_2.0" );
As a side note: this script is reinventing the wheel. You only need to write a script that does the processing on an individual file. You could do the rest entirely from the shell:
find /nfs/fm/disks/version_2.0 -type f -name "the-filename-you-want" -exec your_script.pl {} \;
Wow, it's like reliving the 1990s! Perl code has evolved somewhat, and you really need to learn the new stuff. It looks like you learned Perl in version 3.0 or 4.0. Here's some pointers:
Use use warnings; instead of -w on the command line.
Use use strict;. This will require you to predeclare variables using my which will scope them to the local block or the file if they're not in a local block. This helps catch a lot of errors.
Don't put & in front of subroutine names.
Use and, or, and not instead of &&, ||, and !.
Learn about Perl Modules which can save you a lot of time and effort.
When someone says detect duplicates, I immediately think of hashes. If you use a hash based upon your file's name, you can easily see if there are duplicate files.
Of course a hash can only have a single value for each key. Fortunately, in Perl 5.x, that value can be a reference to another data structure.
So, I recommend you use a hash that contains a reference to a list (array in old parlance). You can push each instance of the file to that list.
Using your example, you'd have a data structure that looks like this:
%file_hash = {
file1.abc => [
/nfs/disks/version_2.0/dir_a/ln/temp1
/nfs/disks/version_2.0/dir_a/ln/temp2
/nfs/disks/version_2.0/dir_a/ln/temp3
],
file2.abc => [
/nfs/disks/version_2.0/dir_a/nn/temp1
/nfs/disks/version_2.0/dir_a/nn/temp2
/nfs/disks/version_2.0/dir_a/nn/temp3
];
And, here's a program to do it:
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say); #Can use `say` which is like `print "\n"`;
use File::Basename; #imports `dirname` and `basename` commands
use File::Find; #Implements Unix `find` command.
use constant DIR => "/nfs/disks/version_2.0";
# Find all duplicates
my %file_hash;
find (\&wanted, DIR);
# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
if (scalar (#{$file_hash{$file_name}}) > 1) {
say qq(Duplicate File: "$file_name");
foreach my $dir_name (#{$file_hash{$file_name}}) {
say " $dir_name";
}
}
}
sub wanted {
return if not -f $_;
if (not exists $file_hash{$_}) {
$file_hash{$_} = [];
}
push #{$file_hash{$_}}, $File::Find::dir;
}
Here's a few things about File::Find:
The work takes place in the subroutine wanted.
The $_ is the name of the file, and I can use this to see if this is a file or directory
$File::Find::Name is the full name of the file including the path.
$File::Find::dir is the name of the directory.
If the array reference doesn't exist, I create it with the $file_hash{$_} = [];. This isn't necessary, but I find it comforting, and it can prevent errors. To use $file_hash{$_} as an array, I have to dereference it. I do that by putting a # in front of it, so it can be #$file_hash{$_} or, #{$file_hash{$_}}.
Once all the file are found, I can print out the entire structure. The only thing I do is check to make sure there is more than one member in each array. If there's only a single member, then there are no duplicates.
Response to Grace
Hi David W., thank you very much for your explainaion and sample script. Sorry maybe I'm not really clear in definding my problem statement. I think I can't use hash in my path finding for the data structure. Since the file*.abc is a few hundred and undertermined and each of the file*.abc even is having same filename but it is actually differ in content in each directory structures.
Such as the file1.abc resides under "/nfs/disks/version_2.0/dir_a/ln/temp1" is not the same content as file1.abc resides under "/nfs/disks/version_2.0/dir_a/ln/temp2" and "/nfs/disks/version_2.0/dir_a/ln/temp3". My intention is to grep the list of files*.abc in each of the directories structure (temp1, temp2 and temp3 ) and compare the filename list with a masterlist. Could you help to shed some lights on how to solve this? Thanks. – Grace yesterday
I'm just printing the file in my sample code, but instead of printing the file, you could open them and process them. After all, you now have the file name and the directory. Here's the heart of my program again. This time, I'm opening the file and looking at the content:
foreach my $file_name (sort keys %file_hash) {
if (scalar (#{$file_hash{$file_name}}) > 1) {
#say qq(Duplicate File: "$file_name");
foreach my $dir_name (#{$file_hash{$file_name}}) {
#say " $dir_name";
open (my $fh, "<", "$dir_name/$file_name")
or die qq(Can't open file "$dir_name/$file_name" for reading);
# Process your file here...
close $fh;
}
}
}
If you are only looking for certain files, you could modify the wanted function to skip over files you don't want. For example, here I am only looking for files which match the file*.txt pattern. Note I use a regular expression of /^file.*\.txt$/ to match the name of the file. As you can see, it's the same as the previous wanted subroutine. The only difference is my test: I'm looking for something that is a file (-f) and has the correct name (file*.txt):
sub wanted {
return if not -f $_ and /^file.*\.txt$/;
if (not exists $file_hash{$_}) {
$file_hash{$_} = [];
}
push #{$file_hash{$_}}, $File::Find::dir;
}
If you are looking at the file contents, you can use the MD5 hash to determine if the file contents match or don't match. This reduces a file to a mere string of 16 to 28 characters which could even be used as a hash key instead of the file name. This way, files that have matching MD5 hashes (and thus matching contents) would be in the same hash list.
You talk about a "master list" of files and it seems you have the idea that this master list needs to match the content of the file you're looking for. So, I'm making a slight mod in my program. I am first taking that master list you talked about, and generating MD5 sums for each file. Then I'll look at all the files in that directory, but only take the ones with the matching MD5 hash...
By the way, this has not been tested.
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say); #Can use `say` which is like `print "\n"`;
use File::Find; #Implements Unix `find` command.
use Digest::file qw(digest_file_hex);
use constant DIR => "/nfs/disks/version_2.0";
use constant MASTER_LIST_DIR => "/some/directory";
# First, I'm going thorugh the MASTER_LIST_DIR directory
# and finding all of the master list files. I'm going to take
# the MD5 hash of those files, and store them in a Perl hash
# that's keyed by the name of file file. Thus, when I find a
# file with a matching name, I can compare the MD5 of that file
# and the master file. If they match, the files are the same. If
# not, they're different.
# In this example, I'm inlining the function I use to find the files
# instead of making it a separat function.
my %master_hash;
find (
{
%master_hash($_) = digest_file_hex($_, "MD5") if -f;
},
MASTER_LIST_DIR
);
# Now I have the MD5 of all the master files, I'm going to search my
# DIR directory for the files that have the same MD5 hash as the
# master list files did. If they do have the same MD5 hash, I'll
# print out their names as before.
my %file_hash;
find (\&wanted, DIR);
# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
if (scalar (#{$file_hash{$file_name}}) > 1) {
say qq(Duplicate File: "$file_name");
foreach my $dir_name (#{$file_hash{$file_name}}) {
say " $dir_name";
}
}
}
# The wanted function has been modified since the last example.
# Here, I'm only going to put files in the %file_hash if they
sub wanted {
if (-f $_ and $file_hash{$_} = digest_file_hex($_, "MD5")) {
$file_hash{$_} //= []; #Using TLP's syntax hint
push #{$file_hash{$_}}, $File::Find::dir;
}
}