How can I Parser file without giving file name in Perl? - perl

When I run the programme at that time I only give a directory name I want all files in the directory to be parsed? Here is my code
my #indexFiles= "www/I.html";
my #rdata = readFile("#indexFiles");
sub readFile{
my $somefile = $_[0];
my #links = ($somefile);
my $p = HTML::TokeParser->new($somefile) || die "Can't open: $!";
while (my $token = $p->get_tag("img","a")){
my $currentlink = $token->[1]{href} || $token->[1]{src};
my $finalLink= $directory."/".$currentlink ;
if($currentlink =~ /\.html$/){
my #data = readFile($finalLink);
push #links,#data;
} else{
push #links,$finalLink;
}
}
return #links;
}
In www folder I have 3 HTML file, 2 folder.
my #indexFiles= "www/I.html" In this line I pass specific path name and file. I don't want to pass that name. Instead of this, it will select automatically.
For example: When I run my programme perl c.pl www. It should be Parse all the file.
I give the specific file name I.html then after it will be going to find img and a tag.

Your question doesn't appear to have anything to do with parsing or HTML::TokeParser; it appears to be about determining whether a path references a directory or not, and getting the list of files in the directory if it references a directory.
stat and -d (in conjunction or independently) can be used to test if a path references a directory.
At the lowest level, opendir+readdir+closedir is used to read a directory. The glob builtin and numerous modules provide alternatives ways of doing this.
Recursive search:
sub process {
for my $qfn (#_) {
stat($qfn)
or die("Can't stat \"$qfn\": $!\n");
if (-d _) {
process(glob("\Q$qfn\E/*"));
}
elsif ($qfn =~ /\.html\z/) {
process_html_file($qfn);
}
}
}
process('www');
Non-recursive search:
sub process {
for my $qfn (#_) {
stat($qfn)
or die("Can't stat \"$qfn\": $!\n");
if (-d _) {
process_html_file(glob("\Q$qfn\E/*.html"));
}
elsif ($qfn =~ /\.html\z/) {
process_html_file($qfn);
}
}
}
process('www');
Alternatively, you could use File::Find::Rule.
Recursive search:
use File::Find::Rule qw( );
process_html_file($_)
for File::Find::Rule->name('*.html')->file->in('www');
Non-recursive search:
use File::Find::Rule qw( );
process_html_file($_)
for File::Find::Rule->maxdepth(1)->name('*.html')->file->in('www');

Related

Perl code fails to match recursively (nested subs)

The code below loops through folders in “/data/results” directory and matches each .vcf file name, located in a sub-folder (two levels down) to the content of a matrix_key file.
This seem to work only for the first folder. I printed the content of each #matrix_key and it’s correct. The code always fails to match for the second folder. Here is where it fails to match:: if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
I’ve tried to run one folder at a time and it work great. I don’t understand why it fails when I put multiple folders in /data/results/? Could someone please suggest how to correct this issue? Thank you.
Here is an example of directory structure:
/data/results/
TestFolder1/
subfolder1/Variants/MD-14-11856_RNA_v2.vcf
subfoder2/Variants/SU-16-16117_RNA_v2.vcf
matrix.txt
matrixkey.txt
TestFolder2/
subfolder1/Variants/SU-15-2542_v2.vcf
subfolder2/Variants/SU-16-16117_v2.vcf
matrix.txt
matrixkey.txt
Example of #matrix_key:
Barcode SampleName
barcode_003 SU-15-2542
barcode-005 MD-14-11856
barcode-002 SU-16-16117
The code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
use List::Util 'first';
use File::Find;
use File::Spec;
use Data::Dumper;
use File::Basename;
use File::Spec::Functions 'splitdir';
my $current_directory = "/data/results";
my #dirs = grep { -d } glob '/data/results/*';
if (grep -d, glob("$current_directory/*")) {
print "$current_directory has subfolder(s)\n";
}
else {
print "there are no folders\n";
die;
}
my %files;
my #matrix_key = ();
for my $dir ( #dirs ) {
print "the directory is $dir\n";
my $run_folder = (split '/', $dir)[3];
print "the folder is $run_folder\n";
my $key2 = $run_folder;
# checks if barcode matrix and barcode summary files exist
#shortens the folder names and unzips them.
#check if each sample is present in the matrix file for each folder.
my $location = "/data/results/".$run_folder;
my $matrix_key_file = "/data/results/".$run_folder."/matrixkey.txt";
open my $key, '<', $matrix_key_file or die $!; # key file
<$key>; # throw away header line in key file (first line)
#matrix_key = sort { length($b->[1]) <=> length($a->[1]) }
map [ split ], <$key>;
close $key or die $!;
print Dumper(#matrix_key) . "===\n\n";
find({ wanted => \&find_vcf, no_chdir=>1}, $location);
#find({ wanted => find_vcf, no_chdir=>1}, $location);
}
my $find_vcf = sub {
#sub find_vcf {
my $F = $File::Find::name;
if ($F =~ /vcf$/ ) {
print "$F\n";
$F =~ m|([^/]+).vcf$| or die "Can't extract Sample ID";
my $sample_id = $1; print "the short vcf name is: $sample_id\n";
if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
#the code fails to match sample_id to matrix_key
#even though it's printed out correctly
print "$sample_id \t MATCHES $aref->[1]\n";
print "\t$aref->[1]_$aref->[0]\n\n";
} else {
# handle all other possible exceptions
#print "folder name is $run_folder\n";
die("The VCF file doesn't match the Summary Barcode file: $sample_id\n");
}
}
}
The posted code appears to be a bit complicated for the job.
Here is one way to do what I understand from the question. It uses File::Find::Rule
use warnings;
use strict;
use File::Find::Rule;
use List::Util 'any';
my $base_dir = '/data/results';
my #dirs = File::Find::Rule->maxdepth(1)->directory->in($base_dir);
foreach my $dir (#dirs)
{
# Find all .vcx files anywhere in this dir or below
my #vcx_files = File::Find::Rule->file->name('*.vcx')->in($dir);
# Remove the path and .vcx extension
my #names = map { m|.*/(.+)\.vcx$| } #vcx_files;
# Find all text files to search, right in this folder
my #files = File::Find::Rule ->
maxdepth(1)->file->name('*.txt')->in($dir);
foreach my $file (#files)
{
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>; # drop the header line
# Get the second field on each line (with SampleName)
my #samples = map { (split)[1] } <$fh>;
# ... search #samples for #names ...
}
}
It is fine to use glob for non-recursive searches above, but given its treatment of spaces better use core File::Glob replacement for it.
There are other ways to organize traversal of directories and file searches, and there are many ways to compare two lists. Please clarify the overall objective so that I can add suitable code to search .vcx names vs. file content.
Please add checks, fix variable names, implement your policies for when things fail, etc.

Perl , How to read subfolder Output

I am writing a script to read the content of multiple sub folder in a directory.
And recently i need to read the content of folder inside multiple sub-folder.
Want to ask how can i write the code to read those folder inside multiple sub-folder.
This is the new conditions
Multiple Sub-folder -> Local folder -> fileAAA.csv
how do i read this fileAAA in Local folder of Multiple Sub-folder?
Currently the code i am writing was in this condition and it works well.
Multiple Sub-folder -> fileAAA.csv
Able to read fileAAA from multiple Sub-folder
Below is the code i use to read
Multiple Sub-folder -> fileAAA.csv
my ( $par_dir, $sub_dir );
opendir( $par_dir, "$parent" );
while ( my $sub_folders = readdir($par_dir) ) {
next if ( $sub_folders =~ /^..?$/ ); # skip . and ..
my $path = $parent . '/' . $sub_folders;
next unless ( -d $path ); # skip anything that isn't a directory
opendir( $sub_dir, $path );
while ( my $file = readdir($sub_dir) ) {
next unless $file =~ /\.csv?$/i;
my $full_path = $path . '/' . $file;
print_file_names($full_path);
}
closedir($sub_dir);
$flag = 0;
}
closedir($par_dir);
......
Updated
You should look at the File::Find module which has everything already in place to do searches like this, and has taken account of all corner cases for you
I wrote that on my tablet and at the time I couldn't offer sample code to support it. I believe this will do what you're asking for, which is simply to find all CSV files at any level beneath a parent directory
use strict;
use warnings;
use File::Find qw/ find /;
STDOUT->autoflush;
my $parent = '/Multiple Sub-folder';
find(sub {
return unless -f and /\.csv$/i;
print_file_names($File::Find::name);
}, $parent);
sub print_file_names {
my ($fn) = #_;
print $fn, "\n";
}
Without using moudle try this
Instead of opendir can you try glob for subdirectory search.
In below script i make a subroutine for continuous search.
When elsif condition is satisfied the path of the directory is will go to the find subroutine then it'll seach and so on.
my $v = "/Multiple Sub-folder";
find($v);
sub find{
my ($s) = #_;
foreach my $ma (glob "$s/*")
{
if(-f $ma)
{
if($ma =~m/.csv$/) # Here search for csv files.
{
print "$ma\n";
}
}
elsif(-d $ma)
{
find("$ma")
}
}
}
But can you use File::Find module for search the files in the directory as the answer of Borodin Which is the best approach.

How to traverse all the files in a directory and subdirectories as well

I am working on an script which will read the file and based on the file content i have to rename the file.
Problem which i am facing :
1st I am not able to read file which are in subdirectory.
2nd I am trying to rename the file but error are occuring.
Here is My code:
Folder Structur :
Logs
/ / / \
a b c d
/ \ \
e file file
Where a,b,c,d,e are dir
File,File are txt file
#!/usr/bin/perl
use strict;
use warnings;
use File::Copy::Recursive;
use Cwd;
my $file;
my (#FILES,#File_01);
my ($dir_01,$dir_02);
my $new_file="Kernel.txt";
chdir('C:\\abd\\efg');
$dir_01 = getcwd;
opendir(DIR_01, $dir_01);
#FILES=readdir(DIR_01);
close(DIR_01);
foreach $dir_02 (#FILES)
{
opendir (DIR_02, "$dir_01"."/"."$dir_02") ;
#File_01=readdir(DIR_02);
close(DIR_02);
foreach $file (#File_01)
{
open(FH,"<","$dir_01/$dir_02/$file") or die "can not open the file $!\n";
while(my $lines = <FH>)
{
if($lines=~ m/Linux kernel Version/i || $lines=~ m/USB_STATE=DISCONNECTED/gi)
{
#print "\t$lines\n\n";
rename($file,$new_file) or die "can not change the name";
}
}
}
}
rename($file,$new_file)
will not work as you did not specify the path
rename("$dir_01/$dir_02/$file", $new_file)
But also like that any file that matches will be renamed Kernel.txt in the current directory. Is that the expected behavior?
Try module File::Next
example:
my $iter = File::Next::files($some_dir);
while(my $fn = $iter->()) {
my $rename = 0;
open(my $FILE,'<',$fn);
while(my $line = <$FILE>) {
if($lines=~ m/Linux kernel Version/i || $lines=~ m/USB_STATE=DISCONNECTED/gi)
$rename = 1;
last;
}
}
close($FILE);
if($rename) {
rename($fn,$new_file) or die "can not change the name";
}
}
maybe you can't rename because renaming open file
try renaming after close open file

What's the best strategy to delete a very huge folder using Perl?

I need to delete all content (files and folders) under a given folder. The problems is the folder has millions of files and folders inside it. So I don't want to load all the file names in one go.
Logic should be like this:
iterate a folder without load everything
get a file or folder
delete it
(verbose that the file or folder "X" was deleted)
go to the next one
I'm trying something like this:
sub main(){
my ($rc, $help, $debug, $root) = ();
$rc = GetOptions ( "HELP" => \$help,
"DEBUG" => \$debug,
"ROOT=s" => \$root);
die "Bad command line options\n$usage\n" unless ($rc);
if ($help) { print $usage; exit (0); }
if ($debug) {
warn "\nProceeding to execution with following parameters: \n";
warn "===============================================================\n";
warn "ROOT = $root\n";
} # write debug information to STDERR
print "\n Starting to delete...\n";
die "usage: $0 dir ..\n" unless $root;
*name = *File::Find::name;
find \&verbose, #ARGV;
}
sub verbose {
if (!-l && -d _) {
print "rmdir $name\n";
} else {
print "unlink $name\n";
}
}
main();
It's working fine, but whenever "find" reads the huge folder, the application gets stuck and I can see the system memory for Perl increasing until timeout. Why? Is it trying to load all the files in one go?
Thanks for your help.
The remove_tree function from File::Path can portably and verbosely remove a directory hierarchy, keeping the top directory, if desired.
use strict;
use warnings;
use File::Path qw(remove_tree);
my $dir = '/tmp/dir';
remove_tree($dir, {verbose => 1, keep_root => 1});
Pre-5.10, use the rmtree function from File::Path. If you still want the top directory, you could just mkdir it again.
use File::Path;
my $dir = '/tmp/dir';
rmtree($dir, 1); # 1 means verbose
mkdir $dir;
The perlfaq points out that File::Find does the hard work of traversing a directory, but the work isn't that hard (assuming your directory tree is free of named pipes, block devices, etc.):
sub traverse_directory {
my $dir = shift;
opendir my $dh, $dir;
while (my $file = readdir($dh)) {
next if $file eq "." || $file eq "..";
if (-d "$dir/$file") {
&traverse_directory("$dir/$file");
} elsif (-f "$dir/$file") {
# $dir/$file is a regular file
# Do something with it, for example:
print "Removing $dir/$file\n";
unlink "$dir/$file" or warn "unlink $dir/$file failed: $!\n";
} else {
warn "$dir/$file is not a directory or regular file. Ignoring ...\n";
}
}
closedir $dh;
# $dir might be empty at this point. If you want to delete it:
if (rmdir $dir) {
print "Removed $dir/\n";
} else {
warn "rmdir $dir failed: $!\n";
}
}
Substitute your own code for doing something with a file or (possibly) empty directory, and call this function once on the root of the tree that you want to process. Lookup the meanings of opendir/closedir, readdir, -d, and -f if you haven't encountered them before.
What's wrong with:
`rm -rf $folder`; // ??
You can use File::Find to systematically traverse the directory and delete the files and directories under it.
OK, I gave in and used Perl builtins but you should use File::Path::rmtree which I had totally forgotten about:
#!/usr/bin/perl
use strict; use warnings;
use Cwd;
use File::Find;
my ($clean) = #ARGV;
die "specify directory to clean\n" unless defined $clean;
my $current_dir = getcwd;
chdir $clean
or die "Cannot chdir to '$clean': $!\n";
finddepth(\&wanted => '.');
chdir $current_dir
or die "Cannot chdir back to '$current_dir':$!\n";
sub wanted {
return if /^[.][.]?\z/;
warn "$File::Find::name\n";
if ( -f ) {
unlink or die "Cannot delete '$File::Find::name': $!\n";
}
elsif ( -d _ ) {
rmdir or die "Cannot remove directory '$File::Find::name': $!\n";
}
return;
}
Download the unix tools for windows and then you can do rm -rv or whatever.
Perl is a great tool for a lot of purposes, but this one seems better done by a specialised tool.
Here's a cheap "cross-platform" method:
use Carp qw<carp croak>;
use English qw<$OS_NAME>;
use File::Spec;
my %deltree_op = ( nix => 'rm -rf %s', win => 'rmdir /S %s' );
my %group_for
= ( ( map { $_ => 'nix' } qw<linux UNIX SunOS> )
, ( map { $_ => 'win' } qw<MSWin32 WinNT> )
);
my $group_name = $group_for{$OS_NAME};
sub chop_tree {
my $full_path = shift;
carp( "No directory $full_path exists! We're done." ) unless -e $full_path;
croak( "No implementation for $OS_NAME!" ) unless $group_name;
my $format = $deltree_op{$group_name};
croak( "Could not find command format for group $group_name" ) unless $format;
my $command = sprintf( $format, File::Spec->canonpath( $full_path ));
qx{$command};
}

What's the best way to tell if a file exists in a directory?

I'm trying to move a file but I want to ensure that it exists before I do so. What's the simplest way to do this in Perl?
My code is like this. I looked up the open command, but I am not sure it is the simplest way or not.
if #Parser.exe exist in directory of Debug
{
move ("bin/Debug/Parser.exe","Parser.exe");
}
elsif #Parser.exe exist in directory of Release
{
move ("bin/Release/Parser.exe","Parser.exe");
}
else
{
die "Can't find the Parser.exe.";
}
Thank you.
What you need is a file test operator to check if the file exists. Specifically, you need the -e operator which checks if a file exists.
if (-e "bin/Debug/Parser.exe")
{
move ("bin/Debug/Parser.exe","Parser.exe");
}
elsif (-e "bin/Release/Parser.exe")
move ("bin/Release/Parser.exe","Parser.exe");
else
{
die "Can't find the Parser.exe."
}
You can make use of -e file test to check for file existence:
use File::Copy;
if(-e "bin/Debug/parser.exe") {
copy("bin/Debug/parser.exe","Parser.exe") or die "Copy failed: $!";
} elsif(-e "bin/Release/Parser.exe") {
copy("bin/Release/parser.exe","Parser.exe") or die "Copy failed: $!";
} else {
die "Can't find the Parser.exe.";
}
Personally I don't like the duplication of the file/ path name in these solutions - speaking for myself I suspect I might change accidently it to
if(-e "pathone....")... { copy("pathtwo...","Parser.exe")
I would do something like
copy("bin/Debug/parser.exe","Parser.exe") or
copy("bin/Release/parser.exe","Parser.exe") or
die "Can't find the Parser.exe.";
Or if that is a bit risque
copy_parser("bin/Debug") or
copy_parser("bin/Release") or
die "Can't find the Parser.exe.";
sub copy_parser {
my $path = shift ;
my $source = File::Spec-> catfile ( $path, 'Parser.exe' ) ;
if ( -e $source ) {
copy( $source, "Parser.exe") or die "Copy or $source failed: $!";
return 1 ;
}
return 0 ;
}
justintime is on the right track when he notes the repetition and seeks to eliminate it. I took the minimization a step farther than he did.
Rather than encapsulate only the copy/move portion of the code, though, it makes sense to remove as all the repetition by encapsulating the list iteration.
I put the subroutine in a module so it can be reused later as needed. This also reduces repeated code.
use SearchMove;
my $found = search_and_move(
src => 'Parser.exe',
dest => 'Parser.exe',
dirs => [
"bin/Debug",
"bin/Release",
],
);
die "Can't find the Parser.exe\n"
unless defined $found;
print "Found Parser.exe in $found";
In SearchMove.pm
package SearchMove;
use strict;
use warnings;
use Exporter 'import';
our #EXPORT_OK = qw( search_and_move );
our #EXPORT = #EXPORT_OK;
sub search_and_move {
my %arg = #_;
croak "No source file" unless exists $args{src};
croak "No dest file" unless exists $args{dest};
croak "No search paths" unless exists $args{dirs};
my $got_file;
for my $dir ( #{$arg{dirs}} ) {
my $source = "$dir/$arg{src}";
if( -e $source ) {
move( $source, $arg{dest} );
$got_file = $dir;
last;
}
}
return $got_file;
}
1;
Now you can use search_and_move in many different projects.