Perl , How to read subfolder Output - perl

I am writing a script to read the content of multiple sub folder in a directory.
And recently i need to read the content of folder inside multiple sub-folder.
Want to ask how can i write the code to read those folder inside multiple sub-folder.
This is the new conditions
Multiple Sub-folder -> Local folder -> fileAAA.csv
how do i read this fileAAA in Local folder of Multiple Sub-folder?
Currently the code i am writing was in this condition and it works well.
Multiple Sub-folder -> fileAAA.csv
Able to read fileAAA from multiple Sub-folder
Below is the code i use to read
Multiple Sub-folder -> fileAAA.csv
my ( $par_dir, $sub_dir );
opendir( $par_dir, "$parent" );
while ( my $sub_folders = readdir($par_dir) ) {
next if ( $sub_folders =~ /^..?$/ ); # skip . and ..
my $path = $parent . '/' . $sub_folders;
next unless ( -d $path ); # skip anything that isn't a directory
opendir( $sub_dir, $path );
while ( my $file = readdir($sub_dir) ) {
next unless $file =~ /\.csv?$/i;
my $full_path = $path . '/' . $file;
print_file_names($full_path);
}
closedir($sub_dir);
$flag = 0;
}
closedir($par_dir);
......

Updated
You should look at the File::Find module which has everything already in place to do searches like this, and has taken account of all corner cases for you
I wrote that on my tablet and at the time I couldn't offer sample code to support it. I believe this will do what you're asking for, which is simply to find all CSV files at any level beneath a parent directory
use strict;
use warnings;
use File::Find qw/ find /;
STDOUT->autoflush;
my $parent = '/Multiple Sub-folder';
find(sub {
return unless -f and /\.csv$/i;
print_file_names($File::Find::name);
}, $parent);
sub print_file_names {
my ($fn) = #_;
print $fn, "\n";
}

Without using moudle try this
Instead of opendir can you try glob for subdirectory search.
In below script i make a subroutine for continuous search.
When elsif condition is satisfied the path of the directory is will go to the find subroutine then it'll seach and so on.
my $v = "/Multiple Sub-folder";
find($v);
sub find{
my ($s) = #_;
foreach my $ma (glob "$s/*")
{
if(-f $ma)
{
if($ma =~m/.csv$/) # Here search for csv files.
{
print "$ma\n";
}
}
elsif(-d $ma)
{
find("$ma")
}
}
}
But can you use File::Find module for search the files in the directory as the answer of Borodin Which is the best approach.

Related

How can I Parser file without giving file name in Perl?

When I run the programme at that time I only give a directory name I want all files in the directory to be parsed? Here is my code
my #indexFiles= "www/I.html";
my #rdata = readFile("#indexFiles");
sub readFile{
my $somefile = $_[0];
my #links = ($somefile);
my $p = HTML::TokeParser->new($somefile) || die "Can't open: $!";
while (my $token = $p->get_tag("img","a")){
my $currentlink = $token->[1]{href} || $token->[1]{src};
my $finalLink= $directory."/".$currentlink ;
if($currentlink =~ /\.html$/){
my #data = readFile($finalLink);
push #links,#data;
} else{
push #links,$finalLink;
}
}
return #links;
}
In www folder I have 3 HTML file, 2 folder.
my #indexFiles= "www/I.html" In this line I pass specific path name and file. I don't want to pass that name. Instead of this, it will select automatically.
For example: When I run my programme perl c.pl www. It should be Parse all the file.
I give the specific file name I.html then after it will be going to find img and a tag.
Your question doesn't appear to have anything to do with parsing or HTML::TokeParser; it appears to be about determining whether a path references a directory or not, and getting the list of files in the directory if it references a directory.
stat and -d (in conjunction or independently) can be used to test if a path references a directory.
At the lowest level, opendir+readdir+closedir is used to read a directory. The glob builtin and numerous modules provide alternatives ways of doing this.
Recursive search:
sub process {
for my $qfn (#_) {
stat($qfn)
or die("Can't stat \"$qfn\": $!\n");
if (-d _) {
process(glob("\Q$qfn\E/*"));
}
elsif ($qfn =~ /\.html\z/) {
process_html_file($qfn);
}
}
}
process('www');
Non-recursive search:
sub process {
for my $qfn (#_) {
stat($qfn)
or die("Can't stat \"$qfn\": $!\n");
if (-d _) {
process_html_file(glob("\Q$qfn\E/*.html"));
}
elsif ($qfn =~ /\.html\z/) {
process_html_file($qfn);
}
}
}
process('www');
Alternatively, you could use File::Find::Rule.
Recursive search:
use File::Find::Rule qw( );
process_html_file($_)
for File::Find::Rule->name('*.html')->file->in('www');
Non-recursive search:
use File::Find::Rule qw( );
process_html_file($_)
for File::Find::Rule->maxdepth(1)->name('*.html')->file->in('www');

Perl code fails to match recursively (nested subs)

The code below loops through folders in “/data/results” directory and matches each .vcf file name, located in a sub-folder (two levels down) to the content of a matrix_key file.
This seem to work only for the first folder. I printed the content of each #matrix_key and it’s correct. The code always fails to match for the second folder. Here is where it fails to match:: if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
I’ve tried to run one folder at a time and it work great. I don’t understand why it fails when I put multiple folders in /data/results/? Could someone please suggest how to correct this issue? Thank you.
Here is an example of directory structure:
/data/results/
TestFolder1/
subfolder1/Variants/MD-14-11856_RNA_v2.vcf
subfoder2/Variants/SU-16-16117_RNA_v2.vcf
matrix.txt
matrixkey.txt
TestFolder2/
subfolder1/Variants/SU-15-2542_v2.vcf
subfolder2/Variants/SU-16-16117_v2.vcf
matrix.txt
matrixkey.txt
Example of #matrix_key:
Barcode SampleName
barcode_003 SU-15-2542
barcode-005 MD-14-11856
barcode-002 SU-16-16117
The code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
use List::Util 'first';
use File::Find;
use File::Spec;
use Data::Dumper;
use File::Basename;
use File::Spec::Functions 'splitdir';
my $current_directory = "/data/results";
my #dirs = grep { -d } glob '/data/results/*';
if (grep -d, glob("$current_directory/*")) {
print "$current_directory has subfolder(s)\n";
}
else {
print "there are no folders\n";
die;
}
my %files;
my #matrix_key = ();
for my $dir ( #dirs ) {
print "the directory is $dir\n";
my $run_folder = (split '/', $dir)[3];
print "the folder is $run_folder\n";
my $key2 = $run_folder;
# checks if barcode matrix and barcode summary files exist
#shortens the folder names and unzips them.
#check if each sample is present in the matrix file for each folder.
my $location = "/data/results/".$run_folder;
my $matrix_key_file = "/data/results/".$run_folder."/matrixkey.txt";
open my $key, '<', $matrix_key_file or die $!; # key file
<$key>; # throw away header line in key file (first line)
#matrix_key = sort { length($b->[1]) <=> length($a->[1]) }
map [ split ], <$key>;
close $key or die $!;
print Dumper(#matrix_key) . "===\n\n";
find({ wanted => \&find_vcf, no_chdir=>1}, $location);
#find({ wanted => find_vcf, no_chdir=>1}, $location);
}
my $find_vcf = sub {
#sub find_vcf {
my $F = $File::Find::name;
if ($F =~ /vcf$/ ) {
print "$F\n";
$F =~ m|([^/]+).vcf$| or die "Can't extract Sample ID";
my $sample_id = $1; print "the short vcf name is: $sample_id\n";
if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
#the code fails to match sample_id to matrix_key
#even though it's printed out correctly
print "$sample_id \t MATCHES $aref->[1]\n";
print "\t$aref->[1]_$aref->[0]\n\n";
} else {
# handle all other possible exceptions
#print "folder name is $run_folder\n";
die("The VCF file doesn't match the Summary Barcode file: $sample_id\n");
}
}
}
The posted code appears to be a bit complicated for the job.
Here is one way to do what I understand from the question. It uses File::Find::Rule
use warnings;
use strict;
use File::Find::Rule;
use List::Util 'any';
my $base_dir = '/data/results';
my #dirs = File::Find::Rule->maxdepth(1)->directory->in($base_dir);
foreach my $dir (#dirs)
{
# Find all .vcx files anywhere in this dir or below
my #vcx_files = File::Find::Rule->file->name('*.vcx')->in($dir);
# Remove the path and .vcx extension
my #names = map { m|.*/(.+)\.vcx$| } #vcx_files;
# Find all text files to search, right in this folder
my #files = File::Find::Rule ->
maxdepth(1)->file->name('*.txt')->in($dir);
foreach my $file (#files)
{
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>; # drop the header line
# Get the second field on each line (with SampleName)
my #samples = map { (split)[1] } <$fh>;
# ... search #samples for #names ...
}
}
It is fine to use glob for non-recursive searches above, but given its treatment of spaces better use core File::Glob replacement for it.
There are other ways to organize traversal of directories and file searches, and there are many ways to compare two lists. Please clarify the overall objective so that I can add suitable code to search .vcx names vs. file content.
Please add checks, fix variable names, implement your policies for when things fail, etc.

HTML::TreeBuilder inside a loop

I'm trying to delete all table elements from several HTML files.
The following code runs perfectly on a single file, but when trying to automate the process it returns the error
can't call method "look_down" on an undefined value
Do you have any solution please?
Here is the code:
use strict;
use warnings;
use Path::Class;
use HTML::TreeBuilder;
opendir( DH, "C:/myfiles" );
my #files = readdir(DH);
closedir(DH);
foreach my $file ( #files ) {
print("Analyzing file $file\n");
my $tree = HTML::TreeBuilder->new->parse_file("C:/myfiles/$file");
foreach my $e ( $tree->look_down( _tag => "table" ) ) {
$e->delete();
}
use HTML::FormatText;
my $formatter = HTML::FormatText->new;
my $parsed = $formatter->format($tree);
print $parsed;
}
The problem is that you're feeding HTML::TreeBuilder all sorts of junk in addition to the HTML files that you intend. As well as any files in the opened directory, readdir returns the names of all subdirectories, as well as the pseudo-directories . and ... You should have seen this in the output from your print statement
print("Analyzing file $file\n");
One way to fix this is to check that each value in the loop is a file before processing it. Something like this
for my $file ( #files ) {
my $path = "C:/myfiles/$file";
next unless -f $path;
print("Analyzing file $file\n");
my $tree = HTML::TreeBuilder->new->parse_file($path);
for my $table ( $tree->look_down( _tag => 'table' ) ) {
$table->delete();
}
...;
}
But it would be much cleaner to use a call to glob. That way you will only get the files that you want, and there is also no need to build the full path to each file
That would look something like this. You would have to adjust the glob pattern if your files don't all end with .html
for my $path ( glob "C:/myfiles/*.html" ) {
print("Analyzing file $path\n");
my $tree = HTML::TreeBuilder->new->parse_file($path);
for my $table ( $tree->look_down( _tag => 'table' ) ) {
$table->delete();
}
...;
}
Strictly speaking, a directory name may also look like *.html, and if you don't trust your file structure you should also test that each result of glob is a file before processing it. But in normal situations where you know what's in the directory you're processing that isn't necessary

How do I use $File::Find::prune?

I have a need to edit cue files in the first directory and not go recursively in the subdirectories.
find(\&read_cue, $dir_source);
sub read_cue {
/\.cue$/ or return;
my $fd = $File::Find::dir;
my $fn = $File::Find::name;
tie my #lines, 'Tie::File', $fn
or die "could not tie file: $!";
foreach (#lines) {
s/some substitution//;
}
untie #lines;
}
I've tried variations of
$File::Find::prune = 1;
return;
but with no success. Where should I place and define $File::Find::prune?
Thanks
If you don't want to recurse, you probably want to use glob:
for (glob("*.cue")) {
read_cue($_);
}
If you want to filter the subdirectories recursed into by File::Find, you should use the preprocess function (not the $File::Find::prune variable) as this gives you much more control. The idea is to provide a function which is called once per directory, and is passed a list of files and subdirectories; the return value is the filtered list to pass to the wanted function, and (for subdirectories) to recurse into.
As msw and Brian have commented, your example would probably be better served by a glob, but if you wanted to use File::Find, you might do something like the following. Here, the preprocess function calls -f on every file or directory it's given, returning a list of files. Then the wanted function is called only for those files, and File::Find does not recurse into any of the subdirectories:
use strict;
use File::Find;
# Function is called once per directory, with a list of files and
# subdirectories; the return value is the filtered list to pass to
# the wanted function.
sub preprocess { return grep { -f } #_; }
# Function is called once per file or subdirectory.
sub wanted { print "$File::Find::name\n" if /\.cue$/; }
# Find files in or below the current directory.
find { preprocess => \&preprocess, wanted => \&wanted }, '.';
This can be used to create much more sophisticated file finders. For example, I wanted to find all files in a Java project directory, without recursing into subdirectories starting with ".", such as ".idea" and ".svn", created by IntelliJ and Subversion. You can do this by modifying the preprocess function:
# Function is called once per directory, with a list of files and
# subdirectories; return value is the filtered list to pass to the
# wanted function.
sub preprocess { return grep { -f or (-d and /^[^.]/) } #_; }
If you only want the files in a directory without searching subdirectories, you don't want to use File::Find. A simple glob probably does the trick:
my #files = glob( "$dir_source/*.cue" );
You don't need that subroutine. In general, when you're doing a lot of work for a task that you think should be simple, you're probably doing it wrong. :)
Say you have a directory subtree with
/tmp/foo/file.cue
/tmp/foo/bar/file.cue
/tmp/foo/bar/baz/file.cue
Running
#! /usr/bin/perl
use warnings;
use strict;
use File::Find;
sub read_cue {
if (-f && /\.cue$/) {
print "found $File::Find::name\n";
}
}
#ARGV = (".") unless #ARGV;
find \&read_cue => #ARGV;
outputs
found /tmp/foo/file.cue
found /tmp/foo/bar/file.cue
found /tmp/foo/bar/baz/file.cue
But if you remember the directories in which you found cue files
#! /usr/bin/perl
use warnings;
use strict;
use File::Find;
my %seen_cue;
sub read_cue {
if (-f && /\.cue$/) {
print "found $File::Find::name\n";
++$seen_cue{$File::Find::dir};
}
elsif (-d && $seen_cue{$File::Find::dir}) {
$File::Find::prune = 1;
}
}
#ARGV = (".") unless #ARGV;
find \&read_cue => #ARGV;
you get only the toplevel cue file:
found /tmp/foo/file.cue
That's because $File::Find::prune emulates the -prune option of find that affects directory processing:
-prune
True; if the file is a directory, do not descend into it.

How can I handle template dependencies in Template Toolkit?

My static web pages are built from a huge bunch of templates which are inter-included using Template Toolkit's "import" and "include", so page.html looks like this:
[% INCLUDE top %]
[% IMPORT middle %]
Then top might have even more files included.
I have very many of these files, and they have to be run through to create the web pages in various languages (English, French, etc., not computer languages). This is a very complicated process and when one file is updated I would like to be able to automatically remake only the necessary files, using a makefile or something similar.
Are there any tools like makedepend for C files which can parse template toolkit templates and create a dependency list for use in a makefile?
Or are there better ways to automate this process?
Template Toolkit does come with its own command line script called ttree for building TT websites ala make.
Here is an ttree.cfg file I use often use on TT website projects here on my Mac:
# directories
src = ./src
lib = ./lib
lib = ./content
dest = ./html
# pre process these site file
pre_process = site.tt
# copy these files
copy = \.(png|gif|jpg)$
# ignore following
ignore = \b(CVS|RCS)\b
ignore = ^#
ignore = ^\.DS_Store$
ignore = ^._
# other options
verbose
recurse
Just running ttree -f ttree.cfg will rebuild the site in dest only updating whats been changed at source (in src) or in my libraries (in lib).
For more fine grained dependencies have a look a Template Dependencies.
Update - And here is my stab at getting dependency list by subclassing Template::Provider:
{
package MyProvider;
use base 'Template::Provider';
# see _dump_cache in Template::Provider
sub _dump_deps {
my $self = shift;
if (my $node = $self->{ HEAD }) {
while ($node) {
my ($prev, $name, $data, $load, $next) = #$node;
say {*STDERR} "$name called from " . $data->{caller}
if exists $data->{caller};
$node = $node->[ 4 ];
}
}
}
}
use Template;
my $provider = MyProvider->new;
my $tt = Template->new({
LOAD_TEMPLATES => $provider,
});
$tt->process( 'root.tt', {} ) or die $tt->error;
$provider->_dump_deps;
The code above displays all dependencies called (via INCLUDE, INSERT, PROCESS and WRAPPER) and where called from within the whole root.tt tree. So from this you could build a ttree dependency file.
/I3az/
In case all you care about are finding file names mentioned in directives such as INCLUDE, PROCESS, WRAPPER etc, one imagine even using sed or perl from the command line to generate the dependencies.
However, if there are subtler dependencies (e.g., you reference an image using <img> in your HTML document whose size is calculated using the Image plugin, the problem can become much less tractable.
I haven't really tested it but something like the following might work:
#!/usr/bin/perl
use strict; use warnings;
use File::Find;
use File::Slurp;
use Regex::PreSuf;
my ($top) = #ARGV;
my $directive_re = presuf qw( INCLUDE IMPORT PROCESS );
my $re = qr{
\[%-? \s+ $directive_re \s+ (\S.+) \s+ -?%\]
}x;
find(\&wanted => $top);
sub wanted {
return unless /\.html\z/i;
my $doc = read_file $File::Find::name;
printf "%s : %s\n", $_, join(" \\\n", $doc =~ /$re/g );
}
After reading the ttree documentation, I decided to create something myself. I'm posting it here in case it's useful to the next person who comes along. However, this is not a general solution, but one which works only for a few limited cases. It worked for this project since all the files are in the same directory and there are no duplicate includes. I've documented the deficiencies as comments before each of the routines.
If there is a simple way to do what this does with ttree which I missed, please let me know.
my #dependencies = make_depend ("first_file.html.tmpl");
# Bugs:
# Insists files end with .tmpl (mine all do)
# Does not check the final list for duplicates.
sub make_depend
{
my ($start_file) = #_;
die unless $start_file && $start_file =~ /\.tmpl/ && -f $start_file;
my $dir = $start_file;
$dir =~ s:/[^/]*$::;
$start_file =~ s:\Q$dir/::;
my #found_files;
find_files ([$start_file], \#found_files, $dir);
return #found_files;
}
# Bugs:
# Doesn't check for including the same file twice.
# Doesn't allow for a list of directories or subdirectories to find the files.
# Warning about files which aren't found switched off, due to
# [% INCLUDE $file %]
sub find_files
{
my ($files_ref, $foundfiles_ref, $dir) = #_;
for my $file (#$files_ref) {
my $full_name = "$dir/$file";
if (-f $full_name) {
push #$foundfiles_ref, $full_name;
my #includes = get_includes ($full_name);
if (#includes) {
find_files (\#includes, $foundfiles_ref, $dir);
}
} else {
# warn "$full_name not found";
}
}
}
# Only recognizes two includes, [% INCLUDE abc.tmpl %] and [% INCLUDE "abc.tmpl" %]
sub get_includes
{
my ($start_file) = #_;
my #includes;
open my $input, "<", $start_file or die "Can't open $start_file: $!";
while (<$input>) {
while (/\[\%-?\s+INCLUDE\s+(?:"([^"]+)"|(.*))\s+-?\%\]/g) {
my $filename = $1 ? $1 : $2;
push #includes, $filename;
}
}
close $input or die $!;
return #includes;
}