Perl code fails to match recursively (nested subs) - perl

The code below loops through folders in “/data/results” directory and matches each .vcf file name, located in a sub-folder (two levels down) to the content of a matrix_key file.
This seem to work only for the first folder. I printed the content of each #matrix_key and it’s correct. The code always fails to match for the second folder. Here is where it fails to match:: if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
I’ve tried to run one folder at a time and it work great. I don’t understand why it fails when I put multiple folders in /data/results/? Could someone please suggest how to correct this issue? Thank you.
Here is an example of directory structure:
/data/results/
TestFolder1/
subfolder1/Variants/MD-14-11856_RNA_v2.vcf
subfoder2/Variants/SU-16-16117_RNA_v2.vcf
matrix.txt
matrixkey.txt
TestFolder2/
subfolder1/Variants/SU-15-2542_v2.vcf
subfolder2/Variants/SU-16-16117_v2.vcf
matrix.txt
matrixkey.txt
Example of #matrix_key:
Barcode SampleName
barcode_003 SU-15-2542
barcode-005 MD-14-11856
barcode-002 SU-16-16117
The code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
use List::Util 'first';
use File::Find;
use File::Spec;
use Data::Dumper;
use File::Basename;
use File::Spec::Functions 'splitdir';
my $current_directory = "/data/results";
my #dirs = grep { -d } glob '/data/results/*';
if (grep -d, glob("$current_directory/*")) {
print "$current_directory has subfolder(s)\n";
}
else {
print "there are no folders\n";
die;
}
my %files;
my #matrix_key = ();
for my $dir ( #dirs ) {
print "the directory is $dir\n";
my $run_folder = (split '/', $dir)[3];
print "the folder is $run_folder\n";
my $key2 = $run_folder;
# checks if barcode matrix and barcode summary files exist
#shortens the folder names and unzips them.
#check if each sample is present in the matrix file for each folder.
my $location = "/data/results/".$run_folder;
my $matrix_key_file = "/data/results/".$run_folder."/matrixkey.txt";
open my $key, '<', $matrix_key_file or die $!; # key file
<$key>; # throw away header line in key file (first line)
#matrix_key = sort { length($b->[1]) <=> length($a->[1]) }
map [ split ], <$key>;
close $key or die $!;
print Dumper(#matrix_key) . "===\n\n";
find({ wanted => \&find_vcf, no_chdir=>1}, $location);
#find({ wanted => find_vcf, no_chdir=>1}, $location);
}
my $find_vcf = sub {
#sub find_vcf {
my $F = $File::Find::name;
if ($F =~ /vcf$/ ) {
print "$F\n";
$F =~ m|([^/]+).vcf$| or die "Can't extract Sample ID";
my $sample_id = $1; print "the short vcf name is: $sample_id\n";
if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
#the code fails to match sample_id to matrix_key
#even though it's printed out correctly
print "$sample_id \t MATCHES $aref->[1]\n";
print "\t$aref->[1]_$aref->[0]\n\n";
} else {
# handle all other possible exceptions
#print "folder name is $run_folder\n";
die("The VCF file doesn't match the Summary Barcode file: $sample_id\n");
}
}
}

The posted code appears to be a bit complicated for the job.
Here is one way to do what I understand from the question. It uses File::Find::Rule
use warnings;
use strict;
use File::Find::Rule;
use List::Util 'any';
my $base_dir = '/data/results';
my #dirs = File::Find::Rule->maxdepth(1)->directory->in($base_dir);
foreach my $dir (#dirs)
{
# Find all .vcx files anywhere in this dir or below
my #vcx_files = File::Find::Rule->file->name('*.vcx')->in($dir);
# Remove the path and .vcx extension
my #names = map { m|.*/(.+)\.vcx$| } #vcx_files;
# Find all text files to search, right in this folder
my #files = File::Find::Rule ->
maxdepth(1)->file->name('*.txt')->in($dir);
foreach my $file (#files)
{
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>; # drop the header line
# Get the second field on each line (with SampleName)
my #samples = map { (split)[1] } <$fh>;
# ... search #samples for #names ...
}
}
It is fine to use glob for non-recursive searches above, but given its treatment of spaces better use core File::Glob replacement for it.
There are other ways to organize traversal of directories and file searches, and there are many ways to compare two lists. Please clarify the overall objective so that I can add suitable code to search .vcx names vs. file content.
Please add checks, fix variable names, implement your policies for when things fail, etc.

Related

How can I Parser file without giving file name in Perl?

When I run the programme at that time I only give a directory name I want all files in the directory to be parsed? Here is my code
my #indexFiles= "www/I.html";
my #rdata = readFile("#indexFiles");
sub readFile{
my $somefile = $_[0];
my #links = ($somefile);
my $p = HTML::TokeParser->new($somefile) || die "Can't open: $!";
while (my $token = $p->get_tag("img","a")){
my $currentlink = $token->[1]{href} || $token->[1]{src};
my $finalLink= $directory."/".$currentlink ;
if($currentlink =~ /\.html$/){
my #data = readFile($finalLink);
push #links,#data;
} else{
push #links,$finalLink;
}
}
return #links;
}
In www folder I have 3 HTML file, 2 folder.
my #indexFiles= "www/I.html" In this line I pass specific path name and file. I don't want to pass that name. Instead of this, it will select automatically.
For example: When I run my programme perl c.pl www. It should be Parse all the file.
I give the specific file name I.html then after it will be going to find img and a tag.
Your question doesn't appear to have anything to do with parsing or HTML::TokeParser; it appears to be about determining whether a path references a directory or not, and getting the list of files in the directory if it references a directory.
stat and -d (in conjunction or independently) can be used to test if a path references a directory.
At the lowest level, opendir+readdir+closedir is used to read a directory. The glob builtin and numerous modules provide alternatives ways of doing this.
Recursive search:
sub process {
for my $qfn (#_) {
stat($qfn)
or die("Can't stat \"$qfn\": $!\n");
if (-d _) {
process(glob("\Q$qfn\E/*"));
}
elsif ($qfn =~ /\.html\z/) {
process_html_file($qfn);
}
}
}
process('www');
Non-recursive search:
sub process {
for my $qfn (#_) {
stat($qfn)
or die("Can't stat \"$qfn\": $!\n");
if (-d _) {
process_html_file(glob("\Q$qfn\E/*.html"));
}
elsif ($qfn =~ /\.html\z/) {
process_html_file($qfn);
}
}
}
process('www');
Alternatively, you could use File::Find::Rule.
Recursive search:
use File::Find::Rule qw( );
process_html_file($_)
for File::Find::Rule->name('*.html')->file->in('www');
Non-recursive search:
use File::Find::Rule qw( );
process_html_file($_)
for File::Find::Rule->maxdepth(1)->name('*.html')->file->in('www');

File::Find is failing in subdirectories

I have a sub that find .vcf files in a sub-directories of the main directory, using File::Find::name that was working great in one environment but is not working on another machine(both run red hat linux) . It stillfinds .vcf files if it's in the main directory but fails to find in a sub-directory.
Could someone please help to troubleshoot?
Here is an example of a file it fails to find (broken over lines for readability):
/home/yeliiley/mdl3/results/SN1-376-OFA_TL127445_CHIP1_052318_BSN/
MD-18-6297_BG_v1_ac9023be-8db4-440b-9095/Variants/
MD-18-6297_BG_v1_MD-18-6297_BG_RNA_v1/
MD-18-6297_BG_v1_MD-18-6297_BG_RNA_v1_Non-Filtered_2018-05-24_040909.vcf
however, if the file is in $main_dir it finds it.
#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
my $main_dir = "/home/yeliiley/mdl3/results/SN1-376-OFA_TL127445_CHIP1_052318_BSN";
my $location=$main_dir;
sub find_vcf {
my $F = $File::Find::name;
if ($F =~ /vcf$/ ) {
print "here is the vcf.$F\n";
$F =~ m|([^/]+).vcf$| or die "Can't extract Sample ID";
my $sample_id = $1; print "the short vcf name is: $sample_id\n";
}else {
print "Did not find any vcf files $F\n";
}
}
find({ wanted => \&find_vcf, no_chdir=>1}, $location);
Try adding "follow => 1" to your find() call, i.e.
find({ wanted => \&find_vcf , no_chdir => 1, follow => 1}, $location);

perl script to count files in windows directory tree

I am new to perl scripting. I am trying to get the count of directories & subdirectories.
So I have searched all the available help on scripting.
But unable get the count of Subdirectories. Below is the script I used.
use strict;
use warnings;
use File::Slurp;
my #dirs = ('.');
my $directory_count = 0;
my $file_count = 0;
my $outfile = 'log.txt';
open my $fh, '>', $outfile or die "can't create logfile; $!";
for my $dir (#dirs) {
for my $file (read_dir ($dir)) {
if ( -d "$dir/$file" ) {
$directory_count++;
}
else {
$file_count++;
}
}
print $fh "Directories: $directory_count\n";
print $fh "Files: $file_count\n";
}
close $fh;
Here, I am unable to identify where to change the command of dir with /s.
Please help it will reduce lot of manual work.
Ravi
Never EVER write your own directory traversal. There are too many pitfalls, gotchas and edge cases. Things like path delimiters, files with spaces, alternate data streams, soft links, hard links, DFS paths... just don't do it.
Use File::Find or if you prefer File::Find::Rule.
As I prefer the former, I'll give an example:
use strict;
use warnings;
use File::Find;
my $dir_count;
my $file_count;
#find runs this for every file in it's traversal.
#$_ is 'current file'. $File::Find::Name is full path to file.
sub count_stuff {
if ( -d ) { $dir_count++ };
if ( -f ) { $file_count++ };
}
find ( \&count_stuff, "." );
print "Dirs: $dir_count\n";
print "Files: $file_count\n";
Here is a script that does it: 1) without global variables; and 2) without adding another sub to the namespace.
#!/usr/bin/env perl
use strict;
use warnings;
use File::Find;
run(\#ARGV);
sub run {
my $argv = shift;
for my $dir ( #$argv ) {
my $ret = count_files_and_directories( $dir );
printf(
"%s: %d files and %d directories\n",
$dir,
$ret->{files},
$ret->{directories}
);
}
return;
}
sub count_files_and_directories {
my $top = shift;
my %ret = (directories => 0, files => 0);
find(
{
wanted => sub {
-d and $ret{directories} += 1;
-f and $ret{files} += 1;
},
no_chdir => 1,
},
$top,
);
\%ret;
}
It seems simpler to use File::Find::Rule.. For example:
use warnings;
use strict;
use File::Find::Rule;
my #files = File::Find::Rule->new->file->in('.');
my #dirs = File::Find::Rule->new->directory->in('.');

Perl How to merge two or more excel files in one (multiple worksheets)?

I need to merge a few excel file into one, multiple sheets.
I do not care too much about the sheet name on the new file.
I do not have Excel on the computer I plan to run this. so I cannot use Win32 OLE.
I attempted to run this code https://sites.google.com/site/mergingxlsfiles/ but it is not working, I get a new empty excel file.
I attempt to run http://www.perlmonks.org/?node_id=743574 but I only obtained one of the file in the new excel file.
My input excel files have some french characters (é for e.g.) I believe these are cp1252.
Code used :
#!/usr/bin/perl -w
use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;
use File::Glob qw(bsd_glob);
use Getopt::Long;
use POSIX qw(strftime);
GetOptions(
'output|o=s' => \my $outfile,
'strftime|t' => \my $do_strftime,
) or die;
if ($do_strftime) {
$outfile = strftime $outfile, localtime;
};
my $output = Spreadsheet::WriteExcel->new($outfile)
or die "Couldn't create '$outfile': $!";
for (#ARGV) {
my ($filename,$sheetname,$targetname);
my #files;
if (m!^(.*\.xls):(.*?)(?::([\w ]+))$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
warn $filename;
if ($do_strftime) {
$filename = strftime $filename, localtime;
};
#files = glob $filename;
} else {
($filename,$sheetname,$targetname) = ($_,qr(.*),undef);
if ($do_strftime) {
$filename = strftime $filename, localtime;
};
push #files, glob $filename;
};
for my $f (#files) {
my $excel = Spreadsheet::ParseExcel::Workbook->Parse($f);
foreach my $sheet (#{$excel->{Worksheet}}) {
if ($sheet->{Name} !~ /$sheetname/) {
warn "Skipping '" . $sheet->{Name} . "' (/$sheetname/)";
next;
};
$targetname ||= $sheet->{Name};
#warn sprintf "Copying %s to %s\n", $sheet->{Name}, $targetname;
my $s = $output->add_worksheet($targetname);
$sheet->{MaxRow} ||= $sheet->{MinRow};
foreach my $row ($sheet->{MinRow} .. $sheet->{MaxRow}) {
my #rowdata = map {
$sheet->{Cells}->[$row]->[$_]->{Val};
} $sheet->{MinCol} .. $sheet->{MaxCol};
$s->write($row,0,\#rowdata);
}
}
};
};
$output->close;
I have 2 excel files named: 2.xls (only 1 sheet named 2 in it), 3.xls (only 1 sheet named 3)
I launched the script as this:
xlsmerge.pl -s -o results-%Y%m%d.xls 2.xls:2 3.xls:3
Results: results-20121024.xls empty nothing in it.
Then I tried
xlsmerge.pl -s -o results-%Y%m%d.xls 2.xls 3.xls
And it worked.
I am not sure why is it failing while adding the Sheetname
It appears that there is a bug in this line of the script:
if (m!^(.*\.xls):(.*?)(?::([\w ]+))$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
...
It looks to me like the goal of that line is to allow arguments either in the form
spreadsheet.xls:source_worksheet
or in another form allowing the name of the target sheet to be specified:
spreadsheet.xls:source_worksheet:target_worksheet
The last grouping appears intended to capture that last, optional argument: (?::([\w ]+)). The only problem is, this grouping was not made optional. Thus, when you only specify the source sheet and not the target, the regex fails to match and it falls to the backup behavior, which is to treat the whole argument as the filename. But this fails, too, because you don't have a file called 2.xls:2.
The solution would be to introduce the ? modifier after the last group in the regex to make it optional:
if (m!^(.*\.xls):(.*?)(?::([\w ]+))?$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
...
Of course, that may not be the only problem. If the script was posted with an error, there could be other errors, too. I don't have Perl available to test it at the moment.

How to get a list of leaf subdirectories in a root folder in Perl

I am very new to Perl (scripting languages in general) and I was wondering how to use Perl to get a lisitng of all the leaf directories in Perl. For example, lets say my root directory is C:
C: -> I have folder "A" and "B" and files a.txt and b.txt
Folder "A" -> I have folder "D" and file c.html
Folder "B" -> I have folder "E" and "F" and file d.html
Folder "D", "E" and "F" -> bunch of text files
How do I get a bunch of directory paths as output for this scenario of:
C:\A\D\
C:\B\E\
C:\B\F\
As you can see, I just want a list of all the leaf directories possible. I dont want C:\A\ and C:\B\ to show up. After doign some reserarch myself, I have noticed that I may somehow be able to use the File::Find module in Perl, but that also I am not 100% sure about how to go ahead with.
Thanks for any help you may be able to provide :)
Another approach:
use strict;
use warnings;
use feature qw( say );
use File::Find::Rule qw( );
use Path::Class qw( dir );
my $root = dir('.')->absolute();
my #dirs = File::Find::Rule->directory->in($root);
shift(#dirs);
my #leaf_dirs;
if (#dirs) {
my $last = shift(#dirs);
for (#dirs) {
push #leaf_dirs, $last if !/^\Q$last/;
$last = $_ . "/";
}
push #leaf_dirs, $last;
}
say for #leaf_dirs;
Or using find's preprocess option:
use strict;
use warnings;
use File::Find;
find({ wanted =>sub{1}, # required--in version 5.8.4 at least
preprocess=>sub{ # #_ is files in current directory
#_ = grep { -d && !/\.{1,2}$/ } #_;
print "$File::Find::dir\n" unless #_;
return #_;
}
}, ".");
From an answer to the question How to Get the Last Subdirectories by liverpole on Perlmonks:
prints all leaf directories under the current directory (see "./"):
use strict;
use warnings;
my $h_dirs = terminal_subdirs("./");
my #dirs = sort keys %$h_dirs;
print "Terminal Directories:\n", join("\n", #dirs);
sub terminal_subdirs {
my ($top, $h_results) = #_;
$h_results ||= { };
opendir(my $dh, $top) or die "Arrggghhhh -- can't open '$top' ($!)\n";
my #files = readdir($dh);
closedir $dh;
my $nsubdirs = 0;
foreach my $fn (#files) {
next if ($fn eq '.' or $fn eq '..');
my $full = "$top/$fn";
if (!-l $full and -d $full) {
++$nsubdirs;
terminal_subdirs($full, $h_results);
}
}
$nsubdirs or $h_results->{$top} = 1;
return $h_results;
}