HTML::TreeBuilder inside a loop - perl

I'm trying to delete all table elements from several HTML files.
The following code runs perfectly on a single file, but when trying to automate the process it returns the error
can't call method "look_down" on an undefined value
Do you have any solution please?
Here is the code:
use strict;
use warnings;
use Path::Class;
use HTML::TreeBuilder;
opendir( DH, "C:/myfiles" );
my #files = readdir(DH);
closedir(DH);
foreach my $file ( #files ) {
print("Analyzing file $file\n");
my $tree = HTML::TreeBuilder->new->parse_file("C:/myfiles/$file");
foreach my $e ( $tree->look_down( _tag => "table" ) ) {
$e->delete();
}
use HTML::FormatText;
my $formatter = HTML::FormatText->new;
my $parsed = $formatter->format($tree);
print $parsed;
}

The problem is that you're feeding HTML::TreeBuilder all sorts of junk in addition to the HTML files that you intend. As well as any files in the opened directory, readdir returns the names of all subdirectories, as well as the pseudo-directories . and ... You should have seen this in the output from your print statement
print("Analyzing file $file\n");
One way to fix this is to check that each value in the loop is a file before processing it. Something like this
for my $file ( #files ) {
my $path = "C:/myfiles/$file";
next unless -f $path;
print("Analyzing file $file\n");
my $tree = HTML::TreeBuilder->new->parse_file($path);
for my $table ( $tree->look_down( _tag => 'table' ) ) {
$table->delete();
}
...;
}
But it would be much cleaner to use a call to glob. That way you will only get the files that you want, and there is also no need to build the full path to each file
That would look something like this. You would have to adjust the glob pattern if your files don't all end with .html
for my $path ( glob "C:/myfiles/*.html" ) {
print("Analyzing file $path\n");
my $tree = HTML::TreeBuilder->new->parse_file($path);
for my $table ( $tree->look_down( _tag => 'table' ) ) {
$table->delete();
}
...;
}
Strictly speaking, a directory name may also look like *.html, and if you don't trust your file structure you should also test that each result of glob is a file before processing it. But in normal situations where you know what's in the directory you're processing that isn't necessary

Related

Perl code fails to match recursively (nested subs)

The code below loops through folders in “/data/results” directory and matches each .vcf file name, located in a sub-folder (two levels down) to the content of a matrix_key file.
This seem to work only for the first folder. I printed the content of each #matrix_key and it’s correct. The code always fails to match for the second folder. Here is where it fails to match:: if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
I’ve tried to run one folder at a time and it work great. I don’t understand why it fails when I put multiple folders in /data/results/? Could someone please suggest how to correct this issue? Thank you.
Here is an example of directory structure:
/data/results/
TestFolder1/
subfolder1/Variants/MD-14-11856_RNA_v2.vcf
subfoder2/Variants/SU-16-16117_RNA_v2.vcf
matrix.txt
matrixkey.txt
TestFolder2/
subfolder1/Variants/SU-15-2542_v2.vcf
subfolder2/Variants/SU-16-16117_v2.vcf
matrix.txt
matrixkey.txt
Example of #matrix_key:
Barcode SampleName
barcode_003 SU-15-2542
barcode-005 MD-14-11856
barcode-002 SU-16-16117
The code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
use List::Util 'first';
use File::Find;
use File::Spec;
use Data::Dumper;
use File::Basename;
use File::Spec::Functions 'splitdir';
my $current_directory = "/data/results";
my #dirs = grep { -d } glob '/data/results/*';
if (grep -d, glob("$current_directory/*")) {
print "$current_directory has subfolder(s)\n";
}
else {
print "there are no folders\n";
die;
}
my %files;
my #matrix_key = ();
for my $dir ( #dirs ) {
print "the directory is $dir\n";
my $run_folder = (split '/', $dir)[3];
print "the folder is $run_folder\n";
my $key2 = $run_folder;
# checks if barcode matrix and barcode summary files exist
#shortens the folder names and unzips them.
#check if each sample is present in the matrix file for each folder.
my $location = "/data/results/".$run_folder;
my $matrix_key_file = "/data/results/".$run_folder."/matrixkey.txt";
open my $key, '<', $matrix_key_file or die $!; # key file
<$key>; # throw away header line in key file (first line)
#matrix_key = sort { length($b->[1]) <=> length($a->[1]) }
map [ split ], <$key>;
close $key or die $!;
print Dumper(#matrix_key) . "===\n\n";
find({ wanted => \&find_vcf, no_chdir=>1}, $location);
#find({ wanted => find_vcf, no_chdir=>1}, $location);
}
my $find_vcf = sub {
#sub find_vcf {
my $F = $File::Find::name;
if ($F =~ /vcf$/ ) {
print "$F\n";
$F =~ m|([^/]+).vcf$| or die "Can't extract Sample ID";
my $sample_id = $1; print "the short vcf name is: $sample_id\n";
if ( my $aref = first { index($sample_id, $_->[1]) != -1 } #matrix_key ) {
#the code fails to match sample_id to matrix_key
#even though it's printed out correctly
print "$sample_id \t MATCHES $aref->[1]\n";
print "\t$aref->[1]_$aref->[0]\n\n";
} else {
# handle all other possible exceptions
#print "folder name is $run_folder\n";
die("The VCF file doesn't match the Summary Barcode file: $sample_id\n");
}
}
}
The posted code appears to be a bit complicated for the job.
Here is one way to do what I understand from the question. It uses File::Find::Rule
use warnings;
use strict;
use File::Find::Rule;
use List::Util 'any';
my $base_dir = '/data/results';
my #dirs = File::Find::Rule->maxdepth(1)->directory->in($base_dir);
foreach my $dir (#dirs)
{
# Find all .vcx files anywhere in this dir or below
my #vcx_files = File::Find::Rule->file->name('*.vcx')->in($dir);
# Remove the path and .vcx extension
my #names = map { m|.*/(.+)\.vcx$| } #vcx_files;
# Find all text files to search, right in this folder
my #files = File::Find::Rule ->
maxdepth(1)->file->name('*.txt')->in($dir);
foreach my $file (#files)
{
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>; # drop the header line
# Get the second field on each line (with SampleName)
my #samples = map { (split)[1] } <$fh>;
# ... search #samples for #names ...
}
}
It is fine to use glob for non-recursive searches above, but given its treatment of spaces better use core File::Glob replacement for it.
There are other ways to organize traversal of directories and file searches, and there are many ways to compare two lists. Please clarify the overall objective so that I can add suitable code to search .vcx names vs. file content.
Please add checks, fix variable names, implement your policies for when things fail, etc.

Perl , How to read subfolder Output

I am writing a script to read the content of multiple sub folder in a directory.
And recently i need to read the content of folder inside multiple sub-folder.
Want to ask how can i write the code to read those folder inside multiple sub-folder.
This is the new conditions
Multiple Sub-folder -> Local folder -> fileAAA.csv
how do i read this fileAAA in Local folder of Multiple Sub-folder?
Currently the code i am writing was in this condition and it works well.
Multiple Sub-folder -> fileAAA.csv
Able to read fileAAA from multiple Sub-folder
Below is the code i use to read
Multiple Sub-folder -> fileAAA.csv
my ( $par_dir, $sub_dir );
opendir( $par_dir, "$parent" );
while ( my $sub_folders = readdir($par_dir) ) {
next if ( $sub_folders =~ /^..?$/ ); # skip . and ..
my $path = $parent . '/' . $sub_folders;
next unless ( -d $path ); # skip anything that isn't a directory
opendir( $sub_dir, $path );
while ( my $file = readdir($sub_dir) ) {
next unless $file =~ /\.csv?$/i;
my $full_path = $path . '/' . $file;
print_file_names($full_path);
}
closedir($sub_dir);
$flag = 0;
}
closedir($par_dir);
......
Updated
You should look at the File::Find module which has everything already in place to do searches like this, and has taken account of all corner cases for you
I wrote that on my tablet and at the time I couldn't offer sample code to support it. I believe this will do what you're asking for, which is simply to find all CSV files at any level beneath a parent directory
use strict;
use warnings;
use File::Find qw/ find /;
STDOUT->autoflush;
my $parent = '/Multiple Sub-folder';
find(sub {
return unless -f and /\.csv$/i;
print_file_names($File::Find::name);
}, $parent);
sub print_file_names {
my ($fn) = #_;
print $fn, "\n";
}
Without using moudle try this
Instead of opendir can you try glob for subdirectory search.
In below script i make a subroutine for continuous search.
When elsif condition is satisfied the path of the directory is will go to the find subroutine then it'll seach and so on.
my $v = "/Multiple Sub-folder";
find($v);
sub find{
my ($s) = #_;
foreach my $ma (glob "$s/*")
{
if(-f $ma)
{
if($ma =~m/.csv$/) # Here search for csv files.
{
print "$ma\n";
}
}
elsif(-d $ma)
{
find("$ma")
}
}
}
But can you use File::Find module for search the files in the directory as the answer of Borodin Which is the best approach.

Archive::Zip membersMatching can't locate method in Perl?

I have a script that uses Archive::Zip, and I want to use the method membersMatching, but I can't figure out what I'm missing.
I called the module at the beginning of the script:
use Archive::Zip qw( :ERROR_CODES :CONSTANTS :MISC_CONSTANTS );
and this is the block of code where the module is used:
while (my $file = readdir(TRIMMED_CELL_DIR)) {
#Only if file ends in _1.fastqc.zip (only 1 instance per "trimmed" subdirectory.)
if($file =~ /.*\_1\_fastqc\.zip/){
#Extract the file summary.txt and assign it to filehandle SUMMARY_R1.
$file = "${trimmedDirectory}/${file}";
print "Loading ZIP file: $file. \n";
my $zip = Archive::Zip->new($file);
my #txtFileMembers = $zip->membersMatching( '.*\.txt' );
foreach my $txtFile (#txtFileMembers){
extractMember($txtFile);
open(SUMMARY_R1,"< $txtFile");
}
}
I keep getting the error Can't locate object method "membersMatching". ... and I know it has something to do with this membersMatching method not being exported, but I don't know how to call it in the script. Te CPAN page for Archive::Zip doesn't say anything except to use it like so:
membersMatching( $regex )
membersMatching( { regex => $regex } )
Return array of members whose filenames match given regular expression in list context. Returns number of matching members in
scalar context.
my #textFileMembers = $zip->membersMatching( '.*\.txt' );
# or
my $numberOfTextFiles = $zip->membersMatching( '.*\.txt' );
The ZIP file loading with the Archive::Zip->new($file) function works, so the module is being exported, just not the method memebersMatching...
Check the path of your zip file ($file). I think it's failing there. Update your code to the below:
my $zip = Archive::Zip->new();
unless ( $zip->read( 'someZip.zip' ) == AZ_OK ) {
die 'read error';
}
print "zip contains the following files:\n";
print "$_\n" for $zip->memberNames();

Perl How to merge two or more excel files in one (multiple worksheets)?

I need to merge a few excel file into one, multiple sheets.
I do not care too much about the sheet name on the new file.
I do not have Excel on the computer I plan to run this. so I cannot use Win32 OLE.
I attempted to run this code https://sites.google.com/site/mergingxlsfiles/ but it is not working, I get a new empty excel file.
I attempt to run http://www.perlmonks.org/?node_id=743574 but I only obtained one of the file in the new excel file.
My input excel files have some french characters (é for e.g.) I believe these are cp1252.
Code used :
#!/usr/bin/perl -w
use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;
use File::Glob qw(bsd_glob);
use Getopt::Long;
use POSIX qw(strftime);
GetOptions(
'output|o=s' => \my $outfile,
'strftime|t' => \my $do_strftime,
) or die;
if ($do_strftime) {
$outfile = strftime $outfile, localtime;
};
my $output = Spreadsheet::WriteExcel->new($outfile)
or die "Couldn't create '$outfile': $!";
for (#ARGV) {
my ($filename,$sheetname,$targetname);
my #files;
if (m!^(.*\.xls):(.*?)(?::([\w ]+))$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
warn $filename;
if ($do_strftime) {
$filename = strftime $filename, localtime;
};
#files = glob $filename;
} else {
($filename,$sheetname,$targetname) = ($_,qr(.*),undef);
if ($do_strftime) {
$filename = strftime $filename, localtime;
};
push #files, glob $filename;
};
for my $f (#files) {
my $excel = Spreadsheet::ParseExcel::Workbook->Parse($f);
foreach my $sheet (#{$excel->{Worksheet}}) {
if ($sheet->{Name} !~ /$sheetname/) {
warn "Skipping '" . $sheet->{Name} . "' (/$sheetname/)";
next;
};
$targetname ||= $sheet->{Name};
#warn sprintf "Copying %s to %s\n", $sheet->{Name}, $targetname;
my $s = $output->add_worksheet($targetname);
$sheet->{MaxRow} ||= $sheet->{MinRow};
foreach my $row ($sheet->{MinRow} .. $sheet->{MaxRow}) {
my #rowdata = map {
$sheet->{Cells}->[$row]->[$_]->{Val};
} $sheet->{MinCol} .. $sheet->{MaxCol};
$s->write($row,0,\#rowdata);
}
}
};
};
$output->close;
I have 2 excel files named: 2.xls (only 1 sheet named 2 in it), 3.xls (only 1 sheet named 3)
I launched the script as this:
xlsmerge.pl -s -o results-%Y%m%d.xls 2.xls:2 3.xls:3
Results: results-20121024.xls empty nothing in it.
Then I tried
xlsmerge.pl -s -o results-%Y%m%d.xls 2.xls 3.xls
And it worked.
I am not sure why is it failing while adding the Sheetname
It appears that there is a bug in this line of the script:
if (m!^(.*\.xls):(.*?)(?::([\w ]+))$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
...
It looks to me like the goal of that line is to allow arguments either in the form
spreadsheet.xls:source_worksheet
or in another form allowing the name of the target sheet to be specified:
spreadsheet.xls:source_worksheet:target_worksheet
The last grouping appears intended to capture that last, optional argument: (?::([\w ]+)). The only problem is, this grouping was not made optional. Thus, when you only specify the source sheet and not the target, the regex fails to match and it falls to the backup behavior, which is to treat the whole argument as the filename. But this fails, too, because you don't have a file called 2.xls:2.
The solution would be to introduce the ? modifier after the last group in the regex to make it optional:
if (m!^(.*\.xls):(.*?)(?::([\w ]+))?$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
...
Of course, that may not be the only problem. If the script was posted with an error, there could be other errors, too. I don't have Perl available to test it at the moment.

How to add one more node information to xml file

I written one script that create one xml file from multiple files,I written script like this.
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
use Carp;
use File::Find;
use File::Spec::Functions qw( canonpath );
use XML::LibXML::Reader;
use Digest::MD5 'md5';
if ( #ARGV == 0 ) {
push #ARGV, "c:/main/work";
warn "Using default path $ARGV[0]\n Usage: $0 path ...\n";
}
open( my $allxml, '>', "all_xml_contents.combined.xml" )
or die "can't open output xml file for writing: $!\n";
print $allxml '<?xml version="1.0" encoding="UTF-8"?>',
"\n<Shiporder xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">\n";
my %shipto_md5;
find(
sub {
return unless ( /(_stc\.xml)$/ and -f );
extract_information();
return;
},
#ARGV
);
print $allxml "</Shiporder>\n";
sub extract_information {
my $path = $_;
if ( my $reader = XML::LibXML::Reader->new( location => $path )) {
while ( $reader->nextElement( 'data' )) {
my $elem = $reader->readOuterXml();
my $md5 = md5( $elem );
print $allxml $reader->readOuterXml() unless ( $shipto_md5{$md5}++ );
}
}
return;
}
from above script I am extracting data node information from all xml files and stored in a new xml file . but I have one more node starts with "details", I need to extract that information and I need to add that information also to the file, I tried like this
$reader->nextElement( 'details' );
my $information = $reader->readOuterXml();
I added this in while loop but how can I assign or print this data into same file($all xml). Please help me with this problem.
After your suggestion I tried like this, It gives error
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
use Carp;
use File::Find;
use File::Spec::Functions qw( canonpath );
use XML::LibXML::Reader;
if ( #ARGV == 0 ) {
push #ARGV, "V:/main/work";
warn "Using default path $ARGV[0]\n Usage: $0 path ...\n";
}
my $libXML = new XML::LibXML;
my $outputDom = $libXML->parse_string('<?xml version="1.0" encoding="UTF-8"?
>','<Shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">');
my $shiporder = $outputDom->documentElement;
find(
sub {
return unless ( /(_stc\.xml)$/ and -f );
extract_information();
return;
},
#ARGV
);
sub extract_information {
my $path = $_;
if(my #inputDom = XML::LibXML->load_xml(location => $path)){
$inputDom->findnodes('//data || //deatils');
foreach (#$inputDom) {
$shiporder->appendChild($_->parentNode->cloneNode(1));
}
$outputDom->toFile("allfiles.xml");
}
}
but it gives like " '\n\n:1: Parser error:Strat tag expected,'<' not found " Can you help me with script because I am very new to perl.
You would do a lot better if you used what XML::LibXML and related modules gives you, it is a very large and comprehensive module and allows you to do a lot in few lines.
You can use the parser to start a new dom document using parse_string, storing the root node using documentElement. From there, use parse_file to load up each of your input files, then findnodes on the input files to extract the nodes you want to clone. Then append a clone of your input nodes to the output document, and finally use the toFile method to write out your output.
Something like:
my $libXML = new XML::LibXML;
my $outputDom = $libXML->parse_string('<?xml version="1.0" encoding="UTF-8"?>',
'\n<Shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">\n');
my $shiporder = $outputDom->documentElement;
...
my $inputDom = $libXML->parse_file(some_file_name);
$inputDom->findnodes('//data || //details'); # use a more suitable xpath
foreach (#$inputDom) {
$shipOrder->appendChild($_->parentNode->cloneNode(1)); # if you want parent too...
}
...
$outputDom->toFile(some_output_file);
}
You will have to allow for namespaces and whatnot, but this gives one approach to start with.