Using a perl script to loop through files in directory - perl

I have a directory filled with several thousand .txt files. I need to run the same perl script on each .txt file and when it's done running the script on each .txt file, to name that file a unique name. Please forgive the basic inquiry as I am learning perl with this script for the first time.
I have seen other posts addressing this issue: How can I loop through files in a directory in Perl? and running loops through terminal: Take all files in dir and for each file do the same perl procedure.
A bit about my data: These are blastx subject sequence ID results
$head file1.txt
GCN2_SCHPO
GCN2_YEAST
GCN20_YEAST
GCNK_GLUOX
$head file2.txt
PDXA_RUEST
PDXA_SULSY
PDXA_SYNFM
PDXA_SYNY3
My perl script- uses Uniprot's Retrieve/ID mapping service programmatically, instead of putting in thousands of requests, manually (Retrieve/ID mapping):
use warnings;
use LWP::UserAgent;
#files = <*.txt>; # File containg list of UniProt IDs.
my $base = 'http://www.uniprot.org';
my $tool = 'uploadlists';
my $contact = ''; # Please set your email address here to help us debug in
case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push #{$agent->requests_redirectable}, 'POST';
foreach $file (#files) {
my $response = $agent->post("$base/$tool/",
[ 'file' => [#files],
'format' => 'tab',
'from' => 'ACC+ID',
'to' => 'ACC',
'columns' => 'id,database(ko)',
],
'Content_Type' => 'form-data');
while (my $wait = $response->header('Retry-After')) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
$response = $agent->get($response->base);
}
$response->is_success ?
print $response->content :
die 'Failed, got ' . $response->status_line .
' for ' . $response->request->uri . "\n";
print $file . "\n";
}
This script, instead of looping through each .txt file only grabs the first .txt file in my directory and performs this function over and over again on that one file only. However, at the end, it prints the correct file name. Here is an example output:
Entry Cross-reference (ko) yourlist:M20170501A isomap:M201705
Q9HGN1 K16196; GCN2_SCHPO
P15442 K16196; GCN2_YEAST
P43535 K06158; GCN20_YEAST
Q5FQ97 K00851; GCNK_GLUOX
file1.txt
Entry Cross-reference (ko) yourlist:M20170501A isomap:M201705
Q9HGN1 K16196; GCN2_SCHPO
P15442 K16196; GCN2_YEAST
P43535 K06158; GCN20_YEAST
Q5FQ97 K00851; GCNK_GLUOX
file2.txt
I have tried to do this via terminal with the following loop as well:
for i in *; do perl script.pl $i $i.txt; done
and I get the same results.
I am missing something very simple and am asking for your wisdom on understanding why this loop is being loopy. Secondly, is there a way to code this (in the script or via terminal) to name each result of each .txt file differently?
Thank-you!

Your for loop foreach $file (#files) { ... }executes the following block repeatedly, setting $file to each file name in turn. But inside the loop you try to pass all of the files at once, using the parameter 'file' => [#files]
LWP treats that list as a file path, a file name, and a number of header names and values, so the data uploaded always comes from the first file in #files
The quick solution is to to replace that line with file => [ $file ] and then it should work, but there are a few other issues with you code so I've written this refactoring
I'm not in a position to test this at present, but it does compile
use strict;
use warnings 'all';
use LWP::UserAgent;
my #files = glob '*.txt'; # Files containg list of UniProt IDs.
my $base = 'http://www.uniprot.org';
my $tool = 'uploadlists';
my $contact = ''; # Please set your email address here
# to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push #{$agent->requests_redirectable}, 'POST';
for my $file ( #files ) {
my $response = $agent->post(
"$base/$tool/",
Content_Type => 'form-data',
Content => [
file => [ $file ],
format => 'tab',
from => 'ACC+ID',
to => 'ACC',
columns => 'id,database(ko)',
],
);
while ( my $wait = $response->header('Retry-After') ) {
print STDERR "Waiting ($wait)...\n";
sleep $wait;
$response = $agent->get($response->base);
}
if ( $response->is_success ) {
print $response->content;
}
else {
die sprintf "Failed. Got %s for %s\n",
$response->request->uri,
$response->status_line;
}
}

Related

HTML::TreeBuilder inside a loop

I'm trying to delete all table elements from several HTML files.
The following code runs perfectly on a single file, but when trying to automate the process it returns the error
can't call method "look_down" on an undefined value
Do you have any solution please?
Here is the code:
use strict;
use warnings;
use Path::Class;
use HTML::TreeBuilder;
opendir( DH, "C:/myfiles" );
my #files = readdir(DH);
closedir(DH);
foreach my $file ( #files ) {
print("Analyzing file $file\n");
my $tree = HTML::TreeBuilder->new->parse_file("C:/myfiles/$file");
foreach my $e ( $tree->look_down( _tag => "table" ) ) {
$e->delete();
}
use HTML::FormatText;
my $formatter = HTML::FormatText->new;
my $parsed = $formatter->format($tree);
print $parsed;
}
The problem is that you're feeding HTML::TreeBuilder all sorts of junk in addition to the HTML files that you intend. As well as any files in the opened directory, readdir returns the names of all subdirectories, as well as the pseudo-directories . and ... You should have seen this in the output from your print statement
print("Analyzing file $file\n");
One way to fix this is to check that each value in the loop is a file before processing it. Something like this
for my $file ( #files ) {
my $path = "C:/myfiles/$file";
next unless -f $path;
print("Analyzing file $file\n");
my $tree = HTML::TreeBuilder->new->parse_file($path);
for my $table ( $tree->look_down( _tag => 'table' ) ) {
$table->delete();
}
...;
}
But it would be much cleaner to use a call to glob. That way you will only get the files that you want, and there is also no need to build the full path to each file
That would look something like this. You would have to adjust the glob pattern if your files don't all end with .html
for my $path ( glob "C:/myfiles/*.html" ) {
print("Analyzing file $path\n");
my $tree = HTML::TreeBuilder->new->parse_file($path);
for my $table ( $tree->look_down( _tag => 'table' ) ) {
$table->delete();
}
...;
}
Strictly speaking, a directory name may also look like *.html, and if you don't trust your file structure you should also test that each result of glob is a file before processing it. But in normal situations where you know what's in the directory you're processing that isn't necessary

Archive::Zip membersMatching can't locate method in Perl?

I have a script that uses Archive::Zip, and I want to use the method membersMatching, but I can't figure out what I'm missing.
I called the module at the beginning of the script:
use Archive::Zip qw( :ERROR_CODES :CONSTANTS :MISC_CONSTANTS );
and this is the block of code where the module is used:
while (my $file = readdir(TRIMMED_CELL_DIR)) {
#Only if file ends in _1.fastqc.zip (only 1 instance per "trimmed" subdirectory.)
if($file =~ /.*\_1\_fastqc\.zip/){
#Extract the file summary.txt and assign it to filehandle SUMMARY_R1.
$file = "${trimmedDirectory}/${file}";
print "Loading ZIP file: $file. \n";
my $zip = Archive::Zip->new($file);
my #txtFileMembers = $zip->membersMatching( '.*\.txt' );
foreach my $txtFile (#txtFileMembers){
extractMember($txtFile);
open(SUMMARY_R1,"< $txtFile");
}
}
I keep getting the error Can't locate object method "membersMatching". ... and I know it has something to do with this membersMatching method not being exported, but I don't know how to call it in the script. Te CPAN page for Archive::Zip doesn't say anything except to use it like so:
membersMatching( $regex )
membersMatching( { regex => $regex } )
Return array of members whose filenames match given regular expression in list context. Returns number of matching members in
scalar context.
my #textFileMembers = $zip->membersMatching( '.*\.txt' );
# or
my $numberOfTextFiles = $zip->membersMatching( '.*\.txt' );
The ZIP file loading with the Archive::Zip->new($file) function works, so the module is being exported, just not the method memebersMatching...
Check the path of your zip file ($file). I think it's failing there. Update your code to the below:
my $zip = Archive::Zip->new();
unless ( $zip->read( 'someZip.zip' ) == AZ_OK ) {
die 'read error';
}
print "zip contains the following files:\n";
print "$_\n" for $zip->memberNames();

Adding a .zip file to the body of an LWP::UserAgent POST request

I believe I have a general Perl problem, rather than an LWP::UserAgent problem... however its somewhat complex.
The task is to write a test-script that does a SWORD deposit.
I create tests by first writing code to prove the thing works, then add in the Test::More wrappers to make it a test.
BACKGROUND
A SWORD deposit is simply an http post request with a bunch of defined headers, and the content of the body being the thing to be ingested. This all works fine, I can perform the actions through CURL, and I've written scripts to do this.... but within a a larger application environment (that'll be EPrints.)
CODE
My problem, I believe, comes when I try to attach the contents of the file on the disk.
#!/home/cpan/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
##use WWW::Mechanize;
use File::Slurp;
use MIME::Base64;
my $auth = 'username:password';
my $domain = 'devel.example.com';
my $ua = LWP::UserAgent->new();
my $basedir = "./test_files";
my $package = 'http://opendepot.org/europePMC/2.0';
my $filename = "$basedir/PMC165035.zip";
my $mime = 'application/zip';
print "filename: $filename\n";
my $deposit_url = $domain . '/sword-app/deposit/archive';
my $file = read_file( $filename, { binmode => ':raw' } );
# Set up the SWORD deposit
my $autho = "Basic " . MIME::Base64::encode( $auth, '' );
my %headers = (
'X-Packaging' => $package,
'X-No-Op' => 'false',
'X-Verbose' => 'true',
'Content-Disposition' => "filename=$filename",
'Content-Type' => $mime,
'User-Agent' => 'Broker Test Harness',
'Authorization' => $autho,
);
my $r = $ua->post( $deposit_url, %headers, Content => $file );
# DEBUG TEST
write_file('foo.zip', $file);
my $ret = $r->decoded_content;
print "Content: $ret\n";
if ( $r->is_success ) { print "Deposited $package successfully" }
WHAT WORKS, WHAT DOESN'T
This code is lifted pretty much directly from working code I have - the only difference is that the working code gets the content for $file via an object-call within EPrints.
I know the file exists on the disk, if I do an ls -l on the filename printed, I can see the file, and its readable
In the code above, there is a line write_file('foo.zip', $file); - that writes a file which unzip -l foo.zip happily tells me has 3 files in it.
The line print "Content: $ret\n"; should print an atom response - for me, it prints nothing....
The Access log reports an error 500, but there's diddly-squat in the error-log.
The help
What I need to know is how I get the actual contents of the .zip file into the content part of the LWP::UserAgent post request...
(I'm going to spend much time not trying to dig into EPrints, to track where the error-500 is coming from, and why nothing appears in the log file.... but that's probably going to be down to an issue with what's been posted)
The solution lies in realizing what LWP POST is doing.
my $filename = "$basedir/PMC165035.zip";
my $file = read_file( $filename, { binmode => ':raw' } );
my %headers = (
'X-Packaging' => $package,
'X-No-Op' => 'false',
'X-Verbose' => 'true',
'Content-Disposition' => "filename=$filename",
'Content-Type' => $mime,
'User-Agent' => 'Broker Test Harness',
'Authorization' => $autho,
);
All work by setting $filename to be something like /home/services/foo/testing/test_files/PMC165035.zip, and passing this (full) filename to the server example.com.
The problem is that the server is looking for a filename, not a filename-with-path... so when the service does its thing with the file by dumping the content into its temporary upload location, and then it looks for ~~~temp_location/home/services/foo/testing/test_files/PMC165035.zip, it can't find it!
The solution is to read in the file, but ensure that the filename given in the headers is just the filename, not with-a-path

Perl reading zip files with IO::Uncompress::AnyUncompress

We are moving from our current build system (which is a mess) to one that uses Ant with Ivy. I'm cleaning up all the build files, and finding the jar dependencies. I thought it might be easier if I could automate it a bit, by going through the jars that are checked into the project, finding what classes they contain, then matching those classes with the various import statements in the Java code.
I have used Archive::Tar before, but Archive::Zip isn't a standard Perl module. (My concern is that someone is going to try my script, call me in the middle of the night and tell me it isn't working.)
I noticed that IO::Uncompress::AnyUncompress is a standard module, so I thought I could try IO::Uncompress::AnyUncompressor at leastIO::Uncompress::Unzip` which is also a standard module.
Unfortunately, the documentation for these modules give no examples (According to the documentation, examples are a todo).
I'm able to successfully open my jar and create an object:
my $zip_obj = IO::Uncompress::AnyUncompress->new ( $zip_file );
Now, I want to see the contents. According to the documentation:
getHeaderInfo
Usage is
$hdr = $z->getHeaderInfo();
#hdrs = $z->getHeaderInfo();
This method returns either a hash reference (in scalar context) or a list or hash references (in array context) that contains information about each of the header fields in the compressed data stream(s).
Okay, this isn't an object like Archive::Tar or Archive::Zip returns, and there are no methods or subroutines mentioned to parse the data. I'll use Data::Dumper and see what hash keys are contained in the reference.
Here's a simple test program:
#! /usr/bin/env perl
use 5.12.0;
use warnings;
use IO::Uncompress::AnyUncompress;
use Data::Dumper;
my $obj = IO::Uncompress::AnyUncompress->new("testng.jar")
or die qq(You're an utter failure);
say qq(Dump of \$obj = ) . Dumper $obj;
my #header2 = $obj->getHeaderInfo;
say qq(Dump of \$header = ) . Dumper $headers->[0];
And here's my results:
Dump of $obj = $VAR1 = bless( \*Symbol::GEN0, 'IO::Uncompress::Unzip' );
Dump of $header = $VAR1 = {
'UncompressedLength' => 0,
'Zip64' => 0,
'MethodName' => 'Stored',
'Stream' => 0,
'Time' => 1181224440,
'MethodID' => 0,
'CRC32' => 0,
'HeaderLength' => 43,
'ExtraFieldRaw' => '¦- ',
'ExtraField' => [
[
'¦-',
''
]
],
'FingerprintLength' => 4,
'Type' => 'zip',
'TrailerLength' => 0,
'CompressedLength' => 0,
'Name' => 'META-INF/',
'Header' => 'PK
+N¦6 META-INF/¦- '
};
Some of that looks sort of useful. However, all of my entries return `'Name' => 'META-INF/``, so it doesn't look like a file name.
Is it possible to use IO::Uncompress::AnyUncompress (or even IO::Uncompress:Unzip) to read through the archive and see what files are in its contents. And, if so, how do I parse that header?
Otherwise, I'll have to go with Archive::Zip and let people know they have to download and install it from CPAN on their systems.
The files in the archive are compressed in different data streams, so you need to iterate through the streams to get the individual files.
use strict;
use warnings;
use IO::Uncompress::Unzip qw(unzip $UnzipError);
my $zipfile = 'zipfile.zip';
my $u = new IO::Uncompress::Unzip $zipfile
or die "Cannot open $zipfile: $UnzipError";
die "Zipfile has no members"
if ! defined $u->getHeaderInfo;
for (my $status = 1; $status > 0; $status = $u->nextStream) {
my $name = $u->getHeaderInfo->{Name};
warn "Processing member $name\n" ;
if ($name =~ /\/$/) {
mkdir $name;
}
else {
unzip $zipfile => $name, Name => $name
or die "unzip failed: $UnzipError\n";
}
}

Perl How to merge two or more excel files in one (multiple worksheets)?

I need to merge a few excel file into one, multiple sheets.
I do not care too much about the sheet name on the new file.
I do not have Excel on the computer I plan to run this. so I cannot use Win32 OLE.
I attempted to run this code https://sites.google.com/site/mergingxlsfiles/ but it is not working, I get a new empty excel file.
I attempt to run http://www.perlmonks.org/?node_id=743574 but I only obtained one of the file in the new excel file.
My input excel files have some french characters (é for e.g.) I believe these are cp1252.
Code used :
#!/usr/bin/perl -w
use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::WriteExcel;
use File::Glob qw(bsd_glob);
use Getopt::Long;
use POSIX qw(strftime);
GetOptions(
'output|o=s' => \my $outfile,
'strftime|t' => \my $do_strftime,
) or die;
if ($do_strftime) {
$outfile = strftime $outfile, localtime;
};
my $output = Spreadsheet::WriteExcel->new($outfile)
or die "Couldn't create '$outfile': $!";
for (#ARGV) {
my ($filename,$sheetname,$targetname);
my #files;
if (m!^(.*\.xls):(.*?)(?::([\w ]+))$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
warn $filename;
if ($do_strftime) {
$filename = strftime $filename, localtime;
};
#files = glob $filename;
} else {
($filename,$sheetname,$targetname) = ($_,qr(.*),undef);
if ($do_strftime) {
$filename = strftime $filename, localtime;
};
push #files, glob $filename;
};
for my $f (#files) {
my $excel = Spreadsheet::ParseExcel::Workbook->Parse($f);
foreach my $sheet (#{$excel->{Worksheet}}) {
if ($sheet->{Name} !~ /$sheetname/) {
warn "Skipping '" . $sheet->{Name} . "' (/$sheetname/)";
next;
};
$targetname ||= $sheet->{Name};
#warn sprintf "Copying %s to %s\n", $sheet->{Name}, $targetname;
my $s = $output->add_worksheet($targetname);
$sheet->{MaxRow} ||= $sheet->{MinRow};
foreach my $row ($sheet->{MinRow} .. $sheet->{MaxRow}) {
my #rowdata = map {
$sheet->{Cells}->[$row]->[$_]->{Val};
} $sheet->{MinCol} .. $sheet->{MaxCol};
$s->write($row,0,\#rowdata);
}
}
};
};
$output->close;
I have 2 excel files named: 2.xls (only 1 sheet named 2 in it), 3.xls (only 1 sheet named 3)
I launched the script as this:
xlsmerge.pl -s -o results-%Y%m%d.xls 2.xls:2 3.xls:3
Results: results-20121024.xls empty nothing in it.
Then I tried
xlsmerge.pl -s -o results-%Y%m%d.xls 2.xls 3.xls
And it worked.
I am not sure why is it failing while adding the Sheetname
It appears that there is a bug in this line of the script:
if (m!^(.*\.xls):(.*?)(?::([\w ]+))$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
...
It looks to me like the goal of that line is to allow arguments either in the form
spreadsheet.xls:source_worksheet
or in another form allowing the name of the target sheet to be specified:
spreadsheet.xls:source_worksheet:target_worksheet
The last grouping appears intended to capture that last, optional argument: (?::([\w ]+)). The only problem is, this grouping was not made optional. Thus, when you only specify the source sheet and not the target, the regex fails to match and it falls to the backup behavior, which is to treat the whole argument as the filename. But this fails, too, because you don't have a file called 2.xls:2.
The solution would be to introduce the ? modifier after the last group in the regex to make it optional:
if (m!^(.*\.xls):(.*?)(?::([\w ]+))?$!) {
($filename,$sheetname,$targetname) = ($1,qr($2),$3);
...
Of course, that may not be the only problem. If the script was posted with an error, there could be other errors, too. I don't have Perl available to test it at the moment.