i am trying to get a xml file from a database using WWW::Mechanize. I know that the file is quite big (bigger than my memory) and it is constantly crashes either i try to view it in the browser or try to store in in a file using get(). I am planning to user XML::Twig in the future, but i cannot ever store the result in a file.
Does anyone know how to split the mechanized object in little chunks,get them one after another, and store them in a file, one after another without running out of memory?
Here is the query api: ArrayExpress Programmatic Access .
Thank you.
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $base = 'http://www.ebi.ac.uk/arrayexpress/xml/v2/experiments';
#Parameters
my $query ='?species="homo sapiens"' ;
my $url = $base . $query;
# Create a new mechanize object
my $mech = WWW::Mechanize->new(stack_depth=>0);
# Associate the mechanize object with a URL
$mech->get($url);
#store xml content
my $content = $mech->content;
#open output file for writing
unlink("ArrayExpress_Human_Final.txt");
open( $fh, '>>:encoding(UTF-8)','ArrayExpress_Human_Final.txt') || die "Can't open file!\n";
print $fh $content;
close $fh;
Sounds like what you want to do is save the file directly to disk, rather than loading it into memory.
From the Mech FAQ question "How do I save an image? How do I save a large tarball?"
You can also save any content directly to disk using the :content_file flag to get(), which is part of LWP::UserAgent.
$mech->get( 'http://www.cpan.org/src/stable.tar.gz',
':content_file' => 'stable.tar.gz' );
Also note that if all you're doing is downloading the file, it may not even make sense to use WWW::Mechanize, and to use the underlying LWP::UserAgent directly.
I've started Perl recently and mixed quite a bit of things to get what I want.
My script gets the content of a webpage, writes it to a file.
Then I open a filehandler, plug the file report.html in (sorry i'm not english, i don't know how to say it better) and parse it.
I write every line i encounter to a new file, except lines containing a specific color.
It works, but I'd like to try another way which doesn't require me to create a "report.html" temporary file.
Furthermore, I'd like to print my result directly in a file, I don't want to have to use a system redirection '>'. That'd mean my script has to be called by another .sh script, and I don't want that.
use strict;
use warnings;
use LWP::Simple;
my $report = "report.html";
getstore('http://test/report.php', 'report.html') or d\
ie 'Unable to get page\n';
open my $fh2, "<$report" or die("could not open report file : $!\n");
while (<$fh2>)
{
print if (!(/<td style="background-color:#71B53A;"/ .. //));
}
close($fh2);
Thanks for your help
If you have got the html content into a variable, you can use a open call on this variable. Like:
my $var = "your html content\ncomes here\nstored into this variable";
open my $fh, '<', \$var;
# .. just do the things you like to $fh
You can try get function in LWP::Simple Module ;)
To your sencond question, use open like open $fh, '<', $filepath. you can use perldoc -f open to see more info.
Firstly I'd like to apology - I'm new to Perl, and my question is so basic I am almost sure it had been asked before, but sadly I couldn't find it.
I'd like to parse an Internet page like I parse a text file with the open my $file, "<", "..". That is, I'd like to use a loop: while (my $line = <$file>). Sadly I couldn't find a way to do that; only using LWP::UserAgent with some get's and content, but that gives me the whole Internet page. I could make an array out of it by splitting it with respect to \n, but I really want to use the convenience of <$file>.
What can I do?
Thank you very much and sorry again if it had been asked before.
Here is one way to do this:
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $response = $ua->get('http://search.cpan.org/');
# assume a successful response
open(my $fh, "<", \$response->decoded_content);
while (<$fh>) {
print "line $. has ", length($_), " characters\n";
}
# $fh will close when it goes out of scope.
My web app runs on Apache mod_perl using CGI::Application. I want to provide a download of a generated file. In the past (before we were using mod_perl and CGI::App) I just spooled out a csv file to STDOUT as it was generated. Now I'm shooting for a little more refinement - creating an Excel spreadsheet using Spreadsheet::WriteExcel - and I can't seem it to get to print it out from the file handle.
sub export_list {
my $self = shift;
binmode(STDOUT);
my $str;
open my $fh, '>', \$str;
my $workbook = Spreadsheet::WriteExcel->new($fh);
my $worksheet = $workbook->add_worksheet();
$worksheet->write_col(0,0, ['some','data','here']);
warn $str;
return $str;
}
The output is just a blank response, and the warn is blank as well.
The method I'm using to write the spreadsheet to a filehandle is pretty much straight out of the documentation, so I assume the problem is due to some CGI::App noobery on my part. The documentation's suggested methods for filehandles and mod_perl proved pretty fruitless as well.
I guess I should mention I'm running on Windows, and that my current workaround is to create a file and provide the user with a link to it. That poses more problems, however, in regards to clearing out the directory and when to do so, and also authentication for access to the generated files.
Suggestions? Scathing criticism?
You shouldn't need to mess with STDOUT; CGI-App should handle that properly for you under the hood. You'll also may need to close the filehandle before you try to send the data.
It looks like you're not setting a proper content type for the Excel data, though. For anything other than text/html, you'll need to set it manually. Try something like this:
sub export_list {
my $self = shift;
my $str;
open my $fh, '>', \$str or die "Can't open to var: $!";
my $workbook = Spreadsheet::WriteExcel->new($fh);
my $worksheet = $workbook->add_worksheet();
$worksheet->write_col(0,0, ['some','data','here']);
$workbook->close;
close $fh;
warn $str;
$self->header_add( -type => 'application/vnd.ms-excel' );
return $str;
}
You may also be interested in CGI::Application::Plugin::Stream
Instead of creating the whole spreadsheet in memory, you should either write it out to a file and them stream it when finished (using CGI::Application::Plugin::Stream helps here, but you'd still need to clean it up afterwards, but really every web app should have a temp directory that periodically gets cleaned up) or print it as you create it (which means making the FH STDIN instead which might be trickier under mod_perl or maybe not).
And then remember to close your workbook when it's done.
You want to close the workbook. Also close the filehandle:
warn "length 1=".length($str);
$workbook->close();
close($fh) or die "error on close: $!";
warn "length 2=".length($str);
length 1=0 at wx.pl line 16.
length 2=5632 at wx.pl line 19.
Please note - I am not looking for the "right" way to open/read a file, or the way I should open/read a file every single time. I am just interested to find out what way most people use, and maybe learn a few new methods at the same time :)*
A very common block of code in my Perl programs is opening a file and reading or writing to it. I have seen so many ways of doing this, and my style on performing this task has changed over the years a few times. I'm just wondering what the best (if there is a best way) method is to do this?
I used to open a file like this:
my $input_file = "/path/to/my/file";
open INPUT_FILE, "<$input_file" || die "Can't open $input_file: $!\n";
But I think that has problems with error trapping.
Adding a parenthesis seems to fix the error trapping:
open (INPUT_FILE, "<$input_file") || die "Can't open $input_file: $!\n";
I know you can also assign a filehandle to a variable, so instead of using "INPUT_FILE" like I did above, I could have used $input_filehandle - is that way better?
For reading a file, if it is small, is there anything wrong with globbing, like this?
my #array = <INPUT_FILE>;
or
my $file_contents = join( "\n", <INPUT_FILE> );
or should you always loop through, like this:
my #array;
while (<INPUT_FILE>) {
push(#array, $_);
}
I know there are so many ways to accomplish things in perl, I'm just wondering if there are preferred/standard methods of opening and reading in a file?
There are no universal standards, but there are reasons to prefer one or another. My preferred form is this:
open( my $input_fh, "<", $input_file ) || die "Can't open $input_file: $!";
The reasons are:
You report errors immediately. (Replace "die" with "warn" if that's what you want.)
Your filehandle is now reference-counted, so once you're not using it it will be automatically closed. If you use the global name INPUT_FILEHANDLE, then you have to close the file manually or it will stay open until the program exits.
The read-mode indicator "<" is separated from the $input_file, increasing readability.
The following is great if the file is small and you know you want all lines:
my #lines = <$input_fh>;
You can even do this, if you need to process all lines as a single string:
my $text = join('', <$input_fh>);
For long files you will want to iterate over lines with while, or use read.
If you want the entire file as a single string, there's no need to iterate through it.
use strict;
use warnings;
use Carp;
use English qw( -no_match_vars );
my $data = q{};
{
local $RS = undef; # This makes it just read the whole thing,
my $fh;
croak "Can't open $input_file: $!\n" if not open $fh, '<', $input_file;
$data = <$fh>;
croak 'Some Error During Close :/ ' if not close $fh;
}
The above satisfies perlcritic --brutal, which is a good way to test for 'best practices' :). $input_file is still undefined here, but the rest is kosher.
Having to write 'or die' everywhere drives me nuts. My preferred way to open a file looks like this:
use autodie;
open(my $image_fh, '<', $filename);
While that's very little typing, there are a lot of important things to note which are going on:
We're using the autodie pragma, which means that all of Perl's built-ins will throw an exception if something goes wrong. It eliminates the need for writing or die ... in your code, it produces friendly, human-readable error messages, and has lexical scope. It's available from the CPAN.
We're using the three-argument version of open. It means that even if we have a funny filename containing characters such as <, > or |, Perl will still do the right thing. In my Perl Security tutorial at OSCON I showed a number of ways to get 2-argument open to misbehave. The notes for this tutorial are available for free download from Perl Training Australia.
We're using a scalar file handle. This means that we're not going to be coincidently closing someone else's file handle of the same name, which can happen if we use package file handles. It also means strict can spot typos, and that our file handle will be cleaned up automatically if it goes out of scope.
We're using a meaningful file handle. In this case it looks like we're going to write to an image.
The file handle ends with _fh. If we see us using it like a regular scalar, then we know that it's probably a mistake.
If your files are small enough that reading the whole thing into memory is feasible, use File::Slurp. It reads and writes full files with a very simple API, plus it does all the error checking so you don't have to.
There is no best way to open and read a file. It's the wrong question to ask. What's in the file? How much data do you need at any point? Do you need all of the data at once? What do you need to do with the data? You need to figure those out before you think about how you need to open and read the file.
Is anything that you are doing now causing you problems? If not, don't you have better problems to solve? :)
Most of your question is merely syntax, and that's all answered in the Perl documentation (especially (perlopentut). You might also like to pick up Learning Perl, which answers most of the problems you have in your question.
Good luck, :)
It's true that there are as many best ways to open a file in Perl as there are
$files_in_the_known_universe * $perl_programmers
...but it's still interesting to see who usually does it which way. My preferred form of slurping (reading the whole file at once) is:
use strict;
use warnings;
use IO::File;
my $file = shift #ARGV or die "what file?";
my $fh = IO::File->new( $file, '<' ) or die "$file: $!";
my $data = do { local $/; <$fh> };
$fh->close();
# If you didn't just run out of memory, you have:
printf "%d characters (possibly bytes)\n", length($data);
And when going line-by-line:
my $fh = IO::File->new( $file, '<' ) or die "$file: $!";
while ( my $line = <$fh> ) {
print "Better than cat: $line";
}
$fh->close();
Caveat lector of course: these are just the approaches I've committed to muscle memory for everyday work, and they may be radically unsuited to the problem you're trying to solve.
I once used the
open (FILEIN, "<", $inputfile) or die "...";
my #FileContents = <FILEIN>;
close FILEIN;
boilerplate regularly. Nowadays, I use File::Slurp for small files that I want to hold completely in memory, and Tie::File for big files that I want to scalably address and/or files that I want to change in place.
For OO, I like:
use FileHandle;
...
my $handle = FileHandle->new( "< $file_to_read" );
croak( "Could not open '$file_to_read'" ) unless $handle;
...
my $line1 = <$handle>;
my $line2 = $handle->getline;
my #lines = $handle->getlines;
$handle->close;
Read the entire file $file into variable $text with a single line
$text = do {local(#ARGV, $/) = $file ; <>};
or as a function
$text = load_file($file);
sub load_file {local(#ARGV, $/) = #_; <>}
If these programs are just for your productivity, whatever works! Build in as much error handling as you think you need.
Reading in a whole file if it's large may not be the best way long-term to do things, so you may want to process lines as they come in rather than load them up in an array.
One tip I got from one of the chapters in The Pragmatic Programmer (Hunt & Thomas) is that you might want to have the script save a backup of the file for you before it goes to work slicing and dicing.
The || operator has higher precedence, so it is evaluated first before sending the result to "open"... In the code you've mentioned, use the "or" operator instead, and you wouldn't have that problem.
open INPUT_FILE, "<$input_file"
or die "Can't open $input_file: $!\n";
Damian Conway does it this way:
$data = readline!open(!((*{!$_},$/)=\$_)) for "filename";
But I don't recommend that to you.