Perl - Mechanize object size too big - perl

i am trying to get a xml file from a database using WWW::Mechanize. I know that the file is quite big (bigger than my memory) and it is constantly crashes either i try to view it in the browser or try to store in in a file using get(). I am planning to user XML::Twig in the future, but i cannot ever store the result in a file.
Does anyone know how to split the mechanized object in little chunks,get them one after another, and store them in a file, one after another without running out of memory?
Here is the query api: ArrayExpress Programmatic Access .
Thank you.
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $base = 'http://www.ebi.ac.uk/arrayexpress/xml/v2/experiments';
#Parameters
my $query ='?species="homo sapiens"' ;
my $url = $base . $query;
# Create a new mechanize object
my $mech = WWW::Mechanize->new(stack_depth=>0);
# Associate the mechanize object with a URL
$mech->get($url);
#store xml content
my $content = $mech->content;
#open output file for writing
unlink("ArrayExpress_Human_Final.txt");
open( $fh, '>>:encoding(UTF-8)','ArrayExpress_Human_Final.txt') || die "Can't open file!\n";
print $fh $content;
close $fh;

Sounds like what you want to do is save the file directly to disk, rather than loading it into memory.
From the Mech FAQ question "How do I save an image? How do I save a large tarball?"
You can also save any content directly to disk using the :content_file flag to get(), which is part of LWP::UserAgent.
$mech->get( 'http://www.cpan.org/src/stable.tar.gz',
':content_file' => 'stable.tar.gz' );
Also note that if all you're doing is downloading the file, it may not even make sense to use WWW::Mechanize, and to use the underlying LWP::UserAgent directly.

Related

Perl: Open a file from a URL

I wanted to know how to open a file from a URL rather than a local file and I found the following answer on another thread:
use IO::String;
my $handle = IO::String->new(get("google.com"));
my #lines = <$handle>;
close $handle;
This works perfectly... on my PC...
But when I transferred the code over to my hosted server it complains that it can't find the IO module. So is there another way to open a file from an URL, that doesn't require any external modules (or uses one that is pretty much installed on every server)...?
You can install PerlIO::http, which will give you an input layer for opening a filehandle from a URL via open. This thing is not included in the Perl core, but it will work with Perls as early as 5.8.9.
Once you've installed it, all you need to do is open with a layer :http in the mode argument. There is nothing to use here. That happens automatically.
open my $fh, '<:http', 'https://metacpan.org/recent';
You can then read from $fh like a regular file. Under the hood it will take care of getting the data over the wire.
while (my $line = <$fh>) { ... }
There is no way to "open a file from a URL" as you ask. Well, I suppose you could throw something together using the progress() callback from LWP::UserAgent, but even then I don't think it would work how you want it to.
But you can make something that looks like it's doing what you want pretty easily. Actually, what we're really doing is pulling all the data back from the URL and then opening a filehandle on a string that contains that data.
use LWP::Simple;
my $data = get('https://google.com');
open my $url_fh, '<', \$data or die $!;
# Now $url_fh is a filehandle wrapped around your data.
# Treat it like any other filehandle.
while (<$url_fh>) {
print;
}
Your problem was that IO::String wasn't installed. But there's no need to install it, as it's simple enough to do what it does with standard Perl features (simply open a filehandle on a reference to a string).
Update: IO::String is completely unnecessary here. Not only because you can do what it does very simply, by just opening a filehandle on a reference to your string, but also because all you want to do is to read a file from a web site into an array. And in that case, your code is simply:
use LWP::Simple;
my $url = 'something';
my #records = split /\n/, get($url);
You might even consider adding some error handing.
use LWP::Simple;
my $url = 'something';
my $data = get($url);
die "No data found\n" unless defined $data;
my #array = split /\n/, get($url);

get the whole content from the web site using perl script

I am a new hire in my company and first time I am working on Perl.
I get a task in which I find IP-Reputation from this link: https://www.talosintelligence.com/reputation_center/lookup?search=27.34.246.62
But in perl when we use:
#!/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
open FILE1, ">./Reports/Reputation.txt" or die "Cannot open Reputation.txt!";
my $mech = WWW::Mechanize->new( autocheck => 1 );
my $url="https://www.talosintelligence.com/reputation_center/lookup?search=27.34.246.62";
$mech->get($url);
print $mech->status();
my $content = $mech->content();
open FILE1, ">./Reports/Reputation.txt" or die "Cannot open Reputation.txt!";
print FILE1 ($content);
close FILE1;
print "\nIP Reputation Report Generated \n";
I don't get the whole content. What can I do to get this?
Contents are loading from JavaScript. So you can't crawl the content using simple methods.
There is two option for this kind of situation.
1) Some API contains the original data and JavaScript loads/formating the data in front end. If you want to parse the JavaScript loading content try to use the
WWW::Mechanize::Firefox
2) Try to figure out from where it is loading, for your IP following link has the corresponding data, which is JSON formated, so parse the content using JSON module. it is so simple compare to using RegEx.
https://www.talosintelligence.com/sb_api/query_lookup?query=%2Fapi%2Fv2%2Frelated_ips%2Fip%2F&query_entry=27.34.246.62

Getstore to Buffer, not using temporary files

I've started Perl recently and mixed quite a bit of things to get what I want.
My script gets the content of a webpage, writes it to a file.
Then I open a filehandler, plug the file report.html in (sorry i'm not english, i don't know how to say it better) and parse it.
I write every line i encounter to a new file, except lines containing a specific color.
It works, but I'd like to try another way which doesn't require me to create a "report.html" temporary file.
Furthermore, I'd like to print my result directly in a file, I don't want to have to use a system redirection '>'. That'd mean my script has to be called by another .sh script, and I don't want that.
use strict;
use warnings;
use LWP::Simple;
my $report = "report.html";
getstore('http://test/report.php', 'report.html') or d\
ie 'Unable to get page\n';
open my $fh2, "<$report" or die("could not open report file : $!\n");
while (<$fh2>)
{
print if (!(/<td style="background-color:#71B53A;"/ .. //));
}
close($fh2);
Thanks for your help
If you have got the html content into a variable, you can use a open call on this variable. Like:
my $var = "your html content\ncomes here\nstored into this variable";
open my $fh, '<', \$var;
# .. just do the things you like to $fh
You can try get function in LWP::Simple Module ;)
To your sencond question, use open like open $fh, '<', $filepath. you can use perldoc -f open to see more info.

How can I download link targets from a web site using Perl?

I just made a script to grab links from a website, and in turn saves them into a text file.
Now I'm working on my regexes so it will grab links which contains php?dl= in the url from the text file:
E.g.: www.example.com/site/admin/a_files.php?dl=33931
Its pretty much the address you get when you hover over the dl button on the site. From which you can click to download or "right click save".
I'm just wondering on how to achieve this, having to download the content of the given address which will download a *.txt file. All from the script of course.
Make WWW::Mechanize your new best friend.
Here's why:
It can identify links on a webpage that match a specific regex (/php\?dl=/ in this case)
It can follow those links through the follow_link method
It can get the targets of those links and save them to file
All this without needing to save your wanted links in an intermediate file! Life's sweet when you have the right tool for the job...
Example
use strict;
use warnings;
use WWW::Mechanize;
my $url = 'http://www.example.com/';
my $mech = WWW::Mechanize->new();
$mech->get ( $url );
my #linksOfInterest = $mech->find_all_links ( text_regex => qr/php\?dl=/ );
my $fileNumber++;
foreach my $link (#linksOfInterest) {
$mech->get ( $link, ':contentfile' => "file".($fileNumber++).".txt" );
$mech->back();
}
You can download the file with LWP::UserAgent:
my $ua = LWP::UserAgent->new();
my $response = $ua->get($url, ':content_file' => 'file.txt');
Or if you need a filehandle:
open my $fh, '<', $response->content_ref or die $!;
Old question, but when I'm doing quick scripts, I often use "wget" or "curl" and pipe. This isn't cross-system portable, perhaps, but if I know my system has one or the other of these commands, it's generally good.
For example:
#! /usr/bin/env perl
use strict;
open my $fp, "curl http://www.example.com/ |";
while (<$fp>) {
print;
}

Why do my images get clipped when served by this Perl CGI script?

When I try to print an image to STDOUT in a Perl CGI script, the image gets clipped when viewed in the browser.
Here is the following code:
if ($path =~ m/\.jpe?g$/i)
{
my $length = (stat($path))[7];
$| = 1;
print "Content-type: image/jpg\r\n";
print "Content-length: $length\r\n\r\n";
open(IMAGE,"<$path");
binmode(IMAGE);
binmode(STDOUT);
my ($image, $buff);
read IMAGE, $buff, $length;
syswrite STDOUT, $buff, $length;
close IMAGE;
}
If you really want to read the entire file into memory before serving, use File::Slurp:
#!/usr/bin/perl
use strict; use warnings;
use CGI::Simple;
use File::Slurp;
use File::stat;
local $| = 1;
my $cgi = CGI::Simple->new;
my $st = stat($path) or die "Cannot stat '$path'";
print $cgi->header(
-type => 'image/jpeg',
-length => $st->size,
);
write_file(\*STDOUT, {binmode => ':raw'},
\ read_file( $path, binmode => ':raw' )
);
However, reading the entire file will consume large amounts of memory for large images. Therefore, see How can I serve an image with a Perl CGI script?.
EDIT: as the stat doesn't seem to be problem, some more ideas:
try using unbuffered instead of buffered reading, ie. use sysread instead of read. or the other way round: use both buffered read and write. also, try commenting out the $|. see Suffering from Buffering? for details on perl buffered io. see also How can I serve an image with a Perl CGI script? here on SO for an apparently working solution. EDIT END
you are using the wrong stat field. (stat($path))[10] is ctime: inode change time in seconds since the epoch. it should be (stat($path))[7], size: total size of file, in bytes.
FYI: I have come to the conclusion that the images are in fact corrupt, though they are fully viewable in Windows File Explorer.
The FireFox browser shows the Images clipped(no matter how they are accessed, so I guess this is no longer a Perl problem), but the Safari Browser displays them completely.
The images were re sampled from using Java's imageIO in "jpg" mode. I just changed the mode to "png", and now the newly generated images are showing perfectly in all browsers. So this was actually a Java imageIO issue.
It is solved.
Thank you everyone for your responses.