I have a list of URLs of pdf files that i want to download, from different sites.
In my firefox i have chosen the option to save PDF files directly to a particular folder.
My plan was to use WWW::Mechanize::Firefox in perl to download each file (in the list - one by one) using Firefox and renaming the file after download.
I used the following code to do it :
use WWW::Mechanize::Firefox;
use File::Copy;
# #list contains the list of links to pdf files
foreach $x (#list) {
my $mech = WWW::Mechanize::Firefox->new(autoclose => 1);
$mech->get($x); #This downloads the file using firefox in desired folder
opendir(DIR, "output/download");
#FILES= readdir(DIR);
my $old = "output/download/$FILES[2]";
move ($old, $new); # $new is the URL of the new filename
}
When i run the file, it opens the first link in Firefox and Firefox downloads the file to the desired directory. But, after that the 'new tab' is not closed and the file does not get renamed and the code keeps running (like its encountered an endless loop) and no further file gets downloaded.
What is going on here? Why isnt the code working? How do i close the tab and make the code read all the files in the list? Is there any alternate way to download?
Solved the problem.
The function,
$mech->get()
waits for 'DOMContentLoaded' Firefox event to be fired by Firefox upon page load. As i had set Firefox to download the files automatically, there was no page being loaded. Thus, the 'DOMContentLoaded' event was never being fired. This led to pause in my code.
I set the function to not wait for the page to load by using the following option
$mech->get($x, synchronize => 0);
After this, i added 60 second delay to allow Firefox to download the file before code progresses
sleep 60;
Thus, my final code look like
use WWW::Mechanize::Firefox;
use File::Copy;
# #list contains the list of links to pdf files
foreach $x (#list) {
my $mech = WWW::Mechanize::Firefox->new(autoclose => 1);
$mech->get($x, synchronize => 0);
sleep 60;
opendir(DIR, "output/download");
#FILES= readdir(DIR);
my $old = "output/download/$FILES[2]";
move ($old, $new); # $new is the URL of the new filename
}
If i understood you correctly, you have the links to the actual pdf files.
In that case WWW::Mechanize is most likely easier than WWW::Mechanize::Firefox. In fact, i think that is almost always the case. Then again, watching the browser work is certainly cooler.
use strict;
use warnings;
use WWW::Mechanize;
# your code here
# loop
my $mech = WWW::Mechanize->new(); # Could (should?) be outside of the loop.
$mech->agent_alias("Linux Mozilla"); # Optionally pretend to be whatever you want.
$mech->get($link);
$mech->save_content("$new");
#end of the loop
If that is absolutely not what you wanted, my cover story will be that i did not want to break my 666 rep!
Related
i am trying to get a xml file from a database using WWW::Mechanize. I know that the file is quite big (bigger than my memory) and it is constantly crashes either i try to view it in the browser or try to store in in a file using get(). I am planning to user XML::Twig in the future, but i cannot ever store the result in a file.
Does anyone know how to split the mechanized object in little chunks,get them one after another, and store them in a file, one after another without running out of memory?
Here is the query api: ArrayExpress Programmatic Access .
Thank you.
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $base = 'http://www.ebi.ac.uk/arrayexpress/xml/v2/experiments';
#Parameters
my $query ='?species="homo sapiens"' ;
my $url = $base . $query;
# Create a new mechanize object
my $mech = WWW::Mechanize->new(stack_depth=>0);
# Associate the mechanize object with a URL
$mech->get($url);
#store xml content
my $content = $mech->content;
#open output file for writing
unlink("ArrayExpress_Human_Final.txt");
open( $fh, '>>:encoding(UTF-8)','ArrayExpress_Human_Final.txt') || die "Can't open file!\n";
print $fh $content;
close $fh;
Sounds like what you want to do is save the file directly to disk, rather than loading it into memory.
From the Mech FAQ question "How do I save an image? How do I save a large tarball?"
You can also save any content directly to disk using the :content_file flag to get(), which is part of LWP::UserAgent.
$mech->get( 'http://www.cpan.org/src/stable.tar.gz',
':content_file' => 'stable.tar.gz' );
Also note that if all you're doing is downloading the file, it may not even make sense to use WWW::Mechanize, and to use the underlying LWP::UserAgent directly.
I use perl for sending gif to a server. I want to move some gif to a folder called precipitation and other gifs to a folder called wind. Now with the following code (well, it is a part of the code I use) I send them, but there is something wrong in the code, because I find all gif in the same folder, in the first one, precipitation. Any idea?
use BSD::Resource;
use File::Copy;
use Net::FTP;
#ACCIONS send gif to the folder precipitation
$ftp->cwd("html/pen/precipitation");
foreach my $file ($ftp->ls("Pl*.gif")){
$ftp->delete($file) or die "Error in delete\n";
}
my #arxius = glob("/home/gif/Pen/Pl*.gif");
foreach my $File(#arxius){
$ftp->binary();
$ftp->put("$File");
}
#ACCIONS send gif to the folder wind
$ftp->cwd("html/pen/wind");
foreach my $file2 ($ftp->ls("vent*.gif")){
$ftp->delete($file2) or die "Error in delete\n";
}
my #arxius2 = glob("/home/gif/Pen/vent*.gif");
foreach my $File(#arxius2){
$ftp->binary();
$ftp->put("$File");
}
The behavior indicates that the second call to cwd() failed.
Most likely this is because you are using a relative rather than absolute path: the second cwd() call is relative to the location set in the first one. It tries to go to html/pen/precipitation/html/pen/wind which doesn't appear to be what you want.
Use an absolute path or ../wind in the second cwd() call.
Also, you should check for the success of the cwd() commands and stop if you didn't change to the expected directory. Otherwise, you are performing potentially destructive actions (like deleting files) in the wrong place!cwd() will return true if it worked and false otherwise. See the Net::FTP documentation.
I have a perl script that I wrote that gets some image URLs, puts the urls into an input file, and proceeds to run wget with the --input-file option. This works perfectly... or at least it did as long as the image filenames were unique.
I have a new company sending me data and they use a very TROUBLESOME naming scheme. All files have the same name, 0.jpg, in different folders.
for example:
cdn.blah.com/folder/folder/202793000/202793123/0.jpg
cdn.blah.com/folder/folder/198478000/198478725/0.jpg
cdn.blah.com/folder/folder/198594000/198594080/0.jpg
When I run my script with this, wget works fine and downloads all the images, but they are titled 0.jpg.1, 0.jpg.2, 0.jpg.3, etc. I can't just count them and rename them because files can be broken, not available, whatever.
I tried running wget once for each file with -O, but it's embarrassingly slow: starting the program, connecting to the site, downloading, and ending the program. Thousands of times. It's an hour vs minutes.
So, I'm trying to find a method to change the output filenames from wget without it taking so long. The original approach works so well that I don't want to change it too much unless necessary, but i am open to suggestions.
Additional:
LWP::Simple is too simple for this. Yes, it works, but very slowly. It has the same problem as running individual wget commands. Each get() or get_store() call makes the system re-connect to the server. Since the files are so small (60kB on average) with so many to process (1851 for this one test file alone) that the connection time is considerable.
The filename i will be using can be found with /\/(\d+)\/(\d+.jpg)/i where the filename will simply be $1$2 to get 2027931230.jpg. Not really important for this question.
I'm now looking at LWP::UserAgent with LWP::ConnCache, but it times out and/or hangs on my pc. I will need to adjust the timeout and retry values. The inaugural run of the code downloaded 693 images (43mb) in just a couple minutes before it hung. Using simple, I only got 200 images in 5 minutes.
use LWP::UserAgent;
use LWP::ConnCache;
chomp(#filelist = <INPUTFILE>);
my $browser = LWP::UserAgent->new;
$browser->conn_cache(LWP::ConnCache->new());
foreach(#filelist){
/\/(\d+)\/(\d+.jpg)/i
my $newfilename = $1.$2;
$response = $browser->mirror($_, $folder . $newfilename);
die 'response failure' if($response->is_error());
}
LWP::Simple's getstore function allows you to specify a URL to fetch from and the filename to store the data from it in. It's an excellent module for many of the same use cases as wget, but with the benefit of being a Perl module (i.e. no need to outsource to the shell or spawn off child processes).
use LWP::Simple;
# Grab the filename from the end of the URL
my $filename = (split '/', $url)[-1];
# If the file exists, increment its name
while (-e $filename)
{
$filename =~ s{ (\d+)[.]jpg }{ $1+1 . '.jpg' }ex
or die "Unexpected filename encountered";
}
getstore($url, $filename);
The question doesn't specify exactly what kind of renaming scheme you need, but this will work for the examples given by simply incrementing the filename until the current directory doesn't contain that filename.
I am looking for a way to use Perl to open a PDF file in Internet Explorer and then save it.
(I want the user to be able to interact with the script and decide whether downloading occurs, which is why I want to pdf to be displayed in IE, so I cannot use something like LWP::Simple.)
As an example, this code loads (displays) a pdf, but I can't figure out how to get Perl to tell IE to save the file.
use Win32::OLE;
my $ie = Win32::OLE->new("InternetExplorer.Application");
$ie->{Visible} = 1;
Win32::OLE->WithEvents($ie);
$ie->Navigate('http://www.aeaweb.org/Annual_Meeting/pdfs/2014_Registration.pdf');
I think I might need to use the OLE method execWB, but I haven't been able to figure it out.
What you want to do is automate the Internet Explorer UI. There are many libraries out there that will do this. You tell the library to find your window of interest, and then you can send keystrokes or commands to the window (CTRL-S in your case).
A good overview on how to do this in Perl is located here.
Example syntax:
my #keys = ( "%{F}", "{RIGHT}", "E", );
for my $key (#keys) {
SendKeys( $key, $pause_between_keypress );
}
The code starts with an array containing the keypresses. Note the
format of the first three elements. The keypresses are: Alt+F, right
arrow, and E. With the application open, this navigates the menu in
order to open the editor.
Another option is to use LWP:
use LWP::Simple;
my $url = 'http://www.aeaweb.org/Annual_Meeting/pdfs/2014_Registration.pdf';
my $file = '2014_Registration.pdf';
getstore($url, $file);
ForExecWB here is good thread, however it is not solved: http://www.perlmonks.org/?node_id=477361
$IE->ExecWB($OLECMDID_SAVEAS, $OLECMDEXECOPT_DONTPROMPTUSER,
$Target);
Why don't you display the PDF in IE then close the IE and save the file using LWP?
You could use Selenium and the perl remote drivers to manage IE
http://search.cpan.org/~aivaturi/Selenium-Remote-Driver-0.15/lib/Selenium/Remote/Driver.pm
http://docs.seleniumhq.org/projects/webdriver/
You will also need to download the IE selenium driver - it comes with firefox as standard
https://code.google.com/p/selenium/wiki/InternetExplorerDriver
use Selenium::Remote::Driver;
my $driver = new Selenium::Remote::Driver;
$driver->get('http://www.google.com');
print $driver->get_title();
$driver->quit();
I just made a script to grab links from a website, and in turn saves them into a text file.
Now I'm working on my regexes so it will grab links which contains php?dl= in the url from the text file:
E.g.: www.example.com/site/admin/a_files.php?dl=33931
Its pretty much the address you get when you hover over the dl button on the site. From which you can click to download or "right click save".
I'm just wondering on how to achieve this, having to download the content of the given address which will download a *.txt file. All from the script of course.
Make WWW::Mechanize your new best friend.
Here's why:
It can identify links on a webpage that match a specific regex (/php\?dl=/ in this case)
It can follow those links through the follow_link method
It can get the targets of those links and save them to file
All this without needing to save your wanted links in an intermediate file! Life's sweet when you have the right tool for the job...
Example
use strict;
use warnings;
use WWW::Mechanize;
my $url = 'http://www.example.com/';
my $mech = WWW::Mechanize->new();
$mech->get ( $url );
my #linksOfInterest = $mech->find_all_links ( text_regex => qr/php\?dl=/ );
my $fileNumber++;
foreach my $link (#linksOfInterest) {
$mech->get ( $link, ':contentfile' => "file".($fileNumber++).".txt" );
$mech->back();
}
You can download the file with LWP::UserAgent:
my $ua = LWP::UserAgent->new();
my $response = $ua->get($url, ':content_file' => 'file.txt');
Or if you need a filehandle:
open my $fh, '<', $response->content_ref or die $!;
Old question, but when I'm doing quick scripts, I often use "wget" or "curl" and pipe. This isn't cross-system portable, perhaps, but if I know my system has one or the other of these commands, it's generally good.
For example:
#! /usr/bin/env perl
use strict;
open my $fp, "curl http://www.example.com/ |";
while (<$fp>) {
print;
}