Download Images from a website using Perl - perl

I am trying to write a program in perl to download images from a website. The problem is I would like to retain the same directory structure as the website it is being downloaded from.
e.g. If the image to be download is from the below url. Then the program should create the directory name "folder" and inside that download and then put the image inside the inner most folder.
http://www.example.com/folder/download/images.jpg
I am using LWP to download the images.
use LWP::Simple;
getstore($fileURL,$filename);

Look at wget or pavuk. You can also call them from within perl. That's what I usually do.

Related

Access downloaded pdf file path in HTML5 file system and display it in webview

In my chrome app, I am using HTML5 file system to save the pdf files to sand box.Downloading is working fine.But how do i access that downloaded file path? I want to give that path as webview source.
The best way, if it works, would be to use a filesystem URL. To get this use FileEntry.toURL
These don't work on external files (i.e. files that come from chrome.fileSystem.chooseEntry and are outside the app's sandbox) but should work for files in the app's sandbox.
Note, I am referring to filesystem:// urls not file://urls, which won't work as Marc Rochkind has pointed out in his answer.
Disclaimer: I haven't tested this, but I believe it should work.
You need to get the contents of the PDF into a data URL. See my answer to this question:
Download external pdf files to chrome packaged app's file system

Identify the upload status

I am uploading the folder from local to FTP using perl Net::FTP::Recursive module. I have written the sample code below. In that code I need to know the status of the uploading process like whether it has been uploaded or not.
use strict;
use Net::FTP:recursive;
my $ftp_con= Net::FTP::Recursive->new('host.com',Debug=>0);
$ftp_con->login('username','password');
$ftp_con->rput('d:\my_test','\root\my_test');
$ftp_con->quit;
In the above code I am unable to find the status of the uploading. Can anyone suggest me to get the uploading status of the folder, whether the folder has been uploaded or not.
Thanks...
Subclass Net::FTP::Recursive to override _rput. Add a callback hook to the end of the foreach block and pass in the current file $file and the list of files #files as arguments.
In the main part of the code, count up each time the callback is called and calculating the progress from the counter and the number of elements in #files.
First thing did you remember what is your folder name that you transfer via ftf. If the transfering is so fast and you are unable to monitor wether it is already in the server, you can use anothet method to verify it wether it is successfully loaded.
1. Log in to CPanel of your website via your hosting provider
2. Locate legacy file manager folder then click
3. Choose document root for, the click Go, then start to see find your folder name that you tranfer via ftp.

Perl Mechanize module for scraping pdfs

I have a website into which many pdfs are uploaded. What i want to do is to download all those PDFs present in the website. To do so i first need to provide username and password to the website. After searching for sometime i found WWW::Mechanize package that does this work. Now the problem arises here that i want to make a recursive search in the website meaning that if the link does not contain a PDF, then i should not simply discard the link but should navigate the link and check whether the new page has links that contain PDFs. In this way i should exhaustively search the entire website to download all PDFs uploaded. Any suggestion on how to do this?
I'd also go with wget, which runs on a variety of platforms.
If you want to do it in Perl, check CPAN for web crawlers.
You might want to decouple collecting PDF URLs from actually downloading them. Crawling already is lengthy processing and it might be advantageous to be able to hand off downloading tasks to seperate worker processes.
You are right about using WWW::Mechanize module. This module has a method - find_all_links() wherein you can point out the regex to match the kind of pages you want to grab or follow.
For example:
my $obj = WWW::Mechanize->new;
.......
.......
my #pdf_links = $obj->find_all_links( url_regex => qr/^.+?\.pdf/ );
This gives you all the links pointing to pdf files, Now iterate through these links and issue a get call on each of them.
I suggest to try with wget. Something like:
wget -r --no-parent -A.pdf --user=LOGIN --pasword=PASSWORD http://www.server.com/dir/

How do I view .asp Images?

I am trying to download images from a site using Perl to download and save them with LWP::Simple.getstore.
Here's an example of the URL
http://www.aavinvc.com/_includes/blob.asp?Table=user&I=28&Width=100!&Height=100!
Turns out the files are completely empty that I am getting with LWP. I even tried cURL and same thing, completely empty. Would there be another way to get these?
If the file really contains ASP, then you have to run it through an ASP engine.
If things worked properly, then the URL would return an image file with an appropriate content type. You've just saved it with a .asp extension.
The fix is simple: Rename the file (preferably by looking at the Content-Type header returned (trivial with LWP, but I think you'll have to move beyond getstore) and doing it in Perl.
Regarding the update:
I just tried:
#!/usr/bin/perl
use Modern::Perl;
use LWP::Simple;
LWP::Simple::getstore(q{http://www.aavinvc.com/_includes/blob.asp?Table=user&I=28&Width=100!&Height=100}, 'foo.jpeg');
… and it just worked. The file opened without a hitch in my default image viewer.
.asp is not an image format.
Here are two explanations:
The image are simple jpegs generated by .asp files, so just use them if they were .jpegs - just rename them;
You are actually downloading a page that says "LOL I trol U - we don't allow images to be downloaded with Simple.getstore.

How can I get all HTML pages from a website subfolder with Perl?

Can you point me on an idea of how to get all the HTML files in a subfolder and all the folders in it of a website?
For example:
www.K.com/goo
I want all the HTML files that are in: www.K.com/goo/1.html, ......n.html
Also, if there are subfolders so I want to get also them: www.K.com/goo/foo/1.html...n.html
Assuming you don't have access to the server's filesystem, then unless each directory has an index of the files it contains, you can't be guaranteed to achieve this.
The normal way would be to use a web crawler, and hope that all the files you want are linked to from pages you find.
Look at lwp-mirror and follow its lead.
I would suggest using the wget program to download the website rather than perl, it's not that well suited to the problem.
There are also a number of useful modules on CPAN which will be named things like "Spider" or "Crawler". But ishnid is right. They will only find files which are linked from somewhere on the site. They won't find every file that's on the file system.
You can also use curl to get all the files from a website folder.
Look at this man page and go to the section -o/--output which gives u a good idead about that.
I have used this a couple of times.
Read perldoc File::Find, then use File::Find.