Perl Mechanize module for scraping pdfs - perl

I have a website into which many pdfs are uploaded. What i want to do is to download all those PDFs present in the website. To do so i first need to provide username and password to the website. After searching for sometime i found WWW::Mechanize package that does this work. Now the problem arises here that i want to make a recursive search in the website meaning that if the link does not contain a PDF, then i should not simply discard the link but should navigate the link and check whether the new page has links that contain PDFs. In this way i should exhaustively search the entire website to download all PDFs uploaded. Any suggestion on how to do this?

I'd also go with wget, which runs on a variety of platforms.
If you want to do it in Perl, check CPAN for web crawlers.
You might want to decouple collecting PDF URLs from actually downloading them. Crawling already is lengthy processing and it might be advantageous to be able to hand off downloading tasks to seperate worker processes.

You are right about using WWW::Mechanize module. This module has a method - find_all_links() wherein you can point out the regex to match the kind of pages you want to grab or follow.
For example:
my $obj = WWW::Mechanize->new;
.......
.......
my #pdf_links = $obj->find_all_links( url_regex => qr/^.+?\.pdf/ );
This gives you all the links pointing to pdf files, Now iterate through these links and issue a get call on each of them.

I suggest to try with wget. Something like:
wget -r --no-parent -A.pdf --user=LOGIN --pasword=PASSWORD http://www.server.com/dir/

Related

How can I "drill down" into a website using Perl's WWW::Mechanize

I have used the WWW::Mechanize Perl module on a number of projects and it's helped me out a lot.
I am trying to use it on a different site and I can't "drill down" into the content of the site.
The site is https://customer.bookingbug.com/?client=hantsrecyclingcentres#/services
I have tried figure out what the URL would be to get content shown in the resulting HTML, such as bb.d570283b87c834518ba9.css, bb.d570283b87c834518ba9.js and version.js
I tried to copy the resulting HTML into this posting, but used all sorts of quote and code sample combinations and it wouldn't display properly.
Does anyone have any idea how I "navigate" this site using this Perl module please?
WWW::Mechanize is a web client with some HTML parsing capabilities. But as you clearly noticed, the information you want is not in the HTML document you requested. Either download the correct document (whatever that might be), or do what the browser does and execute the JavaScript. This would require a JavaScript engine. The simplest way to achieve that is to remote-control a web browser (e.g. using Selenium::Chrome).

Wget to download html

I have been trying to download an html from http://osu.ppy.sh/u/2330158 to get Historical data
but it doesnt download that part. Nor it downloads General, Top Ranks etc
Is there a way to make wget to download it?
That part of the page is loaded dynamically, so wget won't see it as it doesn't support Javascript. However, if you open the web developer tools in your browser of choice and then load the main page you can get the URL which you're really after. For this page, it's: http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0
Luckily, it's another simple, parameterised URL so you can feed that to wget:
wget "http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0"
That'll get you an html document containing just the historic data you're looking for.

Using wget to download all the hulkshare/mediafire linked files on a page

So I've been trying to set up wget to download all the mp3s from www.goodmusicallday.com. Unfortunately, rather than the mp3s being hosted by the site, the site puts them up on www.hulkshare.com and then links to the download pages. Is there a way to use the recursive and filtering abilities of wget to make it go to each hulkshare page and download the linked mp3?
Any help is much appreciated
So, a friend of mine actually figured out an awesome way to do this, just enter the code below in Terminal:
IFS="";function r { echo $1|sed "s/.*$2=\([^\'\"\&;]*\).*/\1/";};for l in `wget goodmusicallday.com -O-|grep soundFile`;do wget -c `r $l soundFile` -O "`r $l titles`";done
I guess not!!!
I have tried on several occasion to do scripted downloads from mediafire, but in vain.
and that's the reason why they don't have a simple download link, instead have a timer attached to!
If you have noticed carefully, you will see that the download links(i mean the actual file hosting server is not www.mediafire.com! but rather something like download666.com).
So, i don't think it is possible with wget!!
Wget can only save the day if download links are simple html links, the a tags.
Regards,

How do I view .asp Images?

I am trying to download images from a site using Perl to download and save them with LWP::Simple.getstore.
Here's an example of the URL
http://www.aavinvc.com/_includes/blob.asp?Table=user&I=28&Width=100!&Height=100!
Turns out the files are completely empty that I am getting with LWP. I even tried cURL and same thing, completely empty. Would there be another way to get these?
If the file really contains ASP, then you have to run it through an ASP engine.
If things worked properly, then the URL would return an image file with an appropriate content type. You've just saved it with a .asp extension.
The fix is simple: Rename the file (preferably by looking at the Content-Type header returned (trivial with LWP, but I think you'll have to move beyond getstore) and doing it in Perl.
Regarding the update:
I just tried:
#!/usr/bin/perl
use Modern::Perl;
use LWP::Simple;
LWP::Simple::getstore(q{http://www.aavinvc.com/_includes/blob.asp?Table=user&I=28&Width=100!&Height=100}, 'foo.jpeg');
… and it just worked. The file opened without a hitch in my default image viewer.
.asp is not an image format.
Here are two explanations:
The image are simple jpegs generated by .asp files, so just use them if they were .jpegs - just rename them;
You are actually downloading a page that says "LOL I trol U - we don't allow images to be downloaded with Simple.getstore.

How can I get all HTML pages from a website subfolder with Perl?

Can you point me on an idea of how to get all the HTML files in a subfolder and all the folders in it of a website?
For example:
www.K.com/goo
I want all the HTML files that are in: www.K.com/goo/1.html, ......n.html
Also, if there are subfolders so I want to get also them: www.K.com/goo/foo/1.html...n.html
Assuming you don't have access to the server's filesystem, then unless each directory has an index of the files it contains, you can't be guaranteed to achieve this.
The normal way would be to use a web crawler, and hope that all the files you want are linked to from pages you find.
Look at lwp-mirror and follow its lead.
I would suggest using the wget program to download the website rather than perl, it's not that well suited to the problem.
There are also a number of useful modules on CPAN which will be named things like "Spider" or "Crawler". But ishnid is right. They will only find files which are linked from somewhere on the site. They won't find every file that's on the file system.
You can also use curl to get all the files from a website folder.
Look at this man page and go to the section -o/--output which gives u a good idead about that.
I have used this a couple of times.
Read perldoc File::Find, then use File::Find.