How can I get all HTML pages from a website subfolder with Perl? - perl

Can you point me on an idea of how to get all the HTML files in a subfolder and all the folders in it of a website?
For example:
www.K.com/goo
I want all the HTML files that are in: www.K.com/goo/1.html, ......n.html
Also, if there are subfolders so I want to get also them: www.K.com/goo/foo/1.html...n.html

Assuming you don't have access to the server's filesystem, then unless each directory has an index of the files it contains, you can't be guaranteed to achieve this.
The normal way would be to use a web crawler, and hope that all the files you want are linked to from pages you find.

Look at lwp-mirror and follow its lead.

I would suggest using the wget program to download the website rather than perl, it's not that well suited to the problem.

There are also a number of useful modules on CPAN which will be named things like "Spider" or "Crawler". But ishnid is right. They will only find files which are linked from somewhere on the site. They won't find every file that's on the file system.

You can also use curl to get all the files from a website folder.
Look at this man page and go to the section -o/--output which gives u a good idead about that.
I have used this a couple of times.

Read perldoc File::Find, then use File::Find.

Related

Get names of files contained in Dropbox directory

I'm in need to read a large number of files from a permanent Dropbox webpage.
The reading part is fine, but I'm having troubles finding a way to list all the files contained within.
To be more precise, I would need something like
files = dir([url_to_Dropbox_directory,'*.file_extension']);
returning the names of all the files.
I've found an example in php, but nothing for MATLAB. Using dir was just an example, I'm looking for any solution to this problem.
How can I get the file list from a permanent Dropbox webpage?
You should use the Dropbox API where you can acces that data by a http request. "file-list-folder" is the specific endpoint that you are looking for.
Here is the documentation for it:
https://www.dropbox.com/developers/documentation/http/documentation#files-list_folder
In addition you could use the SDK for PHP (or others programming languages). I have used the SDK for JS and it's easy and works well.
PHP SDK:
https://dropbox.github.io/dropbox-sdk-php/api-docs/v1.1.x/

Show license agreement before download

I have to solve the following task for our university homepage:
Whenever a pdf is requested the user has to accept a license, which pops up.
On Agree the download starts. If not, no download is possible.
I searched through the extensions but did not find any extension doing the job. Maybe you know one...
So I tried to implement my own extension. Taking the strengths of securelinks (Allows access control to files from a configurable directory ... presents a license acceptation prior to download) and naw_securedl ("Secure Download": Apply TYPO3 access rights to ALL file assets (PDFs, TGZs or JPGs etc. - configurable) - protect them from direct access.) I wanted to combine both extensions to have one that:
whenever a pdf file is requested (naw_securedl)
a license is shown and in case of ACCEPT a redirect to the file happens (securelinks).
This task sounds very easy, since I only have to combine both tasks. Anyway, I failed.
How do you solve this problem?
Do you know some extension doing the job?
Is anyone interested in a cooperation in which we try to create an extension thats doing the job?
Thanks for your help in advance!
Assuming that all donwloads are stored in one folder, I'd recommend writing your own little extension that replaces every link with a link to an intermediate site, like this:
www.mydomain.com/acceptlicense.html?downloadfile=myhighqualitycontent.pdf.
On the accept license page, users need to check the accept license checkbox, then click a submit button, which leads them to the download page, still carrying the GET parameter:
www.mydomain.com/download.html?downloadfile=myhighqualitycontent.pdf.
If not all files are in the same folder, you can replace slashes in the file path with other characters (they need to work in the URL). Or you might need a database table that indexes the files, so you can use IDs for the download files:
www.mydomain.com/acceptlicense.html?downloadfileID=99
If you don't know at all how to write TYPO3 extensions, consider using individual php/html files out of the TYPO3 context.

Using wget to download all the hulkshare/mediafire linked files on a page

So I've been trying to set up wget to download all the mp3s from www.goodmusicallday.com. Unfortunately, rather than the mp3s being hosted by the site, the site puts them up on www.hulkshare.com and then links to the download pages. Is there a way to use the recursive and filtering abilities of wget to make it go to each hulkshare page and download the linked mp3?
Any help is much appreciated
So, a friend of mine actually figured out an awesome way to do this, just enter the code below in Terminal:
IFS="";function r { echo $1|sed "s/.*$2=\([^\'\"\&;]*\).*/\1/";};for l in `wget goodmusicallday.com -O-|grep soundFile`;do wget -c `r $l soundFile` -O "`r $l titles`";done
I guess not!!!
I have tried on several occasion to do scripted downloads from mediafire, but in vain.
and that's the reason why they don't have a simple download link, instead have a timer attached to!
If you have noticed carefully, you will see that the download links(i mean the actual file hosting server is not www.mediafire.com! but rather something like download666.com).
So, i don't think it is possible with wget!!
Wget can only save the day if download links are simple html links, the a tags.
Regards,

Perl Mechanize module for scraping pdfs

I have a website into which many pdfs are uploaded. What i want to do is to download all those PDFs present in the website. To do so i first need to provide username and password to the website. After searching for sometime i found WWW::Mechanize package that does this work. Now the problem arises here that i want to make a recursive search in the website meaning that if the link does not contain a PDF, then i should not simply discard the link but should navigate the link and check whether the new page has links that contain PDFs. In this way i should exhaustively search the entire website to download all PDFs uploaded. Any suggestion on how to do this?
I'd also go with wget, which runs on a variety of platforms.
If you want to do it in Perl, check CPAN for web crawlers.
You might want to decouple collecting PDF URLs from actually downloading them. Crawling already is lengthy processing and it might be advantageous to be able to hand off downloading tasks to seperate worker processes.
You are right about using WWW::Mechanize module. This module has a method - find_all_links() wherein you can point out the regex to match the kind of pages you want to grab or follow.
For example:
my $obj = WWW::Mechanize->new;
.......
.......
my #pdf_links = $obj->find_all_links( url_regex => qr/^.+?\.pdf/ );
This gives you all the links pointing to pdf files, Now iterate through these links and issue a get call on each of them.
I suggest to try with wget. Something like:
wget -r --no-parent -A.pdf --user=LOGIN --pasword=PASSWORD http://www.server.com/dir/

Website Address Bar graphic

How do you get the graphic that is next to the address bar show up?
I have a very simple site, and I want to make a custom image, and have it show up when a user is on my site.
Thanks
Place a 16x16 favicon.ico file in your websites root directory.
You can produce one in any number of graphic editing programs (including paint).
http://www.photoshopsupport.com/tutorials/jennifer/favicon.html
I found this on wikipedia, which seems to well describe what you are asking:
http://en.wikipedia.org/wiki/Favicon
What your after is a favicon.ico file, check out the wikipedia article for more details.
There are also a number of online generators out there. See Google for more =)
favicon.ico files are also used by some browsers when a user bookmarks your site.
Make a square image and edit it the way that you want. Make sure that you don't add any small details. Next, go to http://www.favicon.cc.com/ . This site will convert your image to a supported .ico format. Rename the file favicon.ico and put it in the root directory of your site. You should have a favicon in your browser now.
This icon, is a favicon, and you have to upload a file to your site if you want it. But if you want to make your favicon work in all browsers properly, you will have to add more than 10 files in the correct sizes and formats.
My friend and I have created an App just for this! you can find it in faviconit.com
We did this, so people donĀ“t have to create all these images and the correct tags by hand, create all of them used to annoy me a lot!
The file generated comes with a small explanation on what to do with the files. We will make it better, but it's a start.
Hope it helps!