Wget to download html - wget

I have been trying to download an html from http://osu.ppy.sh/u/2330158 to get Historical data
but it doesnt download that part. Nor it downloads General, Top Ranks etc
Is there a way to make wget to download it?

That part of the page is loaded dynamically, so wget won't see it as it doesn't support Javascript. However, if you open the web developer tools in your browser of choice and then load the main page you can get the URL which you're really after. For this page, it's: http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0
Luckily, it's another simple, parameterised URL so you can feed that to wget:
wget "http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0"
That'll get you an html document containing just the historic data you're looking for.

Related

wget not returning same result as clicking on link in chrome browser

I am trying to use wget to download an audio file from a link (which has no file extension). The issue is that clicking this link automatically starts a .wav file download but wget on the same link returns a file without a file extension. Passing the -O file.wav extension does not work as the file itself is not compatible.
I have tried
wget -O test.wav "[DOWNLOAD LINK]"
The above downloads a file in my directory which is not audio.
My problem can be replicated by going to https://captcha.com/demos/features/captcha-demo.aspx and clicking on the href associated with the element of class class=BDC_SoundLink.
Questions:
Is there a way to get wget to return same result as clicking the link?
Is there a way to resolve the non audio file to audio file after wget does whatever it does?
Any help would be much appreciated!
The thing is that when you use WGET, you're actually downloading a text file because the MIME type is Text.
When you browse the website through your webbrowser it actually gets the right captcha code from the server and then you're able to download the file with the right captcha code. You can see below in the dev tools that the captcha code is here.
This sound file is linked to the captcha itself and each time you reload the captcha picture, the backend C# code of the asp.net page is giving a new captcha code.
That's why you can't download the captcha that way.

How can I "drill down" into a website using Perl's WWW::Mechanize

I have used the WWW::Mechanize Perl module on a number of projects and it's helped me out a lot.
I am trying to use it on a different site and I can't "drill down" into the content of the site.
The site is https://customer.bookingbug.com/?client=hantsrecyclingcentres#/services
I have tried figure out what the URL would be to get content shown in the resulting HTML, such as bb.d570283b87c834518ba9.css, bb.d570283b87c834518ba9.js and version.js
I tried to copy the resulting HTML into this posting, but used all sorts of quote and code sample combinations and it wouldn't display properly.
Does anyone have any idea how I "navigate" this site using this Perl module please?
WWW::Mechanize is a web client with some HTML parsing capabilities. But as you clearly noticed, the information you want is not in the HTML document you requested. Either download the correct document (whatever that might be), or do what the browser does and execute the JavaScript. This would require a JavaScript engine. The simplest way to achieve that is to remote-control a web browser (e.g. using Selenium::Chrome).

using wget to download all data from a webpage

I need to be able to download just the data from the page into a text file to parse later with a different program. I've used this syntax with other sites and works perfect, but I've run into a program with one web site.
Here's the site and the syntax I'm using:
WGET.EXE http://quotes.morningstar.com/fund/AAAAX/f?t=AAAAX -O AAAAX.TXT --no-check-certificate -owebdata/logfile.txt
This downloads the page but key data I need to see is not there. For example:
Expenses Turnover and status data is not there
I know the script is using a sub-program to produce the data but I know WGET is capable of just downloading the output to a file, I'm just unclear what flag or option to set to make it do it
The expenses and turnover and other status data are set using javascript on the page. As far as i know, you cannot wget that as it is generated on the client side when javascript runs on the browser.

Using wget to download all the hulkshare/mediafire linked files on a page

So I've been trying to set up wget to download all the mp3s from www.goodmusicallday.com. Unfortunately, rather than the mp3s being hosted by the site, the site puts them up on www.hulkshare.com and then links to the download pages. Is there a way to use the recursive and filtering abilities of wget to make it go to each hulkshare page and download the linked mp3?
Any help is much appreciated
So, a friend of mine actually figured out an awesome way to do this, just enter the code below in Terminal:
IFS="";function r { echo $1|sed "s/.*$2=\([^\'\"\&;]*\).*/\1/";};for l in `wget goodmusicallday.com -O-|grep soundFile`;do wget -c `r $l soundFile` -O "`r $l titles`";done
I guess not!!!
I have tried on several occasion to do scripted downloads from mediafire, but in vain.
and that's the reason why they don't have a simple download link, instead have a timer attached to!
If you have noticed carefully, you will see that the download links(i mean the actual file hosting server is not www.mediafire.com! but rather something like download666.com).
So, i don't think it is possible with wget!!
Wget can only save the day if download links are simple html links, the a tags.
Regards,

Perl Mechanize module for scraping pdfs

I have a website into which many pdfs are uploaded. What i want to do is to download all those PDFs present in the website. To do so i first need to provide username and password to the website. After searching for sometime i found WWW::Mechanize package that does this work. Now the problem arises here that i want to make a recursive search in the website meaning that if the link does not contain a PDF, then i should not simply discard the link but should navigate the link and check whether the new page has links that contain PDFs. In this way i should exhaustively search the entire website to download all PDFs uploaded. Any suggestion on how to do this?
I'd also go with wget, which runs on a variety of platforms.
If you want to do it in Perl, check CPAN for web crawlers.
You might want to decouple collecting PDF URLs from actually downloading them. Crawling already is lengthy processing and it might be advantageous to be able to hand off downloading tasks to seperate worker processes.
You are right about using WWW::Mechanize module. This module has a method - find_all_links() wherein you can point out the regex to match the kind of pages you want to grab or follow.
For example:
my $obj = WWW::Mechanize->new;
.......
.......
my #pdf_links = $obj->find_all_links( url_regex => qr/^.+?\.pdf/ );
This gives you all the links pointing to pdf files, Now iterate through these links and issue a get call on each of them.
I suggest to try with wget. Something like:
wget -r --no-parent -A.pdf --user=LOGIN --pasword=PASSWORD http://www.server.com/dir/