using wget to download all data from a webpage - wget

I need to be able to download just the data from the page into a text file to parse later with a different program. I've used this syntax with other sites and works perfect, but I've run into a program with one web site.
Here's the site and the syntax I'm using:
WGET.EXE http://quotes.morningstar.com/fund/AAAAX/f?t=AAAAX -O AAAAX.TXT --no-check-certificate -owebdata/logfile.txt
This downloads the page but key data I need to see is not there. For example:
Expenses Turnover and status data is not there
I know the script is using a sub-program to produce the data but I know WGET is capable of just downloading the output to a file, I'm just unclear what flag or option to set to make it do it

The expenses and turnover and other status data are set using javascript on the page. As far as i know, you cannot wget that as it is generated on the client side when javascript runs on the browser.

Related

wget not returning same result as clicking on link in chrome browser

I am trying to use wget to download an audio file from a link (which has no file extension). The issue is that clicking this link automatically starts a .wav file download but wget on the same link returns a file without a file extension. Passing the -O file.wav extension does not work as the file itself is not compatible.
I have tried
wget -O test.wav "[DOWNLOAD LINK]"
The above downloads a file in my directory which is not audio.
My problem can be replicated by going to https://captcha.com/demos/features/captcha-demo.aspx and clicking on the href associated with the element of class class=BDC_SoundLink.
Questions:
Is there a way to get wget to return same result as clicking the link?
Is there a way to resolve the non audio file to audio file after wget does whatever it does?
Any help would be much appreciated!
The thing is that when you use WGET, you're actually downloading a text file because the MIME type is Text.
When you browse the website through your webbrowser it actually gets the right captcha code from the server and then you're able to download the file with the right captcha code. You can see below in the dev tools that the captcha code is here.
This sound file is linked to the captcha itself and each time you reload the captcha picture, the backend C# code of the asp.net page is giving a new captcha code.
That's why you can't download the captcha that way.

Download message from Google group

I need to download an archived google group.
Following link is one of the messages of that group for example.
https://groups.google.com/forum/#!topic/sci.aeronautics/ViFtpXfVm7M
The problem is, what i see in the browser does not appear in the downloaded webpage.
With my very limited knowledge, It seems to me like the reason behind it is this content is dynamically created by java-script. Or else, these downloaded files are with so called 'mbox' extension which is encrypted ?
What I've tried so far
First trys
Simple download
wget https://groups.google.com/d/topic/sci.aeronautics/ViFtpXfVm7M
With mirror
wget --mirror https://groups.google.com/d/topic/sci.aeronautics/ViFtpXfVm7M
Assuming its encrypted
With cookies.
wget --load-cookies=cookies.txt https://groups.google.com/d/topic/sci.aeronautics/ViFtpXfVm7M
Got thunderbird to setup my gmail and opening. did not open correctly
Assuming the content was javascript generated
Downloaded using phantomJS
https://askubuntu.com/questions/411540/how-to-get-wget-to-download-exact-same-web-page-html-as-browser
Downloaded using phantomJS with a different script
https://gist.github.com/giocomai/247d54e097b5083e2451
Used scripts available from Github
https://github.com/henryk/gggd
https://github.com/icy/google-group-crawler
But none did not work so far.
Can anyone please shed some light on how to download this page with its message as a readable html or txt file ?
Cheers
AyyoSalli
You could use https://groups.google.com/forum/feed/sci.aeronautics/msgs/atom.xml?num=100 to get some of the posts - but it only gets roughly half the posts in this case.
And it has all the messages from all topics together.
View it in Firefox or Classic Opera to see directly in a more human-readable form.
But since you say you already got a file in standard mbox format, what exactly is wrong with it - did you attempt to import it into a locally installed email or newsclient ? (like Thunderbird)

Wget to download html

I have been trying to download an html from http://osu.ppy.sh/u/2330158 to get Historical data
but it doesnt download that part. Nor it downloads General, Top Ranks etc
Is there a way to make wget to download it?
That part of the page is loaded dynamically, so wget won't see it as it doesn't support Javascript. However, if you open the web developer tools in your browser of choice and then load the main page you can get the URL which you're really after. For this page, it's: http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0
Luckily, it's another simple, parameterised URL so you can feed that to wget:
wget "http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0"
That'll get you an html document containing just the historic data you're looking for.

Getting Data from website

So the website constantly changes the data that it displays, and I want to get that data every several seconds and log it in a spreadsheet. The problem is in order to get to the page, I have to have a cookie which I get when I log in. Unfortunately I only know how to program in MATLAB. MATLAB has a function for this, urlread, but it doesn't deal with cookies. What can I do to get to that page? Can anyone help me with this? Point me into a direction where a programing noob like me can succeed please.
You could use wget to download content while using HTTP cookies. I will be using StackOverflow.com as example target. Here are the steps to follow:
1) Obtain the wget command tool. For Mac or Linux, I think it is already available. On Windows, you can get it from the GnuWin32 project or from one of the many other ports (Cygwin, MinGW/MSYS, etc..).
2) Next we need to obtain an authenticated cookie by logging into the website in question. You can use your preferred browser for this.
In Internet Explorer, you can produce it using "File menu > Import and Export > Export Cookies". In Firefox, I used the Cookie Exporter extension to export cookies to text file. For Chrome, there should be similar extensions
Obviously you only need to do this step once, as long as the cookies have not yet expired!
3) Once you locate the cookie file exported, we can use wget to fetch the web page and provide it with this cookie. This of course can be performed from inside MATLAB using the SYSTEM function:
%# fetch page and save it to disk
url = 'http://stackoverflow.com/';
cmd = ['wget --cookies=on --load-cookies=./cookies.txt ' url];
system(cmd, '-echo');
%# process page: I am simply viewing it using embedded browser
web( ['file:///' strrep(fullfile(pwd,'index.html'),'\','/')] )
Parsing the web page is a whole other topic that I will not go into. Once you get the data you seek, you can interact with Excel spreadsheets using the XLSREAD and XLSWRITE functions.
4) Finally you can write this in a function, and make it execute on regular intervals using the TIMER function
Try using the java.net.* classes.
You should be able to use them directly in the MATLAB workspace, as described here: http://www.mathworks.co.uk/help/techdoc/matlab_external/f4863.html
Matlab has built-in functions for web downloading. For http sites, there is webread.m and websave.m. For FTPs, there is mget.m

viewing autocomplete.do files

i was trying to reverse engineer a website ("www.asklaila.com") to find out how their yahoo UI AutoComplete Widget is working. Upon finding the view source of it, i saw it is refering to a file called "/autocomplete.do", i wanted to know what does this autocomplete.do file mean and can i download and open it locally on my machine?
Hope my requisite is legitimate and ethical.
As explained by FileInfo.com, the .do extension represents a server side Java code file that runs on the server and outputs HTML to the response.
Therefore, you cannot download it and view its contents. Any requests to the file will either return the same HTML or an HTTP error if it requires parameters/form fields.