Saving html page from MATLAB web browser - matlab

Following this question I get a message on the retrieved page that "Your browser does not support JavaScript so some functionality may be missing!"
If I open this page with web(url) in MATLAB web browser and accept certificate (once per session), the page opens properly.
How can I save the page source from the browser with a script? Or from system browser? Or may be there is a way to get that page even without browser?
url='https://cgwb.nci.nih.gov/cgi-bin/hgTracks?position=chr7:55054218-55242525';

From what I could tell the page source gets downloaded just fine, just make sure to let Javascript run when you open the saved page locally.
[...]
<script type='text/javascript' src='../js/hgTracks.js'></script>
<noscript><b>Your browser does not support JavaScript so some functionality may be missing!</b></noscript>
[...]
Note that the solution you are using only downloads the web page without any of the attached stuff (images, .css, .js, etc..).
What you can do is call wget to get the page with all of its files:
url = 'https://cgwb.nci.nih.gov/cgi-bin/hgTracks?position=chr7:55054218-55242525';
command = ['wget --no-check-certificate --page-requisites ' url];
system( command );
If you are on a Windows machine, you can always get wget from the GnuWin32 project or from one of the many other implementations.

Will saving cookies be sufficient for solving your problem? wget can do that with --keep-session-cookies and --save-cookies filename; then you use --load-cookies filename to get your cookies back on subsequent requests. Something like the following (note I have not tested this from Matlab, so quoting, etc, might not be exactly right, but I do use a similar shell construction in other contexts):
command_init = ['wget --no-check-certificate \
--page-requisites \
--keep-session-cookies \
--save-cookies cookie_file.txt \
--post-data \'user=X&pass=Y&whatever=TRUE\'' \
init_url];
command_get = ['wget --no-check-certificate \
--page-requisites \
--load-cookies cookie_file.txt' \
url];
If you don't have any post-data, but rather subsequent gets will update cookies, you can simply use keep and save on successive get requests.

Related

Do not create directory when using wget with `--page-requisites` option

I'm using wget with --page-requisites option. I'd like to combine this option with --directory-prefix. So for example when calling wget --page-requisites --directory-prefix=/tmp/1 https://google.com would download the google page to /tmp/1/ directory without creating it's own folder (like google.com).
I'd expect the google homepage to end up at /tmp/1/index.html
Is there a way to do this without creating some kind of script that would move the files where I want them to be?
Ok using option --no-directories seems to do the trick.

wget download a section of a website

I want to download a particular section of a website. I am following this wget - Download a sub directory . But the problem is the section of the website does not have any particular url i.e. the urls goes like this http://grephysics.net/ans/0177/* where * is a number from 1-100 and I cant use http://grephysics.net/ans/0177 in wget. How do I download this 100 webpages with link to each other (i.e. the the Previous and Next button should link to local copies)
I think this is what you need:
wget -p -k http://grephysics.net/ans/0177/{1..100}
Explanation:
-k : rewrites links to point to local assets
-p : get all images, js, css, etc. needed to display the page
{1..100} : this specifies a range of urls to download, in your case we have pages labelled 1 to 100.
Why didn't recursive downloading work?
The link you posted was a good first resource, probably what most people would want. But the way wget recursively downloads is by getting the first page specified (i.e. the root), then following links to child pages. The way grephysics is set up however, is that http://grephysics.net/ans/0177 leads us to a 404. It has no links for wget to follow to download child pages.
If your wget doesn't support {}
You can still have the same results by using the following command:
for i in {1..100}; do echo $i; done | wget -p -k -B http://grephysics.net/ans/0177/ -i -
Explanation
for i in {1..100};... : This prints the values 1 to 100.
| : For anyone who hasn't seen this, we are piping the output of the previous command into the input of the following command
-p : get all images, js, css, etc. needed to display the page
-k : rewrite the links to point to the local copies
-B : specifies the base URL to use with the -i option
-i : reads a list of urls to fetch from a file. Since we specified the 'file' - it reads from stdin.
So, we read in the values 1 to 100 and append them to our base url
http://grephysics.net/ans/0177/ and fetch all of those urls and all the assets that go with them, then rewrite links so we can browse offline.

How to download a file from box using wget?

I've created a direct link to a file in box:
The previous link is to the browser web interface, so I've then shared with a direct link:
However, if I download the file with a wget I receive garbage.
How can I download the file with wget?
I was able to download the file by making the link public, then replacing /s/ in the url with /shared/static
So my final command was:
curl -L https://MYUNI.box.com/shared/static/EXAMPLEtzwosac6pz --output myfile.zip
This can probably be modified for wget.
I might be a bit late to the party, but FWIW:
I tried to do the same things in order to download a folder.
I went to the box UI and opened the browser's network tab on the developer tools.
Then I clicked on download and copied as cURL the first link generated, it was something like (removed many headers and options for readability)
curl 'https://app.box.com/index.php?folder_id=122215143745&rm=box_v2_zip_folder'
The response of this request is a json object containing a link for downloading the folder:
{
"use_zpdl": "true",
"result": "success",
"download_url": <somg long url>,
"progress_reporting_url": <some other url>
}
I then executed wget -L <download_url> and was able to download the file using wget
The solution was to add the -L option to follow the HTTP redirect:
wget -v -O myfile.tgz -L https://ibm.box.com/shared/static/xxxxx.tgz
What you can do in 2022 is something like this:
wget "https://your_university.app.box.com/index.php?rm=box_download_shared_file&vanity_name=your_private_name&file_id=f_your_file_id"
You can find this link in the POST method in an incognito under Google Chrome's network tab. Note that the double quotes escape characters.

Creating a static copy of a MoinMoin site

I have a MoinMoin site which I've inherited from a previous system
administrator. I'd like to shut it down but keep a static copy of the
content as an archive, ideally with the same URLs. At the moment I'm
trying to accomplish this using wget with the following parameters:
--mirror
--convert-links
--page-requisites
--no-parent
-w 1
-e robots=off
-user-agent="Mozilla/5.0"
-4
This seems to work for getting the HTML and CSS, but it fails to
download any of the attachments. Is there an argument I can add to wget
which will get round this problem?
Alternatively, is there a way I can tell MoinMoin to link directly to
files in the HTML it produces? If I could do that then I think wget
would "just work" and download all the attachments. I'm not bothered
about the attachment URLs changing as they won't have been linked to
directly in other places (e.g. email archives).
The site is running MoinMoin 1.9.x.
My version of wget:
$ wget --version
GNU Wget 1.16.1 built on linux-gnu.
+digest +https +ipv6 +iri +large-file +nls +ntlm +opie -psl +ssl/openssl
The solution in the end was to use MoinMoin's export dump functionality:
https://moinmo.in/FeatureRequests/MoinExportDump
It doesn't preserve the file paths in the way that wget does, but has the major advantage of including all the files and the attachments.

wget appends query string to resulting file

I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character รข - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/