Download and save image from a website using wget - command-line

How do I download and save the particular image from the following web page using wget.
http://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?SceneView&ImageID=509617654
I tried this
"C:\Program Files (x86)\GnuWin32\bin\wget" -r -P "C:\temp\" -A jpeg,jpg,bmp,gif,png "http://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx?SceneView&ImageID=509617654"
But the image did not download and save. I am using Windows 7. I guess I am not getting the image since the web page is not a proper html page (no html or asp etc extension). Am I correct?

Not exactly. A file extension is not required for URLs containing HTML (e.g. http://google.com/).
By inspecting the HTML source (ignoring that the page has invalid HTML (<script> tag in between <head> and <body>)), we can see it's using JavaScript to alter the image's src attribute on page load (why, who knows...) to /GetBinary.aspx?Scene&ImageID=509617654&CaseID=&Version= (relative to the HTML page).
As wget can't execute JS, this will never work (like this).
However the actual image URL does return a JPEG image, but you'll have to rename it, as, also, the web server (IIS) is misconfigured, as for that URL it returns a header:
Content-Type: E:\Sites\NASS\CDS\/img/jpg
which is invalid, and causes file association problems when downloading in most browsers / clients.
To prove it's there, you can try downloading it directly with wget:
wget "http://www-nass.nhtsa.dot.gov/nass/cds/GetBinary.aspx/GetBinary.aspx?Scene&ImageID=509617654&CaseID=&Version=" -O image.jpg

Related

How to get wget to go down and up host hierarchy

wget recurses to the second-bottom level and goes no further. If I specify the bottom level HTML file as the source, it parses it and goes further. I think this may be caused by the PDF files linked off the HTML document being in an different root file path on the server. I need it to retrieve all the PDF files off the leaves of this hierarchy since I am going to promote them together as part of a campaign for depression awareness.
I am using GNU Wget 1.19.4 built on linux-gnu.
I have tried, --exclude, --exclude-directory, -l2, -l10, --continue and many other switches. I need to use the --include commands or wget grabs the entire site. If I use -np it won't go "up" into /docs
This code gets me the HTML files but does not follow links in the "bottom most"
HTML files.
wget --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/
This code, when I manually specify the HTML file, gets the PDF files I want in it.
wget --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
I want it to visit all the HTML files in this branch, get out all the PDF links in them, and retrieve all the PDF files from /docs
https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
Here is one of the PDFs. The /docs directory does not have a listing.
https://www.beyondblue.org.au/docs/default-source/research-project-files/online-forums-2015-report.pdf?sfvrsn=3d00adea_2
The best I can get wget to do is walk the site and get HTML files down to this level:
https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
https://www.beyondblue.org.au/about-us/research-projects/research-projects/networks-of-advocacy-and-influence-peer-mentors-in-beyond-blue-s-mental-health-forums
...
150 of them
It seems like a depth-limiting setting or a path traversal limitation or something. I suspect it's an easy one to spot.
Thanks again!
Alright it looks like wget might be breadth first. This means gets everything in the directory before recursing into pages. I'm not sure of this but I let the below run and it seemed to get all the leaf HTML files, but then recurse into them after it had got all of them.
wget -r --verbose --include /docs/default-source/research-project-files/,/about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/
Certainly running this and stopping it when it seemed to halt at the bottom HTML layer and not get the PDFs was stopping it too early.

How to download a file from box using wget?

I've created a direct link to a file in box:
The previous link is to the browser web interface, so I've then shared with a direct link:
However, if I download the file with a wget I receive garbage.
How can I download the file with wget?
I was able to download the file by making the link public, then replacing /s/ in the url with /shared/static
So my final command was:
curl -L https://MYUNI.box.com/shared/static/EXAMPLEtzwosac6pz --output myfile.zip
This can probably be modified for wget.
I might be a bit late to the party, but FWIW:
I tried to do the same things in order to download a folder.
I went to the box UI and opened the browser's network tab on the developer tools.
Then I clicked on download and copied as cURL the first link generated, it was something like (removed many headers and options for readability)
curl 'https://app.box.com/index.php?folder_id=122215143745&rm=box_v2_zip_folder'
The response of this request is a json object containing a link for downloading the folder:
{
"use_zpdl": "true",
"result": "success",
"download_url": <somg long url>,
"progress_reporting_url": <some other url>
}
I then executed wget -L <download_url> and was able to download the file using wget
The solution was to add the -L option to follow the HTTP redirect:
wget -v -O myfile.tgz -L https://ibm.box.com/shared/static/xxxxx.tgz
What you can do in 2022 is something like this:
wget "https://your_university.app.box.com/index.php?rm=box_download_shared_file&vanity_name=your_private_name&file_id=f_your_file_id"
You can find this link in the POST method in an incognito under Google Chrome's network tab. Note that the double quotes escape characters.

Creating a static copy of a MoinMoin site

I have a MoinMoin site which I've inherited from a previous system
administrator. I'd like to shut it down but keep a static copy of the
content as an archive, ideally with the same URLs. At the moment I'm
trying to accomplish this using wget with the following parameters:
--mirror
--convert-links
--page-requisites
--no-parent
-w 1
-e robots=off
-user-agent="Mozilla/5.0"
-4
This seems to work for getting the HTML and CSS, but it fails to
download any of the attachments. Is there an argument I can add to wget
which will get round this problem?
Alternatively, is there a way I can tell MoinMoin to link directly to
files in the HTML it produces? If I could do that then I think wget
would "just work" and download all the attachments. I'm not bothered
about the attachment URLs changing as they won't have been linked to
directly in other places (e.g. email archives).
The site is running MoinMoin 1.9.x.
My version of wget:
$ wget --version
GNU Wget 1.16.1 built on linux-gnu.
+digest +https +ipv6 +iri +large-file +nls +ntlm +opie -psl +ssl/openssl
The solution in the end was to use MoinMoin's export dump functionality:
https://moinmo.in/FeatureRequests/MoinExportDump
It doesn't preserve the file paths in the way that wget does, but has the major advantage of including all the files and the attachments.

How can i download a report using wget

I am trying to use WGET to automate download of a file which is generated by a report server (PDF format). However, the problem i am having is that the file name is never known (generated randomly by the server) and the URL accepts parameters that will change eg. Date=. Name= ID= etc.
For Example if i were to pass http://url.com/date=&name=&id= in internet explorer, i will get a download dialog prompting me to download a file with file xyz123.pdf
Is it possible that i can use WGET to pass these parameters to the report server and automatically download the generated PDF file
Just put the full url in quotes - It should go and fetch the file:
wget "http://url.com/date=foo&name=baa&id=baz"
Thanks,
//P

wget download name

I'd like to write a function that, given a URL, returns the name of the file downloaded by wget URL.
I don't understand the behavior of wget very well. If I do wget on python.org, www.python.org, http://www.python.org, or http://www.python.org/, the name of the file downloaded is index.html.
However, if I do www.python.org/about, the name of the file downloaded is about, instead of index.html.
The reason your wget fetches index.html in the first cases is because that's the default "home page" that the server points to. python.org, www.python.org, http://www.phython.org, and http://www.python.org/ aren't files, so the server points wget to index.html. It points your browser there, too, though you don't usually see it. www.python.org/about is a different page, so it makes sense that the file it downloads has a different name.
Might I recommend the man page for wget if you want to know how it works? If it's the name of the downloaded file that concerns you, you have the option to change it via the -O option.