Wget - downloading all files in a webpage?

Wget - downloading all files in a webpage? - wget

I'm using this wget command to download all the .fits files from that URL:
wget -r -np -nd -l inf -A fits https://archive.stsci.edu/missions/tess/ete-6/tid/00/000/000/057/
This is based on an adaptation of this answer.
All I'm getting is a directory structures that mirrors the URI on the website all the way down to /057/, but there's no file.
If I add -nd, then I only get a robot.txt files which isn't very instructive, but still no files.
What am I not getting about how to use wget for this?
EDIT: based on Turgbek's answer below, I do see that the robot.txt file from that website actually has /missions/ in the "Disallow"... maybe this is what is preventing me from using the wget command? Is that the source of the problem? How can I get around that?

In robots.txt there's a statement:
Disallow: /missions/
Which your requested files are in. Since the url builds up as /missions/tess/ete-6/tid/00/000/000/057/ I believe that robots.txt is blocking you.
I've saved two of the files from that URL in my Raspberry Pi and ran a local test without robots.txt. With this command:
wget -r -np -nd -l inf -A fits 192.168.1.250/test/
It worked as intended and I've received both of the files.
--2018-05-03 23:46:51-- http://192.168.1.250/test/tess2019128220341-0000000005712108-0016-s_lc.fits
Reusing existing connection to 192.168.1.250:80.
HTTP request sent, awaiting response... 200 OK
Length: 2090880 (2.0M)
Saving to: `192.168.1.250/test/tess2019128220341-0000000005712108-0016-s_lc.fits'
100%[==============================================================================>] 2,090,880 3.77M/s in 0.5s
2018-05-03 23:46:51 (3.77 MB/s) - `192.168.1.250/test/tess2019128220341-0000000005712108-0016-s_lc.fits' saved [2090880/2090880]
--2018-05-03 23:46:51-- http://192.168.1.250/test/tess2019128220341-0000000005715814-0016-s_lc.fits
Reusing existing connection to 192.168.1.250:80.
HTTP request sent, awaiting response... 200 OK
Length: 2090880 (2.0M)
Saving to: `192.168.1.250/test/tess2019128220341-0000000005715814-0016-s_lc.fits'
100%[==============================================================================>] 2,090,880 4.61M/s in 0.4s
2018-05-03 23:46:52 (4.61 MB/s) - `192.168.1.250/test/tess2019128220341-0000000005715814-0016-s_lc.fits' saved [2090880/2090880]

Related

Wget works for some websites but not for others?

I am on centos 8 and using the wget command to download some files, I am able to do so on certain websites but not on others. here is an example that works for me
wget https://forums.centos.org/index.php
--2021-07-26 20:50:40-- https://forums.centos.org/index.php
Resolving forums.centos.org (forums.centos.org)... 35.178.235.168
Connecting to forums.centos.org (forums.centos.org)|35.178.235.168|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.php’
index.php [ <=> ] 49.80K 319KB/s in 0.2s
2021-07-26 20:50:41 (319 KB/s) - ‘index.php’ saved [50997]
and here is an example that doesn't
wget -d https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html
DEBUG output created by Wget 1.19.5 on linux-gnu.
Reading HSTS entries from /home/tuser1/.wget-hsts
URI encoding = ‘UTF-8’
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2021-07-26 20:27:16-- https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html
Certificates loaded: 147
Resolving www.bioinformatics.babraham.ac.uk (www.bioinformatics.babraham.ac.uk)... 149.155.133.4
Caching www.bioinformatics.babraham.ac.uk => 149.155.133.4
Connecting to www.bioinformatics.babraham.ac.uk (www.bioinformatics.babraham.ac.uk)|149.155.133.4|:443... Closed fd 3
failed: Connection timed out.
EDIT: I have tried pinging as well, I can ping yahoo.com but cannot ping google.com. Some websites are working some are not with ping as well.
I have disabled the firewall (firewalld) and tried to use the curl -O command as well to download but have not found a solution for this. Please let me know if there is any way to fix this

why `wget` can not get redirection for certain website?

wget hangs there while it accesses the following website. But when I use a browser to access it, it will be redirected to https://nyulangone.org. Does anybody know why wget can not get redirected in this case? Thanks.
$ wget http://nyumc.org
--2018-02-20 20:27:05-- http://nyumc.org/
Resolving nyumc.org (nyumc.org)... 216.165.125.106
Connecting to nyumc.org (nyumc.org)|216.165.125.106|:80...

When I used wget on the site you mentioned, this is what I get:
--2018-02-21 21:16:38-- http://www.nyumc.org/
Resolving www.nyumc.org (www.nyumc.org)... 216.165.125.112
Connecting to www.nyumc.org (www.nyumc.org)|216.165.125.112|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 179 [text/html]
Saving to: ‘index.html’
index.html 100%[==================================>] 179 --.-KB/s in 0s
2018-02-21 21:16:38 (8.16 MB/s) - ‘index.html’ saved [179/179]
In the index.html file, which bears the logo of NYU Langone Medical Center, it says: "The following URL has been rejected for security concerns. If you believe you have received this message in error, please summit an incident with our helpdesk at 212-263-6868..." So, it may not redirect because the website can detect that you are a bot and not a browser. You could attempt to change the user agent string and other HTTP headers to avoid detection, but I'm not sure why you wouldn't just turn wget on https://nyulangone.org. Judging from information on archive.org, nyumc.org has been redirecting to other sites for at least the last 5 years. It was redirecting to http://www.med.nyu.edu until 2016, at which point it started redirecting to https://www.nyulangone.org.
I hope that helps.

wget can't download webmin - 404 error

i have this error
--2018-02-14 13:45:42-- http://www.webmin.com/jcameron-key.asc
Resolving www.webmin.com (www.webmin.com)... 216.105.38.10
Connecting to www.webmin.com (www.webmin.com)|216.105.38.10|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Emplacement : https://sourceforge.net/error-404.html [next]
--2018-02-14 13:45:43-- https://sourceforge.net/error-404.html
Resolving sourceforge.net (sourceforge.net)... 216.105.38.10
Connecting to sourceforge.net (sourceforge.net)|216.105.38.10|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-02-14 16:45:44 ERROR 404: Not Found.
When i do this
root#server:/tmp$ wget http://www.webmin.com/jcameron-key.asc
Any solution please

404 not found means that the link you provided did not resolve to anything and thus it doesn’t exist anymore. It’s server side and thus you have no control.

I have the solution:
Install via shell command
If you like to install and update Webmin via APT, Do like this:
$~: sudo nano /etc/apt/sources.list
Add this at the bottom of the file, last line.
deb http://download.webmin.com/download/repository sarge contrib
deb http://webmin.mirror.somersettechsolutions.co.uk/repository sarge contrib
Install Webmin
:~$ sudo -i
:~$ wget https://www.techandme.se/wp-content/uploads/2015/01/jcameron-key.asc
:~$ apt-key add jcameron-key.asc
:~$ apt-get update && apt-get install webmin --force-yes -y && rm jcameron-key.asc
Login
https://your-ip-adress:10000

This happens due to maintenance of Webmin servers moving to the other location. It will be back in a bit.
Sorry about that.

I was trying to download tomcat using wget command and getting below error.
Error- sudheer#sudheer:~$ wget https://www-eu.apache.org/dist/tomcat/tomcat-9/v9.0.11/bin/apache-tomcat-9.0.11.tar.gz -P /tmp --2018-12-02 00:49:10-- https://www-eu.apache.org/dist/tomcat/tomcat-9/v9.0.11/bin/apache-tomcat-9.0.11.tar.gz Resolving www-eu.apache.org (www-eu.apache.org)... 2a01:4f9:2a:185f::2, 95.216.24.32 Connecting to www-eu.apache.org (www-eu.apache.org)|2a01:4f9:2a:185f::2|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2018-12-02 00:49:12 ERROR 404: Not Found.
Solution-
check the url "https://www-eu.apache.org/dist/tomcat/tomcat-9/v9.0.11/bin/apache-tomcat-9.0.11.tar.gz" is correct or not
I found this url doesn't exist so I corrected it and its working fine. Correct url should be
"https://www-eu.apache.org/dist/tomcat/tomcat-9/v9.0.13/bin/apache-tomcat-9.0.13-deployer.tar.gz"
Output:-
sudheer#sudheer:~$ wget https://www-eu.apache.org/dist/tomcat/tomcat-9/v9.0.13/bin/apache-tomcat-9.0.13-deployer.tar.gz -P /tmp --2018-12-02 00:53:18-- https://www-eu.apache.org/dist/tomcat/tomcat-9/v9.0.13/bin/apache-tomcat-9.0.13-deployer.tar.gz Resolving www-eu.apache.org (www-eu.apache.org)... 2a01:4f9:2a:185f::2, 95.216.24.32 Connecting to www-eu.apache.org (www-eu.apache.org)|2a01:4f9:2a:185f::2|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 2636635 (2.5M) [application/x-gzip] Saving to: ‘/tmp/apache-tomcat-9.0.13-deployer.tar.gz’
apache-tomcat-9.0.13-deployer.tar.gz 100%[=======================================================================>] 2.51M 370KB/s in 7.0s
2018-12-02 00:53:27 (370 KB/s) - ‘/tmp/apache-tomcat-9.0.13-deployer.tar.gz’ saved [2636635/2636635]

WGET a Redmine CSV file

I am trying to get a CSV file from Redmine in a shell script. WGET is complaining about an unacceptable. Any ideas what the magical incantation is, or how to find it?
$ wget --no-check-certificate --accept csv https://username:password#company.com/redmine/issues.csv?utf8=%E2%9C%93&columns=all&description=1
Resolving company.com (company.com)... 192.168.1.45
Connecting to company.com (company.com)|192.168.1.45|:443... connected.
WARNING: The certificate of ‘company.com’ is not trusted.
WARNING: The certificate of ‘company.com’ hasn't got a known issuer.
HTTP request sent, awaiting response... 406 Not Acceptable
2017-04-04 10:14:20 ERROR 406: Not Acceptable.

You can try to replace --accept csv with --accept "*.csv". See the wget manual: https://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options

Trying to recursively download zip files with wget fails with 401 error

I am trying to download all the zip files from the website http://www.minetad.gob.es/energia/balances/Publicaciones/ElectricasMensuales/Paginas/ElectricasMensuales.aspx
Note that the .zip files listed have the route (By using right click over the link and selecting copy link address) in the form, for example the first one http://www.minetad.gob.es/energia/balances/Publicaciones/ElectricasMensuales/2015/Abril_2015.zip
I have tried all the combinations listed here How to download multiple .zip files?, but none of them works. They either download an empty structure of directories, or I get this error message:
For example, this one
wget -r -l1 -nd -H -t1 -nd -N -np -A.zip -erobots=off -U mozilla --random-wait http://www.minetad.gob.es/energia/balances/Publicaciones/ElectricasMensuales/2015
gives me the following error message:
Resolving www.minetad.gob.es (www.minetad.gob.es)... 193.146.1.81, 2001:720:438:400::81
Connecting to www.minetad.gob.es (www.minetad.gob.es)|193.146.1.81|:80... connected.
HTTP request sent, awaiting response... 302 Redirect
Location: http://www.minetad.gob.es/energia/balances/Publicaciones/ElectricasMensuales/2015/Forms/AllItems.aspx [following]
--2017-02-05 18:52:40-- http://www.minetad.gob.es/energia/balances/Publicaciones/ElectricasMensuales/2015/Forms/AllItems.aspx
Reusing existing connection to www.minetad.gob.es:80.
HTTP request sent, awaiting response... 401 Unauthorized
Username/Password Authentication Failed.
Note that I can download every file individually by using the complete url, like this:
wget http://www.minetad.gob.es/energia/balances/Publicaciones/ElectricasMensuales/2015/Abril_2015.zip

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Wget - downloading all files in a webpage? - wget

Related

Wget works for some websites but not for others?

why `wget` can not get redirection for certain website?

wget can't download webmin - 404 error

WGET a Redmine CSV file

Trying to recursively download zip files with wget fails with 401 error

Categories

Resources