Recursive file download with wget not working

Recursive file download with wget not working - wget

From what I can tell by studying the wget manual, the following should work:
wget -r -l1 -np -nd -A.* -R "index.html*" http://s3.amazonaws.com/gp.gms/blat/
However, instead of getting all of the files in the blat folder without the apparently auto-generated index.html file, I get a 404 not found error on this and several dozen variations that I've tried.
I can easily download any of the 4 files but trying to do it recursively fails.
Any pointers would be greatly appreciated.

Try replacing -r -l1 with -r -l 1. You need a space between the l and the 1. Also, try adding -k with your options. This will convert the links to point to the corresponding files on your computer.

Related

How to download binary files from a public github repository through the command line?

I'm trying to work through this docker-zfs plug in: https://github.com/TrilliumIT/docker-zfs-plugin. I'm stuck at this line: Download the latest binary from github releases and place in /usr/local/bin/ .
How does one do such a thing? I've done through the whole page, and I don't see any mention of binary files/a link for a release. I've looked at other pages to download from Github repositories, but I don't have any authentication so they didn't seem applicable. I looked at this and tried to make it work, https://geraldonit.com/2019/01/15/how-to-download-the-latest-github-repo-release-via-command-line/ , but something about the link formatting didn't seem to work. This must be really obvious but I don't see what I am missing.
This is what I tried:
LOCATION=$(curl -s https://github.com/TrilliumIT/docker-zfs-plugin/releases/latest
| grep "tag_name"
| awk '{print "https://github.com/TrilliumIT/docker-zfs-plugin/releases/latest" substr($2, 2, length($2)-3) ".zip"}')
; curl -L -o . /usr/local/bin/
(But I'm not sure this is what I need, and the link doesn't exist either. There must be a better way of doing this?)

Ok so I actually figured this out, it was simpler than I was doing:
wget https://github.com/TrilliumIT/docker-zfs-plugin/releases/download/v1.0.5/docker-zfs-plugin
sudo mv docker-zfs-plugin /usr/local/bin/

How to force wget to overwrite an existing file ignoring timestamp?

I tried '-N' and '--no-clobber' but the only result that I get is to retrieve a new copy of the existing example.exe with number a number added using this synax 'example.exe.1'. This is not what I'd like to get. I just need to download and overwrite the file example.exe in the same folder where I already saved a copy of example.com without that wget verifies if the mine is older or newer respect the on example.exe file already present in my download folder. Do you think is i possible or I need to create a script that delete the example.exe file or maybe something that change his modification date etc?

If you specify the output file using the -O option it will overwrite any existing file.
For example:
wget -O index.html bbc.co.uk
Run multiple times will keep over-writting index.html.

wget doesn't let you overwrite an existing file unless you explicitly name the output file on the command line with option -O.
I'm a bit lazy and I don't want to type the output file name on the command line when it is already known from the downloaded file. Therefore, I use curl like this:
curl -O http://ftp.vim.org/vim/runtime/spell/fr.utf-8.spl
Be careful when downloading files like this from unsafe sites. The above command will write a file named as the connected web site wishes to name it (inside the current directory though). The final name may be hidden through redirections and php scripts or be obfuscated in the URL. You might end up overwriting a file you don't want to overwrite.
And if you ever find a file named ls or any other enticing name in the current directory after using curl that way, refrain from executing the downloaded file. It may be a trojan downloaded from a rogue or corrupted web site!

wget --backups=1 google.com
renames original file with .1 suffix and writes new file to the intended filename.
Not exactly what was requested, but could be handy in some cases.

-c or --continue
From the manual:
If you use ‘-c’ on a non-empty file, and the server does not support
continued downloading, Wget will restart the download from scratch and
overwrite the existing file entirely.

I like the -c option. I started with the man page then the web but I've searched for this several times. Like if you're relaying a webcam so the image needs to always be named image.jpg. Seems like it should be more clear in the man page.
I've been using this for a couple years to download things in the background, sometimes combined with "limit-rate = " in my wgetrc file
while true
do
wget -c -i url.txt && break
echo "Restarting wget"
sleep 2
done
Make a little file called url.txt and paste the file's URL into it. Set this script up in your path or maybe as an alias and run it. It keeps retrying the download until there's no error. Sometimes at the end it gets into a loop displaying
416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.
but that's harmless, just ctrl-c it. I think it's always gotten the file I wanted even if wget runs out of retries or the connection temporarily goes away. I've downloaded things for days at a time with it. A CD image on dialup, yes, always with wget.

My use case involves two different URLs, sometimes the second one doesn't exist, but if it DOES exist, I want it to overwrite the first file.
The problem of using wget -O is that, when the second file DOESN'T exist, it will overwrite the first file with a BLANK file.
So the only way I could find is with an if statement:
--spider checks if a file exists, and returns 0 if it does
--quiet fail quietly, with no output
-nv is quiet, but still reports errors
wget -nv https://example.com/files/file01.png -O file01.png
# quietly check if a different version exists
wget --quiet --spider https://example.com/custom-files/file01.png
if [ $? -eq 0 ] ; then
# A different version exists, so download and overwrite the first
wget -nv https://example.com/custom-files/file01.png -O file01.png
fi
It's verbose, but I found it necessary. I hope this is helpful for someone.

Here is an easy way to get it done with parameter trimming
url=https://example.com/example.exe ; wget -nv $url -O ${url##*/}
Or you can use basename
url=https://example.com/example.exe ; wget -nv $url -O $( basename $url )

For those who do not want to use -O and want to specify the output directory only, the following command can be used.
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"

the first command will download from the source with the wget command
the second command will remove the older file
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"; \
rm '$file.1' -f;

Using wget (for windows) to download all MIDI files

I've been trying to use wget to download all midi files from a website (http://cyberhymnal.org/) using:
wget64 -r -l1 H -t1 -nd -N -np -A.mid -erobots=off http://cyberhymnal.org/
I got the syntax from various sites which all suggest the same thing, but it doesn't download anything. I've tried various variations on the theme, such as different values for '-l' etc.
Does anybody have any suggestions as to what I am doing wrong? Is it the fact that I am using Windows?
Thanks in advance.

I don't know much about all the parameters you are using like H, -t1, -N etc though we can find it online. But I also had to download files from a url matching a wildcard. So command that worked for me:
wget -r -l1 -nH --cut-dirs=100 -np "$url" -P "${newLocalLib/$tokenFind}" -A "com.iontrading.arcreporting.*.jar"
after -P you specify the path where you wanna save the files to and after -A you provide the wild card token. Like in your case that would be "*.mid".
-A means Accept. So here we provide the files to accept from the provided URL. Similarly -R for reject list.

You may have better luck (at least, you'll get more MIDI files), if you try the actual Cyber Hymnal™, which moved over 10 years ago. The current URL is now http://www.hymntime.com/tch/.

Download all pdf files using wget

I have the following site http://www.asd.com.tr. I want to download all PDF files into one directory. I've tried a couple of commands but am not having much luck.
$ wget --random-wait -r -l inf -nd -A pdf http://www.asd.com.tr/
With this code only four PDF files were downloaded. Check this link, there are over several thousand PDFs available:
http://www.asd.com.tr/Default.aspx
For instance, hundreds of files are in the following folder:
http://www.asd.com.tr/Folders/asd/…
But I can't figure out how to access them correctly to see and download them all, there are some of folders in this subdirectory, http://www.asd.com.tr/Folders/, and thousands of PDFs in these folders.
I've tried to mirror site using -m command but it failed too.
Any more suggestions?

First, verify that the TOS of the web site permit to crawl it. Then, one solution is :
mech-dump --links 'http://domain.com' |
grep pdf$ |
sed 's/\s+/%20/g' |
xargs -I% wget http://domain.com/%
The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros)

wget downloads only one index.html file instead of other some 500 html files

with Wget I normally receive only one -- index.html file. I enter the following string:
wget -e robots=off -r http://www.korpora.org/kant/aa03
which gives back an index.html file, alas, only.
The directory aa03 implies Kant's book, volume 3, there must be some 560 files (pages) or so in it. These pages are readable online, but will not be downloaded. Any remedy?! THX

Following that link brings us to:
http://korpora.zim.uni-duisburg-essen.de/kant/aa03/
wget won't follow links that point to domains not specified by the user. Since korpora.zim.uni-duisburg-essen.de is not equal to korpora.org, wget will not follow the links on the index page.
To remedy this, use --span-hosts or -H. -rH is a VERY dangerous combination - combined, you can accidentally crawl the entire Internet - so you'll want to keep its scope very tightly focused. This command will do what you intended to do:
wget -e robots=off -rH -l inf -np -D korpora.org,korpora.zim.uni-duisburg-essen.de http://korpora.org/kant/aa03/index.html
(-np, or --no-parent, will limit the crawl to aa03/. -D will limit it to only those two domains. -l inf will crawl infinitely deep, constrained by -D and -np).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Recursive file download with wget not working - wget

Try replacing -r -l1 with -r -l 1. You need a space between the l and the 1. Also, try adding -k with your options. This will convert the links to point to the corresponding files on your computer.

Related

How to download binary files from a public github repository through the command line?

How to force wget to overwrite an existing file ignoring timestamp?

Using wget (for windows) to download all MIDI files

Download all pdf files using wget

wget downloads only one index.html file instead of other some 500 html files

Categories

Resources