Multiple simultaneous downloads using Wget? - command-line

I'm using wget to download website content, but wget downloads the files one by one.
How can I make wget download using 4 simultaneous connections?

Use the aria2:
aria2c -x 16 [url]
# |
# |
# |
# ----> the number of connections
http://aria2.sourceforge.net

Wget does not support multiple socket connections in order to speed up download of files.
I think we can do a bit better than gmarian answer.
The correct way is to use aria2.
aria2c -x 16 -s 16 [url]
# | |
# | |
# | |
# ---------> the number of connections here
Official documentation:
-x, --max-connection-per-server=NUM: The maximum number of connections to one server for each download. Possible Values: 1-16 Default: 1
-s, --split=N: Download a file using N connections. If more than N URIs are given, first N URIs are used and remaining URLs are used for backup. If less than N URIs are given, those URLs are used more than once so that N connections total are made simultaneously. The number of connections to the same host is restricted by the --max-connection-per-server option. See also the --min-split-size option. Possible Values: 1-* Default: 5

Since GNU parallel was not mentioned yet, let me give another way:
cat url.list | parallel -j 8 wget -O {#}.html {}

I found (probably)
a solution
In the process of downloading a few thousand log files from one server
to the next I suddenly had the need to do some serious multithreaded
downloading in BSD, preferably with Wget as that was the simplest way
I could think of handling this. A little looking around led me to
this little nugget:
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url]
Just repeat the wget -r -np -N [url] for as many threads as you need...
Now given this isn’t pretty and there are surely better ways to do
this but if you want something quick and dirty it should do the trick...
Note: the option -N makes wget download only "newer" files, which means it won't overwrite or re-download files unless their timestamp changes on the server.

Another program that can do this is axel.
axel -n <NUMBER_OF_CONNECTIONS> URL
For baisic HTTP Auth,
axel -n <NUMBER_OF_CONNECTIONS> "user:password#https://domain.tld/path/file.ext"
Ubuntu man page.

A new (but yet not released) tool is Mget.
It has already many options known from Wget and comes with a library that allows you to easily embed (recursive) downloading into your own application.
To answer your question:
mget --num-threads=4 [url]
UPDATE
Mget is now developed as Wget2 with many bugs fixed and more features (e.g. HTTP/2 support).
--num-threads is now --max-threads.

I strongly suggest to use httrack.
ex: httrack -v -w http://example.com/
It will do a mirror with 8 simultaneous connections as default. Httrack has a tons of options where to play. Have a look.

As other posters have mentioned, I'd suggest you have a look at aria2. From the Ubuntu man page for version 1.16.1:
aria2 is a utility for downloading files. The supported protocols are HTTP(S), FTP, BitTorrent, and Metalink. aria2 can download a file from multiple sources/protocols and tries to utilize your maximum download bandwidth. It supports downloading a file from HTTP(S)/FTP and BitTorrent at the same time, while the data downloaded from HTTP(S)/FTP is uploaded to the BitTorrent swarm. Using Metalink's chunk checksums, aria2 automatically validates chunks of data while downloading a file like BitTorrent.
You can use the -x flag to specify the maximum number of connections per server (default: 1):
aria2c -x 16 [url]
If the same file is available from multiple locations, you can choose to download from all of them. Use the -j flag to specify the maximum number of parallel downloads for every static URI (default: 5).
aria2c -j 5 [url] [url2]
Have a look at http://aria2.sourceforge.net/ for more information. For usage information, the man page is really descriptive and has a section on the bottom with usage examples. An online version can be found at http://aria2.sourceforge.net/manual/en/html/README.html.

wget cant download in multiple connections, instead you can try to user other program like aria2.

use
aria2c -x 10 -i websites.txt >/dev/null 2>/dev/null &
in websites.txt put 1 url per line, example:
https://www.example.com/1.mp4
https://www.example.com/2.mp4
https://www.example.com/3.mp4
https://www.example.com/4.mp4
https://www.example.com/5.mp4

try pcurl
http://sourceforge.net/projects/pcurl/
uses curl instead of wget, downloads in 10 segments in parallel.

They always say it depends but when it comes to mirroring a website The best exists httrack. It is super fast and easy to work. The only downside is it's so called support forum but you can find your way using official documentation. It has both GUI and CLI interface and it Supports cookies just read the docs This is the best.(Be cureful with this tool you can download the whole web on your harddrive)
httrack -c8 [url]
By default maximum number of simultaneous connections limited to 8 to avoid server overload

use xargs to make wget working in multiple file in parallel
#!/bin/bash
mywget()
{
wget "$1"
}
export -f mywget
# run wget in parallel using 8 thread/connection
xargs -P 8 -n 1 -I {} bash -c "mywget '{}'" < list_urls.txt
Aria2 options, The right way working with file smaller than 20mb
aria2c -k 2M -x 10 -s 10 [url]
-k 2M split file into 2mb chunk
-k or --min-split-size has default value of 20mb, if you not set this option and file under 20mb it will only run in single connection no matter what value of -x or -s

You can use xargs
-P is the number of processes, for example, if set -P 4, four links will be downloaded at the same time, if set to -P 0, xargs will launch as many processes as possible and all of the links will be downloaded.
cat links.txt | xargs -P 4 -I{} wget {}

I'm using gnu parallel
cat listoflinks.txt | parallel --bar -j ${MAX_PARALLEL:-$(nproc)} wget -nv {}
cat will pipe a list of line separated URLs to parallel
--bar flag will show parallel execution progress bar
MAX_PARALLEL env var is for maximum no of parallel download, use it carefully, default here is current no of CPUs
tip: use --dry-run to see what will happen if you execute command.
cat listoflinks.txt | parallel --dry-run --bar -j ${MAX_PARALLEL} wget -nv {}

make can be parallelised easily (e.g., make -j 4). For example, here's a simple Makefile I'm using to download files in parallel using wget:
BASE=http://www.somewhere.com/path/to
FILES=$(shell awk '{printf "%s.ext\n", $$1}' filelist.txt)
LOG=download.log
all: $(FILES)
echo $(FILES)
%.ext:
wget -N -a $(LOG) $(BASE)/$#
.PHONY: all
default: all

Consider using Regular Expressions or FTP Globbing. By that you could start wget multiple times with different groups of filename starting characters depending on their frequency of occurrence.
This is for example how I sync a folder between two NAS:
wget --recursive --level 0 --no-host-directories --cut-dirs=2 --no-verbose --timestamping --backups=0 --bind-address=10.0.0.10 --user=<ftp_user> --password=<ftp_password> "ftp://10.0.0.100/foo/bar/[0-9a-hA-H]*" --directory-prefix=/volume1/foo &
wget --recursive --level 0 --no-host-directories --cut-dirs=2 --no-verbose --timestamping --backups=0 --bind-address=10.0.0.11 --user=<ftp_user> --password=<ftp_password> "ftp://10.0.0.100/foo/bar/[!0-9a-hA-H]*" --directory-prefix=/volume1/foo &
The first wget syncs all files/folders starting with 0, 1, 2... F, G, H and the second thread syncs everything else.
This was the easiest way to sync between a NAS with one 10G ethernet port (10.0.0.100) and a NAS with two 1G ethernet ports (10.0.0.10 and 10.0.0.11). I bound the two wget threads through --bind-address to the different ethernet ports and called them parallel by putting & at the end of each line. By that I was able to copy huge files with 2x 100 MB/s = 200 MB/s in total.

Call Wget for each link and set it to run in background.
I tried this Python code
with open('links.txt', 'r')as f1: # Opens links.txt file with read mode
list_1 = f1.read().splitlines() # Get every line in links.txt
for i in list_1: # Iteration over each link
!wget "$i" -bq # Call wget with background mode
Parameters :
b - Run in Background
q - Quiet mode (No Output)

If you are doing recursive downloads, where you don't know all of the URLs yet, wget is perfect.
If you already have a list of each URL you want to download, then skip down to cURL below.
Multiple Simultaneous Downloads Using Wget Recursively (unknown list of URLs)
# Multiple simultaneous donwloads
URL=ftp://ftp.example.com
for i in {1..10}; do
wget --no-clobber --recursive "${URL}" &
done
The above loop will start 10 wget's, each recursively downloading from the same website, however they will not overlap or download the same file twice.
Using --no-clobber prevents each of the 10 wget processes from downloading the same file twice (including full relative URL path).
& forks each wget to the background, allowing you to run multiple simultaneous downloads from the same website using wget.
Multiple Simultaneous Downloads Using curl from a list of URLs
If you already have a list of URLs you want to download, curl -Z is parallelised curl, with a default of 50 downloads running at once.
However, for curl, the list has to be in this format:
url = https://example.com/1.html
-O
url = https://example.com/2.html
-O
So if you already have a list of URLs to download, simply format the list, and then run cURL
cat url_list.txt
#https://example.com/1.html
#https://example.com/2.html
touch url_list_formatted.txt
while read -r URL; do
echo "url = ${URL}" >> url_list_formatted.txt
echo "-O" >> url_list_formatted.txt
done < url_list.txt
Download in parallel using curl from list of URLs:
curl -Z --parallel-max 100 -K url_list_formatted.txt
For example,
$ curl -Z --parallel-max 100 -K url_list_formatted.txt
DL% UL% Dled Uled Xfers Live Qd Total Current Left Speed
100 -- 2512 0 2 0 0 0:00:01 0:00:01 --:--:-- 1973
$ ls
1.html 2.html url_list_formatted.txt url_list.txt

Related

How to wget a single page with NO assets ONLY the HTML page

Can someone give me the wget command to download a single html page without any assets. Literally only the html page, and nothing else. Example:
wget --no-check-certificate --level=0 https://blablabla.com/get?id=1111
result: 1111.html
wget man page states that
--level=depth
Set the maximum number of subdirectories that Wget will
recurse into to depth. In order to prevent one from
accidentally downloading very large websites when using
recursion this is limited to a depth of 5 by default, i.e.,
it will traverse at most 5 directories deep starting from the
provided URL. Set -l 0 or -l inf for infinite recursion
depth.
wget -r -l 0 http://<site>/1.html
Ideally, one would expect this to download just 1.html. but
unfortunately this is not the case, because -l 0 is
equivalent to -l inf---that is, infinite recursion. To
download a single HTML page (or a handful of them), specify
them all on the command line and leave away -r and -l.(...)
So --level=0 does mean infinite recursion, you should not provide any --level if you want to download single file. Taking this into account your command should be altered to
wget --no-check-certificate https://blablabla.com/get?id=1111

Using wget (for windows) to download all MIDI files

I've been trying to use wget to download all midi files from a website (http://cyberhymnal.org/) using:
wget64 -r -l1 H -t1 -nd -N -np -A.mid -erobots=off http://cyberhymnal.org/
I got the syntax from various sites which all suggest the same thing, but it doesn't download anything. I've tried various variations on the theme, such as different values for '-l' etc.
Does anybody have any suggestions as to what I am doing wrong? Is it the fact that I am using Windows?
Thanks in advance.
I don't know much about all the parameters you are using like H, -t1, -N etc though we can find it online. But I also had to download files from a url matching a wildcard. So command that worked for me:
wget -r -l1 -nH --cut-dirs=100 -np "$url" -P "${newLocalLib/$tokenFind}" -A "com.iontrading.arcreporting.*.jar"
after -P you specify the path where you wanna save the files to and after -A you provide the wild card token. Like in your case that would be "*.mid".
-A means Accept. So here we provide the files to accept from the provided URL. Similarly -R for reject list.
You may have better luck (at least, you'll get more MIDI files), if you try the actual Cyber Hymnal™, which moved over 10 years ago. The current URL is now http://www.hymntime.com/tch/.

how to print the progress of the files being copied in bash [duplicate]

I suppose I could compare the number of files in the source directory to the number of files in the target directory as cp progresses, or perhaps do it with folder size instead? I tried to find examples, but all bash progress bars seem to be written for copying single files. I want to copy a bunch of files (or a directory, if the former is not possible).
You can also use rsync instead of cp like this:
rsync -Pa source destination
Which will give you a progress bar and estimated time of completion. Very handy.
To show a progress bar while doing a recursive copy of files & folders & subfolders (including links and file attributes), you can use gcp (easily installed in Ubuntu and Debian by running "sudo apt-get install gcp"):
gcp -rf SRC DEST
Here is the typical output while copying a large folder of files:
Copying 1.33 GiB 73% |##################### | 230.19 M/s ETA: 00:00:07
Notice that it shows just one progress bar for the whole operation, whereas if you want a single progress bar per file, you can use rsync:
rsync -ah --progress SRC DEST
You may have a look at the tool vcp. Thats a simple copy tool with two progress bars: One for the current file, and one for overall.
EDIT
Here is the link to the sources: http://members.iinet.net.au/~lynx/vcp/
Manpage can be found here: http://linux.die.net/man/1/vcp
Most distributions have a package for it.
Here another solution: Use the tool bar
You could invoke it like this:
#!/bin/bash
filesize=$(du -sb ${1} | awk '{ print $1 }')
tar -cf - -C ${1} ./ | bar --size ${filesize} | tar -xf - -C ${2}
You have to go the way over tar, and it will be inaccurate on small files. Also you must take care that the target directory exists. But it is a way.
My preferred option is Advanced Copy, as it uses the original cp source files.
$ wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.21.tar.xz
$ tar xvJf coreutils-8.21.tar.xz
$ cd coreutils-8.21/
$ wget --no-check-certificate wget https://raw.githubusercontent.com/jarun/advcpmv/master/advcpmv-0.8-8.32.patch
$ patch -p1 -i advcpmv-0.8-8.32.patch
$ ./configure
$ make
The new programs are now located in src/cp and src/mv. You may choose to replace your existing commands:
$ sudo cp src/cp /usr/local/bin/cp
$ sudo cp src/mv /usr/local/bin/mv
Then you can use cp as usual, or specify -g to show the progress bar:
$ cp -g src dest
A simple unix way is to go to the destination directory and do watch -n 5 du -s . Perhaps make it more pretty by showing as a bar . This can help in environments where you have just the standard unix utils and no scope of installing additional files . du-sh is the key , watch is to just do every 5 seconds.
Pros : Works on any unix system Cons : No Progress Bar
To add another option, you can use cpv. It uses pv to imitate the usage of cp.
It works like pv but you can use it to recursively copy directories
You can get it here
There's a tool pv to do this exact thing: http://www.ivarch.com/programs/pv.shtml
There's a ubuntu version in apt
How about something like
find . -type f | pv -s $(find . -type f | wc -c) | xargs -i cp {} --parents /DEST/$(dirname {})
It finds all the files in the current directory, pipes that through PV while giving PV an estimated size so the progress meter works and then piping that to a CP command with the --parents flag so the DEST path matches the SRC path.
One problem I have yet to overcome is that if you issue this command
find /home/user/test -type f | pv -s $(find . -type f | wc -c) | xargs -i cp {} --parents /www/test/$(dirname {})
the destination path becomes /www/test/home/user/test/....FILES... and I am unsure how to tell the command to get rid of the '/home/user/test' part. That why I have to run it from inside the SRC directory.
Check the source code for progress_bar in the below git repository of mine
https://github.com/Kiran-Bose/supreme
Also try custom bash script package supreme to verify how progress bar work with cp and mv comands
Functionality overview
(1)Open Apps
----Firefox
----Calculator
----Settings
(2)Manage Files
----Search
----Navigate
----Quick access
|----Select File(s)
|----Inverse Selection
|----Make directory
|----Make file
|----Open
|----Copy
|----Move
|----Delete
|----Rename
|----Send to Device
|----Properties
(3)Manage Phone
----Move/Copy from phone
----Move/Copy to phone
----Sync folders
(4)Manage USB
----Move/Copy from USB
----Move/Copy to USB
There is command progress, https://github.com/Xfennec/progress, coreutils progress viewer.
Just run progress in another terminal to see the copy/move progress. For continuous monitoring use -M flag.

wget downloads only one index.html file instead of other some 500 html files

with Wget I normally receive only one -- index.html file. I enter the following string:
wget -e robots=off -r http://www.korpora.org/kant/aa03
which gives back an index.html file, alas, only.
The directory aa03 implies Kant's book, volume 3, there must be some 560 files (pages) or so in it. These pages are readable online, but will not be downloaded. Any remedy?! THX
Following that link brings us to:
http://korpora.zim.uni-duisburg-essen.de/kant/aa03/
wget won't follow links that point to domains not specified by the user. Since korpora.zim.uni-duisburg-essen.de is not equal to korpora.org, wget will not follow the links on the index page.
To remedy this, use --span-hosts or -H. -rH is a VERY dangerous combination - combined, you can accidentally crawl the entire Internet - so you'll want to keep its scope very tightly focused. This command will do what you intended to do:
wget -e robots=off -rH -l inf -np -D korpora.org,korpora.zim.uni-duisburg-essen.de http://korpora.org/kant/aa03/index.html
(-np, or --no-parent, will limit the crawl to aa03/. -D will limit it to only those two domains. -l inf will crawl infinitely deep, constrained by -D and -np).

recursive wget with hotlinked requisites

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.
For example, let's look at this page
https://dl.dropbox.com/u/11471672/wget-all-the-things.html
Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.
wget -e robots=off -r -l inf -pk
^^ gets everything but the hotlinked image
wget -e robots=off -r -l inf -pk -H
^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web
wget -e robots=off -r -l inf -pk -H --ignore-tags=a
^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.
I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.
You can't specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you'll want to split the crawls that use them. To grab hotlinked page-reqs, you'll have to run wget twice: once to recurse through the site's structure, and once to grab hotlinked reqs. I've had luck with this method:
1) wget -r -l inf [other non-H non-p switches] http://www.example.com
2) build a list of all HTML files in the site structure (find . | grep html) and pipe to file
3) wget -pH [other non-r switches] -i [infile]
Step 1 builds the site's structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.
I've managed to do this by using regular expressions. Something like this to mirror http://www.example.com/docs
wget --mirror --convert-links --adjust-extension \
--page-requisites --span-hosts \
--accept-regex '^http://www\.example\.com/docs|\.(js|css|png|jpeg|jpg|svg)$' \
http://www.example.com/docs
You'll probably have to tune the regexs for each specific site. For example some sites like to use parameters on css files (e.g. style.css?key=value), which this example will exclude.
The files you want to include from other hosts will probably include at least
Images: png jpg jpeg gif
Fonts: ttf otf woff woff2 eot
Others: js css svg
Anybody know any others?
So the actual regex you want will probably look more like this (as one string with no linebreaks):
^http://www\.example\.org/docs|\.([Jj][Ss]|[Cc][Ss][Ss]|[Pp][Nn][Gg]|[Jj]
[Pp][Ee]?[Gg]|[Ss][Vv][Gg]|[Gg][Ii][Ff]|[Tt][Tt][Ff]|[Oo][Tt][Ff]|[Ww]
[Oo][Ff][Ff]2?|[Ee][Oo][Tt])(\?.*)?$