How to wget a single page with NO assets ONLY the HTML page - wget

Can someone give me the wget command to download a single html page without any assets. Literally only the html page, and nothing else. Example:
wget --no-check-certificate --level=0 https://blablabla.com/get?id=1111
result: 1111.html

wget man page states that
--level=depth
Set the maximum number of subdirectories that Wget will
recurse into to depth. In order to prevent one from
accidentally downloading very large websites when using
recursion this is limited to a depth of 5 by default, i.e.,
it will traverse at most 5 directories deep starting from the
provided URL. Set -l 0 or -l inf for infinite recursion
depth.
wget -r -l 0 http://<site>/1.html
Ideally, one would expect this to download just 1.html. but
unfortunately this is not the case, because -l 0 is
equivalent to -l inf---that is, infinite recursion. To
download a single HTML page (or a handful of them), specify
them all on the command line and leave away -r and -l.(...)
So --level=0 does mean infinite recursion, you should not provide any --level if you want to download single file. Taking this into account your command should be altered to
wget --no-check-certificate https://blablabla.com/get?id=1111

Related

How to force wget to overwrite an existing file ignoring timestamp?

I tried '-N' and '--no-clobber' but the only result that I get is to retrieve a new copy of the existing example.exe with number a number added using this synax 'example.exe.1'. This is not what I'd like to get. I just need to download and overwrite the file example.exe in the same folder where I already saved a copy of example.com without that wget verifies if the mine is older or newer respect the on example.exe file already present in my download folder. Do you think is i possible or I need to create a script that delete the example.exe file or maybe something that change his modification date etc?
If you specify the output file using the -O option it will overwrite any existing file.
For example:
wget -O index.html bbc.co.uk
Run multiple times will keep over-writting index.html.
wget doesn't let you overwrite an existing file unless you explicitly name the output file on the command line with option -O.
I'm a bit lazy and I don't want to type the output file name on the command line when it is already known from the downloaded file. Therefore, I use curl like this:
curl -O http://ftp.vim.org/vim/runtime/spell/fr.utf-8.spl
Be careful when downloading files like this from unsafe sites. The above command will write a file named as the connected web site wishes to name it (inside the current directory though). The final name may be hidden through redirections and php scripts or be obfuscated in the URL. You might end up overwriting a file you don't want to overwrite.
And if you ever find a file named ls or any other enticing name in the current directory after using curl that way, refrain from executing the downloaded file. It may be a trojan downloaded from a rogue or corrupted web site!
wget --backups=1 google.com
renames original file with .1 suffix and writes new file to the intended filename.
Not exactly what was requested, but could be handy in some cases.
-c or --continue
From the manual:
If you use ‘-c’ on a non-empty file, and the server does not support
continued downloading, Wget will restart the download from scratch and
overwrite the existing file entirely.
I like the -c option. I started with the man page then the web but I've searched for this several times. Like if you're relaying a webcam so the image needs to always be named image.jpg. Seems like it should be more clear in the man page.
I've been using this for a couple years to download things in the background, sometimes combined with "limit-rate = " in my wgetrc file
while true
do
wget -c -i url.txt && break
echo "Restarting wget"
sleep 2
done
Make a little file called url.txt and paste the file's URL into it. Set this script up in your path or maybe as an alias and run it. It keeps retrying the download until there's no error. Sometimes at the end it gets into a loop displaying
416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.
but that's harmless, just ctrl-c it. I think it's always gotten the file I wanted even if wget runs out of retries or the connection temporarily goes away. I've downloaded things for days at a time with it. A CD image on dialup, yes, always with wget.
My use case involves two different URLs, sometimes the second one doesn't exist, but if it DOES exist, I want it to overwrite the first file.
The problem of using wget -O is that, when the second file DOESN'T exist, it will overwrite the first file with a BLANK file.
So the only way I could find is with an if statement:
--spider checks if a file exists, and returns 0 if it does
--quiet fail quietly, with no output
-nv is quiet, but still reports errors
wget -nv https://example.com/files/file01.png -O file01.png
# quietly check if a different version exists
wget --quiet --spider https://example.com/custom-files/file01.png
if [ $? -eq 0 ] ; then
# A different version exists, so download and overwrite the first
wget -nv https://example.com/custom-files/file01.png -O file01.png
fi
It's verbose, but I found it necessary. I hope this is helpful for someone.
Here is an easy way to get it done with parameter trimming
url=https://example.com/example.exe ; wget -nv $url -O ${url##*/}
Or you can use basename
url=https://example.com/example.exe ; wget -nv $url -O $( basename $url )
For those who do not want to use -O and want to specify the output directory only, the following command can be used.
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"
the first command will download from the source with the wget command
the second command will remove the older file
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"; \
rm '$file.1' -f;

Mirroring a website and maintaining URL structure

The goal
I want to mirror a website, such that I can host the static files anywhere (localhost, S3, etc.) and the URLs will appear just like the original to the end user.
The command
This is almost perfect for my needs (...but not quite):
wget --mirror -nH -np -p -k -E -e robots=off http://mysite
What this does do
--mirror : Recursively download the entire site
-p : Download all necessary page requisites
-k : Convert the URL's to relative paths so I can host them anywhere
What this doesn't do
Prevent duplicate downloads
Maintain (exactly) the same URL structure
The problem
Some things are being downloaded more than once, which results in myfile.html and myfile.1.html. This wouldn't be bad, except that when wget rewrites the hyperlinks, it is writing it with the myfile.1.html version, which is changing the URLs and therefore has SEO considerations (Google will index ugly looking URL's).
The -nc option would prevent this, but as of wget-v1.13, I cannot use -k and -nc at the same time. Details for this are here.
Help?!
I was hoping to use wget, but I am now considering looking into using another tool, like httrack, but I don't have any experience with that yet.
Any ideas on how to achieve this (with wget, httrack or anything else) would be greatly appreciated!
httrack got me most of the way, the only URL mangling it did was make the links to point to /folder/index.html instead of /folder/.
Using either httrack or wget didn't seem to result in perfect URL structure, so we ended up writing a little bash script that runs the crawler, followed by sed to clean up some of the URLS (crop the index.html from links, replace bla.1.html with bla.html, etc.)
wget description and help
According to this (and a quick experiment of my own) you should have no problems using -nc and -k options together to gather the pages you are after.
What will cause an issue is using -N with -nc (Does not work at all, incompatible) so you won't be able to compare files by timestamp and still no-clobber them, and with the --mirror option you are including -N inherently.
Rather than use --mirror try instead replacing it with "-r -l inf" which will enable recursive downloading to an infinite level but still allow your other options to work.
An example, based on your original:
wget -r -l inf -k -nc -nH -p -E -e robots=off http://yoursite
Notes: I would suggest using -w 5 --random-wait --limit-rate=200k in order to avoid DOSing the server and be a little less rude, but obviously up to you.
Generally speaking I try to avoid using option groupings like --mirror because of conflicts like this being harder to trace.
I know this is an answer to a very old question but I think it should be addressed - wget is a new command for me but so far proving to be invaluable and I would hope others would feel the same.

wget downloads only one index.html file instead of other some 500 html files

with Wget I normally receive only one -- index.html file. I enter the following string:
wget -e robots=off -r http://www.korpora.org/kant/aa03
which gives back an index.html file, alas, only.
The directory aa03 implies Kant's book, volume 3, there must be some 560 files (pages) or so in it. These pages are readable online, but will not be downloaded. Any remedy?! THX
Following that link brings us to:
http://korpora.zim.uni-duisburg-essen.de/kant/aa03/
wget won't follow links that point to domains not specified by the user. Since korpora.zim.uni-duisburg-essen.de is not equal to korpora.org, wget will not follow the links on the index page.
To remedy this, use --span-hosts or -H. -rH is a VERY dangerous combination - combined, you can accidentally crawl the entire Internet - so you'll want to keep its scope very tightly focused. This command will do what you intended to do:
wget -e robots=off -rH -l inf -np -D korpora.org,korpora.zim.uni-duisburg-essen.de http://korpora.org/kant/aa03/index.html
(-np, or --no-parent, will limit the crawl to aa03/. -D will limit it to only those two domains. -l inf will crawl infinitely deep, constrained by -D and -np).

recursive wget with hotlinked requisites

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.
For example, let's look at this page
https://dl.dropbox.com/u/11471672/wget-all-the-things.html
Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.
wget -e robots=off -r -l inf -pk
^^ gets everything but the hotlinked image
wget -e robots=off -r -l inf -pk -H
^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web
wget -e robots=off -r -l inf -pk -H --ignore-tags=a
^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.
I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.
You can't specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you'll want to split the crawls that use them. To grab hotlinked page-reqs, you'll have to run wget twice: once to recurse through the site's structure, and once to grab hotlinked reqs. I've had luck with this method:
1) wget -r -l inf [other non-H non-p switches] http://www.example.com
2) build a list of all HTML files in the site structure (find . | grep html) and pipe to file
3) wget -pH [other non-r switches] -i [infile]
Step 1 builds the site's structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.
I've managed to do this by using regular expressions. Something like this to mirror http://www.example.com/docs
wget --mirror --convert-links --adjust-extension \
--page-requisites --span-hosts \
--accept-regex '^http://www\.example\.com/docs|\.(js|css|png|jpeg|jpg|svg)$' \
http://www.example.com/docs
You'll probably have to tune the regexs for each specific site. For example some sites like to use parameters on css files (e.g. style.css?key=value), which this example will exclude.
The files you want to include from other hosts will probably include at least
Images: png jpg jpeg gif
Fonts: ttf otf woff woff2 eot
Others: js css svg
Anybody know any others?
So the actual regex you want will probably look more like this (as one string with no linebreaks):
^http://www\.example\.org/docs|\.([Jj][Ss]|[Cc][Ss][Ss]|[Pp][Nn][Gg]|[Jj]
[Pp][Ee]?[Gg]|[Ss][Vv][Gg]|[Gg][Ii][Ff]|[Tt][Tt][Ff]|[Oo][Tt][Ff]|[Ww]
[Oo][Ff][Ff]2?|[Ee][Oo][Tt])(\?.*)?$

Multiple simultaneous downloads using Wget?

I'm using wget to download website content, but wget downloads the files one by one.
How can I make wget download using 4 simultaneous connections?
Use the aria2:
aria2c -x 16 [url]
# |
# |
# |
# ----> the number of connections
http://aria2.sourceforge.net
Wget does not support multiple socket connections in order to speed up download of files.
I think we can do a bit better than gmarian answer.
The correct way is to use aria2.
aria2c -x 16 -s 16 [url]
# | |
# | |
# | |
# ---------> the number of connections here
Official documentation:
-x, --max-connection-per-server=NUM: The maximum number of connections to one server for each download. Possible Values: 1-16 Default: 1
-s, --split=N: Download a file using N connections. If more than N URIs are given, first N URIs are used and remaining URLs are used for backup. If less than N URIs are given, those URLs are used more than once so that N connections total are made simultaneously. The number of connections to the same host is restricted by the --max-connection-per-server option. See also the --min-split-size option. Possible Values: 1-* Default: 5
Since GNU parallel was not mentioned yet, let me give another way:
cat url.list | parallel -j 8 wget -O {#}.html {}
I found (probably)
a solution
In the process of downloading a few thousand log files from one server
to the next I suddenly had the need to do some serious multithreaded
downloading in BSD, preferably with Wget as that was the simplest way
I could think of handling this. A little looking around led me to
this little nugget:
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url] &
wget -r -np -N [url]
Just repeat the wget -r -np -N [url] for as many threads as you need...
Now given this isn’t pretty and there are surely better ways to do
this but if you want something quick and dirty it should do the trick...
Note: the option -N makes wget download only "newer" files, which means it won't overwrite or re-download files unless their timestamp changes on the server.
Another program that can do this is axel.
axel -n <NUMBER_OF_CONNECTIONS> URL
For baisic HTTP Auth,
axel -n <NUMBER_OF_CONNECTIONS> "user:password#https://domain.tld/path/file.ext"
Ubuntu man page.
A new (but yet not released) tool is Mget.
It has already many options known from Wget and comes with a library that allows you to easily embed (recursive) downloading into your own application.
To answer your question:
mget --num-threads=4 [url]
UPDATE
Mget is now developed as Wget2 with many bugs fixed and more features (e.g. HTTP/2 support).
--num-threads is now --max-threads.
I strongly suggest to use httrack.
ex: httrack -v -w http://example.com/
It will do a mirror with 8 simultaneous connections as default. Httrack has a tons of options where to play. Have a look.
As other posters have mentioned, I'd suggest you have a look at aria2. From the Ubuntu man page for version 1.16.1:
aria2 is a utility for downloading files. The supported protocols are HTTP(S), FTP, BitTorrent, and Metalink. aria2 can download a file from multiple sources/protocols and tries to utilize your maximum download bandwidth. It supports downloading a file from HTTP(S)/FTP and BitTorrent at the same time, while the data downloaded from HTTP(S)/FTP is uploaded to the BitTorrent swarm. Using Metalink's chunk checksums, aria2 automatically validates chunks of data while downloading a file like BitTorrent.
You can use the -x flag to specify the maximum number of connections per server (default: 1):
aria2c -x 16 [url]
If the same file is available from multiple locations, you can choose to download from all of them. Use the -j flag to specify the maximum number of parallel downloads for every static URI (default: 5).
aria2c -j 5 [url] [url2]
Have a look at http://aria2.sourceforge.net/ for more information. For usage information, the man page is really descriptive and has a section on the bottom with usage examples. An online version can be found at http://aria2.sourceforge.net/manual/en/html/README.html.
wget cant download in multiple connections, instead you can try to user other program like aria2.
use
aria2c -x 10 -i websites.txt >/dev/null 2>/dev/null &
in websites.txt put 1 url per line, example:
https://www.example.com/1.mp4
https://www.example.com/2.mp4
https://www.example.com/3.mp4
https://www.example.com/4.mp4
https://www.example.com/5.mp4
try pcurl
http://sourceforge.net/projects/pcurl/
uses curl instead of wget, downloads in 10 segments in parallel.
They always say it depends but when it comes to mirroring a website The best exists httrack. It is super fast and easy to work. The only downside is it's so called support forum but you can find your way using official documentation. It has both GUI and CLI interface and it Supports cookies just read the docs This is the best.(Be cureful with this tool you can download the whole web on your harddrive)
httrack -c8 [url]
By default maximum number of simultaneous connections limited to 8 to avoid server overload
use xargs to make wget working in multiple file in parallel
#!/bin/bash
mywget()
{
wget "$1"
}
export -f mywget
# run wget in parallel using 8 thread/connection
xargs -P 8 -n 1 -I {} bash -c "mywget '{}'" < list_urls.txt
Aria2 options, The right way working with file smaller than 20mb
aria2c -k 2M -x 10 -s 10 [url]
-k 2M split file into 2mb chunk
-k or --min-split-size has default value of 20mb, if you not set this option and file under 20mb it will only run in single connection no matter what value of -x or -s
You can use xargs
-P is the number of processes, for example, if set -P 4, four links will be downloaded at the same time, if set to -P 0, xargs will launch as many processes as possible and all of the links will be downloaded.
cat links.txt | xargs -P 4 -I{} wget {}
I'm using gnu parallel
cat listoflinks.txt | parallel --bar -j ${MAX_PARALLEL:-$(nproc)} wget -nv {}
cat will pipe a list of line separated URLs to parallel
--bar flag will show parallel execution progress bar
MAX_PARALLEL env var is for maximum no of parallel download, use it carefully, default here is current no of CPUs
tip: use --dry-run to see what will happen if you execute command.
cat listoflinks.txt | parallel --dry-run --bar -j ${MAX_PARALLEL} wget -nv {}
make can be parallelised easily (e.g., make -j 4). For example, here's a simple Makefile I'm using to download files in parallel using wget:
BASE=http://www.somewhere.com/path/to
FILES=$(shell awk '{printf "%s.ext\n", $$1}' filelist.txt)
LOG=download.log
all: $(FILES)
echo $(FILES)
%.ext:
wget -N -a $(LOG) $(BASE)/$#
.PHONY: all
default: all
Consider using Regular Expressions or FTP Globbing. By that you could start wget multiple times with different groups of filename starting characters depending on their frequency of occurrence.
This is for example how I sync a folder between two NAS:
wget --recursive --level 0 --no-host-directories --cut-dirs=2 --no-verbose --timestamping --backups=0 --bind-address=10.0.0.10 --user=<ftp_user> --password=<ftp_password> "ftp://10.0.0.100/foo/bar/[0-9a-hA-H]*" --directory-prefix=/volume1/foo &
wget --recursive --level 0 --no-host-directories --cut-dirs=2 --no-verbose --timestamping --backups=0 --bind-address=10.0.0.11 --user=<ftp_user> --password=<ftp_password> "ftp://10.0.0.100/foo/bar/[!0-9a-hA-H]*" --directory-prefix=/volume1/foo &
The first wget syncs all files/folders starting with 0, 1, 2... F, G, H and the second thread syncs everything else.
This was the easiest way to sync between a NAS with one 10G ethernet port (10.0.0.100) and a NAS with two 1G ethernet ports (10.0.0.10 and 10.0.0.11). I bound the two wget threads through --bind-address to the different ethernet ports and called them parallel by putting & at the end of each line. By that I was able to copy huge files with 2x 100 MB/s = 200 MB/s in total.
Call Wget for each link and set it to run in background.
I tried this Python code
with open('links.txt', 'r')as f1: # Opens links.txt file with read mode
list_1 = f1.read().splitlines() # Get every line in links.txt
for i in list_1: # Iteration over each link
!wget "$i" -bq # Call wget with background mode
Parameters :
b - Run in Background
q - Quiet mode (No Output)
If you are doing recursive downloads, where you don't know all of the URLs yet, wget is perfect.
If you already have a list of each URL you want to download, then skip down to cURL below.
Multiple Simultaneous Downloads Using Wget Recursively (unknown list of URLs)
# Multiple simultaneous donwloads
URL=ftp://ftp.example.com
for i in {1..10}; do
wget --no-clobber --recursive "${URL}" &
done
The above loop will start 10 wget's, each recursively downloading from the same website, however they will not overlap or download the same file twice.
Using --no-clobber prevents each of the 10 wget processes from downloading the same file twice (including full relative URL path).
& forks each wget to the background, allowing you to run multiple simultaneous downloads from the same website using wget.
Multiple Simultaneous Downloads Using curl from a list of URLs
If you already have a list of URLs you want to download, curl -Z is parallelised curl, with a default of 50 downloads running at once.
However, for curl, the list has to be in this format:
url = https://example.com/1.html
-O
url = https://example.com/2.html
-O
So if you already have a list of URLs to download, simply format the list, and then run cURL
cat url_list.txt
#https://example.com/1.html
#https://example.com/2.html
touch url_list_formatted.txt
while read -r URL; do
echo "url = ${URL}" >> url_list_formatted.txt
echo "-O" >> url_list_formatted.txt
done < url_list.txt
Download in parallel using curl from list of URLs:
curl -Z --parallel-max 100 -K url_list_formatted.txt
For example,
$ curl -Z --parallel-max 100 -K url_list_formatted.txt
DL% UL% Dled Uled Xfers Live Qd Total Current Left Speed
100 -- 2512 0 2 0 0 0:00:01 0:00:01 --:--:-- 1973
$ ls
1.html 2.html url_list_formatted.txt url_list.txt