wget converted some of the links - how to convert links after download? - wget

I used
wget -mirror --convert-links http://example.com/ 2>&1 | tee -a wget.log
to download a website. It turns out that only some of the links were converted. How can I have all of the links converted, even after the download? I do not want to download all of the contents again.

Firstly, please be aware that --convert-links does it job after everything was downloaded so if you are inspecting certain downloaded file before wget finished working you might see unconverted list.
I do not want to download all of the contents again.
then you should use --no-clobber, but according to man page --mirror is equivalent to -r -N -l inf --no-remove-listing and --no-clobber and -N are mutually exclusive, therefore you must not use --mirror but parts of it excluding -N taking this is account your command should look following way
wget -r --no-clobber -l inf --no-remove-listing --convert-links http://example.com/

Related

Follow only certain links with Wget but download every host from those links

So, let's say I want to mirror a site with Wget. I want wget to follow and download all the links from http://www.example.com/example/ or http://example.example.com/. How can I do this? I tried this command but it doesn't seem to be working the way I want it to work.
wget -r --mirror -I '/example' -H -D'example.example.com' 'https://www.example.com/'
You want to start with 'https://www.example.com/', want to save files from 'http://www.example.com/example/' and 'http://example.example.com/` ?
Then leave away -H. -I is ambiguous here - does it apply to both domains or just the first ? And btw, -r is included in --mirror.
Check out --accept-regex and --reject-regex for a finer-grained control, e.g. --accept-regex="(www.example.com/example/|example.example.com/)".

wget not following links with mirror

I'm trying to semi mirror a site. What I want is to download all of the MP3s and make sure I'm not redownloading those that I already have (hence the "mirror" part). I've typed in the following:
wget -m -nd -e robots=off --random-wait -A "*.mp3" -P FOLDER http://www.example.com/
And it downloads all the MP3s on the Current Page. It never follows the links to the "Next Page" or the likes. I've replaced the -m with -N -c -r without success. What other options can I use?
Try:
wget ‐‐execute robots=off ‐‐recursive ‐‐accept mp3,MP3 --random-wait ‐‐no-parent ‐‐continue ‐‐no-clobber //site.com/

how to use wget on a site with many folders and subfolders

I try to download this site, with this code:
wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off tenshi.spb.ru/anime-ost/
But I only get the index and enter inside the first folder, not the subfolder, help me?
I use this command to download sites including their subfolders:
wget --mirror -p --convert-links -P . [site address]
A little explanation:
--mirror is a shortcut for -N -r -l inf --no-remove-listing.
--convert-links makes links in downloaded HTML or CSS point to local files
-p allows you to get all images, etc. needed to display HTML pages
-P specifies the next argument is the directory the files will be saved to
I found the command at:
http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/
You use -l 1 also known as --level=1 which limits recursion to one level. Set that to a higher level to download more pages. BTW, I like long options like --level because its easier to see what you are doing without going back to man pages.

Can't resume "wget --mirror" with --no-clobber (-c -F -B unhelpful)

I started a wget mirror with "wget --mirror [sitename]", and it was
working fine, but accidentally interrupted the process.
I now want to resume the mirror with the following caveats:
If wget has already downloaded a file, I don't want it downloaded
it again. I don't even want wget to check the timestamp: I know the
version I have is "recent enough".
I do want wget to read the files it's already downloaded and
follow links inside those files.
I can use "-nc" for the first point above, but I can't seem to coerce
wget to read through files it's already downloaded.
Things I've tried:
The obvious "wget -c -m" doesn't work, because it wants
to compare timestamps, which requires making at least a HEAD request
to the remote server.
"wget -nc -m" doesn't work, since -m implies -N, and -nc is
incompatible with -N.
"wget -F -nc -r -l inf" is the best I could come up with, but it
still fails. I was hoping "-F" would coerce wget into reading local,
already-downloaded files as HTML, and thus follow links, but this
doesn't appear to happen.
I tried a few other options (like "-c" and "-B [sitename]"), but
nothing works.
How do I get wget to resume this mirror?
Apparently this works:
Solved: Wget error “Can’t timestamp and not clobber old files at the
same time.” Posted on February 4, 2012 While trying to resume a
site-mirror operation I was running through Wget, I ran into the error
“Can’t timestamp and not clobber old files at the same time”. It turns
out that running Wget with the -N and -nc flags set at the same time
can’t happen, so if you want to resume a recursive download with
noclobber you have to disable -N. The -m attribute (for mirroring)
intrinsically sets the -N attribute, so you’ll have to switch from -m
to -r in order to use noclobber as well.
From: http://www.marathon-studios.com/blog/solved-wget-error-cant-timestamp-and-not-clobber-old-files-at-the-same-time/
-m, according to the wget manual is equivalent to this longer series of settings: -r -N -l inf --no-remove-listing. Just use those settings instead of -m, and without -N (timestamping).
Now I'm not sure if there is a way to get wget to download urls from existing html files. There probably is a solution, I know it can take html files as inputs and scrape all the links in them. Perhaps you could use a bash command to concatenate all the html files together into one big file.
I solved this problem by just deleting all the html files, because I didn't mind only redownloading them. But this might not work for everyone's use case.

Why does wget only download the index.html for some websites?

I'm trying to use wget command:
wget -p http://www.example.com
to fetch all the files on the main page. For some websites it works but in most of the cases, it only download the index.html. I've tried the wget -r command but it doesn't work. Any one knows how to fetch all the files on a page, or just give me a list of files and corresponding urls on the page?
Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.
wget -r -p http://www.example.com
The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.
So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:
wget -r -p -e robots=off http://www.example.com
As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.
wget -r -p -e robots=off -U mozilla http://www.example.com
A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)
you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
Firstly, to clarify the question, the aim is to download index.html plus all the requisite parts of that page (images, etc). The -p option is equivalent to --page-requisites.
The reason the page requisites are not always downloaded is that they are often hosted on a different domain from the original page (a CDN, for example). By default, wget refuses to visit other hosts, so you need to enable host spanning with the --span-hosts option.
wget --page-requisites --span-hosts 'http://www.amazon.com/'
If you need to be able to load index.html and have all the page requisites load from the local version, you'll need to add the --convert-links option, so that URLs in img src attributes (for example) are rewritten to relative URLs pointing to the local versions.
Optionally, you might also want to save all the files under a single "host" directory by adding the --no-host-directories option, or save all the files in a single, flat directory by adding the --no-directories option.
Using --no-directories will result in lots of files being downloaded to the current directory, so you probably want to specify a folder name for the output files, using --directory-prefix.
wget --page-requisites --span-hosts --convert-links --no-directories --directory-prefix=output 'http://www.amazon.com/'
The link you have provided is the homepage or /index.html, Therefore it's clear that you are getting only a index.html page. For an actual download, for example, for "test.zip" file, you need to add the exact file name at the end. For example use the following link to download test.zip file:
wget -p domainname.com/test.zip
Download a Full Website Using wget --mirror
Following is the command line which you want to execute when you want to download a full website and made available for local viewing.
wget --mirror -p --convert-links -P ./LOCAL-DIR
http://www.example.com
–mirror: turn on options suitable for mirroring.
-p: download all files that are necessary to properly display a given HTML page.
–convert-links: after the download, convert the links in document
for local viewing.
-P ./LOCAL-DIR: save all the files and directories to the specified directory
Download Only Certain File Types Using wget -r -A
You can use this under following situations:
Download all images from a website,
Download all videos from a website,
Download all PDF files from a website
wget -r -A.pdf http://example.com/test.pdf
Another problem might be that the site you're mirroring uses links without www. So if you specify
wget -p -r http://www.example.com
it won't download any linked (intern) pages because they are from a "different" domain. If this is the case then use
wget -p -r http://example.com
instead (without www).
I had the same problem downloading files of CFSv2 model. I solved it using mixing of the above answers, but adding the parameter --no-check-certificate
wget -nH --cut-dirs=2 -p -e robots=off --random-wait -c -r -l 1 -A "flxf*.grb2" -U Mozilla --no-check-certificate https://nomads.ncdc.noaa.gov/modeldata/cfsv2_forecast_6-hourly_9mon_flxf/2018/201801/20180101/2018010100/
Here a brief explanation of every parameter used, for a further explanation go to the GNU wget 1.2 Manual
-nH equivalent to --no-host-directories: Disable generation of host-prefixed directories. In this case, avoid the generation of the directory ./https://nomads.ncdc.noaa.gov/
--cut-dirs=<number>: Ignore directory components. In this case, avoid the generation of the directories ./modeldata/cfsv2_forecast_6-hourly_9mon_flxf/
-p equivalent to --page-requisites: This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
-e robots=off: avoid download robots.txt file
-random-wait: Causes the time between the request to vary between 0.5 and 1.5 * seconds, where was specified using the --wait option.
-c equivalent to --continue: continue getting a partially-downloaded file.
-r equivalent to --recursive: Turn on recursive retrieving. The default maximum depth is 5
-l <depth> equivalent to --level <depth>: Specify recursion maximum depth level
-A <acclist> equivalent to --accept <acclist>: specify a comma-separated list of the name suffixes or patterns to accept.
-U <agent-string> equivalent to --user-agent=<agent-string>: The HTTP protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the WWW software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as ‘Wget/version’, the version being the current version number of Wget.
--no-check-certificate: Don't check the server certificate against the available certificate authorities.
I know that this thread is old, but try what is mentioned by Ritesh with:
--no-cookies
It worked for me!
If you look for index.html in the wget manual you can find an option --default-page=name which is index.html by default. You can change to index.php for example.
--default-page=index.php
If you only get the index.html and that file looks like it only contains binary data (i.e. no readable text, only control characters), then the site is probably sending the data using gzip compression.
You can confirm this by running cat index.html | gunzip to see if it outputs readable HTML.
If this is the case, then wget's recursive feature (-r) won't work. There is a patch for wget to work with gzip compressed data, but it doesn't seem to be in the standard release yet.