Does wget -w option not work with -p? - wget

When I run wget64.exe -p -w 10 http://www.example.com on Windows command line for a site with many images, I expect based on documentation for -w to cause this to space out all its image downloads by 10 seconds each. But it does the whole thing with no waits - is this because -w isn't meant to work with -p? Does grabbing the images linked in a page somehow "not count" as making an additional request from the server? Or am I using incorrect syntax?

wget64.exe -r -l 1 --wait=10 http://www.example.com should do what you want. It splits the page dependencies into retrievals and applies the wait time instead of combining them all into a single page request.

Related

What am I screwing up trying to download particular file types with wget?

I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.
I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL but this leaves me with about 60,000 extra files to waste time both downloading and deleting.
I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A flag is not included.
What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?
Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept, wget determines whether a links refers to a file or directory based on whether or not it ends with a /.
For example, say I have a directory named files, and a web page that has:
Lots o' files!
If I were to request this with wget -r, then I wget would happily GET /files, see that it was an HTML document containing a bunch of links, and continue to download those links.
However, if I add -A zip to my command line, and run wget with --debug, I see:
appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.
In other words, wget thinks this is a file (no trailing /) and it doesn't match our acceptance criteria, so it gets rejected.
If I modify the remote file so that it looks like...
Lots o' files!
...then wget will follow the link and download files as desired.
I don't think there's a great solution to this problem if you need to use wget. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.
It's also possible you're experiencing a different issue; the output of adding --debug to your command line clarify things in that case.
I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'. I also used the --trust-server-names flag. This catches all the .ashx that are linked in the webpage, then when wget does the second check, all the mp3 files that were downloaded will stay.
As an alternative to --trust-server-names, you may also find the --content-disposition flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3 to just file.mp3.

wget download a section of a website

I want to download a particular section of a website. I am following this wget - Download a sub directory . But the problem is the section of the website does not have any particular url i.e. the urls goes like this http://grephysics.net/ans/0177/* where * is a number from 1-100 and I cant use http://grephysics.net/ans/0177 in wget. How do I download this 100 webpages with link to each other (i.e. the the Previous and Next button should link to local copies)
I think this is what you need:
wget -p -k http://grephysics.net/ans/0177/{1..100}
Explanation:
-k : rewrites links to point to local assets
-p : get all images, js, css, etc. needed to display the page
{1..100} : this specifies a range of urls to download, in your case we have pages labelled 1 to 100.
Why didn't recursive downloading work?
The link you posted was a good first resource, probably what most people would want. But the way wget recursively downloads is by getting the first page specified (i.e. the root), then following links to child pages. The way grephysics is set up however, is that http://grephysics.net/ans/0177 leads us to a 404. It has no links for wget to follow to download child pages.
If your wget doesn't support {}
You can still have the same results by using the following command:
for i in {1..100}; do echo $i; done | wget -p -k -B http://grephysics.net/ans/0177/ -i -
Explanation
for i in {1..100};... : This prints the values 1 to 100.
| : For anyone who hasn't seen this, we are piping the output of the previous command into the input of the following command
-p : get all images, js, css, etc. needed to display the page
-k : rewrite the links to point to the local copies
-B : specifies the base URL to use with the -i option
-i : reads a list of urls to fetch from a file. Since we specified the 'file' - it reads from stdin.
So, we read in the values 1 to 100 and append them to our base url
http://grephysics.net/ans/0177/ and fetch all of those urls and all the assets that go with them, then rewrite links so we can browse offline.

Delays between requests in wget

I want to download web using wget, but to make it more like a real user I would like to make small random delays between requests.
I'm executing wget via cmd.
You can add the below code into your command line which adds a ten second wait in between server requests.
-w 10
And you can also include
--random-wait
Into your command line with -w option which will vary the wait by 0.5 and 1.5 times the value you provide here.
Perfect, adding "-w 3" on the front end of a recursive download prevented the server from becoming overloaded.
as in
wget -w 3 -m -np -c -R "index.html*" "http://example.com.whatever/public/files/"
wait 3
mirroring to recurse all folder depths, and use source timestamps
no parent upward traversal
continue partially downloads
reject any files named index.html
target host URL with the desired recursive files and folders
hope this helps someone else

Compare file sizes and download if they're different via wget

I'm downloading some .mp3 files (all legal) via wget :
wget -r -nc files.myserver.com
I have to stop the download sometimes and at that times the file is partially downloaded. For example a 10 minutes record.mp3 file become 4 minutes record.mp3 file. It's playing correctly but incomplete.
If I use the same command above, because the record.mp3 file is already exist in my local computer wget skips that file although it isn't complete.
I wonder if there is a way to check the file sizes and if the file size in the remote server and local computer isn't same re-download it. (I've learned the --spider command gives the file size but is there any other command that automatically check the file sizes and download or not).
I would go with wget's -N option for timestamping, but note that wget will only compare the file sizes if you also specify the --no-if-modified-since option. Without it, incomplete files are indeed skipped on the next run because they receive a timestamp of the current time, which is newer than that on the server.
The reason is probably that with only -N, a GET request is sent for the file with the If-Modified-Since field set. The server responds with either 200 or 304, but the 304 doesn't contain the file size so wget can't check it.
With --no-if-modified-since wget sends a HEAD request instead to get the timestamp and file size, and checks both.
What I use for recursive download of a folder:
wget -T 300 -nv -t 1 -r -nd -np -l 1 -N --no-if-modified-since -P $my_folder $my_url
With:
-T 300: Set the network timeout to 300 seconds
-nv: Turn off verbose without being completely quiet
-t 1: Set number of tries to 1
-r: Turn on recursive retrieving
-nd: Do not create a hierarchy of directories when retrieving recursively
-np: Do not ever ascend to the parent directory when retrieving recursively
-l 1: Specify recursion maximum depth 1
-N: Turn on time-stamping
--no-if-modified-since: Do not send If-Modified-Since header in ‘-N’ mode, send preliminary HEAD request instead
You may try the -c option to continue the download of partially downloaded files, however the manual gives an explicit warning:
You need to be especially careful of this when using -c in conjunction
with -r, since every file will be considered as an "incomplete
download" candidate.
While there is no perfect solution to this problem you could try to use -N option to turn on timestamping. This might prevent errors when the file has changed on the server but only if the server supports timestamping and partial downloads. Try it and see how it goes.
wget -r -N -c files.myserver.com
If you need check if file was partially downloaded (has different size) or updated on remote server by timestamp and must be in this case updated locally you need use -N option.
Here some additional info about -N (--timestamping) option from Wget docs:
If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the
time-stamps say.
Added From: https://www.gnu.org/software/wget/manual/wget.html (Chapter: 5 Time-Stamping)

wget appends query string to resulting file

I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character â - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/