I'm trying to download a directory and all its subdirectories from a website, using wget.
Reading all other SO questions I arrived at this:
wget -nH --recursive --no-parent --cut-dirs=5 --reject "index.html*" --directory-prefix="c:\temp" http://blahblah.com/directory/
However, no mather how I try to formulate the c:\temp, wget always creates "#5Ctemp" in the current directory and does the download in that directory. I check the documentation but to no avail.
Preferably I would also be able to use an environment variable as --directory-prefix, eg
--directory-prefix=%PREFIX%
Looks like the version of wget you're using (1.8.2) is either buggy or too old. It definitely works with newer versions, get one here:
wget 1.11.4 from gnuwin32
wget 1.14 from osspack32
wget 1.15 from eternallybored.org
For completeness, here's a link to the wget wiki download section.
Related
I'm new to wget and trying to download a full website ignoring any images on videos. What is the parameter? I tried going through wget --help but couldn't really find anything useful. Thanks in advance.
The -R or --reject option is what you need here. See the wget specs on gnu.org.
Say your URL is http://somesite.org/, then to download the root index and everything linked therefrom, ignoring media files, just do:
wget -r -R jpg,jpeg,gif,mpg,mkv http://somesite.org
where you add in whatever extra extensions the site is using for media files separated by commas.
I am trying to use wget to download a file under a different local name and only download if the file on the server is newer.
What I thought I could do was use the -O option of wget so as to be able to choose the name of the downloaded file, as in:
wget http://example.com/weird-name -O local-name
and combine that with the -N option that doesn't download anything except if the timestamp is newer on the server. For reasons explained in the comments below, wget refuses to combine both flags:
WARNING: timestamping does nothing in combination with -O. See the manual
for details.
Any ideas on succinct work-arounds ?
Download it, then create a link
wget -N example.com/weird-name
ln weird-name local-name
After that you can run wget -N and it will work as expected:
Only download if newer
If a new file is downloaded it will be accessible from either name, without
costing you extra drive space
If using other tool is possible in your case, I recommend the free, open source lwp-mirror:
lwp-mirror [-options] <url> <file>
It works just as you wish, with no workarounds.
This command is provided by the libwww-perl package on Ubuntu and Debian among other places.
Note that lwp-mirror doesn't support all of wget's other features. For example, it doesn't allow you to set a User Agent for the request like wget does.
I downloaded the source code of wget using apt-get source wget. I want to modify it a little, then use this wget rather than the one I'm using in /usr/bin/wget. How can I do that?
apt-get source wget is retrieving your distribution's source code of wget.
You may want to work on the genuine upstream wget source, which you can get (with some wget or some browser) by following links from http://www.gnu.org/software/wget/
Then you configure, build and install - usually with ./configure; make; sudo make install but the details may vary from package to package. You should look into files named README and INSTALL
You could also be interested by libcurl
Notice that the GPL license requires more or less that you publish your patch (in source form) if you redistribute your patched version of your improved wget software binary
I am trying to download all the folder structure and files under a folder in a website using wget.
Say there is a website like:
http://test/root. Under root it is like
/A
/A1/file1.java
/B
/B1/file2.html
My wget cmd is:
wget -r http://test/root/
I got all the folders and the html files, but no java files. Why is that?
UPDATE1:
I can access the file in the browser using:
http://test/root/A/A1/file1.java
I can also download this individual file using:
wget http://test/root/A/A1/file1.java
wget can just follow links.
If there is no link to the files in the subdirectories, then wget will not find those files. wget will not guess any file-names, it will not test exhaustively for filenames and wget does not practice black magic.
Just because you can access the files in a browser does not mean that wget can necessarily retrieve it. Your browser has code able to recognize the directory structure, wget only knows what you tell it.
You can try adding the java file to an accept list first, perhaps that's all it needs:
wget -r -A "*.java" http://text/root
But it sounds like you're trying to get a complete offline miror of the site. Let's start—as with any command we're trying to figure out—with man wget:
Wget can follow links in HTML, XHTML, and CSS pages, to create local
versions of remote web sites, fully recreating the directory structure
of the original site. This is sometimes referred to as "recursive
downloading." While doing that, Wget respects the Robot Exclusion
Standard (/robots.txt). Wget can be instructed to convert the links in
downloaded files to point at the local files, for offline viewing.
What We Need
1. Proper links to the file to be downloaded.
In your intex.html file, you must provide a link to the Java file, otherwise wget will not recognize it as needing to be downloaded. For your current directory structure, ensure file2.html contains a link to the java file. Format it to link to a directory above the current one:
JavaFile
However, if the file1.java is not sensitive and you routinely do this, it's cleaner and less code to put an index.html file in your root directory and link to:
JavaFile
If you only want the Java files and want to ignore HTML, you can use --reject like so:
wget -r -nH --reject="file2.html"
### Or to reject ALL html files ###
wget -r -nH --reject="*.html"
This will recursively (-r) go through all directories starting at the point we specify.
2. Respect robots.txt
Ensure that if you have a /robots.txt file in your */root/* directory it does not prevent crawling. If it does, you need to instruct wget to ignore it using the following option in your wget command by adding:
wget ... -e robots=off http://test/root
3. Convert remote links to local files.
Additionally, wget must be instructed to convert links into downloaded files. If you've done everything above correctly, you should be fine here. The easiest way I've found to get all files, provided nothing is hidden behind a non-public directory, is using the mirror command.
Try this:
wget -mpEk http://text/root/
# If robots.txt is present:
wget -mpEk robots=off http://text/root/
Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files.
Since you should have a link set up, you should get a file1.java inside the ../A1/ directory. However this command should work as is without a specific link being placed to the java file inside of your index.html or file2.html but it doesn't hurt as it preserves the rest of your directory. Mirror mode also works with a directory structure that's set up as an ftp:// also.
General rule of thumb:
Depending on the side of the site you are doing a mirror of, you're sending many calls to the server. In order to prevent you from being blacklisted or cut off, use the wait option to rate-limit your downloads. If it's a site the side of the one you posted you shouldn't have to, but any large site you're mirroring you'll want to use it:
wget -mpEk --no-parent robots=off --random-wait http://text/root/
Perhaps I'm missing something simple here, but is there any place to download the GWT documentation for use offline?
If you download the SDK here you will have it in
yourGWTFolder\doc\javadoc\index.html
Where 'yourGWTFolder' is the folder you unzipped the file you downloaded to.
You can try someting like:
wget --no-check-certificate -k -r -np -p https://developers.google.com/web-toolkit/doc/latest/
I do not know if wget is available on Windows. If not you can use cygwin or a linux VM.