Why wget doesn't get java file recursively? - wget

I am trying to download all the folder structure and files under a folder in a website using wget.
Say there is a website like:
http://test/root. Under root it is like
/A
/A1/file1.java
/B
/B1/file2.html
My wget cmd is:
wget -r http://test/root/
I got all the folders and the html files, but no java files. Why is that?
UPDATE1:
I can access the file in the browser using:
http://test/root/A/A1/file1.java
I can also download this individual file using:
wget http://test/root/A/A1/file1.java

wget can just follow links.
If there is no link to the files in the subdirectories, then wget will not find those files. wget will not guess any file-names, it will not test exhaustively for filenames and wget does not practice black magic.

Just because you can access the files in a browser does not mean that wget can necessarily retrieve it. Your browser has code able to recognize the directory structure, wget only knows what you tell it.
You can try adding the java file to an accept list first, perhaps that's all it needs:
wget -r -A "*.java" http://text/root
But it sounds like you're trying to get a complete offline miror of the site. Let's start—as with any command we're trying to figure out—with man wget:
Wget can follow links in HTML, XHTML, and CSS pages, to create local
versions of remote web sites, fully recreating the directory structure
of the original site. This is sometimes referred to as "recursive
downloading." While doing that, Wget respects the Robot Exclusion
Standard (/robots.txt). Wget can be instructed to convert the links in
downloaded files to point at the local files, for offline viewing.
What We Need
1. Proper links to the file to be downloaded.
In your intex.html file, you must provide a link to the Java file, otherwise wget will not recognize it as needing to be downloaded. For your current directory structure, ensure file2.html contains a link to the java file. Format it to link to a directory above the current one:
JavaFile
However, if the file1.java is not sensitive and you routinely do this, it's cleaner and less code to put an index.html file in your root directory and link to:
JavaFile
If you only want the Java files and want to ignore HTML, you can use --reject like so:
wget -r -nH --reject="file2.html"
### Or to reject ALL html files ###
wget -r -nH --reject="*.html"
This will recursively (-r) go through all directories starting at the point we specify.
2. Respect robots.txt
Ensure that if you have a /robots.txt file in your */root/* directory it does not prevent crawling. If it does, you need to instruct wget to ignore it using the following option in your wget command by adding:
wget ... -e robots=off http://test/root
3. Convert remote links to local files.
Additionally, wget must be instructed to convert links into downloaded files. If you've done everything above correctly, you should be fine here. The easiest way I've found to get all files, provided nothing is hidden behind a non-public directory, is using the mirror command.
Try this:
wget -mpEk http://text/root/
# If robots.txt is present:
wget -mpEk robots=off http://text/root/
Using -m instead of -r is preferred as it doesn't have a maximum recursion depth and it downloads all assets. Mirror is pretty good at determining the full depth of a site, however if you have many external links you could end up downloading more than just your site, which is why we use -p -E -k. All pre-requisite files to make the page, and a preserved directory structure should be the output. -k converts links to local files.
Since you should have a link set up, you should get a file1.java inside the ../A1/ directory. However this command should work as is without a specific link being placed to the java file inside of your index.html or file2.html but it doesn't hurt as it preserves the rest of your directory. Mirror mode also works with a directory structure that's set up as an ftp:// also.
General rule of thumb:
Depending on the side of the site you are doing a mirror of, you're sending many calls to the server. In order to prevent you from being blacklisted or cut off, use the wait option to rate-limit your downloads. If it's a site the side of the one you posted you shouldn't have to, but any large site you're mirroring you'll want to use it:
wget -mpEk --no-parent robots=off --random-wait http://text/root/

Related

Why does wget add html exentions to every file?

I'm using following command to download all files from a server
wget -R "index.*" -m -np -e robots=off http://robotics.ethz.ch/~asl-datasets/ijrr_euroc_mav_dataset/
All files are recognized correctly, but wget adds .html to all files. For example: ijrr_euroc_mav_dataset/calibration_datasets/cam_april/cam_april.bag becomes ijrr_euroc_mav_dataset/calibration_datasets/cam_april/cam_april.bag.html
Why is that?
Also, wget creates the folder ~asl-datasets which I didn't ask for. I just wanted to download all files below ijrr_euroc_mav_dataset.
This is two separate questions, but is easy to answer. (I already solved this in the comments, but answering since that was apparently a spot-on observation).
The first is, why is Wget adding a .html suffix to your files. The reason for that is most likely that you have adjust-extensions in your ~/.wgetrc file. This option is disabled by default for obvious reasons but is useful in many cases. Try modifying the ~/.wgetrc file or use --no-config (or --config=/dev/null if using a >5 year old version of Wget).
The second question is is why is Wget creating a directory. Well, the answer to that is simple. You asked to mirror a website which has that directory. You can use the --cut-dirs option to fine tune which directories you want Wget to create on disk. (In your cases, I think --cut-dirs=2 --no-host-directories might be appropriate since you don't care about preserving the directory structure. However remember that this means files in different directories with the same name will likely be overwritten

Using wget to download all files/folders from URL not working?

So i'm trying to download everything from this directory http://pds-atmospheres.nmsu.edu/PDS/data/mslrem_1001/DATA/ using wget but i can't seem to get it working? It only ever downloads the index.html (which it then removes) and a robot.txt
I've been using this command i've seen around the internet:
wget -r -np -nH --cut-dirs=3 -R index.html http://pds-atmospheres.nmsu.edu/PDS/data/mslrem_1001/DATA/
but i only get this in response, when I want all the files in DATA like i were copy a file on explorer :\

using wget to download website ignoring image and videos

I'm new to wget and trying to download a full website ignoring any images on videos. What is the parameter? I tried going through wget --help but couldn't really find anything useful. Thanks in advance.
The -R or --reject option is what you need here. See the wget specs on gnu.org.
Say your URL is http://somesite.org/, then to download the root index and everything linked therefrom, ignoring media files, just do:
wget -r -R jpg,jpeg,gif,mpg,mkv http://somesite.org
where you add in whatever extra extensions the site is using for media files separated by commas.

wget :: rename downloaded files and only download if newer

I am trying to use wget to download a file under a different local name and only download if the file on the server is newer.
What I thought I could do was use the -O option of wget so as to be able to choose the name of the downloaded file, as in:
wget http://example.com/weird-name -O local-name
and combine that with the -N option that doesn't download anything except if the timestamp is newer on the server. For reasons explained in the comments below, wget refuses to combine both flags:
WARNING: timestamping does nothing in combination with -O. See the manual
for details.
Any ideas on succinct work-arounds ?
Download it, then create a link
wget -N example.com/weird-name
ln weird-name local-name
After that you can run wget -N and it will work as expected:
Only download if newer
If a new file is downloaded it will be accessible from either name, without
costing you extra drive space
If using other tool is possible in your case, I recommend the free, open source lwp-mirror:
lwp-mirror [-options] <url> <file>
It works just as you wish, with no workarounds.
This command is provided by the libwww-perl package on Ubuntu and Debian among other places.
Note that lwp-mirror doesn't support all of wget's other features. For example, it doesn't allow you to set a User Agent for the request like wget does.

How do I use wget for the targets that are linked behind php or cgi etc scripts?

Often times, I need to use wget from a remote non-gui login but I see that the links presented on webpages do not directly point to the file but rather a script that leads to the download. This means it is impossible to use wget to fetch the file. Instead, I have to do a browser download followed by scp to the remote login.
Is there a way, I can use wget to really target the intended file somehow!?