recursive wget with hotlinked requisites - wget

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.
For example, let's look at this page
https://dl.dropbox.com/u/11471672/wget-all-the-things.html
Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.
wget -e robots=off -r -l inf -pk
^^ gets everything but the hotlinked image
wget -e robots=off -r -l inf -pk -H
^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web
wget -e robots=off -r -l inf -pk -H --ignore-tags=a
^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.
I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.

You can't specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you'll want to split the crawls that use them. To grab hotlinked page-reqs, you'll have to run wget twice: once to recurse through the site's structure, and once to grab hotlinked reqs. I've had luck with this method:
1) wget -r -l inf [other non-H non-p switches] http://www.example.com
2) build a list of all HTML files in the site structure (find . | grep html) and pipe to file
3) wget -pH [other non-r switches] -i [infile]
Step 1 builds the site's structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.

I've managed to do this by using regular expressions. Something like this to mirror http://www.example.com/docs
wget --mirror --convert-links --adjust-extension \
--page-requisites --span-hosts \
--accept-regex '^http://www\.example\.com/docs|\.(js|css|png|jpeg|jpg|svg)$' \
http://www.example.com/docs
You'll probably have to tune the regexs for each specific site. For example some sites like to use parameters on css files (e.g. style.css?key=value), which this example will exclude.
The files you want to include from other hosts will probably include at least
Images: png jpg jpeg gif
Fonts: ttf otf woff woff2 eot
Others: js css svg
Anybody know any others?
So the actual regex you want will probably look more like this (as one string with no linebreaks):
^http://www\.example\.org/docs|\.([Jj][Ss]|[Cc][Ss][Ss]|[Pp][Nn][Gg]|[Jj]
[Pp][Ee]?[Gg]|[Ss][Vv][Gg]|[Gg][Ii][Ff]|[Tt][Tt][Ff]|[Oo][Tt][Ff]|[Ww]
[Oo][Ff][Ff]2?|[Ee][Oo][Tt])(\?.*)?$

Related

How to wget a single page with NO assets ONLY the HTML page

Can someone give me the wget command to download a single html page without any assets. Literally only the html page, and nothing else. Example:
wget --no-check-certificate --level=0 https://blablabla.com/get?id=1111
result: 1111.html
wget man page states that
--level=depth
Set the maximum number of subdirectories that Wget will
recurse into to depth. In order to prevent one from
accidentally downloading very large websites when using
recursion this is limited to a depth of 5 by default, i.e.,
it will traverse at most 5 directories deep starting from the
provided URL. Set -l 0 or -l inf for infinite recursion
depth.
wget -r -l 0 http://<site>/1.html
Ideally, one would expect this to download just 1.html. but
unfortunately this is not the case, because -l 0 is
equivalent to -l inf---that is, infinite recursion. To
download a single HTML page (or a handful of them), specify
them all on the command line and leave away -r and -l.(...)
So --level=0 does mean infinite recursion, you should not provide any --level if you want to download single file. Taking this into account your command should be altered to
wget --no-check-certificate https://blablabla.com/get?id=1111

Using wget (for windows) to download all MIDI files

I've been trying to use wget to download all midi files from a website (http://cyberhymnal.org/) using:
wget64 -r -l1 H -t1 -nd -N -np -A.mid -erobots=off http://cyberhymnal.org/
I got the syntax from various sites which all suggest the same thing, but it doesn't download anything. I've tried various variations on the theme, such as different values for '-l' etc.
Does anybody have any suggestions as to what I am doing wrong? Is it the fact that I am using Windows?
Thanks in advance.
I don't know much about all the parameters you are using like H, -t1, -N etc though we can find it online. But I also had to download files from a url matching a wildcard. So command that worked for me:
wget -r -l1 -nH --cut-dirs=100 -np "$url" -P "${newLocalLib/$tokenFind}" -A "com.iontrading.arcreporting.*.jar"
after -P you specify the path where you wanna save the files to and after -A you provide the wild card token. Like in your case that would be "*.mid".
-A means Accept. So here we provide the files to accept from the provided URL. Similarly -R for reject list.
You may have better luck (at least, you'll get more MIDI files), if you try the actual Cyber Hymnal™, which moved over 10 years ago. The current URL is now http://www.hymntime.com/tch/.

Download all pdf files using wget

I have the following site http://www.asd.com.tr. I want to download all PDF files into one directory. I've tried a couple of commands but am not having much luck.
$ wget --random-wait -r -l inf -nd -A pdf http://www.asd.com.tr/
With this code only four PDF files were downloaded. Check this link, there are over several thousand PDFs available:
http://www.asd.com.tr/Default.aspx
For instance, hundreds of files are in the following folder:
http://www.asd.com.tr/Folders/asd/…
But I can't figure out how to access them correctly to see and download them all, there are some of folders in this subdirectory, http://www.asd.com.tr/Folders/, and thousands of PDFs in these folders.
I've tried to mirror site using -m command but it failed too.
Any more suggestions?
First, verify that the TOS of the web site permit to crawl it. Then, one solution is :
mech-dump --links 'http://domain.com' |
grep pdf$ |
sed 's/\s+/%20/g' |
xargs -I% wget http://domain.com/%
The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros)

Mirroring a website and maintaining URL structure

The goal
I want to mirror a website, such that I can host the static files anywhere (localhost, S3, etc.) and the URLs will appear just like the original to the end user.
The command
This is almost perfect for my needs (...but not quite):
wget --mirror -nH -np -p -k -E -e robots=off http://mysite
What this does do
--mirror : Recursively download the entire site
-p : Download all necessary page requisites
-k : Convert the URL's to relative paths so I can host them anywhere
What this doesn't do
Prevent duplicate downloads
Maintain (exactly) the same URL structure
The problem
Some things are being downloaded more than once, which results in myfile.html and myfile.1.html. This wouldn't be bad, except that when wget rewrites the hyperlinks, it is writing it with the myfile.1.html version, which is changing the URLs and therefore has SEO considerations (Google will index ugly looking URL's).
The -nc option would prevent this, but as of wget-v1.13, I cannot use -k and -nc at the same time. Details for this are here.
Help?!
I was hoping to use wget, but I am now considering looking into using another tool, like httrack, but I don't have any experience with that yet.
Any ideas on how to achieve this (with wget, httrack or anything else) would be greatly appreciated!
httrack got me most of the way, the only URL mangling it did was make the links to point to /folder/index.html instead of /folder/.
Using either httrack or wget didn't seem to result in perfect URL structure, so we ended up writing a little bash script that runs the crawler, followed by sed to clean up some of the URLS (crop the index.html from links, replace bla.1.html with bla.html, etc.)
wget description and help
According to this (and a quick experiment of my own) you should have no problems using -nc and -k options together to gather the pages you are after.
What will cause an issue is using -N with -nc (Does not work at all, incompatible) so you won't be able to compare files by timestamp and still no-clobber them, and with the --mirror option you are including -N inherently.
Rather than use --mirror try instead replacing it with "-r -l inf" which will enable recursive downloading to an infinite level but still allow your other options to work.
An example, based on your original:
wget -r -l inf -k -nc -nH -p -E -e robots=off http://yoursite
Notes: I would suggest using -w 5 --random-wait --limit-rate=200k in order to avoid DOSing the server and be a little less rude, but obviously up to you.
Generally speaking I try to avoid using option groupings like --mirror because of conflicts like this being harder to trace.
I know this is an answer to a very old question but I think it should be addressed - wget is a new command for me but so far proving to be invaluable and I would hope others would feel the same.

wget downloads only one index.html file instead of other some 500 html files

with Wget I normally receive only one -- index.html file. I enter the following string:
wget -e robots=off -r http://www.korpora.org/kant/aa03
which gives back an index.html file, alas, only.
The directory aa03 implies Kant's book, volume 3, there must be some 560 files (pages) or so in it. These pages are readable online, but will not be downloaded. Any remedy?! THX
Following that link brings us to:
http://korpora.zim.uni-duisburg-essen.de/kant/aa03/
wget won't follow links that point to domains not specified by the user. Since korpora.zim.uni-duisburg-essen.de is not equal to korpora.org, wget will not follow the links on the index page.
To remedy this, use --span-hosts or -H. -rH is a VERY dangerous combination - combined, you can accidentally crawl the entire Internet - so you'll want to keep its scope very tightly focused. This command will do what you intended to do:
wget -e robots=off -rH -l inf -np -D korpora.org,korpora.zim.uni-duisburg-essen.de http://korpora.org/kant/aa03/index.html
(-np, or --no-parent, will limit the crawl to aa03/. -D will limit it to only those two domains. -l inf will crawl infinitely deep, constrained by -D and -np).