Mirroring a website and maintaining URL structure - wget

The goal
I want to mirror a website, such that I can host the static files anywhere (localhost, S3, etc.) and the URLs will appear just like the original to the end user.
The command
This is almost perfect for my needs (...but not quite):
wget --mirror -nH -np -p -k -E -e robots=off http://mysite
What this does do
--mirror : Recursively download the entire site
-p : Download all necessary page requisites
-k : Convert the URL's to relative paths so I can host them anywhere
What this doesn't do
Prevent duplicate downloads
Maintain (exactly) the same URL structure
The problem
Some things are being downloaded more than once, which results in myfile.html and myfile.1.html. This wouldn't be bad, except that when wget rewrites the hyperlinks, it is writing it with the myfile.1.html version, which is changing the URLs and therefore has SEO considerations (Google will index ugly looking URL's).
The -nc option would prevent this, but as of wget-v1.13, I cannot use -k and -nc at the same time. Details for this are here.
Help?!
I was hoping to use wget, but I am now considering looking into using another tool, like httrack, but I don't have any experience with that yet.
Any ideas on how to achieve this (with wget, httrack or anything else) would be greatly appreciated!

httrack got me most of the way, the only URL mangling it did was make the links to point to /folder/index.html instead of /folder/.
Using either httrack or wget didn't seem to result in perfect URL structure, so we ended up writing a little bash script that runs the crawler, followed by sed to clean up some of the URLS (crop the index.html from links, replace bla.1.html with bla.html, etc.)

wget description and help
According to this (and a quick experiment of my own) you should have no problems using -nc and -k options together to gather the pages you are after.
What will cause an issue is using -N with -nc (Does not work at all, incompatible) so you won't be able to compare files by timestamp and still no-clobber them, and with the --mirror option you are including -N inherently.
Rather than use --mirror try instead replacing it with "-r -l inf" which will enable recursive downloading to an infinite level but still allow your other options to work.
An example, based on your original:
wget -r -l inf -k -nc -nH -p -E -e robots=off http://yoursite
Notes: I would suggest using -w 5 --random-wait --limit-rate=200k in order to avoid DOSing the server and be a little less rude, but obviously up to you.
Generally speaking I try to avoid using option groupings like --mirror because of conflicts like this being harder to trace.
I know this is an answer to a very old question but I think it should be addressed - wget is a new command for me but so far proving to be invaluable and I would hope others would feel the same.

Related

How to download binary files from a public github repository through the command line?

I'm trying to work through this docker-zfs plug in: https://github.com/TrilliumIT/docker-zfs-plugin. I'm stuck at this line: Download the latest binary from github releases and place in /usr/local/bin/ .
How does one do such a thing? I've done through the whole page, and I don't see any mention of binary files/a link for a release. I've looked at other pages to download from Github repositories, but I don't have any authentication so they didn't seem applicable. I looked at this and tried to make it work, https://geraldonit.com/2019/01/15/how-to-download-the-latest-github-repo-release-via-command-line/ , but something about the link formatting didn't seem to work. This must be really obvious but I don't see what I am missing.
This is what I tried:
LOCATION=$(curl -s https://github.com/TrilliumIT/docker-zfs-plugin/releases/latest
| grep "tag_name"
| awk '{print "https://github.com/TrilliumIT/docker-zfs-plugin/releases/latest" substr($2, 2, length($2)-3) ".zip"}')
; curl -L -o . /usr/local/bin/
(But I'm not sure this is what I need, and the link doesn't exist either. There must be a better way of doing this?)
Ok so I actually figured this out, it was simpler than I was doing:
wget https://github.com/TrilliumIT/docker-zfs-plugin/releases/download/v1.0.5/docker-zfs-plugin
sudo mv docker-zfs-plugin /usr/local/bin/

Using wget (for windows) to download all MIDI files

I've been trying to use wget to download all midi files from a website (http://cyberhymnal.org/) using:
wget64 -r -l1 H -t1 -nd -N -np -A.mid -erobots=off http://cyberhymnal.org/
I got the syntax from various sites which all suggest the same thing, but it doesn't download anything. I've tried various variations on the theme, such as different values for '-l' etc.
Does anybody have any suggestions as to what I am doing wrong? Is it the fact that I am using Windows?
Thanks in advance.
I don't know much about all the parameters you are using like H, -t1, -N etc though we can find it online. But I also had to download files from a url matching a wildcard. So command that worked for me:
wget -r -l1 -nH --cut-dirs=100 -np "$url" -P "${newLocalLib/$tokenFind}" -A "com.iontrading.arcreporting.*.jar"
after -P you specify the path where you wanna save the files to and after -A you provide the wild card token. Like in your case that would be "*.mid".
-A means Accept. So here we provide the files to accept from the provided URL. Similarly -R for reject list.
You may have better luck (at least, you'll get more MIDI files), if you try the actual Cyber Hymnal™, which moved over 10 years ago. The current URL is now http://www.hymntime.com/tch/.

wget downloads only one index.html file instead of other some 500 html files

with Wget I normally receive only one -- index.html file. I enter the following string:
wget -e robots=off -r http://www.korpora.org/kant/aa03
which gives back an index.html file, alas, only.
The directory aa03 implies Kant's book, volume 3, there must be some 560 files (pages) or so in it. These pages are readable online, but will not be downloaded. Any remedy?! THX
Following that link brings us to:
http://korpora.zim.uni-duisburg-essen.de/kant/aa03/
wget won't follow links that point to domains not specified by the user. Since korpora.zim.uni-duisburg-essen.de is not equal to korpora.org, wget will not follow the links on the index page.
To remedy this, use --span-hosts or -H. -rH is a VERY dangerous combination - combined, you can accidentally crawl the entire Internet - so you'll want to keep its scope very tightly focused. This command will do what you intended to do:
wget -e robots=off -rH -l inf -np -D korpora.org,korpora.zim.uni-duisburg-essen.de http://korpora.org/kant/aa03/index.html
(-np, or --no-parent, will limit the crawl to aa03/. -D will limit it to only those two domains. -l inf will crawl infinitely deep, constrained by -D and -np).

recursive wget with hotlinked requisites

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.
For example, let's look at this page
https://dl.dropbox.com/u/11471672/wget-all-the-things.html
Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.
wget -e robots=off -r -l inf -pk
^^ gets everything but the hotlinked image
wget -e robots=off -r -l inf -pk -H
^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web
wget -e robots=off -r -l inf -pk -H --ignore-tags=a
^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.
I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.
You can't specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you'll want to split the crawls that use them. To grab hotlinked page-reqs, you'll have to run wget twice: once to recurse through the site's structure, and once to grab hotlinked reqs. I've had luck with this method:
1) wget -r -l inf [other non-H non-p switches] http://www.example.com
2) build a list of all HTML files in the site structure (find . | grep html) and pipe to file
3) wget -pH [other non-r switches] -i [infile]
Step 1 builds the site's structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.
I've managed to do this by using regular expressions. Something like this to mirror http://www.example.com/docs
wget --mirror --convert-links --adjust-extension \
--page-requisites --span-hosts \
--accept-regex '^http://www\.example\.com/docs|\.(js|css|png|jpeg|jpg|svg)$' \
http://www.example.com/docs
You'll probably have to tune the regexs for each specific site. For example some sites like to use parameters on css files (e.g. style.css?key=value), which this example will exclude.
The files you want to include from other hosts will probably include at least
Images: png jpg jpeg gif
Fonts: ttf otf woff woff2 eot
Others: js css svg
Anybody know any others?
So the actual regex you want will probably look more like this (as one string with no linebreaks):
^http://www\.example\.org/docs|\.([Jj][Ss]|[Cc][Ss][Ss]|[Pp][Nn][Gg]|[Jj]
[Pp][Ee]?[Gg]|[Ss][Vv][Gg]|[Gg][Ii][Ff]|[Tt][Tt][Ff]|[Oo][Tt][Ff]|[Ww]
[Oo][Ff][Ff]2?|[Ee][Oo][Tt])(\?.*)?$

How do I get a list of commit comments from CVS since last tagged version?

I have made a bunch of changes to a number of files in a project. Every commit (usually at the file level) was accompanied by a comment of what was changed.
Is there a way to get a list from CVS of these comments on changes since the last tagged version?
Bonus if I can do this via the eclipse CVS plugin.
UPDATE: I'd love to accept an answer here, but unfortunately none of the answers are what I am looking for. Frankly I don' think it is actually possible, which is a pity really as this could be a great way to create a change list between versions (Assuming all commits are made at a sensible granularity and contain meaningful comments).
I think
cvs -q log -SN -rtag1:::tag2
or
cvs -q log -SN -dfromdate<todate
will do what you want. This lists all the versions and comments for all changes made between the two tags or dates, only for files that have changed. In the tag case, the three colons exclude the comments for the first tag. See cvs -H log for more information.
The options for the cvs log command are available here. Specifically, to get all the commits since a specific tag (lets call it VERSION_1_0)
cvs log -rVERSION_1_0:
If your goal is to have a command that works without having to know the name of the last tag I believe you will need to write a script that grabs the log for the current branch, parses through to find the tag, then issues the log command against that tag, but I migrated everything off of CVS quite a while ago, so my memory might be a bit rusty.
If you want to get a quick result on a single file, the cvs log command is good. If you want something more comprehensive, the best tool I've found for this is a perl script called cvs2cl.pl. This can generate a change list in several different formats. It has many different options, but I've used the tag-to-tag options like this:
cvs2cl.pl --delta dev_release_1_2_3:dev_release_1_6_8
or
cvs2cl.pl --delta dev_release_1_2_3:HEAD
I have also done comparisons using dates with the same tool.
I know you have already "solved" your problem, but I had the same problem and here is how I quickly got all of the comments out of cvs from a given revision until the latest:
$ mkdir ~/repo
$ cd ~/repo
$ mkdir cvs
$ cd cvs
$ scp -pr geek#avoid.cvs.org:/cvs/CVSROOT .
$ mkdir -p my/favorite
$ cd my/favorite
$ scp -pr geek#avoid.cvs.org:/cvs/my/favorite/project .
$ cd ~/repo
$ mkdir -p ~/repo/svn/my/favorite/project
$ cvs2svn -s ~/repo/svn/my/favorite/project/src ~/repo/cvs/my/favorite/project/src
$ mkdir ~/work
$ cd ~/work
$ svn checkout file:///home/geek/repo/svn/my/favorite/project/src/trunk ./src
$ cd src
$ # get the comments made from revision 5 until today
$ svn log -r 5:HEAD
$ # get the comments made from 2010-07-03 until today
$ svn log -r {2010-07-03}:HEAD
The basic idea is to just use svn or git instead of cvs :-)
And that can be done by converting the cvs repo to svn or git using cvs2svn or cvs2git, which we should be doing anyway. It got my my answer within about three minutes because I had a small repository.
Hope that helps.
Something like this
cvs -q log -NS -rVERSION_3_0::HEAD
Where you probably want to pipe the output into egrep to filter out the stuff you don't want to see. I've used this:
cvs -q log -NS -rVERSION_3_0::HEAD | egrep -v "RCS file: |revision |date:|Working file:|head:|branch:|locks:|access list:|keyword substitution:|total revisions: |============|-------------"