How to skip selected url while mirroring site with wget - wget

I have the following problem. I need to mirror password protected site. Sounds like simple task:
wget -m -k -K -E --cookies=on --keep-session-cookies --load-cookies=myCookies.txt http://mysite.com
in myCookies.txt I am keeping proper session cookie. This works until wget come accross logout page - then session is invalidated and, effectively, further mirroring is usless.
W tried to add --reject option, but it works only with file types - I can block only html file download or swf file download, I can't say
--reject http://mysite.com/*.php?type=Logout*
Any ideas how to skip certain URLs in wget? Maybe there is other tool that can do the job (must work on MS Windows).

What if you first download (or even just touch) the logout page, and then
wget --no-clobber --your-original-arguments
This should skip the logout page, as it has already been downloaded
(Disclaimer: I didn't try this myself)

I have also encountered this problem and later solved it like this: "--reject-regex logout", more:wget-devTips

Related

wget - selective recursive download + page-rerequisites?

I'm trying to scrape a forum site, to build a read-only archive.
I understand how to use -A and -R to limit the pages I retrieve, but is there a way to also retrieve page-prerequisites (e.g., icons and such)
Thanks!

wget recursive fails on wiki pages

I'm trying to recursively fetch all pages linked from a Moin wiki page. I've tried many different wget recursive options, which all have the same result: only the html file from the given URL gets downloaded, not any of the pages linked from that html page.
If I use the --convert-links option, wget correctly translates the unfetched links to the right web links. It just doesn't recursively download those linked pages.
wget --verbose -r https://wiki.gnome.org/Outreachy
--2017-03-02 10:34:03-- https://wiki.gnome.org/Outreachy
Resolving wiki.gnome.org (wiki.gnome.org)... 209.132.180.180, 209.132.180.168
Connecting to wiki.gnome.org (wiki.gnome.org)|209.132.180.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘wiki.gnome.org/Outreachy’
wiki.gnome.org/Outreachy [ <=> ] 52.80K 170KB/s in 0.3s
2017-03-02 10:34:05 (170 KB/s) - ‘wiki.gnome.org/Outreachy’ saved [54064]
FINISHED --2017-03-02 10:34:05--
Total wall clock time: 1.4s
Downloaded: 1 files, 53K in 0.3s (170 KB/s)
I'm not sure if it's failing because the wiki's html links don't end with .html. I've tried using various combinations of --accept='[a-zA-Z0-9]+', --page-requisites, and --accept-regex='[a-zA-Z0-9]+' to work around that, no luck.
I'm not sure if it's failing because the wiki has html pages like https://wiki.gnome.org/Outreachy that links page URLs like https://wiki.gnome.org/Outreachy/Admin and https://wiki.gnome.org/Outreachy/Admin/GettingStarted. Maybe wget is confused because there will need to be an HTML page and a directory with the same name? I also tried using --nd but no luck.
The linked html pages are all relative to the base wiki URL (e.g. Outreachy history page). I've tried also adding --base="https://wiki.gnome.org/ with no luck.
At this point, I've tried a whole lot of different wget options, read several stack overflow and unix.stackexchange.com questions, and nothing I've tried has worked. I'm hoping there's a wget expert that can look at this particular wiki page and figure why wget is failing to recursively fetch linked pages. The same options work fine on other domains.
I've also tried httrack, with the same result. I'm running Linux, so please don't suggest Windows or proprietary tools.
This seems to be caused by the following tag in the wiki:
<meta name="robots" content="index,nofollow">
If you are sure you want to ignore the tag, you can make wget ignore it using -e robots=off:
wget -e robots=off --verbose -r https://wiki.gnome.org/Outreachy

Are you able to create clean URLs with Wget?

I'm attempting to create a mirror of a WordPress site with clean URLs (i.e. http://example.org/foo not http://example.org/foo.php). When Wget mirrors the site, it gives all pages and links a ".html" extension (i.e. http://example.org/foo.html).
Is it possible to set options for Wget to create a clean URL structure, so that the mirrored file corresponding to the page "http:example.org/foo" would be "/foo/index.html" and the link to that page would be "http:example.org/foo"? If so, how?
If I understand your question correctly, you're asking for what is the default behaviour of Wget.
Wget will only add the extension to the local copy, if the --adjust-extension option has been passed to it. Quoting the man page for Wget:
--adjust-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the
local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good
use for this is when you're downloading CGI-generated materials. A URL like http://example.com/article.cgi?25 will be saved as article.cgi?25.html.
However, what you seem to be asking for, that Wget saves example.org/foo as /foo/index.html is actually the default option. If you're seeing some other output, you should post the complete output of Wget with the --debug switch.

Getting Newsticker.el to use rss feeds with login credentials

I'm trying to use Newsticker.el also on some internal rss feeds which require login credentials. Since all what newsticker does is getting wget to fetch the feeds I tought it would be possible to simply define the user name and password in the wget confiuration part of newsticker.el
So I configured the following in my init.el
'(newsticker-url-list (quote (("RSS FEED" "https://to.feed.com/timeline?format=rss"
nil nil ("--user=<username>" "--password=<password>" "-q" "-O" "-")))
Feeding the --user and --password options directly to wget works fine but not not within the newsticker.el setup. Anyone tried something similar before?
Newsticker can use wget, indeed, but it can also use Emacs's URL library instead. So you might want to check newsticker-retrieval-method to see which one is used.

wget can't download - 404 error

I tried to download an image using wget but got an error like the following.
--2011-10-01 16:45:42-- http://www.icerts.com/images/logo.jpg
Resolving www.icerts.com... 97.74.86.3
Connecting to www.icerts.com|97.74.86.3|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2011-10-01 16:45:43 ERROR 404: Not Found.
My browser has no problem loading the image.
What's the problem?
curl can't download either.
Thanks.
Sam
You need to add the referer field in the headers of the HTTP request. With wget, you just need the --header arg :
wget http://www.icerts.com/images/logo.jpg --header "Referer: www.icerts.com"
And the result :
--2011-10-02 02:00:18-- http://www.icerts.com/images/logo.jpg
Résolution de www.icerts.com (www.icerts.com)... 97.74.86.3
Connexion vers www.icerts.com (www.icerts.com)|97.74.86.3|:80...connecté.
requête HTTP transmise, en attente de la réponse...200 OK
Longueur: 6102 (6,0K) [image/jpeg]
Sauvegarde en : «logo.jpg»
I had the same problem with a Google Docs URL. Enclosing the URL in quotes did the trick for me:
wget "https://docs.google.com/spreadsheets/export?format=tsv&id=1sSi9f6m-zKteoXA4r4Yq-zfdmL4rjlZRt38mejpdhC23" -O sheet.tsv
You will also get a 404 error if you are using ipv6 and the server only accepts ipv4.
To use ipv4, make a request adding -4:
wget -4 http://www.php.net/get/php-5.4.13.tar.gz/from/this/mirror
I had same problem.
Solved using single quotes like this:
$ wget 'http://www.icerts.com/images/logo.jpg'
wget version in use:
$ wget --version
GNU Wget 1.11.4 Red Hat modified
Wget 404 error also always happens if you want to download the pages from Wordpress-website by typing
wget -r http://somewebsite.com
If this website is built using Wordpress you'll get such an error:
ERROR 404: Not Found.
There's no way to mirror Wordpress-website because the website content is stored in the database and wget is not able to grab .php files. That's why you get Wget 404 error.
I know it's not this question's case, because Sam only wants to download a single picture, but it can be helpful for others.
Actually I don't know what is the reason exactly, I have faced this like of problem.
if you have the domain's IP address (ex 208.113.139.4), please use the IP address instead of domain (in this case www.icerts.com)
wget 192.243.111.11/images/logo.jpg
Go to find the IP from URL https://ipinfo.info/html/ip_checker.php
I want to add something to #blotus's answer,
In case adding the referrer header does not solve the issue, May be you are using the wrong referrer (Sometimes the referrer is different from the URL's domain name).
Paste the URL on a web browser and find the referrer from developer tools (Network -> Request Headers).
I met exactly the same problem while setting up GitHub actions with Cygwin. Only after I used wget --debug <url>, I realized that URL is appended with 0xd symbol which is \r (carriage return).
For this kind of problem there is the solution described in docs:
you can also use igncr in the SHELLOPTS environment variable
So I added the following lines to my YAML script to make wget work properly, as well as other shell commands in my GHA workflow:
env:
SHELLOPTS: igncr