download/mirror a website on cloudflare for archiving - wget

Trying to backup ( download / mirror ) a website for archival purposes. This site is apparently on Cloudflare. My usual tool for this would be wget, but it fails on me (even when using a cookie cfduid header). Example of a not-working wget command:
wget -U "Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0" --header="Accept: text/html" --header="Cookie: __cfduid=someverylongcfduid" --mirror --convert-links --adjust-extension --page-requisites --no-parent -w 1m www.domain.tld
So I thought I'd return to my trusty friend httrack, but it too fails (even when using exported cookies). Example of a not-working httrack command:
httrack -F "Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0" --mirror -b1 -s0 -%c1 -c1 --referer "https://www.domain.tld/" "https://www.domain.tld/"
I do not want to pound the website, so limiting connections and waiting is quite OK. I'd rather have it run longer/slower and be a good netizen along the way.
Currently I'm confronted with either 301 (Moved permanently) or 403 (Forbidden) errors and I'm assuming this is due to Cloudflare. The site is heavy on javascript :-(
Does anyone have any tips/advice/solution to get such a website archived?

I think you should try using selenium.

Related

In Minikube Buildroot OS wget: not an http or ftp url

I have Setup minikube in my Machine using Hyper-v in windows 10. All working fine, but when i tried to setup fannel network i execute following commannd.
wget http://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Oupput:-
Connecting to raw.githubusercontent.com (151.101.192.133:80)
wget: not an http or ftp url: https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
I tried some solution like to install wget and hash -r but not working.
Any idea or suggestion to solve this.
Thank you,
Note that the actual url your wget is trying to connect to is not an http but https url and the output you attached says:
wget: not an http or ftp url
which is true as https is neither http nor ftp url. It looks like your wget version supports only two mentioned protocols.
You can easily check it by issuing following commands:
wget -V | grep https
and
wget -V | grep ssl
I tried to reproduce it on a system possibly similar to the one you're using. For this purpose I created a buildroot Pod from advancedclimatesystems/docker-buildroot image:
kubectl run --generator=run-pod/v1 buildroot --image=advancedclimatesystems/docker-buildroot --command sleep 3600
and I attached to it by:
kubectl exec -ti buildroot /bin/sh
Once there, I tested out your wget command and it was successful. It's output in my system looks like this (note the 301 redirection to https url):
root#buildroot:~/buildroot# wget http://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
--2019-12-31 16:04:27-- http://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml [following]
--2019-12-31 16:04:27-- https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14416 (14K) [text/plain]
Saving to: 'kube-flannel.yml'
kube-flannel.yml 100%[=====================================================================================================================>] 14.08K --.-KB/s in 0s
2019-12-31 16:04:27 (53.3 MB/s) - 'kube-flannel.yml' saved [14416/14416]
As you can see it has built-in ssl and https support:
root#buildroot:~/buildroot# wget -V | grep ssl
+ntlm +opie +psl +ssl/openssl
-Wl,-z,relro -Wl,-z,now -lpcre -luuid -lidn2 -lssl -lcrypto -lpsl
ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a
root#buildroot:~/buildroot# wget -V | grep https
-cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls

Trying to get user credentials to pass through a ProxyPass

We have a site using Windows authentication sitting behind a firewall that we are accessing through ProxyPass. We then need to access an API application on the same server, but are receiving a 401 unauthorized error when using rewrite_proxy rules when we try to access it. How can we pass the credentials for authentication?
To perform the initial redirect from the secure serve to the internal application server:
In the https.conf file
ProxyPass /blastdev/ http://10.0.212.198/blastdev/
This seems to be working correctly and is loading the content on the page until we reach the api calls:
in the .htaccess file
RewriteCond %{REQUEST_URI} ^/blastdev/blast(.*)
RewriteHeader X-Remote-User: .* %{REMOTE_USER}
RewriteHeader X-Logon-User: .* %{LOGON_USER}
RewriteHeader AUTH_TYPE: .* %{AUTH_TYPE}
RewriteProxy ^/blast/(.*)$ http://10.0.212.198/blast/$1 [NC, A, CR]
simply to try to show any user information. All fields are showing blank though.
Here are the headers we are currently sending:
Headers:
'Cache-Control'='no-cache'
'Pragma'='no-cache'
'Expires'='Sat, 01 Jan 2000 00:00:00 GMT'
'Accept'='application/json, text/plain, */*'
'Accept-Encoding'='gzip, deflate'
'Accept-Language'='en-US,en;q=0.9'
'Referer'='http://dev.*******.com/blastdev/'
'User-Agent'='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
'X-REQUEST-URI'='/blastdev/blast/api/usermanager/'
'X-Rewrite-Url'='/blastdev/blast/api/usermanager/'
'X-Original-Url'='/blastdev/blast/api/usermanager/'
'X-logio_http_input_size'='0'
'X-logio_request_headers_size'='746'
'X-Remote-User'=''
'X-Logon-User'=''
'AUTH_TYPE'=''
'Max-Forwards'='10'
'X-Forwarded-Host'='dev.*******.com'
'X-Forwarded-For'='10.1.13.42'
'X-Forwarded-Server'='10.0.90.54'
We need to be able to access the current user if they are AD authenticated and see that they are anonymous if not.
Any additional assistance in tests we can run for further troubleshooting would also be appreciated.

Running sonarqube in Docker keeps redirecting me back to login page

I've had SonarQube running for a good while but haven't used it very much but in general stuff seems to have been working. I'm running it inside Docker.
I just updated it to LTS (6.7) and after that it seems to have gone into some limbo state. I'm able to log in and browse the website but as soon as I try to perform some operation (seems to not matter what that operation is), I get redirected to the login page. If I log in again, everything repeats. So I'm unable to actually perform any action it seems.
At first I thought this had to have something to do with old data conflicting with the new setup. So I cleaned everything out and set it up from scratch. The problem remains, I'm unable to do anything and get redirected to the login page every time.
For example, after the clean setup, I log in with admin/admin and I get the "first time tutorial" where I'm offered to create a token. I tried to do that but get directed to the login page. I log in again and this time I try to skip the tutorial but then I get redirected to the login page. Below is a part of the access.log for when I try to skip the tutorial:
10.3.1.119 - - [16/Nov/2017:00:12:48 +0000] "POST /gor-sq/api/users/skip_onboarding_tutorial HTTP/1.0" 401 - "https://build.acme.com/gor-sq/projects" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" "AV/CJhNZndR3RsZuAAA4"
10.3.1.119 - - [16/Nov/2017:00:12:48 +0000] "GET /gor-sq/api/users/identity_providers HTTP/1.0" 200 24 "https://build.acme.com/gor-sq/sessions/new?return_to=%2Fgor-sq%2Fprojects" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" "AV/CJhNZndR3RsZuAAA5"
10.3.1.119 - - [16/Nov/2017:00:12:48 +0000] "GET /gor-sq/api/navigation/global HTTP/1.0" 200 573 "https://build.acme.com/gor-sq/sessions/new?return_to=%2Fgor-sq%2Fprojects" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" "AV/CJhNZndR3RsZuAAA6"
The first line indicates that the POST is getting a 401 response. Without being absolutely certain, it does look like it's the POST operations that are getting 401 responses while GET works.
This setup does sit behind a reverse proxy but as I said before, the setup has been working fine before and no changes have been made to the reverse proxy setup.
Hope I am not so late. I had the same issue. What worked for me is the deleting cookies from the browser and everything else works like a charm.
I had the same issue.
https://myserver.com/sonar/api/users/skip_onboarding_tutorial
I got 401 and I was redirected to the Login page. I looked at the source code and request.ts was erroring out at line 108.
submit(): Promise<Response> {
const { url, options } = this.getSubmitData({ ...getCSRFToken() });
return window.fetch((window as any).baseUrl + url, options);}
Looked like an issue with the CSRFToken. Since I have Sonarqube running behind a Nginx Reverse Proxy, there might have been something to do with the way I was handling the cookies.
So when I looked a little bit, I found the solution here:
https://stackoverflow.com/a/47909810/3221249
Basically, they changed the way you handle secure cookies after v6.0. Since I was making the cookie secure and httponly to true(not letting the client browser interact with the js code) I was having the above issue. I was doing this even before my non-ssl traffic hits Nginx. I have another proxy server running HAProxy which was handling this so I commented that part of the definitions.
#rspirep ^(Set-cookie:.*) \1;\ Secure if ! secure
#rspirep ^(Set-cookie:.*) \1;\ httponly
I hope this helps you.

Using wget to fake browser?

I'd like to crawl a web site to build its sitemap.
Problems is, the site uses an htaccess file to block spiders, so the following command only downloads the homepage (index.html) and stops, although it does contain links to other pages:
wget -mkEpnp -e robots=off -U Mozilla http://www.acme.com
Since I have no problem accessing the rest of the site with a browser, I assume the "-e robots=off -U Mozilla" options aren't enough to have wget pretend it's a browser.
Are there other options I should know about? Does wget handle cookies by itself?
Thank you.
--
Edit:
I added those to wget.ini, to no avail:
hsts=0
robots = off
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
referer = /
--
Edit: Found it.
The pages linked to in the homepage were on a remote server, so wget would ignore them. Just add "--span-hosts" to tell wget to go there, and "-D www.remote.site.com" if you want to restrict spidering to that domain.
you might want to set the User-Agent to something more than just Mozilla, something like:
wget --user-agent="Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"

using wget against protected site with NTLM

Trying to mirror a local intranet site and have found previous questions using 'wget'. It works great with sites that are anonymous, but I have not been able to use it against a site that is expecting username\password (IIS with Integrated Windows Authentication).
Here is what I pass in:
wget -c --http-user='domain\user' --http-password=pwd http://local/site -dv
Here is the debug output (note I replaced some with dummy values obviously):
Setting --verbose (verbose) to 1
DEBUG output created by Wget 1.11.4 on Windows-MSVC.
--2009-07-14 09:39:04-- http://local/site
Host `local' has not issued a general basic challenge.
Resolving local... seconds 0.00, x.x.x.x
Caching local => x.x.x.x
Connecting to local|x.x.x.x|:80... seconds 0.00, connected.
Created socket 1896.
Releasing 0x003e32b0 (new refcount 1).
---request begin---
GET /site/ HTTP/1.0
User-Agent: Wget/1.11.4
Accept: */*
Host: local
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 401 Access Denied
Server: Microsoft-IIS/5.1
Date: Tue, 14 Jul 2009 13:39:04 GMT
WWW-Authenticate: Negotiate
WWW-Authenticate: NTLM
Content-Length: 4431
Content-Type: text/html
---response end---
401 Access Denied
Closed fd 1896
Unknown authentication scheme.
Authorization failed.
NTLM authentication is broken in wget 1.11, use 1.10 instead.
Curl is actually probably a better tool for fetching content from NTLM-authenticated web servers. You can get an equivalent function to your proposed wget command line by using:
curl --anyauth --user username:password http://someserver/site
I've seen references to being able to use the NTLM Authorization Proxy Server to get around these types of problems.
use --auth-no-challenge option (wget 1.11+) (it's now considered unsafe)
I found solution.
It is work-around for Basic auth IIS7.
When auth is successeful it send next http header:
'Authorization: < type > < credentials >'.
So we able to do authorization in browser and
copy this header params from browser (firebug addon) or generate:
$ echo -en 'username:password' | base64
dXNlcm5hbWU6cGFzc3dvcmQK
$ echo 'dXNlcm5hbWU6cGFzc3dvcmQK' | base64 -d
username:password
example:
$ wget --header="Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQK" http://example.com/