Using wget to fake browser? - wget

I'd like to crawl a web site to build its sitemap.
Problems is, the site uses an htaccess file to block spiders, so the following command only downloads the homepage (index.html) and stops, although it does contain links to other pages:
wget -mkEpnp -e robots=off -U Mozilla http://www.acme.com
Since I have no problem accessing the rest of the site with a browser, I assume the "-e robots=off -U Mozilla" options aren't enough to have wget pretend it's a browser.
Are there other options I should know about? Does wget handle cookies by itself?
Thank you.
--
Edit:
I added those to wget.ini, to no avail:
hsts=0
robots = off
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
referer = /
--
Edit: Found it.
The pages linked to in the homepage were on a remote server, so wget would ignore them. Just add "--span-hosts" to tell wget to go there, and "-D www.remote.site.com" if you want to restrict spidering to that domain.

you might want to set the User-Agent to something more than just Mozilla, something like:
wget --user-agent="Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"

Related

download/mirror a website on cloudflare for archiving

Trying to backup ( download / mirror ) a website for archival purposes. This site is apparently on Cloudflare. My usual tool for this would be wget, but it fails on me (even when using a cookie cfduid header). Example of a not-working wget command:
wget -U "Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0" --header="Accept: text/html" --header="Cookie: __cfduid=someverylongcfduid" --mirror --convert-links --adjust-extension --page-requisites --no-parent -w 1m www.domain.tld
So I thought I'd return to my trusty friend httrack, but it too fails (even when using exported cookies). Example of a not-working httrack command:
httrack -F "Mozilla/5.0 (X11; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0" --mirror -b1 -s0 -%c1 -c1 --referer "https://www.domain.tld/" "https://www.domain.tld/"
I do not want to pound the website, so limiting connections and waiting is quite OK. I'd rather have it run longer/slower and be a good netizen along the way.
Currently I'm confronted with either 301 (Moved permanently) or 403 (Forbidden) errors and I'm assuming this is due to Cloudflare. The site is heavy on javascript :-(
Does anyone have any tips/advice/solution to get such a website archived?
I think you should try using selenium.

Trying to get user credentials to pass through a ProxyPass

We have a site using Windows authentication sitting behind a firewall that we are accessing through ProxyPass. We then need to access an API application on the same server, but are receiving a 401 unauthorized error when using rewrite_proxy rules when we try to access it. How can we pass the credentials for authentication?
To perform the initial redirect from the secure serve to the internal application server:
In the https.conf file
ProxyPass /blastdev/ http://10.0.212.198/blastdev/
This seems to be working correctly and is loading the content on the page until we reach the api calls:
in the .htaccess file
RewriteCond %{REQUEST_URI} ^/blastdev/blast(.*)
RewriteHeader X-Remote-User: .* %{REMOTE_USER}
RewriteHeader X-Logon-User: .* %{LOGON_USER}
RewriteHeader AUTH_TYPE: .* %{AUTH_TYPE}
RewriteProxy ^/blast/(.*)$ http://10.0.212.198/blast/$1 [NC, A, CR]
simply to try to show any user information. All fields are showing blank though.
Here are the headers we are currently sending:
Headers:
'Cache-Control'='no-cache'
'Pragma'='no-cache'
'Expires'='Sat, 01 Jan 2000 00:00:00 GMT'
'Accept'='application/json, text/plain, */*'
'Accept-Encoding'='gzip, deflate'
'Accept-Language'='en-US,en;q=0.9'
'Referer'='http://dev.*******.com/blastdev/'
'User-Agent'='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
'X-REQUEST-URI'='/blastdev/blast/api/usermanager/'
'X-Rewrite-Url'='/blastdev/blast/api/usermanager/'
'X-Original-Url'='/blastdev/blast/api/usermanager/'
'X-logio_http_input_size'='0'
'X-logio_request_headers_size'='746'
'X-Remote-User'=''
'X-Logon-User'=''
'AUTH_TYPE'=''
'Max-Forwards'='10'
'X-Forwarded-Host'='dev.*******.com'
'X-Forwarded-For'='10.1.13.42'
'X-Forwarded-Server'='10.0.90.54'
We need to be able to access the current user if they are AD authenticated and see that they are anonymous if not.
Any additional assistance in tests we can run for further troubleshooting would also be appreciated.

Downloading files from box.com using content api, GZIP

I have a 6mb txt file in Box.com site.
Now i would like to download the file using api. as it takes time to download, i would like to download it as a gzipped file.
As given here https://developers.box.com/docs/ where we have to add accept-encoding header with the values "gzip, deflate". I have added this header but the file is not downloaded as zip file it has the same size as 6mb, if it is zipped then it should be less than one mb in size.
But it is not happening. The following are the headers passed in REST request.
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
Authorization: Bearer ACCESSTOKEN
Accept: */*
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,te;q=0.6
The following are the response headers.
Server: nginx
Date: Thu, 24 Jul 2014 16:24:56 GMT
Content-Type: application/octet-stream
Content-Length: 6685772
Connection: keep-alive
Cache-control: private
Accept-Ranges: bytes
Content-Disposition: attachment;filename="abc.log";filename*=UTF-8''
X-Content-Type-Options: nosniff
Accept-Ranges: bytes
Is there anything that I missed here?
I have run into the same problem. However, I have also realized why this may be happening. Quoting the Box SDK documentation:
If the file is available to be downloaded, the response will be a 302
Found to a URL at dl.boxcloud.com. The dl.boxcloud.com URL is not
persistent. Clients will need to follow the redirect in order to
actually download the file. The raw data of the file is returned
unless the file ID is invalid or the user does not have access to it.
The important point to note is that there is a client redirect to a non-persistent URL in order to download the file. When we check the sequence of headers that are passed as a part of the redirect request, you can see that the Accept-Encoding: gzip deflate is missing. I used Fiddler to try this out, you can use any other HTTP proxy or interceptor.
This should be the reason why the files are not getting downloaded using the gzip encoding.
Hope this helps.

The domain is inserted right but FB login still gives a "the given url is not allowed by the application configuration"

I'm making a facebook login for a client. But i keep getting the error:
the given url is not allowed by the application configuration
This is not new to me and I know what the error means and how to correct it. But this time i'm really puzzled, as I have tried every possible solution and still no result.
I'm using the newest Facebook PHP SDK (v.3.2.2) and integrated it into a codeigniter (v.2.0) project i received from my client. I have tried every possible website url. With and without http, with and without www and with and without slashes.
Please write if you need more information.
DOCUMENTATION :
The facebook settings right now :
App name: myDomain.com
App Namespace : myDomain.com
Sandbox Mode : off
Allowed domains : myDomain.com (Also tried www.myDomain.com just to test)
Website with Facebook Login : http:// www.myDomain.com/
THE HTTP REQUEST (I highlighted the most important) :
GET /dialog/oauth?client_id=416398375038503&redirect_uri=http%3A%2F%2Fwww.myDomain.com%2F&state=c7fcaa638bb00e28177b2551ab285199&scope=email HTTP/1.1
Host: www.facebook.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36
Referer: **http://www.myDomain.com/**
Accept-Encoding: gzip,deflate,sdch
Accept-Language: da-DK,da;q=0.8,en-US;q=0.6,en;q=0.4
Cookie: locale=da_DK; datr=c352UWXk3ZNnDRb9vF_flCKf; lu=ThIiiPnfHJO6URgttq6n991g; p=30; presence=EM371808349EuserFA21015531322A2EstateFDsb2F1371646911100Et2F_5b_5dElm2FnullEuct2F1371756338BEtrFnullEtwF2809378943EatF1371808247422EwmlFDfolderFA2inboxA2Ethread_5fidFA2user_3a1275290355A2CG371808349901CEchFDp_5f1015531322F9CC; sub=8192; act=1371810189642%2F3; c_user=1015531322; csm=2; fr=0Q2dUDn4VqDw2NwPO.AWUeSLjGpJH-4uKuONHiGbL-jYE.BRdn55.S8.AWVTnqQG; s=Aa7pqsZ0XxblFusE.BRwH3a; xs=1%3A7IUHTcAXzyAQdg%3A2%3A1371569626; wd=1363x712
Query String Parametersview sourceview URL encoded
client_id:416398375038503
redirect_uri: **http://www.myDomain.com/**
state:c7fcaa638bb00e28177b2551ab285199
scope:email
Want to improve this post? Add citations from reputable sources by editing the post. Posts with unsourced content may be edited or deleted.
Try once by removing the Allowed domains setting

How to access url that requires http authenticate in c/c++/command line?

http://admin:123456#192.168.1.178/videostream.cgi
To access a url that doesn't require http authenticate it's quite easy:
telnet 192.168.1.178 80
Get /videostream.cgi HTTP/1.1
Accept: text/html;text/plain
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.13) Gecko/20100914 Firefox/3.5.13
Connection: close
But how to specify admin:123456?
See the RFC or This Wikipedia article.
It can be educational to use Wireshark, or some other LAN sniffer, to watch what a browser and server do when you access a URL with embedded credentials such as your http://admin:123456#192.168.1.178/videostream.cgi
For basic authentication, you specify the username and password as username:password, then Base64-encode it and use it as an argument to the Authentication header:
Authorization: Basic YXNkZjoxMjM0
YXNkZjoxMjM0 decodes to asdf:1234; I used curl -u adsf:1234 (specifying the username "asdf" and password "1234") to produce this result.