Mirroring websites - 403 Forbidden with user agent strings - wget

I'm working on an application to mirror US university academic catalogs. To do this, I have a cluster of Celery workers that use wget or httrack to mirror the content, styles and scripts, then upload to our S3 bucket.
For a small number of university sites, I've been encountering a 403 - Forbidden error using wget/httrack with a Windows Chrome user agent string. However, I'm able to load the web page in the browser.
I originally thought user agent and referer were the issue here, so I set them to Chrome 50 user agent and google.com, respectively. However, I'm still encountering the issue. However, if I use the python requests library with all these URLs, I get HTTP 200 responses.
I've ensured cookies are used, so I'm at a loss. Is there any reason why requests would work but wget/httrack not?

Related

Is Chrome failing when sending a large http header?

I have been troubleshooting an error on my web app and have concluded that Chrome is failing to handle my http request when my headers get too large.
Why are my headers large?
I am using a JWT authorization scheme that includes permissions in the JWT token. With my admin account that token is growing as I have permissions for each tenant. The JWT is currently around 5200 characters.
Why do I blame chrome?
I have tested the identical request in several environments:
Swagger: fails with TypeError: Failed to fetch
Postman Chrome Extension: fails with Could not get any response
Postman Native App: Succeeds
Python script using requests: Succeeds
curl: Succeeds
For each test I have the same headers, url, and body (none because it is a GET).
Notes
While researching this, I came across this SO Question which suggests that Chrome is limited to 250KB headers. Mine are under 6k.
If I use a smaller Authorization header, then Swagger and the Postman Chrome Extension both succeed.
Bottom line:
Can we confirm my conclusion that Chrome is having trouble with the larger header?
What can I do about that?

400 status on login request for asp.net core 2.0

I have the following issue.
After upgrading an application to ASP.NET 2.0 I get a 400 (bad request) status response whenever trying to authenticate in production.
This error does not reproduce locally and doesn't reproduce when using the production container locally.
The only difference that exists between production and local is that there is a reverse proxy in production that implements SSL for all requests.
I've tried moving the authentication code from middleware (as it was initially implemented) into a controller and I've changed the path to the route that was used for authentication. I still get the error.
All other requests work fine (provided you have a jwt token attached to them).
I should also mention that the CORS headers aren't set on the 400 response.
Any ideas?
This issue was caused by an upstream reverse proxy that was stripping some headers from the requests. Requests with verbs Post & Put were affected.
Set the log level of your application to Information to see what Kestrel is actually complaining about.
In our case we had to switch hosting providers because of the issue.

IBM Weather REST API 401 Keep getting CORS issues when access

I am getting a 401 and some cross domain issues when trying to access IBM Weather REST API from either client (browser) or server.
If I generate a URL and try and access it directly from a browser (eg paste it in it works fine and the JSON weather report is returned).
When I try and run the Javascript HTTP request from either the browser or server it seems like it's only allowed to run from an ibm.com domain.
Failed to load https://twcservice.au-syd.mybluemix.net/api/weather/v1/geocode/-33.00/151.00/forecast/daily/7day.json?units=m&language=en-US: The 'Access-Control-Allow-Origin' header contains multiple values 'https://*.ibm.com, https://*.ibmcloud.com', but only one is allowed. Origin 'http://localhost:3000' is therefore not allowed access.
I am using the free service on Bluemix. Is this restricted to only run via a Bluemix server? or are there some options I can pass when I create the service on Bluemix
Note, when I make the request I am using the credentials supplied via the Bluemix console. Again, this works via the browser URL bar, but not via code.
Update/More info: if I hit past the URL above into the browser (with creds) it works as above, then if hit it via the web app in the same session it works.
Hmmm. So the IBM server is sending the following response header:
Access-Control-Allow-Origin: https://*.ibm.com, https://*.ibmcloud.com
That's an invalid response from IBM. Unfortunately, I think your only option is to complain to IBM, and convince them to
Return a valid Access-Control-Allow-Origin response header (with only one value)
Allow people outside of IBM to access it
Without that, I fear you're out of luck.

Issue Testing after IdentityServer3 Deploy

After going through walkthroughs I had a test mvc app, test web api, and identityserver3 all working perfectly on my machine. I deployed IdentityServer3 to our servers in AWS behind a load balancer. I followed all the instructions in the Deployment wiki. I am able to hit the .wellknown configuration fine after deployment from a browser on my machine.
I changed the authority url for the mvc and api test apps to point to the aws deployment. Clients, Scopes, users, etc are all configured identically as they are hitting the same database as it was when running on local machine.
I can get an access token using RequestResourceOwnerPasswordAsync just fine so I think ids is installed fine.
However, both the API and the MVC app just trying to use implicit flow are now failing. FOr instance, when I try to hit a mvc controller action marked with [Authorize] I get an error stating "An invalid request URI was provided. The request URI must either be an absolute URI or BaseAddress must be set".
If I try to hit the webapi from the mvc app (both running locally on my machine) after a successful RequestResourceOwnerPasswordAsync call, I get the error "Response status code does not indicate success: 401 (Unauthorized)." after what seems like a timeout.
Any help would be greatly appreciated.
Figured out the problem. When specifying PublicOrigin, it has to be a full URL and not just the domain. I had left off https:// prefix.
The web api issue was related to connectivity to the identity server. There was some incorrect proxy settings for the app.

Authorizing localhost with gdata and AuthSub?

While testing I started walking through authorizing my test machine (192.168.15.6, a local IP) with YouTube, which seemed successful. That IP is listed under my authorized sites. However, any actual requests say I'm not authenticated. I'm guessing it isn't going to work because the requests seem to be coming from my Public IP, right?
The documentation is split up between the API reference, the gdata guide, and the python client guide. The examples seem limited. I didn't get, from the Python guide, that the session token is a new token, rather than an upgrade of the existing one-use token.
yt_service.SetAuthSubToken(token)
yt_service.UpgradeToSessionToken()
session_token = yt_service.current_token.get_token_string()
This gives you the new token after upgrading for a session.
Everything has worked for me developing locally except getting a secure token. Just leave that set as false.