I am trying to scrape the following site:
https://nypost.com/2020/06/27/milton-glaser-designer-of-i-%e2%99%a5%e2%80%8a-ny-logo-dead-at-91/
import requests
from bs4 import BeautifulSoup
site = 'https://nypost.com/2020/06/27/milton-glaser-designer-of-i-%e2%99%a5%e2%80%8a-ny-logo-dead-at-91/'
soup = BeautifulSoup(requests.get(site).content, 'html.parser')
I get:
raise TooManyRedirects('Exceeded {} redirects.'.format(self.max_redirects), response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
I would like to understand what is going on. I suspect some loop generated somehow by the special characters being interpreted but I am at a loss rn.
You asked why this happens. It is due to requests using urllib3. The urllib3 changes the percent-encoded bytes to upper case https://github.com/urllib3/urllib3/issues/1677 as per the recommendation of RFC 3986 to uppercase percent-encoded bytes during normalization. In normal circumstances that would be good. But this server has seems to want it's URLs lowercase. This can be see by :
import requests
url = 'https://nypost.com/2020/06/27/milton-glaser-designer-of-i-%e2%99%a5%e2%80%8a-ny-logo-dead-at-91/'
resp = requests.get(url, allow_redirects=False)
print(resp.status_code)
print(resp.headers['Location'])
print(resp.url)
Outputs:
301
https://nypost.com/2020/06/27/milton-glaser-designer-of-i-%e2%99%a5%e2%80%8a-ny-logo-dead-at-91/
https://nypost.com/2020/06/27/milton-glaser-designer-of-i-%E2%99%A5%E2%80%8A-ny-logo-dead-at-91/
This shows it is a HTTP 301 redirect. The URL it is redirecting to and the URL the request was made to.
You can test this by opening Firefox or Chrome, right clicking on a page, Select Inspect, then select Network, select disable cache, then paste the last URL and hit return. You will see the 301 redirect.
I expect there is a directive on the server to make all URLs lowercase by forcing a redirect. So it goes into a loop of requesting with uppercase percent-encoded bytes and being redirected to a URL with lowercase percent-encoded bytes to which it makes a request with uppercase percent-encoded bytes etc.
There is a way round it but it could lead to unexpected side-effects and I would only use it as a last resort and then only if you were certain all your URLs were formatted as the server expects them. But it explains the problem.
import requests.packages.urllib3.util.url as _url
import requests
def my_encode_invalid_chars(component, allowed_chars):
return component
_url._encode_invalid_chars = my_encode_invalid_chars
url = 'https://nypost.com/2020/06/27/milton-glaser-designer-of-i-%e2%99%a5%e2%80%8a-ny-logo-dead-at-91/'
resp = requests.get(url)
print(resp.status_code)
print(resp.headers)
print(resp.url)
print(resp.text)
Note the output is:
200
{'Server': 'nginx', ...
https://nypost.com/2020/06/27/milton-glaser-designer-of-i-%e2%99%a5%e2%80%8a-ny-logo-dead-at-91/
The response is HTTP 200 OK.
There is no Location header (I truncated the output).
The URL that was requested is lowercase.
Then it prints the page source.
Related
I'm making requests to Twitter, using the OAuth1.0 signing process to set the Authorization header. They explain it step-by-step here, which I've followed. It all works, most of the time.
Authorization fails whenever special characters are sent without percent encoding in the query component of the request. For example, ?status=hello%20world! fails, but ?status=hello%20world%21 succeeds. But the change from ! to the percent encoded form %21 is only made in the URL, after the signature is generated.
So I'm confused as to why this fails, because AFAIK that's a legally encoded query string. Only the raw strings ("status", "hello world!") are used for signature generation, and I'd assume the server would remove any percent encoding from the query params and generate its own signature for comparison.
When it comes to building the URL, I let URLComponents do the work, so I don't add percent encoding manually, ex.
var urlComps = URLComponents()
urlComps.scheme = "https"
urlComps.host = host
urlComps.path = path
urlComps.queryItems = [URLQueryItem(key: "status", value: "hello world!")]
urlComps.percentEncodedQuery // "status=hello%20world!"
I wanted to see how Postman handled the same request. I selected OAuth1.0 as the Auth type and plugged in the same credentials. The request succeeded. I checked the Postman console and saw ?status=hello%20world%21; it was percent encoding the !. I updated Postman, because a nice little prompt asked me to. Then I tried the same request; now it was getting an authorization failure, and I saw ?status=hello%20world! in the console; the ! was no longer being percent encoded.
I'm wondering who is at fault here. Perhaps Postman and I are making the same mistake. Perhaps it's with Twitter. Or perhaps there's some proxy along the way that idk, double encodes my !.
The OAuth1.0 spec says this, which I believe is in the context of both client (taking a request that's ready to go and signing it before it's sent), and server (for generating another signature to compare against the one received):
The parameters from the following sources are collected into a
single list of name/value pairs:
The query component of the HTTP request URI as defined by
[RFC3986], Section 3.4. The query component is parsed into a list
of name/value pairs by treating it as an
"application/x-www-form-urlencoded" string, separating the names
and values and decoding them as defined by
[W3C.REC-html40-19980424], Section 17.13.4.
That last reference, here, outlines the encoding for application/x-www-form-urlencoded, and says that space characters should be replaced with +, non-alphanumeric characters should be percent encoded, name separated from value by =, and pairs separated by &.
So, the OAuth1.0 spec says that the query string of the URL needs to be decoded as defined by application/x-www-form-urlencoded. Does that mean that our query string needs to be encoded this way too?
It seems to me, if a request is to be signed using OAuth1.0, the query component of the URL that gets sent must be encoded in a way that is different to what it would normally be encoded in? That's a pretty significant detail if you ask me. And I haven't seen it explicitly mentioned, even in Twitter's documentation. And evidently the folks at Postman overlooked it too? Unless I'm not supposed to be using URLComponents to build a URL, but that's what it's for, no? Have I understood this correctly?
Note: ?status=hello+world%21 succeeds; it tweets "hello world!"
I ran into a similar issue.
put the status in post body, not query string.
Percent-encoding:
private encode(str: string) {
// encodeURIComponent() escapes all characters except: A-Z a-z 0-9 - _ . ! ~ * " ( )
// RFC 3986 section 2.3 Unreserved Characters (January 2005): A-Z a-z 0-9 - _ . ~
return encodeURIComponent(str)
.replace(/[!'()*]/g, c => "%" + c.charCodeAt(0).toString(16).toUpperCase());
}
I created my own “404 Page not found” error page on a TYPO3 website and implemented it via the /typo3conf/LocalConfiguration.php as follows, using the page’s Speaking URL path:
return [
...
'FE' => [
...
'pageNotFound_handling' => '/page-not-found/',
]
]
Now when I call a non-existing page, the error page gets displayed but there is a 4-digit alphanumeric number (hexadecimal as far as I’ve seen by now) BEFORE the HTML source code and a “0” AFTER it. Example (the number in the beginning is different after most of the reloads):
37b3
<!DOCTYPE html>
...
</html>
0
When calling the error page URL itself the page is returned correctly without those numbers.
Having the RealURL extension activated or deactivated does not make a difference.
Thanks a lot in advance!
I added the full description from the install tool and I guess we might find the solution there.
How TYPO3 should handle requests for non-existing/accessible pages.
empty (default)
The next visible page upwards in the page tree is shown.
'true' or '1'
An error message is shown.
String
Static HTML file to show (reads content and outputs with correct headers), e.g. notfound.html or http://www.example.org/errors/notfound.html.
Prefix "REDIRECT:"
If prefixed with "REDIRECT:" it will redirect to the URL/script after the prefix.
Prefix "READFILE:"
If prefixed with "READFILE" then it will expect the remaining string to be a HTML file which will be read and outputted directly after having the marker "###CURRENT_URL###" substituted with REQUEST_URI and ###REASON### with reason text, for example: READFILE:fileadmin/notfound.html.
Prefix "USER_FUNCTION:"
If prefixed with "USER_FUNCTION:" a user function is called, e.g. USER_FUNCTION:fileadmin/class.user_notfound.php:user_notFound->pageNotFound where the file must contain a class user_notFound with a method pageNotFound() inside with two parameters $param and $ref.
What you configured:
You're passing a string, thus TYPO3 expects to find a file - which you don't have, because it's more like an URL.
From what you try to achieve I'd go with REDIRECT:/page-not-found/.
Thanks for pointing this one out btw, I will remove the string configuration from the core since it does not make sense to have more people trip into this pitfall.
In short: change the following line in the FE section of your LocalConfiguration.php:
'pageNotFound_handling' => '/your404page.html',
to
'pageNotFound_handling' => 'REDIRECT:/your404page.html',
Cause
The actual cause is a combination of chunked Content-Encoding and the TYPO3 not being able to decode that in some cases. In your case the page not found handler eventually uses GeneralUtility::getUrl() to retrieve the error page.
If you have [SYS][curlUse] enabled it will use cUrl to retrieve the page and there is no problem.
If you don't have [SYS][curlUse] enabled it will open a socket, read the headers and then read the rest of the body. If the webserver uses "chunked" Content-Encoding the body will contain blocks of data and each block starts with a line with the length in hexadecimal format. The content ends with an empty block (with of course a line with the length "0").
cUrl apparently knows how to decode chunked data.
getUrl() itself does not know how to handle chunked data and uses the content as is as the page content.
In TYPO3 8 LTS the guzzle library is used to handle HTTP requests. In the guzzle code I can't find anything about handling chunked data. Guzzle will check if the cUrl PHP extension is present and use that as preferred transport. In most installations cUrl is present and since this decodes chunked data automagically no problem is visible. I have to test guzzle with PHP that has cUrl disabled to see if the issue is also present in v8/master.
Workaround/solution
If the PHP extension cUrl is enabled in your installation you can simply set [SYS][curlUse] in the Install Tool. The numbers around the 404 page content will disappear.
Upon executing an HTTP Get request, I receive the following error:
2015/08/30 16:42:09 Get https://en.wikipedia.org/wiki/List_of_S%26P_500_companies:
stopped after 10 redirects
In the following code:
package main
import (
"net/http"
"log"
)
func main() {
response, err := http.Get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
if err != nil {
log.Fatal(err)
}
}
I know that according to the documentation,
// Get issues a GET to the specified URL. If the response is one of
// the following redirect codes, Get follows the redirect, up to a
// maximum of 10 redirects:
//
// 301 (Moved Permanently)
// 302 (Found)
// 303 (See Other)
// 307 (Temporary Redirect)
//
// An error is returned if there were too many redirects or if there
// was an HTTP protocol error. A non-2xx response doesn't cause an
// error.
I was hoping that somebody knows what the solution would be in this case. It seems rather odd that this simple url results in more than ten redirects. Makes me think that there may be more going on behind the scenes.
Thank you.
As others have pointed out, you should first give thought to why you are encountering so many HTTP redirects. Go's default policy of stopping at 10 redirects is reasonable. More than 10 redirects could mean you are in a redirect loop. That could be caused outside your code. It could be induced by something about your network configuration, proxy servers between you and the website, etc.
That said, if you really do need to change the default policy, you do not need to resort to editing the net/http source as someone suggested.
To change the default handling of redirects you will need to create a Client and set CheckRedirect.
For your reference:
http://golang.org/pkg/net/http/#Client
// If CheckRedirect is nil, the Client uses its default policy,
// which is to stop after 10 consecutive requests.
CheckRedirect func(req *Request, via []*Request) error
I had this issue with Wikipedia URLs containing %26 because they redirect to a version of the URL with & which Go then encodes to %26 which Wikipedia redirects to & and ...
Oddly, removing gcc-go (v1.4) from my Arch box and replacing it with go (v1.5) has fixed the problem.
I'm guessing this can be put down to the changes in net/http between v1.4 and v1.5 then.
The following code should post a form to an endpoint (which returns 302) and, after following the redirect, parse the url of the page and return some information from there.
val start = System.currentTimeMillis()
val requestHolder = WS.url(conf("login.url"))
.withRequestTimeout(loginRequestTimeOut)
.withFollowRedirects(true) //This appears to have no effect...
requestHolder.post(getMap(username, password))
.map(resp =>{
Logger.debug(resp.status.toString)
val loginResponse = getResponse(resp)
val end = System.currentTimeMillis()
Logger.debug("Login for the user: "+username+", request took: " + (end - start) + " milliseconds.")
loginResponse
})
The problem is that .withFollowRedirects(true) appears to have no effect on the query. The status of the response is 302 and the request does not follow the redirect.
I've gone through the process manually using httpie and following the redirects does lead to the correct page.
Any help or insight would be much appreciated.
POST redirection isn't as well supported as GET redirection. W3 specification says:
If the 301 status code is received in response to a request other than GET or HEAD, the user agent MUST NOT automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued.
Some browsers don't do that, and just ignore. Have a look also at the 307 status:
307 Temporary Redirect (since HTTP/1.1)
In this case, the request should be repeated with another URI; however, future requests should still use the original URI. In contrast to how 302 was historically implemented, the request method is not allowed to be changed when reissuing the original request. For instance, a POST request should be repeated using another POST request.
There is also a discussion about this on Programmer Stack Exchange.
I've had a lot of trouble with withFollowRedirects and POST.
At some point, while fighting to make things work, I had .withFollowRedirects(false) in my code, then removed it during cleanups & things broke. My current guess is that if this option is not explicitly made false, the default behavior is to follow redirects (302 in my case) with some faulty mechanism. Perhaps the default mechanism uses POST again with same arguments. But in my case, interacting with Google App Script (GAS), one needs to use GET to retrieve JSON output of a POST.
Whatever the mechanism was doing, I was getting 400 with no further diagnostics.
After wasting hours, I realized that .withFollowRedirects(false) was in fact truly needed: it disabled Play's messing with redirects, I was able to see the 302 response & handle the following GET manually with success.
I am currently trying to modify a script to use the requests library instead of the urllib2 library. I haven't really used it before and I am looking to do the equivalent of urlopen("http://www.example.org").read(), so I tried the requests.get("http://www.example.org").text function.
This works fine with normal everyday html, however when I fetch from this url (https://gtfsrt.api.translink.com.au/Feed/SEQ) it doesn't seem to work.
So I wrote the below code to print out the responses from the same url using both the requests and urllib2 libraries.
import urllib2
import requests
#urllib2 request
request = urllib2.Request("https://gtfsrt.api.translink.com.au/Feed/SEQ")
result = urllib2.urlopen(request)
#requests request
result2 = requests.get("https://gtfsrt.api.translink.com.au/Feed/SEQ")
print result2.encoding
#urllib2 write to text
open("Output.txt", 'w').close()
text_file = open("Output.txt", "w")
text_file.write(result.read())
text_file.close()
open("Output2.txt", 'w').close()
text_file = open("Output2.txt", "w")
text_file.write(result2.text)
text_file.close()
The openurl().read() works fine but the requests.get().text doesn't work for the given this url. I suspect it has something to do with encoding, but i don't know what. Any thoughts?
Note: The supplied url is a feed in the google protocol buffer format, once I receive the message i give the feed to a google library that interprets it.
Your issue is that you're making the requests module interpret binary content in a response as text.
A response from the requests library has two main way to access the body of the response:
Response.content - will return the response body as a bytestring
Response.text - will decode the response body as text and return unicode
Since protocol buffers are a binary format, you should use result2.content in your code instead of result2.text.
Response.content will return the body of the response as-is, in bytes. For binary content this is exactly what you want. For text content that contains non-ASCII characters this means the content must have been encoded by the server into a bytestring using a particular encoding that is indicated by either a HTTP header or a <meta charset="..." /> tag. In order to make sense of those bytes they therefore need to be decoded after receiving using that charset.
Response.text now is a convenience method that does exactly this for you. It assumes the response body is text, and looks at the response headers to find the encoding, and decodes it for you, returning unicode.
But if your response doesn't contain text, this is the wrong method to use. Binary content doesn't contain characters, because it's not text, so the whole concept of character encoding does not make any sense for binary content - it's only applicable to text composed of characters. (That's also why you're seeing response.encoding == None - it's just bytes, there is no character encoding involved).
See Response Content and Binary Response Content in the requests documentation for more details.