Scrapy redirect is always 200 - redirect

I am experiencing strange behavior in Scrapy. I collect status codes by calling response.status, by not all of them are present (Seems to be 3xx). I see in the log the following thing:
downloader/response_status_count/200: 8150
downloader/response_status_count/301: 226
downloader/response_status_count/302: 67
downloader/response_status_count/303: 1
downloader/response_status_count/307: 48
downloader/response_status_count/400: 7
downloader/response_status_count/403: 44
downloader/response_status_count/404: 238
downloader/response_status_count/405: 8
downloader/response_status_count/406: 26
downloader/response_status_count/410: 7
downloader/response_status_count/500: 12
downloader/response_status_count/502: 6
downloader/response_status_count/503: 3
whereas my csv file has only 200, 404, 403, 406, 502, 400, 405, 410, 500, 503. I set HTTPERROR_ALLOW_ALL=True in the settings.py. Can I force Scrapy to provide information about redirects? Right know I am taking it from response.meta['redirect_times'] and response.meta['redirect_urls'], but status code is still 200, instead of 3xx.

30X responses will never reach your callback (parse method) because they are being handles by a redirect middleware before that.
However all of the response statuses are already stored in scrapy stats as you have pointed out yourself which means you can easily pull them in your crawler at any point:
In your callback:
def parse(self, response):
stats = self.crawler.stats.get_stats()
status_stats = {
k: v for k, v in stats.items()
if 'status_count' in k
}
# {'downloader/response_status_count/200': 1}
In your pipeline (see docs for how to use pipelines):
class SaveStatsPipeline:
"""Save response status stats in a stats.json file"""
def close_spider(self, spider):
"""When spider closes save all status stats in a stats.json file"""
stats = spider.crawler.stats.get_stats()
status_stats = {
k: v for k, v in stats.items()
if 'status_count' in k
}
with open('stats.json', 'w') as f:
f.write(json.dumps(status_stats))
Anywhere where you have access to crawler object really!

Related

Redirection using Scrapy Spider Middleware (Unhandled error in Deferred)

I've made a spider using Scrapy that first solves a CAPTCHA in a redirected address before accessing the main website I intend to scrape. It says that I have an HTTP error causing an infinite loop but I can't find which part of the script is causing this.
In the middleware:
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
class ProtectRedirectMiddleware(RedirectMiddleware):
def __init__(self, settings):
super().__init__(settings)
self.source = urllib.request.urlopen('http://sampleurlname.com/')
soup = BeautifulSoup(source, 'lxml')
def _redirect(self, redirected, request, spider, reason):
# act normally if this isn't a CAPTCHA redirect
if not self.is_protected(redirected.url):
return super()._redirect(redirected, request, spider, reason)
# if this is a CAPTCHA redirect
logger.debug(f'The protect URL is triggered for {request.url}')
request.cookies = self.bypass_protection(redirected.url)
request.dont_filter = True
return request
def is_protected(self, url):
return 'sampleurlname.com/protect' in url
def bypass_protection(self, url=None):
# only navigate if any explicit url is provided
if url:
url = url or self.source.geturl(url)
img = soup.find_all('img')[0]
imgurl = img['src']
urllib.request.urlretrieve(imgurl, "captcha.png")
return self.solve_captcha(imgurl)
# wait for the redirect and try again
self.wait_for_redirect()
return self.bypass_protection()
def wait_for_redirect(self, url = None, wait = 0.1, timeout=10):
url = self.url
for i in range(int(timeout//wait)):
time.sleep(wait)
if self.response.url() != url:
return self.response.url()
logger.error(f'Maybe {self.response.url()} isn\'t a redirect URL')
raise Exception('Timed out')
def solve_captcha(self, img, width=150, height=50):
# open image
self.img = 'captcha.png'
img = Image.open("captcha.png")
# image manipulation - simplified
# input the captcha text - simplified
# click the submit button - simplified
# save the URL
url = self.response.url()
# try again if wrong
if self.is_protected(self.wait_for_redirect(url)):
return self.bypass_protection()
# return the cookies as a dict
cookies = {}
for cookie_string in self.response.css.cookies():
if 'domain=sampleurlname.com' in cookie_string:
key, value = cookie_string.split(';')[0].split('=')
cookies[key] = value
return cookies
Then, this is the error I get when I run the scrapy crawl of my spider:
Unhandled error in Deferred:
2018-08-06 16:34:33 [twisted] CRITICAL: Unhandled error in Deferred:
2018-08-06 16:34:33 [twisted] CRITICAL:
Traceback (most recent call last):
File "/username/anaconda/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/username/anaconda/lib/python3.6/site-packages/scrapy/crawler.py", line 80, in crawl
self.engine = self._create_engine()
File "/username/anaconda/lib/python3.6/site-packages/scrapy/crawler.py", line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/username/anaconda/lib/python3.6/site-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/username/anaconda/lib/python3.6/site-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/username/anaconda/lib/python3.6/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/username/anaconda/lib/python3.6/site-packages/scrapy/middleware.py", line 36, in from_settings
mw = mwcls.from_crawler(crawler)
File "/username/anaconda/lib/python3.6/site-packages/scrapy/downloadermiddlewares/redirect.py", line 26, in from_crawler
return cls(crawler.settings)
File "/username/...../scraper/myscraper/myscraper/middlewares.py", line 27, in __init__
self.source = urllib.request.urlopen('http://sampleurlname.com/')
File "/username/anaconda/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/username/anaconda/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/username/anaconda/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/username/anaconda/lib/python3.6/urllib/request.py", line 564, in error
result = self._call_chain(*args)
File "/username/anaconda/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/username/anaconda/lib/python3.6/urllib/request.py", line 756, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/username/anaconda/lib/python3.6/urllib/request.py", line 532, in open
It basically repeats the bottom part of these over and over: open, http_response, error, _call_chain, and http_error_302, until these show at the end:
File "/username/anaconda/lib/python3.6/urllib/request.py", line 746, in http_error_302
self.inf_msg + msg, headers, fp)
urllib.error.HTTPError: HTTP Error 307: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Temporary Redirect
In setting.py is:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
'myscrape.middlewares.ProtectRedirectMiddleware': 600}
Your issue has nothing to do with scrapy itself. You are using blocking requests in your middleware initiation.
This request seems to be stuck in a redirect loop. This usually happens when websites do not act appropriately and require cookies to allow you through:
First you connect and get a redirect response 30x and some setCokies headers
You redirect again but not with Cookies headers and the page lets you through
Python urllib doesn't handle cookies, so try this:
import urllib
from http.cookiejar import CookieJar
def __init__(self):
try:
req=urllib.request.Request(url)
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
source = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as e:
logging.error(f"couldn't initiate middleware: {e}")
return
# you should use scrapy selectors instead of beautiful soup here
#soup = BeautifulSoup(source, 'lxml')
selector = Selector(text=source)
Alternatively you should use requests package that handles cookies by itself.

Healthcheck for endpoints - quick and dirty version

I have a few REST endpoints and few [asmx/svc] endpoints.
Some of them are GET and the others are POST operations.
I am trying to put together a quick and dirty , repeatable healthcheck sequence for finding if all the endpoints are responsive or if any are down.
Essentially either get a 200 or 201 and report error if otherwise.
What is the easiest way to do this?
SOAPUI use internally apache http-client 4.1.1 version, you can use it inside a groovy script testStep to perform your checks.
Add a groovy script testStep to your testCase an inside use the follow code; which basically try to perform a GET against a list of URLs, if its returns http-status 200 or 201 its considered that is working, if http-status 405 (Method not allowed) is returned then it tries with POST and perform the same status code check, otherwise it's considered down.
Note that some services can be running however can return for example 400 (BAD request) if the request is not correct, so think about if you need to rethink the way you want perform a check or add some other status codes to consider if the server is running correctly.
import org.apache.http.HttpEntity
import org.apache.http.HttpResponse
import org.apache.http.client.methods.HttpGet
import org.apache.http.client.methods.HttpPost
import org.apache.http.impl.client.DefaultHttpClient
// urls to check
def urls = ['http://google.es','http://stackoverflow.com']
// apache http-client to use in closue
DefaultHttpClient httpclient = new DefaultHttpClient()
// util function to get the response
def getStatus = { httpMethod ->
HttpResponse response = httpclient.execute(httpMethod)
// consume the entity to avoid error with http-client
if(response.getEntity() != null) {
response.getEntity().consumeContent();
}
return response.getStatusLine().getStatusCode()
}
HttpGet httpget;
HttpPost httppost;
// finAll urls that are working
def urlsWorking = urls.findAll { url ->
log.info "try GET for $url"
httpget = new HttpGet(url)
def status = getStatus(httpget)
// if status are 200 or 201 it's correct
if(status in [200,201]){
log.info "$url is OK"
return true
// if GET is not allowed try with POST
}else if(status == 405){
log.info "try POST for $url"
httppost = new HttpPost(url)
status = getStatus(httpget)
// if status are 200 or 201 it's correct
if(status in [200,201]){
log.info "$url is OK"
return true
}
log.info "$url is NOT working status code: $status"
return false
}else{
log.info "$url is NOT working status code: $status"
return false
}
}
// close connection to release resources
httpclient.getConnectionManager().shutdown()
log.info "URLS WORKING:" + urlsWorking
This scripts logs:
Tue Nov 03 22:37:59 CET 2015:INFO:try GET for http://google.es
Tue Nov 03 22:38:00 CET 2015:INFO:http://google.es is OK
Tue Nov 03 22:38:00 CET 2015:INFO:try GET for http://stackoverflow.com
Tue Nov 03 22:38:03 CET 2015:INFO:http://stackoverflow.com is OK
Tue Nov 03 22:38:03 CET 2015:INFO:URLS WORKING:[http://google.es, http://stackoverflow.com]
Hope it helps,

500 internal server error on certain page after a few hours

I am getting a 500 Internal Server Error on a certain page of my site after a few hours of being up. I restart uWSGI instance with uwsgi --ini /home/metheuser/webapps/ers_portal/ers_portal_uwsgi.ini and it works again for a few hours.
The rest of the site seems to be working. When I navigate to my_table, I am directed to the login page. But, I get the 500 error on my table page on login. I followed the instructions here to set up my nginx and uwsgi configs.
That is, I have ers_portal_nginx.conf located i my app folder that is symlinked to /etc/nginx/conf.d/. I start my uWSGI "instance" (not sure what exactly to call it) in a Screen instance as mentioned above, with the .ini file located in my app folder
My ers_portal_nginx.conf:
server {
listen 80;
server_name www.mydomain.com;
location / { try_files $uri #app; }
location #app {
include uwsgi_params;
uwsgi_pass unix:/home/metheuser/webapps/ers_portal/run_web_uwsgi.sock;
}
}
My ers_portal_uwsgi.ini:
[uwsgi]
#user info
uid = metheuser
gid = ers_group
#application's base folder
base = /home/metheuser/webapps/ers_portal
#python module to import
app = run_web
module = %(app)
home = %(base)/ers_portal_venv
pythonpath = %(base)
#socket file's location
socket = /home/metheuser/webapps/ers_portal/%n.sock
#permissions for the socket file
chmod-socket = 666
#uwsgi varible only, does not relate to your flask application
callable = app
#location of log files
logto = /home/metheuser/webapps/ers_portal/logs/%n.log
Relevant parts of my views.py
data_modification_time = None
data = None
def reload_data():
global data_modification_time, data, sites, column_names
filename = '/home/metheuser/webapps/ers_portal/app/static/' + ec.dd_filename
mtime = os.stat(filename).st_mtime
if data_modification_time != mtime:
data_modification_time = mtime
with open(filename) as f:
data = pickle.load(f)
return data
#a bunch of authentication stuff...
#app.route('/')
#app.route('/index')
def index():
return render_template("index.html",
title = 'Main',)
#app.route('/login', methods = ['GET', 'POST'])
def login():
login stuff...
#app.route('/my_table')
#login_required
def my_table():
print 'trying to access data table...'
data = reload_data()
return render_template("my_table.html",
title = "Rundata Viewer",
sts = sites,
cn = column_names,
data = data) # dictionary of data
I installed nginx via yum as described here (yesterday)
I am using uWSGI installed in my venv via pip
I am on CentOS 6
My uwsgi log shows:
Wed Jun 11 17:20:01 2014 - uwsgi_response_writev_headers_and_body_do(): Broken pipe [core/writer.c line 287] during GET /whm-server-status (127.0.0.1)
IOError: write error
[pid: 9586|app: 0|req: 135/135] 127.0.0.1 () {24 vars in 292 bytes} [Wed Jun 11 17:20:01 2014] GET /whm-server-status => generated 0 bytes in 3 msecs (HTTP/1.0 404) 2 headers in 0 bytes (0 switches on core 0)
When its working, the print statement in the views "my_table" route prints into the log file. But not once it stops working.
Any ideas?

AttributeError when using lettuce 'world'

I have two files:
steps.py:
from lettuce import *
from splinter.browser import Browser
#before.harvest
def set_browser():
world.browser = Browser('webdriver.chrome')
#step(u'Given I visit "([^"]*)"')
def given_i_visit(step, url):
world.browser.visit(url)
test.feature:
Feature: Do some basic tests
Scenario: Check whether the website is accessable
Given I visit "/"
Running lettuce against them returns this:
Feature: Do some basic tests # features/test.feature:1
Scenario: Check whether the website is accessable # features/test.feature:2
Given I visit "/" # features/steps.py:8
Traceback (most recent call last):
File "/..../site-packages/lettuce/core.py", line 125, in __call__
ret = self.function(self.step, *args, **kw)
File "/..../test/features/steps.py", line 9, in given_i_visit
world.browser.visit(url)
AttributeError: 'thread._local' object has no attribute 'browser'
1 feature (0 passed)
1 scenario (0 passed)
1 step (1 failed, 0 passed)
Any ideas on what could be going wrong?
Although not in the documentation. place the terrain.py file in the same directory as your steps and features files. Initialized the world attribute with any value and you should be ok.
The problem is that the before.harvest takes some data, so the right code will be the following:
from lettuce import *
from splinter import Browser
#before.harvest
def set_browser(data):
world.browser = Browser('webdriver.chrome')
#step(u'Given I visit "([^"]*)"')
def given_i_visit(step, url):
world.browser.visit(url)
hope it helps!

Why makes calling error or done in a BodyParser's Iteratee the request hang in Play Framework 2.0?

I am trying to understand the reactive I/O concepts of Play 2.0 framework. In order to get a better understanding from the start I decided to skip the framework's helpers to construct iteratees of different kinds and to write a custom Iteratee from scratch to be used by a BodyParser to parse a request body.
Starting with the information available in Iteratees and ScalaBodyParser docs and two presentations about play reactive I/O this is what I came up with:
import play.api.mvc._
import play.api.mvc.Results._
import play.api.libs.iteratee.{Iteratee, Input}
import play.api.libs.concurrent.Promise
import play.api.libs.iteratee.Input.{El, EOF, Empty}
01 object Upload extends Controller {
02 def send = Action(BodyParser(rh => new SomeIteratee)) { request =>
03 Ok("Done")
04 }
05 }
06
07 case class SomeIteratee(state: Symbol = 'Cont, input: Input[Array[Byte]] = Empty, received: Int = 0) extends Iteratee[Array[Byte], Either[Result, Int]] {
08 println(state + " " + input + " " + received)
09
10 def fold[B](
11 done: (Either[Result, Int], Input[Array[Byte]]) => Promise[B],
12 cont: (Input[Array[Byte]] => Iteratee[Array[Byte], Either[Result, Int]]) => Promise[B],
13 error: (String, Input[Array[Byte]]) => Promise[B]
14 ): Promise[B] = state match {
15 case 'Done => { println("Done"); done(Right(received), Input.Empty) }
16 case 'Cont => cont(in => in match {
17 case in: El[Array[Byte]] => copy(input = in, received = received + in.e.length)
18 case Empty => copy(input = in)
19 case EOF => copy(state = 'Done, input = in)
20 case _ => copy(state = 'Error, input = in)
21 })
22 case _ => { println("Error"); error("Some error.", input) }
23 }
24 }
(Remark: All these things are new to me, so please forgive if something about this is total crap.)
The Iteratee is pretty dumb, it just reads all chunks, sums up the number of received bytes and prints out some messages. Everything works as expected when I call the controller action with some data - I can observe all chunks are received by the Iteratee and when all data is read it switches to state done and the request ends.
Now I started to play around with the code because I wanted to see the behaviour for these two cases:
Switching into state error before all input is read.
Switching into state done before all input is read and returning a Result instead of the Int.
My understanding of the documentation mentioned above is that both should be possible but actually I am not able to understand the observed behaviour. To test the first case I changed line 17 of the above code to:
17 case in: El[Array[Byte]] => copy(state = if(received + in.e.length > 10000) 'Error else 'Cont, input = in, received = received + in.e.length)
So I just added a condition to switch into the error state if more than 10000 bytes were received. The output I get is this:
'Cont Empty 0
'Cont El([B#38ecece6) 8192
'Error El([B#4ab50d3c) 16384
Error
Error
Error
Error
Error
Error
Error
Error
Error
Error
Error
Then the request hangs forever and never ends. My expectation from the above mentioned docs was that when I call the error function inside fold of an Iteratee the processing should be stopped. What is happening here is that the Iteratee's fold method is called several times after error has been called - well and then the request hangs.
When I switch into done state before reading all input the behaviour is quite similar. Changing line 15 to:
15 case 'Done => { println("Done with " + input); done(if (input == EOF) Right(received) else Left(BadRequest), Input.Empty) }
and line 17 to:
17 case in: El[Array[Byte]] => copy(state = if(received + in.e.length > 10000) 'Done else 'Cont, input = in, received = received + in.e.length)
produces the following output:
'Cont Empty 0
'Cont El([B#16ce00a8) 8192
'Done El([B#2e8d214a) 16384
Done with El([B#2e8d214a)
Done with El([B#2e8d214a)
Done with El([B#2e8d214a)
Done with El([B#2e8d214a)
and again the request hangs forever.
My main question is why the request is hanging in the above mentioned cases. If anybody could shed light on this I would greatly appreciate it!
Your understanding is perfectly right and I have just push a fix to master:
https://github.com/playframework/Play20/commit/ef70e641d9114ff8225332bf18b4dd995bd39bcc
Fixed both cases plus exceptions in the Iteratees.
Nice use of copy in case class for doing an Iteratee BTW.
Things must have changed with Play 2.1 - Promise is no longer parametric, and this example no longer compiles.