Whenever I run my spider scrapy crawl test -O test.json in my Visual Studio Code terminal I get output like this:
2023-01-31 14:31:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.example.com/product/1
{'price': 100,
'newprice': 90
}
2023-01-31 14:31:50 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-31 14:31:50 [scrapy.extensions.feedexport] INFO: Stored json feed (251 items) in: test.json
2023-01-31 14:31:50 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://localhost:61169/session/996866d968ab791730e4f6d87ce2a1ea {}
2023-01-31 14:31:50 [urllib3.connectionpool] DEBUG: http://localhost:61169 "DELETE /session/996866d968ab791730e4f6d87ce2a1ea HTTP/1.1" 200 14
2023-01-31 14:31:50 [selenium.webdriver.remote.remote_connection] DEBUG: Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2023-01-31 14:31:50 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2023-01-31 14:31:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 91321,
'downloader/request_count': 267,
'downloader/request_method_count/GET': 267,
'downloader/response_bytes': 2730055,
'downloader/response_count': 267,
'downloader/response_status_count/200': 267,
'dupefilter/filtered': 121,
'elapsed_time_seconds': 11.580893,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 1, 31, 13, 31, 50, 495392),
'httpcompression/response_bytes': 9718676,
'httpcompression/response_count': 267,
'item_scraped_count': 251,
'log_count/DEBUG': 537,
'log_count/INFO': 11,
'request_depth_max': 2,
'response_received_count': 267,
'scheduler/dequeued': 267,
'scheduler/dequeued/memory': 267,
'scheduler/enqueued': 267,
'scheduler/enqueued/memory': 267,
'start_time': datetime.datetime(2023, 1, 31, 13, 31, 38, 914499)}
2023-01-31 14:31:52 [scrapy.core.engine] INFO: Spider closed (finished)
I want to log all this, including the print('hi') lines in my Spider but I DON'T want the spider output logged, in this case {'price': 100, 'newprice': 90 }.
Inspecting the above I think I need to disable only the downloader/response_bytes.
I've been reading this https://docs.scrapy.org/en/latest/topics/logging.html, but I'm not sure where or how to configure my exact use case. I have hundreds of spiders and I don't want to have to add a configuration in each, but rather apply the loggin config to all spiders. Do I need to add a separate config file or add to an existing like scrapy.cfg?
UPDATE 1
So here's my folder structure where I created settings.py:
Scrapy\
tt_spiders\
myspiders\
spider1.py
spider2.py
settings.py
middlewares.py
pipelines.py
settings.py
scrapy.cfg
settings.py
settings.py
if __name__ == "__main__":
disable_list = ['scrapy.core.engine', 'scrapy.core.scraper', 'scrapy.spiders']
for element in disable_list:
logger = logging.getLogger(element)
logger.disabled = True
spider = 'example_spider'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider)
process.start()
This throws 3 errors, which makes sense as I have not defined these:
"logging" is not defined
"get_project_settings" is not defined
"CrawlerProcess" is not defined
But more importantly, what I don't understand, this code contains spider = 'example_spider',
where I want this logic to apply to ALL spiders.
So I reduced it to:
if __name__ == "__main__":
disable_list = ['scrapy.core.scraper']
But still the output is logged. What am I missing?
Let's assume that we have this spider:
spider.py:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['scrapingclub.com']
start_urls = ['https://scrapingclub.com/exercise/detail_basic/']
def parse(self, response):
item = dict()
item['title'] = response.xpath('//h3/text()').get()
item['price'] = response.xpath('//div[#class="card-body"]/h4/text()').get()
yield item
And its output is:
...
[scrapy.middleware] INFO: Enabled item pipelines:
[]
[scrapy.core.engine] INFO: Spider opened
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapingclub.com/exercise/detail_basic/> (referer: None)
[scrapy.core.scraper] DEBUG: Scraped from <200 https://scrapingclub.com/exercise/detail_basic/>
{'title': 'Long-sleeved Jersey Top', 'price': '$12.99'}
[scrapy.core.engine] INFO: Closing spider (finished)
[scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 329,
'downloader/request_count': 1,
...
If you want to disable logging for specific line then just copy the text inside the square brackets and disable its logger.
e.g.: [scrapy.core.scraper] DEBUG: Scraped from <200 https://scrapingclub.com/exercise/detail_basic/>.
main.py:
if __name__ == "__main__":
disable_list = ['scrapy.core.engine', 'scrapy.core.scraper', 'scrapy.spiders']
for element in disable_list:
logger = logging.getLogger(element)
logger.disabled = True
spider = 'example_spider'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider)
process.start()
If you want to disable some of the extensions you can set them to None in settings.py:
EXTENSIONS = {
'scrapy.extensions.telnet': None,
'scrapy.extensions.logstats.LogStats': None,
'scrapy.extensions.corestats.CoreStats': None
}
Update 1:
Add just this to settings.py:
import logging
disable_list = ['scrapy.core.engine', 'scrapy.core.scraper', 'scrapy.spiders']
for element in disable_list:
logger = logging.getLogger(element)
logger.disabled = True
Related
I am writing a REST plugin for Foswiki using Perl and I am facing an reliability issue when using File::Find. I have tried my best to write a minimal reproducible example. The plugin uses File::Find to traverse directories and print the filenames in the HTTP response. The REST request is working properly 4 times, but stop to work the 5th time. The HTTP status remain “HTTP/1.1 200 OK” but no file is reported by File::Find anymore.
The webserver is nginx and is configured to use FastCGI. It appear to run 4 working threads managed by foswiki-fcgi-pm:
> ps aux
www-data 16957 0.0 7.7 83412 78332 ? Ss 16:52 0:00 foswiki-fcgi-pm
www-data 16960 0.0 7.5 83960 76740 ? S 16:52 0:00 foswiki-fcgi
www-data 16961 0.0 7.6 84004 76828 ? S 16:52 0:00 foswiki-fcgi
www-data 16962 0.0 7.6 83956 76844 ? S 16:52 0:00 foswiki-fcgi
www-data 16963 0.0 7.5 83960 76740 ? S 16:52 0:00 foswiki-fcgi
Firstly, the plugin initialization simply register the REST handler:
sub initPlugin {
my ( $topic, $web, $user, $installWeb ) = #_;
# check for Plugins.pm versions
if ( $Foswiki::Plugins::VERSION < 2.3 ) {
Foswiki::Func::writeWarning( 'Version mismatch between ',
__PACKAGE__, ' and Plugins.pm' );
return 0;
}
Foswiki::Func::registerRESTHandler(
'restbug', \&RestBug,
authenticate => 0, # Set to 0 if handler should be useable by WikiGuest
validate => 0, # Set to 0 to disable StrikeOne CSRF protection
http_allow => 'GET,POST', # Set to 'GET,POST' to allow use HTTP GET and POST
description => 'Debug'
);
# Plugin correctly initialized
return 1;
}
Secondly, the REST handler is implemented as follow, printing all the files it can possibly find:
sub RestBug {
my ($session, $subject, $verb, $response) = #_;
my #Directories = ("/var/www/foswiki/tools");
sub findfilestest
{
$response->print("FILE $_\n");
}
find({ wanted => \&findfilestest }, #Directories );
}
When I test the REST service with a HTTP request, the first 4 times I get the following HTTP response, which seems quite satisfying:
HTTP/1.1 200 OK
Server: nginx/1.14.2
Date: Tue, 22 Nov 2022 09:23:10 GMT
Content-Length: 541
Connection: keep-alive
Set-Cookie: SFOSWIKISID=385db599c5d66bb19591e1eef7f1a854; path=/; secure; HttpOnly
FILE .
FILE foswiki.freebsd.init-script
FILE bulk_copy.pl
FILE dependencies
FILE mod_perl_startup.pl
FILE geturl.pl
FILE extender.pl
FILE extension_installer
FILE configure
FILE lighttpd.pl
FILE foswiki.freebsd.etc-defaults
FILE save-pending-checkins
FILE babelify
FILE upgrade_emails.pl
FILE tick_foswiki.pl
FILE foswiki.defaults
FILE rewriteshebang.pl
FILE fix_file_permissions.sh
FILE foswiki.init-script
FILE convertTopicSettings.pl
FILE mailnotify
FILE html2tml.pl
FILE tml2html.pl
FILE systemd
FILE foswiki.service
The following attempts give this unexpected response:
HTTP/1.1 200 OK
Server: nginx/1.14.2
Date: Tue, 22 Nov 2022 09:24:56 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: SFOSWIKISID=724b2c4b1ddfbebd25d0dc2a0f182142; path=/; secure; HttpOnly
Note that if I restart Foswiki with the command systemctl restart foswiki, the REST service work again 4 more times.
How to make this REST service work more than 4 times in a row?
I'm starting in Gatling. I have 411 status and don't understand why.
Response DefaultHttpResponse(decodeResult: success, version: HTTP/1.1)
HTTP/1.1 411 Length Required
Connection: close
Date: Tue, 13 Feb 2018 16:07:51 GMT
Server: Kestrel
Content-Length: 0
19:07:53.083 [gatling-http-thread-1-2] DEBUG org.asynchttpclient.netty.channel.ChannelManager - Closing Channel [id: 0x5f14313e, L:/10.8.1.89:52767 - R:blabla.com:5000]
19:07:53.107 [gatling-http-thread-1-2] INFO io.gatling.commons.validation.package$ - Boon failed to parse into a valid AST: -1
java.lang.ArrayIndexOutOfBoundsException: -1
...
19:07:53.111 [gatling-http-thread-1-2] WARN io.gatling.http.ahc.ResponseProcessor - Request 'HTTP Request createCompany' failed: status.find.is(200), but actually found 411
19:07:53.116 [gatling-http-thread-1-2] DEBUG io.gatling.http.ahc.ResponseProcessor -
My code:
package load
import io.gatling.core.scenario.Simulation
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
class LoadScript extends Simulation{
val httpConf = http
.baseURL("http://blabla.com:5000")
.authorizationHeader("Bearer 35dfd7a3c46f3f0bc7a2f06929399756029f47b9cc6d193ed638aeca1306d")
.acceptHeader("application/json, text/plain,")
.acceptEncodingHeader("gzip, deflate, br")
.acceptLanguageHeader("ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7")
.userAgentHeader("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36")
val basicLoad = scenario("BASIC_LOAD").exec(BasicLoad.start)
setUp(
basicLoad.inject(rampUsers(1) over (1 minutes))
.protocols(httpConf))
}
object BasicLoad {
val start =
exec(
http("HTTP Request createCompany")
.post("/Companies/CreateCompanyAndStartTransaction")
.queryParam("inn","7733897761")
.queryParam("ogrn","5147746205041")
.check(status is 200, jsonPath("$.id").saveAs("idCompany"))
)
}
When you are not sending message-body you need to add
.header("Content-Length", "0") as workaround.
I have similar issue. I'm running my tests on two environments and a difference is in application infrastructure.
Tests are passing on Amazon AWS but getting HTTP 411 on Azure. So looks like the issue is not in Gatling itself.
This issue has been also well answered by Gatling team at the and of this chat:
https://groups.google.com/forum/#!topic/gatling/mAGzjzoMr1I
I've just upgraded Gatling from 2.3 to 3.0.2. They wrote their own HTTP client and it sends now content-length: 0 except one case described in this bug:
https://github.com/gatling/gatling/issues/3648
so if you avoid using httpRequest() with method type passed as string e.g:
exec(http("empty POST test").httpRequest("POST","https://gatling.io/"))
and use post() as you do:
exec(
http("HTTP Request createCompany")
.post("/Companies/CreateCompanyAndStartTransaction")...
or
exec(
http("HTTP Request createCompany")
.httpRequest(HttpMethod.POST, "/Companies/CreateCompanyAndStartTransaction")
then upgrade Gatling to 3.0.2 is enough. Otherwise you need to wait for Gatling 3.0.3
When opening a jupyter notebook, I get the following errors. Any idea how to fix this? I'm running on ubuntu 16.04 with Anaconda 2. Tried uninstalling and reinstalling ipython and tried doing that through both pip and conda. Didn't help.
[E 13:52:02.191 NotebookApp] Unhandled error in API request
Traceback (most recent call last):
File "/home/user/anaconda2/lib/python2.7/site-packages/notebook/base/handlers.py", line 457, in wrapper
result = yield gen.maybe_future(method(self, *args, **kwargs))
File "/home/user/anaconda2/lib/python2.7/site-packages/notebook/services/kernelspecs/handlers.py", line 56, in get
for kernel_name in ksm.find_kernel_specs():
File "/home/user/anaconda2/lib/python2.7/site-packages/nb_conda_kernels/manager.py", line 192, in find_kernel_specs
kspecs = super(CondaKernelSpecManager, self).find_kernel_specs()
File "/home/user/anaconda2/lib/python2.7/site-packages/jupyter_client/kernelspec.py", line 128, in find_kernel_specs
for kernel_dir in self.kernel_dirs:
File "/home/user/anaconda2/lib/python2.7/site-packages/traitlets/traitlets.py", line 554, in __get__
return self.get(obj, cls)
File "/home/user/anaconda2/lib/python2.7/site-packages/traitlets/traitlets.py", line 533, in get
value = self._validate(obj, dynamic_default())
File "/home/user/anaconda2/lib/python2.7/site-packages/jupyter_client/kernelspec.py", line 114, in _kernel_dirs_default
from IPython.paths import get_ipython_dir
File "/home/user/anaconda2/lib/python2.7/site-packages/IPython/__init__.py", line 48, in <module>
from .core.application import Application
File "/home/user/anaconda2/lib/python2.7/site-packages/IPython/core/application.py", line 25, in <module>
from IPython.core import release, crashhandler
File "/home/user/anaconda2/lib/python2.7/site-packages/IPython/core/crashhandler.py", line 28, in <module>
from IPython.core import ultratb
File "/home/user/anaconda2/lib/python2.7/site-packages/IPython/core/ultratb.py", line 131, in <module>
import IPython.utils.colorable as colorable
AttributeError: 'module' object has no attribute 'utils'
[E 13:52:02.192 NotebookApp] {
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Host": "localhost:8888",
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0",
"Connection": "keep-alive",
"X-Requested-With": "XMLHttpRequest",
"Referer": "http://localhost:8888/tree"
}
[E 13:52:02.192 NotebookApp] 500 GET /api/kernelspecs (127.0.0.1) 3.10ms referer=http://localhost:8888/tree
Uninstalling Anaconda2 and installing Anaconda3 instead fixed this issue.
i am just crawl to a websit.but redirecting anthor page. in spider i added
handle_httpstatus_list = [302,301]
and overwrite the start_requests method. but problem is
AttributeError: 'Response' object has no attribute 'xpath'
spider code:
# -*- coding=utf-8 -*-
from __future__ import absolute_import
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule,Spider
from car.items import Car58Item
import scrapy
import time
class Car51Spider (CrawlSpider):
name = 'car51'
allowed_domains = ['51auto.com']
start_urls = ['http://www.51auto.com/quanguo/pabmdcigf?searchtype=searcarlist&curentPage=1&isNewsCar=0&isSaleCar=0&isQa=0&orderValue=record_time']
rules = [Rule(LinkExtractor(allow=('/pabmdcigf?searchtype=searcarlist&curentPage=\d+\&isNewsCar\=0\&isSaleCar\=0\&isQa\=0\&orderValue\=record_time')),callback='parse_item',follow=True)] #//页面读取策略
handle_httpstatus_list = [302,301]
items = {}
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, dont_filter=True, callback=self.parse_item)
def parse_item(self,response):
trs = response.xpath("//div[#class='view-grid-overflow']/a").extract()
for tr in trs:
sales_1 = u''
item = Car58Item()
urls = tr.xpath("a/#href").extract_first()
item['url'] = tr.xpath("a/#href").extract_first()
item['tip'] = tr.xpath("a/ul/li[#class='title']/text()").extract_first()
item['name'] = tr.xpath("a/ul/li[#class='title']/text()").extract_first()
sales_times = tr.xpath("a/ul/li[#class='info']/span/text()").extract()
for x in sales_times:
sales_1 = sales_1 + x
item['sales_time'] = sales_1
item['region'] = tr.xpath("a/ul/li[#class='info']/span[#class='font-color-red']/text()").extract_first()
item['amt'] = tr.xpath("a/ul/li[#class='price']/div[1]/text()").extract_first()
yield scrapy.Request(url=urls,callback=self.parse_netsted_item,meta={'item':item})
def parse_netsted_item(self,response):
dh = u''
dha = u''
mode = response.xpath("//body")
item = Car58Item(response.meta['item'])
dhs = mode.xpath("//div[#id='contact-tel1']/p/text()").extract()
for x in dhs:
dh = dh + x
item['lianxiren_dh'] = dh
lianxiren = mode.xpath("//div[#class='section-contact']/text()").extract()
item['lianxiren'] = lianxiren[1]
item['lianxiren_dz'] = lianxiren[2]
item['details'] = mode.xpath("//div[#id='car-dangan']").extract()
desc = mode.xpath("//div[#class='car-detail-container']/p/text()").extract()
for d in desc:
dha = dha + d
item['description'] = dha
item['image_urls'] = mode.xpath("//div[#class='car-pic']/img/#src").extract()
item['collection_dt'] = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(time.time()))
return item
settting.py
# -*- coding: utf-8 -*-
# Scrapy settings for car project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'car'
SPIDER_MODULES = ['car.spiders.car51']
#NEWSPIDER_MODULE = 'car.spiders.zhaoming'
DEFAULT_ITEM_CLASS = 'car.items.Car58Item'
ITEM_PIPELINES = {'scrapy.contrib.pipeline.images.ImagesPipeline': 1,
'car.pipelines.MongoDBPipeline': 300,
'car.pipelines.Car58ImagesPipeline': 301
}
MONGODB_SERVER ="localhost"
MONGODB_PORT=27017
MONGODB_DB="car"
MONGODB_COLLECTION_CAR="car"
MONGODB_COLLECTION_ZHAOMING="zhaoming"
IMAGES_STORE = "img/"
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
IMAGES_EXPIRES = 90
DOWNLOAD_TIMEOUT=10
LOG_ENABLED=True
LOG_ENCODING='utf-8'
LOG_LEVEL="DEBUG"
LOGSTATS_INTERVAL=5
# LOG_FILE='/tmp/scrapy.log'
CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16
scrapy log:
$scrapy crawl car51
2016-06-14 14:18:38 [scrapy] INFO: Scrapy 1.1.0 started (bot: car)
2016-06-14 14:18:38 [scrapy] INFO: Overridden settings: {'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 'SPIDER_MODULES': ['car.spiders.car51'], 'BOT_NAME': 'car', 'DOWNLOAD_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 5, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; rv:35.0) Gecko/20100101 Firefox/35.0', 'DEFAULT_ITEM_CLASS': 'car.items.Car58Item', 'DOWNLOAD_DELAY': 0.25}
2016-06-14 14:18:38 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-06-14 14:18:38 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-14 14:18:38 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-14 14:18:38 [py.warnings] WARNING: /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/deprecate.py:156: ScrapyDeprecationWarning: `scrapy.contrib.pipeline.images.ImagesPipeline` class is deprecated, use `scrapy.pipelines.images.ImagesPipeline` instead
ScrapyDeprecationWarning)
2016-06-14 14:18:38 [py.warnings] WARNING: /Users/mayuping/PycharmProjects/car/car/pipelines.py:13: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
from scrapy import log
2016-06-14 14:18:38 [scrapy] INFO: Enabled item pipelines:
['scrapy.pipelines.images.ImagesPipeline',
'car.pipelines.MongoDBPipeline',
'car.pipelines.Car58ImagesPipeline']
2016-06-14 14:18:38 [scrapy] INFO: Spider opened
2016-06-14 14:18:38 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-14 14:18:38 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-14 14:18:38 [scrapy] DEBUG: Crawled (302) <GET http://www.51auto.com/quanguo/pabmdcigf?searchtype=searcarlist&curentPage=1&isNewsCar=0&isSaleCar=0&isQa=0&orderValue=record_time> (referer: None)
**2016-06-14 14:18:39 [scrapy] ERROR: Spider error processing <GET http://www.51auto.com/quanguo/pabmdcigf?searchtype=searcarlist&curentPage=1&isNewsCar=0&isSaleCar=0&isQa=0&orderValue=record_time> (referer: None)**
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/mayuping/PycharmProjects/car/car/spiders/car51.py", line 22, in parse_item
trs = response.xpath("//div[#class='view-grid-overflow']/a").extract()
AttributeError: 'Response' object has no attribute 'xpath'
2016-06-14 14:18:39 [scrapy] INFO: Closing spider (finished)
2016-06-14 14:18:39 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 351,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 420,
'downloader/response_count': 1,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 14, 6, 18, 39, 56461),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'log_count/WARNING': 2,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2016, 6, 14, 6, 18, 38, 437336)}
2016-06-14 14:18:39 [scrapy] INFO: Spider closed (finished)
When you add handle_httpstatus_list = [302,301] you're telling Scrapy to call your callback even for HTTP redirection, instead of letting the framework handle the redirection transparently for you (which is the default).
Some HTTP responses for redirections do NOT have bodies nor content headers, so in those cases, in your callback, Scrapy hands you the response as-is, i.e. a plain Response object, and not an HtmlResponse for which you have .xpath() and .css() shortcuts.
Either you really need to handle HTTP 301 and 302 responses, and you need to write your callback so it tests the status code (response.status), extracting data only in the non-3xx cases,
Or, you let Scrapy handle HTTP redirections for you and you need to remove handle_httpstatus_list in your spider.
I have a perl script that uses WWW::Mechanize to open a SSL connection to our single sign-on page to test connectivity for Nagios. Yesterday, that script stopped working, and I have no idea why. Here is a snippet of the debug of that script:
main::(./check_profiles.pl:13): $auth_url = "https://account.example.com/SSO/index.html";
main::(./check_profiles.pl:14): $profile_url = "https://example.com/prof";
main::(./check_profiles.pl:16): $user = "nagios\#heyyou.com";
main::(./check_profiles.pl:17): $pass = "testing123";
main::(./check_profiles.pl:18): $expected_string = "Bunnies";
main::(./check_profiles.pl:21): $digest_username = "user";
main::(./check_profiles.pl:22): $digest_password = "pass";
main::(./check_profiles.pl:25): $mech = WWW::Mechanize->new( agent=>"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4" );
main::(./check_profiles.pl:28): $mech->credentials("example.com/prof:443","Nokia", $digest_username=>$digest_password);
main::(./check_profiles.pl:31): $mech->add_handler("request_send", sub { shift->dump; return });
main::(./check_profiles.pl:32): $mech->add_handler("response_done", sub { shift->dump; return });
main::(./check_profiles.pl:35): $mech->get( $auth_url );
GET https://account.example.com/SSO/index.html
Accept-Encoding: gzip
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
(no content)
500 Can't connect to account.example.com:443 (connect: Network is unreachable)
Content-Type: text/plain
Client-Date: Wed, 15 May 2013 16:16:25 GMT
Client-Warning: Internal response
500 Can't connect to account.example.com:443 (connect: Network is unreachable)\n
Error GETing https://account.example.com/SSO/index.html: Can't connect to account.example.com:443 (connect: Network is unreachable) at ./check_profiles.pl line 35
at /usr/share/perl5/vendor_perl/WWW/Mechanize.pm line 2747
WWW::Mechanize::_die('Error ', 'GET', 'ing ', 'URI::https=SCALAR(0x3291d40)', ': ', 'Can\'t connect to account.example.com:443 (connect: Network is ...') called at /usr/share/perl5/vendor_perl/WWW/Mechanize.pm line 2734
WWW::Mechanize::die('WWW::Mechanize=HASH(0x31ce490)', 'Error ', 'GET', 'ing ', 'URI::https=SCALAR(0x3291d40)', ': ', 'Can\'t connect to account.example.com:443 (connect: Network is ...') called at /usr/share/perl5/vendor_perl/WWW/Mechanize.pm line 2383
WWW::Mechanize::_update_page('WWW::Mechanize=HASH(0x31ce490)', 'HTTP::Request=HASH(0x330f490)', 'HTTP::Response=HASH(0x3489c28)') called at /usr/share/perl5/vendor_perl/WWW/Mechanize.pm line 2213
WWW::Mechanize::request('WWW::Mechanize=HASH(0x31ce490)', 'HTTP::Request=HASH(0x330f490)') called at /usr/share/perl5/LWP/UserAgent.pm line 387
LWP::UserAgent::get('WWW::Mechanize=HASH(0x31ce490)', 'https://account.example.com/SSO/index.html') called at /usr/share/perl5/vendor_perl/WWW/Mechanize.pm line 407
WWW::Mechanize::get('WWW::Mechanize=HASH(0x31ce490)', 'https://account.example.com/SSO/index.html') called at ./check_profiles.pl line 35
As you can see, WWW::Mechanize believes that there is no 443 access to account.example.com, however this is not true. If it were true, my entire app would break, and it is working fine. This is further proof:
> telnet account.example.com 443
Trying 2.2.2.2...
Connected to account.example.com.
Escape character is '^]'.
I have no idea why this is happening. Can anyone determine the problem based on the debug info, or offer any further help? Thanks!
WWW::Mechanize was at 1.66-1.
Upgraded to 1.72-1, problem went away.
Most likely they kick you because of 'haxor' "User agent" string. Thy something like this before your get:
my $initial_user_agent = 'Mozilla/5.0 (Linux; U; Android 2.2; de-de; HTC Desire HD 1.18.161.2 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1';
my $mech = WWW::Mechanize->new( agent => $initial_user_agent );