How my web crawler(python, Scrapy, Scrapy-splash) can crawl faster?

How my web crawler(python, Scrapy, Scrapy-splash) can crawl faster? - mongodb

Develop Environment:
CentOS7
pip 18.1
Docker version 18.09.3, build 774a1f4
anaconda Command line client (version 1.7.2)
Python3.7
Scrapy 1.6.0
scrapy-splash
MongoDB(db version v4.0.6)
PyCharm
Server Specs:
CPU ->
processor: 22,
vendor_id: GenuineIntel,
cpu family: 6,
model: 45,
model name: Intel(R) Xeon(R) CPU E5-2430 0 # 2.20GHz
RAM -> Mem: 31960
64 bit
Hello.
I'm a php developer, and this is my first python project. I'm trying to use python because I heard that python has many benefits for web crawling.
I'm crawling one dynamic web site, and I need to crawl around 3,500 pages in every 5-15 seconds. For now, mine is too slow. It is crawl only 200 pages per minute.
My source is like this:
main.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spiders.bot1 import Bot1Spider
from spiders.bot2 import Bot2Spider
from spiders.bot3 import Bot3Spider
from spiders.bot4 import Bot4Spider
from pprint import pprint
process = CrawlerProcess(get_project_settings())
process.crawl(Oddsbot1Spider)
process.crawl(Oddsbot2Spider)
process.crawl(Oddsbot3Spider)
process.crawl(Oddsbot4Spider)
process.start()
bot1.py
import scrapy
import datetime
import math
from scrapy_splash import SplashRequest
from pymongo import MongoClient
from pprint import pprint
class Bot1Spider(scrapy.Spider):
name = 'bot1'
client = MongoClient('localhost', 27017)
db = client.db
def start_requests(self):
count = int(self.db.games.find().count())
num = math.floor(count*0.25)
start_urls = self.db.games.find().limit(num-1)
for url in start_urls:
full_url = domain + list(url.values())[5]
yield SplashRequest(full_url, self.parse, args={'wait': 0.1}, meta={'oid': list(url.values())[0]})
def parse(self, response):
pass
settings.py
BOT_NAME = 'crawler'
SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'
# Scrapy Configuration
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'my-project-name (www.my.domain)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 64
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
When execute these code, I'm using this command: python main.py
After seeing my code, please help me. I'll happily listen any saying.
1.How my spider can faster? I've tried to use threading, but it's seem not working right.
2.What is the best performance line up for web crawling?
3.Is that possible to crawl 3500 dynamic pages in every 5-15 seconds?
Thank you.

Related

No more replicas available for broadcast_0_python

I am attempting to run the following code, in a Dataproc cluster (you can find the software versions I am using here):
# IMPORTANT: THIS CODE WAS RUN IN A SINGLE JUPYTER NOTEBOOK CELL
print("IMPORTING LIBRARIES...")
import pandas as pd
import numpy as np
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, pandas_udf
# https://spark.apache.org/docs/3.1.3/api/python/_modules/pyspark/sql/types.html
from pyspark.sql.types import ArrayType, StringType
print("STARTING SPARK SESSION...")
spark = SparkSession.builder.appName('SpacyOverPySpark') \
.getOrCreate()
print("FUNCTION DEFINTION...")
def load_spacy_model():
import spacy
print("\tLoading spacy model...")
return spacy.load("./spacy_model") # This model exists locally
#pandas_udf(ArrayType(StringType()))
def get_entities(list_of_text: pd.Series) -> pd.Series:
# retrieving the shared nlp object
nlp = broadcasted_nlp.value
# batch processing our list of text
docs = nlp.pipe(list_of_text)
# entity extraction (`ents` is a list[list[str]])
ents=[
[ent.text for ent in doc.ents]
for doc in docs
]
return pd.Series(ents)
# loading spaCy model and broadcasting it
broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
print("DATA READING (OR MANUAL DATA GENERATION)...")
# # Manually-generated data (DISABLED BY DEFAULT, USE FOR "TESTING")
# # IMPORTANT: Code works well for this case !!!
# pdf = pd.DataFrame(
# [
# "Python and Pandas are very important for Automation",
# "Tony Stark is an Electrical Engineer",
# "Pipe welding is a very dangerous task in Oil mining",
# "Nursing is often underwhelmed, but it's very interesting",
# "Software Engineering now opens a lot of doors for you",
# "Civil Engineering can get exiting, as you travel very often",
# "I am a Java Programmer, and I think I'm quite good at what I do",
# "Diane is never bored of doing the same thing all day",
# "My father is a Doctor, and he supports people in condition of poverty",
# "A janitor is required as soon as possible"
# ],
# columns=['posting']
# )
# sdf=spark.createDataFrame(pdf)
# Reading data from CSV stored in GCS (ENABLED BY DEFAULT, USE FOR "PRODUCTION")
sdf = spark.read.csv("gs://onementor-ml-data/1M_indeed_eng_clean.csv", header=True) # ~1M rows, 1 column 'posting', ~1GB in size
print("\tDataFrame shape: ", (sdf.count(), len(sdf.columns)))
print("NAMED ENTITY RECOGNITION USING SPACY OVER PYSPARK...")
t1 = time.time()
# df_dummy2.withColumn("entities", get_entities(col("posting"))).show(5, truncate=10)
sdf_new = sdf.withColumn('skills',get_entities('posting'))
sdf_new.show(5, truncate=10)
print("\tData mined in {:.2f} seconds (Dataframe shape: ({}, {}))".format(
time.time()-t1,
sdf_new.count(),
len(sdf_new.columns))
)
BTW, some basic specs of my cluster (this info can be updated, please request it in the comment section):
Master node
Standard (1 master, N workers)
Machine type: n1-highmem-4 (originally n1-standard-4, still with errors)
Number of GPUs: 0
Primary disk type: pd-standard
Primary disk size: 500GB
Local SSDs: 0
Worker nodes
(Qty.:) 10 (originally 2, still with errors)
Machine type: n1-standard-4
Number of GPUs: 0
Primary disk type: pd-standard
Primary disk size: 500GB
Local SSDs: 0
Secondary worker nodes: 0
When running the previous script with the "manually-generated data", the entity extraction works OK (if you need details about how I created my cluster, hit that link too); however when importing the .csv data from Cloud Storage, the following error appears (both VM and cluster names have been changed, for safety):
ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 11 on my-vm-w-9.us-central1-a.c.my-project.internal: Container marked as failed: container_1661960727108_0002_01_000013 on host: my-vm-w-9.us-central1-a.c.my-project.internal. Exit status: -100. Diagnostics: Container released on a *lost* node.
I have also read in the logs the following warning:
WARN org.apache.spark.storage.BlockManagerMasterEndpoint: No more replicas available for broadcast_0_python !
I have made a quick research, but I was astonished at the considerable amount of very different possible causes of that error (however none of them valid for PySpark over Dataproc), so I am not quite sure if there's a more optimal troubleshooting approach for this case (than just shooting blindly at case after case I find in the web).
What could be happening here?
Thank you

Is there a way to start mitmproxy v.7.0.2 programmatically in the background?

Is there a way to start mitmproxy v.7.0.2 programmatically in the background?
ProxyConfig and ProxyServer have been removed since version 7.0.0, and the code below isn't working.
from mitmproxy.options import Options
from mitmproxy.proxy.config import ProxyConfig
from mitmproxy.proxy.server import ProxyServer
from mitmproxy.tools.dump import DumpMaster
import threading
import asyncio
import time
class Addon(object):
def __init__(self):
self.num = 1
def request(self, flow):
flow.request.headers["count"] = str(self.num)
def response(self, flow):
self.num = self.num + 1
flow.response.headers["count"] = str(self.num)
print(self.num)
# see source mitmproxy/master.py for details
def loop_in_thread(loop, m):
asyncio.set_event_loop(loop) # This is the key.
m.run_loop(loop.run_forever)
if __name__ == "__main__":
options = Options(listen_host='0.0.0.0', listen_port=8080, http2=True)
m = DumpMaster(options, with_termlog=False, with_dumper=False)
config = ProxyConfig(options)
m.server = ProxyServer(config)
m.addons.add(Addon())
# run mitmproxy in backgroud, especially integrated with other server
loop = asyncio.get_event_loop()
t = threading.Thread( target=loop_in_thread, args=(loop,m) )
t.start()
# Other servers, such as a web server, might be started then.
time.sleep(20)
print('going to shutdown mitmproxy')
m.shutdown()
from BigSully's gist

You can put your Addon class into your_script.py and then run mitmdump -s your_script.py. mitmdump comes without the console interface and can run in the background.
We (mitmproxy devs) officially don't support manual instantiation from Python anymore because that creates a massive amount of support burden for us. If you have some Python experience you can probably find your way around.
What if my addon has additional dependencies?
Approach 1: pip install mitmproxy is still perfectly supported and gets you the same functionality as the standalone binaries. Bonus tip: You can run venv/bin/mitmproxy or venv/Scripts/mitmproxy.exe to invoke mitmproxy in your virtualenv without having your virtualenv activated.
Approach 2: You can install mitmproxy with pipx and then run pipx inject mitmproxy <your dependency name>. See https://docs.mitmproxy.org/stable/overview-installation/#installation-from-the-python-package-index-pypi for details.
How can I debug mitmproxy itself?
If you are debugging from the command line (be it print statements or pdb), the easiest approach is to run mitmdump instead of mitmproxy, which provides the same functionality minus the console interface. Alternatively, you can use PyCharm's remote debug functionality, which also works while the console interface is active (https://github.com/mitmproxy/mitmproxy/blob/main/examples/contrib/remote-debug.py).

This example below should work fine with mitmproxy v7
from mitmproxy.tools import main
from mitmproxy.tools.dump import DumpMaster
options = main.options.Options(listen_host='0.0.0.0', listen_port=8080)
m = DumpMaster(options=options)
# the rest is same in the previous versions

from mitmproxy.addons.proxyserver import Proxyserver
from mitmproxy.options import Options
from mitmproxy.tools.dump import DumpMaster
options = Options(listen_host='127.0.0.1', listen_port=8080, http2=True)
m = DumpMaster(options, with_termlog=True, with_dumper=False)
m.server = Proxyserver()
m.addons.add(
// addons here
)
m.run()
Hi, I think that should do it

How to take screenshots in AWS Device farm for a run using appium python?

Even after a successful execution of my tests in DeviceFarm, I get an empty screenshots report. I have kept my code as simple as below -
from appium import webdriver
import time
import unittest
import os
class MyAndroidTest(unittest.TestCase):
def setUp(self):
caps = {}
self.driver = webdriver.Remote("http://127.0.0.1:4723/wd/hub", caps)
def test1(self):
self.driver.get('http://docs.aws.amazon.com/devicefarm/latest/developerguide/welcome.html')
time.sleep(5)
screenshot_folder = os.getenv('SCREENSHOT_PATH', '/tmp')
self.driver.save_screenshot(screenshot_folder + 'screen1.png')
time.sleep(5)
def tearDown(self):
self.driver.quit()
if __name__ == '__main__':
suite = unittest.TestLoader().loadTestsFromTestCase(MyAndroidTest)
unittest.TextTestRunner(verbosity=2).run(suite)
I tested on a single device pool -
How can I make this work ?
TIA.

Missing a slash (/) before the filename (i.e., screen1.png). Line 15 should be as below -
self.driver.save_screenshot(screenshot_folder + '/screen1.png')

Though I'm not sure exactly how to write this to a file in Device Farm here are the appium docs for the screenshot endpoint and a python example.
https://github.com/appium/appium/blob/master/docs/en/commands/session/screenshot.md
It gets a base 64 encoded string which then we would just need to save it somewhere like the appium screenshot dir the other answers mentioned. Otherwise we could also save it in the /tmp dir and then export it using the custom artifacts feature.
Let me know if that link helps.
James

gsutil multiprocessing and multithreaded does not sustain cpu usage & copy rate on GCP instance

I am running a script to copy millions (2.4 million to be exact) images from several gcs buckets into one central bucket, with all buckets in the same region. I was originally working from one csv file but broke it into 64 smaller ones so each process can iterate through its own file as to not wait for the others. When the script launches on a 64 vCPU, 240 GB memory instance on GCP it runs fine for about an hour and a half. In 75 minutes 155 thousand files copied over. The CPU usage was registering a sustained 99%. After this, the CPU usage drastically declines to 2% and the transfer rate falls significantly. I am really unsure why this. I am keeping track of files that fail by creating blank files in an errors directory. This way there is no write lock when writing to a central error file. Code is below. It is not a spacing or syntax error, some spacing got messed up when I copied into the post. Any help is greatly appreciated.
Thanks,
Zach
import os
import subprocess
import csv
from multiprocessing.dummy import Pool as ThreadPool
from multiprocessing import Pool as ProcessPool
import multiprocessing
gcs_destination = 'gs://dest-bucket/'
source_1 = 'gs://source-1/'
source_2 = 'gs://source-2/'
source_3 = 'gs://source-3/'
source_4 = 'gs://source-4/'
def copy(img):
try:
imgID = img[0] # extract name
imgLocation = pano[9] # extract its location on gcs
print pano[0] + " " + panoLocation
source = ""
if imgLocation == '1':
source = source_1
elif imgLocation == '2':
source = source-2
elif imgLocation == '3':
source = source_3
elif imgLocation == '4':
source = source_4
print str(os.getpid())
command = "gsutil -o GSUtil:state_dir=.{} cp {}{}.tar.gz {}".format(os.getpid(), source, imgID , g
prog = subprocess.call(command, shell="True")
if prog != 0:
command = "touch errors/{}_{}".format(imgID, imgLocation)
os.system(command)
except:
print "Doing nothing with the error"
def split_into_threads(csv_file):
with open(csv_file) as f:
csv_f = csv.reader(f)
pool = ThreadPool(15)
pool.map(copy, csv_f)
if __name__ == "__main__":
file_names = [None] * 64
# Read in CSV file of all records
for i in range(0,64):
file_names[i] = 'split_origin/origin_{}.csv'.format(i)
process_pool = ProcessPool(multiprocessing.cpu_count())
process_pool.map(split_into_threads, file_names)

For gsutil, I agree strongly with the multithreading suggestion by adding -m. Further, composite uploads, -o, may be unnecessary and undesirable as the images are not GB each in size and need not be split into shards. They're likely in the X-XXMB range.
Within your python function, you are calling gsutil commands, which are in turn calling further python functions. It should be cleaner and more performant to leverage the google-made client library for python, available [below]. Gsutil is built for interactive CLI use rather than for calling programatically.
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
Also, for gsutil, see your ~/.boto file and look at the multi-processing and multi-threading values. Beefier machines can handle greater thread and process. For reference, I work from my Macbook Pro w/ 1 process and 24 threads. I use an ethernet adapter and hardwire into my office connection and get incredible performance off internal SSD (>450 Mbps). That's Megabits, not bytes. The transfer rates are impressive, nonetheless

I strongly recommend you to use the "-m" flag on gsutil to enable multi thread copy.
Also as an alternative you can use the Storage Transfer Service [1] to move data between buckets.
[1] https://cloud.google.com/storage/transfer/

import error: Django Running in terminal works but not in eclipse IDE

from django.http import HttpResponse
import urllib2
import simplejson
import memcache
def epghome(request):
mc=memcache.Client(['localhost:11211'])
if mc.get("jsondata"):
myval=mc.get("jsondata")
return HttpResponse("<h1>This Data is coming from Cache :</h1><br> " + str(myval))
else:
responseFromIDubba = ""
responseFromIDubba = urllib2.urlopen("http://www.idubba.com/apps/guide.aspx?key=4e2b0ca2328e03acce0014101d302ac7&type=1&f1=1&f2=0&channel=&gener=&page=0&summary=1").read()
parseResponseString = simplejson.loads(responseFromIDubba)
print(parseResponseString)
body=""
for data in parseResponseString["data"]:
#for keys in data:
#body=body+str(data .items())+".."+str(data[str(keys)])+"<br>"
body=body+str(data.items())+"<br><br>"
#body=body+"<br><br>"
mc.set("epgdata3",body)
mc.set("jsondata",responseFromIDubba)
returndata=str(mc.get("jsondata"))
return HttpResponse("<h1>No Cache Set</h1>" + responseFromIDubba)
Hi, This code is working if I run it in the terminal Python, without any error.
But in ECLIPSE IDE it always gives Import Error.
Even if I downloaded and install those modules. Also, and I restarted both Eclipse and my system, but it won't help.
Can you please suggest me a site where I get all libraries at once ?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How my web crawler(python, Scrapy, Scrapy-splash) can crawl faster? - mongodb

Related

No more replicas available for broadcast_0_python

Is there a way to start mitmproxy v.7.0.2 programmatically in the background?

How to take screenshots in AWS Device farm for a run using appium python?

gsutil multiprocessing and multithreaded does not sustain cpu usage & copy rate on GCP instance

import error: Django Running in terminal works but not in eclipse IDE

Categories

Resources