No more replicas available for broadcast_0_python - pyspark

I am attempting to run the following code, in a Dataproc cluster (you can find the software versions I am using here):
# IMPORTANT: THIS CODE WAS RUN IN A SINGLE JUPYTER NOTEBOOK CELL
print("IMPORTING LIBRARIES...")
import pandas as pd
import numpy as np
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, pandas_udf
# https://spark.apache.org/docs/3.1.3/api/python/_modules/pyspark/sql/types.html
from pyspark.sql.types import ArrayType, StringType
print("STARTING SPARK SESSION...")
spark = SparkSession.builder.appName('SpacyOverPySpark') \
.getOrCreate()
print("FUNCTION DEFINTION...")
def load_spacy_model():
import spacy
print("\tLoading spacy model...")
return spacy.load("./spacy_model") # This model exists locally
#pandas_udf(ArrayType(StringType()))
def get_entities(list_of_text: pd.Series) -> pd.Series:
# retrieving the shared nlp object
nlp = broadcasted_nlp.value
# batch processing our list of text
docs = nlp.pipe(list_of_text)
# entity extraction (`ents` is a list[list[str]])
ents=[
[ent.text for ent in doc.ents]
for doc in docs
]
return pd.Series(ents)
# loading spaCy model and broadcasting it
broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
print("DATA READING (OR MANUAL DATA GENERATION)...")
# # Manually-generated data (DISABLED BY DEFAULT, USE FOR "TESTING")
# # IMPORTANT: Code works well for this case !!!
# pdf = pd.DataFrame(
# [
# "Python and Pandas are very important for Automation",
# "Tony Stark is an Electrical Engineer",
# "Pipe welding is a very dangerous task in Oil mining",
# "Nursing is often underwhelmed, but it's very interesting",
# "Software Engineering now opens a lot of doors for you",
# "Civil Engineering can get exiting, as you travel very often",
# "I am a Java Programmer, and I think I'm quite good at what I do",
# "Diane is never bored of doing the same thing all day",
# "My father is a Doctor, and he supports people in condition of poverty",
# "A janitor is required as soon as possible"
# ],
# columns=['posting']
# )
# sdf=spark.createDataFrame(pdf)
# Reading data from CSV stored in GCS (ENABLED BY DEFAULT, USE FOR "PRODUCTION")
sdf = spark.read.csv("gs://onementor-ml-data/1M_indeed_eng_clean.csv", header=True) # ~1M rows, 1 column 'posting', ~1GB in size
print("\tDataFrame shape: ", (sdf.count(), len(sdf.columns)))
print("NAMED ENTITY RECOGNITION USING SPACY OVER PYSPARK...")
t1 = time.time()
# df_dummy2.withColumn("entities", get_entities(col("posting"))).show(5, truncate=10)
sdf_new = sdf.withColumn('skills',get_entities('posting'))
sdf_new.show(5, truncate=10)
print("\tData mined in {:.2f} seconds (Dataframe shape: ({}, {}))".format(
time.time()-t1,
sdf_new.count(),
len(sdf_new.columns))
)
BTW, some basic specs of my cluster (this info can be updated, please request it in the comment section):
Master node
Standard (1 master, N workers)
Machine type: n1-highmem-4 (originally n1-standard-4, still with errors)
Number of GPUs: 0
Primary disk type: pd-standard
Primary disk size: 500GB
Local SSDs: 0
Worker nodes
(Qty.:) 10 (originally 2, still with errors)
Machine type: n1-standard-4
Number of GPUs: 0
Primary disk type: pd-standard
Primary disk size: 500GB
Local SSDs: 0
Secondary worker nodes: 0
When running the previous script with the "manually-generated data", the entity extraction works OK (if you need details about how I created my cluster, hit that link too); however when importing the .csv data from Cloud Storage, the following error appears (both VM and cluster names have been changed, for safety):
ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 11 on my-vm-w-9.us-central1-a.c.my-project.internal: Container marked as failed: container_1661960727108_0002_01_000013 on host: my-vm-w-9.us-central1-a.c.my-project.internal. Exit status: -100. Diagnostics: Container released on a *lost* node.
I have also read in the logs the following warning:
WARN org.apache.spark.storage.BlockManagerMasterEndpoint: No more replicas available for broadcast_0_python !
I have made a quick research, but I was astonished at the considerable amount of very different possible causes of that error (however none of them valid for PySpark over Dataproc), so I am not quite sure if there's a more optimal troubleshooting approach for this case (than just shooting blindly at case after case I find in the web).
What could be happening here?
Thank you

Related

No module named 'spacy' in PySpark

I am attempting to perform some entity extraction, using a custom NER spaCy model. The extraction will be done over a Spark Dataframe, and everything is being orchestrated in a Dataproc cluster (using a Jupyter Notebook, available in the "Workbench"). The code I am using, looks like follows:
# IMPORTANT: NOTICE THIS CODE WAS RUN FROM A JUPYTER NOTEBOOK (!)
import pandas as pd
import numpy as np
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, pandas_udf
from pyspark.sql.types import ArrayType, StringType
spark = SparkSession.builder.appName('SpacyOverPySpark') \
.getOrCreate()
# FUNCTIONS DEFINITION
def load_spacy_model():
import spacy
print("Loading spacy model...")
return spacy.load("./spacy_model") # This model exists locally
#pandas_udf(ArrayType(StringType()))
def entities(list_of_text: pd.Series) -> pd.Series:
# retrieving the shared nlp object
nlp = broadcasted_nlp.value
# batch processing our list of text
docs = nlp.pipe(list_of_text)
# entity extraction (`ents` is a list[list[str]])
ents=[
[ent.text for ent in doc.ents]
for doc in docs
]
return pd.Series(ents)
# DUMMY DATA FOR THIS TEST
pdf = pd.DataFrame(
[
"Pyhton and Pandas are very important for Automation",
"Tony Stark is a Electrical Engineer",
"Pipe welding is a very dangerous task in Oil mining",
"Nursing is often underwhelmed, but it's very interesting",
"Software Engineering now opens a lot of doors for you",
"Civil Engineering can get exiting, as you travel very often",
"I am a Java Programmer, and I think I'm quite good at what I do",
"Diane is never bored of doing the same thing all day",
"My father is a Doctor, and he supports people in condition of poverty",
"A janitor is required as soon as possible"
],
columns=['postings']
)
sdf=spark.createDataFrame(pdf)
# MAIN CODE
# loading spaCy model and broadcasting it
broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
# Extracting entities
df_new = sdf.withColumn('skills',entities('postings'))
# Displaying results
df_new.show(10, truncate=20)
The error code I am getting, looks similar to this, but the answer does not apply for my case, because it deals with "executing a Pyspark job in Yarn" which is different (or so I think, feel free to correct me). Plus, I have also found this, but the answer is rather vague (I gotta be honest here: the only thing I have done to "restart the spark session" is to run spark.stop() in the last cell of my Jupyter Notebook, and then run the cells above again, feel free to correct me here too).
The code used was heavily inspired by "Answer 2 of 2" in this forum, which makes me wonder if some missing setting is still eluding me (BTW, "Answer 1 of 2" was already tested but did not work). And regarding my specific software versions, they can be found here.
Thank you.
CLARIFICATIONS:
Because some queries or hints generated in the comment section can be lengthy, I have decided to include them here:
No. 1: "Which command did you use to create your cluster?" : I used this method, so the command was not visible "at plain sight"; I have just realized however that, when you are about to create the cluster, you have an "EQUIVALENT COMMAND LINE" button, that grants access to such command:
In my case, the Dataproc cluster creation code (automatically generated by GCP) is:
gcloud dataproc clusters create my-cluster \
--enable-component-gateway \
--region us-central1 \
--zone us-central1-c \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 2.0-debian10 \
--optional-components JUPYTER \
--metadata PIP_PACKAGES=spacy==3.2.1 \
--project hidden-project-name
Notice how spaCy is installed in the metadata (following these recommendations); however running pip freeze | grep spacy command, right after the Dataproc cluster creation, does not display any result (i.e., spaCy does NOT get installed successfully). To enable it, the official method is used afterwards.
No. 2: "Wrong path as possible cause" : Not my case, it actually looks similar to this case (even when I can't say the root case is the same for both):
Running which python shows /opt/conda/miniconda3/bin/python as result.
Running which spacy (read "Clarification No. 1") shows /opt/conda/miniconda3/bin/spacy as result.
While replicating your case in Jupyter Notebook, I encountered the same error.
I used this pip command and it worked:
pip install -U spacy
But after installing, I got a JAVA_HOME is not set error, so I used these commands:
conda install openjdk conda install -c
conda-forge findspark
!python3 -m spacy download en_core_web_sm
I just included it in case you might also encounter it.
Here is the output:
Note: I used spacy.load("en_core_web_sm").
I managed to solve this issue, by combining 2 pieces of information:
"Configure Dataproc Python environment", "Dataproc image version 2.0" (as that is the version I am using): available here (special thanks to #Dagang in the comment section).
"Create a (Dataproc) cluster": available here.
In specific, during the Dataproc cluster setup via Google Console, I "installed" spaCy by doing:
And when the cluster was already created, I ran the code mentioned in my original post (NO modifications) with the following result:
That solves my original question. I am planning to apply my solution on a larger dataset, but I think whatever happen there, is subject of a different thread.

How to "restart" Cloud MongoDB Atlas Database

Today I was stress testing multiprocessing in Python against our cloud MongoDB (Atlas).
It's currently running at 100%, and I'd like to do something like a "restart".
I have found a "Shutdown" command, but I can't find a command to start it up after it has shutdown, so I'm afraid to run just the "Shutdown".
I have tried killing processes one at a time in the lower right section of the screen below, but after refreshing the page, the same processes numbers are there, and I think there are more on the bottom of the list. I think they are all backed up.
Doing an insert of a large document does not return to the Python program in 5 minutes. I need to get that working again (should update in 10-15 seconds as it has in the past).
I am able to open a command window and connect to that server. Just unclear what commands to run.
Here is an example of how I tried to kill some of the processes:
Also note that the "Performance Adviser" page is not recommending any new indexes.
Update 1:
Alternatively, can I kill all running, hung, or locked processes?
I was reading about killop here:(https://docs.mongodb.com/manual/tutorial/terminate-running-operations/), but found it confusing with the versions, and the fact that I'm using Atlas.
I'm not sure if there is an easier way, but this is what I did.
First, I ran a Python program to extract all the desired operation IDs based on my database and collection name. You have to look at the file created to understand the if statements in the code below. NOTE: it says that db.current_op is deprecated, and I haven't found out how to do this without that command (from PyMongo).
Note the doc page warns against killing certain types of operations, so I was careful to pick ones that were doing inserts on one specific collection. (Do not attempt to kill all processes in the JSON returned).
import requests
import os
import sys
import traceback
import pprint
import json
from datetime import datetime as datetime1, timedelta, timezone
import datetime
from time import time
import time as time2
import configHandler
import pymongo
from pymongo import MongoClient
from uuid import UUID
def uuid_convert(o):
if isinstance(o, UUID):
return o.hex
# This get's all my config from a config.json file, not including that code here.
config_dict = configHandler.getConfigVariables()
cluster = MongoClient(config_dict['MONGODB_CONNECTION_STRING_ADMIN'])
db = cluster[config_dict['MONGODB_CLUSTER']]
current_ops = db.current_op(True)
count_ops = 0
for op in current_ops["inprog"]:
count_ops += 1
#db.kill- no such command
if op["type"] == "op":
if "op" in op:
if op["op"] == "insert" and op["command"]["insert"] == "TestCollectionName":
#print(op["opid"], op["command"]["insert"])
print('db.adminCommand({"killOp": 1, "op": ' + str(op["opid"]) + '})')
print("\n\ncount_ops=", count_ops)
currDateTime = datetime.datetime.now()
print("type(current_ops) = ", type(current_ops))
# this dictionary has nested fields
# current_ops_str = json.dumps(current_ops, indent=4)
# https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
filename = "./data/currents_ops" + currDateTime.strftime("%y_%m_%d__%H_%M_%S") + ".json"
with open(filename, "w") as file1:
#file1.write(current_ops_str)
json.dump(current_ops, file1, indent=4, default=uuid_convert)
print("Wrote to filename=", filename)
It writes the full ops file to disk, but I did a copy/paste from the command windows to a file. Then from the command line, I ran something like this:
mongo "mongodb+srv://mycluster0.otwxp.mongodb.net/mydbame" --username myuser --password abc1234 <kill_opid_script.js
The kill_opid_script.js looked like this. I added the print(db) because the first time I ran it didn't seem to do anything.
print(db)
db.adminCommand({"killOp": 1, "op": 648685})
db.adminCommand({"killOp": 1, "op": 667396})
db.adminCommand({"killOp": 1, "op": 557439})
etc... for 400+ times...
print(db)

How to deploy a Google dataflow worker with a file loaded into memory?

I am trying to deploy Google Dataflow streaming for use in my machine learning streaming pipeline, but cannot seem to deploy the worker with a file already loaded into memory. Currently, I have setup the job to pull a pickle file from a GCS bucket, load it into memory, and use it for model prediction. But this is executed on every cycle of the job, i.e. pull from GCS every time a new object enters the dataflow pipeline - meaning that the current execution of the pipeline is much slower than it needs to be.
What I really need, is a way to allocate a variable within the worker nodes on setup of each worker. Then use that variable within the pipeline, without having to re-load on every execution of the pipeline.
Is there a way to do this step before the job is deployed, something like
with open('model.pkl', 'rb') as file:
pickle_model = pickle.load(file)
But within my setup.py file?
##### based on - https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py
"""Setup.py module for the workflow's worker utilities.
All the workflow related code is gathered in a package that will be built as a
source distribution, staged in the staging area for the workflow being run and
then installed in the workers when they start running.
This behavior is triggered by specifying the --setup_file command line option
when running the workflow for remote execution.
"""
# pytype: skip-file
from __future__ import absolute_import
from __future__ import print_function
import subprocess
from distutils.command.build import build as _build # type: ignore
import setuptools
# This class handles the pip install mechanism.
class build(_build): # pylint: disable=invalid-name
"""A build command class that will be invoked during package install.
The package built using the current setup.py will be staged and later
installed in the worker using `pip install package'. This class will be
instantiated during install for this specific scenario and will trigger
running the custom commands specified.
"""
sub_commands = _build.sub_commands + [('CustomCommands', None)]
CUSTOM_COMMANDS = [['pip', 'install', 'scikit-learn==0.23.1']]
CUSTOM_COMMANDS = [['pip', 'install', 'google-cloud-storage']]
CUSTOM_COMMANDS = [['pip', 'install', 'mlxtend']]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def RunCustomCommand(self, command_list):
print('Running command: %s' % command_list)
p = subprocess.Popen(
command_list,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
# Can use communicate(input='y\n'.encode()) if the command run requires
# some confirmation.
stdout_data, _ = p.communicate()
print('Command output: %s' % stdout_data)
if p.returncode != 0:
raise RuntimeError(
'Command %s failed: exit code: %s' % (command_list, p.returncode))
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
REQUIRED_PACKAGES = [
'google-cloud-storage',
'mlxtend',
'scikit-learn==0.23.1',
]
setuptools.setup(
name='ML pipeline',
version='0.0.1',
description='ML set workflow package.',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
'build': build,
'CustomCommands': CustomCommands,
})
Snippet of current ML load mechanism:
class MlModel(beam.DoFn):
def __init__(self):
self._model = None
from google.cloud import storage
import pandas as pd
import pickle as pkl
self._storage = storage
self._pkl = pkl
self._pd = pd
def process(self,element):
if self._model is None:
bucket = self._storage.Client().get_bucket(myBucket)
blob = bucket.get_blob(myBlob)
self._model = self._pkl.loads(blob.download_as_string())
new_df = self._pd.read_json(element, orient='records').iloc[:, 3:-1]
predict = self._model.predict(new_df)
df = self._pd.DataFrame(data=predict, columns=["A", "B"])
A = df.iloc[0]['A']
B = df.iloc[0]['B']
d = {'A':A, 'B':B}
return [d]
You can use the #Setup method in your MlModel DoFn method where you can load your model and then use it in your #Process method. The #Setup method is called once per worker initialization.
I had written a similar answer here
HTH

How my web crawler(python, Scrapy, Scrapy-splash) can crawl faster?

Develop Environment:
CentOS7
pip 18.1
Docker version 18.09.3, build 774a1f4
anaconda Command line client (version 1.7.2)
Python3.7
Scrapy 1.6.0
scrapy-splash
MongoDB(db version v4.0.6)
PyCharm
Server Specs:
CPU ->
processor: 22,
vendor_id: GenuineIntel,
cpu family: 6,
model: 45,
model name: Intel(R) Xeon(R) CPU E5-2430 0 # 2.20GHz
RAM -> Mem: 31960
64 bit
Hello.
I'm a php developer, and this is my first python project. I'm trying to use python because I heard that python has many benefits for web crawling.
I'm crawling one dynamic web site, and I need to crawl around 3,500 pages in every 5-15 seconds. For now, mine is too slow. It is crawl only 200 pages per minute.
My source is like this:
main.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spiders.bot1 import Bot1Spider
from spiders.bot2 import Bot2Spider
from spiders.bot3 import Bot3Spider
from spiders.bot4 import Bot4Spider
from pprint import pprint
process = CrawlerProcess(get_project_settings())
process.crawl(Oddsbot1Spider)
process.crawl(Oddsbot2Spider)
process.crawl(Oddsbot3Spider)
process.crawl(Oddsbot4Spider)
process.start()
bot1.py
import scrapy
import datetime
import math
from scrapy_splash import SplashRequest
from pymongo import MongoClient
from pprint import pprint
class Bot1Spider(scrapy.Spider):
name = 'bot1'
client = MongoClient('localhost', 27017)
db = client.db
def start_requests(self):
count = int(self.db.games.find().count())
num = math.floor(count*0.25)
start_urls = self.db.games.find().limit(num-1)
for url in start_urls:
full_url = domain + list(url.values())[5]
yield SplashRequest(full_url, self.parse, args={'wait': 0.1}, meta={'oid': list(url.values())[0]})
def parse(self, response):
pass
settings.py
BOT_NAME = 'crawler'
SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'
# Scrapy Configuration
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'my-project-name (www.my.domain)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 64
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 16
When execute these code, I'm using this command: python main.py
After seeing my code, please help me. I'll happily listen any saying.
1.How my spider can faster? I've tried to use threading, but it's seem not working right.
2.What is the best performance line up for web crawling?
3.Is that possible to crawl 3500 dynamic pages in every 5-15 seconds?
Thank you.

gsutil multiprocessing and multithreaded does not sustain cpu usage & copy rate on GCP instance

I am running a script to copy millions (2.4 million to be exact) images from several gcs buckets into one central bucket, with all buckets in the same region. I was originally working from one csv file but broke it into 64 smaller ones so each process can iterate through its own file as to not wait for the others. When the script launches on a 64 vCPU, 240 GB memory instance on GCP it runs fine for about an hour and a half. In 75 minutes 155 thousand files copied over. The CPU usage was registering a sustained 99%. After this, the CPU usage drastically declines to 2% and the transfer rate falls significantly. I am really unsure why this. I am keeping track of files that fail by creating blank files in an errors directory. This way there is no write lock when writing to a central error file. Code is below. It is not a spacing or syntax error, some spacing got messed up when I copied into the post. Any help is greatly appreciated.
Thanks,
Zach
import os
import subprocess
import csv
from multiprocessing.dummy import Pool as ThreadPool
from multiprocessing import Pool as ProcessPool
import multiprocessing
gcs_destination = 'gs://dest-bucket/'
source_1 = 'gs://source-1/'
source_2 = 'gs://source-2/'
source_3 = 'gs://source-3/'
source_4 = 'gs://source-4/'
def copy(img):
try:
imgID = img[0] # extract name
imgLocation = pano[9] # extract its location on gcs
print pano[0] + " " + panoLocation
source = ""
if imgLocation == '1':
source = source_1
elif imgLocation == '2':
source = source-2
elif imgLocation == '3':
source = source_3
elif imgLocation == '4':
source = source_4
print str(os.getpid())
command = "gsutil -o GSUtil:state_dir=.{} cp {}{}.tar.gz {}".format(os.getpid(), source, imgID , g
prog = subprocess.call(command, shell="True")
if prog != 0:
command = "touch errors/{}_{}".format(imgID, imgLocation)
os.system(command)
except:
print "Doing nothing with the error"
def split_into_threads(csv_file):
with open(csv_file) as f:
csv_f = csv.reader(f)
pool = ThreadPool(15)
pool.map(copy, csv_f)
if __name__ == "__main__":
file_names = [None] * 64
# Read in CSV file of all records
for i in range(0,64):
file_names[i] = 'split_origin/origin_{}.csv'.format(i)
process_pool = ProcessPool(multiprocessing.cpu_count())
process_pool.map(split_into_threads, file_names)
For gsutil, I agree strongly with the multithreading suggestion by adding -m. Further, composite uploads, -o, may be unnecessary and undesirable as the images are not GB each in size and need not be split into shards. They're likely in the X-XXMB range.
Within your python function, you are calling gsutil commands, which are in turn calling further python functions. It should be cleaner and more performant to leverage the google-made client library for python, available [below]. Gsutil is built for interactive CLI use rather than for calling programatically.
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
Also, for gsutil, see your ~/.boto file and look at the multi-processing and multi-threading values. Beefier machines can handle greater thread and process. For reference, I work from my Macbook Pro w/ 1 process and 24 threads. I use an ethernet adapter and hardwire into my office connection and get incredible performance off internal SSD (>450 Mbps). That's Megabits, not bytes. The transfer rates are impressive, nonetheless
I strongly recommend you to use the "-m" flag on gsutil to enable multi thread copy.
Also as an alternative you can use the Storage Transfer Service [1] to move data between buckets.
[1] https://cloud.google.com/storage/transfer/