No module named 'spacy' in PySpark - pyspark

I am attempting to perform some entity extraction, using a custom NER spaCy model. The extraction will be done over a Spark Dataframe, and everything is being orchestrated in a Dataproc cluster (using a Jupyter Notebook, available in the "Workbench"). The code I am using, looks like follows:
# IMPORTANT: NOTICE THIS CODE WAS RUN FROM A JUPYTER NOTEBOOK (!)
import pandas as pd
import numpy as np
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, pandas_udf
from pyspark.sql.types import ArrayType, StringType
spark = SparkSession.builder.appName('SpacyOverPySpark') \
.getOrCreate()
# FUNCTIONS DEFINITION
def load_spacy_model():
import spacy
print("Loading spacy model...")
return spacy.load("./spacy_model") # This model exists locally
#pandas_udf(ArrayType(StringType()))
def entities(list_of_text: pd.Series) -> pd.Series:
# retrieving the shared nlp object
nlp = broadcasted_nlp.value
# batch processing our list of text
docs = nlp.pipe(list_of_text)
# entity extraction (`ents` is a list[list[str]])
ents=[
[ent.text for ent in doc.ents]
for doc in docs
]
return pd.Series(ents)
# DUMMY DATA FOR THIS TEST
pdf = pd.DataFrame(
[
"Pyhton and Pandas are very important for Automation",
"Tony Stark is a Electrical Engineer",
"Pipe welding is a very dangerous task in Oil mining",
"Nursing is often underwhelmed, but it's very interesting",
"Software Engineering now opens a lot of doors for you",
"Civil Engineering can get exiting, as you travel very often",
"I am a Java Programmer, and I think I'm quite good at what I do",
"Diane is never bored of doing the same thing all day",
"My father is a Doctor, and he supports people in condition of poverty",
"A janitor is required as soon as possible"
],
columns=['postings']
)
sdf=spark.createDataFrame(pdf)
# MAIN CODE
# loading spaCy model and broadcasting it
broadcasted_nlp = spark.sparkContext.broadcast(load_spacy_model())
# Extracting entities
df_new = sdf.withColumn('skills',entities('postings'))
# Displaying results
df_new.show(10, truncate=20)
The error code I am getting, looks similar to this, but the answer does not apply for my case, because it deals with "executing a Pyspark job in Yarn" which is different (or so I think, feel free to correct me). Plus, I have also found this, but the answer is rather vague (I gotta be honest here: the only thing I have done to "restart the spark session" is to run spark.stop() in the last cell of my Jupyter Notebook, and then run the cells above again, feel free to correct me here too).
The code used was heavily inspired by "Answer 2 of 2" in this forum, which makes me wonder if some missing setting is still eluding me (BTW, "Answer 1 of 2" was already tested but did not work). And regarding my specific software versions, they can be found here.
Thank you.
CLARIFICATIONS:
Because some queries or hints generated in the comment section can be lengthy, I have decided to include them here:
No. 1: "Which command did you use to create your cluster?" : I used this method, so the command was not visible "at plain sight"; I have just realized however that, when you are about to create the cluster, you have an "EQUIVALENT COMMAND LINE" button, that grants access to such command:
In my case, the Dataproc cluster creation code (automatically generated by GCP) is:
gcloud dataproc clusters create my-cluster \
--enable-component-gateway \
--region us-central1 \
--zone us-central1-c \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 2.0-debian10 \
--optional-components JUPYTER \
--metadata PIP_PACKAGES=spacy==3.2.1 \
--project hidden-project-name
Notice how spaCy is installed in the metadata (following these recommendations); however running pip freeze | grep spacy command, right after the Dataproc cluster creation, does not display any result (i.e., spaCy does NOT get installed successfully). To enable it, the official method is used afterwards.
No. 2: "Wrong path as possible cause" : Not my case, it actually looks similar to this case (even when I can't say the root case is the same for both):
Running which python shows /opt/conda/miniconda3/bin/python as result.
Running which spacy (read "Clarification No. 1") shows /opt/conda/miniconda3/bin/spacy as result.

While replicating your case in Jupyter Notebook, I encountered the same error.
I used this pip command and it worked:
pip install -U spacy
But after installing, I got a JAVA_HOME is not set error, so I used these commands:
conda install openjdk conda install -c
conda-forge findspark
!python3 -m spacy download en_core_web_sm
I just included it in case you might also encounter it.
Here is the output:
Note: I used spacy.load("en_core_web_sm").

I managed to solve this issue, by combining 2 pieces of information:
"Configure Dataproc Python environment", "Dataproc image version 2.0" (as that is the version I am using): available here (special thanks to #Dagang in the comment section).
"Create a (Dataproc) cluster": available here.
In specific, during the Dataproc cluster setup via Google Console, I "installed" spaCy by doing:
And when the cluster was already created, I ran the code mentioned in my original post (NO modifications) with the following result:
That solves my original question. I am planning to apply my solution on a larger dataset, but I think whatever happen there, is subject of a different thread.

Related

Google Cloud AI Platform: I can't create a model version using the "--accelerator" parameter

In order to get online predictions, I'm creating a model version on the ai-platform. It works fine unless I want to use the --accelerator parameter.
Here is the command that works:
gcloud alpha ai-platform versions create [...] --model [...] --origin=[...] --python-version=3.5 --runtime-version=1.14 --package-uris=[...] --machine-type=mls1-c4-m4 --prediction-class=[...]
Here is the parameter that makes it not work:
--accelerator=^:^count=1:type=nvidia-tesla-k80
This is the error message I get:
ERROR: (gcloud.alpha.ai-platform.versions.create) INVALID_ARGUMENT: Request contains an invalid argument.
I expect it to work, since 1) the parameter exists and uses these two keys (count and type), 2) I use the correct syntax for the parameter, any other syntaxes would return a syntax error, and 3) the "nvidia-tesla-k80" value exists (it is described in --help) and is available in the region in which the model is deployed.
Make sure you are using a recent version of the Google Cloud SDK.
Then you can use the following command:
gcloud beta ai-platform versions create $VERSION_NAME \
--model $MODEL_NAME \
--origin gs://$MODEL_DIRECTORY_URI \
--runtime-version 1.15 \
--python-version 3.7 \
--framework tensorflow \
--machine-type n1-standard-4 \
--accelerator count=1,type=nvidia-tesla-t4
For reference you can enable logging during model creation:
gcloud beta ai-platform models create {MODEL_NAME} \
--regions {REGION}
--enable-logging \
--enable-console-logging
The format for the --accelerator parameter as you can check in the official documentation is:
--accelerator=count=1,type=nvidia-tesla-k80
I think that might cause your issue, let me know.

Is there a spark-defaults.conf when installed with pip install pyspark

I installed pyspark with pip.
I code in jupyter notebooks. Everything works fine but not I got a java heap space error when exporting a large .csv file.
Here someone suggested editing the spark-defaults.config. Also in the spark documentation, it says
"Note: In client mode, this config must not be set through the
SparkConf directly in your application, because the driver JVM has
already started at that point. Instead, please set this through the
--driver-memory command line option or in your default properties file."
But I'm afraid there is no such file when installing pyspark with pip.
I'm I right? How do I solve this?
Thanks!
I recently ran into this as well. If you look at the Spark UI under the Classpath Entries, the first path is probably the configuration directory, something like /.../lib/python3.7/site-packages/pyspark/conf/. When I looked for that directory, it didn't exist; presumably it's not part of the pip installation. However, you can easily create it and add your own configuration files. For example,
mkdir /.../lib/python3.7/site-packages/pyspark/conf
vi /.../lib/python3.7/site-packages/pyspark/conf/spark-defaults.conf
The spark-defaults.conf file should be located in:
$SPARK_HOME/conf
If no file is present, create one (a template should be available in the same directory).
How to find the default configuration folder
Check contents of the folder in Python:
import glob, os
glob.glob(os.path.join(os.environ["SPARK_HOME"], "conf", "spark*"))
# ['/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh.template',
# '/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf.template']
When no spark-defaults.conf file is available, built-in values are used
To my surprise, no spark-defaults.conf but just a template file was present!
Still I could look at Spark properties, either in the “Environment” tab of the Web UI http://<driver>:4040 or using getConf().getAll() on the Spark context:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.getOrCreate()
spark.sparkContext.getConf().getAll()
# [('spark.driver.port', '55128'),
# ('spark.app.name', 'myApp'),
# ('spark.rdd.compress', 'True'),
# ('spark.sql.warehouse.dir', 'file:/path/spark-warehouse'),
# ('spark.serializer.objectStreamReset', '100'),
# ('spark.master', 'local[*]'),
# ('spark.submit.pyFiles', ''),
# ('spark.app.startTime', '1645484409629'),
# ('spark.executor.id', 'driver'),
# ('spark.submit.deployMode', 'client'),
# ('spark.app.id', 'local-1645484410352'),
# ('spark.ui.showConsoleProgress', 'true'),
# ('spark.driver.host', 'xxx.xxx.xxx.xxx')]
Note that not all properties are listed but:
only values explicitly specified through spark-defaults.conf, SparkConf, or the command line. For all other configuration properties, you can assume the default value is used.
For instance, consider the default parallelism is in my case:
spark._sc.defaultParallelism
8
This is the default for local mode, namely the number of cores on the local machine--see https://spark.apache.org/docs/latest/configuration.html. In my case 8=2x4cores because of hyper-threading.
If passed the property spark.default.parallelism when launching the app
spark = SparkSession \
.builder \
.appName("Set parallelism") \
.config("spark.default.parallelism", 4) \
.getOrCreate()
then the property is shown in the Web UI and in the list
spark.sparkContext.getConf().getAll()
Precedence of configuration settings
Spark will consider given properties in this order (spark-defaults.conf comes last):
SparkConf
flags passed to spark-submit
spark-defaults.conf
From https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Note
Some pyspark Jupyter kernels contain flags for spark-submit in the environment variable $PYSPARK_SUBMIT_ARGS, so one might want to check that too.
Related question: Where to modify spark-defaults.conf if I installed pyspark via pip install pyspark
The spark-defaults.config file is needed when we have to change any of the default configs for spark.
As #niuer suggested, it should be present in the $SPARK_HOME/conf/ directory. But that might not be the case with you. By default, a template config file will be present there. You can just add a new spark-defaults.conf file in $SPARK_HOME/conf/.
Check your spark path. There are configuration files under:
$SPARK_HOME/conf/, e.g.
spark-defaults.conf.

Convert CloudFormation template (YAML) to Troposphere code

I have a large sized CloudFormation template written in Yaml, I want to start using Troposphere instead. Is there any easy way to convert the CF template to Troposphere code?
I have noticed this script here https://github.com/cloudtools/troposphere/blob/master/troposphere/template_generator.py
This creates a Troposphere python object, but I am not sure if its possible to output it Troposphere code.
You can do it by converting the CF YAML to JSON and running the https://github.com/cloudtools/troposphere/blob/master/scripts/cfn2py passing the JSON file in as an argument.
Adding to the good tip from #OllieB
Install dependencies using pip or poetry:
https://python-poetry.org/docs/#installation
https://github.com/cloudtools/troposphere
https://github.com/awslabs/aws-cfn-template-flip
pip install 'troposphere[policy]'
pip install cfn-flip
poetry add -D 'troposphere[policy]'
poetry add -D cfn-flip
The command line conversion is something like:
cfn-flip -c -j template-vpc.yaml template-vpc.json
cfn2py template-vpc.json > template_vpc.py
WARNING: it appears that the cfn2py script might not be fully unit tested or something, because it can generate some code that does not pass troposphere validations. I recommend adding a simple round-trip test to the end of the output python script, e.g.
if __name__ == "__main__":
template_py = json.loads(t.to_json())
with open("template-vpc.json", "r") as json_fd:
template_cfn = json.load(json_fd)
assert template_py == template_cfn
See also https://github.com/cloudtools/troposphere/issues/1879 for an example of auto-generation of pydantic models from CFN json schemas.
from troposphere.template_generator import TemplateGenerator
import yaml
with open("aws/cloudformation/template.yaml") as f:
source_content = yaml.load(f, Loader=yaml.BaseLoader)
template = TemplateGenerator(source_content)
This snippet shall give you a template object from Troposphere library. You can then make modifications using Troposphere api.

Travis CI failed because installing dependencies timed out

Travis CI for my Github repo keeps failing and I didn't know why, until I read through the Job log and figured out that many dependencies couldn't be downloaded on their machines.
For example, the xcolor package failed to install:
[20/23, 03:54/04:02] install: xcolor [17k]
Downloading
ftp://tug.org/historic/systems/texlive/2015/tlnet-final/archive/xcolor.tar.xz
did not succeed, please retry.
TLPDB::_install_package: couldn't unpack ftp://tug.org/historic/systems/texlive/2015/tlnet-final/archive/xcolor.tar.xz to /home/travis/texmf
which results in the following error:
Latexmk: Missing input file: 'xcolor.sty' from line
'! LaTeX Error: File `xcolor.sty' not found.'
Latexmk: Log file says no output from latex
Latexmk: For rule 'pdflatex', no output was made
The problem is I'm working on a LaTeX project, and LaTeX environments tend to be huge with lots of supporting packages. Here is the relevant part of my setup.sh, which is already the minimal requirement:
sudo tlmgr install \
xkeyval ifthen amsmath bm \
longtable ctex tabu array \
colortbl berasans graphicx longtable \
etoolbox lastpage amssymb mathrsfs \
multirow xeCJK environ after \
booktabs hyperref epstopdf tabu \
fancyhdr listings amsfonts latexsym \
hhline CJK longtable pifont \
geometry ifpdf bmpsize hologo \
fancybox appendix amsbsy paralist \
tabularx xCJK2uni hologo calc \
fontenc ifxetex xcolor palatino
Can I make sure that all required packages are successfully downloaded & installed before the building phase, say, by letting the server to retry several times? If so, how?
From the Travis docs:
If you are getting network timeouts when trying to download dependencies, either use the built in retry feature of your dependency manager or wrap your install commands in the travis_retry function.
For example, in your .travis.yml:
install: travis_retry pip install myawesomepackage
travis_retry will attempt up three times if the return code is non-zero.

Submit a PySpark job to a cluster with the '--py-files' argument

I was trying to submit a job with the the GCS uri of the zip of the python files to use (via the --py-files argument) and the python file name as the PY_FILE argument value.
This did not seem to work. Do I need to provide some relative path for the PY_FILE value? The PY_FILE is also included in the zip.
e.g. in
gcloud beta dataproc jobs submit pyspark --cluster clustername --py-files gcsuriofzip PY_FILE
what should the value of PY_FILE be?
This is a good question. To answer this question, I am going to use the PySpark wordcount example.
In this case, I created two files, one called test.py which is the file I want to execute and another called wordcount.py.zip which is a zip containing a modified wordcount.py file designed to mimic a module I want to call.
My test.py file looks like this:
import wordcount
import sys
if __name__ == "__main__":
wordcount.wctest(sys.argv[1])
I modified the wordcount.py file to eliminate the main method and to add a named method:
...
from pyspark import SparkContext
...
def wctest(path):
sc = SparkContext(appName="PythonWordCount")
...
I can call the whole thing on Dataproc by using the following gcloud command:
gcloud beta dataproc jobs submit pyspark --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \
gs://<bucket>/input/input.txt
In this example <bucket> is the name (or path) to my bucket and <cluster-name> is the name of my Dataproc cluster.