how to install python module on dataproc cluster executors - pyspark

how can we import xmltodict module to all worker nodes of cluster, now it's giving me error as ' no name module ' while performing xmltodict on pyspark dataframe
xmltodict python module should be install on all executors of dataproc cluster

Related

Change Python version Livy uses in an EMR cluster

I am aware of Change Apache Livy's Python Version and How do i setup Pyspark in Python 3 with spark-env.sh.template.
I also have seen the Livy documentation
However, none of that works. Livy keeps using Python 2.7 no matter what.
This is running Livy 0.6.0 on an EMR cluster.
I have changed the PYSPARK_PYTHON environment variable to /usr/bin/python3 in the hadoop user, my user, the root, and ec2-user. Logging into the EMR master node via ssh and running pyspark starts python3 as expected. But, Livy keeps using python2.7.
I added export PYSPARK_PYTHON=/usr/bin/python3 to the /etc/spark/conf/spark-env.sh file. Livy keeps using python2.7.
I added "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/bin/python3" and "spark.executorEnv.PYSPARK_PYTHON":"/usr/bin/python3" to the items listed below and in every case . Livy keeps using python2.7.
sparkmagic config.json and config_other_settings.json files before starting a PySpark kernel Jupyter
Session Properties in the sparkmagic %manage_spark Jupyter widget. Livy keeps using python2.7.
%%spark config cell-magic before the line-magic %spark add --session test --url http://X.X.X.X:8998 --auth None --language python
Note: This works without any issues in another EMR cluster running Livy 0.7.0 I have gone over all of the settings on the other cluster and cannot find what is different. I did not have to do any of this on the other cluster, Livy just used python3 by default.
How exactly do I get Livy to use python3 instead of python2?
Finally just found an answer after posting.
I ran the following in a PySpark kernel Jupyter session cell before running any code to start the PySpark session on the remote EMR cluster via Livy.
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3"
}
}
Simply adding "spark.pyspark.python": "python3"  to the .sparkmagic config.json or config_other_settings.json also worked.
Confusing that this does not match the official Livy documentation.

Pyspark kernel error while running jupyter notebook on Amazon EMR cluster

I'm new in AWS EMR. I would like to run jupyter notebook on my cluster from command line with pyspark kernel.
To create cluster I ran command as follows:
aws emr create-cluster --release-label emr-5.32.0 --name 'spark_jupyter_2'
--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway Name=Hive
--ec2-attributes KeyName=…………..,InstanceProfile=EMR_EC2_DefaultRole --service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,
InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge
--region eu-central-1 --log-uri s3://………….../logs/ --no-termination-protected
Then I installed jupyter with sudo pip3 install jupyter
Then I signed in again with ssh -i "…………...pem" -L 8888:localhost:8888 hadoop#ec2-…………..compute.amazonaws.com
,ran command jupyter notebook and went to webpage from the link on the screen.
Until this step everything went smoothly but when I tried to run notebook with pyspark kernel I got ``kernel error```
I don't completely understand why is that. When I run pyspark from command line there's no error.
What do I have to do to run jupyter with pyspark kernel without any errors?
1.

Storing a pyspark dataframe to local file system using pandas

I have a pyspark dataframe that I'm converting to pandas to store it as csv on my local file system, but pandas is not recognizing my local file path
pandas_df = df.toPandas()
pandas_df.to_csv('/home/dir/my.csv', index=False, encoding='utf-8', sep='|')
I'm getting this error FileNotFoundError: [Errno 2] No such file or directory
Here is how I'm submitting
/usr/bin/spark2-submit --master yarn --deploy-mode cluster <pyspark-file>.py
If you run the job as --deploy-mode cluster, the driver will be running in any of the machine which is managed by YARN, so if to_csv has local file path, then it will store the output in any of the machine where driver is running.
Check the if the file path exists on all the machines in the cluster
Check if appropriate permissions are given to the File Path
else try
Running the job as --deploy-mode client so the driver runs in the client machine, however step 1 & 2 still applies to the client machine

can sparkmagic be used outside the ipython?

i'm using a jupyter notebook with sparkmagic extension, but i can only access the spark cluster by create a pyspark kernel. The conflict is that i can't use the py3 environment(some installed python package) in pyspark kernel, either i can't use spark context in python3 kernel.
i don't know how to introduce packages in sparkmagic, so can i use pyspark that actually implement by sparkmagic in py3? or are there any other opinions?
Both kernels - PySpark and default IPython can be used with python3 interpreter on pyspark. It can be specified in ~/.sparkmagic/config.json. This is standard spark configuration and will be just passed by sparkmagic to the livy server running on the spark master node.
"session_configs": {
"conf": {
"spark.pyspark.python":"python3"
}
}
spark.pyspark.python Python binary executable to use for PySpark in both driver and executors.
python3 is in this case available as command on the PATH of each node in the spark cluster. You can install it also into a custom directory on each node and specify the full path. "spark.pyspark.python":"/Users/hadoop/python3.8/bin/python"
All spark conf options can be passed like that.
Thera are 2 ways for importing tensorflow:
install on all spark machines (master and workers) via python3 -m pip install tensorflow
zip, upload and pass the remote path through sparkmagic via spark.submit.pyFiles setting. Accepts a path on s3, hdfs or the master node file system (not a path on your machine)
See answer about --py-files

Run specific virtualenv on dataproc cluster at spark-submit like in vanilla Spark

When I'm running on a vanilla spark cluster, and wanting to run a pyspark script against a specific virtualenv, I can create the virtual environment, install packages as needed, and then zip the environment into a file, let's say venv.zip.
Then, at runtime, I can execute
spark-submit --archives venv.zip#VENV --master yarn script.py
and then, so long as I run
os.environ["PYSPARK_PYTHON"] = "VENV/bin/python" inside of script.py, the code will run against the virtual environment, and spark will handle provisioning the virtualenvironment to all of my clusters.
When I do this on dataproc, first, the hadoop-style hash aliasing doesn't work, and second, running
gcloud dataproc jobs submit pyspark script.py --archives venv.zip --cluster <CLUSTER_NAME>
with os.environ["PYSPARK_PYTHON"] = "venv.zip/bin/python" will produce:
Error from python worker:
venv/bin/python: 1: venv.zip/bin/python: Syntax error: word unexpected (expecting ")")
It's clearly seeing my python executable, and trying to run against it, but there really appears to be some sort of parsing error. What gives? Is there any way to pass the live python executable to use to dataproc the way that you can against a vanilla spark cluster?
Turns out I was distributing python binaries across OSes, and was boneheaded enough to not notice that I was doing so, and the incompatibility was causing the crash.