Change Python version Livy uses in an EMR cluster

Change Python version Livy uses in an EMR cluster - pyspark

I am aware of Change Apache Livy's Python Version and How do i setup Pyspark in Python 3 with spark-env.sh.template.
I also have seen the Livy documentation
However, none of that works. Livy keeps using Python 2.7 no matter what.
This is running Livy 0.6.0 on an EMR cluster.
I have changed the PYSPARK_PYTHON environment variable to /usr/bin/python3 in the hadoop user, my user, the root, and ec2-user. Logging into the EMR master node via ssh and running pyspark starts python3 as expected. But, Livy keeps using python2.7.
I added export PYSPARK_PYTHON=/usr/bin/python3 to the /etc/spark/conf/spark-env.sh file. Livy keeps using python2.7.
I added "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/bin/python3" and "spark.executorEnv.PYSPARK_PYTHON":"/usr/bin/python3" to the items listed below and in every case . Livy keeps using python2.7.
sparkmagic config.json and config_other_settings.json files before starting a PySpark kernel Jupyter
Session Properties in the sparkmagic %manage_spark Jupyter widget. Livy keeps using python2.7.
%%spark config cell-magic before the line-magic %spark add --session test --url http://X.X.X.X:8998 --auth None --language python
Note: This works without any issues in another EMR cluster running Livy 0.7.0 I have gone over all of the settings on the other cluster and cannot find what is different. I did not have to do any of this on the other cluster, Livy just used python3 by default.
How exactly do I get Livy to use python3 instead of python2?

Finally just found an answer after posting.
I ran the following in a PySpark kernel Jupyter session cell before running any code to start the PySpark session on the remote EMR cluster via Livy.
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3"
}
}
Simply adding "spark.pyspark.python": "python3"  to the .sparkmagic config.json or config_other_settings.json also worked.
Confusing that this does not match the official Livy documentation.

Related

Connection to remote Hadoop Cluster (CDP) through Linux server

I'm new to PySpark and I want to connect remote Hadoop Cluster (CDP) through Linux server by using spark-submit command.
Any help would be appreciated.
I need spark-submit command to connect remote CDP.

You can use Apache Livy to submit remote jobs to a CDP cluster. Here is detailed info on how to install and use Livy to submit jobs :
After downloading and unzipping Livy you should add following lines in livy.conf file. Then start livy service.
livy.spark.master = yarn
livy.spark.deploy-mode = cluster
You can find examples of how to create a spark submit script on following links:
https://community.cloudera.com/t5/Community-Articles/Submit-a-Spark-Job-to-CDP-Data-Hub-using-the-Livy-REST-API/ta-p/322481
https://livy.apache.org/examples/

How to set the Jupyter default user for Pyspark in GCP Dataproc

In a Jupyter notebook connected to a GCP Spark cluster, the cell !pip3 install pyLDAvis==3.2.1 works, but gives a warning:
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager.
It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
The warning is not unique to pyLDAvis, other packages — even numpy — give the same warning.
Running the notebook as root shouldn't be the default. How can the default user in the notebook be set to singhj rather than root? I have searched through IPython Configuration and customization for any hints.
Configuration: Fresh cluster in GCP Dataproc, default Jupyter notebook, nothing customized.

The Jupyter server in a Dataproc cluster is run by the systemd service defined in the file /usr/lib/systemd/system/jupyter.service.
If you want to change the user it runs as, then you can modify that file and replace the line saying User=root with one saying the name of the user you want (e.g. User=singhj in your example).
Then, once the file has been updated, restart the systemd service by running the following commands as root:
systemctl daemon-reload
systemctl restart jupyter
If you want to automate that, you can write an initialization action to make the change at cluster creation time.

can sparkmagic be used outside the ipython?

i'm using a jupyter notebook with sparkmagic extension, but i can only access the spark cluster by create a pyspark kernel. The conflict is that i can't use the py3 environment(some installed python package) in pyspark kernel, either i can't use spark context in python3 kernel.
i don't know how to introduce packages in sparkmagic, so can i use pyspark that actually implement by sparkmagic in py3? or are there any other opinions?

Both kernels - PySpark and default IPython can be used with python3 interpreter on pyspark. It can be specified in ~/.sparkmagic/config.json. This is standard spark configuration and will be just passed by sparkmagic to the livy server running on the spark master node.
"session_configs": {
"conf": {
"spark.pyspark.python":"python3"
}
}
spark.pyspark.python Python binary executable to use for PySpark in both driver and executors.
python3 is in this case available as command on the PATH of each node in the spark cluster. You can install it also into a custom directory on each node and specify the full path. "spark.pyspark.python":"/Users/hadoop/python3.8/bin/python"
All spark conf options can be passed like that.
Thera are 2 ways for importing tensorflow:
install on all spark machines (master and workers) via python3 -m pip install tensorflow
zip, upload and pass the remote path through sparkmagic via spark.submit.pyFiles setting. Accepts a path on s3, hdfs or the master node file system (not a path on your machine)
See answer about --py-files

Pyspark / pyspark kernels not working in jupyter notebook

Here are installed kernels:
$jupyter-kernelspec list
Available kernels:
apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala
apache_toree_sql /usr/local/share/jupyter/kernels/apache_toree_sql
pyspark3kernel /usr/local/share/jupyter/kernels/pyspark3kernel
pysparkkernel /usr/local/share/jupyter/kernels/pysparkkernel
python3 /usr/local/share/jupyter/kernels/python3
sparkkernel /usr/local/share/jupyter/kernels/sparkkernel
sparkrkernel /usr/local/share/jupyter/kernels/sparkrkernel
A new notebook was created but fails with
The code failed because of a fatal error:
Error sending http request and maximum retry encountered..
There is no [error] message in the jupyter console

If you use magicspark to connect your Jupiter notebook, you should also start Livy which is API service used by magicspark to talk to your Spark cluster.
Download Livy from Apache Livy and unzip it
Check SPARK_HOME environment is set, if not, set to your Spark installation directory
Run Livy server by <livy_home>/bin/livy-server in the shell/command line
Now go back to your notebook, you should be able to run spark code in cell.

Run specific virtualenv on dataproc cluster at spark-submit like in vanilla Spark

When I'm running on a vanilla spark cluster, and wanting to run a pyspark script against a specific virtualenv, I can create the virtual environment, install packages as needed, and then zip the environment into a file, let's say venv.zip.
Then, at runtime, I can execute
spark-submit --archives venv.zip#VENV --master yarn script.py
and then, so long as I run
os.environ["PYSPARK_PYTHON"] = "VENV/bin/python" inside of script.py, the code will run against the virtual environment, and spark will handle provisioning the virtualenvironment to all of my clusters.
When I do this on dataproc, first, the hadoop-style hash aliasing doesn't work, and second, running
gcloud dataproc jobs submit pyspark script.py --archives venv.zip --cluster <CLUSTER_NAME>
with os.environ["PYSPARK_PYTHON"] = "venv.zip/bin/python" will produce:
Error from python worker:
venv/bin/python: 1: venv.zip/bin/python: Syntax error: word unexpected (expecting ")")
It's clearly seeing my python executable, and trying to run against it, but there really appears to be some sort of parsing error. What gives? Is there any way to pass the live python executable to use to dataproc the way that you can against a vanilla spark cluster?

Turns out I was distributing python binaries across OSes, and was boneheaded enough to not notice that I was doing so, and the incompatibility was causing the crash.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Change Python version Livy uses in an EMR cluster - pyspark

Related

Connection to remote Hadoop Cluster (CDP) through Linux server

How to set the Jupyter default user for Pyspark in GCP Dataproc

can sparkmagic be used outside the ipython?

Pyspark / pyspark kernels not working in jupyter notebook

Run specific virtualenv on dataproc cluster at spark-submit like in vanilla Spark

Categories

Resources