I'm new in AWS EMR. I would like to run jupyter notebook on my cluster from command line with pyspark kernel.
To create cluster I ran command as follows:
aws emr create-cluster --release-label emr-5.32.0 --name 'spark_jupyter_2'
--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway Name=Hive
--ec2-attributes KeyName=…………..,InstanceProfile=EMR_EC2_DefaultRole --service-role EMR_DefaultRole
--instance-groups InstanceGroupType=MASTER,
InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge
--region eu-central-1 --log-uri s3://………….../logs/ --no-termination-protected
Then I installed jupyter with sudo pip3 install jupyter
Then I signed in again with ssh -i "…………...pem" -L 8888:localhost:8888 hadoop#ec2-…………..compute.amazonaws.com
,ran command jupyter notebook and went to webpage from the link on the screen.
Until this step everything went smoothly but when I tried to run notebook with pyspark kernel I got ``kernel error```
I don't completely understand why is that. When I run pyspark from command line there's no error.
What do I have to do to run jupyter with pyspark kernel without any errors?
1.
Related
In a Jupyter notebook connected to a GCP Spark cluster, the cell !pip3 install pyLDAvis==3.2.1 works, but gives a warning:
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager.
It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
The warning is not unique to pyLDAvis, other packages — even numpy — give the same warning.
Running the notebook as root shouldn't be the default. How can the default user in the notebook be set to singhj rather than root? I have searched through IPython Configuration and customization for any hints.
Configuration: Fresh cluster in GCP Dataproc, default Jupyter notebook, nothing customized.
The Jupyter server in a Dataproc cluster is run by the systemd service defined in the file /usr/lib/systemd/system/jupyter.service.
If you want to change the user it runs as, then you can modify that file and replace the line saying User=root with one saying the name of the user you want (e.g. User=singhj in your example).
Then, once the file has been updated, restart the systemd service by running the following commands as root:
systemctl daemon-reload
systemctl restart jupyter
If you want to automate that, you can write an initialization action to make the change at cluster creation time.
I am aware of Change Apache Livy's Python Version and How do i setup Pyspark in Python 3 with spark-env.sh.template.
I also have seen the Livy documentation
However, none of that works. Livy keeps using Python 2.7 no matter what.
This is running Livy 0.6.0 on an EMR cluster.
I have changed the PYSPARK_PYTHON environment variable to /usr/bin/python3 in the hadoop user, my user, the root, and ec2-user. Logging into the EMR master node via ssh and running pyspark starts python3 as expected. But, Livy keeps using python2.7.
I added export PYSPARK_PYTHON=/usr/bin/python3 to the /etc/spark/conf/spark-env.sh file. Livy keeps using python2.7.
I added "spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/bin/python3" and "spark.executorEnv.PYSPARK_PYTHON":"/usr/bin/python3" to the items listed below and in every case . Livy keeps using python2.7.
sparkmagic config.json and config_other_settings.json files before starting a PySpark kernel Jupyter
Session Properties in the sparkmagic %manage_spark Jupyter widget. Livy keeps using python2.7.
%%spark config cell-magic before the line-magic %spark add --session test --url http://X.X.X.X:8998 --auth None --language python
Note: This works without any issues in another EMR cluster running Livy 0.7.0 I have gone over all of the settings on the other cluster and cannot find what is different. I did not have to do any of this on the other cluster, Livy just used python3 by default.
How exactly do I get Livy to use python3 instead of python2?
Finally just found an answer after posting.
I ran the following in a PySpark kernel Jupyter session cell before running any code to start the PySpark session on the remote EMR cluster via Livy.
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3"
}
}
Simply adding "spark.pyspark.python": "python3" to the .sparkmagic config.json or config_other_settings.json also worked.
Confusing that this does not match the official Livy documentation.
i'm using a jupyter notebook with sparkmagic extension, but i can only access the spark cluster by create a pyspark kernel. The conflict is that i can't use the py3 environment(some installed python package) in pyspark kernel, either i can't use spark context in python3 kernel.
i don't know how to introduce packages in sparkmagic, so can i use pyspark that actually implement by sparkmagic in py3? or are there any other opinions?
Both kernels - PySpark and default IPython can be used with python3 interpreter on pyspark. It can be specified in ~/.sparkmagic/config.json. This is standard spark configuration and will be just passed by sparkmagic to the livy server running on the spark master node.
"session_configs": {
"conf": {
"spark.pyspark.python":"python3"
}
}
spark.pyspark.python Python binary executable to use for PySpark in both driver and executors.
python3 is in this case available as command on the PATH of each node in the spark cluster. You can install it also into a custom directory on each node and specify the full path. "spark.pyspark.python":"/Users/hadoop/python3.8/bin/python"
All spark conf options can be passed like that.
Thera are 2 ways for importing tensorflow:
install on all spark machines (master and workers) via python3 -m pip install tensorflow
zip, upload and pass the remote path through sparkmagic via spark.submit.pyFiles setting. Accepts a path on s3, hdfs or the master node file system (not a path on your machine)
See answer about --py-files
I have 1 WorkerNode SPARK HDInsight cluster. I need to use scikit-neuralnetwork and vaderSentiment module in Pyspark Jupyter.
Installed the library using commands below:
cd /usr/bin/anaconda/bin/
export PATH=/usr/bin/anaconda/bin:$PATH
conda update matplotlib
conda install Theano
pip install scikit-neuralnetwork
pip install vaderSentiment
Next I open pyspark terminal and i am able to successfully import the module. Screenshot below.
Now, i open Jupyter Pyspark Notebook:
Just to add, I am able to import pre-installed module from Jupyter like "import pandas"
The installation goes to:
admin123#hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "vaderSentiment"
/usr/bin/anaconda/lib/python2.7/site-packages/vaderSentiment
/usr/local/lib/python2.7/dist-packages/vaderSentiment
For pre-installed modules:
admin123#hn0-linuxh:/usr/bin/anaconda/bin$ sudo find / -name "pandas"
/usr/bin/anaconda/pkgs/pandas-0.17.1-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/pandas-0.16.2-np19py27_0/lib/python2.7/site-packages/pandas
/usr/bin/anaconda/pkgs/bokeh-0.9.0-np19py27_0/Examples/bokeh/compat/pandas
/usr/bin/anaconda/Examples/bokeh/compat/pandas
/usr/bin/anaconda/lib/python2.7/site-packages/pandas
sys.executable path is same in both Jupyter and terminal.
print(sys.executable)
/usr/bin/anaconda/bin/python
Any help would greatly appreciated.
The issue is that while you are installing it on the headnode (one of the VMs), you are not installing it on all the other VMs (worker nodes). When the Pyspark app for Jupyter gets created, it gets run in YARN cluster mode, and so the application master starts in a random worker node.
One way of installing the libraries in all worker nodes would be to create a script action that runs against worker nodes and installs the necessary libraries:
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-customize-cluster-linux/
Do note that there's two python installations in the cluster, and you have to refer to the Anaconda installation explicitly. Installing scikit-neuralnetwork would look something like this:
sudo /usr/bin/anaconda/bin/pip install scikit-neuralnetwork
The second way of doing this is to simply ssh into the workernodes from the headnode. First, ssh into the headnode, then figure out the workernode IPs by going to Ambari at: https://YOURCLUSTER.azurehdinsight.net/#/main/hosts. Then, ssh 10.0.0.# and execute the installation commands yourself for all worker nodes.
I did this for scikit-neuralnetwork and while it does import correctly, it throws saying it cannot create a file in ~/.theano. Because YARN is running Pyspark sessions as the nobody user, Theano cannot create its config file. Doing a little bit of digging around, I see that there's a way to change where Theano writes/looks for its config file. Please also take care of that while doing the installation: http://deeplearning.net/software/theano/library/config.html#envvar-THEANORC
Forgot to mention, to modify an env var, you need to set the variable when creating the pyspark session. Execute this in the Jupyter notebook:
%%configure -f
{
"conf": {
"spark.executorEnv.THEANORC": "{YOURPATH}",
"spark.yarn.appMasterEnv.THEANORC": "{YOURPATH}"
}
}
Thanks!
Easy way to resolve this was:
Create a bash script
cd /usr/bin/anaconda/bin/
export PATH=/usr/bin/anaconda/bin:$PATH
conda update matplotlib
conda install Theano
pip install scikit-neuralnetwork
pip install vaderSentiment
Copy the above created bash script to any container in Azure storage account.
While creating HDInsight Spark cluster, use script action and mention the above path in URL. Ex: https://sa-account-name.blob.core.windows.net/containername/path-of-installation-file.sh
Install it in both HeadNodes and WorkerNodes.
Now, open Jupyter and you should be able to import the modules.
I haven't yet managed to get Spark, Scala, and Jupyter to co-operate. Does anyone have a simple recipe? Which version of each component did you use?
Apache Toree is compatible with DataProc's 1.0 image, which currently includes Spark 1.6.1. I had unsuccessfully tried to use it with the preview image, which includes Spark 2.0 preview. To install Toree on the DataProc master you can run
sudo apt install python3-pip
pip3 install --user jupyter
export SPARK_HOME=/usr/lib/spark
pip3 install --pre --user toree
export PATH=$HOME/.local/bin:$PATH
jupyter toree install --user --spark_home=$SPARK_HOME
Spark is included standard on Dataproc clusters.
Here is a gcloud command you can use to create a Dataproc cluster (named "dplab") that includes Jupyter listening on port 8124:
$ gcloud dataproc clusters create dplab \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--metadata "JUPYTER_PORT=8124" \
--zone=us-central1-c
Then run this command to port-forward from your host to the cluster master:
$ gcloud compute ssh dplab-m \
--ssh-flag="-Llocalhost:8124:localhost:8124" --zone=us-central1-c
Open localhost:8124 in your browser and you should see the Jupyter page.