I'm taking a machine learning course and am trying to install pyspark to complete some of the class assignments. I downloaded pyspark from this link, unzipped it and put it in my home directory, and added the following lines to my .bash_profile.
export SPARK_PATH=~/spark-3.3.0-bin-hadoop2.6
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
However, when I try to run the command:
pyspark
to start a session, I get the error:
-bash: pyspark: command not found
Can someone tell me what I need to do to get pyspark working on my local machine? Thank you.
You are probably missing the PATH entry. Here are the environment variable changes I did to get pyspark working on my Mac:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.6.jdk/Contents/Home/
export SPARK_HOME=/opt/spark-3.3.0-bin-hadoop3
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'
Also ensure that, you've Java SE 8+ and Python 3.5+ installed.
Start the server from /opt/spark-3.3.0-bin-hadoop3/sbin/start-master.sh.
Then run pyspark and copy+paste URL displayed on screen in web browser.
Related
I have installed pyspark using (pipenv install pyspark) and type pyspark after activating 'pipenv shell'
I can able to open pyspark terminal and able to run few spark code.
but I am trying to figure out to enable Hive (for that where I need to place hive-site.xml (with mysql metastore properties) and not able to see any spark/config folder in order to place hive-site.xml).
Unfortunately the existing application much relied on Pipefile (so i have to follow pipenv install pyspark)
When I install PySpark for Jupyter notebook, I using this cmd:
jupyter toree install --kernel_name=tanveer --interpreters=PySpark --python="/usr/lib/python3.6"
But, I get the tips of
[ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
So I don't know what a problem. I have set up Toree's Scala and SQL successfully. thinks
Toree version 0.3.0 removed support for PySpark and SparkR:
Removed support for PySpark and Spark R in Toree (use specific kernels)
Release notes here: incubator-toree release notes
I am not sure what "use specific kernels" means and continue to look for a Jupyter PySpark kernel.
As also mentioned in Lee's answer, Toree version 0.3.0 removed support for PySpark and SparkR. As per their release notes, they asked to "use specific kernels". For PySpark, this means manually install pyspark to be used with Jupyter.
Steps are simple as follow:
Install pyspark. Either by pip install pyspark, or by download Apache Spark binary package and decompress into a specific folder.
Add the following 3 environment variables. How to do this depends on your OS. For example, on my MacOS, I added the following lines to the file ~/.bash_profile
export SPARK_HOME=<path_to_your_installed_spark_files>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
That's it. To start your PySpark Jupyter Notebook, simply run "pyspark" from your command line, and choose "Python" kernel
Refer to https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788835367/1/ch01lvl1sec17/installing-jupyter
or
https://opensource.com/article/18/11/pyspark-jupyter-notebook for more detailed instructions.
I have configured my .bash_profile like below . please let me know if anything i'm missing here . I'm getting
No module named pyspark
# added by Anaconda3 5.2.0 installer
export PATH=/Users/pkumar5/anaconda3/bin:$PATH
export JAVA_HOME=/Library/Java/Home
# spark configuration
export SPARK_PATH=~/spark-2.3.2-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
alias snotebook='$SPARK_PATH/bin/pyspark --master "local[2]"'
i'm trying use pyspark in jupyter notebook , I'm getting error called "No module named pyspark" .Please . help me out to resolve.
You might have to define the correct $PYTHONPATH (this is where Python looks for modules).
Also something you might want to check: if you have installed pyspark correctly, it might be that you installed it for Python 3 while your Jupyter notebook kernel is running Python 2, so switching kernel would solve the issue.
What do I need to do beyond setting "zeppelin.pyspark.python" to make a Zeppelin interpreter us a specific Python executable?
Background:
I'm using Apache Zeppelin connected to a Spark+Mesos cluster. The cluster's worked fine for several years. Zeppelin is new and works fine in general.
But I'm unable to import numpy within functions applied to an RDD in pyspark. When I use Python subprocess to locate the Python executable, it shows that the code is being run in the system's Python, not in the virutalenv it needs to be in.
So I've seen a few questions on this issue that say the fix is to set "zeppelin.pyspark.python" to point to the correct python. I've done that and restarted the interpreter a few times. But it is still using the system Python.
Is there something additional I need to do? This is using Zeppelin 0.7.
On an older, custom snapshot build of Zeppelin I've been using on an EMR cluster, I set the following two properties to use a specific virtualenv:
"zeppelin.pyspark.python": "/path/to/bin/python",
"spark.executorEnv.PYSPARK_PYTHON": "/path/to/bin/python"
When you are in your activated venv in python:
(my_venv)$ python
>>> import sys
>>> sys.executable
# http://localhost:8080/#/interpreters
# search for 'python'
# set `zeppelin.python` to output of `sys.executable`
fix: sorry, all is fine, error was because of no module installed in this new environment, jinja2.
First time using virtualenvwrapper so I am little confused.
Setup went fine, I read the docs, but still I don't understand few things.
In my .bashrc file I've set:
# virtualenvwrapper
export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Snakepit
source /usr/bin/virtualenvwrapper.sh
I already have my project files, so I thougt I should do the following:
Go into ~/Snakepit/ directory, run mkvirtualenv -p /usr/bin/python2 [ envname ]
(I need this specific version for my project), and I saw it created in
~/.virtualenvs/ dir.
My command promt changes showing me that my new environment is [ envname ].
When I do now: python -V, it shows that I am using version 2.7 of python, so
all is well!
But when I move now, my project files into Snakepit directory, and try
running my program with python myprogram.py it shows me errors because it
still tries to run my program with python 3.
How is that possible when python -V shows version 2.7?
Error was not about python version being run, but instead module missing in newly created environment. I will leave it, for a feature reference.