I have configured my .bash_profile like below . please let me know if anything i'm missing here . I'm getting
No module named pyspark
# added by Anaconda3 5.2.0 installer
export PATH=/Users/pkumar5/anaconda3/bin:$PATH
export JAVA_HOME=/Library/Java/Home
# spark configuration
export SPARK_PATH=~/spark-2.3.2-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
alias snotebook='$SPARK_PATH/bin/pyspark --master "local[2]"'
i'm trying use pyspark in jupyter notebook , I'm getting error called "No module named pyspark" .Please . help me out to resolve.
You might have to define the correct $PYTHONPATH (this is where Python looks for modules).
Also something you might want to check: if you have installed pyspark correctly, it might be that you installed it for Python 3 while your Jupyter notebook kernel is running Python 2, so switching kernel would solve the issue.
Related
I'm taking a machine learning course and am trying to install pyspark to complete some of the class assignments. I downloaded pyspark from this link, unzipped it and put it in my home directory, and added the following lines to my .bash_profile.
export SPARK_PATH=~/spark-3.3.0-bin-hadoop2.6
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
However, when I try to run the command:
pyspark
to start a session, I get the error:
-bash: pyspark: command not found
Can someone tell me what I need to do to get pyspark working on my local machine? Thank you.
You are probably missing the PATH entry. Here are the environment variable changes I did to get pyspark working on my Mac:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.6.jdk/Contents/Home/
export SPARK_HOME=/opt/spark-3.3.0-bin-hadoop3
export PATH=$JAVA_HOME/bin:$SPARK_HOME:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8889'
Also ensure that, you've Java SE 8+ and Python 3.5+ installed.
Start the server from /opt/spark-3.3.0-bin-hadoop3/sbin/start-master.sh.
Then run pyspark and copy+paste URL displayed on screen in web browser.
When I install PySpark for Jupyter notebook, I using this cmd:
jupyter toree install --kernel_name=tanveer --interpreters=PySpark --python="/usr/lib/python3.6"
But, I get the tips of
[ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
So I don't know what a problem. I have set up Toree's Scala and SQL successfully. thinks
Toree version 0.3.0 removed support for PySpark and SparkR:
Removed support for PySpark and Spark R in Toree (use specific kernels)
Release notes here: incubator-toree release notes
I am not sure what "use specific kernels" means and continue to look for a Jupyter PySpark kernel.
As also mentioned in Lee's answer, Toree version 0.3.0 removed support for PySpark and SparkR. As per their release notes, they asked to "use specific kernels". For PySpark, this means manually install pyspark to be used with Jupyter.
Steps are simple as follow:
Install pyspark. Either by pip install pyspark, or by download Apache Spark binary package and decompress into a specific folder.
Add the following 3 environment variables. How to do this depends on your OS. For example, on my MacOS, I added the following lines to the file ~/.bash_profile
export SPARK_HOME=<path_to_your_installed_spark_files>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
That's it. To start your PySpark Jupyter Notebook, simply run "pyspark" from your command line, and choose "Python" kernel
Refer to https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788835367/1/ch01lvl1sec17/installing-jupyter
or
https://opensource.com/article/18/11/pyspark-jupyter-notebook for more detailed instructions.
I actually work on zeppelin with spark and scala. I want to import the library which contain : import com.databricks.spark.xml.
I tried but I have still the same mistake in zeppelin mistake : <console>:25: error: object databricks is not a member of package com.
What I've done actually ? I create a note in Zeppelin with this code : %dep
z.load("com.databricks:spark-xml_2.11:jar:0.5.0"). Even with that, the interpreter don't work. It's like it don't succeed to load the library.
Have you an idea why it don't work ?
Thanks for your help and have a nice day !
Your problem is very common and not intuitive to solve. I resolved an issue similar to this (I wanted to load the postgres jdbc connector in AWS EMR and I was using a linux terminal). Your issue can be resolved by checking if you can:
load the jar file manually to the environment that is hosting Zeppelin.
add the path of the jar file to your CLASSPATH environment variable. I don't know where you're hosting your files that manage your CLASSPATH env, but in EMR, my file, viewed from the Zeppelin root directory, was here: /usr/lib/zeppelin/conf/zeppelin-env.sh
download the zeppelin interpreter with
$ sudo ./bin/install-interpreter.sh --name "" --artifact
add the interpreter in Zeppelin wby going to the Zeppelin Interpreter GUI and add in the interpreter group.
Reboot Zeppelin with:
$ sudo stop zeppelin
$ sudo start zeppelin
It's very likely that your configurations may vary slightly, but I hope this helps provide some structure and relevance.
What do I need to do beyond setting "zeppelin.pyspark.python" to make a Zeppelin interpreter us a specific Python executable?
Background:
I'm using Apache Zeppelin connected to a Spark+Mesos cluster. The cluster's worked fine for several years. Zeppelin is new and works fine in general.
But I'm unable to import numpy within functions applied to an RDD in pyspark. When I use Python subprocess to locate the Python executable, it shows that the code is being run in the system's Python, not in the virutalenv it needs to be in.
So I've seen a few questions on this issue that say the fix is to set "zeppelin.pyspark.python" to point to the correct python. I've done that and restarted the interpreter a few times. But it is still using the system Python.
Is there something additional I need to do? This is using Zeppelin 0.7.
On an older, custom snapshot build of Zeppelin I've been using on an EMR cluster, I set the following two properties to use a specific virtualenv:
"zeppelin.pyspark.python": "/path/to/bin/python",
"spark.executorEnv.PYSPARK_PYTHON": "/path/to/bin/python"
When you are in your activated venv in python:
(my_venv)$ python
>>> import sys
>>> sys.executable
# http://localhost:8080/#/interpreters
# search for 'python'
# set `zeppelin.python` to output of `sys.executable`
I am getting the following error while trying to run a sympy file in order to contribute to sympy. It is :
ImportError: No module named sympy
I installed the sympy module through pip for both python2.7 and python 3.
Also, isympy is working.
Strangly, when I try to import sympy in python's interactive console in the main sympy directory, no import errors are shown but in some other directory, it shows import errors.
Please help me to download the sympy module in a way that I will be able to run the code.
Thanks.
Importing module in python console of main directory.
Importing module in some other directory.
A likely cause here is that you are using two different Pythons. If you have Python installed multiple times (like Python 2 and Python 3), each has its own separate packages. You can check what Python you are using by printing sys.executable.
I should point out that for the purposes of contributing to SymPy, you generally want to run against the development version. That is, running Python from the SymPy directory and importing the development version from there, without actually installing it.
Thanks for the reply.
But I solved the problem. I realised that I didn't install sympy in the current conda environment. When I tried it using the command:
conda install sympy
It worked and no error is being shown.
Thanks.