I had installed pyspark in a python virtualenv. I have also installed jupyterlab which was newly released http://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html in the virtualenv. I was unable to fire pyspark within a jupyter-notebook in such a way that I have the SparkContext variable available.
First fire the virtualenv
source venv/bin/activate
export SPARK_HOME={path_to_venv}/lib/python2.7/site-packages/pyspark
export PYSPARK_DRIVER_PYTHON=jupyter-lab
Before this I hope you have done:pip install pyspark and pip install jupyterlab inside your virtualenv
To check, once your jupyterlab is open, type sc in a box in the jupyterlab and you should have the SparkContext object available and the output should be this:
SparkContext
Spark UI
Version
v2.2.1
Master
local[*]
AppName
PySparkShell
You need to export your $PYSPARK_PYTHON with your virtualenv
export PYSPARK_PYTHON={path/to/your/virtualenv}/bin/python
That solved my case.
In my case, working with windows, python 3.7.4 and spark 3.1.1., the problem was that pyspark was looking for python3.exe that did not exist. I made a copy of venv/Scripts/python.exe and renamed venv/Scripts/python3.exe
Related
I have one EMR cluster up and running. In it, I have one Jupyter Notebook with pyspark kernel.
For the master node, I am able to SSH into it. I am able to install Python packages in the master node easily, such as :
pip install pandas
which I can then verify successful with pip freeze
However, when I go to the pyspark notebook, using sc.list_packages(), I see a different list of packages in there. Some package has different version compared to in the master node. Some package (such as pandas) does not appear altogether.
Here is the list of pip freeze in master node SSH.
aws-cfn-bootstrap==2.0
beautifulsoup4==4.9.1
boto==2.49.0
click==7.1.2
Cython==0.29.30
docutils==0.14
jmespath==0.10.0
joblib==0.15.1
lockfile==0.11.0
lxml==4.5.1
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.6
pandas==1.3.5
py-dateutil==2.2
py4j==0.10.9.5
pybind11==2.9.2
pyspark==3.3.0
pystache==0.5.4
python-daemon==2.2.3
python-dateutil==2.8.2
python37-sagemaker-pyspark==1.3.0
pytz==2020.1
PyYAML==5.3.1
regex==2020.6.8
scipy==1.7.3
simplejson==3.2.0
six==1.13.0
soupsieve==1.9.5
tqdm==4.46.1
windmill==1.6
And here is the package list in the PySpark notebook using sc.list_packages():
aws-cfn-bootstrap (2.0)
beautifulsoup4 (4.9.1)
boto (2.49.0)
click (7.1.2)
docutils (0.14)
jmespath (0.10.0)
joblib (0.15.1)
lockfile (0.11.0)
lxml (4.5.1)
mysqlclient (1.4.2)
nltk (3.5)
nose (1.3.4)
numpy (1.16.5)
pip (9.0.1)
py-dateutil (2.2)
pystache (0.5.4)
python-daemon (2.2.3)
python37-sagemaker-pyspark (1.3.0)
pytz (2020.1)
PyYAML (5.3.1)
regex (2020.6.8)
setuptools (28.8.0)
simplejson (3.2.0)
six (1.13.0)
soupsieve (1.9.5)
tqdm (4.46.1)
UNKNOWN (1.3.5)
wheel (0.29.0)
windmill (1.6)
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
You are using pip version 9.0.1, however version 22.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Note that pandas, scipy and pip are different. Why are they different? How do I upgrade or update the list in the PySpark notebook?
Log into the master node and run sudo docker ps -a. You should see a container named something like emr/jupyter-notebook:6.0.3 and that's where your Jupyter Notebook is running; it is not running in the master node.
If you decide to install any packages in the master node, the Jupyter Notebook will not see them. This is the reason why your packages do not match. To install packages in the Jupyter Notebook I use a requirements file, which contains the packages I want to install, and invoke a bootstrap action script that installs those packages. An important detail is to make sure that if you do specify a package version then it must be supported by the Python version running in the container. To find out just run a step in the Jupyter Notebook:
import sys
print(sys.version)
To find the latest packages that go with a specific version of Python, I highly recommend using Anaconda. For example
conda create --name requests python=3.7.9 matplotlib
will tell me the latest version of matplotlib that works with Python 3.7.9
I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes.
Currently when I run the command sc.list_packages() I see that pip is at version 9.0.1 whereas if I SSH onto the master node and run pip list I see that pip is at version 20.2.2. I have issues running the command sc.install_pypi_package() due to the lowered pip version in the Notebook.
In the notebook cell if I run import pip then pip I see that the module is located at
<module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/__init__.py'>
I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one.
If I run sc.uninstall_package('pip') then sc.list_packages() I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned.
How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1?
If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this?
<module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/__init__.py'>
As for pip 9.0.1 the only reference I can find at the moment is in /lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl. One directory outside of this I see a file called virtualenv-15.1.0-py2.7.egg-info which if I cat the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a virtualenv.py file which does reference a __version__ = "15.1.0".
I was able to find a solution on updating pip, setuptools, and wheel in the virtualenv that PySpark uses.
I initially had to determine how pip 9 is being sourced. By SSH'ing to my EMR Master node I changed directories into the root cd / and then ran the command sudo find . -name "pip*" to recursively search for where pip files may be located at.
In my scenario there is a pip 9 wheel located at:
./usr/lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl
By searching around a bit more in /usr/lib/python2.7/site-packages there is a virtualenv.py that is being invoked to create the virtualenv and is explained a bit more below.
Within the PySpark notebook session using %%info shows that the virtualenv is created from this file path (thanks Parag):
'spark.pyspark.virtualenv.bin.path': '/usr/bin/virtualenv'
Running cat /usr/bin/virtualenv shows that the virtualenv is being invoked from the following commands:
#!/usr/bin/python
import virtualenv
virtualenv.main()
This version of python in /usr/bin is python2.7. At the terminal I ran the following commands in sequence:
/usr/bin/python
import virtualenv
virtualenv
This outputs:
<module 'virtualenv' from '/usr/lib/python2.7/site-packages/virtualenv.py'>
I have sometimes seen a virtualenv.pyc file being used here which is located in /usr/lib/python2.7/site-packages/ but I have seen other users suggest that .pyc files can be deleted.
On the EMR master node I ran the command /usr/bin/virtualenv which shows some flags that can be used. First I used /usr/bin/virtualenv --verbose ./myVE which shows that pip 9.0.1 is packaged into the virtualenv I created. If I run /usr/bin/virtualenv --verbose --download ./myVE2 this shows output that an updated version of pip, setuptools, and wheel are being downloaded from Artifactory (our private PyPi mirror) into the virtualenv. There is a /etc/pip.conf that we use to setup the index-url and trusted host for Artifactory to be used instead of PyPi.
At this point it seems that the EMR cluster's virtualenv.py file as a default does not download updated wheels from Artifactory/PyPi and instead uses the wheel files located in /usr/lib/python2.7/site-packages/virtualenv_support/*.whl
Running cat /usr/lib/python2.7/site-packages/virtualenv.py shows that this version of virtualenv is 15.1.0 which is very outdated (2016 release).
Reading more into virtualenv.py shows that the main() function has a block of code as follows:
parser.add_option(
"--download",
dest="download",
action="store_true",
help="Download preinstalled packages from PyPI.",
)
I compared this virtualenv.py file on my EMR master to the official release of virtualenv==15.1.0 from PyPi (https://pypi.org/project/virtualenv/15.1.0/). I downloaded the tar.gz file and unzipped it on my local machine. There is a virtualenv.py file in the unzipped folder. When comparing contents using diff of the official virtualenv.py file to the EMR cluster's virtualenv.py file there are only a couple of lines that are not the same. The main difference is that parser.add_option from the code block above has default=True, in the official virtualenv.py file. The EMR cluster's virtualenv.py file does not have this.
parser.add_option(
"--download",
dest="download",
action="store_true",
default=True,
help="Download preinstalled packages from PyPI.",
)
What I did from here was I copied the EMR cluster's virtualenv.py and updated the line of code to set default=True,. I then used this updated virtualenv.py as part of an EMR bootstrap script so that this file is updated on all node types (master/core/task).
The bootstrap script does the following:
sudo rm /usr/lib/python2.7/site-packages/virtualenv.pyc
sudo rm /usr/lib/python2.7/site-packages/virtualenv.py
sudo aws s3 cp <UPDATED_VIRTUALENV_S3_PATH> /usr/lib/python2.7/site-packages/
Ensure that the copied file from S3 is just called virtualenv.py in the event that this causes any issues due to filenames not being kept the same.
Now when I start up a PySpark kernel the spark.pyspark.virtualenv.bin.path invokes the updated virtualenv.py file and I am able to confirm that pip is at a much higher version number (20+) which is what I was looking to achieve.
I am using an EMR notebook attached to my cluster for some experimentation purposes. I needed to install some python modules for testing, specifically spacy and it's data module en_core_web_sm.
I ssh'ed into the master and core nodes and downloaded the modules individually. However I am not able to import from the my EMR notebook. I get the following error :
An error was encountered:
No module named 'spacy'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'spacy'
I know there is a way to install them just for the scope of EMR notebook, but this wouldn't suffice in a production scenario, so please avoid answers which suggest notebook installing as mentioned in this guide : https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/
Please let me know if I am missing some setup steps. Appreciate your response.
You can use bootstraps to install additional modules while creating your EMR
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html
I was able to solve this by changing the bootstrap script to use sudo instead of --user. (You could also manually change run the scripts below)
Before I was running
pip3 install spacy --user
python3 -m spacy download en --user
I changed that script to
sudo pip3 install spacy
sudo python3 -m spacy download en
To verify this solution quickly issue the following commands from your EMR notebook (to compare before and after)
sc.list_packages()
You should see an output similar to this
SparkSession available as 'spark'.
Package Version
-------------------------- ----------
beautifulsoup4 4.9.0
blis 0.4.1
boto 2.49.0
catalogue 1.0.0
certifi 2020.4.5.2
chardet 3.0.4
cymem 2.0.3
en-core-web-sm 2.3.0
idna 2.9
importlib-metadata 1.6.1
jmespath 0.9.5
lxml 4.5.0
murmurhash 1.0.2
mysqlclient 1.4.2
nltk 3.4.5
nose 1.3.4
numpy 1.16.5
pip 9.0.1
plac 1.1.3
preshed 3.0.2
py-dateutil 2.2
python37-sagemaker-pyspark 1.3.0
pytz 2019.3
PyYAML 5.3.1
requests 2.24.0
setuptools 28.8.0
six 1.13.0
soupsieve 1.9.5
spacy 2.3.0
srsly 1.0.2
thinc 7.4.1
tqdm 4.46.1
urllib3 1.25.9
wasabi 0.6.0
wheel 0.29.0
windmill 1.6
zipp 3.1.0
This is not the best possible solution IMO, since the first warning that gets displayed after using sudo is
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
If anyone has a better solution please free to post.
I've a local function to transform my data on a local.
For example, when I tried to import:
import com.company.area.project.area.functions
I've received the error:
console: 49: error: object company is not a member of package com
import com.company.area.project.area.functions
I'm using Anaconda and the pip. Reading trough the web I've found this question: Is there a pip / easy_install for Scala?
So, I noticed that there is no a equivalent to "Pip install ." we have in python.
So, the question is. How can I import a custom function in my Jupyter Notebook?
First you have to ensure that you have Scala environment enabled for your jupyter notebook.
pixiedust can help you with installing scala packages in your jupyter notebook.
A specific example in your case will be to deploy your project as a jar file to a repository or an address where you can then pass to the pixiedust package
Code example should look like this:
import pixiedust
pixiedust.installPackage("https://github.com/companyrepo/jars/raw/master/dist/functions-assembly-1.0.jar")
Here's a few links that I went to and did exactly what they said. I don't know what I'm doing wrong.
https://github.com/alexarchambault/jupyter-scala
https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages
https://github.com/apache/incubator-toree
http://jcrudy.github.io/blog/html/2013/12/08/introduction_to_iscala.html
None of this is working. It may be some way that my node is configured. I just don't know. Please help.
I tried the following with Jupyterhub notebook and it works seamlessly:
# Step 1: Install spylon kernel
pip install spylon-kernel
# Step 2: create a kernel spec
python -m spylon_kernel install
# Step 3: start jupyter notebook
jupyter notebook
PS: to list all installed kernels, you can run the following command:
jupyter kernelspec list
You can use the information given here.
Ensure you have IPython 3 installed. ipython --version should return a
value >= 3.0. If it's not the case, a quick way of setting it up
consists in installing the Anaconda Python distribution, and then
running
$ pip install --upgrade "ipython[all]"
ipython --version should then return a value >= 3.0.
Download the Jupyter Scala binaries for Scala 2.10 (txz or zip) or
Scala 2.11 (txz or zip), and unpack them in a safe place. Then run
once the jupyter-scala program (or jupyter-scala.bat on Windows) it
contains. That will set-up the Jupyter Scala kernel for the current
user.
Check that Jupyter/IPython knows about Jupyter Scala by running
$ jupyter kernelspec list
This should print, among others, a line like
scala211
(or scala210 dependending on the Scala version you chose).
Then run either IPython console with
$ ipython console --kernel scala211
and start using the Jupyter Scala kernel straightaway, or run Jupyter
Notebook with
$ jupyter notebook
and create Scala 2.11 notebooks by choosing Scala 2.11 in the dropdown
in the upper right of the Jupyter Notebook start page.
Note: Since IPython has now been replaced by Jupyter, we replaced ipython in the above commands with jupyter.
I've just run:
conda create --name base2 --clone base to create an env just like base.
conda activate base2 to move to the new env.
conda install -c conda-forge spylon-kernel.
python -m spylon_kernel install --user. create a kernel spec for Jupyter notebook
jupyter-notebook
...and works just fine.
I'm using:
Anaconda 4.7.12
Jupyter-notebook 6.0.1
Ubuntu 18.04
ipykernel 5.1.3
ipython 7.9.0
ipython_genutils 0.2.0
jupyter_client 5.3.4
jupyter_core 4.6.0
traitlets 4.3.3
from def suma(a: Int) = a + 3
I can't add a comment to Heapify's answer, but his solution worked for JupyterLab on Windows without problems.
I cut and pasted his code into an Anaconda Powershell prompt
pip install spylon-kernel
python -m spylon_kernel install
jupyter notebook
And refreshed my anacopnda launcher and the spylon project option was available.
The answer for Linux can be found here.
Install Scala. Add these lines to ~/.bashrc
export SCALA_HOME=/usr/local/share/scala export
PATH=$PATH:$SCALA_HOME/bin:$PATH
Follow these instructions from the
GitHub site:
Download and unpack pre-packaged binaries Scala 2.11. Unpack each
downloaded archive(s), and, from a console, go to the bin
sub-directory of the directory it contains. Then run the following to
set-up the corresponding Scala kernel:
./jove-scala --kernel-spec
Make sure spark is installed in local along with SPARK_HOME is added or exported in .profile/environment file.
If not, you might get stuck with the following message:
"Intitializing Scala interpreter ..."
without any result.
For mac, I needed only to 3 commands to add Scala and run it with Spark (I had it already installed) on my Jupyter notebook
pip install spylon-kernel
python -m spylon_kernel install
ipython notebook
Once you run them on your terminal, you'll have spylon-kernel in your notebook, which can be used as your a Scala notebook.
spylon-kernel hasn't seen an update in years. These days its much better to use almond.