Unable to install seaborn on EMR cluster - pyspark

I want to install pandas and seaborn on an EMR cluster with PySpark. I tried:
sc.install_pypi_package("pandas==1.0.3")
sc.install_pypi_package("seaborn==0.12.1")
Pandas worked. But seaborn did not. It asked to install Cython and I did.
However, it still did not work. It seems I need to install many more dependencies. See output here (16 pages). Is there an easier way to install it? I thought that sc.install_pypi_package would install the dependencies.
Also, I tried %pip install seaborn as suggested by the documentation, but it did not work either.

Related

Install or import in Python?

I am a beginner in Python, trying to still learn the basics. I am mostly interested in using it for Data Analysis and Visualizations, with packages such as matplotlib.
Most of the examples I see, use the code
"import matplotlib"
or something similar.
But there are also cases when people suggest using pip install the use the package.
So, as a rule of thumb, when should one use import and when should one install through the terminal?
Let's say you want to use some library. Let its name be ABC. ABC has some function, let's say function1.
If you write
import ABC
ABC.function1()
you will get error. Because in your virtual environment python can't find library called ABC. You must install it first using pip install ABC in your terminal. After that same code will work.
You must install library first in order to use it.
There is no thumb rule for using a method to install. You can use any method for installing. Aim is to install so that the library is available when you run the code, else you will get an error.
In Windows, if you want to install a package/library use the following Command on DOS Prompt
python3 -m pip install matplotlib.
To Upgrade the same, use the following Command on DOS Prompt
python3 -m pip install --upgrade matplotlib.
You can install and upgrade the package/libraries through Jupyter too.
Once installed, you need to place the import <library_name> on top of the code in which you want to use that library.

ToreeInstall ERROR | Unknown interpreter PySpark. toree can not install PySpark

When I install PySpark for Jupyter notebook, I using this cmd:
jupyter toree install --kernel_name=tanveer --interpreters=PySpark --python="/usr/lib/python3.6"
But, I get the tips of
[ToreeInstall] ERROR | Unknown interpreter PySpark. Skipping installation of PySpark interpreter
So I don't know what a problem. I have set up Toree's Scala and SQL successfully. thinks
Toree version 0.3.0 removed support for PySpark and SparkR:
Removed support for PySpark and Spark R in Toree (use specific kernels)
Release notes here: incubator-toree release notes
I am not sure what "use specific kernels" means and continue to look for a Jupyter PySpark kernel.
As also mentioned in Lee's answer, Toree version 0.3.0 removed support for PySpark and SparkR. As per their release notes, they asked to "use specific kernels". For PySpark, this means manually install pyspark to be used with Jupyter.
Steps are simple as follow:
Install pyspark. Either by pip install pyspark, or by download Apache Spark binary package and decompress into a specific folder.
Add the following 3 environment variables. How to do this depends on your OS. For example, on my MacOS, I added the following lines to the file ~/.bash_profile
export SPARK_HOME=<path_to_your_installed_spark_files>
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
That's it. To start your PySpark Jupyter Notebook, simply run "pyspark" from your command line, and choose "Python" kernel
Refer to https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781788835367/1/ch01lvl1sec17/installing-jupyter
or
https://opensource.com/article/18/11/pyspark-jupyter-notebook for more detailed instructions.

Jupyter cant run shapely.geometry

Hey so I've managed to get shapely.geometry to run just fine on PyCharm.
But the difficulty here is in getting the import to run on Jupyter notebook.
I have done:
import geopandas as gpd
This returns shapely.geometry doesn't exist.
I think I know how to fix this through downloading the file
"Shapely-1.6.4.post1-cp37-cp37m-win_amd64.whl" and doing conda install (that)... but it returned that the channel didnt exist...
So I did:
conda install --add channels https://www.lfd.uci.edu/~gohlke/pythonlibs/
(which is where I got the file from) which worked just fine so I then again did "conda install Shapely-1.6.4.post1-cp37-cp37m-win_amd64.whl" but it returned:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://www.lfd.uci.edu/~gohlke/pythonlibs/win-64/repodata.json>
A simple retry will get you on your way...
Tried that, didnt work. Someone please help. Reminder that I successfully installed shapely with all of its modules working through "pip install Shapely-1.6.4.post1-cp37-cp37m-win_amd64.whl" WITHIN Pycharm itself.
EDIT 1
Im following the textbook "Mastering Geospatial Anlsysis with Python" It got me to download the packages:
gdal
geos
shapely
fiona
pyshp
pyproj
rasterio
geopandas
EDIT 2
I dont know what i did but somehow i fixed it... but the thing is, i literally did nothing except take out a shapely file with a long name and kept the one just called "shapely".
If i have files like this
gdal-2.2.2-py36hcebd033_1
instead of this
gdal
Is that the problem?????? because if it is, then i dont know how to get files like that they just either appear or they dont.
Shapely is a wrapper of C++ library called GEOS that is not installed with the wheel. You should go to the page and install that library.
Or perhaps you have Pycharm for python 2 and Jupyter for python 3 (or vice-versa).
Running conda install -c conda-forge geos=3.7.1 worked for me.

ImportError: No module named sympy

I am getting the following error while trying to run a sympy file in order to contribute to sympy. It is :
ImportError: No module named sympy
I installed the sympy module through pip for both python2.7 and python 3.
Also, isympy is working.
Strangly, when I try to import sympy in python's interactive console in the main sympy directory, no import errors are shown but in some other directory, it shows import errors.
Please help me to download the sympy module in a way that I will be able to run the code.
Thanks.
Importing module in python console of main directory.
Importing module in some other directory.
A likely cause here is that you are using two different Pythons. If you have Python installed multiple times (like Python 2 and Python 3), each has its own separate packages. You can check what Python you are using by printing sys.executable.
I should point out that for the purposes of contributing to SymPy, you generally want to run against the development version. That is, running Python from the SymPy directory and importing the development version from there, without actually installing it.
Thanks for the reply.
But I solved the problem. I realised that I didn't install sympy in the current conda environment. When I tried it using the command:
conda install sympy
It worked and no error is being shown.
Thanks.

Installing SciPy without Anaconda on Windows: how to fix "no lapack/blas" error?

Is there an option to install SciPy on Windows without installing Anaconda as well? I could not do it via pip and everywhere it says to use Anaconda.
More details:
I want the SciPy package without any additional programs like Python(x, y) or Canopy.
The error with pip is: numpy.distutils.system_info.NotFoundError: no lapack/blas resources found. From research I found that I need to use additional packages but it sounds strange to me. I couldn't install LAPACK or BLAS.
There are unofficial builds: http://www.lfd.uci.edu/~gohlke/pythonlibs. Here's a link to scipy: http://www.lfd.uci.edu/~gohlke/pythonlibs#scipy
You can proceed with installing numpy or scipy using pip, after you install lapack and blas, which are system libraries. It shouldn't be very hard, but depends on you OS.
For RedHat/CentOS/Fedora this could be done with:
yum install lapack lapack-devel blas blas-devel
The packages can be found e.g. in CentOS base repository.
However, the scikit-learn website says as follows:
We don’t recommend installing scipy or numpy using pip on linux, as this will involve a lengthy build-process with many dependencies. Without careful configuration, building numpy yourself can lead to an installation that is much slower than it should be. If you are using Linux, consider using your package manager to install scikit-learn. It is usually the easiest way, but might not provide the newest version. If you haven’t already installed numpy and scipy and can’t install them via your operation system, it is recommended to use a third party distribution.
Package managers are usually yum or apt-get and again on RedHat/CentOS/Fedora you can skip using pip and install this way:
yum install scipy
Third party distributions mentioned above are things like anaconda or Python(x,y).