Unable to install PySpark on Google Colab

Unable to install PySpark on Google Colab - pyspark

I am trying to install PySpark on Google Colab using the code given below but getting the following error.
tar: spark-2.3.2-bin-hadoop2.7.tgz: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
This code has ran successfully once. But it is throwing this error after the notebook restart. I have even tried running this from a different Google account but same error again.
(Also is there any way that we don't need to install PySpark everytime after the notebook re-start?)
code:
--------------------------------------------------------------------------------------------------------------------------------
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
This following line seems to cause the problem as it is not finding the downloaded file.
!tar xvf spark-2.3.2-bin-hadoop2.7.tgz
I have also tried the following two lines (instead of above two lines) suggested somewhere on medium blog. But nothing better.
!wget -q http://mirror.its.dal.ca/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xvf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
-------------------------------------------------------------------------------------------------------------------------------
Any ideas how to get out of this error and install PySpark on Colab?

I am running pyspark on colab by just using
!pip install pyspark
and it works fine.

Date: 6-09-2020
Step 1 : Install pyspark on google colab
!pip install pyspark
Step 2 : Dealing with pandas and spark Dataframe inside spark session
!pip install pyarrow
It facilitates communication between many components, for example, reading a parquet file with Python (pandas) and transforming to a Spark data frame, Falcon Data Visualization or Cassandra without worrying about conversion.
Step 3 : Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').getOrCreate()
Done ⭐

you are getting this error because spark-2.3.2-bin-hadoop2.7 is replaced with latest version on official site and mirror sites.
Go to any of this path and get the latest version
http://apache.osuosl.org/spark/
https://www-us.apache.org/dist/spark/
replace spark build version and you are done.
every thing will work smoothly.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf /content/spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

I had tried to install in the same way but even after checking with proper versions of spark I was getting the same error.
Running below code worked for me!!
!pip install pyspark
!pip install pyarrow
!pip install -q findspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('HelloWorld').getOrCreate()

I have used the below setup to run PySpark on Google Colab.
# Installing spark
!apt-get install openjdk-8-jre
!apt-get install scala
!pip install py4j
!wget -q https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz
!tar xf spark-2.4.8-bin-hadoop2.7.tgz
!pip install -q findspark
# Setting up environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.8-bin-hadoop2.7"
# Importing and initating spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Test Setup").getOrCreate()
sc = spark.sparkContext

Related

How to install Tesseract OCR on Databricks

I am trying to run the following script on a databrick python notebook:
pip install presidio-image-redactor
pip install pytesseract
python -m spacy download en_core_web_lg
from PIL import Image
from presidio_image_redactor import ImageRedactorEngine
import pytesseract
image = Image.open("images/ImageData.PNG")
engine = ImageRedactorEngine()
redacted_image = engine.redact(image, (255, 192, 203))
Upon running the last line, I'm getting the error below:
TesseractNotFoundError: tesseract is not installed or it's not in your PATH.
am I missing anything?

You can use %sh in a separate cell to execute the shell commands on the driver node. To install tesseract, you can do:
%sh apt-get -f -y install tesseract-ocr
If you need to install it to all nodes of the cluster, you need to use cluster init script with the same command (without %sh)

python module not accessible from EMR notebook

I am using an EMR notebook attached to my cluster for some experimentation purposes. I needed to install some python modules for testing, specifically spacy and it's data module en_core_web_sm.
I ssh'ed into the master and core nodes and downloaded the modules individually. However I am not able to import from the my EMR notebook. I get the following error :
An error was encountered:
No module named 'spacy'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'spacy'
I know there is a way to install them just for the scope of EMR notebook, but this wouldn't suffice in a production scenario, so please avoid answers which suggest notebook installing as mentioned in this guide : https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/
Please let me know if I am missing some setup steps. Appreciate your response.

You can use bootstraps to install additional modules while creating your EMR
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html

I was able to solve this by changing the bootstrap script to use sudo instead of --user. (You could also manually change run the scripts below)
Before I was running
pip3 install spacy --user
python3 -m spacy download en --user
I changed that script to
sudo pip3 install spacy
sudo python3 -m spacy download en
To verify this solution quickly issue the following commands from your EMR notebook (to compare before and after)
sc.list_packages()
You should see an output similar to this
SparkSession available as 'spark'.
Package Version
-------------------------- ----------
beautifulsoup4 4.9.0
blis 0.4.1
boto 2.49.0
catalogue 1.0.0
certifi 2020.4.5.2
chardet 3.0.4
cymem 2.0.3
en-core-web-sm 2.3.0
idna 2.9
importlib-metadata 1.6.1
jmespath 0.9.5
lxml 4.5.0
murmurhash 1.0.2
mysqlclient 1.4.2
nltk 3.4.5
nose 1.3.4
numpy 1.16.5
pip 9.0.1
plac 1.1.3
preshed 3.0.2
py-dateutil 2.2
python37-sagemaker-pyspark 1.3.0
pytz 2019.3
PyYAML 5.3.1
requests 2.24.0
setuptools 28.8.0
six 1.13.0
soupsieve 1.9.5
spacy 2.3.0
srsly 1.0.2
thinc 7.4.1
tqdm 4.46.1
urllib3 1.25.9
wasabi 0.6.0
wheel 0.29.0
windmill 1.6
zipp 3.1.0
This is not the best possible solution IMO, since the first warning that gets displayed after using sudo is
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
If anyone has a better solution please free to post.

How to execute the right Python to import the installed tensorflow.transform package?

The version of my Python is 2.7.13.
I run the following in Jupyter Notebook.
Firstly, I installed the packages
%%bash
pip uninstall -y google-cloud-dataflow
pip install --upgrade --force tensorflow_transform==0.15.0 apache-beam[gcp]
Then,
%%bash
pip freeze | grep -e 'flow\|beam'
I can see that the package tensorflow-transform is installed.
apache-beam==2.19.0
tensorflow==2.1.0
tensorflow-datasets==1.2.0
tensorflow-estimator==2.1.0
tensorflow-hub==0.6.0
tensorflow-io==0.8.1
tensorflow-metadata==0.15.2
tensorflow-probability==0.8.0
tensorflow-serving-api==2.1.0
tensorflow-transform==0.15.0
However when I tried to import it, there are warning and error.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/api/_v1/estimator/__init__.py:12: The name tf.estimator.inputs is deprecated. Please use tf.compat.v1.estimator.inputs instead.
ImportErrorTraceback (most recent call last)
<ipython-input-3-26a4792d0a76> in <module>()
1 import tensorflow as tf
----> 2 import tensorflow_transform as tft
3 import shutil
4 print(tf.__version__)
ImportError: No module named tensorflow_transform
After some investigation, I think I have some ideas of the problem.
I run this:
%%bash
pip show tensorflow_transform| grep Location
This is the output
Location: /home/jupyter/.local/lib/python3.5/site-packages
I tried to modify the $PATH by adding /home/jupyter/.local/lib/python3.5/site-packages to the beginning of $PATH. However, I still failed to import tensorflow_transform.
Based on the above and the following information, I think, when I ran the import command, it executes Python 2.7, not Python 3.5
import sys
print('\n'.join(sys.path))
/usr/lib/python2.7
/usr/lib/python2.7/plat-x86_64-linux-gnu
/usr/lib/python2.7/lib-tk
/usr/lib/python2.7/lib-old
/usr/lib/python2.7/lib-dynload
/usr/local/lib/python2.7/dist-packages
/usr/lib/python2.7/dist-packages
/usr/local/lib/python2.7/dist-packages/IPython/extensions
/home/jupyter/.ipython
Also,
import sys
sys.executable
'/usr/bin/python2'
I think the problem is tensorflow_transform package was installed in /home/jupyter/.local/lib/python3.5/site-packages. But when I run "Import", it goes to /usr/local/lib/python2.7/dist-packages to search for the package, rather than /home/jupyter/.local/lib/python3.5/site-packages, so even updating $PATH does not help. Am I right?
I tried to upgrade my python, but
%%bash
pip install upgrade python
Defaulting to user installation because normal site-packages is not writeable
Then, I added --user. It seems that the python is not really upgraded.
%%bash
pip install --user upgrade python
%%bash
python -V
Python 2.7.13
Any solution?

It seems to me that your jupyter notebook is not using the right python environment.
Perhaps, you installed the package under version 3.5,
but the Notebook uses the other one, thus it cannot find the library
You can pick the other interpreter by clicking on: Python(your version) - bottom left.
VS-Code - Select Python Environment 1
However you can do this also via:
CNTRL+SHIFT+P > Select Python Interpreter to start Jupyter Server
If that does not work make sure that the package you are trying to import is installed under the correct python environment.
If not open up a terminal, activate the environment and install it using:
pip install packagename
For example i did the same thing here: (Note: I'm using Anaconda)
installing tensorflow_transform
After a installation, you can import it in your code directly like this:
importing tensorflow_transform

Unable to load pyspark inside virtualenv

I had installed pyspark in a python virtualenv. I have also installed jupyterlab which was newly released http://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html in the virtualenv. I was unable to fire pyspark within a jupyter-notebook in such a way that I have the SparkContext variable available.

First fire the virtualenv
source venv/bin/activate
export SPARK_HOME={path_to_venv}/lib/python2.7/site-packages/pyspark
export PYSPARK_DRIVER_PYTHON=jupyter-lab
Before this I hope you have done:pip install pyspark and pip install jupyterlab inside your virtualenv
To check, once your jupyterlab is open, type sc in a box in the jupyterlab and you should have the SparkContext object available and the output should be this:
SparkContext
Spark UI
Version
v2.2.1
Master
local[*]
AppName
PySparkShell

You need to export your $PYSPARK_PYTHON with your virtualenv
export PYSPARK_PYTHON={path/to/your/virtualenv}/bin/python
That solved my case.

In my case, working with windows, python 3.7.4 and spark 3.1.1., the problem was that pyspark was looking for python3.exe that did not exist. I made a copy of venv/Scripts/python.exe and renamed venv/Scripts/python3.exe

how to import h5py on datalab?

Does anybody know how to install h5py on datalab? pip install h5py doesn't work. apt-get install python-h5py is working in the shell but it doesn't recognize the package in datalab notebook!
Thnaks

It is true that !pip install h5py wil allow you to install the library but unfortunately, even after a successful installation, the import will fail:
The issue is rooted on an ongoing python-future issue ("surrogateescape handler broken when encountering UnicodeEncodeError") that is experienced in datalab because the underlying OS uses an 'ANSI_X3.4-1968' file system encoding.
As a hacky workaround, you may remove line 60 from h5py's __init__.py by running the following command from within a notebook cell:
!sed -i.bak '/run_tests/d' /usr/local/lib/python2.7/dist-packages/h5py/__init__.py

Just make sure you run it using bash syntax: !pip install h5py in any notebook cell.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Unable to install PySpark on Google Colab - pyspark

I am running pyspark on colab by just using !pip install pyspark and it works fine.

Related

How to install Tesseract OCR on Databricks

python module not accessible from EMR notebook

How to execute the right Python to import the installed tensorflow.transform package?

Unable to load pyspark inside virtualenv

how to import h5py on datalab?

Categories

Resources