Jupyter ImportError: No module named py4j.protocol despite py4j is installed - pyspark

I read some posts regarding to the error I am seeing now when import pyspark, some suggest to install py4j, and I already did, and yet I am still seeing the error.
I am using a conda environment, here is the steps:
1. create a yml file and include the needed packages (including the py4j)
2. create a env based on the yml
3. create a kernel pointing to the env
4. start the kernel in Jupyter
5. running `import pyspark` throws error: ImportError: No module named py4j.protocol

The issue is resolved with adding environment section in kernel.json and explicitely specify the variables of the following:
"env": {
"HADOOP_CONF_DIR": "/etc/spark2/conf/yarn-conf",
"PYSPARK_PYTHON":"/opt/cloudera/parcels/Anaconda/bin/python",
"SPARK_HOME": "/opt/cloudera/parcels/SPARK2",
"PYTHONPATH": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.7-src.zip:/opt/cloudera/parcels/SPARK2/lib/spark2/python/",
"PYTHONSTARTUP": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": " --master yarn --deploy-mode client pyspark-shell"
}

Related

ModuleNotFoundError: No module named 'pyspark.dbutils' while running multiple.py file/notebook on job clusters in databricks

I am working in TravisCI, MlFlow and Databricks environment where .tavis.yml sits at git master branch and detects any change in .py file and whenever it gets updated, It will run mlflow command to run .py file in databricks environment.
my MLProject file looks as following:
name: mercury_cltv_lib
conda_env: conda-env.yml
entry_points:
main:
command: "python3 run-multiple-notebooks.py"
Workflow is as following:
TravisCI detects change in master branch-->triggers build which will run MLFlow command and it'll spin up a job cluster in databricks to run .py file from repo.
It worked fine with one .py file but when I tried to run multiple notebook using dbutils, it is throwing
File "run-multiple-notebooks.py", line 3, in <module>
from pyspark.dbutils import DBUtils
ModuleNotFoundError: No module named 'pyspark.dbutils'
Please find below the relevant code section from run-multiple-notebooks.py
def get_spark_session():
from pyspark.sql import SparkSession
return SparkSession.builder.getOrCreate()
def get_dbutils(self, spark = None):
try:
if spark == None:
spark = spark
from pyspark.dbutils import DBUtils #error line
dbutils = DBUtils(spark) #error line
except ImportError:
import IPython
dbutils = IPython.get_ipython().user_ns["dbutils"]
return dbutils
def submitNotebook(notebook):
print("Running notebook %s" % notebook.path)
spark = get_spark_session()
dbutils = get_dbutils(spark)
I tried all the options and tried
https://stackoverflow.com/questions/61546680/modulenotfounderror-no-module-named-pyspark-dbutils
as well. It is not working :(
Can someone please suggest if there is fix for the above-mentioned error while running .py in job cluster. My code works fine inside databricks local notebook but running from outside using TravisCI and MLFlow isn't working which is must requirement for pipeline automation.

How to stop 'import psycopg2' from causing an Exception when starting an Azure Container?

I am trying to deploy a Django REST API using Azure App Service on Linux. I am using a postgresql Database and deploy via pipeline. Azure has postgresql 9.6. After running my pipeline, the Website shows an Server Error (500).
The AppLogs show, that the Container couldn't be started due an failed import of psycopg2.
[ERROR] Exception in worker process
Traceback (most recent call last):
File "/home/site/wwwroot/antenv/lib/python3.7/site-packages/django/db/backends/postgresql/base.py", line 25, in
import psycopg2 as Database
File "/home/site/wwwroot/antenv/lib/python3.7/site-packages/psycopg2/__init__.py", line 50, in
from psycopg2._psycopg import ( # noqa
ImportError: /home/site/wwwroot/antenv/lib/python3.7/site-packages/psycopg2/_psycopg.cpython-37m-x86_64-linux-gnu.so: undefined symbol: PQencryptPasswordConn
In the Build-stage of my pipeline, I set up my environment (python3.7) like this:
- script: |
python -m venv antenv
source antenv/bin/activate
python -m pip install --upgrade pip
pip install setup
pip install -r requirements.txt
Where requirements.txt looks like this:
Django==3.0.2
djangorestframework==3.11.0
psycopg2-binary==2.8.4
pandas==0.25.3
pytest==5.3.5
pytest-django==3.8.0
pytest-mock==2.0.0
python-dateutil==2.8.1
sqlparse==0.3.0
whitenoise==5.0.1
BuildJob and DeploymentJob seem to run flawless. the Build-logs indicate that psycopg2_binary-2.8.4-cp37-cp37m-manylinux1_x86_64.whl was correctly downloaded and installed.
Also the App runs fine on my machine when using the database on azure by configuring in the settings.py:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'NAME': 'databasename',
'USER': 'user#postgresqlserver',
'PASSWORD': 'Password',
'HOST': 'postgresqlserver.postgres.database.azure.com',
'PORT': '',
'OPTIONS': {'sslmode': 'require'}
}
} # Of course the info is actually saved in environment variables
This gives me the feeling, that something with the psycopg2 installation is not working... For others the *psycopg2-binary seemed to do the trick but unfortunateley not for me.
Am I right to assume that on azure I'm nether able to install postgresql10 as suggested here https://github.com/psycopg/psycopg2/issues/983 nor can install from source like suggested here https://github.com/psycopg/psycopg2/issues/1018?
There must be something I am missing, I would be grateful for any advice!
EDIT:
Taking a look at the library (as suggested here https://stackoverflow.com/a/59652816/13183775) I found that I don't have a PQencryptPasswordConn function but only a PQencryptPassword function. I have the feeling that this is expected for Postgresql9.6 (https://github.com/psycopg/psycopg2/blob/cb3353be1f10590cdc2a894ada42c3b4c171feb7/psycopg/psycopgmodule.c#L466).
To check, whether there are multiple versions libpq:
/>find . -name "libpq*"
./var/lib/dpkg/info/libpq5:amd64.symbols
./var/lib/dpkg/info/libpq5:amd64.shlibs
./var/lib/dpkg/info/libpq5:amd64.list
./var/lib/dpkg/info/libpq5:amd64.triggers
./var/lib/dpkg/info/libpq-dev.list
./var/lib/dpkg/info/libpq5:amd64.md5sums
./var/lib/dpkg/info/libpq-dev.md5sums
./usr/share/doc/libpq5
./usr/share/doc/libpq-dev
./usr/share/locale/ko/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/it/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/pl/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/zh_TW/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/tr/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/cs/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/de/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/ru/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/sv/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/pt_BR/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/fr/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/es/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/zh_CN/LC_MESSAGES/libpq5-9.6.mo
./usr/share/locale/ja/LC_MESSAGES/libpq5-9.6.mo
./usr/lib/x86_64-linux-gnu/pkgconfig/libpq.pc
./usr/lib/x86_64-linux-gnu/libpq.so.5
./usr/lib/x86_64-linux-gnu/libpq.so
./usr/lib/x86_64-linux-gnu/libpq.so.5.9
./usr/lib/x86_64-linux-gnu/libpq.a
./usr/include/postgresql/libpq-events.h
./usr/include/postgresql/libpq-fe.h
./usr/include/postgresql/libpq
./usr/include/postgresql/libpq/libpq-fs.h
./usr/include/postgresql/internal/libpq
./usr/include/postgresql/internal/libpq-int.h>
Sadly I'm not able to see here wether there are multiple libpq versions...

Importing PySpark packages

I have downloaded the graphframes package (from here) and saved it on my local disk. Now, I would like to use it. So, I use the following command:
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --py-files ~/temp/graphframes-0.1.0-spark1.5.jar --jars ~/temp/graphframes-0.1.0-spark1.5.jar --packages graphframes:graphframes:0.1.0-spark1.5
All the pyspark functionality works as expected, except for the new graphframes package: whenever I try to import graphframes, I get an ImportError. When I examine sys.path, I can see the following two paths:
/tmp/spark-1eXXX/userFiles-9XXX/graphframes_graphframes-0.1.0-spark1.5.jar and /tmp/spark-1eXXX/userFiles-9XXX/graphframes-0.1.0-spark1.5.jar, however these files don't exist. Moreover, the /tmp/spark-1eXXX/userFiles-9XXX/ directory is empty.
What am I missing?
in my case:
1、cd /home/zh/.ivy2/jars
2、jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
3、add /home/zh/.ivy2/jar to PYTHONPATH in spark-env.sh like code above:
export PYTHONPATH=$PYTHONPATH:/home/zh/.ivy2/jars:.
This might be an issue in Spark packages with Python in general. Someone else was asking about it too earlier on the Spark user discussion alias.
My workaround is to unpackage the jar to find the python code embedded, and then move the python code into a subdirectory called graphframes.
For instance, I run pyspark from my home directory
~$ ls -lart
drwxr-xr-x 2 user user 4096 Feb 24 19:55 graphframes
~$ ls graphframes/
__init__.pyc examples.pyc graphframe.pyc tests.pyc
You would not need the py-files or jars parameters, though, something like
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --packages graphframes:graphframes:0.1.0-spark1.5
and having the python code in the graphframes directory should work.
Add these lines to your $SPARK_HOME/conf/spark-defaults.conf :
spark.executor.extraClassPath file_path/jar1:file_path/jar2
spark.driver.extraClassPath file_path/jar1:file_path/jar2
In the more general case of importing 'orphan' python file (outside of current folder, not part of properly installed package) - use addPyFile, e.g.:
sc.addPyFile('somefolder/graphframe.zip')
addPyFile(path): Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

How to start a Spark Shell using pyspark in Windows?

I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html
But when I run in cmd the following:
C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4\>c:\Python27\python bin\pyspark
then I receive the following error message:
File "bin\pyspark", line 21
export SPARK_HOME="$(cd ="$(cd "`dirname "$0"`"/..; pwd)"
SyntaxError: invalid syntax
What am I doing wrong here?
P.S. When in cmd I try just C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4>bin\pyspark
then I receive ""python" is not recognized as internal or external command, operable program or batch file".
You need to have Python available in the system path, you can add it with setx:
setx path "%path%;C:\Python27"
I'm a fairly new Spark user (as of today, really). I am using spark 1.6.0 on Windows 10 and 7 machines. The following worked for me:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Using the code above, I was able to launch Spark in an IPython notebook and my Enthought Canopy Python IDE. Before, this, I was only able to launch pyspark through a cmd prompt. The code above will only work if you have your Environment Variables set correctly for Python and Spark (pyspark).
I run these set of path settings whenever I start pyspark in ipython:
import os
import sys
# Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') for R
### MANNN restart spart using ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
os.environ['SPARK_HOME']="G:/Spark/spark-1.5.1-bin-hadoop2.6"
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/bin")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
##sc.stop() # IF you wish to stop the context
sc = SparkContext("local", "Simple App")
With the reference and help of the user "maxymoo" I was able to find a way to set a PERMANENT path is Windows 7 as well. The instructions are here:
http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx
Simply set path in System -> Environment Variables -> Path
R Path in my system C:\Program Files\R\R-3.2.3\bin
Python Path in my system c:\python27
Spark Path in my system c:\spark-2
The path must be separated by ";" and there must be no space between paths

How to import modules in IPython Clusters

I am trying to import some of my personal modules into my IPython Clusters. I am using Anacondas on Windows Vista 64 bit
from IPython.parallel import Client
rc = Client()
dview = rc[:]
with dview.sync_imports():
import lib.rf
It is giving me this error:
No module named 'lib.rf'
I can import the module in the rest of my IPython notebook, as I have this .bat file to start ipython notebook:
cd C:\Users\Jon\workspace\bf
set PYTHONPATH=%PYTHONPATH%;C:\Users\Jon\workspace\bf
C:\Anaconda\envs\p33\scripts\ipython notebook
I am using this similar code to start my ip clusters:
cd C:\Users\Jon\workspace\bf
set PYTHONPATH=%PYTHONPATH%;C:\Users\Jon\workspace\bf
C:\Anaconda\envs\p33\Scripts\ipcluster start --n=7
Why is this not working?
More info:
If I print out sys.path, I get a list that contains C:\Users\Jon\workspace\bf
If I print out the paths of my clusters, I get the same list:
%px sys.path
['',
'',
'',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\distribute-0.6.28-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\pykalman-0.9.5-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\patsy-0.2.1-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\joblib-0.8.3_r1-py3.3.egg',
'C:\\Users\\Jon\\workspace\\bf',
'C:\\Users\\Jon\\workspace\\bf\\my_numba',
'C:\\Anaconda\\envs\\p33\\python33.zip',
'C:\\Anaconda\\envs\\p33\\DLLs',
'C:\\Anaconda\\envs\\p33\\lib',
'C:\\Anaconda\\envs\\p33',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\Sphinx-1.2.3-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32\\lib',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\Pythonwin',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\runipy-0.1.1-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\setuptools-7.0-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\IPython\\extensions']
In [45]:
Further analysis:
%px lib.__path__
Out[0:11]: _NamespacePath(['C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32\\lib'])
lib.__path__
Out[57]: ['.\\lib']
Looks like the ipcluster and notebook are looking at lib in different places. I have tried renaming lib to mylib. It has not helped.
It seems that with dview.sync_imports() is being run someplace other than your IPython Notebook environment and is therefore relying a different PYTHONPATH. It is definitely not being run on one of the cluster engines and so wouldn't expect it to leverage your cluster settings of PYTHONPATH.
I'm thinking you'll need to have that directory in your PYTHONPATH (not your PATH) for the calling python environment because that is the location from which you are importing the modules.
The impact of the bit you have about setting the PYTHONPATH in the DOS shell from which you invoke ipclusters isn't clear to me. I can see that one might expect this to let the engines know about your directory, but I'm wondering if that PYTHONPATH gets initilized to the environment from which you call IPython.parallel.Client.