Spark-submit not picking modules and sub-modules of project structure - pyspark

Folder structure of pyspark project on pycharm:
TEST
TEST (marked as sources root)
com
earl
test
pysprk
utils
utilities.py
test_main.py
test_main.py has:
from _ast import arg
__author__ = "earl"
from pyspark.sql.functions import to_json, struct, lit
from com.earl.test.pyspark.utils.utilities import *
import sys
utilities.py has:
__author__ = "earl"
from py4j.protocol import Py4JJavaError
from pyspark.sql import SparkSession
import sys
On PyCharm, I execute the code by Running test_main.py and it works absolutley OK. Calls functions from utilities.py and executes perfectly. I set Run -> Edit Configurations -> Parameters on PyCharm as D:\Users\input\test.json localhost:9092 and use sys.argv[1] and sys.argv[2] and it does it OK
Spark submit command:
spark-submit --master local --conf spark.sparkContext.setLogLevel=WARN --name test D:\Users\earl\com\earl\test\pyspark\test_main.py --files D:\Users\earl\com\test\pyspark\utils\utilities.py D:\Users\input\test.json localhost:9092
Error:
Traceback (most recent call last):
File "D:\Users\earl\com\earl\test\pyspark\test_main.py", line 5, in <module>
from com.earl.test.pyspark.utils.utilities import *
ModuleNotFoundError: No module named 'com'

Fixed it by setting below property before running spark-submit.
PYTHONPATH earlier is set as %PY_HOME%\Lib;%PY_HOME%\DLLs;%PY_HOME%\Lib\lib-tk
set PYTHONPATH=%PYTHONPATH%;D:\Users\earl\TEST\ (Path of the project home structure)
And updated spark-submit as (Only main needed to be mentioned):
spark-submit --master local --conf spark.sparkContext.setLogLevel=WARN --name test D:\Users\earl\com\earl\test\pyspark\test_main.py D:\Users\input\test.json localhost:9092

Related

Import from the root of the repository when running a jupyter notebook

I have a repository with the following setup:
│
└───foo_lib
│ │ bar.py
│
└───notebooks
│ my_notebook.ipynb
So basically I have some common python code in foo_lib and some notebook in notebooks
In my_notebook I want to use the code from foo_lib. So I do:
from foo_lib import bar
But that doesn't work because the root of the repo isn't in my python path when the notebook is executed.
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-e2c421feccf4> in <module>
----> 1 from foo_lib import bar
ModuleNotFoundError: No module named 'foo_lib'
The hack I've been using is to put %cd .. in the first cell. Then the working directory is the root of the repo and I can import fine. But it's not idempotent, so if I run the cell more than once, imports break again.
I found an idempotent solution. I can use globals()["_dh"][0] which points to the directory containing the notebook, when running in jupyter:
import os
os.chdir(os.path.join(globals()["_dh"][0], ".."))
Unfortuantely, this doesn't work when I run my notebook programatically using nbconvert:
import json
import nbconvert
import nbformat
def run_notebook():
ep = nbconvert.preprocessors.ExecutePreprocessor()
with open("notebooks/my_notebook.ipynb") as fp:
nb = nbformat.read(fp, as_version=4)
nb, resources = ep.preprocess(nb)
print(json.dumps(nb, indent=2))
if __name__ == "__main__":
run_notebook()
When I run this script from the root of the repository, globals()["_dh"][0] points to the root of the repository
So I'm looking for a solution to this import problem that:
is idempotent
works when executing from the browser/jupyter
works when executing using nbconvert
is short: I would have to copy paste the code in every notebook (since before that code runs, I can't do imports).
Is there a better way to do this?
I've figured out that the local repository code and be added to the site-package by calling:
pip install -e .

I have pytorch installed in a environment but import torch produces an error

I have anaconda python3 kernel with pytorch and numpy installed in the environment. In jupyter notebook first line 'import torch' produces error.
I am using anaconda navigator to launch jupyter notebook and enter my environment and see pytorch is installed but not being imported. Tried various dir extensions 'from torch... import * but more error
import torch
ImportError Traceback (most recent call last)
<ipython-input-1-20507c95d9af> in <module>
1
----> 2 import torch
3
4
~/anaconda3/envs/udacity1/lib/python3.6/site-packages/torch/__init__.py in <module>
100 pass
101
--> 102 from torch._C import *
103
104 __all__ += [name for name in dir(_C)
ImportError: /home/frida/anaconda3/envs/udacity1/lib/python3.6/site-packages/torch/lib/libtorch.so.1: undefined symbol: nvrtcGetProgramLogSize
I was able to add this to the end of my path to establish a 'backend ' for my notebook kernel. Thanks to Kris Stern!
-m ipykernel install --user
first get your kernel path with
(which python3)
then connect it using
sudo (your path)/anaconda3/bin/python3 -m ipykernel install --user

findspark.init() IndexError: list index out of range: PySpark on Google Colab

I am trying to install PySpark on Colab.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
!tar xf spark-2.4.1-bin-hadoop2.7.tgz
!pip install -q findspark
After installing above things, I set the environment as following:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
After that, I tried to initialized pyspark as follows and end up with error.
import findspark
findspark.init()
Error:
IndexError Traceback (most recent call last)
<ipython-input-24-4e91d34768ac> in <module>()
1 import findspark
----> 2 findspark.init()
/usr/local/lib/python3.6/dist-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
133 # add pyspark to sys.path
134 spark_python = os.path.join(spark_home, 'python')
--> 135 py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
136 sys.path[:0] = [spark_python, py4j]
137
IndexError: list index out of range
Can you try setting the
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
to the same spark version as your above install? In your case it would be 2.4.1 not 2.2.1.
os.environ["SPARK_HOME"] = "/content/spark-2.4.1-bin-hadoop2.7"
Make sure that your Java and Spark paths (including version) are correct:
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
Then try to check if the path is correct by printing the path 
print(os.listdir('./sample_data'))
If you get a list of sample files, the code will initialize without any 'index out of range' errors

unregistered task type import errors in celery

I'm having headaches with getting celery to work with my folder structure. Note I am using virtualenv but it should not matter.
cive /
celery_app.py
__init__.py
venv
framework /
tasks.py
__init__.py
civeAPI /
files tasks.py need
cive is my root project folder.
celery_app.py:
from __future__ import absolute_import
from celery import Celery
app = Celery('cive',
broker='amqp://',
backend='amqp://',
include=['cive.framework.tasks'])
# Optional configuration, see the application user guide.
app.conf.update(
CELERY_TASK_RESULT_EXPIRES=3600,
)
if __name__ == '__main__':
app.start()
tasks.py (simplified)
from __future__ import absolute_import
#import other things
#append syspaths
from cive.celery_app import app
#app.task(ignore_result=False)
def start(X):
# do things
def output(X):
# output files
def main():
for d in Ds:
m = []
m.append( start.delay(X) )
output( [n.get() for n in m] )
if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))
I then start workers via (outside root cive dir)
celery -A cive worker --app=cive.celery_app:app -l info
which seems to work fine, loading the workers and showing
[tasks]
. cive.framework.tasks.start_sessions
But when I try to run my tasks.py via another terminal:
python tasks.py
I get the error:
Traceback (most recent call last):
File "tasks.py", line 29, in <module>
from cive.celery_app import app
ImportError: No module named cive.celery_app
If I rename the import to:
from celery_app import app #without the cive.celery_app
I can eventually start the script but celery returns error:
Received unregistered task of type 'cive.start_sessions'
I think there's something wrong with my imports or config but I can't say what.
So this was a python package problem, not particularly a celery issue. I found the solution by looking at How to fix "Attempted relative import in non-package" even with __init__.py .
I've never even thought about this before, but I wasn't running python in package mode. The solution is cd'ing out of your root project directory, then running python as a package (note there is no .py after tasks):
python -m cive.framework.tasks
Now when I run the celery task everything works.

Fail to import IPython parallel in Jupyter

I recently made an update of IPython to 4.0.0 and installed Jupyter 4.0.6.
I wanted to use Ipython parallel, and after starting the engines in the notebook, I imported:
from IPython import parallel
And it fails:
~/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
warn("IPython.utils.traitlets has moved to a top-level traitlets package.")
~/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/IPython/utils/pickleutil.py:3: UserWarning: IPython.utils.pickleutil has moved to ipykernel.pickleutil
warn("IPython.utils.pickleutil has moved to ipykernel.pickleutil")
~/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/IPython/utils/jsonutil.py:3: UserWarning: IPython.utils.jsonutil has moved to jupyter_client.jsonutil
warn("IPython.utils.jsonutil has moved to jupyter_client.jsonutil")
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-5652e9e33a4d> in <module>()
----> 1 from IPython import parallel
~/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/IPython/parallel/__init__.py in <module>()
31
32 from .client.asyncresult import *
---> 33 from .client.client import Client
34 from .client.remotefunction import *
35 from .client.view import *
~/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/IPython/parallel/client/client.py in <module>()
38 from IPython.utils.capture import RichOutput
39 from IPython.utils.coloransi import TermColors
---> 40 from IPython.utils.jsonutil import rekey, extract_dates, parse_date
41 from IPython.utils.localinterfaces import localhost, is_local_ip
42 from IPython.utils.path import get_ipython_dir
ImportError: cannot import name rekey
So I tried:
pip install rekey
But no distribution were found.
Note that it fails the same way in the notebook, be it open with ipython notebook or jupyter notebook, and in the console.
Also note that there is a warning:
UserWarning: IPython.utils.jsonutil has moved to jupyter_client.jsonutil
But rekey does not exist in the module jupyter_client.jsonutil
Question: How can I have IPython parallel to work within Jupyter ?
What am I missing ?
I found the problem I think (at least it works):
First, I had to import ipyparallel instead of IPython.parallel
See here: http://jupyter.readthedocs.org/en/latest/migrating.html#imports
EDIT: I get this OSError, but the fix was apparently useless, and it works without. I still don't get why I had this error, though.
Then, I had another error, when starting the client:
OSError: Connection file '~/.ipython/profile_default/security/ipcontroller-client.json' not found.
You have attempted to connect to an IPython Cluster but no Controller could be found.
Please double-check your configuration and ensure that a cluster is running.
So I just copy the directory ~/.ipython/profile_default to ~/.jupyter/profile_default
And it works!