pyspark program throwing name 'spark' is not defined - pyspark

Below program throwing error name 'spark' is not defined
Traceback (most recent call last):
File "pgm_latest.py", line 232, in <module>
sconf =SparkConf().set(spark.dynamicAllocation.enabled,true)
.set(spark.dynamicAllocation.maxExecutors,300)
.set(spark.shuffle.service.enabled,true)
.set(spark.shuffle.spill.compress,true)
NameError: name 'spark' is not defined
spark-submit --driver-memory 12g --master yarn-cluster --executor-memory 6g --executor-cores 3 pgm_latest.py
Code
#!/usr/bin/python
import sys
import os
from datetime import *
from time import *
from pyspark.sql import *
from pyspark
import SparkContext
from pyspark import SparkConf
sc = SparkContext()
sqlCtx= HiveContext(sc)
sqlCtx.sql('SET spark.sql.autoBroadcastJoinThreshold=104857600')
sqlCtx.sql('SET Tungsten=true')
sqlCtx.sql('SET spark.sql.shuffle.partitions=500')
sqlCtx.sql('SET spark.sql.inMemoryColumnarStorage.compressed=true')
sqlCtx.sql('SET spark.sql.inMemoryColumnarStorage.batchSize=12000')
sqlCtx.sql('SET spark.sql.parquet.cacheMetadata=true')
sqlCtx.sql('SET spark.sql.parquet.filterPushdown=true')
sqlCtx.sql('SET spark.sql.hive.convertMetastoreParquet=true')
sqlCtx.sql('SET spark.sql.parquet.binaryAsString=true')
sqlCtx.sql('SET spark.sql.parquet.compression.codec=snappy')
sqlCtx.sql('SET spark.sql.hive.convertMetastoreParquet=true')
## Main functionality
def main(sc):
if name == 'main':
# Configure OPTIONS
sconf =SparkConf() \
.set("spark.dynamicAllocation.enabled","true")\
.set("spark.dynamicAllocation.maxExecutors",300)\
.set("spark.shuffle.service.enabled","true")\
.set("spark.shuffle.spill.compress","true")
sc =SparkContext(conf=sconf)
# Execute Main functionality
main(sc)
sc.stop()

I think you are using old spark version than 2.x.
instead of this
spark.createDataFrame(..)
use below
> df = sqlContext.createDataFrame(...)

For example, if you know where spark is installed. eg:
/home/user/spark/spark-2.4.0-bin-hadoop2.7/
├── LICENSE
├── NOTICE
├── R
├── README.md
├── RELEASE
├── bin
├── conf
├── data
├── examples
├── jars
├── kubernetes
├── licenses
├── python
├── sbin
└── yarn
You can explicitly specify the path to the spark installation inside the .init method
#pyspark
findspark.init("/home/user/spark/spark-2.4.0-bin-hadoop2.7/")

FindSpark module will come handy here.
Install the module with the following:
python -m pip install findspark
Make sure SPARK_HOME environment variable is set.
Usage:
import findspark
findspark.init()
import pyspark # Call this only after findspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
print(spark)

Related

are setup() attributes supposed to be available?

I want to set attribute my_package.version for users to use, but can't make it work.
# file structure
my_package
├── src
│ └──my_package
├── VERSION
└── setup.py
# in VERSION file
0.0.1
# in setup.py file
import re
from setuptools import setup
VERSION_RE = re.compile(r"""([0-9dev.]+)""")
def get_version():
with open("VERSION", "r") as fh:
init = fh.read().strip()
return VERSION_RE.search(init).group(1)
setup(
author="ABC",
version=get_version(),
...
)
but
>>> import my_package
>>> my_package.version # expect '0.0.1'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'my_package' has no attribute 'version'
Can anyone tell if I have the wrong expectation (that my_package should have attributes author, version,...)?
Or if the expectation is right then what is wrong? If so, what to fix here to set my_package.version for users to use?
Thanks

Import from the root of the repository when running a jupyter notebook

I have a repository with the following setup:
│
└───foo_lib
│ │ bar.py
│
└───notebooks
│ my_notebook.ipynb
So basically I have some common python code in foo_lib and some notebook in notebooks
In my_notebook I want to use the code from foo_lib. So I do:
from foo_lib import bar
But that doesn't work because the root of the repo isn't in my python path when the notebook is executed.
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-e2c421feccf4> in <module>
----> 1 from foo_lib import bar
ModuleNotFoundError: No module named 'foo_lib'
The hack I've been using is to put %cd .. in the first cell. Then the working directory is the root of the repo and I can import fine. But it's not idempotent, so if I run the cell more than once, imports break again.
I found an idempotent solution. I can use globals()["_dh"][0] which points to the directory containing the notebook, when running in jupyter:
import os
os.chdir(os.path.join(globals()["_dh"][0], ".."))
Unfortuantely, this doesn't work when I run my notebook programatically using nbconvert:
import json
import nbconvert
import nbformat
def run_notebook():
ep = nbconvert.preprocessors.ExecutePreprocessor()
with open("notebooks/my_notebook.ipynb") as fp:
nb = nbformat.read(fp, as_version=4)
nb, resources = ep.preprocess(nb)
print(json.dumps(nb, indent=2))
if __name__ == "__main__":
run_notebook()
When I run this script from the root of the repository, globals()["_dh"][0] points to the root of the repository
So I'm looking for a solution to this import problem that:
is idempotent
works when executing from the browser/jupyter
works when executing using nbconvert
is short: I would have to copy paste the code in every notebook (since before that code runs, I can't do imports).
Is there a better way to do this?
I've figured out that the local repository code and be added to the site-package by calling:
pip install -e .

Spark-submit not picking modules and sub-modules of project structure

Folder structure of pyspark project on pycharm:
TEST
TEST (marked as sources root)
com
earl
test
pysprk
utils
utilities.py
test_main.py
test_main.py has:
from _ast import arg
__author__ = "earl"
from pyspark.sql.functions import to_json, struct, lit
from com.earl.test.pyspark.utils.utilities import *
import sys
utilities.py has:
__author__ = "earl"
from py4j.protocol import Py4JJavaError
from pyspark.sql import SparkSession
import sys
On PyCharm, I execute the code by Running test_main.py and it works absolutley OK. Calls functions from utilities.py and executes perfectly. I set Run -> Edit Configurations -> Parameters on PyCharm as D:\Users\input\test.json localhost:9092 and use sys.argv[1] and sys.argv[2] and it does it OK
Spark submit command:
spark-submit --master local --conf spark.sparkContext.setLogLevel=WARN --name test D:\Users\earl\com\earl\test\pyspark\test_main.py --files D:\Users\earl\com\test\pyspark\utils\utilities.py D:\Users\input\test.json localhost:9092
Error:
Traceback (most recent call last):
File "D:\Users\earl\com\earl\test\pyspark\test_main.py", line 5, in <module>
from com.earl.test.pyspark.utils.utilities import *
ModuleNotFoundError: No module named 'com'
Fixed it by setting below property before running spark-submit.
PYTHONPATH earlier is set as %PY_HOME%\Lib;%PY_HOME%\DLLs;%PY_HOME%\Lib\lib-tk
set PYTHONPATH=%PYTHONPATH%;D:\Users\earl\TEST\ (Path of the project home structure)
And updated spark-submit as (Only main needed to be mentioned):
spark-submit --master local --conf spark.sparkContext.setLogLevel=WARN --name test D:\Users\earl\com\earl\test\pyspark\test_main.py D:\Users\input\test.json localhost:9092

findspark.init() IndexError: list index out of range: PySpark on Google Colab

I am trying to install PySpark on Colab.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
!tar xf spark-2.4.1-bin-hadoop2.7.tgz
!pip install -q findspark
After installing above things, I set the environment as following:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
After that, I tried to initialized pyspark as follows and end up with error.
import findspark
findspark.init()
Error:
IndexError Traceback (most recent call last)
<ipython-input-24-4e91d34768ac> in <module>()
1 import findspark
----> 2 findspark.init()
/usr/local/lib/python3.6/dist-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
133 # add pyspark to sys.path
134 spark_python = os.path.join(spark_home, 'python')
--> 135 py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
136 sys.path[:0] = [spark_python, py4j]
137
IndexError: list index out of range
Can you try setting the
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
to the same spark version as your above install? In your case it would be 2.4.1 not 2.2.1.
os.environ["SPARK_HOME"] = "/content/spark-2.4.1-bin-hadoop2.7"
Make sure that your Java and Spark paths (including version) are correct:
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
Then try to check if the path is correct by printing the path 
print(os.listdir('./sample_data'))
If you get a list of sample files, the code will initialize without any 'index out of range' errors

unregistered task type import errors in celery

I'm having headaches with getting celery to work with my folder structure. Note I am using virtualenv but it should not matter.
cive /
celery_app.py
__init__.py
venv
framework /
tasks.py
__init__.py
civeAPI /
files tasks.py need
cive is my root project folder.
celery_app.py:
from __future__ import absolute_import
from celery import Celery
app = Celery('cive',
broker='amqp://',
backend='amqp://',
include=['cive.framework.tasks'])
# Optional configuration, see the application user guide.
app.conf.update(
CELERY_TASK_RESULT_EXPIRES=3600,
)
if __name__ == '__main__':
app.start()
tasks.py (simplified)
from __future__ import absolute_import
#import other things
#append syspaths
from cive.celery_app import app
#app.task(ignore_result=False)
def start(X):
# do things
def output(X):
# output files
def main():
for d in Ds:
m = []
m.append( start.delay(X) )
output( [n.get() for n in m] )
if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))
I then start workers via (outside root cive dir)
celery -A cive worker --app=cive.celery_app:app -l info
which seems to work fine, loading the workers and showing
[tasks]
. cive.framework.tasks.start_sessions
But when I try to run my tasks.py via another terminal:
python tasks.py
I get the error:
Traceback (most recent call last):
File "tasks.py", line 29, in <module>
from cive.celery_app import app
ImportError: No module named cive.celery_app
If I rename the import to:
from celery_app import app #without the cive.celery_app
I can eventually start the script but celery returns error:
Received unregistered task of type 'cive.start_sessions'
I think there's something wrong with my imports or config but I can't say what.
So this was a python package problem, not particularly a celery issue. I found the solution by looking at How to fix "Attempted relative import in non-package" even with __init__.py .
I've never even thought about this before, but I wasn't running python in package mode. The solution is cd'ing out of your root project directory, then running python as a package (note there is no .py after tasks):
python -m cive.framework.tasks
Now when I run the celery task everything works.