Below program throwing error name 'spark' is not defined
Traceback (most recent call last):
File "pgm_latest.py", line 232, in <module>
sconf =SparkConf().set(spark.dynamicAllocation.enabled,true)
.set(spark.dynamicAllocation.maxExecutors,300)
.set(spark.shuffle.service.enabled,true)
.set(spark.shuffle.spill.compress,true)
NameError: name 'spark' is not defined
spark-submit --driver-memory 12g --master yarn-cluster --executor-memory 6g --executor-cores 3 pgm_latest.py
Code
#!/usr/bin/python
import sys
import os
from datetime import *
from time import *
from pyspark.sql import *
from pyspark
import SparkContext
from pyspark import SparkConf
sc = SparkContext()
sqlCtx= HiveContext(sc)
sqlCtx.sql('SET spark.sql.autoBroadcastJoinThreshold=104857600')
sqlCtx.sql('SET Tungsten=true')
sqlCtx.sql('SET spark.sql.shuffle.partitions=500')
sqlCtx.sql('SET spark.sql.inMemoryColumnarStorage.compressed=true')
sqlCtx.sql('SET spark.sql.inMemoryColumnarStorage.batchSize=12000')
sqlCtx.sql('SET spark.sql.parquet.cacheMetadata=true')
sqlCtx.sql('SET spark.sql.parquet.filterPushdown=true')
sqlCtx.sql('SET spark.sql.hive.convertMetastoreParquet=true')
sqlCtx.sql('SET spark.sql.parquet.binaryAsString=true')
sqlCtx.sql('SET spark.sql.parquet.compression.codec=snappy')
sqlCtx.sql('SET spark.sql.hive.convertMetastoreParquet=true')
## Main functionality
def main(sc):
if name == 'main':
# Configure OPTIONS
sconf =SparkConf() \
.set("spark.dynamicAllocation.enabled","true")\
.set("spark.dynamicAllocation.maxExecutors",300)\
.set("spark.shuffle.service.enabled","true")\
.set("spark.shuffle.spill.compress","true")
sc =SparkContext(conf=sconf)
# Execute Main functionality
main(sc)
sc.stop()
I think you are using old spark version than 2.x.
instead of this
spark.createDataFrame(..)
use below
> df = sqlContext.createDataFrame(...)
For example, if you know where spark is installed. eg:
/home/user/spark/spark-2.4.0-bin-hadoop2.7/
├── LICENSE
├── NOTICE
├── R
├── README.md
├── RELEASE
├── bin
├── conf
├── data
├── examples
├── jars
├── kubernetes
├── licenses
├── python
├── sbin
└── yarn
You can explicitly specify the path to the spark installation inside the .init method
#pyspark
findspark.init("/home/user/spark/spark-2.4.0-bin-hadoop2.7/")
FindSpark module will come handy here.
Install the module with the following:
python -m pip install findspark
Make sure SPARK_HOME environment variable is set.
Usage:
import findspark
findspark.init()
import pyspark # Call this only after findspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
print(spark)
Related
I want to set attribute my_package.version for users to use, but can't make it work.
# file structure
my_package
├── src
│ └──my_package
├── VERSION
└── setup.py
# in VERSION file
0.0.1
# in setup.py file
import re
from setuptools import setup
VERSION_RE = re.compile(r"""([0-9dev.]+)""")
def get_version():
with open("VERSION", "r") as fh:
init = fh.read().strip()
return VERSION_RE.search(init).group(1)
setup(
author="ABC",
version=get_version(),
...
)
but
>>> import my_package
>>> my_package.version # expect '0.0.1'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'my_package' has no attribute 'version'
Can anyone tell if I have the wrong expectation (that my_package should have attributes author, version,...)?
Or if the expectation is right then what is wrong? If so, what to fix here to set my_package.version for users to use?
Thanks
I have a repository with the following setup:
│
└───foo_lib
│ │ bar.py
│
└───notebooks
│ my_notebook.ipynb
So basically I have some common python code in foo_lib and some notebook in notebooks
In my_notebook I want to use the code from foo_lib. So I do:
from foo_lib import bar
But that doesn't work because the root of the repo isn't in my python path when the notebook is executed.
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-e2c421feccf4> in <module>
----> 1 from foo_lib import bar
ModuleNotFoundError: No module named 'foo_lib'
The hack I've been using is to put %cd .. in the first cell. Then the working directory is the root of the repo and I can import fine. But it's not idempotent, so if I run the cell more than once, imports break again.
I found an idempotent solution. I can use globals()["_dh"][0] which points to the directory containing the notebook, when running in jupyter:
import os
os.chdir(os.path.join(globals()["_dh"][0], ".."))
Unfortuantely, this doesn't work when I run my notebook programatically using nbconvert:
import json
import nbconvert
import nbformat
def run_notebook():
ep = nbconvert.preprocessors.ExecutePreprocessor()
with open("notebooks/my_notebook.ipynb") as fp:
nb = nbformat.read(fp, as_version=4)
nb, resources = ep.preprocess(nb)
print(json.dumps(nb, indent=2))
if __name__ == "__main__":
run_notebook()
When I run this script from the root of the repository, globals()["_dh"][0] points to the root of the repository
So I'm looking for a solution to this import problem that:
is idempotent
works when executing from the browser/jupyter
works when executing using nbconvert
is short: I would have to copy paste the code in every notebook (since before that code runs, I can't do imports).
Is there a better way to do this?
I've figured out that the local repository code and be added to the site-package by calling:
pip install -e .
Folder structure of pyspark project on pycharm:
TEST
TEST (marked as sources root)
com
earl
test
pysprk
utils
utilities.py
test_main.py
test_main.py has:
from _ast import arg
__author__ = "earl"
from pyspark.sql.functions import to_json, struct, lit
from com.earl.test.pyspark.utils.utilities import *
import sys
utilities.py has:
__author__ = "earl"
from py4j.protocol import Py4JJavaError
from pyspark.sql import SparkSession
import sys
On PyCharm, I execute the code by Running test_main.py and it works absolutley OK. Calls functions from utilities.py and executes perfectly. I set Run -> Edit Configurations -> Parameters on PyCharm as D:\Users\input\test.json localhost:9092 and use sys.argv[1] and sys.argv[2] and it does it OK
Spark submit command:
spark-submit --master local --conf spark.sparkContext.setLogLevel=WARN --name test D:\Users\earl\com\earl\test\pyspark\test_main.py --files D:\Users\earl\com\test\pyspark\utils\utilities.py D:\Users\input\test.json localhost:9092
Error:
Traceback (most recent call last):
File "D:\Users\earl\com\earl\test\pyspark\test_main.py", line 5, in <module>
from com.earl.test.pyspark.utils.utilities import *
ModuleNotFoundError: No module named 'com'
Fixed it by setting below property before running spark-submit.
PYTHONPATH earlier is set as %PY_HOME%\Lib;%PY_HOME%\DLLs;%PY_HOME%\Lib\lib-tk
set PYTHONPATH=%PYTHONPATH%;D:\Users\earl\TEST\ (Path of the project home structure)
And updated spark-submit as (Only main needed to be mentioned):
spark-submit --master local --conf spark.sparkContext.setLogLevel=WARN --name test D:\Users\earl\com\earl\test\pyspark\test_main.py D:\Users\input\test.json localhost:9092
I am trying to install PySpark on Colab.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
!tar xf spark-2.4.1-bin-hadoop2.7.tgz
!pip install -q findspark
After installing above things, I set the environment as following:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
After that, I tried to initialized pyspark as follows and end up with error.
import findspark
findspark.init()
Error:
IndexError Traceback (most recent call last)
<ipython-input-24-4e91d34768ac> in <module>()
1 import findspark
----> 2 findspark.init()
/usr/local/lib/python3.6/dist-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
133 # add pyspark to sys.path
134 spark_python = os.path.join(spark_home, 'python')
--> 135 py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
136 sys.path[:0] = [spark_python, py4j]
137
IndexError: list index out of range
Can you try setting the
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
to the same spark version as your above install? In your case it would be 2.4.1 not 2.2.1.
os.environ["SPARK_HOME"] = "/content/spark-2.4.1-bin-hadoop2.7"
Make sure that your Java and Spark paths (including version) are correct:
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
Then try to check if the path is correct by printing the path
print(os.listdir('./sample_data'))
If you get a list of sample files, the code will initialize without any 'index out of range' errors
I'm having headaches with getting celery to work with my folder structure. Note I am using virtualenv but it should not matter.
cive /
celery_app.py
__init__.py
venv
framework /
tasks.py
__init__.py
civeAPI /
files tasks.py need
cive is my root project folder.
celery_app.py:
from __future__ import absolute_import
from celery import Celery
app = Celery('cive',
broker='amqp://',
backend='amqp://',
include=['cive.framework.tasks'])
# Optional configuration, see the application user guide.
app.conf.update(
CELERY_TASK_RESULT_EXPIRES=3600,
)
if __name__ == '__main__':
app.start()
tasks.py (simplified)
from __future__ import absolute_import
#import other things
#append syspaths
from cive.celery_app import app
#app.task(ignore_result=False)
def start(X):
# do things
def output(X):
# output files
def main():
for d in Ds:
m = []
m.append( start.delay(X) )
output( [n.get() for n in m] )
if __name__ == '__main__':
sys.exit(main(sys.argv[1:]))
I then start workers via (outside root cive dir)
celery -A cive worker --app=cive.celery_app:app -l info
which seems to work fine, loading the workers and showing
[tasks]
. cive.framework.tasks.start_sessions
But when I try to run my tasks.py via another terminal:
python tasks.py
I get the error:
Traceback (most recent call last):
File "tasks.py", line 29, in <module>
from cive.celery_app import app
ImportError: No module named cive.celery_app
If I rename the import to:
from celery_app import app #without the cive.celery_app
I can eventually start the script but celery returns error:
Received unregistered task of type 'cive.start_sessions'
I think there's something wrong with my imports or config but I can't say what.
So this was a python package problem, not particularly a celery issue. I found the solution by looking at How to fix "Attempted relative import in non-package" even with __init__.py .
I've never even thought about this before, but I wasn't running python in package mode. The solution is cd'ing out of your root project directory, then running python as a package (note there is no .py after tasks):
python -m cive.framework.tasks
Now when I run the celery task everything works.