Pyspark: SPARK_HOME may not be configured correctly - pyspark

I'm trying to run pyspark using a notebook in a conda enviroment.
$ which python
inside the enviroment 'env', returns:
/Users/<username>/anaconda2/envs/env/bin/p
ython
and outside the environment:
/Users/<username>/anaconda2/bin/python
My .bashrc file has:
export PATH="/Users/<username>/anaconda2/bin:$PATH"
export JAVA_HOME=`/usr/libexec/java_home`
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.1.2
export PYTHONPATH=$SPARK_HOME/libexec/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
But still, when I run:
import findspark
findspark.init()
I'm getting the error:
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
Any ideas?
Full traceback
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~/anaconda2/envs/ai/lib/python3.7/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
142 try:
--> 143 py4j = glob(os.path.join(spark_python, "lib", "py4j-*.zip"))[0]
144 except IndexError:
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
/var/folders/dx/dfb8h2h925l7vmm7y971clpw0000gn/T/ipykernel_72686/1796740182.py in <module>
1 import findspark
2
----> 3 findspark.init()
~/anaconda2/envs/ai/lib/python3.7/site-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
144 except IndexError:
145 raise Exception(
--> 146 "Unable to find py4j, your SPARK_HOME may not be configured correctly"
147 )
148 sys.path[:0] = [spark_python, py4j]
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
EDIT:
If I run the following in the notebook:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
I get the error:
/usr/local/Cellar/apache-spark/3.1.2/bin/load-spark-env.sh: line 2: /usr/local/Cellar/apache-spark/3.1.2/libexec/bin/load-spark-env.sh: Permission denied
/usr/local/Cellar/apache-spark/3.1.2/bin/load-spark-env.sh: line 2: exec: /usr/local/Cellar/apache-spark/3.1.2/libexec/bin/load-spark-env.sh: cannot execute: Undefined error: 0

Related

Loading MongoDB dump (*.bson.gz) file using Pyspark

I am getting snapshot of Mango DB with bson.gz extension in a s3 location.
I am trying to read and load the file using pyspark, but it is giving me error. I am
using Jupyter notebook for this activity. Has anyone handled anything like this ? Please suggest.
from pyspark import SparkContext, SparkConf
sc.install_pypi_package("pymongo==3.2.2")
import pymongo_spark
pymongo_spark.activate()
conf = SparkConf().setAppName("pyspark-bson")
file_path = "s3://location/users.bson.gz"
bsonFileRdd = sc.BSONFileRDD(file_path)
bsonFileRdd.take(5)
Error
An error was encountered:
No module named 'pymongo_spark'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'pymongo_spark'

Runing the sample code for google assistant sdk giving SyntaxErorr: invalid syntax

I'm trying to run the following code for starting my google assistant with the raspbery pi:
googlesamples-assistant-pushtotalk --project-id my-dev-project --device-model-id my-model
But I get the following error:
Traceback (most recent call last):
File "/home/pi/env/bin/googlesamples-assistant-pushtotalk", line 5, in <module>
from googlesamples.assistant.grpc.pushtotalk import main
File "/home/pi/env/lib/python3.9/site-packages/googlesamples/assistant/grpc/pushtotalk.py", line 29, in <module>
from tenacity import retry, stop_after_attempt, retry_if_exception
File "/home/pi/env/lib/python3.9/site-packages/tenacity/__init__.py", line 292
from tenacity.async import AsyncRetrying
^
SyntaxError: invalid syntax
Just use:
pip install -U tenacity
Solve for me...

SyntaxError: invalid syntax EngineerIO.py", line 133 running simulation on IBM Cloud

Works locally on mac. Gives syntax error below running on IBM Cloud using:
python-3.5.6
Requirements.txt
Flask==0.10.1
requests==2.12.4
flask-restful-swagger==0.19
flask-restplus==0.9.2
watson-developer-cloud>=1.3.5
python-dotenv==0.8.2
cloudant==2.12.0
UliEngineering==0.3.3
scipy>=0.5
toolz==0.10.0
numpy>=1.5
Error
Traceback (most recent call last): File "skill.py", line 16, in <module> from UliEngineering.SignalProcessing.Simulation import sine_wave File "/home/vcap/deps/0/python/lib/python3.5/site-packages/UliEngineering/SignalProcessing/Simulation.py", line 8, in <module> from UliEngineering.EngineerIO import normalize_numeric File "/home/vcap/deps/0/python/lib/python3.5/site-packages/UliEngineering/EngineerIO.py", line 133 self.unit_prefix_re = re.compile(f'({__unitprefix_set})+$') # $: Matched at end of numeric part ^ SyntaxError: invalid syntax
Any ideas how I might be able to fix this?

findspark.init() IndexError: list index out of range: PySpark on Google Colab

I am trying to install PySpark on Colab.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.1/spark-2.4.1-bin-hadoop2.7.tgz
!tar xf spark-2.4.1-bin-hadoop2.7.tgz
!pip install -q findspark
After installing above things, I set the environment as following:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
After that, I tried to initialized pyspark as follows and end up with error.
import findspark
findspark.init()
Error:
IndexError Traceback (most recent call last)
<ipython-input-24-4e91d34768ac> in <module>()
1 import findspark
----> 2 findspark.init()
/usr/local/lib/python3.6/dist-packages/findspark.py in init(spark_home, python_path, edit_rc, edit_profile)
133 # add pyspark to sys.path
134 spark_python = os.path.join(spark_home, 'python')
--> 135 py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip'))[0]
136 sys.path[:0] = [spark_python, py4j]
137
IndexError: list index out of range
Can you try setting the
os.environ["SPARK_HOME"] = "/content/spark-2.2.1-bin-hadoop2.7"
to the same spark version as your above install? In your case it would be 2.4.1 not 2.2.1.
os.environ["SPARK_HOME"] = "/content/spark-2.4.1-bin-hadoop2.7"
Make sure that your Java and Spark paths (including version) are correct:
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"
Then try to check if the path is correct by printing the path 
print(os.listdir('./sample_data'))
If you get a list of sample files, the code will initialize without any 'index out of range' errors

Aws Cloudwatch Logs agent throws an error

I'm setting up awslogs agent on ec2 instance, When i run the python script of awslogs. I'm getting below message.
Downloading the latest CloudWatch Logs agent bits ... ERROR: Failed to create virtualenv. Try manually installing with pip and adding it to the sudo user's PATH before running this script.
And awslogs-agent-setup.log show below error.
Environment: CentOS 6.10 and Python 2.6
Traceback (most recent call last):
File "/usr/bin/pip", line 7, in <module>
from pip._internal import main
File "/usr/lib/python2.6/site-packages/pip-19.0.3-py2.6.egg/pip/_internal/__init__.py", line 19, in <module>
from pip._vendor.urllib3.exceptions import DependencyWarning
File "/usr/lib/python2.6/site-packages/pip-19.0.3-py2.6.egg/pip/_vendor/urllib3/__init__.py", line 8, in <module>
from .connectionpool import (
File "/usr/lib/python2.6/site-packages/pip-19.0.3-py2.6.egg/pip/_vendor/urllib3/connectionpool.py", line 92
_blocking_errnos = {errno.EAGAIN, errno.EWOULDBLOCK}
^
SyntaxError: invalid syntax
/usr/bin/virtualenv
Traceback (most recent call last):
File "/usr/bin/virtualenv", line 7, in <module>
from virtualenv import main
File "/usr/lib/python2.6/site-packages/virtualenv.py", line 51, in <module>
print("ERROR: {}".format(sys.exc_info()[1]))
ValueError: zero length field name in format
Basically, this error is due to your python version 2.6. Could you please update your python version from 2.6 to 2.7 or 3.1.
This should help.