PySpark on Linux with pycharm - first exception error - pyspark

I am trying to run my first PySpark script on a Linux VM I configured. The error message I have is KeyError: SPARK_HOME when I run the following:
from os import environ
from pyspark import SparkContext
I momentarily made this error go away by running export SPARK_HOME=~/spark-2.4.3-bin-hadoop2.7. I then ran into a new error error=2, No such file or directory. Searching took me to this page:https://community.cloudera.com/t5/Community-Articles/Tutorial-Install-Configure-iPython-and-create-run-PySpark/ta-p/246400. I then ran export PYSPARK_PYTHON=~/python3*. This brings me back to experiencing the KeyError: SPARK_HOME error.
Honestly, I'm stumbling through this, because it's my first time configuring Spark, and using PySpark. I still don't quite understand the ins-and-outs of pycharm, as well.
I expect to be able to run the following basic sample script on this page: https://medium.com/parrot-prediction/integrating-apache-spark-2-0-with-pycharm-ce-522a6784886f with no issues.

there is a package called findspark here
or you may use below code to set path if not found in environment
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'full_path_to_spark_root'
[code continues]

Related

Visual studio code using pytest for Pyspark getting stuck at SparkSession Creation

I am trying to run a pyspark unit test in Visual studio code on my local windows machine. when i debug the test it gets stuck at line where I am creating a sparksession. It doesn't show any error/failure but status bar just shows "Running Tests" . Once it work, i can refactor my test to create sparksession as part of test fixture, but presently my test is getting stuck at sparksession creation.
Do i have to install/configure on my local machine for sparksession to work?
i tried a simple test with assert 'a' == 'b' and i can debug and test run succsfully, so i assume my pytest configurations are correct. Issue i am facing is with creating sparksession.
# test code
from pyspark.sql import SparkSession, Row, DataFrame
import pytest
def test_poc():
spark_session = SparkSession.builder.master('local[2]').getOrCreate() #this line never returns when debugging test.
spark_session.createDataFrame(data,schema) #data and schema not shown here.
Thanks
What I have done to make it work was:
Create a .env file in the root of the project
Add the following content to the created file:
SPARK_LOCAL_IP=127.0.0.1
JAVA_HOME=<java_path>/jdk/zulu#1.8.192/Contents/Home
SPARK_HOME=<spark_path>/spark-3.0.1-bin-hadoop2.7
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Go to .vscode file in the root, expand and open settings.json. Add the following like (replace <workspace_path> with your actual workspace path):
"python.envFile": "<workspace_path>/.env"
After refreshing the Testing section in Visual Studio Code, the setup should succeed.
Note: I use pyenv to setup my python version, so I had to make sure that VS Code was using the correct python version with all the expected dependencies installed.
Solution inspired by py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM and https://github.com/microsoft/vscode-python/issues/6594

Error when running pyspark

I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.
You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.
In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.
Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()

Running Apache SystemML

I am trying to get Apache SystemML set up and running (on Ubuntu) in a standalone mode.
I am relying on the github documentation to set this up.
I would like to run this with pyspark and I am following the instructions from this beginner's guide
After successfully installing systemml and launching pyspark shell, I tried the following code from the tutorial:
import systemml as sml
import numpy as np
m1 = sml.matrix(np.ones((3,3)) + 2)
The import statements work fine, however I encounter the following error with the 3rd line:
ImportError: Unable to load systemML.jar into the current pyspark session.Hint: Provide
the following argument to pyspark: --driver-class-path /usr/local...
As per the hint provided, I launched pyspark again appending the "--driver -class-path..." at the end. But I encountered the same error.
While googling for this, I found this error being highlighted in the Apache SystemML documentations. However, I wasn't really able to address the issue.
Any help will be greatly appreciated!
Can you please confirm that "/usr/local..." in your comment is path to systemml-*-incubating-SNAPSHOT.jar and that file exists ?

Virtualenv PyDev Undefined variable from import error

First of all I'm aware of the question here but I couldn't find a satisfied answer there. I don't want to ignore errors or use comments - I want to have the right settings in eclipse/pydev. My problem is pretty similar to this one.
I'm using Ubuntu 12.04 and installed a virtuenv for python 2.7 in my home directory. After installing several python packages (numpy, scipy, matplotlib, etc.) using pip, I installed eclipse 4.3 with pydev.
If I use the python system interpreter at /usr/bin/python everything works fine (except that I didn't want to use). However, if I try to set up a python interpreter using the virtualenv first I get this warning describe here. After clicking "proceed anyway", it seems to work. So far so good.
However e.g. import numpy as np gives for each np.* call the eclipse/pydev error Undefined variable from import, also the code completion doesn't work properly. It seems to work, e.g. for datetime, but not for numpy, scipy and matplotlib.
Does anybody figured out to configure eclipse correctly?
I already tried to add the numpy path manually to the virtualenv interpreter, but than I get the weird error:
import matplotlib.dates as mpl_dates
File "/home/pydev/myenv-py27/local/lib/python2.7/site-packages/matplotlib/init.py", line 149, in
import sys, os, tempfile
File "/usr/lib/python2.7/tempfile.py", line 34, in
from random import Random as _Random
ImportError: cannot import name Random

Need help to resolve errors un using py2exe missing

The following modules appear to be missing
email.Generator
email.Iterators
email.Utils
win32api
win32con
w in32pipe
wx
My setup file looks like this:
from distutils.core import setup
import py2exe
setup(console=['fwsm_migration.py'])
i'm using Python 2.5.4 and the py2exe 0.6.8
Looked here and outside for a peculiar solution but have not found one!!
read about using "optoins: but being new to python itself failing to know where to do it.
Please HELP!
Try cx_freeze. After having a hell of a time with py2exe cx_freeze compiled my script without any configuration. In the same environment Py2exe claimed I'd missed nine packages.
For simple scripts you only need to do:
cxfreeze hello.py --target-dir dist