How to start a Spark Shell using pyspark in Windows? - pyspark

I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html
But when I run in cmd the following:
C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4\>c:\Python27\python bin\pyspark
then I receive the following error message:
File "bin\pyspark", line 21
export SPARK_HOME="$(cd ="$(cd "`dirname "$0"`"/..; pwd)"
SyntaxError: invalid syntax
What am I doing wrong here?
P.S. When in cmd I try just C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4>bin\pyspark
then I receive ""python" is not recognized as internal or external command, operable program or batch file".

You need to have Python available in the system path, you can add it with setx:
setx path "%path%;C:\Python27"

I'm a fairly new Spark user (as of today, really). I am using spark 1.6.0 on Windows 10 and 7 machines. The following worked for me:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Using the code above, I was able to launch Spark in an IPython notebook and my Enthought Canopy Python IDE. Before, this, I was only able to launch pyspark through a cmd prompt. The code above will only work if you have your Environment Variables set correctly for Python and Spark (pyspark).

I run these set of path settings whenever I start pyspark in ipython:
import os
import sys
# Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') for R
### MANNN restart spart using ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
os.environ['SPARK_HOME']="G:/Spark/spark-1.5.1-bin-hadoop2.6"
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/bin")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
##sc.stop() # IF you wish to stop the context
sc = SparkContext("local", "Simple App")

With the reference and help of the user "maxymoo" I was able to find a way to set a PERMANENT path is Windows 7 as well. The instructions are here:
http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx

Simply set path in System -> Environment Variables -> Path
R Path in my system C:\Program Files\R\R-3.2.3\bin
Python Path in my system c:\python27
Spark Path in my system c:\spark-2
The path must be separated by ";" and there must be no space between paths

Related

Is there a spark-defaults.conf when installed with pip install pyspark

I installed pyspark with pip.
I code in jupyter notebooks. Everything works fine but not I got a java heap space error when exporting a large .csv file.
Here someone suggested editing the spark-defaults.config. Also in the spark documentation, it says
"Note: In client mode, this config must not be set through the
SparkConf directly in your application, because the driver JVM has
already started at that point. Instead, please set this through the
--driver-memory command line option or in your default properties file."
But I'm afraid there is no such file when installing pyspark with pip.
I'm I right? How do I solve this?
Thanks!
I recently ran into this as well. If you look at the Spark UI under the Classpath Entries, the first path is probably the configuration directory, something like /.../lib/python3.7/site-packages/pyspark/conf/. When I looked for that directory, it didn't exist; presumably it's not part of the pip installation. However, you can easily create it and add your own configuration files. For example,
mkdir /.../lib/python3.7/site-packages/pyspark/conf
vi /.../lib/python3.7/site-packages/pyspark/conf/spark-defaults.conf
The spark-defaults.conf file should be located in:
$SPARK_HOME/conf
If no file is present, create one (a template should be available in the same directory).
How to find the default configuration folder
Check contents of the folder in Python:
import glob, os
glob.glob(os.path.join(os.environ["SPARK_HOME"], "conf", "spark*"))
# ['/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-env.sh.template',
# '/usr/local/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf.template']
When no spark-defaults.conf file is available, built-in values are used
To my surprise, no spark-defaults.conf but just a template file was present!
Still I could look at Spark properties, either in the “Environment” tab of the Web UI http://<driver>:4040 or using getConf().getAll() on the Spark context:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.getOrCreate()
spark.sparkContext.getConf().getAll()
# [('spark.driver.port', '55128'),
# ('spark.app.name', 'myApp'),
# ('spark.rdd.compress', 'True'),
# ('spark.sql.warehouse.dir', 'file:/path/spark-warehouse'),
# ('spark.serializer.objectStreamReset', '100'),
# ('spark.master', 'local[*]'),
# ('spark.submit.pyFiles', ''),
# ('spark.app.startTime', '1645484409629'),
# ('spark.executor.id', 'driver'),
# ('spark.submit.deployMode', 'client'),
# ('spark.app.id', 'local-1645484410352'),
# ('spark.ui.showConsoleProgress', 'true'),
# ('spark.driver.host', 'xxx.xxx.xxx.xxx')]
Note that not all properties are listed but:
only values explicitly specified through spark-defaults.conf, SparkConf, or the command line. For all other configuration properties, you can assume the default value is used.
For instance, consider the default parallelism is in my case:
spark._sc.defaultParallelism
8
This is the default for local mode, namely the number of cores on the local machine--see https://spark.apache.org/docs/latest/configuration.html. In my case 8=2x4cores because of hyper-threading.
If passed the property spark.default.parallelism when launching the app
spark = SparkSession \
.builder \
.appName("Set parallelism") \
.config("spark.default.parallelism", 4) \
.getOrCreate()
then the property is shown in the Web UI and in the list
spark.sparkContext.getConf().getAll()
Precedence of configuration settings
Spark will consider given properties in this order (spark-defaults.conf comes last):
SparkConf
flags passed to spark-submit
spark-defaults.conf
From https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties:
Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
Note
Some pyspark Jupyter kernels contain flags for spark-submit in the environment variable $PYSPARK_SUBMIT_ARGS, so one might want to check that too.
Related question: Where to modify spark-defaults.conf if I installed pyspark via pip install pyspark
The spark-defaults.config file is needed when we have to change any of the default configs for spark.
As #niuer suggested, it should be present in the $SPARK_HOME/conf/ directory. But that might not be the case with you. By default, a template config file will be present there. You can just add a new spark-defaults.conf file in $SPARK_HOME/conf/.
Check your spark path. There are configuration files under:
$SPARK_HOME/conf/, e.g.
spark-defaults.conf.

Spark reading file from local gives InvalidInputException

Using Spark 2.2.0 installed by Homebrew on OSX High Sierra. I got into spark-shell and tried to read a local file like so:
val lines = sc.textFile("file:///Users/username/Documents/Notes/sampleFile")
val llist = lines.collect()
This gives me:
org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: file:/Users/bsj625/Documents/Notes/sampleFile
I've tried a bunch of variations, file:/ and file://. I also tried running spark-shell in local mode like so:
spark-shell --master local
But I'm still getting the same error. Are there any environment variables I need to set? Any help appreciated.

zeppelin-0.7.3 Interpreter pyspark not found

I get the below error when I use pyspark via Zeppelin.
The python & spark interpreters work and all environment variables are set correctly.
print os.environ['PYTHONPATH']
/x01/spark_u/spark/python:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python/lib/pyspark.zip:/x01/spark_u/spark/python:/x01/spark_u/spark/python/pyspark:/x01/spark_u/zeppelin/interpreter/python/py4j-0.9.2/src:/x01/spark_u/zeppelin/interpreter/lib/python
zepplin-env.sh is set with the below vars
export PYSPARK_PYTHON=/usr/local/bin/python2
export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${PYTHONPATH}
export SPARK_YARN_USER_ENV="PYTHONPATH=${PYTHONPATH}"
See the below log file
INFO [2017-11-01 12:30:42,972] ({pool-2-thread-4}
RemoteInterpreter.java[init]:221) - Create remote interpreter
org.apache.zeppelin.spark.PySparkInterpreter
org.apache.zeppelin.interpreter.InterpreterException:
paragraph_1509038605940_-1717438251's Interpreter pyspark not
found
Thank you in advance
I found a workaround for the above issue.The interpreter not found issue does not happen when I create note inside a directory.The issue only happens when I use notes at toplevel.Addionally I foud out that this issue does not happen in 0.7.2 version
Ex :
enter image description here

Importing PySpark packages

I have downloaded the graphframes package (from here) and saved it on my local disk. Now, I would like to use it. So, I use the following command:
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --py-files ~/temp/graphframes-0.1.0-spark1.5.jar --jars ~/temp/graphframes-0.1.0-spark1.5.jar --packages graphframes:graphframes:0.1.0-spark1.5
All the pyspark functionality works as expected, except for the new graphframes package: whenever I try to import graphframes, I get an ImportError. When I examine sys.path, I can see the following two paths:
/tmp/spark-1eXXX/userFiles-9XXX/graphframes_graphframes-0.1.0-spark1.5.jar and /tmp/spark-1eXXX/userFiles-9XXX/graphframes-0.1.0-spark1.5.jar, however these files don't exist. Moreover, the /tmp/spark-1eXXX/userFiles-9XXX/ directory is empty.
What am I missing?
in my case:
1、cd /home/zh/.ivy2/jars
2、jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar
3、add /home/zh/.ivy2/jar to PYTHONPATH in spark-env.sh like code above:
export PYTHONPATH=$PYTHONPATH:/home/zh/.ivy2/jars:.
This might be an issue in Spark packages with Python in general. Someone else was asking about it too earlier on the Spark user discussion alias.
My workaround is to unpackage the jar to find the python code embedded, and then move the python code into a subdirectory called graphframes.
For instance, I run pyspark from my home directory
~$ ls -lart
drwxr-xr-x 2 user user 4096 Feb 24 19:55 graphframes
~$ ls graphframes/
__init__.pyc examples.pyc graphframe.pyc tests.pyc
You would not need the py-files or jars parameters, though, something like
IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --packages graphframes:graphframes:0.1.0-spark1.5
and having the python code in the graphframes directory should work.
Add these lines to your $SPARK_HOME/conf/spark-defaults.conf :
spark.executor.extraClassPath file_path/jar1:file_path/jar2
spark.driver.extraClassPath file_path/jar1:file_path/jar2
In the more general case of importing 'orphan' python file (outside of current folder, not part of properly installed package) - use addPyFile, e.g.:
sc.addPyFile('somefolder/graphframe.zip')
addPyFile(path): Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

How to import modules in IPython Clusters

I am trying to import some of my personal modules into my IPython Clusters. I am using Anacondas on Windows Vista 64 bit
from IPython.parallel import Client
rc = Client()
dview = rc[:]
with dview.sync_imports():
import lib.rf
It is giving me this error:
No module named 'lib.rf'
I can import the module in the rest of my IPython notebook, as I have this .bat file to start ipython notebook:
cd C:\Users\Jon\workspace\bf
set PYTHONPATH=%PYTHONPATH%;C:\Users\Jon\workspace\bf
C:\Anaconda\envs\p33\scripts\ipython notebook
I am using this similar code to start my ip clusters:
cd C:\Users\Jon\workspace\bf
set PYTHONPATH=%PYTHONPATH%;C:\Users\Jon\workspace\bf
C:\Anaconda\envs\p33\Scripts\ipcluster start --n=7
Why is this not working?
More info:
If I print out sys.path, I get a list that contains C:\Users\Jon\workspace\bf
If I print out the paths of my clusters, I get the same list:
%px sys.path
['',
'',
'',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\distribute-0.6.28-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\pykalman-0.9.5-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\patsy-0.2.1-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\joblib-0.8.3_r1-py3.3.egg',
'C:\\Users\\Jon\\workspace\\bf',
'C:\\Users\\Jon\\workspace\\bf\\my_numba',
'C:\\Anaconda\\envs\\p33\\python33.zip',
'C:\\Anaconda\\envs\\p33\\DLLs',
'C:\\Anaconda\\envs\\p33\\lib',
'C:\\Anaconda\\envs\\p33',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\Sphinx-1.2.3-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32\\lib',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\Pythonwin',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\runipy-0.1.1-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\setuptools-7.0-py3.3.egg',
'C:\\Anaconda\\envs\\p33\\lib\\site-packages\\IPython\\extensions']
In [45]:
Further analysis:
%px lib.__path__
Out[0:11]: _NamespacePath(['C:\\Anaconda\\envs\\p33\\lib\\site-packages\\win32\\lib'])
lib.__path__
Out[57]: ['.\\lib']
Looks like the ipcluster and notebook are looking at lib in different places. I have tried renaming lib to mylib. It has not helped.
It seems that with dview.sync_imports() is being run someplace other than your IPython Notebook environment and is therefore relying a different PYTHONPATH. It is definitely not being run on one of the cluster engines and so wouldn't expect it to leverage your cluster settings of PYTHONPATH.
I'm thinking you'll need to have that directory in your PYTHONPATH (not your PATH) for the calling python environment because that is the location from which you are importing the modules.
The impact of the bit you have about setting the PYTHONPATH in the DOS shell from which you invoke ipclusters isn't clear to me. I can see that one might expect this to let the engines know about your directory, but I'm wondering if that PYTHONPATH gets initilized to the environment from which you call IPython.parallel.Client.