How to ignore spark_home when unit testing with pyspark - pyspark

I trying to write unit-tests with pyspark. Tests pass with the following configuration when SPARK_HOME is NOT set. There are multiple installations of spark on our machines and if SPARK_HOME is set to one of them tests fails on that machine.
#pytest.fixture(scope="session")
def spark_session(request):
session = SparkSession\
.builder\
.master("local[2]")\
.appName("pytest-pyspark-local-testing")\
.getOrCreate()
request.addfinalizer(lambda: session.stop())
quiet_py4j()
return session
I have tried os.environ["SPARK_HOME"] = "" which gets FileNotFoundError: [Errno 2] No such file or directory: './bin/spark-submit': './bin/spark-submit error.
I have also tried os.unsetenv('SPARK_HOME') which gets Exception: Java gateway process exited before sending its port number. When I don't try to unset the env var, I get this same error as well.
How can I make sure that my tests will work on any machine simply ignoring any environment variables.

Related

Unable to hit breakpoints while debugging Spark with IntelliJ CE

I am running a Spark JAR file using command spark-submit testmysparkfile.jar, after having setup export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005;
The code is written in Scala.
Below is the Spark Session i am creating-
val spark = SparkSession
.builder()
.appName("testmysparkfile")
.config("spark.serializer",classOf[KryoSerializer].getName).master("local[*]")
.getOrCreate()
When I run the JAR, the application is started and it listens to port 5005. Now, when I go back to my IntelliJ, and try to run 'debug', it runs the debugger fine, and the sample output is recieved on the terminal window that was listening to port 5005, however, the breakpoint I have set, is not hit.
Debugger settings-
Debugger mode:Attach to remote JVM
Host:Localhost
Port:5005
command line args for JVM:-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
Use module classpath:
Output from IntelliJ terminal-
Connected to the target VM, address: 'localhost:5005', transport: 'socket'
Disconnected from the target VM, address: 'localhost:5005', transport: 'socket'
I have followed examples that show how to debug Spark using a debugger like this one-http://www.bigendiandata.com/2016-08-26-How-to-debug-remote-spark-jobs-with-IntelliJ/
However, this does not seem to work as I am unable to hit any breakpoints.

how to detect if your code is running under pyspark

For staging and production, my code will be running on PySpark. However, in my local development environment, I will not be running my code on PySpark.
This presents a problem from the standpoint of logging. Because one uses the Java library Log4J via Py4J when using PySpark, one will not be using Log4J for the local development.
Thankfully, the API for Log4J and the core Python logging module are the same: once you get a logger object, with either module you simply debug() or info() etc.
Thus, I wish to detect whether or not my code is being imported/run in PySpark or a non-PySpark environment: similar to:
class App:
def our_logger(self):
if self.running_under_spark():
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger
How might I implement running_under_spark()
Simply trying to import pyspark and seeing if it works is not a fail-proof way of doing this because I have pyspark in my dev environment to kill warnings about non-imported modules in the code from my IDE.
Maybe you can set some environment variable in your spark environment that you check for at runtime ( in $SPARK_HOME/conf/spark-env.sh):
export SPARKY=spark
Then you check if SPARKY exists to determine if you're in your spark environment.
from os import environ
class App:
def our_logger(self):
if environ.get('SPARKY') is not None:
sc = SparkContext(conf=conf)
log4jLogger = sc._jvm.org.apache.log4j
log = log4jLogger.LogManager.getLogger(__name__)
log.warn("Hello World!")
return log
else:
from loguru import logger
return logger

Scala Spark : (org.apache.spark.repl.ExecutorClassLoader) Failed to check existence of class org on REPL class server at path

Running basic df.show() post spark notebook installation
I am getting the following error when running scala - spark code on spark-notebook. Any idea when this occurs and how to avoid?
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org.apache.spark.sql.catalyst.expressions.Object on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class org on REPL class server at spark://192.168.10.194:50935/classes
[org.apache.spark.util.Utils] Aborting task
[org.apache.spark.repl.ExecutorClassLoader] Failed to check existence of class
I installed the spark on local, and when I was using following code it was giving me the same error.
spark.read.format("json").load("Downloads/test.json")
I think the issue was, it was trying to find some master node and taking some random or default IP. I specified the mode and then provided the IP as 127.0.0.1 and it resolved my issue.
Solution
Run the spark using local master
usr/local/bin/spark-shell --master "local[4]" --conf spark.driver.host=127.0.0.1'

Pyspark not running the sqlContext in Pycharm

I hope someone could help with this problem I am having. I have previously setup a VM in windows using CENTOS, with hadoop and spark (all in singlenode) and it was working perfectly.
I am now running a multinode setup with another computer, both running CENTOS standalone. I have installed hadoop successfully and is running on both machines. Then I've installed spark with the following setup:
Version : Spark 2.2.1-bin-hadoop2.7, with the .bashrc file as follows:
export SPARK_HOME=/opt/spark/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PATH="/home/hadoop/anaconda2/bin:$PATH"
I am using anaconda (python 2.7 version) to install the pyspark packages. I then have the $SPARK_HOME/conf files setup as follows:
the slaves file as:
datanode1
(the hostname of the node which i use to conduct the processing on)
and the spark-env.sh file:
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
export HADOOP_CONF_DIR=/opt/hadoop/hadoop-2.8.3/etc/hadoop
export SPARK_WORKER_CORES=6
The idea is that I then connect the spark to PyCharm IDE to do my work on. In Pycharm I have setup the environment variables (under run -> edit configurations) as
PYTHON PATH /opt/spark/spark-2.2.1-bin-hadoop2.7/python/lib
SPARK_HOME /opt/spark/spark-2.2.1-bin-hadoop2.7
I have also setup my python interpreter to point to the anaconda python directory.
With all this setup I get multiple errors as output when I call either a spark SQLContext or SparkSession.Builder, for example:
conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
sql_sc = SQLContext(sc)
or
spark = SparkSession.builder.master("local").appName("PythonTutPrac").config("spark.executor.memory","2gb").getOrCreate()
The ERROR:
File "/home/hadoop/Desktop/PythonPrac/CollaborativeFiltering.py", line
72, in
.config("spark.executor.memory", "2gb") \ File "/opt/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/session.py",
line 183, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value) File
"/home/hadoop/anaconda2/lib/python2.7/site-packages/py4j/java_gateway.py",
line 1160, in call
answer, self.gateway_client, self.target_id, self.name) File "/opt/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/utils.py",
line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)1, stackTrace) pyspark.sql.utils.IllegalArgumentException: u"Error while
instantiating 'org.apache.spark.sql.internal.SessionStateBuilder':"
Unhandled exception in thread started by > Process finished with exit code 1
Error image
I do not know why this error message is showing, when I was running this in my VM single node, it was working fine. I then decided in my multinode setup to remove the datanode1 and just run it again as a singlenode setup with my main computer (hostname - master), but still getting the same errors.
I hope someone could help, as I have followed other guides to setup pycharm with pyspark, but could not figure out what is going wrong. Thanks!

Kerberos Exception launching Spark locally

I am trying to set up a Spark Testng unit test:
#Test
def testStuff(): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
...
}
The code fails with: IllegalArgumentException: Can't get Kerberos realm
What am I missing?
The error suggests that your JVM is unable to locate the kerberos config (krb5.conf file).
Depending on your company's environment/infrastruture you have a few options:
Check if your company has standard library to set kerberos authentication.
Alternatively try:
set JVM property: -Djava.security.krb5.conf=/file-path/for/krb5.conf
Put the krb5.conf file into the <jdk-home>/jre/lib/security folder