Closing PySpark notebook properly - ipython

I'm using Jupyter notebook with PySpark, which uses Spark as a kernel.
The problem is that I'm not sure how to properly close it and I have an impression that something keeps hanging, as the memory on the driver on which the notebook is running gets full and crashes (I get GC overhead exception).
I'm closing the whole thing by simply killing the notebook using the process id which I save to .pid file. But I have a feeling that the following state is note good:
What is the actual problem and how to solve it, that is, how the close the whole thing (on driver and on the yarn) properly?

You should use "File" -> "Close and halt" inside Jupyter. This will close spark context and kill yarn containers from the session.

Related

Manually Stop Jupyter Kernel and Prevent from Restarting

Background
I have created a Jupyter kernel A from which I launch another kernel B. I am doing this in order to audit the execution of kernel B. So when a user selects kernel A from the interface, kernel B is launched in the background which then executes the notebook code. strace is being used to audit the execution. After the audit phase, code, data, and provenance etc. of the program execution are recorded and stored for analysis later on.
Problem
After the notebook program ends, I intend to stop tracing the execution of kernel B. This does not happen unless I stop the execution of kernel B launched internally by kernel A. The only way I have been able to do this is using the kill command as such:
os.kill(os.getpid(), 9)
This does the job but with a side-effect: Jupyter restarts the kernel automatically which means kernel A and B are launched and start auditing the execution again. This causes certain race conditions and overwrites of some files which I want to avoid.
Possible Solution
To my mind, there are two things I can do to resolve this issue:
Exit the kernel B program gracefully so the auditing of the notebook code gets completed and stored. This does not happen with the kill command so would need some other solution
Avoid automatic restart of the kernel, with or without the kill command.
I have looked into different ways to achieve the above two but have not been successful yet. Any advice on achieving either of the above two solutions would be appreciated, or perhaps another way of solving the problem.
have you tried terminating kernel B instead of killing it
using 15 instead of 9
os.kill(os.getpid(), signal.SIGTERM)
#or
os.kill(os.getpid(), 15)
other values for kill are signal.SIGSTOP=23 , signal.SIGHUP=1
Other option could be to insert following code snippet on top of the code
import signal
import sys
def onSigExit():
sys.exit(0)
#os._exit(0)
signal.signal(signal.SIGINT, onSigExit)
signal.signal(signal.SIGTERM, onSigExit)
now you should be able to send
os.kill(os.getpid(),15)
and it should exit gracefully and not restart

Use pyspark on yarn cluster without creating the context

I'll try to do my best to explain my self. I'm using JupyterHub to connect to the cluster of my university and write some code. Basically i'm using pyspark but, since i'va always used "yarn kernel" (i'm not sure of what i'm sying) i've never defined the spark context or the spark session. Now, for some reason, it doesn't work anymore and when i try to use spark this error appears:
Code ->
df = spark.read.csv('file:///%s/.....
Error ->
name 'spark' is not defined
It already happend to me but i just solved by installing another version of pyspark. Now i don't know what to do.

Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:
"..
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
textRdd.collect().show()
.."
I got this error:
An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}
Running the line:
sparkDF.show()
works!
I also created a small subset of the file and all my code runs fine.
What is the problem?
I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect() all the data gets sent to the driver. By default the driver memory is 1000M when creating a spark application through JupyterHub even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
From This stack overflow question's answer which worked for me
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again
Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap
Just a restart helped solve this problem for me. On your Jupyter Notebook, go to -->Kernel-->>Restart
Once done, if you run the cell with "spark" command you will see that a new spark session gets established.
You might get some insights from this similar Stack Overflow thread: Timeout error: Error with 400 StatusCode: "requirement failed: Session isn't active."
Solution might be to increase spark.executor.heartbeatInterval. Default is 10 seconds.
See EMR's official documentation on how to change Spark defaults:
You change the defaults in spark-defaults.conf using the spark-defaults configuration classification or the maximizeResourceAllocation setting in the spark configuration classification.
Insufficient reputation to comment.
I tried increasing heartbeat Interval to a much higher (100 seconds), still the same result. FWIW, the error shows up in < 9s.
What worked for me is adding {"Classification": "spark-defaults", "Properties": {"spark.driver.memory": "20G"}} to the EMR configuration.

How can I execute a S3-dist-cp command within a spark-submit application

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a
Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>
I also installed s3-dist-cp on all salves along with master.
The application starts and succeeded without error but does not move the data to S3.
This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.
s3-dist-cp is now a default thing on the Master node of the EMR cluster.
I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

xmonad --restart does not seem to be working

As of recent changes to xmonad.hs (importing and using MouseResizeableTile layout and FindEmptyWorkspace action) xmonad --recompile's fine and if I log out and in again all is well, but if I issue an xmonad --restart nothing seems to happen. Certainly, my starthook is not run. As this behaviour is so totally unexpected I'm not sure where to begin to look. I have rolled back changes to the last time it worked, but to no avail.
darcs version 0.12 on ubuntu 14.04
This is a known problem when the workspace stack is too complicated to pass on to the newly spawned instance of xmonad - i.e. the arg list to the new instance exceeds the limits of the kernel. The failed instance falls back safely to the current running session.
Next question, how to increase the maximum argument length...