Session isn't active Pyspark in an AWS EMR cluster - pyspark

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:
"..
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
textRdd.collect().show()
.."
I got this error:
An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}
Running the line:
sparkDF.show()
works!
I also created a small subset of the file and all my code runs fine.
What is the problem?

I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect() all the data gets sent to the driver. By default the driver memory is 1000M when creating a spark application through JupyterHub even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.

From This stack overflow question's answer which worked for me
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again
Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap

Just a restart helped solve this problem for me. On your Jupyter Notebook, go to -->Kernel-->>Restart
Once done, if you run the cell with "spark" command you will see that a new spark session gets established.

You might get some insights from this similar Stack Overflow thread: Timeout error: Error with 400 StatusCode: "requirement failed: Session isn't active."
Solution might be to increase spark.executor.heartbeatInterval. Default is 10 seconds.
See EMR's official documentation on how to change Spark defaults:
You change the defaults in spark-defaults.conf using the spark-defaults configuration classification or the maximizeResourceAllocation setting in the spark configuration classification.

Insufficient reputation to comment.
I tried increasing heartbeat Interval to a much higher (100 seconds), still the same result. FWIW, the error shows up in < 9s.

What worked for me is adding {"Classification": "spark-defaults", "Properties": {"spark.driver.memory": "20G"}} to the EMR configuration.

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

Command confluent local services start gives an error : Starting ZooKeeper Error: ZooKeeper failed to start

I'm trying to run this command : confluent local services start
I don't know why each time it gives me an error before passing to the next step. So I had to run it again over and over until it passes all the steps.
what is the reason for the error and how to solve the problem?
You need to open the log files to inspect any errors that would be happening.
But, it's possible the services are having a race condition. Schema Registry requires Kafka, REST Proxy and Connect require the Schema Registry... Maybe they are not waiting for the previous components to start.
Or maybe your machine does not have enough resources to start all services. E.g. I believe at least 6GB of RAM are necessary. If you have 8GB on the machine, and Chrome and lots of other services are running, for example, then you wouldn't have 6GB readily available.

Azure Synapse - How to stop an Apache Spark application / notebook?

When I run (in debug mode) a Spark notebook in Azure Synapse Analytics, it doesn't seem to shutdown as expected.
In the last cell I call: mssparkutils.notebook.exit("exiting notebook")
But then when I fire off another notebook (again in debug mode, same pool), I get this error:
AVAILABLE_COMPUTE_CAPACITY_EXCEEDED: Livy session has failed. Session state: Error. Error code: AVAILABLE_COMPUTE_CAPACITY_EXCEEDED. Your job requested 12 vcores. However, the pool only has 0 vcores available out of quota of 12 vcores. Try ending the running job(s) in the pool, reducing the numbers of vcores requested, increasing the pool maximum size or using another pool. Source: User.
So I go to Monitor => Apache Spark applications and I see my the first notebook I ran still in a "Running" status and I can manually stop it.
How do I automatically stop the Notebook / Apache Spark application? I thought that was the notebook.exit() call but apparently not...
In debug mode, the cluster's vcores are supplied to the notebook for the entire duration of the debug (that is one hour of inactivity or until you manually terminate it)
Thus, you have two options:
Work on one notebook at a time, closing the debug before starting another
OR
Configure the session to reduce the number of executors so that the spark cluster can provision all three debug modes at the same time (might need to increase the size of the cluster)

Spark application gets KILLED abruptly in EMR after 1 hour and livy session expires.What is the cause& solution?

I am using JupyterHub on AWS EMR cluster. I am using EMR version 5.16
I submitted a spark application using a pyspark3 notebook.
My application is trying to write 1TB data to s3.
I am using autoscaling feature of the EMR to scale us the task node.
Hardware configurations:
1.Master node:32 GB RAM with 16 cores
2.Core node:32 GB RAM with 16 cores
3.Task node:16 GB with 8 cores each. (Task nodes scales up 15)
I have observed that Spark application gets killed after running for 50 to 60 minutes.
I tried debugging:
1. My cluster still had scope for scaling up. So it is not an issue with a shortage of resources.
2. Livy session also gets killed.
3. In the job log, I saw error message RECVD TERM SIGNAL "Shutdown hook
received"
Please note:
1. I have kept :spark.dynamicAllocation.enabled=true"
2. I am using the yarn fair scheduler with user impersonation in Jupiter hub
Can you please help me in understanding the problem and solution for it?
I think that I faced the same problem and I found the solution thanks to this answer.
The issue comes from the Livy configuration parameter livy.server.session.timeout, which sets the timeout for a session by default to 1 hour.
You should set it by adding the following line into the configurations of the EMR cluster.
[{'classification': 'livy-conf','Properties': {'livy.server.session.timeout':'5h'}}]
This solved the issue for me.

Failed to run Zeppelin notebook demo Spark Streaming - Hortonworks sandbox

I tried to run the note book demo available on Zeppelin in Hortonworks sandbox 2.4 (Notebook named twitter) to learn SparkStreaming. According the instruction on the top of notebook (/* BEFORE START....), I logged on Ambari to modify the configuration of Yarn service.
CPU => Container: Minimum Container Size (VCores) 4; Maximum Container Size (Vcores): 8
Memory
Node: 2250MB
Container: Minimum Container Size: 768MB; Maximum Container Size: 2250MB
All services are restarted after modifying but when I came back to Zeppelin to run the notebook, the second paragraph (/* UPDATE YOUR TWITTER CREDENTIALS */....) was always on the state "running" but never "finished". All twitter credentials are already updated.
P/S: without modifying the YARN configuration, I could run the second paragraph, but when running the 3rd, It was always "running" but never "finished"
Thanks for any suggestions
if the error is
Error:YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register
change the value of yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage to 99 and the value of yarn maximum-am-resource-percent to 100