Failed to run Zeppelin notebook demo Spark Streaming - Hortonworks sandbox - scala

I tried to run the note book demo available on Zeppelin in Hortonworks sandbox 2.4 (Notebook named twitter) to learn SparkStreaming. According the instruction on the top of notebook (/* BEFORE START....), I logged on Ambari to modify the configuration of Yarn service.
CPU => Container: Minimum Container Size (VCores) 4; Maximum Container Size (Vcores): 8
Memory
Node: 2250MB
Container: Minimum Container Size: 768MB; Maximum Container Size: 2250MB
All services are restarted after modifying but when I came back to Zeppelin to run the notebook, the second paragraph (/* UPDATE YOUR TWITTER CREDENTIALS */....) was always on the state "running" but never "finished". All twitter credentials are already updated.
P/S: without modifying the YARN configuration, I could run the second paragraph, but when running the 3rd, It was always "running" but never "finished"
Thanks for any suggestions

if the error is
Error:YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register
change the value of yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage to 99 and the value of yarn maximum-am-resource-percent to 100

Related

Azure Synapse - How to stop an Apache Spark application / notebook?

When I run (in debug mode) a Spark notebook in Azure Synapse Analytics, it doesn't seem to shutdown as expected.
In the last cell I call: mssparkutils.notebook.exit("exiting notebook")
But then when I fire off another notebook (again in debug mode, same pool), I get this error:
AVAILABLE_COMPUTE_CAPACITY_EXCEEDED: Livy session has failed. Session state: Error. Error code: AVAILABLE_COMPUTE_CAPACITY_EXCEEDED. Your job requested 12 vcores. However, the pool only has 0 vcores available out of quota of 12 vcores. Try ending the running job(s) in the pool, reducing the numbers of vcores requested, increasing the pool maximum size or using another pool. Source: User.
So I go to Monitor => Apache Spark applications and I see my the first notebook I ran still in a "Running" status and I can manually stop it.
How do I automatically stop the Notebook / Apache Spark application? I thought that was the notebook.exit() call but apparently not...
In debug mode, the cluster's vcores are supplied to the notebook for the entire duration of the debug (that is one hour of inactivity or until you manually terminate it)
Thus, you have two options:
Work on one notebook at a time, closing the debug before starting another
OR
Configure the session to reduce the number of executors so that the spark cluster can provision all three debug modes at the same time (might need to increase the size of the cluster)

Session isn't active Pyspark in an AWS EMR cluster

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:
"..
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
textRdd.collect().show()
.."
I got this error:
An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}
Running the line:
sparkDF.show()
works!
I also created a small subset of the file and all my code runs fine.
What is the problem?
I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect() all the data gets sent to the driver. By default the driver memory is 1000M when creating a spark application through JupyterHub even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook
spark.sparkContext.getConf().get('spark.driver.memory')
1000M
To increase the driver memory just do
%%configure -f
{"driverMemory": "6000M"}
This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.
From This stack overflow question's answer which worked for me
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again
Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap
Just a restart helped solve this problem for me. On your Jupyter Notebook, go to -->Kernel-->>Restart
Once done, if you run the cell with "spark" command you will see that a new spark session gets established.
You might get some insights from this similar Stack Overflow thread: Timeout error: Error with 400 StatusCode: "requirement failed: Session isn't active."
Solution might be to increase spark.executor.heartbeatInterval. Default is 10 seconds.
See EMR's official documentation on how to change Spark defaults:
You change the defaults in spark-defaults.conf using the spark-defaults configuration classification or the maximizeResourceAllocation setting in the spark configuration classification.
Insufficient reputation to comment.
I tried increasing heartbeat Interval to a much higher (100 seconds), still the same result. FWIW, the error shows up in < 9s.
What worked for me is adding {"Classification": "spark-defaults", "Properties": {"spark.driver.memory": "20G"}} to the EMR configuration.

Monitor widget missing from Jupyter

What code or configuration or steps to take to restore monitor widget to EMR Jupyter Notebook?
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-spark-monitor.html
Found this:
https://forums.aws.amazon.com/thread.jspa?threadID=308132
(It is dated August 15, 2019)
sc
Starting Spark application
ID YARN Application ID Kind State Spark UI Driver log Current session?
36 application_blahblahblahsomenumber pyspark idle Link Link ✔
SparkSession available as 'spark'.
But running normal program doesn't show expected monitoring
like number of partitions, seconds elapsed, etc.
It just runs silently with no clue but for the asterisk next to code
In [*]
What gives?
See the graphic in the section under:
"The following is an example of the Spark job monitoring widget.
from the page"
On this page:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-spark-monitor.html
This issue has been fixed in latest EMR notebooks update.

how to update cluster status in dataproc

I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.

Apache spark in cluster mode where to run the jobs. In Master or in worker node?

I have installed the spark in cluster mode. 1 master and 2 workers.And When I start spark shell in master node it is countinously running without getting the scala shell.
But when I run spark-shell on a worker node I am getting scala shell.And I am able to do the jobs.
val file=sc.textFile(“hdfs://192.168.1.20:9000/user/1gbdata”)
file.count()
And for this I got the output.
So My doubt is actually where to run the spark jobs.
Is it in worker nodes?
Based on the documentation, you need to connect your spark-shell to the master node with the following command : spark-shell --master spark://IP:PORT. This url can be retrieved from the master's UI or log file.
You should be able to launch the spark-shell on the master node (machine), make sure to check out the UI to see if the spark-shell is effectively running and that the prompt is shown (you might need to press enter on your keyboard after issuing spark-shell).
Please note that when you are using spark-submit in cluster mode, the driver will be submitted directly from one of the worker nodes, contrary to client mode where it will run as a client process. Refer to the documentation for more details.