Monitor widget missing from Jupyter - pyspark

What code or configuration or steps to take to restore monitor widget to EMR Jupyter Notebook?
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-spark-monitor.html
Found this:
https://forums.aws.amazon.com/thread.jspa?threadID=308132
(It is dated August 15, 2019)
sc
Starting Spark application
ID YARN Application ID Kind State Spark UI Driver log Current session?
36 application_blahblahblahsomenumber pyspark idle Link Link ✔
SparkSession available as 'spark'.
But running normal program doesn't show expected monitoring
like number of partitions, seconds elapsed, etc.
It just runs silently with no clue but for the asterisk next to code
In [*]
What gives?
See the graphic in the section under:
"The following is an example of the Spark job monitoring widget.
from the page"
On this page:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-spark-monitor.html

This issue has been fixed in latest EMR notebooks update.

Related

How to track the current execution of my applications in Apache Spark

I have an Apache Spark service instance on IBM cloud(light plan). After I submit a Spark job I want to see its progress, it would be perfect to see it the Spark way - get the Spark progress UI with number of partitions and everything. I would also like to get a connection to the history server.
I saw that I can run ./spark-submit.sh ... --status <app id> but I would like to get something more informative.
I saw the comment
You can track the current execution of your running application and see the details of previously run jobs on the Spark job history UI by clicking Job History on the Analytics for Apache Spark service console.
here, but fail to understand where exactly I get this console/history thing.
As a side note, is there any detailed technical documentation of this service, e.g. number of concurrent jobs which can run, technology stack etc..?
As per spark Documentation:
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://{driver-node}:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Bottom Line : http://{driver-node}:4040 (replace driver-node with the node where spark job invoked) and you should be good to go

Spark 2.3.1 on YARN : how to monitor stages progress programatically?

I have a setup with Spark running on YARN, and my goal is to programmatically get updates of the progress of a Spark job by its application id.
My first idea was to parse HTML output of the YARN GUI. However, the problem of such GUI, is that the progress bar associated to a spark job don't get updated regularly and even don't change most of time : when the job start, the percent is something like 10%, and it stuck to this value until the job finish. So such YARN progress bar is just irrelevant for Spark Jobs.
When I click to the Application Master link corresponding to a Spark Job, I'm redirected to the Spark GUI that is temporarily binded during the job run. The stages page is very relevant about progress of the Spark job. However it is plain HTML, so it is a pain to parse. On the Spark documentation, they talk about a JSON API, however it seems that I can't access to it as I'm under YARN and I'm accessing Spark GUI trough YARN proxy pages.
May be a solution, in order to have access to more things, could be to access to the real Spark GUI ip:port, and not the YARN proxied one, but I don't know if I can get such source URL easily...
All of that sound complicated to just get Spark job progress... As of 2018, could you please tell us what are the preferred methods to get relevant stages progress of a Spark Job running on YARN ?
From within the application itself, you can get informations on stage progress by using spark.sparkContext.statusTracker, you can look how e.g. Zeppelin Notebook implemented a progress bar for Spark 2.3: https://github.com/apache/zeppelin/blob/master/spark/spark-scala-parent/src/main/scala/org/apache/zeppelin/spark/JobProgressUtil.scala
You can retrieve YARN application state and other details for your submitted spark on yarn job via REST API
Refer to the below links:
https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html#Example_usage
https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_API
No way of knowing the progress in percentage, as you can have any amount of Spark stages. However, there is a REST API for Spark History Server - Monitoring and Instrumentation with which you can ask for stages/tasks/jobs info. Assuming your app has predefined amount of Stages - you can calculate the progress.

How can I monitor the tasks started with pyspark

I am using pyspark to run some tasks on a cluster.
I want to see the status of the tasks.
I think that the UI must be started by default
as mentioned here.
But I am unable to get UI (http://localhost:4040 or so).

Failed to run Zeppelin notebook demo Spark Streaming - Hortonworks sandbox

I tried to run the note book demo available on Zeppelin in Hortonworks sandbox 2.4 (Notebook named twitter) to learn SparkStreaming. According the instruction on the top of notebook (/* BEFORE START....), I logged on Ambari to modify the configuration of Yarn service.
CPU => Container: Minimum Container Size (VCores) 4; Maximum Container Size (Vcores): 8
Memory
Node: 2250MB
Container: Minimum Container Size: 768MB; Maximum Container Size: 2250MB
All services are restarted after modifying but when I came back to Zeppelin to run the notebook, the second paragraph (/* UPDATE YOUR TWITTER CREDENTIALS */....) was always on the state "running" but never "finished". All twitter credentials are already updated.
P/S: without modifying the YARN configuration, I could run the second paragraph, but when running the 3rd, It was always "running" but never "finished"
Thanks for any suggestions
if the error is
Error:YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register
change the value of yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage to 99 and the value of yarn maximum-am-resource-percent to 100

Web UI(http://localhost:8088) is not showing Spark Applications

I have pseudo distributed hadoop 2.2.0 Environment setup in my laptop.I can run mapreduce applications(including Pig and Hive jobs) and the status of the applications can be seen from the Web UI http://localhost:8088
I have downloaded the Spark library and just used the file system-HDFS for the spark applications.when I launch a spark application,it is getting launched and the execution also gets completed successfully as expected.
But the Web UI http://localhost:8088 is not listing the Spark application completed/launched.
Please suggest if there is any other additional configuration is required for seeing Spark applications in the Web UI.
(Note: http://localhost:50070this Web UI shows the files correctly,when tried writing files to HDFS via Spark applications)
You might have figured it out but for others who are starting with Spark.You can see all the spark jobs on
http://localhost:4040
after your spark context is initiated(port can be different eg 4041). Based on Standalone installation you can see the master and slave status on
http://localhost:8080
(for slave port is usually 8081 onward). you need to Spark-submit jobs to yarn-cluster or client to see the same on hadoop webservices.