How can I monitor the tasks started with pyspark

How can I monitor the tasks started with pyspark - pyspark

I am using pyspark to run some tasks on a cluster.
I want to see the status of the tasks.
I think that the UI must be started by default
as mentioned here.
But I am unable to get UI (http://localhost:4040 or so).

Related

Get Databricks cluster ID (or get cluster link) in a Spark job

I want to get the cluster link (or the cluster ID to manually compose the link) inside a running Spark job.
This will be used to print the link in an alerting message, making it easier for engineers to reach the logs.
Is it possible to achieve that in a Spark job running in Databricks?

When Databricks cluster starts, there is a number of Spark configuration properties added. Most of them are having name starting with spark.databricks. - you can find all of the in the Environment tab of the Spark UI.
Cluster ID is available as spark.databricks.clusterUsageTags.clusterId property and you can get it as:
spark.conf.get("spark.databricks.clusterUsageTags.clusterId")
You can get workspace host name via dbutils.notebook.getContext().apiUrl.get call (for Scala), or dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() (for Python)

Use Airflow to run parametrized jobs on-demand and with a schedule

I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.

Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.

You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.

Apache Flink job is not scheduled on multiple TaskManagers in Kubernetes (replicas)

I have a simple Flink job that reads from an ActiveMQ source & sink to a database & print. I deploy the job in the Kubernetes with 2 TaskManagers, each having Task Slots of 10 (taskmanager.numberOfTaskSlots: 10). I configured parallelism more than the total TaskSlots available (ie., 10 in this case).
When I see the Flink Dashboard I see this job runs only in one of the TaskManager, but the other TaskManager has no Jobs. I verified this by checking every operator where it is scheduled, also in Task Manager UI page one of the manager has all slots free. I attach below images for reference.
Did I configure anything wrong? Where is the gap in my understanding? And can someone explain it?
Job

The first task manager has enough slots (10) to fully satisfy the requirements of your job.
The scheduler's default behavior is to fully utilize one task manager's slots before using slots from another task manager. If instead you would prefer that Flink spread out the workload across all available task managers, set cluster.evenly-spread-out-slots: true in flink-conf.yaml. (This option was added in Flink 1.10 to recreate a scheduling behavior similar to what was the default before Flink 1.5.)

How to track the current execution of my applications in Apache Spark

I have an Apache Spark service instance on IBM cloud(light plan). After I submit a Spark job I want to see its progress, it would be perfect to see it the Spark way - get the Spark progress UI with number of partitions and everything. I would also like to get a connection to the history server.
I saw that I can run ./spark-submit.sh ... --status <app id> but I would like to get something more informative.
I saw the comment
You can track the current execution of your running application and see the details of previously run jobs on the Spark job history UI by clicking Job History on the Analytics for Apache Spark service console.
here, but fail to understand where exactly I get this console/history thing.
As a side note, is there any detailed technical documentation of this service, e.g. number of concurrent jobs which can run, technology stack etc..?

As per spark Documentation:
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://{driver-node}:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Bottom Line : http://{driver-node}:4040 (replace driver-node with the node where spark job invoked) and you should be good to go

Spark 2.3.1 on YARN : how to monitor stages progress programatically?

I have a setup with Spark running on YARN, and my goal is to programmatically get updates of the progress of a Spark job by its application id.
My first idea was to parse HTML output of the YARN GUI. However, the problem of such GUI, is that the progress bar associated to a spark job don't get updated regularly and even don't change most of time : when the job start, the percent is something like 10%, and it stuck to this value until the job finish. So such YARN progress bar is just irrelevant for Spark Jobs.
When I click to the Application Master link corresponding to a Spark Job, I'm redirected to the Spark GUI that is temporarily binded during the job run. The stages page is very relevant about progress of the Spark job. However it is plain HTML, so it is a pain to parse. On the Spark documentation, they talk about a JSON API, however it seems that I can't access to it as I'm under YARN and I'm accessing Spark GUI trough YARN proxy pages.
May be a solution, in order to have access to more things, could be to access to the real Spark GUI ip:port, and not the YARN proxied one, but I don't know if I can get such source URL easily...
All of that sound complicated to just get Spark job progress... As of 2018, could you please tell us what are the preferred methods to get relevant stages progress of a Spark Job running on YARN ?

From within the application itself, you can get informations on stage progress by using spark.sparkContext.statusTracker, you can look how e.g. Zeppelin Notebook implemented a progress bar for Spark 2.3: https://github.com/apache/zeppelin/blob/master/spark/spark-scala-parent/src/main/scala/org/apache/zeppelin/spark/JobProgressUtil.scala

You can retrieve YARN application state and other details for your submitted spark on yarn job via REST API
Refer to the below links:
https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html#Example_usage
https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_API

No way of knowing the progress in percentage, as you can have any amount of Spark stages. However, there is a REST API for Spark History Server - Monitoring and Instrumentation with which you can ask for stages/tasks/jobs info. Assuming your app has predefined amount of Stages - you can calculate the progress.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I monitor the tasks started with pyspark - pyspark

I am using pyspark to run some tasks on a cluster. I want to see the status of the tasks. I think that the UI must be started by default as mentioned here. But I am unable to get UI (http://localhost:4040 or so).

Related

Get Databricks cluster ID (or get cluster link) in a Spark job

Use Airflow to run parametrized jobs on-demand and with a schedule

Apache Flink job is not scheduled on multiple TaskManagers in Kubernetes (replicas)

How to track the current execution of my applications in Apache Spark

Spark 2.3.1 on YARN : how to monitor stages progress programatically?

Categories

Resources