Dataflow job with ProfileOptions profile_cpu=True not writing profile files - apache-beam

I am trying to profile the CPU usage of a Dataflow Pipeline job, run on Apache Beam Python 3.7 SDK 2.27.0. I triggered the job with the --profile_cpu and profile_location args set, and can see that they are set in the Dataflow console:
Dataflow Pipeline Options showing that profile_cpu and profile_location are set.
However, after the job completed there were no files written to the profile_location GSC bucket.
When looking at the Dataflow logs with jsonPayload.logger:"apache_beam.utils.profiler:profiler.py" I can see the logs that say "Start profiling" and "Stop profiling":
Logs showing the "Start profiling" and "Stop profiling" messages from the Profiler.
but there are no logs corresponding to the "Copying profiler data to:" step even though the profile_location is set in the ProfilingOptions and therefore should be set on the Profiler. Any advice on what could be going wrong, or knowledge of whether this functionality is currently supported would be very helpful.

This was resolved by using the --experiments=use_runner_v2 flag. Looks like this is only supported on Dataflow Runner v2, which has not been rolled out as the default runner yet.

Related

Azure Databricks error- The output of the notebook is too large. Cause: rpc response

Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.
See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.

How to track the current execution of my applications in Apache Spark

I have an Apache Spark service instance on IBM cloud(light plan). After I submit a Spark job I want to see its progress, it would be perfect to see it the Spark way - get the Spark progress UI with number of partitions and everything. I would also like to get a connection to the history server.
I saw that I can run ./spark-submit.sh ... --status <app id> but I would like to get something more informative.
I saw the comment
You can track the current execution of your running application and see the details of previously run jobs on the Spark job history UI by clicking Job History on the Analytics for Apache Spark service console.
here, but fail to understand where exactly I get this console/history thing.
As a side note, is there any detailed technical documentation of this service, e.g. number of concurrent jobs which can run, technology stack etc..?
As per spark Documentation:
Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:
A list of scheduler stages and tasks
A summary of RDD sizes and memory usage
Environmental information.
Information about the running executors
You can access this interface by simply opening http://{driver-node}:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Bottom Line : http://{driver-node}:4040 (replace driver-node with the node where spark job invoked) and you should be good to go

Spark 2.3.1 on YARN : how to monitor stages progress programatically?

I have a setup with Spark running on YARN, and my goal is to programmatically get updates of the progress of a Spark job by its application id.
My first idea was to parse HTML output of the YARN GUI. However, the problem of such GUI, is that the progress bar associated to a spark job don't get updated regularly and even don't change most of time : when the job start, the percent is something like 10%, and it stuck to this value until the job finish. So such YARN progress bar is just irrelevant for Spark Jobs.
When I click to the Application Master link corresponding to a Spark Job, I'm redirected to the Spark GUI that is temporarily binded during the job run. The stages page is very relevant about progress of the Spark job. However it is plain HTML, so it is a pain to parse. On the Spark documentation, they talk about a JSON API, however it seems that I can't access to it as I'm under YARN and I'm accessing Spark GUI trough YARN proxy pages.
May be a solution, in order to have access to more things, could be to access to the real Spark GUI ip:port, and not the YARN proxied one, but I don't know if I can get such source URL easily...
All of that sound complicated to just get Spark job progress... As of 2018, could you please tell us what are the preferred methods to get relevant stages progress of a Spark Job running on YARN ?
From within the application itself, you can get informations on stage progress by using spark.sparkContext.statusTracker, you can look how e.g. Zeppelin Notebook implemented a progress bar for Spark 2.3: https://github.com/apache/zeppelin/blob/master/spark/spark-scala-parent/src/main/scala/org/apache/zeppelin/spark/JobProgressUtil.scala
You can retrieve YARN application state and other details for your submitted spark on yarn job via REST API
Refer to the below links:
https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html#Example_usage
https://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_API
No way of knowing the progress in percentage, as you can have any amount of Spark stages. However, there is a REST API for Spark History Server - Monitoring and Instrumentation with which you can ask for stages/tasks/jobs info. Assuming your app has predefined amount of Stages - you can calculate the progress.

how to monitor a job from another job in talend open studio 5.3.1 version

Hi i am beginer in Talend Open Studio 5.3.1 version.
currently i am facing issue in project i.e. schedule a job to run every 10 seconds and it monitor the other job and display output as status of another job which means the job is running or idle state.
Currently i am using Talend Open Studio 5.3.1 version by using this version it is possible or not .
explain me how to schelude a job for 10 seconds and display output as status of another job.
can anyone suggest and help me to solve my problem.
We should think a bit out of the box here. I'd solve this by using Project level logging: https://help.talend.com/display/TalendOpenStudioforBigDataUserGuide520EN/2.6+Customizing+project+settings
You'll have the jobs status stored in a database table, you just have to check whether the last execution of the job is still running or not. (Self join the stats table)
Monitoring jobs is not supported in Talend Open Studio, but there is some workaround:
Use a master job that launch the job to be monitored using tRunJob component, and your master job will have an idea whats going on.
Use empty files to synchronize your jobs, an empty file with a tricky name created by monitored jobs and the master job check them to get other jobs states.
Much easier is to use Quartz.

How to turn on console logging for TwitterSource

Based on the Flink Scala quickstart, I've created a sample job that uses org.apache.flink.streaming.connectors.twitter.TwitterSource.
I'm using a local streaming job manager /start-local-streaming.sh, and start the job with ~/flink-0.10.0/bin/flink run target/quickstart-0.1.jar
In the TwitterSource sourcecode, I notice that there are logging statements of the INFO level. How can I get those logged to console?
The Flink daemons are started in the background, that's why Flink is logging to log files in the log/ directory.
To monitor these files, use tail -f log/*. This will print the log entries on the console.
Another approach would be to change the start scripts of Flink to start it in the foreground + change the conf/log4j.properties file to use the ConsoleAppender.