Dataproc PySpark Streaming job failing at connecting to Resource Manager - pyspark

My PySpark streaming job on Google Cloud Dataproc cluster is failing at initial stage, saying connecting to Resource Manager and fails.
Same code runs successfully on my local host.
Error message:
INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to
ResourceManager at clustername/ Job output is complete
In this job, checkpoint file and jars sits on master to enable recovery from checkpoint directory for one time as suggested by Dennis here

Related

EMR cluster automatically terminating after few days

I've a AWS EMR cluster executing a spark streaming job. It takes streaming data from Kinesis stream and process it. It works fine for few days but after 12-15 days the cluster terminates automatically. I checked in events tab, it shows
cluster has terminated with errors with a reason of STEP_FAILURE.
Anyone has any idea why step failure can occur when the step successfully ran for few days ?
Go to the EMR console, and check the step option. If it is set as follows:
Action on failure:Terminate cluster
then the cluster will be terminated when the step failed.

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

Can't submit job with Flink 1.5 cluster

Trying to move from Flink 1.3.2 to 1.5 We have cluster deployed with kubernetes. Everything works fine with 1.3.2 but I can not submit job with 1.5. When I am trying to do that I just see spinner spin around infinitely, same via REST api. I even can't submit wordcount example job.
Seems my taskmanagers can not connect to jobmanager, I can see them in flink UI, but in logs I see
level=WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with
org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException:
connection timed out:
flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123
level=WARN akka.remote.ReliableDeliverySupervisor - Association with remote system
[akka.tcp://flink#flink-jobmanager-nonprod-2.rpds.svc.cluster.local:6123]
has failed, address is now gated for [50] ms. Reason: [Association
failed with
[akka.tcp://flink#flink-jobmanager-nonprod-2.rpds.svc.cluster.local:6123]]
Caused by: [No response from remote for outbound association.
Associate timed out after [20000 ms].]
level=WARN akka.remote.transport.netty.NettyTransport - Remote
connection to [null] failed with
org.apache.flink.shaded.akka.org.jboss.netty.channel.ConnectTimeoutException:
connection timed out:
flink-jobmanager-nonprod-2.rpds.svc.cluster.local/25.0.84.226:6123
But I can do telnet from taskmanager to jobmanager
Moreover everything works on my local if I start flink in cluster mode (jobmanager + taskmanager).
In 1.5 documentation I found mode option which flip mode between flip6 and legacy (default flip6), but If I set mode: legacy I don't see my taskmanagers registered at all.
Is this something specific about k8s deployment and 1.5 I need to do? I checked 1.5 k8s config and it looks pretty same as we have, but we using customized docker image for flink (Security, HA, checkpointing)
Thank you.
The issue with jobmanage connectivity. Jobmanager docker image cannot connect to "flink-jobmanager" (${JOB_MANAGER_RPC_ADDRESS}) address.
Just use afilichkin/flink-k8s Docker instead of flink:latest
I've fixed it by adding new host to jobmanager docker. You can see it in my github project
https://github.com/Aleksandr-Filichkin/flink-k8s/tree/master

spark-shell on multinode spark cluster fails to spon executor on remote worker node

Installed spark cluster on standalone mode with 2 nodes on first node there is spark master running and on another node spark worker. When i try to run spark shell on worker node with word count code it runs fine but when i try to run spark shell on the master node it gives following output :
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Executor is not triggered to run the job. Even though there is worker available to spark master its giving such a problem . Any help is appriciated , thanks
You use client deploy mode so the best bet is that executor nodes cannot connect to the driver port on the local machine. It could be firewall issue or problem with advertised IP / hostname. Please make sure that:
spark.driver.bindAddress
spark.driver.host
spark.driver.port
use expected values. Please refer to the networking section of Spark documentation.
Less likely it is a lack of resources. Please check if you don't request more resources than provided by workers.

Where are the Spark logs on EMR?

I'm not able to locate error logs or message's from println calls in Scala while running jobs on Spark in EMR.
Where can I access these?
I'm submitting the Spark job, written in Scala to EMR using script-runner.jar with arguments --deploy-mode set to cluster and --master set to yarn. It runs the job fine.
However I do not see my println statements in the Amazon EMR UI where it lists "stderr, stdoutetc. Furthermore if my job errors I don't see why it had an error. All I see is this in thestderr`:
15/05/27 20:24:44 INFO yarn.Client: Application report from ResourceManager:
application identifier: application_1432754139536_0002
appId: 2
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-10-185-87-217.ec2.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1432758272973
yarnAppState: FINISHED
distributedFinalState: FAILED
appTrackingUrl: http://10.150.67.62:9046/proxy/application_1432754139536_0002/A
appUser: hadoop
`
With the deploy mode of cluster on yarn the Spark driver and hence the user code executed will be within the Application Master container. It sounds like you had EMR debugging enabled on the cluster so logs should have also pushed to S3. In the S3 location look at task-attempts/<applicationid>/<firstcontainer>/*.
If you SSH into the master node of your cluster then you should be able to find the stdout, stderr, syslog and controller logs under:
/mnt/var/log/hadoop/steps/<stepname>
I also spent a lot of time figuring this out. Found logs in the following location:
EMR UI Console -> Summary -> Log URI -> Containers -> application_xxx_xxx -> container_yyy_yy_yy -> stdout.gz.
The event logs, the ones required for the spark-history-server can be found at :
hdfs:///var/log/spark/apps
If you submit your job with emr-bootstrap you can specify the log directory as an s3 bucket with --log-uri