How to run two spark job in EMR cluster? - pyspark

I've a realtime spark job which runs in EMR cluster and I've another batch job which runs in another EMR cluster and this job is triggered at specific time.
How to run both these jobs in one EMR cluster ?
Any suggestions.

If the steps in both the EMR are not dependent on each other, then you can use the feature called Concurrency in the EMR to solve your use case. This feature simply means that you can run more than 1 step in parallel at a time.
This feature is there from the EMR version 5.28.0. If you are using the older version than this then you can not use this feature.
While launching the EMR from the AWS console, this feature is termed as 'Concurrency' in the UI. you can choose any number between 1 to 256.
If you are launching the EMR from the AWS CLI, then this feature is termed as 'StepConcurrencyLevel'.
You can read more about this at multiple steps now in EMR and AWS CLI details

Related

How Data Flow computing differs from Databricks

Knowing that in ADF Dataflows transformations will run in a Databricks cluster in the backgroung, how different (in terms of cost and performance) would be to run the same transformations on a Databricks notebook in the same pipeline?
I guess it will depend on how we set the Databricks cluster but my question is also to understand how this cluster will run in the background. Would it be a dedicated cluster or shared one in the platform?
Each activity in ADF is executed by an Integration Runtime (VM). If you are synchronously monitoring a Databricks job, you will be charged for the Integration Runtime that will be monitoring your job.
Notebook execution in Databricks will be charged as a job cluster. Please create pool and use that pool in ADF. In databricks you will see history of ADF created clusters in pool overview.
During creation of the pool please be careful with settings as you can be charged for idle time. Min idle could be 0 and auto termination time set to low value. If you have dataflow which executes notebooks step by step reuse the same pool can be quicker and cheaper as databricks will not deploy new machine and use existing machine from pool (if it wasn't auto-terminated already).
On the screenshot ADF jobs in pool and min idle settings:

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

Can pure python script (not pyspark) run in parallel in a cluster in Azure Databricks?

I want to migrate my python scripts from local to run on cloud, specifically on a cluster created on Azure Databricks.
Can pure python script run in parallel (using multiple nodes in a cluster at the same time) without having to be converted into pyspark?
Is it possible to check whether the job is running in parallel?
No
If you have submitted a spark job, you can see logs and it will show you number of executors it is using. That's how you 'll know job is running in parallel

How to setup dependent components of python spark job on aws EMR cluster

I've wrote a spark program, which needs to be executed on EMR cluster. But there are some dependent files and modules being used by python program. So is there any way around to setup dependent components on a running cluster ?
Can we mount the s3 bucket and mount that one cluster nodes, and can put all the dependent component on s3 ? Is this a good idea, and using Python how we can mount the s3 buckets on EMR ?
(During cluster creation): You can use Amazon EMR bootstrap custom actions which is capable of executing a bash script at the time of creation of the cluster. You can install all the dependent components using this script. Bootstrap action will be performed on all nodes of the cluster.
(On a running cluster): You can use Amazon EMR step option to create a s3-dist-cp command-runner step to copy files from s3.

Spark fails with too many open files on HDInsight YARN cluster

I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.