Can pure python script (not pyspark) run in parallel in a cluster in Azure Databricks? - pyspark

I want to migrate my python scripts from local to run on cloud, specifically on a cluster created on Azure Databricks.
Can pure python script run in parallel (using multiple nodes in a cluster at the same time) without having to be converted into pyspark?
Is it possible to check whether the job is running in parallel?

No
If you have submitted a spark job, you can see logs and it will show you number of executors it is using. That's how you 'll know job is running in parallel

Related

Is it possible in airflow to run a single task on multiple worker nodes i.e running a task in distributed way

I am using spring batch to create a workflow of batch job. The single batch job takes 2 hrs to complete(data to be processed ~ 1 million) so decided to run in distributed way where one task will be distributed across multiple worker nodes, that way it can execute in short time. The other jobs (all are working in distributed manner) in workflow need to run in sequential manner one after other. The jobs are multi node distributed jobs(master/slave architecture) that need to run one after another.
Now, I was considering to deploy the workflow on airflow. So, while exploring that I could not find any way to run a single task that distributes across multiple machine. Is it possible in airflow?
Yes, you can create a task using Spark framework. Spark allows you to process the data on multiple nodes in a distributed fashion.
You can then use SparkSubmitOperator to align the task in your DAG.

What are the best tools to schedule Snowflake tasks or python scripts in Ec2 to load data into snowflake?

Please share your experiences wrt orchestrating jobs run through various tools and programmatic interfaces to load data to Snowflake-
python scripts in Ec2 instances. currently scheduled using crontab.
tasks in snowflake
Alteryx workflows
Are there any tools with sophisticated UI to create job workflows with dependencies?
The workflow can have -
python script followed by a task
Alteryx workflow followed by a python script and then a task
If any job fails then it should send emails to the team.
Thanks
We have used both CONTROL-M and Apache Airflow to schedule and orchestrate data load to snowflake

How to run two spark job in EMR cluster?

I've a realtime spark job which runs in EMR cluster and I've another batch job which runs in another EMR cluster and this job is triggered at specific time.
How to run both these jobs in one EMR cluster ?
Any suggestions.
If the steps in both the EMR are not dependent on each other, then you can use the feature called Concurrency in the EMR to solve your use case. This feature simply means that you can run more than 1 step in parallel at a time.
This feature is there from the EMR version 5.28.0. If you are using the older version than this then you can not use this feature.
While launching the EMR from the AWS console, this feature is termed as 'Concurrency' in the UI. you can choose any number between 1 to 256.
If you are launching the EMR from the AWS CLI, then this feature is termed as 'StepConcurrencyLevel'.
You can read more about this at multiple steps now in EMR and AWS CLI details

How I make Scala code runs on EMR cluster by using SDK?

I wrote code with Scala to run a Cluster in EMR. Also, I have a Spark application written in Scala. I want to run this Spark application on EMR Cluster. But is it possible for me to do this in the first script (that launch EMR Cluster)? I want to do all of them with the SDK, not through the console or CLI. It has to be a kind of automatization, not a single manual job (or minimize manual job).
Basically;
Launch EMR Cluster -> Run Spark Job on EMR -> Terminate after job finished
How do I do it if possible?
Thanks.
HadoopJarStepConfig sparkStepConf = new HadoopJarStepConfig()
.withJar("command-runner.jar")
.withArgs(params);
final StepConfig sparkStep = new StepConfig()
.withName("Spark Step")
.withActionOnFailure("CONTINUE")
.withHadoopJarStep(sparkStepConf);
AddJobFlowStepsRequest request = new AddJobFlowStepsRequest(clusterId)
.withSteps(new ArrayList<StepConfig>(){{add(sparkStep);}});
AddJobFlowStepsResult result = emr.addJobFlowSteps(request);
return result.getStepIds().get(0);
If you are looking just for automation you should read about Pipeline Orchestration-
EMR is the AWS service which allows you to run distributed applications
AWS DataPipeline is an Orchestration tool that allows you to run jobs (activities) on resources (EMR or even EC2)
If you'd just like to run a spark job consistently, I would suggest creating a data pipeline, and configuring your pipeline to have one step, which is to run the Scala spark jar on the master node using a "shellcommandactivity". Another benefit is that the jar you are running can be stored in AWS S3 (object storage service) and you'd just provide the s3 path to your DataPipeline and it will pick up that jar, log onto the EMR service it has brought up (with the configurations you've provided)- clone that jar on the master node, run the jar with the configuration provided in the "shellcommandactivity", and once the the job exits (successfully or with an error) it will kill the EMR cluster so you aren't paying for it and log the output
Please read more into it: https://aws.amazon.com/datapipeline/ & https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
And if you'd like you can trigger this pipeline via the AWS SDK or even set the pipeline to run on a schedule

Spark fails with too many open files on HDInsight YARN cluster

I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.