Pyspark best practices - pyspark

We are newly implementing pyspark in our company, which would be the best practice of handling code. Out Architecture is a follows
S3 -->EMR (Pyspark)--> SNowflake . Data flow is orchestrated through airflow.
Infra orchestration through Terraform.
Currently, I am storing terraform code through Ci/CD on github . So when Data is pushed to pyspark in dev s3 it gets copied to the prod aws account.
My Question is do I need to store pyspark on GitHub or not ? What is the best practice to do this?

Related

Creating IaC with AWS CDK to just get the CloudFormation equivalent code

I'm working with a customer that wants all the IaC in CloudFormation. I'm not good at CloudFormation. I wonder if it is a common practice to create the IaC using AWS CDK, generate the CloudFormation equivalent, and use it?
Is this something that is commonly done ?

Azure Synapse SQL CICD with multiple environments

Might be a stupid question but seems to be very difficult to find information about Synapse with multiple environments.
We have dev/test/prod environment setup and need to create partially-automated CICD pipelines between those. The only problem is now that we cannot build dynamic SQL scripts to query from the respective storage accounts - so those could be identical no matter the environment. So, dev Synapse using data from dev-storage and so on. Dedicated SQL pool can benefit from Stored Procedures, and I could pass parameters there if it works. But what about serverless pool? What is the correct way?
I've tried to look options from OPENROWSET with DATA_SOURCE argument as well as EXTERNAL DATA SOURCE expression without any luck. Also, no one seems to offer any information about this so I'm beginning to think if this whole perspective is wrong.
This kind of "external" file reading is new to me, I may have tried to put this in a SQL Server context in my head.
Thank you for your time!
Okay, Serverless pool does support both procedures and dynamic SQL, yet you currently cannot call that straight from Synapse Pipelines.
You have to either trigger those procedures via Spark notebooks or by creating separate Synapse Analytics Linked Services for each of your databases in a Synapse Serverless pools and work from there.

Azure : Ansible role for deploying ML model integrated over databricks

I have developed ML predictive model on historical data in Azure Databricks using python notebook.
Which means i have done data extraction, preparation, feature engineering and model training everything done in Databricks using python notebook.
I have almost completed development part of it, now we want to deploy ML model into production using ansible roles.
To deploy to AzureML you need to build the image from the MLflow model - it's done by using the mlflow.azureml.build_image function of MLflow. After that you can deploy it to Azure Container Instances (ACI) or Azure Kubernetes Service by using client.create_deployment function of MLflow (see Azure docs). There is also mlflow.azureml.deploy function that is doing everything in one step.
This blog post & example notebook that show the code for full process of training/testing/deployment of the model using MLflow & AzureML.

GCP - spark on GKE vs Dataproc

Our organisation has recently moved its infrastructure from aws to google cloud compute and I figured dataproc clusters are a good solution to running our existing spark jobs . But when it comes to comparing the pricing , I also realised that I can just fire up a google kubernetes engine cluster and install spark in it to run spark applications on it .
Now my question is , how do “running spark on gke “ and using dataproc compare ? Which one would be the best option in terms of autoscaling , pricing and infrastructure . I’ve read googles documentation on gke and dataproc but there isn’t enough for to be sure in terms of advantages and disadvantages of using GKE or dataproc over the other .
Any expert opinion will be extremely helpful.
Thanks in advance.
Spark on DataProc is proven and it's in use at many organizations, though its not fully managed, you can automate cluster creation and tear down, submitting jobs etc through GCP api, but still it's another stack you have to manage.
Spark on GKE is something new, Spark started adding features from 2.4 onwards to support Kubernetes, and even Google updated the Kubernetes for the preview couple of days back, Link
I would just go with DataProc if I have to run Jobs in Prod environment as we speak otherwise you could just experiment yourself with Docker and see how it fares, but I think it needs little more time to be stable, from purely cost perspective it would be cheaper with Docker as you can share resources with your other services.
Adding my two cents to above answer.
I would favor DataProc, because its managed and supports Spark out of
the box. No hazzles. More importantly, cost optimized. You may not
need clusters all the time, you can have ephemeral clusters with
dataproc.
With GKE, I need to explicitly discard the cluster and recreate when
necessary. Additional care needs to be taken care of.
I could not come across any direct service offering from GCP on data
lineage. In that case, I would probably use Apache Atlas with
Spark-Atlas-Connector on Spark installation managed by myself. In
that case, running Spark on GKE with all the control lying with
myself would make a compelling choice.

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions