Sagemaker Pre-processing/Training Jobs vs ECS - amazon-ecs

We are considering using Sagemaker jobs/ECS as a resource for a few of our ML jobs. Our jobs are based on a custom docker file (no spark, just basic ML python libraries) and thus all that is required is resource for the container.
Wanted to know is there any specific advantage of using Sagemaker vs ECS here ? Also, As in our use-case we only require a resource for running docker image, would processing Job / training job serve the same purpose? Thanks!

Yeah you could make use of a either a Training Job or Processing Job (assuming the ML jobs are for transient training and/or processing).
The benefit of using SageMaker over ECS is that SageMaker manages the infrastructure. The Jobs are also transient and as such will be killed after training/processing while your artifacts will be automatically saved to S3.
With SageMaker Training or Processing Jobs all you need to do is bring your container (sitting in ECR) and kick off the Job with a single API (CreateTrainingJob, CreateProcessingJob)

Related

Can anyone please tell me how efficient Azure Databricks Jobs are?

like If I schedule a databricks notebook to run via Jobs and Azure Data Factory, which one would be more efficient and why?
There are few cases when Databricks Workflows (former Jobs) are more efficient to use than ADF:
ADF still uses Jobs API 2.0 for submission of ephemeral jobs that doesn't support setting of default access control lists
If you have a job consisting of several tasks, Databricks has an option of cluster reuse that allow to use the same cluster(s) to run multiple subtasks, and don't wait to creation of new clusters as in case when subtasks are scheduled from ADF
You can more efficiently share a context between subtasks when using Databricks Workflows

How to integrate spark.ml pipeline fitting and hyperparameter optimisation in AWS Sagemaker?

Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api.
To clarify:
I have to use LightGBM in general, there is no option here.
The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.
Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.
Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator API from that package does not support on-the-fly cluster configuration either.
Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?
One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.
Has anyone done anything similar in the past?
Thanks
Thanks for your questions. Here are answers:
SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container
Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:
each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)
In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.

How Data Flow computing differs from Databricks

Knowing that in ADF Dataflows transformations will run in a Databricks cluster in the backgroung, how different (in terms of cost and performance) would be to run the same transformations on a Databricks notebook in the same pipeline?
I guess it will depend on how we set the Databricks cluster but my question is also to understand how this cluster will run in the background. Would it be a dedicated cluster or shared one in the platform?
Each activity in ADF is executed by an Integration Runtime (VM). If you are synchronously monitoring a Databricks job, you will be charged for the Integration Runtime that will be monitoring your job.
Notebook execution in Databricks will be charged as a job cluster. Please create pool and use that pool in ADF. In databricks you will see history of ADF created clusters in pool overview.
During creation of the pool please be careful with settings as you can be charged for idle time. Min idle could be 0 and auto termination time set to low value. If you have dataflow which executes notebooks step by step reuse the same pool can be quicker and cheaper as databricks will not deploy new machine and use existing machine from pool (if it wasn't auto-terminated already).
On the screenshot ADF jobs in pool and min idle settings:

Kubeflow Pipeline in serving model

I'm beginning to dig into kubeflow pipelines for a project and have a beginner's question. It seems like kubeflow pipelines work well for training, but how about serving in production?
I have a fairly intensive pre processing pipeline for training and must apply that same pipeline for production predictions. Can I use something like Seldon Serving to create an endpoint to kickoff the pre processing pipeline, apply the model, then to return the prediction? Or is the better approach to just put everything in one docker container?
Yes, you can definitely use Seldon for serving. In fact, Kubeflow team offers an easy way to link between training and serving: fairing
Fairing provides a programmatic way of deploying your prediction endpoint. You could also take a look at this example on how to deploy your Seldon endpoint with your training result.
KF Pipelines is designed for pipelines that run from start to finish. Serving process does not have an end, so, although possible, serving itself should be handled outside of a pipeline.
What the pipeline should do is to push a trained model to the long-lasting serving service in the end.
The serving can performed by CMLE serving, Kubeflow's TFServe, Seldon, etc.
Can I use something like Seldon Serving to create an endpoint to kickoff the pre processing pipeline, apply the model, then to return the prediction?
Due to container starting overhead, Kubeflow Pipelines usually handle batch jobs. Of course you can run a pipeline for a single prediction, but the latency might not be acceptable. For serving it might be better to have a dedicated long-lived container/service that accepts requests, transforms data and makes predictions.

How do I set up a Tensorflow cluster using Google Compute Engine Instances to train a model?

I understand can use docker images, but do I need Kubernetes to create a cluster? There are instructions available for model serving, but what about model training on Kubernetes?
You can use Kubernetes Jobs to run batch compute tasks. But currently (circa v1.6) it's not easy to set up data pipelines in Kubernetes.
You might want to look at Pachyderm, which is a data processing framework built on top of Kubernetes. It adds some nice data packing/versioning tools.