How to integrate spark.ml pipeline fitting and hyperparameter optimisation in AWS Sagemaker? - pyspark

Here is a high-level picture of what I am trying to achieve: I want to train a LightGBM model with spark as a compute backend, all in SageMaker using their Training Job api.
To clarify:
I have to use LightGBM in general, there is no option here.
The reason I need to use spark compute backend is because the training with the current dataset does not fit in memory anymore.
I want to use SageMaker Training job setting so I could use SM Hyperparameter optimisation job to find the best hyperparameters for LightGBM. While LightGBM spark interface itself does offer some hyperparameter tuning capabilities, it does not offer Bayesian HP tuning.
Now, I know the general approach to running custom training in SM: build a container in a certain way, and then just pull it from ECR and kick-off a training job/hyperparameter tuning job through sagemaker.Estimator API. Now, in this case SM would handle resource provisioning for you, would create an instance and so on. What I am confused about is that essentially, to use spark compute backend, I would need to have an EMR cluster running, so the SDK would have to handle that as well. However, I do not see how this is possible with the API above.
Now, there is also that thing called Sagemaker Pyspark SDK. However, the provided SageMakerEstimator API from that package does not support on-the-fly cluster configuration either.
Does anyone know a way how to run a Sagemaker training job that would use an EMR cluster so that later the same job could be used for hyperparameter tuning activities?
One way I see is to run an EMR cluster in the background, and then just create a regular SM estimator job that would connect to the EMR cluster and do the training, essentially running a spark driver program in SM Estimator job.
Has anyone done anything similar in the past?
Thanks

Thanks for your questions. Here are answers:
SageMaker PySpark SDK https://sagemaker-pyspark.readthedocs.io/en/latest/ does the opposite of what you want: being able to call a non-spark (or spark) SageMaker job from a Spark environment. Not sure that's what you need here.
Running Spark in SageMaker jobs. While you can use SageMaker Notebooks to connect to a remote EMR cluster for interactive coding, you do not need EMR to run Spark in SageMaker jobs (Training and Processing). You have 2 options:
SageMaker Processing has a built-in Spark Container, which is easy to use but unfortunately not connected to SageMaker Model Tuning (that works with Training only). If you use this, you will have to find and use a third-party, external parameter search library ; for example Syne Tune from AWS itself (that supports bayesian optimization)
SageMaker Training can run custom docker-based jobs, on one or multiple machines. If you can fit your Spark code within SageMaker Training spec, then you will be able to use SageMaker Model Tuning to tune your Spark code. However there is no framework container for Spark on SageMaker Training, so you would have to build your own, and I am not aware of any examples. Maybe you could get inspiration from the Processing container code here to build a custom Training container
Your idea of using the Training job as a client to launch an EMR cluster is good and should work (if SM has the right permissions), and will indeed allow you to use SM Model Tuning. I'd recommend:
each SM job to create a new transient cluster (auto-terminate after step) to keep costs low and avoid tuning results to be polluted by inter-job contention that could arise if running everything on the same cluster.
use the cheapest possible instance type for the SM estimator, because it will need to stay up during all duration of your EMR experiment to collect and print your final metric (accuracy, duration, cost...)
In the same spirit, I once used SageMaker Training myself to launch Batch Transform jobs for the sole purpose of leveraging the bayesian search API to find an inference configuration that minimizes cost.

Related

Sagemaker Pre-processing/Training Jobs vs ECS

We are considering using Sagemaker jobs/ECS as a resource for a few of our ML jobs. Our jobs are based on a custom docker file (no spark, just basic ML python libraries) and thus all that is required is resource for the container.
Wanted to know is there any specific advantage of using Sagemaker vs ECS here ? Also, As in our use-case we only require a resource for running docker image, would processing Job / training job serve the same purpose? Thanks!
Yeah you could make use of a either a Training Job or Processing Job (assuming the ML jobs are for transient training and/or processing).
The benefit of using SageMaker over ECS is that SageMaker manages the infrastructure. The Jobs are also transient and as such will be killed after training/processing while your artifacts will be automatically saved to S3.
With SageMaker Training or Processing Jobs all you need to do is bring your container (sitting in ECR) and kick off the Job with a single API (CreateTrainingJob, CreateProcessingJob)

Should I use AWS Glue or Spark on EMR for processing binary data to parquet format

I have a work requirement of reading binary data from sensors and produce parquet output results for Analytics.
For storage I have chosen s3 and Dynamodb.
For processing engine I’m confused on how to choose between AWS EMR or AWS Glue.
Data processing code base will be maintained in python coupled with Spark.
Please post your suggestion on choosing between AWS EMR or AWS Glue.
Using Glue / EMR depends on your use-case.
EMR is a managed cluster of servers and costs less than Glue, but it also requires more maintenance and set-up overhead. You can not only run Spark but also other frameworks on EMR like Flink.
Glue is serverless Spark / Python and really easy to use. It does not run on the latest Spark version and abstracts a lot of Spark away, in a good but also in a bad sense, that you can not set specific configurations very easily.
It's an opinion based question and now you have AWS EMR Serverless.
AWS Glue is 1) more managed and thus with restrictions, and 2) imho issues with crawling for schema changes to consider, 3) own interpretation of dataframes 4) and less run-time configuration and 5) less options for serverless scalability. There seems to a few bugs etc. that keep on popping up.
AWS EMR is 1) an AWS platform easy enough to configure, 2) with the AWS flavour of what they think the best way of running Spark is, 3) some limitations in terms of subsequently scaling down resources when using dynamic scaling out, 4) a platform that uses Spark so there will be a bigger pool of persons to hire, 5) allowing bootstrapping of software not standardly supplied, and selection of standard software, such as, say, HBase.
So, comparable to an extent. And divergent in other ways; AWS Glue is ETL/ELT, AWS EMR is that with more capabilities.

How do I set up a Tensorflow cluster using Google Compute Engine Instances to train a model?

I understand can use docker images, but do I need Kubernetes to create a cluster? There are instructions available for model serving, but what about model training on Kubernetes?
You can use Kubernetes Jobs to run batch compute tasks. But currently (circa v1.6) it's not easy to set up data pipelines in Kubernetes.
You might want to look at Pachyderm, which is a data processing framework built on top of Kubernetes. It adds some nice data packing/versioning tools.

How to setup fully functional (includeing cluster) Spark learning developement on one machine?

I want to start learning Spark 2.0 so I try to setup my dev (Scalav2.11) environment.
Spark uses a distributed env. to work on one cluster across multiple separate machines each node per machine. However, I do not have many machines for my testing purpose I only have one machine with CentOS 7 on it.
I am not after performance, I need something that would simulate a working cluster so that I could learn Spark.
How can I setup a development environment to learn and develop Spark applications without having to access multiple machines but still being able to learn and write code for fully functional Spark based environment?
Start with local mode.
Spark will do everything as usual: spawn executors, distribute tasks etc, the only step that will be omitted is the transfer of data across the network, and it's done completely under the hood in production so you don't need to take this omission into account while coding.
You will be able to specify number of executors (only threads in this mode), and test for example the fact that Spark Streaming needs at least 2 of them.
Refering to your comments:
Or it does not make much sense to make a cluster to learn spark
because it is all done under the hood and the programming is all the
same on local and say standalone/YARN/mesos mode
Yes, there are some conventions, but they are exactly the same on local and other modes.
Does the local mode means that I will be able to start exemplary
cluster with say 3 nodes?
local[3] should do the trick.

Timing MLlib algorithms when executed in Scala interactive shell

I am using Spark’s Scala (interactive) Shell to experiment with MLlib algorithms, e.g. singular value decomposition (SVD), by applying them to different size datasets. Is there a way to find out how long the execution of an algorithm takes when it is executed in the Shell?
Spark shell job status is available at Web interface, 3030 port of your master node.
Open explorer and input URL http://MASTER_HOSTNAME:3030
The 3030 port allow you to track the job’s progress and various metrics about its execution, including task durations and cache statistics.
Other useful ports include:
8080, entire Spark Cluster information
7077, Spark Master information
Reference: Data exploration using spark.