How do I set up a Tensorflow cluster using Google Compute Engine Instances to train a model? - kubernetes

I understand can use docker images, but do I need Kubernetes to create a cluster? There are instructions available for model serving, but what about model training on Kubernetes?

You can use Kubernetes Jobs to run batch compute tasks. But currently (circa v1.6) it's not easy to set up data pipelines in Kubernetes.
You might want to look at Pachyderm, which is a data processing framework built on top of Kubernetes. It adds some nice data packing/versioning tools.

Related

Sagemaker Pre-processing/Training Jobs vs ECS

We are considering using Sagemaker jobs/ECS as a resource for a few of our ML jobs. Our jobs are based on a custom docker file (no spark, just basic ML python libraries) and thus all that is required is resource for the container.
Wanted to know is there any specific advantage of using Sagemaker vs ECS here ? Also, As in our use-case we only require a resource for running docker image, would processing Job / training job serve the same purpose? Thanks!
Yeah you could make use of a either a Training Job or Processing Job (assuming the ML jobs are for transient training and/or processing).
The benefit of using SageMaker over ECS is that SageMaker manages the infrastructure. The Jobs are also transient and as such will be killed after training/processing while your artifacts will be automatically saved to S3.
With SageMaker Training or Processing Jobs all you need to do is bring your container (sitting in ECR) and kick off the Job with a single API (CreateTrainingJob, CreateProcessingJob)

Orchestration of an NLP model via airflow and kubernetes

This is more of an architecture question. I have a data engineering background and have been using airflow to orchestrate ETL tasks using airflow for a while. I have limited knowledge of containerization and kuberentes. I have a task to come up with a good practice framework for productionalizting our Data science models using an orchestration engine namely airflow.
Our Data science team creates many NLP models to process different text documents from various resources. Previously the model was created by an external team which requires us to create an anacoda environment install libraries on it and run the model. The running of model was very manual where a data engineer would spin us a EC2 instance, and setup the model download the files to the ec2 instance and process the files using the model and take the output for further processing.
We are trying to move away from this to an automated pipeline where we have an airflow dag that basically orchestrates this all. The point where I am struggling is the running the model part.
This is the logical step I am thinking of doing. Please let me know if you think this would be feasible. All of these will be down in airflow. Step 2,3,4 are the ones I am totally unsure how to achieve.
Download files from ftp to s3
**Dynamically spin up a kubernetes cluster and create parallel pod based on number of files to be process.
Split files between those pods so each pod can only process its subset of files
Collate output of model from each pod into s3 location**
Do post processing on them
I am unsure how I can spin up a kuberentes cluster in airflow on runtime and especially how I split files between pods so each pod only processes on its own chunk of files and pushes output to shared location.
The running of the model has two methods. Daily and Complete. Daily would be a delta of files that have been added since last run whereas complete is a historical reprocessing of the whole document catalogue that we run every 6 months. As you can imagine the back catalogue would require alot of parallel processing and pods in parallel to process the number of documents.
I know this is a very generic post but my lack of kuberentes is the issue and any help would be appreciated in pointing me in the right direction.
Normally people schedule the container or PODs as per need on top of k8s cluster, however, I am not sure how frequent you need to crate the k8s cluster.
K8s cluster setup :
You can create the K8s cluster in different ways that are more dependent on the cloud provider and options they provide like SDK, CLI, etc.
Here is one example you can use this option with airflow to create the AWS EKS clusters : https://leftasexercise.com/2019/04/01/python-up-an-eks-cluster-part-i/
Most cloud providers support the CLI option so maybe using just CLI also you can create the K8s cluster.
If you want to use GCP GKE you can also check for the operators to create cluster : https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/kubernetes_engine.html
Split files between those pods so each pod can only process its subset
of files
This is more depends on the file structure, you can mount the S3 direct to all pods, or you can keep the file into NFS and mount it to POD but in all cases you have to manage the directory structure accordingly, you can mount it to POD.
Collate output of model from each pod into s3 location**
You can use boto3 to upload files to S3, Can also mount S3 bucket direct to POD.
it's more now on your structure how big files are generated, and stored.

kubernetes - How do I get one job to work with multiple nodes?

Currently, when I create and run deployment, I only work on one node.
I want to work on one task at the same time using Kubernetes.
I want all nodes to work like one computer.
Kubernetes is about managing containers and scheduling them to run across a cluster, not about “jobs” per se. Have a look at MapReduce and Apache Spark.
First you need to understand more about Kubernetes and why your understanding might be a bit misleading for you concept. Kubernetes is an container orchestration tool that automates many of the manual processes involved in deploying, managing, and scaling containerized applications.
In other words, you can cluster together groups of hosts running Linux containers, and K8s helps you manage those clusters. To process some kind of job, data you will need a software that runs on kubernetes.
The next step that you might want to look into is distributed computing concept and distributed computing model called MapReduce.
MapReduce was introduce by Google to meet the demand of large set of users for its applications. Its used to write write scalable applications that can do parallel processing to process a large amount of data on a large cluster of commodity hardware servers. Hadoop is software that has adopted MapReduce and is capable of running it`s programs in various languages (Python, Ruby, C++).
Take a look on this medium article about distributed computing system based on MapReduce and Kubernetes.

Best way to setup ELK in production

Is it a good practice to setup Elasticsearch, logstash and kiban on 3 different servers, with each server having RAM of 8GB.
Or
Setup ELK on 1 single machine with higher memory of 16GB.
The machine needs to be highly available.
Can anyone suggest or share inputs
it depends on your task and situation. normally it is good practice to setup Elasticsearch, logstash and kiban on 3 different server. or if you data if more so you have to make a cluster of elastic search or may have more than one server of logstash .
filefeats will be on all the data(log) server .
there are an example of handling 25000 logs per secoung
https://engineering.viki.com/blog/2015/log-processing-at-scale-elk-cluster-at-25k-events-per-second/
Its slightly more complicated than explained here,
Any distributed component would try to offer features with sharded or partioned way. In a similar way the Elastic Search at ELK which is based out of Master Slave model and maintains the data at ES data nodes. This means one needs to set up a cluster of nodes for Elastic search itself for its various components such as ES Master, ES data and ES client.
The next level if the system is required at production grade which requires Multi master setup with minimum 3 master nodes.
This would be the beginning of ELK.
If one needs to run such a complex system in a limited resources, then Containerizing the ELK components and running them in a container orchestration framework is the recommended option. Kubernetes/Docker swarm are the options to run ELK cluster based on the dockerized instances of ELK. Again these orchestration frameworks also require multimaster setup , but that would be fair as one would have lot more components in a cloud environment and all of them could be controlled under these orchestration frameworks.

Use OpenStack HEAT to install and setup MongoDB cluster

We need to be able install a MongoDB cluster on OpenStack declaring the number of shards and the size of the replica-set as parameters.
Is there a way to achieve this using a HEAT template, passing in <shardCount> and <replicaSetSize> as parameters?
Can this be done by HEAT or does it require complicated scripting and on-the-fly template generation?