What is the suggested workflow when working on a Kubernetes cluster using Dask? - kubernetes

I have set up a Kubernetes cluster using Kubernetes Engine on GCP to work on some data preprocessing and modelling using Dask. I installed Dask using Helm following these instructions.
Right now, I see that there are two folders, work and examples
I was able to execute the contents of the notebooks in the example folder confirming that everything is working as expected.
My questions now are as follows
What are the suggested workflow to follow when working on a cluster? Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?
How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment? Would you just manually move them to a bucket every time you upgrade (which seems tedious)? or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?
I'm new to working with data in a distributed environment in the cloud so any suggestions are welcome.

What are the suggested workflow to follow when working on a cluster?
There are many workflows that work well for different groups. There is no single blessed workflow.
Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?
Sure, that would be fine.
How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment?
You might save your data to some more permanent store, like cloud storage, or a git repository hosted elsewhere.
Would you just manually move them to a bucket every time you upgrade (which seems tedious)?
Yes, that would work (and yes, it is)
or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?
Yes, that would also work.
In Summary
The Helm chart includes a Jupyter notebook server for convenience and easy testing, but it is no substitute for a full fledged long-term persistent productivity suite. For that you might consider a project like JupyterHub (which handles the problems you list above) or one of the many enterprise-targeted variants on the market today. It would be easy to use Dask alongside any of those.

Related

How to modify Kubeflow source code before deploying it with Kubernetes?

I encountered the same issue as in https://github.com/kubeflow/kubeflow/issues/6014 with my Kubeflow app. The fix is very simple (just a type casting), then I would like to fix it myself and redeploy Kubeflow.
The problem is that I am running a k3s cluster on my local machine where I have installed Kubeflow bundle via Juju. Then, I cannot change the source code.
How to modify Kubeflow source code before deploying it with Kubernetes?
Should I use the manifest installation https://github.com/kubeflow/manifests#installation ? or a totally different method?
Thank you.
The bug was fixed in the last version of the manifests, then I have finally installed kubeflow directly from the manifests.
But still I am in touch with one Kubeflow developer, I will post here the right way to do modify/deploy if interested.
You got to check out their Github repo. Make changes and use kustomize to install like explained in their wiki. If you check the example folder you can see that it points to all other component folders.
https://github.com/kubeflow/manifests#install-with-a-single-command
One another hack could be, just look for the controllers in Kubernetes eg., deployments created for kubeflow, then modify them; works only if your changes are only related to Kubernetes resource definitions. I suggest going with the first option above for a clean development experience, and hey, that way can you contribute back to the kubeflow project as well, if you changes will benefit others.

What's the anatomy of a Bluemix/Cloud Foundry node red project?

There's lots of documentation and a kludgy console to set up continuous deployment in Cloud Foundry, but I haven't found any documentation on what the artifacts inside a repository need to be.
I don't want to cut-n-paste flows from the node red editor. If that's the only way, then IBM is not ready for prime time. I also am aware of most everything about my flows being in the Cloudant nodered db.
A node red application is more than the flows though. What about my _design docs for my dbs?
I need device info and other stuff from the Watson console, Cloudant info and my flows packaged up into something deployable.
Has anyone scripted this?
What I mean by this is I can clone a Docker project, an npm project and all sorts of projects that implement a build->test->push mechanism. They employ a configuration script of some sort (e.g. package.json) and contain a bunch of source files for the actual application, test scripts, db scripts, whatever is necessary to deploy the application and its environment into a host. I see lots of documentation on the toolchain and its features, but I'm not clear on if it's possible to make use of it for my hosted node red application. Or if I have to write the scripting mechanisms to offload flow info from the nodered db and query all my other dbs for their respective _design docs and all the other configuration information required to set up an IoT node red application.
I forgot to mention, the copy/paste method loses information; you get no tab level metadata. The only way to get all the flow stuff is to pull if from the nodered flow record.
Node-RED will release a new version in a couple of days that will introduce projects, so you'll be able to use GitHub and all the usual tools to handle your app: https://twitter.com/NodeRED/status/956934949784956931 and https://nodered.org/docs/user-guide/projects/
While it doesn't address your short-term needs, I think it's the best long-term solution. Hopefully that helps.

Learning to use Kuberentes on one single computer

I'm in the need of learning how to use Kubernetes. I've read the first sentences of a couple of introductory tutorials, and never have found one which explains me, step by step, how to build a simulated real world example on a single computer.
Is Kubernetes by nature so distributed that even the 101-level tutorials can only be performed on clusters?
Or can I learn (execute important examples) the important stuff there is to know by just using my Laptop without needing to use a stack of Raspberry Pi's, AWS or GCP?
The easiest might be minikube.
Minikube is a tool that makes it easy to run Kubernetes locally.
Minikube runs a single-node Kubernetes cluster inside a VM on your
laptop for users looking to try out Kubernetes or develop with it
day-to-day.
For a resource that explains how to use this, try this getting started guide. It runs through an entire example application using a local development environment.
If you are okay with using Google Cloud Platform (I think one gets free credits initially), there is hello-node.
If you want to run the latest and greatest (not necessary stable) and you're using Linux, is also possible to spin up a local cluster on Linux from a cloned copy of the kubernetes sources, using hack/local_up_cluster.sh.

Docker deployment options

I'm wondering which options are there for docker container deployment in production. Given I have separate APP and DB server containers and data-only containers holding deployables and other holding database files.
I just have one server for now, which I would like to "docker enable", but what is the best way to deploy there(remotely will be the best option)
I just want to hit a button and some tool will take care of stopping, starting, exchanging all needed docker containers.
There is myriad of tools(Fleet, Flocker, Docker Compose etc.), I'm overwhelmed by the choices.
Only thing I'm clear is, I don't want to build images with codes from git repo. I would like to have docker images as wrappers for my releases. Have I grasped the docker ideas from wrong end?
My team recently built a Docker continuous deployment system and I thought I'd share it here since you seem to have the same questions we had. It pretty much does what you asked:
"hit a button and some tool will take care of stopping, starting, exchanging all needed docker containers"
We had the challenge that our Docker deployment scripts were getting too complex. Our containers depend on each other in various ways to make the full system so when we deployed, we'd often have dependency issues crop up.
We built a system called "Skopos" to resolve these issues. Skopos detects the current state of your running system and detects any changes being made and then automatically plans out and deploys the update into production. It creates deployment plans dynamically for each deployment based on a comparison of current state and desired state.
It can help you continuously deploy your application or service to production using tags in your repository to automatically roll out the right version to the right platform while removing the need for manual procedures or scripts.
It's free, check it out: http://datagridsys.com/getstarted/
You can import your system in 3 ways:
1. if you have a Docker Compose, we can suck that in and start working iwth it.
2. If your app is running, we can scan it and then start working with it.
3. If you have neither, you can create a quick descriptor file in YAML and then we can understand your current state.
I think most people start their container journey using tools from Docker Toolbox. Those tools provide a good start and work as promised, but you'll end up wanting more. With these tools, you are missing for example integrated overlay networking, DNS, load balancing, aggregated logging, VPN access and private image repository which are crucial for most container workloads.
To solve these problems we started to develop Kontena - Docker Container Orchestration Platform. While Kontena works great for all types of businesses and may be used to run containerized workloads at any scale, it's best suited for start-ups and small to medium sized business who require worry-free and simple to use platform to run containerized workloads.
Kontena is an open source project and you can view it on GitHub.

Amazon EC2 Auto Scaling in production

I have realized that I have to make Image from EBS Volume everytime when I change my code
and following autoscaling configuration everytime (this is really bad).
I have heard that some people try to load their newest code from github or some similar sort of doing.
So that they can let server to have newest code automatically without making new image every single time.
I already have a private github.
Is it a only way to solve Auto-Scaling code management ?
If so, how can I configure this to work?
Use user-data scripts, which work on a lot of public images including Amazon's. You could have it download puppet manifests/templates/files and run directly. Search for master less puppet.
Yes, you can configure your AMI so that the instance loads the latest software and configuration on first boot before it is put into service in the auto scaling group.
How to set up a startup script may depend on the specific os and version you are running.