I'm new to nextflow. We would like to build our workflow using nextflow and have nextflow deploy the workflow to a large mulit-institution Kubernetes cluster that we use.
In this cluster we don't have admin permission, we have a namespace we work in. Also, pods in our cluster have limited resources, but jobs have unlimited resources.
In looking at the documentation for nextflow + kubernetes and it says that the workflow runs under a Kubernetes pod, which raises red flags for me because of the limitation on pods in our cluster.
Is there a way to execute nextflow workflows as a kubernetes jobs instead of a pod? What are my options in this area?
There might be a new feature with nextflow 22.04. Quoting Ben Sherman's nextflow k8s best practices page:
In Nextflow v22.04 and later, the k8s executor can be configured to use Jobs instead of Pods directly.
Based on a conversation on https://gitter.im/nextflow-io/nextflow, Nextflow cannot run a Job, it only supports spawning Pods. One pod is spawned for workflow control, and that Pod spawns pods for the individual tasks.
Related
I am new to Kubernetes and I am facing some issues.
The project codebase in bitbucket and in each commit, there are pipelines in bitbucket which build a pod in the Kubernetes cluster. So the pods do some tasks and terminate after the task got completed. When the commits are high cluster fails due to a large number of pods. So I am trying to find a solution to queue it in the Kubernetes cluster so the pods will use all the resources of my cluster after the termination of the pods it will run the other pods in the queue and so on. Any help?
You can set pod resource requirements so that when you create a pod and requirements cannot be satisfied, scheduler will put it in pending state and schedule it as soon as there are available resources.
The only downfall of this solution is that it does not guarantee that pods will be created in the same order as api request sent to create them.
I am trying to set up a job scheduler (airflow) on an EKS cluster to replace a scheduler (Jenkins) we're running directly on an ec2. This job scheduler should be able to deploy pods to the EKS cluster it's running on.
However, whenever I try to deploy the pod (with a pod manifest), I get the following error message:
Error from server (Forbidden): error when creating "deployment.yaml": pods "simple-pod" is forbidden: pod does not have "kubernetes.io/config.mirror" annotation, node "ip-xx.ec2.internal" can only create mirror pods
I believe the restriction has to do with the NodeRestriction plugin on the kube-apiserver running on the EKS Control Plane.
I have looked through documentation to see if I can turn this plugin off, however it does not appear to be possible through kubectl, and only possible by modifying the kube-apiserver configuration on control plane itself.
Is it possible to turn off this plugin? Or, is it possible to label a node or pod to mark that it is not subject to this plugin? More broadly, is running a job scheduler on EKS that assigns job on the same cluster a bad design choice?
If we wanted to containerize and deploy our job scheduler, do we need to instantiate a separate EKS cluster/other service to run it on?
I am implementing the continuous integration and continuous deployment by using Ansible, Docker, Jenkins and Kubernetes. I already created one Kubernetes cluster with 1 master and 2 worker nodes by using Ansible and kubespray deployment. And I have 30 - 40 number of micro service application. I need to create that much of service and deployments.
My Confusion
When I am using Kubernetes package manager Kubernetes Helm chart, then do I need to initiate my chart on master node or in my base machine from where I I deployed my kubernet cluster ?
If I am initiating inside master, then can I use kubectl to deploy using ssh on remote worker nodes?
If I am initiating outside the Kubernetes cluster nodes , then Can i use kubectl command to deploy in Kubernetes cluster ?
Your confusion seems to lie in the configuration and interactions of Helm components. This explanation provides a good graphics to represent the relationships.
If you are using the traditional Helm/Tiller configuration, Helm will be installed locally on your machine and, assuming you have the correct kubectl configuration, you can "initialize" your cluster by running helm init to install Tiller into your cluster. Tiller will run as a deployment in kube-system, and has the RBAC privileges to create/modify/delete/view the chart resources. Helm will automatically manage all the API objects for you, and the kube-scheduler will schedule the pods to all your nodes accordingly. You should not be directly interacting with your master and nodes via your console.
In either configuration, you would always be making the Helm deployment from your local machine with a kubectl access to your cluster.
Hope this helps!
If you look for the way for running helm client inside your Kubernetes cluster, please check the concept of Helm-Operator.
I would recommend you also to look around for term "GitOps" - set of practices which combines Git with Kubernetes, and sets Git as a source of truth for your declarative infrastructure and applications.
There are two great OSS projects out there, that implements GitOps best practices:
flux (uses Helm-Operator)
Jenkins-x (uses helm as a part of release pipeline, check out this session on YT to see it in action)
What we want to achieve:
We would like to use Airflow to manage our machine learning and data pipeline while using Kubernetes to manage the resources and schedule the jobs. What we would like to achieve is for Airflow to orchestrate the workflow (e.g. Various tasks dependencies. Re-run jobs upon failures) and Kubernetes to orchestrate the infrastructure (e.g cluster autoscaling and individual jobs assignment to nodes). In other words Airflow will tell the Kubernetes cluster what to do and Kubernetes decides how to distribute the work. In the same time we would also want Airflow to be able to monitor the individual tasks status. For example if we have 10 tasks spreaded across a cluster of 5 nodes, Airflow should be able to communicate with the cluster and reports show something like: 3 “small tasks” are done, 1 “small task” has failed and will be scheduled to re-run and the remaining 6 “big tasks” are still running.
Questions:
Our understanding is that Airflow has no Kubernetes-Operator, see open issues at https://issues.apache.org/jira/browse/AIRFLOW-1314. That being said we don’t want Airflow to manage resources like managing service accounts, env variables, creating clusters, etc. but simply send tasks to an existing Kubernetes cluster and let Airflow know when a job is done. An alternative would be to use Apache Mesos but it looks less flexible and less straightforward compared to Kubernetes.
I guess we could use Airflow’s bash_operator to run kubectl but this seems not like the most elegant solution.
Any thoughts? How do you deal with that?
Airflow has both a Kubernetes Executor as well as a Kubernetes Operator.
You can use the Kubernetes Operator to send tasks (in the form of Docker images) from Airflow to Kubernetes via whichever AirflowExecutor you prefer.
Based on your description though, I believe you are looking for the KubernetesExecutor to schedule all your tasks against your Kubernetes cluster. As you can see from the source code it has a much tighter integration with Kubernetes.
This will also allow you to not have to worry about creating the docker images ahead of time as is required with the Kubernetes Operator.
Kubernetes is an orchestration tool for the management of containers.
Kubernetes creates pods which are containing containers, instead of managing containers directly.
I read this about pods
I'm working with OpenShift V3 which is using pods. But in my apps, all demo's and all examples I see:
One pod contains one containers (it's possible to contain more and that could be an advantage of using pods). But in an OpenShift environment I don't see the advantage of this pods.
Can some explain me why OpenShift V3 is using kubernetes with pods and containers instead of an orchestration tool which is working with containers immediately (without pods).
There are many cases where our users want to run pods with multiple containers within OpenShift. A common use-case for running multiple containers is where a pod has a 'primary' container that does some job, and a 'side-car' container that does something like write logs to a logging agent.
The motivation for pods is twofold -- to make it easier to share resources between containers, and to enable deploying and replicating groups of containers that share resources. You can read more about them in the user-guide.
The reason we still use a Pod when only a single container is that containers do not have all the notions that are attached to pods. For example, pods have IP addresses. Containers do not -- they share the IP address associated with the pod's network namespace.
Hope that helps. Let me know if you'd like more clarification, or we can discuss on slack.