I'm trying to create Kafka cluster automatically, instead of creation manually, I'm using the stable chart: https://github.com/helm/charts/tree/master/stable/kafka-manager
in the template folder there are two .yaml files: configmap.yaml and job.yaml, what's these files and what's the roles of these files?
configMap is just a way to store non-confidential data in key-value pairs, you can also consume this data as an environment variable from the pods. (it doesn't provide secrecy or encryption!).
job.yaml is a supervisor for pods carrying out batch processes, that is, a process that runs for a certain time to completion, for example a calculation or a backup operation.
hope it answers your question, let me know if you need anything else. :)
Related
Architecture Overview:
We have Grafana running on Azure Kubernetes Service and Persistent Storage has been enabled.
We store the Grafana Helm value file in git and use ArgoCD for CD.
Now we are in a situation where we should store all dashboards and data sources config in the Helm values file or not.
Advantage:
In case something is wrong with the persistent storage (e.g. disaster, someone deletes all Grafana resources), we can easily re-create Grafana resources.
Disadvantages:
If a user would like to modify a data source, he/she has to notify the Grafana admin to update the Helm value file, then have a normal code review process to apply the change. So the users have to wait for a certain amount of time for a tiny little change.
Later we may have more Grafana users, they may just create new data sources or dashboards as they wish. We just cannot source control everything.
I was wondering is there a common best practice for this situation? Or it really depends on our own situation?
Thanks!
This is more of an architecture question. I have a data engineering background and have been using airflow to orchestrate ETL tasks using airflow for a while. I have limited knowledge of containerization and kuberentes. I have a task to come up with a good practice framework for productionalizting our Data science models using an orchestration engine namely airflow.
Our Data science team creates many NLP models to process different text documents from various resources. Previously the model was created by an external team which requires us to create an anacoda environment install libraries on it and run the model. The running of model was very manual where a data engineer would spin us a EC2 instance, and setup the model download the files to the ec2 instance and process the files using the model and take the output for further processing.
We are trying to move away from this to an automated pipeline where we have an airflow dag that basically orchestrates this all. The point where I am struggling is the running the model part.
This is the logical step I am thinking of doing. Please let me know if you think this would be feasible. All of these will be down in airflow. Step 2,3,4 are the ones I am totally unsure how to achieve.
Download files from ftp to s3
**Dynamically spin up a kubernetes cluster and create parallel pod based on number of files to be process.
Split files between those pods so each pod can only process its subset of files
Collate output of model from each pod into s3 location**
Do post processing on them
I am unsure how I can spin up a kuberentes cluster in airflow on runtime and especially how I split files between pods so each pod only processes on its own chunk of files and pushes output to shared location.
The running of the model has two methods. Daily and Complete. Daily would be a delta of files that have been added since last run whereas complete is a historical reprocessing of the whole document catalogue that we run every 6 months. As you can imagine the back catalogue would require alot of parallel processing and pods in parallel to process the number of documents.
I know this is a very generic post but my lack of kuberentes is the issue and any help would be appreciated in pointing me in the right direction.
Normally people schedule the container or PODs as per need on top of k8s cluster, however, I am not sure how frequent you need to crate the k8s cluster.
K8s cluster setup :
You can create the K8s cluster in different ways that are more dependent on the cloud provider and options they provide like SDK, CLI, etc.
Here is one example you can use this option with airflow to create the AWS EKS clusters : https://leftasexercise.com/2019/04/01/python-up-an-eks-cluster-part-i/
Most cloud providers support the CLI option so maybe using just CLI also you can create the K8s cluster.
If you want to use GCP GKE you can also check for the operators to create cluster : https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/kubernetes_engine.html
Split files between those pods so each pod can only process its subset
of files
This is more depends on the file structure, you can mount the S3 direct to all pods, or you can keep the file into NFS and mount it to POD but in all cases you have to manage the directory structure accordingly, you can mount it to POD.
Collate output of model from each pod into s3 location**
You can use boto3 to upload files to S3, Can also mount S3 bucket direct to POD.
it's more now on your structure how big files are generated, and stored.
Just a quick question.. Do you HAVE to remove or move the default kube-scheduler.yaml from the folder? Can't I just make a new yaml(with the custom scheduler) and run that in the pod?
Kubernetes isn't file-based. It doesn't care about the file location. You use the files only to apply the configuration onto the cluster via a kubectl / kubeadm or similar CLI tools or their libraries. The yaml is only the content you manually put into it.
You need to know/decide what your folder structure and the execution/configuration flow is.
Also, you can simply have a temporary fule, the naming doesn't matter as well and it's alright to replace the content of a yaml file. Preferably though, try to have some kind of history record such as manual note, comment or a source control such as git in place, so you know what and why was changed.
So yes, you can change the scheduler yaml or you can create a new file and reorganize it however you like but you will need to adjust your flow to that - change paths, etc.
I'm trying to deploy Node.js code to a Kubernetes cluster, and I'm seeing that in my reference (provided by the maintainer of the cluster) that the yaml files are all prefixed by numbers:
00-service.yaml
10-deployment.yaml
etc.
I don't think that this file format is specified by kubectl, but I found another example of it online: https://imti.co/kibana-kubernetes/ (but the numbering scheme isn't the same).
Is this a Kubernetes thing? A file naming convention? Is it to keep files ordered in a folder?
This is to handle the resource creation order. There's an opened issue in kubernetes:
https://github.com/kubernetes/kubernetes/issues/16448#issue-113878195
tl;dr kubectl apply -f k8s/* should handle the order but it does not.
However, except the namespace, I cannot imagine where the order will matter. Every relation except namespace is handled by label selectors, so it fixes itself once all resources are deployed. You can just do 00-namespace.yaml and everything else without prefixes. Or just skip prefixes at all unless you really hit the issue (I never faced it).
When you execute kubectl apply * the files are executed alphabetically. Prefixing files with a rising number allows you to control the order of the executed files. But in nearly all cases the order shouldn't matter.
Sequence helps in readability, user friendly and not the least maintainability. Looking at the resources one can conclude in which order the deployment needs to be performed. For example, deployment using configMap object would fail if the deployment is done before configMap is created.
I have installed Deis Workflow v.2.11 in a GKE cluster, and some of our applications share values in common, like a proxy URL e credentials. I can use these values putting them into environment variables, or even in a .env file.
However, every new application, I need to create a .env file, with shared values and then, call
deis config:push
If one of those shared value changes, I need to adjust every configuration of every app and restart them. I would like to modify the value in ConfigMap once and, after changes, Deis restart the applications.
Does anyone know if it is possible to read values from Kubernetes ConfigMap and to put them into Deis environment variables? Moreover, if yes, how do I do it?
I believe what you're looking for is a way to set environment variables globally across all applications. That is currently not implemented. However, please feel free to hack up a PR and we'd likely accept it!
https://github.com/deis/controller/issues/383
https://github.com/deis/controller/issues/1219
Currently there is no support for configMaps in Deis Workflow v2.18.0 . We would appreciate a PR into the Hephy Workflow (open source fork of Deis Workflow). https://github.com/teamhephy/controller
There is no functionality right now to capture configMap in by the init scripts of the containers.
You could update the configMap, but each of the applications would need to run kubectl replace -f path/accessible/for/everyone/configmap.yaml to get the variables updated.
So, I would say yes, at Kubernetes level you can do it. Just figure out the best way for your apps to update the configMap. I don't have details of your use case, so I can't tell you specific ways.