Make RabbitMQ durable/persistent queues survive Kubernetes pod restart - kubernetes

Our application uses RabbitMQ with only a single node. It is run in a single Kubernetes pod.
We use durable/persistent queues, but any time that our cloud instance is brought down and back up, and the RabbitMQ pod is restarted, our existing durable/persistent queues are gone.
At first, I though that it was an issue with the volume that the queues were stored on not being persistent, but that turned out not to be the case.
It appears that the queue data is stored in /var/lib/rabbitmq/mnesia/<user#hostname>. Since the pod's hostname changes each time, it creates a new set of data for the new hostname and loses access to the previously persisted queue. I have many sets of files built up in the mnesia folder, all from previous restarts.
How can I prevent this behavior?
The closest answer that I could find is in this question, but if I'm reading it correctly, this would only work if you have multiple nodes in a cluster simultaneously, sharing queue data. I'm not sure it would work with a single node. Or would it?

What helped in our case was to set hostname: <static-host-value>
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 1
...
template:
metadata:
labels:
app: rabbitmq
spec:
...
containers:
- name: rabbitmq
image: rabbitmq:3-management
...
hostname: rmq-host

How can I prevent this behavior?
By using a StatefulSet as is intended for the case where Pods have persistent data that is associated with their "identity." The Helm chart is a fine place to start reading, even if you don't end up using it.

I ran into this issue myself and the quickest way I found was to specify an environment variable RABBITMQ_NODENAME = "yourapplicationsqueuename" and making sure I only had 1 replica for my pod.

Related

GKE - How to expose pod for cross-cluster communication with MCS

In Google Cloud Platform, my goal is to have one cluster with a message queue and a pod to consume these in another cluster with MCS (Multi Cluster Service). When trying this out with only one cluster it went fairly smooth. I used the container name with the port number as the endpoint to connect to the redpanda message queue like this:
Now I want to do this between two clusters, but I'm having trouble configuring stuff right. This is my setup:
I followed this guide to set the clusters up which seemed to work (hard to tell, but no errors), and the redpanda application inside the pod is configured to be on localhost:9092. Unfortunately, I'm getting a Connection Error when running the consumer on my-service-export.my-ns.svc.clusterset.local:9092.
Is it correct to expose the pod with the message queue on its localhost?
Are there ways I can debug or test the connection between pods easier?
Ok, got it working. I obviously misread the setup at one point and had to re-do some stuff to get it working.
Also, the my-service-export should probably have the same name as the service you want to export, in my case redpanda.
A helpful tool to check the connection without running up a consumer is the simple dnsutils image. Use this deployment file and change the namespace to my-ns:
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: my-ns # <----
spec:
containers:
- name: dnsutils
image: k8s.gcr.io/e2e-test-images/jessie-dnsutils:1.3
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
restartPolicy: Always
Then spin it up with apply, exec into it, then run host my-service-export.my-ns.svc.clusterset.local. If you get an IP back you are probably good.

kubernetes having different env for one of the replica in service

A Usecase where one of the service must be scaled to 10 pods.
BUT, one of the pod must have different env variables. (kind of doing certain actions like DB actions and triggers handling, don't want 10 triggers to be handled instead of 1 DB change), for example 9 pods have env variable CHANGE=0 but one of the pod has env variable CHANGE=1
Also i am resolving by service name, so changing service name is not what i am looking for.
It sounds like you're trying to solve an issue with your app using Kubernetes.
The reason I say that is because the whole concept of "replicas" is to have identical instances, what you're actually saying is: "I have 10 identical pods but I want 1 of the to be different" and that's not how Kubernetes works.
So, you need to re-think the reason for which you need this environment variable to be different, what do you use it for. If you want to share the details maybe I can help you find an idiomatic way of doing this using Kubernetes.
The easiest way to do what you describe is to have two separate Services. One attaches to any "web" pod:
apiVersion: v1
kind: Service
metadata:
name: myapp-web
spec:
selector:
app: myapp
tier: web
The second attaches to only the master pod(s):
apiVersion: v1
kind: Service
metadata:
name: myapp-master
spec:
selector:
app: myapp
tier: web
role: master
Then have two separate Deployments. One has the single master pod, and one replica; the other has nine server pods. Your administrative requests go to myapp-master but general requests go to myapp-web.
As #omricoco suggests you can come up with a couple of ways to restructure this. A job queue like RabbitMQ will have the property that each job is done once (with retries if a job fails), so one setup is to run a queue like this, allow any server to accept administrative requests, but have their behavior just be to write a job into the queue. Then you can run a worker process (or several) to service these.

Run containers which intentionally exit periodically

How can I have Kubernates automatically restart a container which purposefully exits in order to get new data from environment variables?
I have a container running on a Kubernates cluster which operates in the following fashion:
Container starts, polls for work
If it receives a task, it does some work
It polls for work again, until ...
.. the container has been running for over a certain period of time, after which it exits instead of polling for more work.
It needs to be continually restarted, as it uses environment variables which are populated by Kubernates secrets which are periodically refreshed by another process.
I've tried a Deployment, but it doesn't seem like the right fit as I get CrashLoopBackOff status, which means the worker is scheduled less and less often.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-fonky-worker
labels:
app: my-fonky-worker
spec:
replicas: 2
selector:
matchLabels:
app: my-fonky-worker
template:
metadata:
labels:
app: my-fonky-worker
spec:
containers:
- name: my-fonky-worker-container
image: my-fonky-worker:latest
env:
- name: NOTSOSECRETSTUFF
value: cats_are_great
- name: SECRETSTUFF
valueFrom:
secretKeyRef:
name: secret-name
key: secret-key
I've also tried a CronJob, but that seems a bit hacky as it could mean that the container is left in the stopped state for several seconds.
As #Josh said you need to exit with exit 0 else it will be treated as a failed container! Here is the reference According to the first example there "Pod is running and has one Container. Container exits with success." if your restartPolicy is set to Always (which is default by the way) then the container will restart although the Pod status shows running but if you log the pod then you can see the restart of the container.
It needs to be continually restarted, as it uses environment variables which are populated by Kubernates secrets which are periodically refreshed by another process.
I would take a different approach to this. I would mount the config map as explained here this will automatically refresh the Mounted config maps data Ref. Note: please take care of the " kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute by default) in kubelet" to manage the refresh rate of configmap data in the Pod.
What I see as a solution for this would be to run your container as a cronjob. but don't use startingDeadlineSeconds as your container killer.
It runs on its schedule.
In your container you can have it poll for work N times.
After N times it exits 0.
If I understood correctly in your example there are 2 problems:
Restarting container
Updating secret values
In order to keep your secrets up to date you should consider using secrets as described by Amit Kumar Gupta comment and mount secrets as volume instead of environment variable, here is an example.
As per the second problem with restarting container it depends on what is the exit code as described by garlicFrancium
From another point of view you can use init container waiting for new tasks and main container in order to proceed this tasks according to your requirements or create job scheduler.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: complete
name: complete
spec:
replicas: 1
selector:
matchLabels:
app: complete
template:
metadata:
labels:
app: complete
spec:
hostname: c1
containers:
- name: complete
command:
- "bash"
args:
- "-c"
- "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
image: ubuntu
imagePullPolicy: IfNotPresent
resources: {}
initContainers:
- name: wait-for
image: ubuntu
command: ['bash', '-c', 'sleep 30']
restartPolicy: Always
Please note:
When a secret being already consumed in a volume is updated, projected keys are eventually updated as well. Kubelet is checking whether the mounted secret is fresh on every periodic sync. However, it is using its local cache for getting the current value of the Secret.
The type of the cache is configurable using the (ConfigMapAndSecretChangeDetectionStrategy field in KubeletConfiguration struct). It can be either propagated via watch (default), ttl-based, or simply redirecting all requests to directly kube-apiserver. As a result, the total delay from the moment when the Secret is updated to the moment when new keys are projected to the Pod can be as long as kubelet sync period + cache propagation delay, where cache propagation delay depends on the chosen cache type (it equals to watch propagation delay, ttl of cache, or zero corespondingly).
A container using a Secret as a subPath volume mount will not receive Secret updates.
Please refer also to:
Fine Parallel Processing Using a Work Queue

Make Kubernetes wait for Pod termination before removing from Service endpoints

According to Termination of Pods, step 7 occurs simultaneously with 3. Is there any way I can prevent this from happening and have 7 occur only after the Pod's graceful termination (or expiration of the grace period)?
The reason why I need this is that my Pod's termination routine requires my-service-X.my-namespace.svc.cluster.local to resolve to the Pod's IP during the whole process, but the corresponding Endpoint gets removed as soon as I run kubectl delete on the Pod / Deployment.
Note: In case it helps making this clear, I'm running a bunch of clustered VerneMQ (Erlang) nodes which, on termination, dump their contents to other nodes on the cluster — hence the need for the nodenames to resolve correctly during the whole termination process. Only then should the corresponding Endpoints be removed.
Unfortunately kubernetes was designed to remove the Pod from the endpoints at the same time as the prestop hook is started (see link in question to kubernetes docs):
At the same time as the kubelet is starting graceful shutdown, the
control plane removes that shutting-down Pod from Endpoints
This google kubernetes docs says it even more clearly:
Pod is set to the “Terminating” State and removed from the
endpoints list of all Services
There also was also a feature request for that. which was not recognized.
Solution for helm users
But if you are using helm, you can use hooks (e.g. pre-delete,pre-upgrade,pre-rollback). Unfortunately this helm hook is an extra pod which can not access all pod resources.
This is an example for a hook:
apiVersion: batch/v1
kind: Job
metadata:
name: graceful-shutdown-hook
annotations:
"helm.sh/hook": pre-delete,pre-upgrade,pre-rollback
labels:
app.kubernetes.io/name: graceful-shutdown-hook
spec:
template:
spec:
containers:
- name: graceful-shutdown
image: busybox:1.28.2
command: ['sh', '-cx', '/bin/sleep 15']
restartPolicy: Never
backoffLimit: 0
Maybe you should consider using headless service instead of using ClusterIP one. That way your apps will discover using the actual endpoint IPs and the removal from endpoint list will not break the availability during shutdown, but will remove from discovery (or from ie. ingress controller backends in nginx contrib)

How are minions chosen for given pod creation request

How does kubernetes choose the minion among many available for a given pod creation command? Is it something that can be controlled/tweaked ?
If replicated pods are submitted for deployment, is kubernetes intelligent enough to place them in different minions if they expose the same container/host port pair? Or does it always place different replicas in different minions ?
What about corner cases like what if two different pods (not necessarily replicas) that expose same host/container port pair are submitted? Will they carefully be placed on different minions ?
If a pod requires specific compute/memory requirements, can it be placed in a minion/host that has sufficient resources left to meet those requirement?
In summary, is there detailed documentation on kubernetes pod placement strategy?
Pods are scheduled to ports using the algorithm in generic_scheduler.go
There are rules that prevent host-port conflicts, and also to make sure that there are sufficient memory and cpu requirements. predicates.go
One way to choose minion for pod creation is using nodeSelector. Inside the yaml file of pod specify the label of minion for which you want to choose the minion.
apiVersion: v1
kind: Pod
metadata:
name: nginx1
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
key: value