Kubernetes : RabbitMQ pod is spammed with connections from kube-system - kubernetes

I'm currently learning Kubernetes and all its quircks.
I'm currently using a rabbitMQ Deployment, service and pod in my cluster to exchange messages between apps in the cluster. However, I saw an abnormal amount of the rabbitMQ pod restarts.
After installing prometheus and Grafana to see the problem, I saw that the rabbitMQ pod would consume more and more memory and cpu until it gets killed by the OOMkiller every two hours or so. The graph looks like this :
Graph of CPU consumption in my cluster (rabbitmq in red)
After that I looked into the rabbitMQ pod UI, and saw that an app in my cluster (ip 10.224.0.5) was constantly creating new connections, this IP corresponding to my kube-system and my prometheus instance, as shown by the following logs :
k get all -A -o wide | grep 10.224.0.5
E1223 12:13:48.231908 23198 memcache.go:255] couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1
E1223 12:13:48.311831 23198 memcache.go:255] couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1
kube-system pod/azure-ip-masq-agent-xh9jk 1/1 Running 0 25d 10.224.0.5 aks-agentpool-37892177-vmss000001 <none> <none>
kube-system pod/cloud-node-manager-h5ff5 1/1 Running 0 25d 10.224.0.5 aks-agentpool-37892177-vmss000001 <none> <none>
kube-system pod/csi-azuredisk-node-sf8sn 3/3 Running 0 3d15h 10.224.0.5 aks-agentpool-37892177-vmss000001 <none> <none>
kube-system pod/csi-azurefile-node-97nbt 3/3 Running 0 19d 10.224.0.5 aks-agentpool-37892177-vmss000001 <none> <none>
kube-system pod/kube-proxy-2s5tn 1/1 Running 0 3d15h 10.224.0.5 aks-agentpool-37892177-vmss000001 <none> <none>
monitoring pod/prometheus-prometheus-node-exporter-dztwx 1/1 Running 0 20h 10.224.0.5 aks-agentpool-37892177-vmss000001 <none> <none>
Also, I noticed that these connections seem tpo be blocked by rabbitMQ, as the field connection.blocked in the client properties is set to true, as shown in the follwing image:
Print screen of a connection details from rabbitMQ pod's UI
I saw in the documentation that rabbitMQ starts to blocks connections when it hits low on resources, but I set the cpu and memory limits to 1 cpu and 1 Gib RAM, and the connections are blocked from the start anyway.
On the cluster, I'm also using Keda which uses the rabbitmq pod, and polls it every one second to see if there are any messages in a queue (I set pollingInterval to 1 in the yaml). But as I said earlier, it's not Keda that's creating all the connections, it's kube-system. Unless keda uses a component described earlier in the log to poll rabbitmq, and that the Keda's polling interval does not corresponds to seconds (which is highly unlikely as it's written in the docs that this polling intertval is given in seconds), I don't know at all what's going on with all these connections.
The following section contains the yamls of all the components that might be involved with this problem (keda and rabbitmq) :
rabbitMQ Replica Count.yaml
apiVersion: v1
kind: ReplicationController
metadata:
labels:
component: rabbitmq
name: rabbitmq-controller
spec:
replicas: 1
template:
metadata:
labels:
app: taskQueue
component: rabbitmq
spec:
containers:
- image: rabbitmq:3.11.5-management
name: rabbitmq
ports:
- containerPort: 5672
name: amqp
- containerPort: 15672
name: http
resources:
limits:
cpu: 1
memory: 1Gi
rabbitMQ Service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
component: rabbitmq
name: rabbitmq-service
spec:
type: LoadBalancer
ports:
- port: 5672
targetPort: 5672
name: amqp
- port: 15672
targetPort: 15672
name: http
selector:
app: taskQueue
component: rabbitmq
keda JobScaler, Secret and TriggerAuthentication (sample data is just a replacement for fields that I do not want to be revealed :) ):
apiVersion: v1
kind: Secret
metadata:
name: keda-rabbitmq-secret
data:
host: sample-host # base64 encoded value of format amqp://guest:password#localhost:5672/vhost
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: keda-trigger-auth-rabbitmq-conn
namespace: default
spec:
secretTargetRef:
- parameter: host
name: keda-rabbitmq-secret
key: host
---
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: builder-job-scaler
namespace: default
spec:
jobTargetRef:
parallelism: 1
completions: 1
activeDeadlineSeconds: 600
backoffLimit: 5
template:
spec:
volumes:
- name: shared-storage
emptyDir: {}
initContainers:
- name: sourcesfetcher
image: sample image
volumeMounts:
- name: shared-storage
mountPath: /mnt/shared
env:
- name: SHARED_STORAGE_MOUNT_POINT
value: /mnt/shared
- name: RABBITMQ_ENDPOINT
value: sample host
- name: RABBITMQ_QUEUE_NAME
value: buildOrders
containers:
- name: builder
image: sample image
volumeMounts:
- name: shared-storage
mountPath: /mnt/shared
env:
- name: SHARED_STORAGE_MOUNT_POINT
value: /mnt/shared
- name: MINIO_ENDPOINT
value: sample endpoint
- name: MINIO_PORT
value: sample port
- name: MINIO_USESSL
value: "false"
- name: MINIO_ROOT_USER
value: sample user
- name: MINIO_ROOT_PASSWORD
value: sampel password
- name: BUCKET_NAME
value: "hex"
- name: SERVER_NAME
value: sample url
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 500m
memory: 512Mi
restartPolicy: OnFailure
pollingInterval: 1
maxReplicaCount: 2
minReplicaCount: 0
rollout:
strategy: gradual
triggers:
- type: rabbitmq
metadata:
protocol: amqp
queueName: buildOrders
mode: QueueLength
value: "1"
authenticationRef:
name: keda-trigger-auth-rabbitmq-conn
Any help would very much appreciated!

Related

"Must specify limits.cpu" error during pod deployment even though cpu limit is specified

I am trying to run a test pod with OpenShift CLI:
$oc run nginx --image=nginx --limits=cpu=2,memory=4Gi
deploymentconfig.apps.openshift.io/nginx created
$oc describe deploymentconfig.apps.openshift.io/nginx
Name: nginx
Namespace: myproject
Created: 12 seconds ago
Labels: run=nginx
Annotations: <none>
Latest Version: 1
Selector: run=nginx
Replicas: 1
Triggers: Config
Strategy: Rolling
Template:
Pod Template:
Labels: run=nginx
Containers:
nginx:
Image: nginx
Port: <none>
Host Port: <none>
Limits:
cpu: 2
memory: 4Gi
Environment: <none>
Mounts: <none>
Volumes: <none>
Deployment #1 (latest):
Name: nginx-1
Created: 12 seconds ago
Status: New
Replicas: 0 current / 0 desired
Selector: deployment=nginx-1,deploymentconfig=nginx,run=nginx
Labels: openshift.io/deployment-config.name=nginx,run=nginx
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DeploymentCreated 12s deploymentconfig-controller Created new replication controller "nginx-1" for version 1
Warning FailedCreate 1s (x12 over 12s) deployer-controller Error creating deployer pod: pods "nginx-1-deploy" is forbidden: failed quota: quota-svc-myproject: must specify limits.cpu,limits.memory
I get "must specify limits.cpu,limits.memory" error, despite both limits being present in the same describe output.
What might be the problem and how do I fix it?
I found a solution!
Part of the error message was "Error creating deployer pod". It means that the problem is not with my pod, but with the deployer pod which performs my pod deployment.
It seems the quota in my project affects deployer pods as well.
I couldn't find a way to set deployer pod limits with CLI, so I've made a DeploymentConfig.
kind: "DeploymentConfig"
apiVersion: "v1"
metadata:
name: "test-app"
spec:
template:
metadata:
labels:
name: "test-app"
spec:
containers:
- name: "test-app"
image: "nginxinc/nginx-unprivileged"
resources:
limits:
cpu: "2000m"
memory: "20Gi"
ports:
- containerPort: 8080
protocol: "TCP"
replicas: 1
selector:
name: "test-app"
triggers:
- type: "ConfigChange"
- type: "ImageChange"
imageChangeParams:
automatic: true
containerNames:
- "test-app"
from:
kind: "ImageStreamTag"
name: "nginx-unprivileged:latest"
strategy:
type: "Rolling"
resources:
limits:
cpu: "2000m"
memory: "20Gi"
A you can see, two sets of limitations are specified here: for container and for deployment strategy.
With this configuration it worked fine!
Looks like you have specified resource quota and the values you specified for limits seems to be larger than that. Can you describe the resource quota oc describe quota quota-svc-myproject and adjust your configs accordingly.
A good reference could be https://docs.openshift.com/container-platform/3.11/dev_guide/compute_resources.html

Why my GKE node pool does not auto-scale down?

I've got a preemptible node pool which is clearly under-utilized:
The node pool hosts a deployment with HPA with the following setup:
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend
labels:
app: backend
spec:
replicas: 1
selector:
matchLabels:
app: backend
template:
metadata:
labels:
app: backend
spec:
initContainers:
- name: wait-for-database
image: ### IMAGE ###
command: ['bash', 'init.sh']
containers:
- name: backend
image: ### IMAGE ###
command: ["bash", "entrypoint.sh"]
imagePullPolicy: Always
resources:
requests:
memory: "200M"
cpu: "50m"
ports:
- name: probe-port
containerPort: 8080
hostPort: 8080
volumeMounts:
- name: static-shared-data
mountPath: /static
readinessProbe:
httpGet:
path: /readiness/
port: probe-port
failureThreshold: 5
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
- name: nginx
image: nginx:alpine
resources:
requests:
memory: "400M"
cpu: "20m"
ports:
- containerPort: 80
volumeMounts:
- name: nginx-proxy-config
mountPath: /etc/nginx/conf.d/default.conf
subPath: app.conf
- name: static-shared-data
mountPath: /static
volumes:
- name: nginx-proxy-config
configMap:
name: backend-nginx
- name: static-shared-data
emptyDir: {}
nodeSelector:
cloud.google.com/gke-nodepool: app-dev
tolerations:
- effect: NoSchedule
key: workload
operator: Equal
value: dev
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: backend
namespace: default
spec:
maxReplicas: 12
minReplicas: 8
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: backend
metrics:
- resource:
name: cpu
targetAverageUtilization: 50
type: Resource
---
The node pool also has the toleration label.
The HPA utilization shows this:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
backend-develop Deployment/backend-develop 10%/50% 8 12 8 38d
But the node pool does not scale down for about a day. No heavy load on this deployment:
NAME STATUS ROLES AGE VERSION
gke-dev-app-dev-fee1a901-fvw9 Ready <none> 22h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-gls7 Ready <none> 22h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-lf3f Ready <none> 24h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-lgw9 Ready <none> 3d10h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-qxkz Ready <none> 3h35m v1.14.10-gke.36
gke-dev-app-dev-fee1a901-s10l Ready <none> 22h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-sj4d Ready <none> 22h v1.14.10-gke.36
gke-dev-app-dev-fee1a901-vdnw Ready <none> 27h v1.14.10-gke.36
There's no affinity settings for this deployment and node pool. Some of the nodes easily pack several same pods, but others just hold one pod for hours, no scale down happens.
What could be wrong?
The issue was:
hostPort: 8080
This lead to FailedScheduling didn't have free ports.
That's why the nodes were kept online.

Rabbit mq - Error while waiting for Mnesia tables

I have installed rabbitmq using helm chart on a kubernetes cluster. The rabbitmq pod keeps restarting. On inspecting the pod logs I get the below error
2020-02-26 04:42:31.582 [warning] <0.314.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-02-26 04:42:31.582 [info] <0.314.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
When I try to do kubectl describe pod I get this error
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-rabbitmq-0
ReadOnly: false
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rabbitmq-config
Optional: false
healthchecks:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rabbitmq-healthchecks
Optional: false
rabbitmq-token-w74kb:
Type: Secret (a volume populated by a Secret)
SecretName: rabbitmq-token-w74kb
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/arch=amd64
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 3m27s (x878 over 7h21m) kubelet, gke-analytics-default-pool-918f5943-w0t0 Readiness probe failed: Timeout: 70 seconds ...
Checking health of node rabbit#rabbitmq-0.rabbitmq-headless.default.svc.cluster.local ...
Status of node rabbit#rabbitmq-0.rabbitmq-headless.default.svc.cluster.local ...
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
I have provisioned the above on Google Cloud on a kubernetes cluster. I am not sure during what specific situation it started failing. I had to restart the pod and since then it has been failing.
What is the issue here ?
TLDR
helm upgrade rabbitmq --set clustering.forceBoot=true
Problem
The problem happens for the following reason:
All RMQ pods are terminated at the same time due to some reason (maybe because you explicitly set the StatefulSet replicas to 0, or something else)
One of them is the last one to stop (maybe just a tiny bit after the others). It stores this condition ("I'm standalone now") in its filesystem, which in k8s is the PersistentVolume(Claim). Let's say this pod is rabbitmq-1.
When you spin the StatefulSet back up, the pod rabbitmq-0 is always the first to start (see here).
During startup, pod rabbitmq-0 first checks whether it's supposed to run standalone. But as far as it can see on its own filesystem, it's part of a cluster. So it checks for its peers and doesn't find any. This results in a startup failure by default.
rabbitmq-0 thus never becomes ready.
rabbitmq-1 is never starting because that's how StatefulSets are deployed - one after another. If it were to start, it would start successfully because it sees that it can run standalone as well.
So in the end, it's a bit of a mismatch between how RabbitMQ and StatefulSets work. RMQ says: "if everything goes down, just start everything and the same time, one will be able to start and as soon as this one is up, the others can rejoin the cluster." k8s StatefulSets say: "starting everything all at once is not possible, we'll start with the 0".
Solution
To fix this, there is a force_boot command for rabbitmqctl which basically tells an instance to start standalone if it doesn't find any peers. How you can use this from Kubernetes depends on the Helm chart and container you're using. In the Bitnami Chart, which uses the Bitnami Docker image, there is a value clustering.forceBoot = true, which translates to an env variable RABBITMQ_FORCE_BOOT = yes in the container, which will then issue the above command for you.
But looking at the problem, you can also see why deleting PVCs will work (other answer). The pods will just all "forget" that they were part of a RMQ cluster the last time around, and happily start. I would prefer the above solution though, as no data is being lost.
Just deleted the existing persistent volume claim and reinstalled rabbitmq and it started working.
So every time after installing rabbitmq on a kubernetes cluster and if I scale down the pods to 0 and when I scale up the pods at a later time I get the same error. I also tried deleting the Persistent Volume Claim without uninstalling the rabbitmq helm chart but still the same error.
So it seems each time I scale down the cluster to 0, I need to uninstall the rabbitmq helm chart, delete the corresponding Persistent Volume Claims and install the rabbitmq helm chart each time to make it working.
IF you are in the same scenario like me and you don't know who deployed the helm chart and how was it deployed... you can edit the statefulset directly to avoid messing up more things..
I was able to make it work without deleting the helm_chart
kubectl -n rabbitmq edit statefulsets.apps rabbitmq
under the spec section I added as following the env variable RABBITMQ_FORCE_BOOT = yes:
spec:
containers:
- env:
- name: RABBITMQ_FORCE_BOOT # New Line 1 Added
value: "yes" # New Line 2 Added
And that should fix the issue also... please first try to do it in a proper way as is explained above by Ulli.
In my case solution was simple
Step1: Downscale the statefulset it will not delete the PVC.
kubectl scale statefulsets rabbitmq-1-rabbitmq --namespace teps-rabbitmq --replicas=1
Step2: Access the RabbitMQ Pod.
kubectl exec -it rabbitmq-1-rabbitmq-0 -n Rabbit
Step3: Reset the cluster
rabbitmqctl stop_app
rabbitmqctl force_boot
Step4:Rescale the statefulset
kubectl scale statefulsets rabbitmq-1-rabbitmq --namespace teps-rabbitmq --replicas=4
I also got a similar kind of error as given below.
2020-06-05 03:45:37.153 [info] <0.234.0> Waiting for Mnesia tables for
30000 ms, 9 retries left 2020-06-05 03:46:07.154 [warning] <0.234.0>
Error while waiting for Mnesia tables:
{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-06-05 03:46:07.154 [info] <0.234.0> Waiting for Mnesia tables for
30000 ms, 8 retries left
In my case, the slave node(server) of the RabbitMQ cluster was down. Once I started the slave node, master node's started without an error.
test this deploy:
kind: Service
apiVersion: v1
metadata:
namespace: rabbitmq-namespace
name: rabbitmq
labels:
app: rabbitmq
type: LoadBalancer
spec:
type: NodePort
ports:
- name: http
protocol: TCP
port: 15672
targetPort: 15672
nodePort: 31672
- name: amqp
protocol: TCP
port: 5672
targetPort: 5672
nodePort: 30672
- name: stomp
protocol: TCP
port: 61613
targetPort: 61613
selector:
app: rabbitmq
---
kind: Service
apiVersion: v1
metadata:
namespace: rabbitmq-namespace
name: rabbitmq-lb
labels:
app: rabbitmq
spec:
# Headless service to give the StatefulSet a DNS which is known in the cluster (hostname-#.app.namespace.svc.cluster.local, )
# in our case - rabbitmq-#.rabbitmq.rabbitmq-namespace.svc.cluster.local
clusterIP: None
ports:
- name: http
protocol: TCP
port: 15672
targetPort: 15672
- name: amqp
protocol: TCP
port: 5672
targetPort: 5672
- name: stomp
port: 61613
selector:
app: rabbitmq
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rabbitmq-config
namespace: rabbitmq-namespace
data:
enabled_plugins: |
[rabbitmq_management,rabbitmq_peer_discovery_k8s,rabbitmq_stomp].
rabbitmq.conf: |
## Cluster formation. See http://www.rabbitmq.com/cluster-formation.html to learn more.
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
## Should RabbitMQ node name be computed from the pod's hostname or IP address?
## IP addresses are not stable, so using [stable] hostnames is recommended when possible.
## Set to "hostname" to use pod hostnames.
## When this value is changed, so should the variable used to set the RABBITMQ_NODENAME
## environment variable.
cluster_formation.k8s.address_type = hostname
## Important - this is the suffix of the hostname, as each node gets "rabbitmq-#", we need to tell what's the suffix
## it will give each new node that enters the way to contact the other peer node and join the cluster (if using hostname)
cluster_formation.k8s.hostname_suffix = .rabbitmq.rabbitmq-namespace.svc.cluster.local
## How often should node cleanup checks run?
cluster_formation.node_cleanup.interval = 30
## Set to false if automatic removal of unknown/absent nodes
## is desired. This can be dangerous, see
## * http://www.rabbitmq.com/cluster-formation.html#node-health-checks-and-cleanup
## * https://groups.google.com/forum/#!msg/rabbitmq-users/wuOfzEywHXo/k8z_HWIkBgAJ
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
## See http://www.rabbitmq.com/ha.html#master-migration-data-locality
queue_master_locator=min-masters
## See http://www.rabbitmq.com/access-control.html#loopback-users
loopback_users.guest = false
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: rabbitmq-namespace
spec:
serviceName: rabbitmq
replicas: 3
selector:
matchLabels:
name: rabbitmq
template:
metadata:
labels:
app: rabbitmq
name: rabbitmq
state: rabbitmq
annotations:
pod.alpha.kubernetes.io/initialized: "true"
spec:
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
containers:
- name: rabbitmq-k8s
image: rabbitmq:3.8.3
volumeMounts:
- name: config-volume
mountPath: /etc/rabbitmq
- name: data
mountPath: /var/lib/rabbitmq/mnesia
ports:
- name: http
protocol: TCP
containerPort: 15672
- name: amqp
protocol: TCP
containerPort: 5672
livenessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 10
resources:
requests:
memory: "0"
cpu: "0"
limits:
memory: "2048Mi"
cpu: "1000m"
readinessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 20
periodSeconds: 60
timeoutSeconds: 10
imagePullPolicy: Always
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: RABBITMQ_USE_LONGNAME
value: "true"
# See a note on cluster_formation.k8s.address_type in the config file section
- name: RABBITMQ_NODENAME
value: "rabbit#$(HOSTNAME).rabbitmq.$(NAMESPACE).svc.cluster.local"
- name: K8S_SERVICE_NAME
value: "rabbitmq"
- name: RABBITMQ_ERLANG_COOKIE
value: "mycookie"
volumes:
- name: config-volume
configMap:
name: rabbitmq-config
items:
- key: rabbitmq.conf
path: rabbitmq.conf
- key: enabled_plugins
path: enabled_plugins
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- "ReadWriteOnce"
storageClassName: "default"
resources:
requests:
storage: 3Gi
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rabbitmq
namespace: rabbitmq-namespace
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: endpoint-reader
namespace: rabbitmq-namespace
rules:
- apiGroups: [""]
resources: ["endpoints"]
verbs: ["get"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: endpoint-reader
namespace: rabbitmq-namespace
subjects:
- kind: ServiceAccount
name: rabbitmq
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: endpoint-reader

Exposing cassandra cluster on minikube to access externally

I'm trying to deploy a cassandra multinode cluster in minikube, I have followed this tutorial Example: Deploying Cassandra with Stateful Sets and made some modifications, the cluster is up and running and with kubectl I can connect via cqlsh, but I want to connect externally, I tried to expose the service via NodePort and test the connection with datastax studio (192.168.99.100:32554) but no success, also later I want to connect in spring boot, I supose that I have to use the svc name or the node ip.
All host(s) tried for query failed (tried: /192.168.99.100:32554 (com.datastax.driver.core.exceptions.TransportException: [/192.168.99.100:32554] Cannot connect))
[cassandra-0] /etc/cassandra/cassandra.yaml
rpc_port: 9160
broadcast_rpc_address: 172.17.0.5
listen_address: 172.17.0.5
# listen_interface: eth0
start_rpc: true
rpc_address: 0.0.0.0
# rpc_interface: eth1
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "cassandra-0.cassandra.default.svc.cluster.local"
Here is minikube output for the svc and pods
$ kubectl cluster-info
Kubernetes master is running at https://192.168.99.100:8443
KubeDNS is running at https://192.168.99.100:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cassandra NodePort 10.102.236.158 <none> 9042:32554/TCP 20m
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 22h
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cassandra-0 1/1 Running 0 20m 172.17.0.4 minikube <none> <none>
cassandra-1 1/1 Running 0 19m 172.17.0.5 minikube <none> <none>
cassandra-2 1/1 Running 1 19m 172.17.0.6 minikube <none> <none>
$ kubectl describe service cassandra
Name: cassandra
Namespace: default
Labels: app=cassandra
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"cassandra"},"name":"cassandra","namespace":"default"},"s...
Selector: app=cassandra
Type: NodePort
IP: 10.102.236.158
Port: <unset> 9042/TCP
TargetPort: 9042/TCP
NodePort: <unset> 32554/TCP
Endpoints: 172.17.0.4:9042,172.17.0.5:9042,172.17.0.6:9042
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
$ kubectl exec -it cassandra-0 -- nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.17.0.5 104.72 KiB 256 68.1% 680bfcb9-b374-40a6-ba1d-4bf7ee80a57b rack1
UN 172.17.0.4 69.9 KiB 256 66.5% 022009f8-112c-46c9-844b-ef062bac35aa rack1
UN 172.17.0.6 125.31 KiB 256 65.4% 48ae76fe-b37c-45c7-84f9-3e6207da4818 rack1
$ kubectl exec -it cassandra-0 -- cqlsh
Connected to K8Demo at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.4 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>
cassandra-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: cassandra
name: cassandra
spec:
type: NodePort
ports:
- port: 9042
selector:
app: cassandra
cassandra-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
labels:
app: cassandra
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
terminationGracePeriodSeconds: 1800
containers:
- name: cassandra
image: cassandra:3.11
ports:
- containerPort: 7000
name: intra-node
- containerPort: 7001
name: tls-intra-node
- containerPort: 7199
name: jmx
- containerPort: 9042
name: cql
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
securityContext:
capabilities:
add:
- IPC_LOCK
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- nodetool drain
env:
- name: MAX_HEAP_SIZE
value: 512M
- name: HEAP_NEWSIZE
value: 100M
- name: CASSANDRA_SEEDS
value: "cassandra-0.cassandra.default.svc.cluster.local"
- name: CASSANDRA_CLUSTER_NAME
value: "K8Demo"
- name: CASSANDRA_DC
value: "DC1-K8Demo"
- name: CASSANDRA_RACK
value: "Rack1-K8Demo"
- name: CASSANDRA_START_RPC
value: "true"
- name: CASSANDRA_RPC_ADDRESS
value: "0.0.0.0"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# These volume mounts are persistent. They are like inline claims,
# but not exactly because the names need to match exactly one of
# the stateful pod volumes.
volumeMounts:
- name: cassandra-data
mountPath: /var/lib/cassandra
# These are converted to volume claims by the controller
# and mounted at the paths mentioned above.
# do not use these in production until ssd GCEPersistentDisk or other ssd pd
volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast
resources:
requests:
storage: 1Gi
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fast
provisioner: k8s.io/minikube-hostpath
parameters:
type: pd-standard
Just for anyone with this problem:
After reading docs on datastax I realized that DataStax Studio is meant for use with DataStax Enterprise, for local development and the community edition of cassanda I'm using DataStax DevCenter and it works.
For spring boot (Cassandra cluster running on minikube):
spring.data.cassandra.keyspacename=mykeyspacename
spring.data.cassandra.contactpoints=cassandra-0.cassandra.default.svc.cluster.local
spring.data.cassandra.port=9042
spring.data.cassandra.schemaaction=create_if_not_exists
For DataStax DevCenter(Cassandra cluster running on minikube):
ContactHost = 192.168.99.100
NativeProtocolPort: 300042
Updated cassandra-service
# ------------------- Cassandra Service ------------------- #
apiVersion: v1
kind: Service
metadata:
labels:
app: cassandra
name: cassandra
spec:
type: NodePort
ports:
- port: 9042
nodePort: 30042
selector:
app: cassandra
If we just want to connect cqlsh, what you neeed is following command
kubectl exec -it cassandra-0 -- cqlsh
On the other hand, if we want to connect from external point, command can be used to get cassandra url (I use DBever to connect cassandra cluster)
minikube service cassandra --url

Two kubernetes deployments in the same namespace are not able to communicate

I'm deploying ELK stack (oss) to kubernetes cluster. Elasticsearch deployment and service starts correctly and API is reacheble. Kibana deployment starts but can't access elasticsearch:
From Kibana container logs:
{"type":"log","#timestamp":"2019-05-08T22:49:26Z","tags":["error","elasticsearch","admin"],"pid":1,"message":"Request error, retrying\nHEAD http://elasticsearch:9200/ => getaddrinfo ENOTFOUND elasticsearch elasticsearch:9200"}
{"type":"log","#timestamp":"2019-05-08T22:50:44Z","tags":["warning","elasticsearch","admin"],"pid":1,"message":"Unable to revive connection: http://elasticsearch:9200/"}
{"type":"log","#timestamp":"2019-05-08T22:50:44Z","tags":["warning","elasticsearch","admin"],"pid":1,"message":"No living connections"}
Both deployments are in the same namespace "observability". I also tried to reference elasticsearch container as elasticsearch.observability.svc.cluster.local but it's not working too.
What I'am doing wrong? How to reference elasticsearch container from kibana container?
More info:
kubectl --context=19team-observability-admin-context -n observability get pods
NAME READY STATUS RESTARTS AGE
elasticsearch-9d495b84f-j2297 1/1 Running 0 15s
kibana-65bc7f9c4-s9cv4 1/1 Running 0 15s
kubectl --context=19team-observability-admin-context -n observability get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
elasticsearch NodePort 10.104.250.175 <none> 9200:30083/TCP,9300:30059/TCP 1m
kibana NodePort 10.102.124.171 <none> 5601:30124/TCP 1m
I start my containers with command
kubectl --context=19team-observability-admin-context -n observability apply -f .\elasticsearch.yaml -f .\kibana.yaml
elasticsearch.yaml
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: observability
spec:
type: NodePort
ports:
- name: "9200"
port: 9200
targetPort: 9200
- name: "9300"
port: 9300
targetPort: 9300
selector:
app: elasticsearch
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: elasticsearch
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: set-vm-max-map-count
image: busybox
imagePullPolicy: IfNotPresent
command: ['sysctl', '-w', 'vm.max_map_count=262144']
securityContext:
privileged: true
resources:
requests:
memory: "512Mi"
cpu: "1"
limits:
memory: "724Mi"
cpu: "1"
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.7.1
ports:
- containerPort: 9200
- containerPort: 9300
resources:
requests:
memory: "3Gi"
cpu: "1"
limits:
memory: "3Gi"
cpu: "1"
kibana.yaml
apiVersion: v1
kind: Service
metadata:
name: kibana
namespace: observability
spec:
type: NodePort
ports:
- name: "5601"
port: 5601
targetPort: 5601
selector:
app: observability_platform_kibana
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: observability_platform_kibana
name: kibana
namespace: observability
spec:
replicas: 1
template:
metadata:
labels:
app: observability_platform_kibana
spec:
containers:
- env:
# THIS IS WHERE WE SET CONNECTION BETWEEN KIBANA AND ELASTIC
- name: ELASTICSEARCH_HOSTS
value: http://elasticsearch:9200
- name: SERVER_NAME
value: kibana
image: docker.elastic.co/kibana/kibana-oss:6.7.1
name: kibana
ports:
- containerPort: 5601
resources:
requests:
memory: "512Mi"
cpu: "1"
limits:
memory: "724Mi"
cpu: "1"
restartPolicy: Always
UPDATE 1
As gonzalesraul proposed I've created second service for elastic with ClusterIP type:
apiVersion: v1
kind: Service
metadata:
labels:
app: elasticsearch
name: elasticsearch-local
namespace: observability
spec:
type: ClusterIP
ports:
- port: 9200
protocol: TCP
targetPort: 9200
selector:
app: elasticsearch
Service is created:
kubectl --context=19team-observability-admin-context -n observability get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
elasticsearch NodePort 10.106.5.94 <none> 9200:31598/TCP,9300:32018/TCP 26s
elasticsearch-local ClusterIP 10.101.178.13 <none> 9200/TCP 26s
kibana NodePort 10.99.73.118 <none> 5601:30004/TCP 26s
And reference elastic as "http://elasticsearch-local:9200"
Unfortunately it does not work, in kibana container:
{"type":"log","#timestamp":"2019-05-09T10:13:54Z","tags":["warning","elasticsearch","admin"],"pid":1,"message":"Unable to revive connection: http://elasticsearch-local:9200/"}
Do not use a NodePort service, instead use a ClusterIP. If you need to expose as a Nodeport your service, create a second service besides, for instance:
---
apiVersion: v1
kind: Service
metadata:
labels:
app: elasticsearch
name: elasticsearch-local
namespace: observability
spec:
type: ClusterIP
ports:
- port: 9200
protocol: TCP
targetPort: 9200
selector:
app: elasticsearch
Then update the kibana manifest to point to the ClusterIP service:
# ...
# THIS IS WHERE WE SET CONNECTION BETWEEN KIBANA AND ELASTIC
- name: ELASTICSEARCH_HOSTS
value: http://elasticsearch-local:9200
# ...
The nodePort services do not create a 'dns entry' (ex. elasticsearch.observability.svc.cluster.local) on kubernetes
Edit the server name value in kibana.yaml and set it to kibana:5601.
I think if you don't do this, by default it is trying to go to port 80.
This is what looks like now kibana.yaml:
...
spec:
containers:
- env:
- name: ELASTICSEARCH_HOSTS
value: http://elasticsearch:9200
- name: SERVER_NAME
value: kibana:5601
image: docker.elastic.co/kibana/kibana-oss:6.7.1
imagePullPolicy: IfNotPresent
name: kibana
...
And this is the output now:
{"type":"log","#timestamp":"2019-05-09T10:37:16Z","tags":["status","plugin:console#6.7.1","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","#timestamp":"2019-05-09T10:37:16Z","tags":["status","plugin:interpreter#6.7.1","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","#timestamp":"2019-05-09T10:37:16Z","tags":["status","plugin:metrics#6.7.1","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","#timestamp":"2019-05-09T10:37:16Z","tags":["status","plugin:tile_map#6.7.1","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","#timestamp":"2019-05-09T10:37:16Z","tags":["status","plugin:timelion#6.7.1","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","#timestamp":"2019-05-09T10:37:16Z","tags":["status","plugin:elasticsearch#6.7.1","info"],"pid":1,"state":"green","message":"Status changed from yellow to green - Ready","prevState":"yellow","prevMsg":"Waiting for Elasticsearch"}
{"type":"log","#timestamp":"2019-05-09T10:37:17Z","tags":["listening","info"],"pid":1,"message":"Server running at http://0:5601"}
UPDATE
I just tested it on a bare metal cluster (bootstraped through kubeadm), and worked again.
This is the output:
{"type":"log","#timestamp":"2019-05-09T11:09:59Z","tags":["warning","elasticsearch","admin"],"pid":1,"message":"No living connections"}
{"type":"log","#timestamp":"2019-05-09T11:10:01Z","tags":["warning","elasticsearch","admin"],"pid":1,"message":"Unable to revive connection: http://elasticsearch:9200/"}
{"type":"log","#timestamp":"2019-05-09T11:10:01Z","tags":["warning","elasticsearch","admin"],"pid":1,"message":"No living connections"}
{"type":"log","#timestamp":"2019-05-09T11:10:04Z","tags":["status","plugin:elasticsearch#6.7.1","info"],"pid":1,"state":"green","message":"Status changed from red to green - Ready","prevState":"red","prevMsg":"Unable to connect to Elasticsearch."}
{"type":"log","#timestamp":"2019-05-09T11:10:04Z","tags":["info","migrations"],"pid":1,"message":"Creating index .kibana_1."}
{"type":"log","#timestamp":"2019-05-09T11:10:06Z","tags":["info","migrations"],"pid":1,"message":"Pointing alias .kibana to .kibana_1."}
{"type":"log","#timestamp":"2019-05-09T11:10:06Z","tags":["info","migrations"],"pid":1,"message":"Finished in 2417ms."}
{"type":"log","#timestamp":"2019-05-09T11:10:06Z","tags":["listening","info"],"pid":1,"message":"Server running at http://0:5601"}
Note that it passed from "No Living Connections" to "Running". I am running the nodes on GCP. I had to open the firewalls for it to work. What's your environment?