Why clustering on k8s through redis-ha doesn't work? - deployment

I'm trying to create Redis cluster along with Node.JS (ioredis/cluster) but that doesn't seem to work.
It's v1.11.8-gke.6 on GKE.
I'm doing exactly what been told in ha-redis docs:
~  helm install --set replicas=3 --name redis-test stable/redis-ha
NAME: redis-test
LAST DEPLOYED: Fri Apr 26 00:13:31 2019
NAMESPACE: yt
STATUS: DEPLOYED
RESOURCES:
==> v1/ConfigMap
NAME DATA AGE
redis-test-redis-ha-configmap 3 0s
redis-test-redis-ha-probes 2 0s
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
redis-test-redis-ha-server-0 0/2 Init:0/1 0 0s
==> v1/Role
NAME AGE
redis-test-redis-ha 0s
==> v1/RoleBinding
NAME AGE
redis-test-redis-ha 0s
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-test-redis-ha ClusterIP None <none> 6379/TCP,26379/TCP 0s
redis-test-redis-ha-announce-0 ClusterIP 10.7.244.34 <none> 6379/TCP,26379/TCP 0s
redis-test-redis-ha-announce-1 ClusterIP 10.7.251.35 <none> 6379/TCP,26379/TCP 0s
redis-test-redis-ha-announce-2 ClusterIP 10.7.252.94 <none> 6379/TCP,26379/TCP 0s
==> v1/ServiceAccount
NAME SECRETS AGE
redis-test-redis-ha 1 0s
==> v1/StatefulSet
NAME READY AGE
redis-test-redis-ha-server 0/3 0s
NOTES:
Redis can be accessed via port 6379 and Sentinel can be accessed via port 26379 on the following DNS name from within your cluster:
redis-test-redis-ha.yt.svc.cluster.local
To connect to your Redis server:
1. Run a Redis pod that you can use as a client:
kubectl exec -it redis-test-redis-ha-server-0 sh -n yt
2. Connect using the Redis CLI:
redis-cli -h redis-test-redis-ha.yt.svc.cluster.local
~  k get pods | grep redis-test
redis-test-redis-ha-server-0 2/2 Running 0 1m
redis-test-redis-ha-server-1 2/2 Running 0 1m
redis-test-redis-ha-server-2 2/2 Running 0 54s
~  kubectl exec -it redis-test-redis-ha-server-0 sh -n yt
Defaulting container name to redis.
Use 'kubectl describe pod/redis-test-redis-ha-server-0 -n yt' to see all of the containers in this pod.
/data $ redis-cli -h redis-test-redis-ha.yt.svc.cluster.local
redis-test-redis-ha.yt.svc.cluster.local:6379> set test key
(error) READONLY You can't write against a read only replica.
But in the end only one random pod I connect to is writable. I ran logs on a few containers and everything seem to be fine there. I tried to run cluster info in redis-cli but I get ERR This instance has cluster support disabled everywhere.
Logs:
~  k logs pod/redis-test-redis-ha-server-0 redis
1:C 25 Apr 2019 20:13:43.604 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 25 Apr 2019 20:13:43.604 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 25 Apr 2019 20:13:43.604 # Configuration loaded
1:M 25 Apr 2019 20:13:43.606 * Running mode=standalone, port=6379.
1:M 25 Apr 2019 20:13:43.606 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 25 Apr 2019 20:13:43.606 # Server initialized
1:M 25 Apr 2019 20:13:43.606 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 25 Apr 2019 20:13:43.627 * DB loaded from disk: 0.021 seconds
1:M 25 Apr 2019 20:13:43.627 * Ready to accept connections
1:M 25 Apr 2019 20:14:11.801 * Replica 10.7.251.35:6379 asks for synchronization
1:M 25 Apr 2019 20:14:11.801 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'c2827ffe011d774db005a44165bac67a7e7f7d85', my replication IDs are '8311a1ca896e97d5487c07f2adfd7d4ef924f36b' and '0000000000000000000000000000000000000000')
1:M 25 Apr 2019 20:14:11.802 * Delay next BGSAVE for diskless SYNC
1:M 25 Apr 2019 20:14:17.825 * Starting BGSAVE for SYNC with target: replicas sockets
1:M 25 Apr 2019 20:14:17.825 * Background RDB transfer started by pid 55
55:C 25 Apr 2019 20:14:17.826 * RDB: 0 MB of memory used by copy-on-write
1:M 25 Apr 2019 20:14:17.926 * Background RDB transfer terminated with success
1:M 25 Apr 2019 20:14:17.926 # Slave 10.7.251.35:6379 correctly received the streamed RDB file.
1:M 25 Apr 2019 20:14:17.926 * Streamed RDB transfer with replica 10.7.251.35:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
1:M 25 Apr 2019 20:14:18.828 * Synchronization with replica 10.7.251.35:6379 succeeded
1:M 25 Apr 2019 20:14:42.711 * Replica 10.7.252.94:6379 asks for synchronization
1:M 25 Apr 2019 20:14:42.711 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'c2827ffe011d774db005a44165bac67a7e7f7d85', my replication IDs are 'af453adde824b2280ba66adb40cc765bf390e237' and '0000000000000000000000000000000000000000')
1:M 25 Apr 2019 20:14:42.711 * Delay next BGSAVE for diskless SYNC
1:M 25 Apr 2019 20:14:48.976 * Starting BGSAVE for SYNC with target: replicas sockets
1:M 25 Apr 2019 20:14:48.977 * Background RDB transfer started by pid 125
125:C 25 Apr 2019 20:14:48.978 * RDB: 0 MB of memory used by copy-on-write
1:M 25 Apr 2019 20:14:49.077 * Background RDB transfer terminated with success
1:M 25 Apr 2019 20:14:49.077 # Slave 10.7.252.94:6379 correctly received the streamed RDB file.
1:M 25 Apr 2019 20:14:49.077 * Streamed RDB transfer with replica 10.7.252.94:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
1:M 25 Apr 2019 20:14:49.761 * Synchronization with replica 10.7.252.94:6379 succeeded
~  k logs pod/redis-test-redis-ha-server-1 redis
1:C 25 Apr 2019 20:14:11.780 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 25 Apr 2019 20:14:11.781 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 25 Apr 2019 20:14:11.781 # Configuration loaded
1:S 25 Apr 2019 20:14:11.786 * Running mode=standalone, port=6379.
1:S 25 Apr 2019 20:14:11.791 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:S 25 Apr 2019 20:14:11.791 # Server initialized
1:S 25 Apr 2019 20:14:11.791 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:S 25 Apr 2019 20:14:11.792 * DB loaded from disk: 0.001 seconds
1:S 25 Apr 2019 20:14:11.792 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 25 Apr 2019 20:14:11.792 * Ready to accept connections
1:S 25 Apr 2019 20:14:11.792 * Connecting to MASTER 10.7.244.34:6379
1:S 25 Apr 2019 20:14:11.792 * MASTER <-> REPLICA sync started
1:S 25 Apr 2019 20:14:11.792 * Non blocking connect for SYNC fired the event.
1:S 25 Apr 2019 20:14:11.793 * Master replied to PING, replication can continue...
1:S 25 Apr 2019 20:14:11.799 * Trying a partial resynchronization (request c2827ffe011d774db005a44165bac67a7e7f7d85:6006176).
1:S 25 Apr 2019 20:14:17.824 * Full resync from master: af453adde824b2280ba66adb40cc765bf390e237:722
1:S 25 Apr 2019 20:14:17.824 * Discarding previously cached master state.
1:S 25 Apr 2019 20:14:17.852 * MASTER <-> REPLICA sync: receiving streamed RDB from master
1:S 25 Apr 2019 20:14:17.853 * MASTER <-> REPLICA sync: Flushing old data
1:S 25 Apr 2019 20:14:17.853 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 25 Apr 2019 20:14:17.853 * MASTER <-> REPLICA sync: Finished with success
What am I missing or is there a better way to do clustering?

Not the best solution, but I figured I can just use Sentinel instead of finding another way (or maybe there is no another way). It has support on most languages so it shouldn't be very hard (except redis-cli, can't figure how to query Sentinel server).
This is how I got this done on ioredis (node.js, sorry if you not familiar with ES6 syntax):
import * as IORedis from 'ioredis';
import Redis from 'ioredis';
import { redisHost, redisPassword, redisPort } from './config';
export function getRedisConfig(): IORedis.RedisOptions {
// I'm not sure how to set this properly
// ioredis/cluster automatically resolves all pods by hostname, but not this.
// So I have to implicitly specify all pods.
// Or resolve them all by hostname
return {
sentinels: process.env.REDIS_CLUSTER.split(',').map(d => {
const [host, port = 26379] = d.split(':');
return { host, port: Number(port) };
}),
name: process.env.REDIS_MASTER_NAME || 'mymaster',
...(redisPassword ? { password: redisPassword } : {}),
};
}
export async function initializeRedis() {
if (process.env.REDIS_CLUSTER) {
const cluster = new Redis(getRedisConfig());
return cluster;
}
// For dev environment
const client = new Redis(redisPort, redisHost);
if (redisPassword) {
await client.auth(redisPassword);
}
return client;
}
In env:
env:
- name: REDIS_CLUSTER
value: redis-redis-ha-server-1.redis-redis-ha.yt.svc.cluster.local:26379,redis-redis-ha-server-0.redis-redis-ha.yt.svc.cluster.local:23679,redis-redis-ha-server-2.redis-redis-ha.yt.svc.cluster.local:23679
You may wanna protect it using password.

Related

Kubernetes cronjob never scheduling, no errors

I have a kubernetes cluster on 1.18:
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.4", GitCommit:"c96aede7b5205121079932896c4ad89bb93260af", GitTreeState:"clean", BuildDate:"2020-06-17T11:33:59Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
I am following the documentation for 1.18 cronjobs. I have the following yaml saved in hello_world.yaml:
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
I created the cronjob with
kubectl create -f hello_world.yaml
cronjob.batch/hello created
However the jobs are never scheduled, despite the cronjob being created:
kubectl get cronjobs
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
hello */1 * * * * False 0 <none> 5m48s
kubectl get jobs
NAME COMPLETIONS DURATION AGE
not-my-job-1624413720 1/1 43m 7d11h
not-my-job-1624500120 1/1 42m 6d11h
not-my-job-1624586520 1/1 43m 5d11h
I notice that the last job to run did so 5 days ago, when our certificates expired resulting in developers getting the following error:
"Unable to connect to the server: x509: certificate has expired or is not yet valid"
We regenerated the certs using the following procedure from IBM, which seemed to work at the time. These are the main commands, we did some backups of configuration files etc. also as per the linked doc:
kubeadm alpha certs renew all
systemctl daemon-reload&&systemctl restart kubelet
I am sure the certificate expiration and renewal has caused some issue, but I see no smoking gun.
kubectl describe cronjob hello
Name: hello
Namespace: default
Labels: <none>
Annotations: <none>
Schedule: */1 * * * *
Concurrency Policy: Allow
Suspend: False
Successful Job History Limit: 3
Failed Job History Limit: 1
Starting Deadline Seconds: <unset>
Selector: <unset>
Parallelism: <unset>
Completions: <unset>
Pod Template:
Labels: <none>
Containers:
hello:
Image: busybox
Port: <none>
Host Port: <none>
Args:
/bin/sh
-c
date; echo Hello from the Kubernetes cluster
Environment: <none>
Mounts: <none>
Volumes: <none>
Last Schedule Time: <unset>
Active Jobs: <none>
Events: <none>
Any help would be greatly appreciated! Thanks.
EDIT: provide some more info:
sudo kubeadm alpha certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Jun 30, 2022 13:31 UTC 364d no
apiserver Jun 30, 2022 13:31 UTC 364d ca no
apiserver-etcd-client Jun 30, 2022 13:31 UTC 364d etcd-ca no
apiserver-kubelet-client Jun 30, 2022 13:31 UTC 364d ca no
controller-manager.conf Jun 30, 2022 13:31 UTC 364d no
etcd-healthcheck-client Jun 30, 2022 13:31 UTC 364d etcd-ca no
etcd-peer Jun 30, 2022 13:31 UTC 364d etcd-ca no
etcd-server Jun 30, 2022 13:31 UTC 364d etcd-ca no
front-proxy-client Jun 30, 2022 13:31 UTC 364d front-proxy-ca no
scheduler.conf Jun 30, 2022 13:31 UTC 364d no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Jun 23, 2030 13:21 UTC 8y no
etcd-ca Jun 23, 2030 13:21 UTC 8y no
front-proxy-ca Jun 23, 2030 13:21 UTC 8y no
ls -alt /etc/kubernetes/pki/
total 68
-rw-r--r-- 1 root root 1058 Jun 30 13:31 front-proxy-client.crt
-rw------- 1 root root 1679 Jun 30 13:31 front-proxy-client.key
-rw-r--r-- 1 root root 1099 Jun 30 13:31 apiserver-kubelet-client.crt
-rw------- 1 root root 1675 Jun 30 13:31 apiserver-kubelet-client.key
-rw-r--r-- 1 root root 1090 Jun 30 13:31 apiserver-etcd-client.crt
-rw------- 1 root root 1675 Jun 30 13:31 apiserver-etcd-client.key
-rw-r--r-- 1 root root 1229 Jun 30 13:31 apiserver.crt
-rw------- 1 root root 1679 Jun 30 13:31 apiserver.key
drwxr-xr-x 4 root root 4096 Sep 9 2020 ..
drwxr-xr-x 3 root root 4096 Jun 25 2020 .
-rw------- 1 root root 1675 Jun 25 2020 sa.key
-rw------- 1 root root 451 Jun 25 2020 sa.pub
drwxr-xr-x 2 root root 4096 Jun 25 2020 etcd
-rw-r--r-- 1 root root 1038 Jun 25 2020 front-proxy-ca.crt
-rw------- 1 root root 1675 Jun 25 2020 front-proxy-ca.key
-rw-r--r-- 1 root root 1025 Jun 25 2020 ca.crt
-rw------- 1 root root 1679 Jun 25 2020 ca.key
Found a solution to this one after trying a lot of different stuff, forgot to update at the time. The certs were renewed after they had already expired, I guess this stopped a synchronisation of the certs across the different components in the cluster and nothing could talk to the API.
This is a three node cluster. I cordoned the worker nodes, stopped the kubelet service on them, stopped docker containers + service, started new docker containers, started the kubelet, uncordoned the nodes and carried out the same procedure on the master node. This forced the synchronisation of certs and keys across the different components.
We encountered this just now; updated certificate and cronjobs stopped scheduling.
We only needed to restart the master node, saved us a lot of hassle so try that first.

Why Redis in K8S restarts all the time?

Redis pod restarts like crazy.
How can I find out the reason for this behavior?
I figured out, that the resources quota should be upgraded, but I have no clue what would be the best cpu/ram ratio. And why there are no crash events or logs?
Here are the pods:
> kubectl get pods
redis-master-5d9cfb54f8-8pbgq 1/1 Running 33 3d16h
Here are the logs:
> kubectl logs --follow redis-master-5d9cfb54f8-8pbgq
[1] 08 Sep 07:02:12.152 # Server started, Redis version 2.8.19
[1] 08 Sep 07:02:12.153 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
[1] 08 Sep 07:02:12.153 * The server is now ready to accept connections on port 6379
[1] 08 Sep 07:03:13.085 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 07:03:13.085 * Background saving started by pid 8
[8] 08 Sep 07:03:13.101 * DB saved on disk
[8] 08 Sep 07:03:13.101 * RDB: 0 MB of memory used by copy-on-write
[1] 08 Sep 07:03:13.185 * Background saving terminated with success
[1] 08 Sep 07:04:14.018 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 07:04:14.018 * Background saving started by pid 9
...
[93] 08 Sep 08:38:30.160 * DB saved on disk
[93] 08 Sep 08:38:30.164 * RDB: 2 MB of memory used by copy-on-write
[1] 08 Sep 08:38:30.259 * Background saving terminated with success
[1] 08 Sep 08:39:31.072 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 08:39:31.074 * Background saving started by pid 94
Here is previous logs of the same pod.
> kubectl logs --previous --follow redis-master-5d9cfb54f8-8pbgq
[1] 08 Sep 09:41:46.057 * Background saving terminated with success
[1] 08 Sep 09:42:47.073 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 09:42:47.076 * Background saving started by pid 140
[140] 08 Sep 09:43:14.398 * DB saved on disk
[140] 08 Sep 09:43:14.457 * RDB: 1 MB of memory used by copy-on-write
[1] 08 Sep 09:43:14.556 * Background saving terminated with success
[1] 08 Sep 09:44:15.073 * 10000 changes in 60 seconds. Saving...
[1] 08 Sep 09:44:15.077 * Background saving started by pid 141
[1 | signal handler] (1599558267) Received SIGTERM scheduling shutdown...
[1] 08 Sep 09:44:28.052 # User requested shutdown...
[1] 08 Sep 09:44:28.052 # There is a child saving an .rdb. Killing it!
[1] 08 Sep 09:44:28.052 * Saving the final RDB snapshot before exiting.
[1] 08 Sep 09:44:49.592 * DB saved on disk
[1] 08 Sep 09:44:49.592 # Redis is now ready to exit, bye bye...
Here is the description of the pod. As you can see the limit is 100Mi, but I can't see the threshold, after which the pod restarts.
> kubectl describe pod redis-master-5d9cfb54f8-8pbgq
Name: redis-master-5d9cfb54f8-8pbgq
Namespace: cryptoman
Priority: 0
Node: gke-my-cluster-default-pool-818613a8-smmc/10.172.0.28
Start Time: Fri, 04 Sep 2020 18:52:17 +0300
Labels: app=redis
pod-template-hash=5d9cfb54f8
role=master
tier=backend
Annotations: <none>
Status: Running
IP: 10.36.2.124
IPs: <none>
Controlled By: ReplicaSet/redis-master-5d9cfb54f8
Containers:
master:
Container ID: docker://3479276666a41df502f1f9eb9bb2ff9cfa592f08a33e656e44179042b6233c6f
Image: k8s.gcr.io/redis:e2e
Image ID: docker-pullable://k8s.gcr.io/redis#sha256:f066bcf26497fbc55b9bf0769cb13a35c0afa2aa42e737cc46b7fb04b23a2f25
Port: 6379/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 09 Sep 2020 10:27:56 +0300
Last State: Terminated
Reason: OOMKilled
Exit Code: 0
Started: Wed, 09 Sep 2020 07:34:18 +0300
Finished: Wed, 09 Sep 2020 10:27:55 +0300
Ready: True
Restart Count: 42
Limits:
cpu: 100m
memory: 250Mi
Requests:
cpu: 100m
memory: 250Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5tds9 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-5tds9:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-5tds9
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SandboxChanged 52m (x42 over 4d13h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Pod sandbox changed, it will be killed and re-created.
Normal Killing 52m (x42 over 4d13h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Stopping container master
Normal Created 52m (x43 over 4d16h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Created container master
Normal Started 52m (x43 over 4d16h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Started container master
Normal Pulled 52m (x42 over 4d13h) kubelet, gke-my-cluster-default-pool-818613a8-smmc Container image "k8s.gcr.io/redis:e2e" already present on machine
This is the limit after which it restarts. CPU is just throttled, memory is OOM'ed.
Limits:
cpu: 100m
memory: 250Mi
Reason: OOMKilled
Remove requests & limits
Run the pod, make sure it doesn't restart
If you already have prometheus, run VPA Recommender to check how much resources it needs. Or just use any monitoring stack: GKE Prometheus, prometheus-operator, DataDog etc to check actual resource consumption and adjust limits accordingly.
Max's answer is very complete. But if you don't have Prometheus installed or don't want to, there is another way simple to check actual resource consumption installing the metrics server project in your cluster. After installing it you can check the CPU and memory usage with kubectl top node to check consumption on the node, and kubectl top pod to check consumption on pods. I use it and is very useful.
Or you can just increase the CPU and memory limits, but you will not be able to ensure how much resources the container will need. Basically will be a waste of resources.
The main problem is that you didn't limit the redis application. So redis just increases the memory, and when it reaches Pod limits.memory 250Mb, it is killed by an OOM, and restart it.
Then, if you remove the limits.memory, the redis will continue eating memory until the Node has not enough memory to run other processes and K8s kills it and marks it as "Evicted".
Therefore, configure the memory in redis application to limit the memory used by redis inside redis.conf file and depending on your needs set a LRU or LFU policy to remove some keys (https://redis.io/topics/lru-cache):
maxmemory 256mb
maxmemory-policy allkeys-lfu
And limit the memory of the Pod at around the double of redis maxmemory to give the rest of the processes and objects saved in redis some margin:
resources:
limits:
cpu: 100m
memory: 512Mi
Now the pods are getting evicted.
Can I find out the reason?
NAME READY STATUS RESTARTS AGE
redis-master-7d97765bbb-7kjwn 0/1 Evicted 0 38h
redis-master-7d97765bbb-kmc9g 1/1 Running 0 30m
redis-master-7d97765bbb-sf2ss 0/1 Evicted 0 30m

CephFS mount. Can't read superblock

Any pointers for this issue? Tried tons of things already to no avail.
This command fails with the error Can't read superblock
sudo mount -t ceph worker2:6789:/ /mnt/mycephfs -o name=admin,secret=AQAYjCpcAAAAABAAxs1mrh6nnx+0+1VUqW2p9A==
Some more info that may be helpful
uname -a Linux cephfs-test-admin-1 4.14.84-coreos #1 SMP Sat Dec 15 22:39:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Ceph status and ceph osd status all show no issues at all
dmesg | tail
[228343.304863] libceph: resolve 'worker2' (ret=0): 10.1.96.4:0
[228343.322279] libceph: mon0 10.1.96.4:6789 session established
[228343.323622] libceph: client107238 fsid 762e6263-a95c-40da-9813-9df4fef12f53
ceph -s
cluster:
id: 762e6263-a95c-40da-9813-9df4fef12f53
health: HEALTH_WARN
too few PGs per OSD (16 < min 30)
services:
mon: 3 daemons, quorum worker2,worker0,worker1
mgr: worker1(active)
mds: cephfs-1/1/1 up {0=mds-ceph-mds-85b4fbb478-c6jzv=up:active}
osd: 3 osds: 3 up, 3 in
data:
pools: 2 pools, 16 pgs
objects: 21 objects, 2246 bytes
usage: 342 MB used, 76417 MB / 76759 MB avail
pgs: 16 active+clean
ceph osd status
+----+---------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+---------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | worker2 | 114M | 24.8G | 0 | 0 | 0 | 0 | exists,up |
| 1 | worker0 | 114M | 24.8G | 0 | 0 | 0 | 0 | exists,up |
| 2 | worker1 | 114M | 24.8G | 0 | 0 | 0 | 0 | exists,up |
+----+---------+-------+-------+--------+---------+--------+---------+-----------+
ceph -v
ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) luminous (stable)
Some of the syslog output:
Jan 04 21:24:04 worker2 kernel: libceph: resolve 'worker2' (ret=0): 10.1.96.4:0
Jan 04 21:24:04 worker2 kernel: libceph: mon0 10.1.96.4:6789 session established
Jan 04 21:24:04 worker2 kernel: libceph: client159594 fsid 762e6263-a95c-40da-9813-9df4fef12f53
Jan 04 21:24:10 worker2 systemd[1]: Started OpenSSH per-connection server daemon (58.242.83.28:36729).
Jan 04 21:24:11 worker2 sshd[12315]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=58.242.83.28 us>
Jan 04 21:24:14 worker2 sshd[12315]: Failed password for root from 58.242.83.28 port 36729 ssh2
Jan 04 21:24:15 worker2 sshd[12315]: Failed password for root from 58.242.83.28 port 36729 ssh2
Jan 04 21:24:18 worker2 sshd[12315]: Failed password for root from 58.242.83.28 port 36729 ssh2
Jan 04 21:24:18 worker2 sshd[12315]: Received disconnect from 58.242.83.28 port 36729:11: [preauth]
Jan 04 21:24:18 worker2 sshd[12315]: Disconnected from authenticating user root 58.242.83.28 port 36729 [preauth]
Jan 04 21:24:18 worker2 sshd[12315]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=58.242.83.28 user=root
Jan 04 21:24:56 worker2 systemd[1]: Started OpenSSH per-connection server daemon (24.114.79.151:58123).
Jan 04 21:24:56 worker2 sshd[12501]: Accepted publickey for core from 24.114.79.151 port 58123 ssh2: RSA SHA256:t4t9yXeR2yC7s9c37mdS/F7koUs2x>
Jan 04 21:24:56 worker2 sshd[12501]: pam_unix(sshd:session): session opened for user core by (uid=0)
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
So after digging the problem was due to XFS partitioning issues ...
Do not know how I missed it at first.
In short:
Trying to create a partion using xfs was failing.
i.e. running mkfs.xfs /dev/vdb1 would simply hang. The OS would still create and mark partitions properly but they'd be corrupt - the fact you'd only find out when trying to mount by getting that Can't read superblock error.
So ceph does this:
1. Run deploy
2. Create XFS partitions mkfs.xfs ...
3. OS would create those faulty partitions
4. Since you can still read the status of OSDs just fine all status report and logs will report no problems (mkfs.xfs did not report errors it just hang)
5. When you try to mount cephFS or use block storage the whole thing bombs due to corrupt partions.
The root cause: still unknown. But I suspect something was not done right on the SSD disk level when provisioning/attaching them from my cloud provider. It now works fine

How to use K8S node_problem_detector?

Question
node-problem-detector is mentioned in Monitor Node Health documentation if K8S. How do we use it if it is not in GCE? Does it feed information to Dashboard or provide API metrics?
"This tool aims to make various node problems visible to the upstream layers in cluster management stack. It is a daemon which runs on each node, detects node problems and reports them to apiserver."
Err Ok but... What does that actually mean? How can I tell if it went to the api server?
What does the before and after look like? Knowing that would help me understand what it's doing.
Before installing Node Problem Detector I see:
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 20 Jun 2019 12:30:05 -0400 Thu, 20 Jun 2019 12:30:05 -0400 WeaveIsUp Weave pod has set this
OutOfDisk False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:30:14 -0400 KubeletReady kubelet is posting ready status
After installing Node Problem Detector I see:
Bash# helm upgrade --install npd stable/node-problem-detector -f node-problem-detector.values.yaml
Bash# kubectl rollout status daemonset npd-node-problem-detector #(wait for up)
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
DockerDaemon False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 DockerDaemonHealthy Docker daemon is healthy
EBSHealth False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 NoVolumeErrors Volumes are attaching successfully
KernelDeadlock False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 FilesystemIsNotReadOnly Filesystem is not read-only
NetworkUnavailable False Thu, 20 Jun 2019 12:30:05 -0400 Thu, 20 Jun 2019 12:30:05 -0400 WeaveIsUp Weave pod has set this
OutOfDisk False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:30:14 -0400 KubeletReady kubelet is posting ready status
Note I asked for help coming up with a way to see this for all nodes, Kenna Ofoegbu came up with this super useful and readable gem:
zsh# nodes=$(kubectl get nodes | sed '1d' | awk '{print $1}') && for node in $nodes; do; kubectl describe node | sed -n '/Conditions/,/Ready/p' ; done
Bash# (same command, gives errors)
Ok so now I know what Node Problem Detector does but... what good is adding a condition to the node, how do I use the condition to do something useful?
Question: How to use Kubernetes Node Problem Detector?
Use Case #1: Auto heal borked nodes
Step 1.) Install Node Problem Detector, so it can attach new condition metadata to nodes.
Step 2.) Leverage Planetlabs/draino to cordon and drain nodes with bad conditions.
Step 3.) Leverage https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler to auto heal. (When the node is cordon and drained it'll be marked unscheduleable, this will trigger a new node to be provisioned, and then the bad node's resource utilization will be super low which will cause the bad node to get deprovisioned)
Source: https://github.com/kubernetes/node-problem-detector#remedy-systems
Use Case #2: Surface the unhealthy node event so that it can be detected by Kubernetes, and then injested into your monitoring stack so you have an auditable historic record that the event occurred and when.
These unhealthy node events are logged somewhere on the host node, but usually, the host node is generating so much noisy/useless log data that these events aren't usually collected by default.
Node Problem Detector knows where to look for these events on the host node and filters out the noise when it sees the signal of a negative outcome it'll post it to its pod log, which isn't noisy.
The pod log is likely getting ingested into an ELK and Prometheus Operator stack, where it can be detected, alerted on, stored, and graphed.
Also, note that nothing is stopping you from implementing both use cases.
Update, added a snippet of node-problem-detector.helm-values.yaml file per request in comment:
log_monitors:
#https://github.com/kubernetes/node-problem-detector/tree/master/config contains the full list, you can exec into the pod and ls /config/ to see these as well.
- /config/abrt-adaptor.json #Adds ABRT Node Events (ABRT: automatic bug reporting tool), exceptions will show up under "kubectl describe node $NODENAME | grep Events -A 20"
- /config/kernel-monitor.json #Adds 2 new Node Health Condition Checks "KernelDeadlock" and "ReadonlyFilesystem"
- /config/docker-monitor.json #Adds new Node Health Condition Check "DockerDaemon" (Checks if Docker is unhealthy as a result of corrupt image)
# - /config/docker-monitor-filelog.json #Error: "/var/log/docker.log: no such file or directory", doesn't exist on pod, I think you'd have to mount node hostpath to get it to work, gain doesn't sound worth effort.
# - /config/kernel-monitor-filelog.json #Should add to existing Node Health Check "KernelDeadlock", more thorough detection, but silently fails in NPD pod logs for me.
custom_plugin_monitors: #[]
# Someone said all *-counter plugins are custom plugins, if you put them under log_monitors, you'll get #Error: "Failed to unmarshal configuration file "/config/kernel-monitor-counter.json""
- /config/kernel-monitor-counter.json #Adds new Node Health Condition Check "FrequentUnregisteredNetDevice"
- /config/docker-monitor-counter.json #Adds new Node Health Condition Check "CorruptDockerOverlay2"
- /config/systemd-monitor-counter.json #Adds 3 new Node Health Condition Checks "FrequentKubeletRestart", "FrequentDockerRestart", and "FrequentContainerdRestart"
Considering node-problem-detector is a Kubernetes addon, you would need to install that addon on your own Kubernetes server.
A Kubernetes CLuster has an addon-manager that will use it.
Do you mean:how to install it?
kubectl create -f https://github.com/kubernetes/node-problem-detector.yaml

Kubernetes pod on Google Container Engine continually restarts, is never ready

I'm trying to get a ghost blog deployed on GKE, working off of the persistent disks with WordPress tutorial. I have a working container that runs fine manually on a GKE node:
docker run -d --name my-ghost-blog -p 2368:2368 -d us.gcr.io/my_project_id/my-ghost-blog
I can also correctly create a pod using the following method from another tutorial:
kubectl run ghost --image=us.gcr.io/my_project_id/my-ghost-blog --port=2368
When I do that I can curl the blog on the internal IP from within the cluster, and get the following output from kubectl get pod:
Name: ghosty-nqgt0
Namespace: default
Image(s): us.gcr.io/my_project_id/my-ghost-blog
Node: very-long-node-name/10.240.51.18
Labels: run=ghost
Status: Running
Reason:
Message:
IP: 10.216.0.9
Replication Controllers: ghost (1/1 replicas created)
Containers:
ghosty:
Image: us.gcr.io/my_project_id/my-ghost-blog
Limits:
cpu: 100m
State: Running
Started: Fri, 04 Sep 2015 12:18:44 -0400
Ready: True
Restart Count: 0
Conditions:
Type Status
Ready True
Events:
...
The problem arises when I instead try to create the pod from a yaml file, per the Wordpress tutorial. Here's the yaml:
metadata:
name: ghost
labels:
name: ghost
spec:
containers:
- image: us.gcr.io/my_project_id/my-ghost-blog
name: ghost
env:
- name: NODE_ENV
value: production
- name: VIRTUAL_HOST
value: myghostblog.com
ports:
- containerPort: 2368
When I run kubectl create -f ghost.yaml, the pod is created, but is never ready:
> kubectl get pod ghost
NAME READY STATUS RESTARTS AGE
ghost 0/1 Running 11 3m
The pod continuously restarts, as confirmed by the output of kubectl describe pod ghost:
Name: ghost
Namespace: default
Image(s): us.gcr.io/my_project_id/my-ghost-blog
Node: very-long-node-name/10.240.51.18
Labels: name=ghost
Status: Running
Reason:
Message:
IP: 10.216.0.12
Replication Controllers: <none>
Containers:
ghost:
Image: us.gcr.io/my_project_id/my-ghost-blog
Limits:
cpu: 100m
State: Running
Started: Fri, 04 Sep 2015 14:08:20 -0400
Ready: False
Restart Count: 10
Conditions:
Type Status
Ready False
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
Fri, 04 Sep 2015 14:03:20 -0400 Fri, 04 Sep 2015 14:03:20 -0400 1 {scheduler } scheduled Successfully assigned ghost to very-long-node-name
Fri, 04 Sep 2015 14:03:27 -0400 Fri, 04 Sep 2015 14:03:27 -0400 1 {kubelet very-long-node-name} implicitly required container POD created Created with docker id dbbc27b4d280
Fri, 04 Sep 2015 14:03:27 -0400 Fri, 04 Sep 2015 14:03:27 -0400 1 {kubelet very-long-node-name} implicitly required container POD started Started with docker id dbbc27b4d280
Fri, 04 Sep 2015 14:03:27 -0400 Fri, 04 Sep 2015 14:03:27 -0400 1 {kubelet very-long-node-name} spec.containers{ghost} created Created with docker id ceb14ba72929
Fri, 04 Sep 2015 14:03:27 -0400 Fri, 04 Sep 2015 14:03:27 -0400 1 {kubelet very-long-node-name} spec.containers{ghost} started Started with docker id ceb14ba72929
Fri, 04 Sep 2015 14:03:27 -0400 Fri, 04 Sep 2015 14:03:27 -0400 1 {kubelet very-long-node-name} implicitly required container POD pulled Pod container image "gcr.io/google_containers/pause:0.8.0" already present on machine
Fri, 04 Sep 2015 14:03:30 -0400 Fri, 04 Sep 2015 14:03:30 -0400 1 {kubelet very-long-node-name} spec.containers{ghost} started Started with docker id 0b8957fe9b61
Fri, 04 Sep 2015 14:03:30 -0400 Fri, 04 Sep 2015 14:03:30 -0400 1 {kubelet very-long-node-name} spec.containers{ghost} created Created with docker id 0b8957fe9b61
Fri, 04 Sep 2015 14:03:40 -0400 Fri, 04 Sep 2015 14:03:40 -0400 1 {kubelet very-long-node-name} spec.containers{ghost} created Created with docker id edaf0df38c01
Fri, 04 Sep 2015 14:03:40 -0400 Fri, 04 Sep 2015 14:03:40 -0400 1 {kubelet very-long-node-name} spec.containers{ghost} started Started with docker id edaf0df38c01
Fri, 04 Sep 2015 14:03:50 -0400 Fri, 04 Sep 2015 14:03:50 -0400 1 {kubelet very-long-node-name} spec.containers{ghost} started Started with docker id d33f5e5a9637
...
This cycle of created/started goes on forever, if I don't kill the pod. The only difference from the successful pod is the lack of a replication controller. I don't expect this is the problem because the tutorial mentions nothing about rc.
Why is this happening? How can I create a successful pod from config file? And where would I find more verbose logs about what is going on?
If the same docker image is working via kubectl run but not working in a pod, then something is wrong with the pod spec. Compare the full output of the pod as created from spec and as created by rc to see what differs by running kubectl get pods <name> -o yaml for both. Shot in the dark: is it possible the env vars specified in the pod spec are causing it to crash on startup?
Maybe you could use different restart Policy in the yaml file?
What you have I believe is equivalent to
- restartPolicy: Never
no replication controller. You may try to add this line to yaml and set it to Always (and this will provide you with RC), or to OnFailure.
https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/pod-states.md#restartpolicy
Container logs may be useful, with kubectl logs
Usage:
kubectl logs [-p] POD [-c CONTAINER]
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_logs.html