CephFS mount. Can't read superblock - kubernetes

Any pointers for this issue? Tried tons of things already to no avail.
This command fails with the error Can't read superblock
sudo mount -t ceph worker2:6789:/ /mnt/mycephfs -o name=admin,secret=AQAYjCpcAAAAABAAxs1mrh6nnx+0+1VUqW2p9A==
Some more info that may be helpful
uname -a Linux cephfs-test-admin-1 4.14.84-coreos #1 SMP Sat Dec 15 22:39:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Ceph status and ceph osd status all show no issues at all
dmesg | tail
[228343.304863] libceph: resolve 'worker2' (ret=0): 10.1.96.4:0
[228343.322279] libceph: mon0 10.1.96.4:6789 session established
[228343.323622] libceph: client107238 fsid 762e6263-a95c-40da-9813-9df4fef12f53
ceph -s
cluster:
id: 762e6263-a95c-40da-9813-9df4fef12f53
health: HEALTH_WARN
too few PGs per OSD (16 < min 30)
services:
mon: 3 daemons, quorum worker2,worker0,worker1
mgr: worker1(active)
mds: cephfs-1/1/1 up {0=mds-ceph-mds-85b4fbb478-c6jzv=up:active}
osd: 3 osds: 3 up, 3 in
data:
pools: 2 pools, 16 pgs
objects: 21 objects, 2246 bytes
usage: 342 MB used, 76417 MB / 76759 MB avail
pgs: 16 active+clean
ceph osd status
+----+---------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+---------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | worker2 | 114M | 24.8G | 0 | 0 | 0 | 0 | exists,up |
| 1 | worker0 | 114M | 24.8G | 0 | 0 | 0 | 0 | exists,up |
| 2 | worker1 | 114M | 24.8G | 0 | 0 | 0 | 0 | exists,up |
+----+---------+-------+-------+--------+---------+--------+---------+-----------+
ceph -v
ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) luminous (stable)
Some of the syslog output:
Jan 04 21:24:04 worker2 kernel: libceph: resolve 'worker2' (ret=0): 10.1.96.4:0
Jan 04 21:24:04 worker2 kernel: libceph: mon0 10.1.96.4:6789 session established
Jan 04 21:24:04 worker2 kernel: libceph: client159594 fsid 762e6263-a95c-40da-9813-9df4fef12f53
Jan 04 21:24:10 worker2 systemd[1]: Started OpenSSH per-connection server daemon (58.242.83.28:36729).
Jan 04 21:24:11 worker2 sshd[12315]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=58.242.83.28 us>
Jan 04 21:24:14 worker2 sshd[12315]: Failed password for root from 58.242.83.28 port 36729 ssh2
Jan 04 21:24:15 worker2 sshd[12315]: Failed password for root from 58.242.83.28 port 36729 ssh2
Jan 04 21:24:18 worker2 sshd[12315]: Failed password for root from 58.242.83.28 port 36729 ssh2
Jan 04 21:24:18 worker2 sshd[12315]: Received disconnect from 58.242.83.28 port 36729:11: [preauth]
Jan 04 21:24:18 worker2 sshd[12315]: Disconnected from authenticating user root 58.242.83.28 port 36729 [preauth]
Jan 04 21:24:18 worker2 sshd[12315]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=58.242.83.28 user=root
Jan 04 21:24:56 worker2 systemd[1]: Started OpenSSH per-connection server daemon (24.114.79.151:58123).
Jan 04 21:24:56 worker2 sshd[12501]: Accepted publickey for core from 24.114.79.151 port 58123 ssh2: RSA SHA256:t4t9yXeR2yC7s9c37mdS/F7koUs2x>
Jan 04 21:24:56 worker2 sshd[12501]: pam_unix(sshd:session): session opened for user core by (uid=0)
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument
Jan 04 21:24:56 worker2 systemd[1]: Failed to set up mount unit: Invalid argument

So after digging the problem was due to XFS partitioning issues ...
Do not know how I missed it at first.
In short:
Trying to create a partion using xfs was failing.
i.e. running mkfs.xfs /dev/vdb1 would simply hang. The OS would still create and mark partitions properly but they'd be corrupt - the fact you'd only find out when trying to mount by getting that Can't read superblock error.
So ceph does this:
1. Run deploy
2. Create XFS partitions mkfs.xfs ...
3. OS would create those faulty partitions
4. Since you can still read the status of OSDs just fine all status report and logs will report no problems (mkfs.xfs did not report errors it just hang)
5. When you try to mount cephFS or use block storage the whole thing bombs due to corrupt partions.
The root cause: still unknown. But I suspect something was not done right on the SSD disk level when provisioning/attaching them from my cloud provider. It now works fine

Related

Cannot Launch Prothetheus App for RPI, Error code 2

I am trying to run Prometheus' standalone app on an RPI4 8GB. I am following the instructions laid out here: https://pimylifeup.com/raspberry-pi-prometheus/
My prometheus.service file is this:
[Unit]
Description=Prometheus Server
Documentation=https://prometheus.io/docs/introduction/overview/
After=network-online.target
[Service]
User=pi
Restart=on-failure
ExecStart=/home/pi/prometheus/prometheus \
--config.file=/home/pi/prometheus/prometheus.yml \
--storage.tsdb.path=/home/pi/prometheus/data
[Install]
WantedBy=multi-user.target
But when I try to run the service I get the following error.
● prometheus.service - Prometheus Server
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2022-11-24 18:42:51 GMT; 2s ago
Docs: https://prometheus.io/docs/introduction/overview/
Process: 485265 ExecStart=/home/pi/prometheus/prometheus --config.file=/home/pi/prometheus/prometheus.yml --storage.tsdb.path=/home/pi/prometheus/data (code=exited, status=2)
Main PID: 485265 (code=exited, status=2)
CPU: 160ms
Nov 24 18:42:51 master2 systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 5.
Nov 24 18:42:51 master2 systemd[1]: Stopped Prometheus Server.
Nov 24 18:42:51 master2 systemd[1]: prometheus.service: Start request repeated too quickly.
Nov 24 18:42:51 master2 systemd[1]: prometheus.service: Failed with result 'exit-code'.
Nov 24 18:42:51 master2 systemd[1]: Failed to start Prometheus Server.
What does Error Status 2 mean in this context? Is it a permission problem, or something else?

kubelet.service: Unit entered failed state in not ready state node error from kubernetes cluster

I am trying to deploy an springboot microservices in kubernetes cluster having 1 master and 2 worker node. When I am trying to get the node state using the command sudo kubectl get nodes, I am getting one of my worker node is not ready. It showing not ready in status.
When I am applying to troubleshoot the following command,
sudo journalctl -u kubelet
I am getting response like kubelet.service: Unit entered failed state and kubelet service stopped. The following is the response what I am getting when applying the command sudo journalctl -u kubelet.
-- Logs begin at Fri 2020-01-03 04:56:18 EST, end at Fri 2020-01-03 05:32:47 EST. --
Jan 03 04:56:25 MILDEVKUB050 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 03 04:56:31 MILDEVKUB050 kubelet[970]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --confi
Jan 03 04:56:31 MILDEVKUB050 kubelet[970]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --confi
Jan 03 04:56:32 MILDEVKUB050 kubelet[970]: I0103 04:56:32.053962 970 server.go:416] Version: v1.17.0
Jan 03 04:56:32 MILDEVKUB050 kubelet[970]: I0103 04:56:32.084061 970 plugins.go:100] No cloud provider specified.
Jan 03 04:56:32 MILDEVKUB050 kubelet[970]: I0103 04:56:32.235928 970 server.go:821] Client rotation is on, will bootstrap in background
Jan 03 04:56:32 MILDEVKUB050 kubelet[970]: I0103 04:56:32.280173 970 certificate_store.go:129] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-curre
Jan 03 04:56:38 MILDEVKUB050 kubelet[970]: I0103 04:56:38.107966 970 server.go:641] --cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /
Jan 03 04:56:38 MILDEVKUB050 kubelet[970]: F0103 04:56:38.109401 970 server.go:273] failed to run Kubelet: running with swap on is not supported, please disable swa
Jan 03 04:56:38 MILDEVKUB050 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jan 03 04:56:38 MILDEVKUB050 systemd[1]: kubelet.service: Unit entered failed state.
Jan 03 04:56:38 MILDEVKUB050 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jan 03 04:56:48 MILDEVKUB050 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jan 03 04:56:48 MILDEVKUB050 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jan 03 04:56:48 MILDEVKUB050 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 03 04:56:48 MILDEVKUB050 kubelet[1433]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --conf
Jan 03 04:56:48 MILDEVKUB050 kubelet[1433]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --conf
Jan 03 04:56:48 MILDEVKUB050 kubelet[1433]: I0103 04:56:48.901632 1433 server.go:416] Version: v1.17.0
Jan 03 04:56:48 MILDEVKUB050 kubelet[1433]: I0103 04:56:48.907654 1433 plugins.go:100] No cloud provider specified.
Jan 03 04:56:48 MILDEVKUB050 kubelet[1433]: I0103 04:56:48.907806 1433 server.go:821] Client rotation is on, will bootstrap in background
Jan 03 04:56:48 MILDEVKUB050 kubelet[1433]: I0103 04:56:48.947107 1433 certificate_store.go:129] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-curr
Jan 03 04:56:49 MILDEVKUB050 kubelet[1433]: I0103 04:56:49.263777 1433 server.go:641] --cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to
Jan 03 04:56:49 MILDEVKUB050 kubelet[1433]: F0103 04:56:49.264219 1433 server.go:273] failed to run Kubelet: running with swap on is not supported, please disable sw
Jan 03 04:56:49 MILDEVKUB050 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jan 03 04:56:49 MILDEVKUB050 systemd[1]: kubelet.service: Unit entered failed state.
Jan 03 04:56:49 MILDEVKUB050 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jan 03 04:56:59 MILDEVKUB050 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jan 03 04:56:59 MILDEVKUB050 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jan 03 04:56:59 MILDEVKUB050 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 03 04:56:59 MILDEVKUB050 kubelet[1500]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --conf
Jan 03 04:56:59 MILDEVKUB050 kubelet[1500]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --conf
Jan 03 04:56:59 MILDEVKUB050 kubelet[1500]: I0103 04:56:59.712729 1500 server.go:416] Version: v1.17.0
Jan 03 04:56:59 MILDEVKUB050 kubelet[1500]: I0103 04:56:59.714927 1500 plugins.go:100] No cloud provider specified.
Jan 03 04:56:59 MILDEVKUB050 kubelet[1500]: I0103 04:56:59.715248 1500 server.go:821] Client rotation is on, will bootstrap in background
Jan 03 04:56:59 MILDEVKUB050 kubelet[1500]: I0103 04:56:59.763508 1500 certificate_store.go:129] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-curr
Jan 03 04:56:59 MILDEVKUB050 kubelet[1500]: I0103 04:56:59.956706 1500 server.go:641] --cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to
Jan 03 04:56:59 MILDEVKUB050 kubelet[1500]: F0103 04:56:59.957078 1500 server.go:273] failed to run Kubelet: running with swap on is not supported, please disable sw
Jan 03 04:56:59 MILDEVKUB050 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jan 03 04:56:59 MILDEVKUB050 systemd[1]: kubelet.service: Unit entered failed state.
Jan 03 04:56:59 MILDEVKUB050 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jan 03 04:57:10 MILDEVKUB050 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jan 03 04:57:10 MILDEVKUB050 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Jan 03 04:57:10 MILDEVKUB050 systemd[1]: Started kubelet: The Kubernetes Node Agent.
log file: service: Unit entered failed state
I tried by restarting the kubelet. But still there is no change in node state. Not ready state only.
Updates
When I am trying the command systemctl list-units --type=swap --state=active , then I am getting the following response,
docker#MILDEVKUB040:~$ systemctl list-units --type=swap --state=active
UNIT LOAD ACTIVE SUB DESCRIPTION
dev-mapper-MILDEVDCR01\x2d\x2dvg\x2dswap_1.swap loaded active active /dev/mapper/MILDEVDCR01--vg-swap_1
Important
When I am getting these kind of issue with node not ready, each time I need to disable the swap and need to reload the daemon and kubelet. After that node becomes ready state. And again I need to repeat the same.
How can I find a permanent solution for this?
failed to run Kubelet: running with swap on is not supported, please disable swap
You need to disable swap on the system for kubelet to work. You can disable swap with sudo swapoff -a
For systemd based systems, there is another way of enabling swap partitions using swap units which gets enabled whenever systemd reloads even if you have turned off swap using swapoff -a
https://www.freedesktop.org/software/systemd/man/systemd.swap.html
Check if you have any swap units using systemctl list-units --type=swap --state=active
You can permanently disable any active swap unit with systemctl mask <unit name>.
Note: Do not use systemctl disable <unit name> to disable the swap unit as swap unit will be activated again when systemd reloads. Use systemctl mask <unit name> only.
To make sure swap doesn't get re-enabled when your system reboots due to power cycle or any other reason, remove or comment out the swap entries in /etc/fstab
Summarizing:
Run sudo swapoff -a
Check if you have swap units with command systemctl list-units --type=swap --state=active. If there are any active swap units, mask them using systemctl mask <unit name>
Remove swap entries in /etc/fstab
The root cause is the swap space. To disable completely follow steps:
run swapoff -a: this will immediately disable swap but will activate on restart
remove any swap entry from /etc/fstab
reboot the system.
If the swap is gone, good. If, for some reason, it is still here, you
had to remove the swap partition. Repeat steps 1 and 2 and, after
that, use fdisk or parted to remove the (now unused) swap partition.
Use great care here: removing the wrong partition will have disastrous
effects!
reboot
This should resolve your issue.
Removing /etc/fstab will give the vm error, I think we should find another way to solve this issue. I tried to remove the fstab, all command (install, ping and other command) error.

Kubeam failed | service is down

I try to join worker node to k8s kluser.
sudo kubeadm join 10.2.67.201:6443 --token x --discovery-token-ca-cert-hash sha2566 x
But i get error on this stage:
curl -sSL http://localhost:10248/healthz'
failed with error: Get http://localhost:10248/healthz: dial tcp
Error:
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
I see that kubelet service is down:
journalctl -xeu kubelet
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit kubelet.service has finished shutting down.
Nov 22 15:49:00 s001as-ceph-node-03 systemd[1]: Started kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit kubelet.service has finished starting up.
--
-- The start-up result is done.
Nov 22 15:49:00 s001as-ceph-node-03 kubelet[286703]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag.
Nov 22 15:49:00 s001as-ceph-node-03 kubelet[286703]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag.
Nov 22 15:49:00 s001as-ceph-node-03 kubelet[286703]: F1122 15:49:00.224350 286703 server.go:251] unable to load client CA file /etc/kubernetes/ssl/ca.crt: open /etc/kubernetes/ssl/ca.cr
Nov 22 15:49:00 s001as-ceph-node-03 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Nov 22 15:49:00 s001as-ceph-node-03 systemd[1]: Unit kubelet.service entered failed state.
Nov 22 15:49:00 s001as-ceph-node-03 systemd[1]: kubelet.service failed.
Nov 22 15:49:10 s001as-ceph-node-03 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Nov 22 15:49:10 s001as-ceph-node-03 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit kubelet.service has finished shutting down.
Nov 22 15:49:10 s001as-ceph-node-03 systemd[1]: Started kubelet: The Kubernetes Node Agent.
-- Subject: Unit kubelet.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit kubelet.service has finished starting up.
--
-- The start-up result is done.
Nov 22 15:49:10 s001as-ceph-node-03 kubelet[286717]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag.
Nov 22 15:49:10 s001as-ceph-node-03 kubelet[286717]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag.
Nov 22 15:49:10 s001as-ceph-node-03 kubelet[286717]: F1122 15:49:10.476478 286717 server.go:251] unable to load client CA file /etc/kubernetes/ssl/ca.crt: open /etc/kubernetes/ssl/ca.cr
Nov 22 15:49:10 s001as-ceph-node-03 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Nov 22 15:49:10 s001as-ceph-node-03 systemd[1]: Unit kubelet.service entered failed state.
Nov 22 15:49:10 s001as-ceph-node-03 systemd[1]: kubelet.service failed.
I fixed it.
Just copy /etc/kubernetes/pki/ca.crt into /etc/kubernetes/ssl/ca.crt

Why clustering on k8s through redis-ha doesn't work?

I'm trying to create Redis cluster along with Node.JS (ioredis/cluster) but that doesn't seem to work.
It's v1.11.8-gke.6 on GKE.
I'm doing exactly what been told in ha-redis docs:
~  helm install --set replicas=3 --name redis-test stable/redis-ha
NAME: redis-test
LAST DEPLOYED: Fri Apr 26 00:13:31 2019
NAMESPACE: yt
STATUS: DEPLOYED
RESOURCES:
==> v1/ConfigMap
NAME DATA AGE
redis-test-redis-ha-configmap 3 0s
redis-test-redis-ha-probes 2 0s
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
redis-test-redis-ha-server-0 0/2 Init:0/1 0 0s
==> v1/Role
NAME AGE
redis-test-redis-ha 0s
==> v1/RoleBinding
NAME AGE
redis-test-redis-ha 0s
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
redis-test-redis-ha ClusterIP None <none> 6379/TCP,26379/TCP 0s
redis-test-redis-ha-announce-0 ClusterIP 10.7.244.34 <none> 6379/TCP,26379/TCP 0s
redis-test-redis-ha-announce-1 ClusterIP 10.7.251.35 <none> 6379/TCP,26379/TCP 0s
redis-test-redis-ha-announce-2 ClusterIP 10.7.252.94 <none> 6379/TCP,26379/TCP 0s
==> v1/ServiceAccount
NAME SECRETS AGE
redis-test-redis-ha 1 0s
==> v1/StatefulSet
NAME READY AGE
redis-test-redis-ha-server 0/3 0s
NOTES:
Redis can be accessed via port 6379 and Sentinel can be accessed via port 26379 on the following DNS name from within your cluster:
redis-test-redis-ha.yt.svc.cluster.local
To connect to your Redis server:
1. Run a Redis pod that you can use as a client:
kubectl exec -it redis-test-redis-ha-server-0 sh -n yt
2. Connect using the Redis CLI:
redis-cli -h redis-test-redis-ha.yt.svc.cluster.local
~  k get pods | grep redis-test
redis-test-redis-ha-server-0 2/2 Running 0 1m
redis-test-redis-ha-server-1 2/2 Running 0 1m
redis-test-redis-ha-server-2 2/2 Running 0 54s
~  kubectl exec -it redis-test-redis-ha-server-0 sh -n yt
Defaulting container name to redis.
Use 'kubectl describe pod/redis-test-redis-ha-server-0 -n yt' to see all of the containers in this pod.
/data $ redis-cli -h redis-test-redis-ha.yt.svc.cluster.local
redis-test-redis-ha.yt.svc.cluster.local:6379> set test key
(error) READONLY You can't write against a read only replica.
But in the end only one random pod I connect to is writable. I ran logs on a few containers and everything seem to be fine there. I tried to run cluster info in redis-cli but I get ERR This instance has cluster support disabled everywhere.
Logs:
~  k logs pod/redis-test-redis-ha-server-0 redis
1:C 25 Apr 2019 20:13:43.604 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 25 Apr 2019 20:13:43.604 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 25 Apr 2019 20:13:43.604 # Configuration loaded
1:M 25 Apr 2019 20:13:43.606 * Running mode=standalone, port=6379.
1:M 25 Apr 2019 20:13:43.606 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 25 Apr 2019 20:13:43.606 # Server initialized
1:M 25 Apr 2019 20:13:43.606 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 25 Apr 2019 20:13:43.627 * DB loaded from disk: 0.021 seconds
1:M 25 Apr 2019 20:13:43.627 * Ready to accept connections
1:M 25 Apr 2019 20:14:11.801 * Replica 10.7.251.35:6379 asks for synchronization
1:M 25 Apr 2019 20:14:11.801 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'c2827ffe011d774db005a44165bac67a7e7f7d85', my replication IDs are '8311a1ca896e97d5487c07f2adfd7d4ef924f36b' and '0000000000000000000000000000000000000000')
1:M 25 Apr 2019 20:14:11.802 * Delay next BGSAVE for diskless SYNC
1:M 25 Apr 2019 20:14:17.825 * Starting BGSAVE for SYNC with target: replicas sockets
1:M 25 Apr 2019 20:14:17.825 * Background RDB transfer started by pid 55
55:C 25 Apr 2019 20:14:17.826 * RDB: 0 MB of memory used by copy-on-write
1:M 25 Apr 2019 20:14:17.926 * Background RDB transfer terminated with success
1:M 25 Apr 2019 20:14:17.926 # Slave 10.7.251.35:6379 correctly received the streamed RDB file.
1:M 25 Apr 2019 20:14:17.926 * Streamed RDB transfer with replica 10.7.251.35:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
1:M 25 Apr 2019 20:14:18.828 * Synchronization with replica 10.7.251.35:6379 succeeded
1:M 25 Apr 2019 20:14:42.711 * Replica 10.7.252.94:6379 asks for synchronization
1:M 25 Apr 2019 20:14:42.711 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'c2827ffe011d774db005a44165bac67a7e7f7d85', my replication IDs are 'af453adde824b2280ba66adb40cc765bf390e237' and '0000000000000000000000000000000000000000')
1:M 25 Apr 2019 20:14:42.711 * Delay next BGSAVE for diskless SYNC
1:M 25 Apr 2019 20:14:48.976 * Starting BGSAVE for SYNC with target: replicas sockets
1:M 25 Apr 2019 20:14:48.977 * Background RDB transfer started by pid 125
125:C 25 Apr 2019 20:14:48.978 * RDB: 0 MB of memory used by copy-on-write
1:M 25 Apr 2019 20:14:49.077 * Background RDB transfer terminated with success
1:M 25 Apr 2019 20:14:49.077 # Slave 10.7.252.94:6379 correctly received the streamed RDB file.
1:M 25 Apr 2019 20:14:49.077 * Streamed RDB transfer with replica 10.7.252.94:6379 succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming
1:M 25 Apr 2019 20:14:49.761 * Synchronization with replica 10.7.252.94:6379 succeeded
~  k logs pod/redis-test-redis-ha-server-1 redis
1:C 25 Apr 2019 20:14:11.780 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 25 Apr 2019 20:14:11.781 # Redis version=5.0.3, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 25 Apr 2019 20:14:11.781 # Configuration loaded
1:S 25 Apr 2019 20:14:11.786 * Running mode=standalone, port=6379.
1:S 25 Apr 2019 20:14:11.791 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:S 25 Apr 2019 20:14:11.791 # Server initialized
1:S 25 Apr 2019 20:14:11.791 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:S 25 Apr 2019 20:14:11.792 * DB loaded from disk: 0.001 seconds
1:S 25 Apr 2019 20:14:11.792 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 25 Apr 2019 20:14:11.792 * Ready to accept connections
1:S 25 Apr 2019 20:14:11.792 * Connecting to MASTER 10.7.244.34:6379
1:S 25 Apr 2019 20:14:11.792 * MASTER <-> REPLICA sync started
1:S 25 Apr 2019 20:14:11.792 * Non blocking connect for SYNC fired the event.
1:S 25 Apr 2019 20:14:11.793 * Master replied to PING, replication can continue...
1:S 25 Apr 2019 20:14:11.799 * Trying a partial resynchronization (request c2827ffe011d774db005a44165bac67a7e7f7d85:6006176).
1:S 25 Apr 2019 20:14:17.824 * Full resync from master: af453adde824b2280ba66adb40cc765bf390e237:722
1:S 25 Apr 2019 20:14:17.824 * Discarding previously cached master state.
1:S 25 Apr 2019 20:14:17.852 * MASTER <-> REPLICA sync: receiving streamed RDB from master
1:S 25 Apr 2019 20:14:17.853 * MASTER <-> REPLICA sync: Flushing old data
1:S 25 Apr 2019 20:14:17.853 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 25 Apr 2019 20:14:17.853 * MASTER <-> REPLICA sync: Finished with success
What am I missing or is there a better way to do clustering?
Not the best solution, but I figured I can just use Sentinel instead of finding another way (or maybe there is no another way). It has support on most languages so it shouldn't be very hard (except redis-cli, can't figure how to query Sentinel server).
This is how I got this done on ioredis (node.js, sorry if you not familiar with ES6 syntax):
import * as IORedis from 'ioredis';
import Redis from 'ioredis';
import { redisHost, redisPassword, redisPort } from './config';
export function getRedisConfig(): IORedis.RedisOptions {
// I'm not sure how to set this properly
// ioredis/cluster automatically resolves all pods by hostname, but not this.
// So I have to implicitly specify all pods.
// Or resolve them all by hostname
return {
sentinels: process.env.REDIS_CLUSTER.split(',').map(d => {
const [host, port = 26379] = d.split(':');
return { host, port: Number(port) };
}),
name: process.env.REDIS_MASTER_NAME || 'mymaster',
...(redisPassword ? { password: redisPassword } : {}),
};
}
export async function initializeRedis() {
if (process.env.REDIS_CLUSTER) {
const cluster = new Redis(getRedisConfig());
return cluster;
}
// For dev environment
const client = new Redis(redisPort, redisHost);
if (redisPassword) {
await client.auth(redisPassword);
}
return client;
}
In env:
env:
- name: REDIS_CLUSTER
value: redis-redis-ha-server-1.redis-redis-ha.yt.svc.cluster.local:26379,redis-redis-ha-server-0.redis-redis-ha.yt.svc.cluster.local:23679,redis-redis-ha-server-2.redis-redis-ha.yt.svc.cluster.local:23679
You may wanna protect it using password.

HAproxy process has been killed unexpectlly when HAproxy service wss reloaded by sysv init script in CentOS7.2

I installed the HAproxy(1.5.14-3.el7) from CentOS7.2 repository.
When I reload HAproxy service with a wrong haproxy.cfg,
the return code of reload is incorrect.
About HAproxy,OS,systemd information pls see below:
[root#unknown ~]# rpm -qa | egrep haproxy
haproxy-1.5.14-3.el7.x86_64
[root#unknown ~]#
[root#unknown ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
[root#unknown ~]#
[root#unknown ~]# rpm -qa | egrep systemd
systemd-libs-219-19.el7.x86_64
systemd-219-19.el7.x86_64
systemd-sysv-219-19.el7.x86_64
[root#unknown ~]#
The return code of reload is incorrect.
[root#unknown ~]# service haproxy status
Redirecting to /bin/systemctl status haproxy.service
●haproxy.service - HAProxy Load Balancer
Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2016-06-07 11:24:41 UTC; 4s ago
[root#unknown ~]#
[root#unknown ~]#
[root#unknown ~]# more /etc/haproxy/haproxy.cfg
XXXX **--> I added an incorrect keyword into haproxy.cfg**
Global
....
[root#unknown ~]#
[root#unknown ~]#
[root#unknown ~]# service haproxy reload
Redirecting to /bin/systemctl reload haproxy.service
[root#unknown ~]#
[root#unknown ~]# echo $?
0 **--> It was sucessful !!!**
[root#unknown ~]#
[root#unknown ~]# service haproxy status
Redirecting to /bin/systemctl status haproxy.service
● haproxy.service - HAProxy Load Balancer
Loaded: loaded (/usr/lib/systemd/system/haproxy.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2016-06-07 11:24:41 UTC; 21s ago
Process: 16507 ExecReload=/bin/kill -USR2 $MAINPID (code=exited, status=0/SUCCESS)
Main PID: 16464 (haproxy-systemd)
CGroup: /system.slice/haproxy.service
tq16464 /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
tq16465 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
mq16466 /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
Jun 07 11:24:41 unknown systemd[1]: Started HAProxy Load Balancer.
Jun 07 11:24:41 unknown systemd[1]: Starting HAProxy Load Balancer...
Jun 07 11:24:41 unknown haproxy-systemd-wrapper[16464]: haproxy-systemd-wrapper: executing /usr/sbin/haproxy -f /etc/haproxy/haproxy...id -Ds
Jun 07 11:24:57 unknown systemd[1]: Reloaded HAProxy Load Balancer.
Jun 07 11:24:57 unknown haproxy-systemd-wrapper[16464]: haproxy-systemd-wrapper: re-executing
Jun 07 11:24:57 unknown haproxy-systemd-wrapper[16464]: haproxy-systemd-wrapper: executing /usr/sbin/haproxy -f /etc/haproxy/haproxy... 16466
Jun 07 11:24:57 unknown haproxy-systemd-wrapper[16464]: [ALERT] 158/112457 (16508) : parsing [/etc/haproxy/haproxy.cfg:9]: unknown k...ction. **--> In fact, it was wrong**
Jun 07 11:24:57 unknown haproxy-systemd-wrapper[16464]: [ALERT] 158/112457 (16508) : Error(s) found in configuration file : /etc/hap...xy.cfg
Jun 07 11:24:57 unknown haproxy-systemd-wrapper[16464]: [ALERT] 158/112457 (16508) : Fatal errors found in configuration.
So I descided to use the sysv init.d script to start/reload/stop HAproxy service.
sysv init.d script:
[root#unknown ~]# cat /etc/init.d/haproxy
#!/bin/sh
#
# chkconfig: - 85 15
# description: HA-Proxy is a TCP/HTTP reverse proxy which is particularly suited \
# for high availability environments.
# processname: haproxy
# config: /etc/haproxy/haproxy.cfg
# pidfile: /var/run/haproxy.pid
# Script Author: Simon Matter <simon.matter#invoca.ch>
# Version: 2004060600
### BEGIN INIT INFO
# Provides: HA-Proxy
# Required-Start: $network $syslog sshd
# Required-Stop:
# Default-Start: 3 4 5
# Default-Stop: 0 1 2 6
# Short-Description: HAProxy
### END INIT INFO
# Source function library.
if [ -f /etc/init.d/functions ]; then
. /etc/init.d/functions
elif [ -f /etc/rc.d/init.d/functions ] ; then
. /etc/rc.d/init.d/functions
else
exit 0
fi
# Source networking configuration.
. /etc/sysconfig/network
# Check that networking is up.
#[ ${NETWORKING} = "no" ] && exit 0
# This is our service name
BASENAME=`basename $0`
if [ -L $0 ]; then
BASENAME=`find $0 -name $BASENAME -printf %l`
BASENAME=`basename $BASENAME`
fi
[ -f /etc/$BASENAME/$BASENAME.cfg ] || exit 1
RETVAL=0
start() {
/usr/sbin/$BASENAME -c -q -f /etc/$BASENAME/$BASENAME.cfg
if [ $? -ne 0 ]; then
echo "Errors found in configuration file, check it with '$BASENAME check'."
return 1
fi
echo -n "Starting $BASENAME: "
daemon /usr/sbin/$BASENAME -D -f /etc/$BASENAME/$BASENAME.cfg -p /var/run/$BASENAME.pid
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && touch /var/lock/subsys/$BASENAME
return $RETVAL
}
stop() {
killproc $BASENAME -USR1
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$BASENAME
[ $RETVAL -eq 0 ] && rm -f /var/run/$BASENAME.pid
return $RETVAL
}
restart() {
/usr/sbin/$BASENAME -c -q -f /etc/$BASENAME/$BASENAME.cfg
if [ $? -ne 0 ]; then
echo "Errors found in configuration file, check it with '$BASENAME check'."
return 1
fi
stop
start
}
reload() {
/usr/sbin/$BASENAME -c -q -f /etc/$BASENAME/$BASENAME.cfg
if [ $? -ne 0 ]; then
echo "Errors found in configuration file, check it with '$BASENAME check'."
return 1
fi
/usr/sbin/$BASENAME -D -f /etc/$BASENAME/$BASENAME.cfg -p /var/run/$BASENAME.pid -sf $(cat /var/run/$BASENAME.pid)
}
check() {
/usr/sbin/$BASENAME -c -q -V -f /etc/$BASENAME/$BASENAME.cfg
}
rhstatus() {
status $BASENAME
}
condrestart() {
[ -e /var/lock/subsys/$BASENAME ] && restart || :
}
# See how we were called.
case "$1" in
start)
start
;;
stop)
stop
;;
restart)
restart
;;
reload)
reload
;;
condrestart)
condrestart
;;
status)
rhstatus
;;
check)
check
;;
*)
echo $"Usage: $BASENAME {start|stop|restart|reload|condrestart|status|check}"
exit 1
esac
exit $?
When I reloaded HAproxy serivce with the correct haproxy.cfg,
the command(service haproxy reload) returned 0,
but HAproxy's status became failed.
[root#unknown ~]# service haproxy status
● haproxy.service - LSB: HAProxy
Loaded: loaded (/etc/rc.d/init.d/haproxy)
Active: active (running) since Tue 2016-06-07 11:33:22 UTC; 1h 14min ago
Docs: man:systemd-sysv-generator(8)
Process: 16636 ExecStart=/etc/rc.d/init.d/haproxy start (code=exited, status=0/SUCCESS)
Main PID: 16641 (haproxy)
CGroup: /system.slice/haproxy.service
mq16641 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid
Jun 07 11:33:22 unknown systemd[1]: Starting LSB: HAProxy...
Jun 07 11:33:22 unknown haproxy[16636]: Starting haproxy: [ OK ]
Jun 07 11:33:22 unknown systemd[1]: Started LSB: HAProxy.
[root#unknown ~]#
[root#unknown ~]# service haproxy reload
Reloading haproxy configuration (via systemctl): [ OK ]
[root#unknown ~]# echo $?
0 **--> It was sucessful !!!**
[root#unknown ~]#
[root#unknown ~]# service haproxy status
● haproxy.service - LSB: HAProxy
Loaded: loaded (/etc/rc.d/init.d/haproxy)
Active: failed (Result: signal) since Tue 2016-06-07 12:48:01 UTC; 1s ago
Docs: man:systemd-sysv-generator(8)
Process: 16869 ExecStop=/etc/rc.d/init.d/haproxy stop (code=exited, status=0/SUCCESS)
Process: 16863 ExecReload=/etc/rc.d/init.d/haproxy reload (code=exited, status=0/SUCCESS)
Process: 16636 ExecStart=/etc/rc.d/init.d/haproxy start (code=exited, status=0/SUCCESS)
Main PID: 16868 (code=killed, signal=KILL)
Jun 07 11:33:22 unknown systemd[1]: Starting LSB: HAProxy...
Jun 07 11:33:22 unknown haproxy[16636]: Starting haproxy: [ OK ]
Jun 07 11:33:22 unknown systemd[1]: Started LSB: HAProxy.
Jun 07 12:48:00 unknown systemd[1]: Reloaded LSB: HAProxy.
Jun 07 12:48:00 unknown systemd[1]: haproxy.service: main process exited, code=killed, status=9/KILL **--> It was killed ,but I don't know which process killed it, Cgroup ?**
Jun 07 12:48:01 unknown haproxy[16869]: [FAILED]
Jun 07 12:48:01 unknown systemd[1]: Unit haproxy.service entered failed state.
Jun 07 12:48:01 unknown systemd[1]: haproxy.service failed.
[root#unknown ~]#
I used a newer systemd to get the detailed logs
Jun 07 13:02:59 elb systemd[1]: Starting LSB: HAProxy...
Jun 07 13:02:59 elb systemd[7010]: Executing: /etc/rc.d/init.d/haproxy start
Jun 07 13:02:59 elb haproxy[7010]: Starting haproxy: [ OK ]
Jun 07 13:02:59 elb systemd[1]: Child 7010 belongs to haproxy.service
Jun 07 13:02:59 elb systemd[1]: haproxy.service: control process exited, code=exited status=0
Jun 07 13:02:59 elb systemd[1]: haproxy.service got final SIGCHLD for state start
Jun 07 13:02:59 elb systemd[1]: Main PID loaded: 7015
Jun 07 13:02:59 elb systemd[1]: haproxy.service changed start -> running
Jun 07 13:02:59 elb systemd[1]: Job haproxy.service/start finished, result=done
Jun 07 13:02:59 elb systemd[1]: Started LSB: HAProxy. **--> start HAproxy successfully **
Jun 07 13:03:27 elb systemd[1]: Trying to enqueue job haproxy.service/reload/replace
Jun 07 13:03:27 elb systemd[1]: Installed new job haproxy.service/reload as 9504
Jun 07 13:03:27 elb systemd[1]: Enqueued job haproxy.service/reload as 9504
Jun 07 13:03:27 elb systemd[1]: About to execute: /etc/rc.d/init.d/haproxy reload
Jun 07 13:03:27 elb systemd[1]: Forked /etc/rc.d/init.d/haproxy as 7060
Jun 07 13:03:27 elb systemd[1]: haproxy.service changed running -> reload
Jun 07 13:03:27 elb systemd[7060]: Executing: /etc/rc.d/init.d/haproxy reload
Jun 07 13:03:27 elb systemd[1]: Child 7015 belongs to haproxy.service
Jun 07 13:03:27 elb systemd[1]: Main PID changing: 7015 -> 7065
Jun 07 13:03:27 elb systemd[1]: Child 7060 belongs to haproxy.service
Jun 07 13:03:27 elb systemd[1]: haproxy.service: control process exited, code=exited status=0
Jun 07 13:03:27 elb systemd[1]: haproxy.service got final SIGCHLD for state reload
Jun 07 13:03:27 elb systemd[1]: haproxy.service changed reload -> running
Jun 07 13:03:27 elb systemd[1]: Job haproxy.service/reload finished, result=done
Jun 07 13:03:27 elb systemd[1]: Reloaded LSB: HAProxy. **--> successful to reload HAproxy**
Jun 07 13:03:27 elb systemd[1]: Child 7065 belongs to haproxy.service
Jun 07 13:03:27 elb systemd[1]: haproxy.service: main process exited, code=killed, status=9/KILL **--> process 7065 has been killed unexpectly**
Jun 07 13:03:27 elb systemd[1]: haproxy.service changed running -> failed
Jun 07 13:03:27 elb systemd[1]: Unit haproxy.service entered failed state.
Jun 07 13:03:27 elb systemd[1]: haproxy.service failed.
Jun 07 13:03:27 elb systemd[1]: haproxy.service: cgroup is empty **-->Did cgroup killed process 7065? Is it a bug of systemd? **
In CentOS7.1, I use the sysv init script (pls see above) to reload haproxy ,and  'service haproxy reload' command could return correct result.
I don't know what is wrong in CentOS7.2. I just want to get following results of reloading HAproxy.
When haproxy.cfg file is incorrect, 'service haproxy reload' command returns 1
When haproxy.cfg file is correct, 'service haproxy reload' command returns 0
Can anyone help me ? Thanks
I would guess this is a SELinux issue. you can try change for below:
:
vi /etc/selinux/config
SELINUX=enforcing
SELINUXTYPE=targeted
SELINUX=disabled
:wq! #save and quit