Pgpool fails to start on kubernetes as a pod - postgresql

I have hosted pgpool on a container and given the container config for kubernetes deployment -
Mountpaths -
- name: cgroup
mountPath: /sys/fs/cgroup:ro
- name: var-run
mountPath: /run
And Volumes for mountpath for the cgroups are mentioned as below -
- name: cgroup
hostPath:
path: /sys/fs/cgroup
type: Directory
- name: var-run
emptyDir:
medium: Memory
Also in kubernetes deployment I have passed -
securityContext:
privileged: true
But when I open the pod and exec inside it to check the pgpool status I get the below issue -
[root#app-pg-6448dfb58d-vzk67 /]# journalctl -xeu pgpool
-- Logs begin at Sat 2020-07-04 16:28:41 UTC, end at Sat 2020-07-04 16:29:13 UTC. --
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 systemd[1]: Started Pgpool-II.
-- Subject: Unit pgpool.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit pgpool.service has finished starting up.
--
-- The start-up result is done.
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: [1-1] 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "stateme
nt_level_load_balance"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "statement_lev
el_load_balance"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "auto_failback
"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "auto_failback
_interval"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "enable_consen
sus_with_half_votes"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "enable_shared
_relcache"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "relcache_quer
y_target"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: FATAL: could not open pid file as /var/run/pgpool-II-11/p
gpool.pid. reason: No such file or directory
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 systemd[1]: pgpool.service: main process exited, code=exited, status=3/NOTIMPLEMENTED
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 systemd[1]: Unit pgpool.service entered failed state.
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 systemd[1]: pgpool.service failed.
Systemctl status pgpool inside the pod container -
➜ app-app kubectl exec -it app-pg-6448dfb58d-vzk67 -- bash
[root#app-pg-6448dfb58d-vzk67 /]# systemctl status pgpool
● pgpool.service - Pgpool-II
Loaded: loaded (/usr/lib/systemd/system/pgpool.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sat 2020-07-04 16:28:41 UTC; 1h 39min ago
Process: 34 ExecStart=/usr/bin/pgpool -f /etc/pgpool-II/pgpool.conf $OPTS (code=exited, status=3)
Main PID: 34 (code=exited, status=3)
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "stat...lance"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "auto...lback"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "auto...erval"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "enab...votes"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "enab...cache"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: INFO: unrecognized configuration parameter "relc...arget"
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 pgpool[34]: 2020-07-04 16:28:41: pid 34: FATAL: could not open pid file as /var/run/pgpoo...ectory
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 systemd[1]: pgpool.service: main process exited, code=exited, status=3/NOTIMPLEMENTED
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 systemd[1]: Unit pgpool.service entered failed state.
Jul 04 16:28:41 app-pg-6448dfb58d-vzk67 systemd[1]: pgpool.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
If required this is the whole deployment sample -
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-pg
labels:
helm.sh/chart: app-pgpool-1.0.0
app.kubernetes.io/name: app-pgpool
app.kubernetes.io/instance: app-service
app.kubernetes.io/version: "1.0.3"
app.kubernetes.io/managed-by: Helm
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: app-pgpool
app.kubernetes.io/instance: app-service
template:
metadata:
labels:
app.kubernetes.io/name: app-pgpool
app.kubernetes.io/instance: app-service
spec:
volumes:
- name: "pgpool-config"
persistentVolumeClaim:
claimName: "pgpool-pvc"
- name: cgroup
hostPath:
path: /sys/fs/cgroup
type: Directory
- name: var-run
emptyDir:
# Tmpfs needed for systemd.
medium: Memory
# volumes:
# - name: pgpool-config
# configMap:
# name: pgpool-config
# - name: pgpool-config
# azureFile:
# secretName: azure-fileshare-secret
# shareName: pgpool
# readOnly: false
imagePullSecrets:
- name: app-secret
serviceAccountName: app-pg
securityContext:
{}
containers:
- name: app-pgpool
securityContext:
{}
image: "appacr.azurecr.io/pgpool:1.0.3"
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
stdin: true
tty: true
ports:
- name: http
containerPort: 9999
protocol: TCP
# livenessProbe:
# httpGet:
# path: /
# port: http
# readinessProbe:
# httpGet:
# path: /
# port: http
resources:
{}
volumeMounts:
- name: "pgpool-config"
mountPath: /etc/pgpool-II
- name: cgroup
mountPath: /sys/fs/cgroup:ro
- name: var-run
mountPath: /run
UPDATE -
Running this same setup on dockerfile runs perfectly good no issues at all -
version: '2'
services:
pgpool:
container_name: pgpool
image: appacr.azurecr.io/pgpool:1.0.3
logging:
options:
max-size: 100m
ports:
- "9999:9999"
networks:
vpcbr:
ipv4_address: 10.5.0.2
restart: unless-stopped
volumes:
- /sys/fs/cgroup:/sys/fs/cgroup:ro
- $HOME/Documents/app/docker-compose/pgpool.conf:/etc/pgpool-II/pgpool.conf
- $HOME/Documents/app/docker-compose/pool_passwd:/etc/pgpool-II/pool_passwd
privileged: true
stdin_open: true
tty: true
I dont know what am I doing wrong I am not able to start this pgpool anyway and not able to pinpoint the issue. What permission are we missing here or whether cgroups is the culprit ? or not ?
Some direction would be appreciated .

while this might not be a direct answer to your question, I have seen some very cryptic errors when trying to run any postgresql product from raw manifest, my recommandations would be to try leveraging the chart from Bitnami, they have put a lot of effort in ensuring that all of the security / permission culpits are taken care of properly.
https://github.com/bitnami/charts/tree/master/bitnami/postgresql-ha
Hopefully, this help.
Also, if you do not want to use Helm, you can run the help template command
https://helm.sh/docs/helm/helm_template/
this will generate manifest out of the chart's template file based on the provided values.yaml

Related

Ceph OSD (authenticate timed out) after node restart

A couple of our nodes restarted unexpectedly and since the OSDs on those nodes will no longer authenticate with the MON.
I have tested that the node still has access to all the MON nodes using nc to see if the ports are open.
We can not find anything in the mon logs about authentication errors.
At the moment 50% of the cluster is down due to 2/4 nodes offline.
Feb 06 21:04:07 ceph1 systemd[1]: Starting Ceph osd.7 for d5126e5a-882e-11ec-954e-90e2baec3d2c...
Feb 06 21:04:08 ceph1 podman[520029]: 2023-02-06 21:04:08.056452052 +0100 CET m=+0.123533698 container create 0b396efc0543af48d593d1e4c72ed74d>
Feb 06 21:04:08 ceph1 podman[520029]: 2023-02-06 21:04:08.334525479 +0100 CET m=+0.401607145 container init 0b396efc0543af48d593d1e4c72ed74d30>
Feb 06 21:04:08 ceph1 podman[520029]: 2023-02-06 21:04:08.346028585 +0100 CET m=+0.413110241 container start 0b396efc0543af48d593d1e4c72ed74d3>
Feb 06 21:04:08 ceph1 podman[520029]: 2023-02-06 21:04:08.346109677 +0100 CET m=+0.413191333 container attach 0b396efc0543af48d593d1e4c72ed74d>
Feb 06 21:04:08 ceph1 bash[520029]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-7
Feb 06 21:04:08 ceph1 bash[520029]: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-03539866-06e2-4>
Feb 06 21:04:08 ceph1 bash[520029]: Running command: /usr/bin/ln -snf /dev/ceph-03539866-06e2-4ba6-8809-6a491becb4fe/osd-block-1dd63d2a-9803-4>
Feb 06 21:04:08 ceph1 bash[520029]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-7/block
Feb 06 21:04:08 ceph1 bash[520029]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-0
Feb 06 21:04:08 ceph1 bash[520029]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-7
Feb 06 21:04:08 ceph1 bash[520029]: --> ceph-volume lvm activate successful for osd ID: 7
Feb 06 21:04:08 ceph1 podman[520029]: 2023-02-06 21:04:08.635416784 +0100 CET m=+0.702498460 container died 0b396efc0543af48d593d1e4c72ed74d30>
Feb 06 21:04:09 ceph1 podman[520029]: 2023-02-06 21:04:09.036165374 +0100 CET m=+1.103247040 container remove 0b396efc0543af48d593d1e4c72ed74d>
Feb 06 21:04:09 ceph1 podman[520260]: 2023-02-06 21:04:09.299438115 +0100 CET m=+0.070335845 container create d25c3024614dfb0a01c70bd56cf0758e>
Feb 06 21:04:09 ceph1 podman[520260]: 2023-02-06 21:04:09.384256486 +0100 CET m=+0.155154236 container init d25c3024614dfb0a01c70bd56cf0758ef1>
Feb 06 21:04:09 ceph1 podman[520260]: 2023-02-06 21:04:09.393054076 +0100 CET m=+0.163951816 container start d25c3024614dfb0a01c70bd56cf0758ef>
Feb 06 21:04:09 ceph1 bash[520260]: d25c3024614dfb0a01c70bd56cf0758ef16aa67f511ee4add8a85586c67beb0b
Feb 06 21:04:09 ceph1 systemd[1]: Started Ceph osd.7 for d5126e5a-882e-11ec-954e-90e2baec3d2c.
Feb 06 21:09:09 ceph1 conmon[520298]: debug 2023-02-06T20:09:09.394+0000 7f6c10705080 0 monclient(hunting): authenticate timed out after 300
Feb 06 21:14:09 ceph1 conmon[520298]: debug 2023-02-06T20:14:09.395+0000 7f6c10705080 0 monclient(hunting): authenticate timed out after 300
Feb 06 21:19:09 ceph1 conmon[520298]: debug 2023-02-06T20:19:09.397+0000 7f6c10705080 0 monclient(hunting): authenticate timed out after 300
Feb 06 21:24:09 ceph1 conmon[520298]: debug 2023-02-06T20:24:09.398+0000 7f6c10705080 0 monclient(hunting): authenticate timed out after 300
Feb 06 21:29:09 ceph1 conmon[520298]: debug 2023-02-06T20:29:09.399+0000 7f6c10705080 0 monclient(hunting): authenticate timed out after 300
We have restarted the OSD nodes and this did not resolve the issue.
Confirmed that nodes have access to all mon servers.
I have looked in /var/run/ceph and the admin sockets are not there.
Here is output as its starting the OSD.
[2023-02-07 10:38:58,167][ceph_volume.main][INFO ] Running command: ceph-volume lvm list --format json
[2023-02-07 10:38:58,168][ceph_volume.process][INFO ] Running command: /usr/sbin/lvs --noheadings --readonly --separator=";" -a --units=b --nosuffix -S -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2023-02-07 10:38:58,213][ceph_volume.process][INFO ] stdout ceph.block_device=/dev/ceph-03539866-06e2-4ba6-8809-6a491becb4fe/osd-block-1dd63d2a-9803-452c-a102-3b826e6ef448,ceph.block_uuid=VjbtJW-iiCA-PMvC-TCnV-9xgJ-a8UU-IDo0Pv,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=d5126e5a-882e-11ec-954e-90e2baec3d2c,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=1dd63d2a-9803-452c-a102-3b826e6ef448,ceph.osd_id=7,ceph.osdspec_affinity=all-available-devices,ceph.type=block,ceph.vdo=0";"/dev/ceph-03539866-06e2-4ba6-8809-6a491becb4fe/osd-block-1dd63d2a-9803-452c-a102-3b826e6ef448";"osd-block-1dd63d2a-9803-452c-a102-3b826e6ef448";"ceph-03539866-06e2-4ba6-8809-6a491becb4fe";"VjbtJW-iiCA-PMvC-TCnV-9xgJ-a8UU-IDo0Pv";"16000896466944
[2023-02-07 10:38:58,213][ceph_volume.process][INFO ] stdout ceph.block_device=/dev/ceph-1ce58676-9409-4e19-ac66-f63b5025dfb0/osd-block-9949a437-7e8a-489b-ba10-ded82c775c43,ceph.block_uuid=KLNJDx-J1iC-V5GJ-0nw3-YuEA-Q41D-HNIXv8,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=d5126e5a-882e-11ec-954e-90e2baec3d2c,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=9949a437-7e8a-489b-ba10-ded82c775c43,ceph.osd_id=3,ceph.osdspec_affinity=all-available-devices,ceph.type=block,ceph.vdo=0";"/dev/ceph-1ce58676-9409-4e19-ac66-f63b5025dfb0/osd-block-9949a437-7e8a-489b-ba10-ded82c775c43";"osd-block-9949a437-7e8a-489b-ba10-ded82c775c43";"ceph-1ce58676-9409-4e19-ac66-f63b5025dfb0";"KLNJDx-J1iC-V5GJ-0nw3-YuEA-Q41D-HNIXv8";"16000896466944
[2023-02-07 10:38:58,213][ceph_volume.process][INFO ] stdout ceph.block_device=/dev/ceph-7053d77a-5d1c-450b-a932-d1590411ea2b/osd-block-29ac0ada-d23c-45c1-ae5d-c8aba5a60195,ceph.block_uuid=NTTkze-YV08-lOir-SJ6W-39un-oUc7-ZvOBra,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=d5126e5a-882e-11ec-954e-90e2baec3d2c,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=29ac0ada-d23c-45c1-ae5d-c8aba5a60195,ceph.osd_id=14,ceph.osdspec_affinity=all-available-devices,ceph.type=block,ceph.vdo=0";"/dev/ceph-7053d77a-5d1c-450b-a932-d1590411ea2b/osd-block-29ac0ada-d23c-45c1-ae5d-c8aba5a60195";"osd-block-29ac0ada-d23c-45c1-ae5d-c8aba5a60195";"ceph-7053d77a-5d1c-450b-a932-d1590411ea2b";"NTTkze-YV08-lOir-SJ6W-39un-oUc7-ZvOBra";"16000896466944
[2023-02-07 10:38:58,213][ceph_volume.process][INFO ] stdout ceph.block_device=/dev/ceph-e0a1e940-dec3-4369-a533-1e88bea5fa5e/osd-block-2d002c14-7751-4037-a070-7538e1264d88,ceph.block_uuid=1Gts1p-KwPO-LnIb-XlP2-zCGQ-92fb-Kvv53H,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=d5126e5a-882e-11ec-954e-90e2baec3d2c,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=2d002c14-7751-4037-a070-7538e1264d88,ceph.osd_id=11,ceph.osdspec_affinity=all-available-devices,ceph.type=block,ceph.vdo=0";"/dev/ceph-e0a1e940-dec3-4369-a533-1e88bea5fa5e/osd-block-2d002c14-7751-4037-a070-7538e1264d88";"osd-block-2d002c14-7751-4037-a070-7538e1264d88";"ceph-e0a1e940-dec3-4369-a533-1e88bea5fa5e";"1Gts1p-KwPO-LnIb-XlP2-zCGQ-92fb-Kvv53H";"16000896466944
[2023-02-07 10:38:58,214][ceph_volume.process][INFO ] Running command: /usr/sbin/pvs --noheadings --readonly --separator=";" -S lv_uuid=VjbtJW-iiCA-PMvC-TCnV-9xgJ-a8UU-IDo0Pv -o pv_name,pv_tags,pv_uuid,vg_name,lv_uuid
[2023-02-07 10:38:58,269][ceph_volume.process][INFO ] stdout /dev/sdb";"";"a6T0sC-DeMp-by25-wUjP-wL3R-u6d1-nPXfji";"ceph-03539866-06e2-4ba6-8809-6a491becb4fe";"VjbtJW-iiCA-PMvC-TCnV-9xgJ-a8UU-IDo0Pv
[2023-02-07 10:38:58,269][ceph_volume.process][INFO ] Running command: /usr/sbin/pvs --noheadings --readonly --separator=";" -S lv_uuid=KLNJDx-J1iC-V5GJ-0nw3-YuEA-Q41D-HNIXv8 -o pv_name,pv_tags,pv_uuid,vg_name,lv_uuid
[2023-02-07 10:38:58,333][ceph_volume.process][INFO ] stdout /dev/sda";"";"63b0j0-o1S7-FHqG-lwOk-0OYj-I9pH-g58TzB";"ceph-1ce58676-9409-4e19-ac66-f63b5025dfb0";"KLNJDx-J1iC-V5GJ-0nw3-YuEA-Q41D-HNIXv8
[2023-02-07 10:38:58,333][ceph_volume.process][INFO ] Running command: /usr/sbin/pvs --noheadings --readonly --separator=";" -S lv_uuid=NTTkze-YV08-lOir-SJ6W-39un-oUc7-ZvOBra -o pv_name,pv_tags,pv_uuid,vg_name,lv_uuid
[2023-02-07 10:38:58,397][ceph_volume.process][INFO ] stdout /dev/sde";"";"qDEqYa-cgXd-Tc2h-64wQ-zT63-vIBZ-ZfGGO0";"ceph-7053d77a-5d1c-450b-a932-d1590411ea2b";"NTTkze-YV08-lOir-SJ6W-39un-oUc7-ZvOBra
[2023-02-07 10:38:58,398][ceph_volume.process][INFO ] Running command: /usr/sbin/pvs --noheadings --readonly --separator=";" -S lv_uuid=1Gts1p-KwPO-LnIb-XlP2-zCGQ-92fb-Kvv53H -o pv_name,pv_tags,pv_uuid,vg_name,lv_uuid
[2023-02-07 10:38:58,457][ceph_volume.process][INFO ] stdout /dev/sdd";"";"aqhedj-aUlM-0cl4-P98k-XZRL-1mPG-0OgKLV";"ceph-e0a1e940-dec3-4369-a533-1e88bea5fa5e";"1Gts1p-KwPO-LnIb-XlP2-zCGQ-92fb-Kvv53H
config dump
WHO MASK LEVEL OPTION VALUE RO
global advanced cluster_network 10.125.0.0/24 *
global basic container_image quay.io/ceph/ceph#sha256:a39107f8d3daab4d756eabd6ee1630d1bc7f31eaa76fff41a77fa32d0b903061 *
mon advanced auth_allow_insecure_global_id_reclaim false
mon advanced public_network 10.123.0.0/24 *
mgr advanced mgr/cephadm/container_init True *
mgr advanced mgr/cephadm/migration_current 3 *
mgr advanced mgr/dashboard/ALERTMANAGER_API_HOST http://10.123.0.21:9093 *
mgr advanced mgr/dashboard/GRAFANA_API_SSL_VERIFY false *
mgr advanced mgr/dashboard/GRAFANA_API_URL https://10.123.0.21:3000 *
mgr advanced mgr/dashboard/PROMETHEUS_API_HOST http://10.123.0.21:9095 *
mgr advanced mgr/dashboard/ssl_server_port 8443 *
mgr advanced mgr/orchestrator/orchestrator cephadm
mgr advanced mgr/pg_autoscaler/autoscale_profile scale-up
mds advanced mds_max_caps_per_client 65536
mds.cephfs basic mds_join_fs cephfs
####
ceph status
cluster:
id: d5126e5a-882e-11ec-954e-90e2baec3d2c
health: HEALTH_WARN
8 failed cephadm daemon(s)
2 stray daemon(s) not managed by cephadm
nodown,noout flag(s) set
4 osds down
1 host (4 osds) down
Degraded data redundancy: 195662646/392133183 objects degraded (49.897%), 160 pgs degraded, 160 pgs undersized
6 pgs not deep-scrubbed in time
1 daemons have recently crashed
services:
mon: 3 daemons, quorum ceph5,ceph7,ceph6 (age 2d)
mgr: ceph2.tofizp(active, since 9M), standbys: ceph1.vnkagp
mds: 3/3 daemons up
osd: 19 osds: 15 up (since 11h), 19 in (since 11h); 151 remapped pgs
flags nodown,noout
data:
volumes: 1/1 healthy
pools: 6 pools, 257 pgs
objects: 102.97M objects, 67 TiB
usage: 69 TiB used, 107 TiB / 176 TiB avail
pgs: 195662646/392133183 objects degraded (49.897%)
2620377/392133183 objects misplaced (0.668%)
150 active+undersized+degraded+remapped+backfill_wait
97 active+clean
9 active+undersized+degraded
1 active+undersized+degraded+remapped+backfilling
io:
client: 170 B/s rd, 0 op/s rd, 0 op/s wr
recovery: 9.7 MiB/s, 9 objects/s

'nfs-server' state in 'service_facts' differ from 'ansible.builtin.systemd'

Running a CentOS 8 Stream server with nfs-utils version 2.3.3-57.el8 and using ansible-playbook core version 2.11.12 with a test playbook
- hosts: server-1
tasks:
- name: Collect status
service_facts:
register: services_state
- name: Print service_facts
debug:
var: services_state
- name: Collect systemd status
ansible.builtin.systemd:
name: "nfs-server"
register: sysd_service_state
- name: Print systemd state
debug:
var: sysd_service_state
will render the following results
service_facts
...
"nfs-server.service": {
"name": "nfs-server.service",
"source": "systemd",
"state": "stopped",
"status": "disabled"
},
...
ansible.builtin.systemd
...
"name": "nfs-server",
"status": {
"ActiveEnterTimestamp": "Tue 2022-10-04 10:03:17 UTC",
"ActiveEnterTimestampMonotonic": "7550614760",
"ActiveExitTimestamp": "Tue 2022-10-04 09:05:43 UTC",
"ActiveExitTimestampMonotonic": "4096596618",
"ActiveState": "active",
...
The NFS Server is very much running/active but the service_facts fails to report it as such.
Other services, such as httpd reports correct state in service_facts.
Have I misunderstood or done something wrong here? Or have I run in to an anomaly?
Running RHEL 7.9, nfs-utils 1.3.0-0.68.el7, ansible 2.9.27, I was able to observe the same behavior and to reproduce the issue you are observing.
It seems to be caused by "two" redundant service files (or the symlink).
ll /usr/lib/systemd/system/nfs*
...
-rw-r--r--. 1 root root 1044 Oct 1 2022 /usr/lib/systemd/system/nfs-server.service
lrwxrwxrwx. 1 root root 18 Oct 1 2022 /usr/lib/systemd/system/nfs.service -> nfs-server.service
...
diff /usr/lib/systemd/system/nfs.service /usr/lib/systemd/system/nfs-server.service; echo $?
0
Obviously a status request call to systemd will produce the expected result.
systemctl status nfs
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled; vendor preset: disabled)
Drop-In: /run/systemd/generator/nfs-server.service.d
└─order-with-mounts.conf
Active: active (exited) since Sat 2022-10-01 22:00:00 CEST; 4 days ago
Process: 1080 ExecStartPost=/bin/sh -c if systemctl -q is-active gssproxy; then systemctl reload gssproxy ; fi (code=exited, status=0/SUCCESS)
Process: 1070 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited, status=0/SUCCESS)
Process: 1065 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
Main PID: 1070 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/nfs-server.service
systemctl status nfs-server
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled; vendor preset: disabled)
Drop-In: /run/systemd/generator/nfs-server.service.d
└─order-with-mounts.conf
Active: active (exited) since Sat 2022-10-01 22:00:00 CEST; 4 days ago
Process: 1080 ExecStartPost=/bin/sh -c if systemctl -q is-active gssproxy; then systemctl reload gssproxy ; fi (code=exited, status=0/SUCCESS)
Process: 1070 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited, status=0/SUCCESS)
Process: 1065 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
Main PID: 1070 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/nfs-server.service
However, a test playbook
---
- hosts: nfs_server
become: true
gather_facts: false
tasks:
- name: Gather Service Facts
service_facts:
- name: Show Facts
debug:
var: ansible_facts
called via
sshpass -p ${PASSWORD} ansible-playbook --user ${ACCOUNT} --ask-pass service_facts.yml | grep -A4 nfs
will result into an output of
PLAY [nfs_server] ***************
TASK [Gather Service Facts] *****
ok: [test.example.com]
--
...
nfs-server.service:
name: nfs-server.service
source: systemd
state: stopped
status: enabled
...
nfs.service:
name: nfs.service
source: systemd
state: active
status: enabled
and report for the first service file found(?), nfs.service correct only.
Workaround
You could just check for ansible_facts.nfs.service, the alias name.
systemctl show nfs-server.service -p Names
Names=nfs-server.service nfs.service
Further Investigation
Ansible Issue #73215 "ansible_facts.service returns incorrect state"
Ansible Issue #67262 "service_facts does not return correct state for services"
It might be that this is somehow caused by What does status "active (exited)" mean for a systemd service? and def _list_rh(self, services) and even if there is already set RemainAfterExit=yes within the .service files.
systemctl list-units --no-pager --type service --all | grep nfs
nfs-config.service loaded inactive dead Preprocess NFS configuration
nfs-idmapd.service loaded active running NFSv4 ID-name mapping service
nfs-mountd.service loaded active running NFS Mount Daemon
● nfs-secure-server.service not-found inactive dead nfs-secure-server.service
nfs-server.service loaded active exited NFS server and services
nfs-utils.service loaded inactive dead NFS server and client services
For further tests
systemctl list-units --no-pager --type service --state=running
# versus
systemctl list-units --no-pager --type service --state=exited
one may also read about NFS server active (exited) and Service Active but (exited).
... please take note that I haven't done further investigation on the issue yet. Currently also for me it is still not fully clear which part in the code of ansible/modules/service_facts.py might causing this.

Redis replication - Failed to resolve hostname

I am trying to setup a redis replication cluster with 3 redis servers (1 primary and 2 replicas) and 3 redis sentinels.
One server and database pair will exist on a machine and there will be 3 machines, each machine has docker installed.
The issue I am having is that the redis server instances are unable to connect to MASTER and resolve the primary name.
There seems to be no issues with the network or port openings since I am able to specify each machines IPv4 in the
configuration below and the communication works as expected.
Redis version available in logs further down
docker-compose version 1.29.2, build 5becea4c
Docker version 20.10.14, build a224086
Ubuntu 20.04.4 LTS
Below are the docker-compose.yml files:
I did not include redis_3 config and output since it is almost identical to redis_2
primary:
version: '3.8'
services:
redis_1:
image: bitnami/redis:6.2
restart: always
command: ["redis-server", "--protected-mode", "no", "--dir", "/data"]
environment:
- REDIS_REPLICA_IP=redis_1
- REDIS_REPLICATION_MODE=master
- REDIS_MASTER_PASSWORD=very-good-password
- REDIS_PASSWORD=very-good-password
ports:
- "6379:6379"
volumes:
- "/opt/knowit/docker/data:/data"
sentinel_1:
image: bitnami/redis-sentinel:6.2
restart: always
environment:
- REDIS_MASTER_HOST=redis_1
- REDIS_MASTER_PASSWORD=very-good-password
- REDIS_SENTINEL_ANNOUNCE_IP=redis_1
- REDIS_SENTINEL_QUORUM=2
- REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS=5000
- REDIS_SENTINEL_FAILOVER_TIMEOUT=60000
- REDIS_SENTINEL_PASSWORD=other-good-password
- REDIS_SENTINEL_ANNOUNCE_HOSTNAMES=yes
- REDIS_SENTINEL_RESOLVE_HOSTNAMES=yes
ports:
- "26379:26379"
replica:
version: '3.8'
services:
redis_2:
image: bitnami/redis:6.2
restart: always
command: ["redis-server", "--protected-mode", "no", "--replicaof", "redis_1", "6379", "--dir", "/data"]
environment:
- REDIS_REPLICA_IP=redis_2
- REDIS_REPLICATION_MODE=replica
- REDIS_MASTER_PASSWORD=very-good-password
- REDIS_PASSWORD=very-good-password
ports:
- "6379:6379"
volumes:
- "/opt/knowit/docker/data:/data"
sentinel_2:
image: bitnami/redis-sentinel:6.2
restart: always
environment:
- REDIS_MASTER_HOST=redis_1
- REDIS_MASTER_PASSWORD=very-good-password
- REDIS_SENTINEL_ANNOUNCE_IP=redis_2
- REDIS_SENTINEL_QUORUM=2
- REDIS_SENTINEL_DOWN_AFTER_MILLISECONDS=5000
- REDIS_SENTINEL_FAILOVER_TIMEOUT=60000
- REDIS_SENTINEL_PASSWORD=other-good-password
- REDIS_SENTINEL_ANNOUNCE_HOSTNAMES=yes
- REDIS_SENTINEL_RESOLVE_HOSTNAMES=yes
ports:
- "26379:26379"
The docker logs looks like this:
primary:
$ sudo docker-compose up
Starting docker_redis_1 ... done
Starting docker_sentinel_1 ... done
Attaching to docker_redis_1, docker_sentinel_1
redis_1 | redis 12:15:41.03
redis_1 | redis 12:15:41.04 Welcome to the Bitnami redis container
redis_1 | redis 12:15:41.04 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-redis
redis_1 | redis 12:15:41.04 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-redis/issues
redis_1 |
redis_1 | redis 12:15:41.04
redis_1 | 1:C 29 Apr 2022 12:15:41.068 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis_1 | 1:C 29 Apr 2022 12:15:41.069 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
redis_1 | 1:C 29 Apr 2022 12:15:41.069 # Configuration loaded
redis_1 | 1:M 29 Apr 2022 12:15:41.070 * monotonic clock: POSIX clock_gettime
redis_1 | 1:M 29 Apr 2022 12:15:41.072 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
redis_1 | 1:M 29 Apr 2022 12:15:41.072 * Running mode=standalone, port=6379.
redis_1 | 1:M 29 Apr 2022 12:15:41.072 # Server initialized
redis_1 | 1:M 29 Apr 2022 12:15:41.073 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis_1 | 1:M 29 Apr 2022 12:15:41.073 * Ready to accept connections
sentinel_1 | redis-sentinel 12:15:41.08
sentinel_1 | redis-sentinel 12:15:41.08 Welcome to the Bitnami redis-sentinel container
sentinel_1 | redis-sentinel 12:15:41.08 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-redis-sentinel
sentinel_1 | redis-sentinel 12:15:41.08 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-redis-sentinel/issues
sentinel_1 | redis-sentinel 12:15:41.08
sentinel_1 | redis-sentinel 12:15:41.08 INFO ==> ** Starting Redis sentinel setup **
sentinel_1 | redis-sentinel 12:15:41.11 INFO ==> Initializing Redis Sentinel...
sentinel_1 | redis-sentinel 12:15:41.11 INFO ==> Persisted files detected, restoring...
sentinel_1 | redis-sentinel 12:15:41.12 INFO ==> ** Redis sentinel setup finished! **
sentinel_1 |
sentinel_1 | redis-sentinel 12:15:41.13 INFO ==> ** Starting Redis Sentinel **
sentinel_1 | 1:X 29 Apr 2022 12:15:41.143 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
sentinel_1 | 1:X 29 Apr 2022 12:15:41.144 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
sentinel_1 | 1:X 29 Apr 2022 12:15:41.144 # Configuration loaded
sentinel_1 | 1:X 29 Apr 2022 12:15:41.145 * monotonic clock: POSIX clock_gettime
sentinel_1 | 1:X 29 Apr 2022 12:15:41.146 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
sentinel_1 | 1:X 29 Apr 2022 12:15:41.147 * Running mode=sentinel, port=26379.
sentinel_1 | 1:X 29 Apr 2022 12:15:41.148 # Sentinel ID is 232f6b838b76c348f123597f2852091a77bdae03
sentinel_1 | 1:X 29 Apr 2022 12:15:41.148 # +monitor master mymaster redis_1 6379 quorum 2
replica:
$ sudo docker-compose up
Starting docker_redis_2 ... done
Starting docker_sentinel_2 ... done
Attaching to docker_redis_2, docker_sentinel_2
redis_2 | redis 11:53:24.61
redis_2 | redis 11:53:24.62 Welcome to the Bitnami redis container
redis_2 | redis 11:53:24.63 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-redis
redis_2 | redis 11:53:24.63 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-redis/issues
redis_2 | redis 11:53:24.63
redis_2 |
redis_2 | 1:C 29 Apr 2022 11:53:24.649 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis_2 | 1:C 29 Apr 2022 11:53:24.651 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
redis_2 | 1:C 29 Apr 2022 11:53:24.651 # Configuration loaded
redis_2 | 1:S 29 Apr 2022 11:53:24.653 * monotonic clock: POSIX clock_gettime
redis_2 | 1:S 29 Apr 2022 11:53:24.656 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
redis_2 | 1:S 29 Apr 2022 11:53:24.657 * Running mode=standalone, port=6379.
redis_2 | 1:S 29 Apr 2022 11:53:24.657 # Server initialized
redis_2 | 1:S 29 Apr 2022 11:53:24.657 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis_2 | 1:S 29 Apr 2022 11:53:24.659 * Ready to accept connections
redis_2 | 1:S 29 Apr 2022 11:53:24.659 * Connecting to MASTER redis_1:6379
sentinel_2 | redis-sentinel 11:53:24.70
sentinel_2 | redis-sentinel 11:53:24.71 Welcome to the Bitnami redis-sentinel container
sentinel_2 | redis-sentinel 11:53:24.71 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-redis-sentinel
sentinel_2 | redis-sentinel 11:53:24.71 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-redis-sentinel/issues
sentinel_2 | redis-sentinel 11:53:24.71
sentinel_2 | redis-sentinel 11:53:24.71 INFO ==> ** Starting Redis sentinel setup **
redis_2 | 1:S 29 Apr 2022 11:53:34.673 # Unable to connect to MASTER: Resource temporarily unavailable
sentinel_2 | redis-sentinel 11:53:34.75 WARN ==> Hostname redis_1 could not be resolved, this could lead to connection issues
sentinel_2 | redis-sentinel 11:53:34.76 INFO ==> Initializing Redis Sentinel...
sentinel_2 | redis-sentinel 11:53:34.76 INFO ==> Persisted files detected, restoring...
sentinel_2 | redis-sentinel 11:53:34.77 INFO ==> ** Redis sentinel setup finished! **
sentinel_2 |
sentinel_2 | redis-sentinel 11:53:34.79 INFO ==> ** Starting Redis Sentinel **
redis_2 | 1:S 29 Apr 2022 11:53:35.675 * Connecting to MASTER redis_1:6379
sentinel_2 | 1:X 29 Apr 2022 11:53:44.813 # Failed to resolve hostname 'redis_1'
sentinel_2 | 1:X 29 Apr 2022 11:53:44.813 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
sentinel_2 | 1:X 29 Apr 2022 11:53:44.813 # Redis version=6.2.7, bits=64, commit=00000000, modified=0, pid=1, just started
sentinel_2 | 1:X 29 Apr 2022 11:53:44.813 # Configuration loaded
sentinel_2 | 1:X 29 Apr 2022 11:53:44.814 * monotonic clock: POSIX clock_gettime
sentinel_2 | 1:X 29 Apr 2022 11:53:44.815 # A key '__redis__compare_helper' was added to Lua globals which is not on the globals allow list nor listed on the deny list.
sentinel_2 | 1:X 29 Apr 2022 11:53:44.815 * Running mode=sentinel, port=26379.
sentinel_2 | 1:X 29 Apr 2022 11:53:44.816 # Sentinel ID is bfec501e81d8da33def75f23911b606aa395078d
sentinel_2 | 1:X 29 Apr 2022 11:53:44.816 # +monitor master mymaster redis_1 6379 quorum 2
sentinel_2 | 1:X 29 Apr 2022 11:53:44.817 # +tilt #tilt mode entered
redis_2 | 1:S 29 Apr 2022 11:53:45.687 # Unable to connect to MASTER: Resource temporarily unavailable
redis_2 | 1:S 29 Apr 2022 11:53:46.689 * Connecting to MASTER redis_1:6379
sentinel_2 | 1:X 29 Apr 2022 11:53:54.831 # Failed to resolve hostname 'redis_1'
sentinel_2 | 1:X 29 Apr 2022 11:53:54.914 # +tilt #tilt mode entered
redis_2 | 1:S 29 Apr 2022 11:53:56.701 # Unable to connect to MASTER: Resource temporarily unavailable
Is there some issue with resolving hostnames when the different redis instances are located on separate machines or have I just missed something basic?
I would assume the latter, since I have been able to get this up and running by specifying the ip addresses and also the replica receives the hostname of the primary.
Any help would be much appreciated! Let me know if additional information is required
Only sentinel with version above 6.2 can resolve host names, but this is not enabled by default. Adding sentinel resolve-hostnames yes to sentinel.conf will help.
If your sentinel has older versions, the hostname redis_node should be replaced by and ip.
For more details, check out IP Addresses and DNS names in Redis document
Reference - answer by Tsonglew

Hazelcast failed to start in Kubernetes

I have a HA Kubernetes cluster that initialized with custom certificates. I want to run Hazelcast on it, but there is an error in discovering Hazelcast members using Kubernetes API.
This is my deploy file:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hazelcast
labels:
app: hazelcast
spec:
replicas: 3
serviceName: hazelcast-service
selector:
matchLabels:
app: hazelcast
template:
metadata:
labels:
app: hazelcast
spec:
imagePullSecrets:
- name: nexuspullsecret
containers:
- name: hazelcast
image: 192.168.161.187:9050/hazelcast-custom:4.0.2
imagePullPolicy: "Always"
ports:
- name: hazelcast
containerPort: 5701
livenessProbe:
httpGet:
path: /hazelcast/health/node-state
port: 5701
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /hazelcast/health/node-state
port: 5701
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 1
successThreshold: 1
failureThreshold: 1
resources:
requests:
memory: "0"
cpu: "0"
limits:
memory: "2048Mi"
cpu: "500m"
volumeMounts:
- name: hazelcast-storage
mountPath: /data/hazelcast
env:
- name: JAVA_OPTS
value: "-Dhazelcast.rest.enabled=true -Dhazelcast.config=/data/hazelcast/hazelcast.xml"
volumes:
- name: hazelcast-storage
configMap:
name: hazelcast-configuration
---
apiVersion: v1
kind: Service
metadata:
name: hazelcast-service
spec:
type: ClusterIP
selector:
app: hazelcast
ports:
- protocol: TCP
port: 5701
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: hazelcast-cluster-role
rules:
- apiGroups: [""]
resources: ["endpoints", "pods", "nodes"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: hazelcast-cluster-role-binding
subjects:
- kind: ServiceAccount
name: default
namespace: default
roleRef:
kind: ClusterRole
name: hazelcast-cluster-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: hazelcast
namespace: default
spec:
maxUnavailable: 0
selector:
matchLabels:
app: hazelcast
---
apiVersion: v1
kind: ConfigMap
metadata:
name: hazelcast-configuration
data:
hazelcast.xml: |-
<?xml version="1.0" encoding="UTF-8"?>
<hazelcast xmlns="http://www.hazelcast.com/schema/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.hazelcast.com/schema/config
http://www.hazelcast.com/schema/config/hazelcast-config-4.0.xsd">
<network>
<rest-api enabled="true"></rest-api>
<join>
<!-- deactivate normal discovery -->
<multicast enabled="false"/>
<tcp-ip enabled="false" />
<!-- activate the Kubernetes plugin -->
<kubernetes enabled="true">
<service-name>hazelcast-service</service-name>
<namespace>default</namespace>
<kubernetes-api-retries>20</kubernetes-api-retries>
</kubernetes>
</join>
</network>
<user-code-deployment enabled="true">
<class-cache-mode>ETERNAL</class-cache-mode>
<provider-mode>LOCAL_AND_CACHED_CLASSES</provider-mode>
</user-code-deployment>
<reliable-topic name="ConfirmationTimeout">
<read-batch-size>10</read-batch-size>
<topic-overload-policy>DISCARD_OLDEST</topic-overload-policy>
<statistics-enabled>true</statistics-enabled>
</reliable-topic>
<ringbuffer name="ConfirmationTimeout">
<capacity>10000</capacity>
<backup-count>1</backup-count>
<async-backup-count>0</async-backup-count>
<time-to-live-seconds>0</time-to-live-seconds>
<in-memory-format>BINARY</in-memory-format>
<merge-policy batch-size="100">com.hazelcast.spi.merge.PutIfAbsentMergePolicy</merge-policy>
</ringbuffer>
<scheduled-executor-service name="ConfirmationTimeout">
<capacity>100</capacity>
<capacity-policy>PER_NODE</capacity-policy>
<pool-size>32</pool-size>
<durability>3</durability>
<merge-policy batch-size="100">com.hazelcast.spi.merge.PutIfAbsentMergePolicy</merge-policy>
</scheduled-executor-service>
<cp-subsystem>
<cp-member-count>3</cp-member-count>
<group-size>3</group-size>
<session-time-to-live-seconds>300</session-time-to-live-seconds>
<session-heartbeat-interval-seconds>5</session-heartbeat-interval-seconds>
<missing-cp-member-auto-removal-seconds>14400</missing-cp-member-auto-removal-seconds>
<fail-on-indeterminate-operation-state>false</fail-on-indeterminate-operation-state>
<raft-algorithm>
<leader-election-timeout-in-millis>15000</leader-election-timeout-in-millis>
<leader-heartbeat-period-in-millis>5000</leader-heartbeat-period-in-millis>
<max-missed-leader-heartbeat-count>10</max-missed-leader-heartbeat-count>
<append-request-max-entry-count>100</append-request-max-entry-count>
<commit-index-advance-count-to-snapshot>10000</commit-index-advance-count-to-snapshot>
<uncommitted-entry-count-to-reject-new-appends>100</uncommitted-entry-count-to-reject-new-appends>
<append-request-backoff-timeout-in-millis>100</append-request-backoff-timeout-in-millis>
</raft-algorithm>
<locks>
<fenced-lock>
<name>TimeoutLock</name>
<lock-acquire-limit>1</lock-acquire-limit>
</fenced-lock>
</locks>
</cp-subsystem>
<metrics enabled="true">
<management-center>
<retention-seconds>30</retention-seconds>
</management-center>
<jmx enabled="false"/>
<collection-frequency-seconds>10</collection-frequency-seconds>
</metrics>
</hazelcast>
I have tested this deploy file on non-custom certificate ssl HA Kubernetes cluster and it works without any problems.
This is log files:
########################################
# JAVA_OPTS=-Djava.net.preferIPv4Stack=true -Djava.util.logging.config.file=/opt/hazelcast/logging.properties -XX:MaxRAMPercentage=80.0 -XX:+UseParallelGC --add-modules java.se --add-exports java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -Dhazelcast.rest.enabled=true -Dhazelcast.config=/data/hazelcast/hazelcast.xml
# CLASSPATH=/opt/hazelcast/*:/opt/hazelcast/lib/*:/opt/hazelcast/user-lib/*
# CLASSPATH_DEFAULT=/opt/hazelcast/*:/opt/hazelcast/lib/*:/opt/hazelcast/user-lib/*
# starting now....
########################################
+ exec java -server -Djava.net.preferIPv4Stack=true -Djava.util.logging.config.file=/opt/hazelcast/logging.properties -XX:MaxRAMPercentage=80.0 -XX:+UseParallelGC --add-modules java.se --add-exports java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED -Dhazelcast.rest.enabled=true -Dhazelcast.config=/data/hazelcast/hazelcast.xml com.hazelcast.core.server.HazelcastMemberStarter
Sep 08, 2020 7:08:36 AM com.hazelcast.internal.config.AbstractConfigLocator
INFO: Loading configuration '/data/hazelcast/hazelcast.xml' from System property 'hazelcast.config'
Sep 08, 2020 7:08:37 AM com.hazelcast.internal.config.AbstractConfigLocator
INFO: Using configuration file at /data/hazelcast/hazelcast.xml
Sep 08, 2020 7:08:40 AM com.hazelcast.instance.AddressPicker
INFO: [LOCAL] [dev] [4.0.2] Prefer IPv4 stack is true, prefer IPv6 addresses is false
Sep 08, 2020 7:08:40 AM com.hazelcast.instance.AddressPicker
INFO: [LOCAL] [dev] [4.0.2] Picked [10.42.128.11]:5701, using socket ServerSocket[addr=/0.0.0.0,localport=5701], bind any local is true
Sep 08, 2020 7:08:40 AM com.hazelcast.system
INFO: [10.42.128.11]:5701 [dev] [4.0.2] Hazelcast 4.0.2 (20200702 - 2de3027) starting at [10.42.128.11]:5701
Sep 08, 2020 7:08:40 AM com.hazelcast.system
INFO: [10.42.128.11]:5701 [dev] [4.0.2] Copyright (c) 2008-2020, Hazelcast, Inc. All Rights Reserved.
Sep 08, 2020 7:08:42 AM com.hazelcast.spi.impl.operationservice.impl.BackpressureRegulator
INFO: [10.42.128.11]:5701 [dev] [4.0.2] Backpressure is disabled
Sep 08, 2020 7:08:43 AM com.hazelcast.spi.discovery.integration.DiscoveryService
INFO: [10.42.128.11]:5701 [dev] [4.0.2] Kubernetes Discovery properties: { service-dns: null, service-dns-timeout: 5, service-name: hazelcast-service, service-port: 0, service-label: null, service-label-value: true, namespace: default, pod-label: null, pod-label-value: null, resolve-not-ready-addresses: true, use-node-name-as-external-address: false, kubernetes-api-retries: 20, kubernetes-master: https://kubernetes.default.svc}
Sep 08, 2020 7:08:43 AM com.hazelcast.spi.discovery.integration.DiscoveryService
INFO: [10.42.128.11]:5701 [dev] [4.0.2] Kubernetes Discovery activated with mode: KUBERNETES_API
Sep 08, 2020 7:08:43 AM com.hazelcast.instance.impl.Node
INFO: [10.42.128.11]:5701 [dev] [4.0.2] Activating Discovery SPI Joiner
Sep 08, 2020 7:08:43 AM com.hazelcast.cp.CPSubsystem
INFO: [10.42.128.11]:5701 [dev] [4.0.2] CP Subsystem is enabled with 3 members.
Sep 08, 2020 7:08:44 AM com.hazelcast.spi.impl.operationexecutor.impl.OperationExecutorImpl
INFO: [10.42.128.11]:5701 [dev] [4.0.2] Starting 2 partition threads and 3 generic threads (1 dedicated for priority tasks)
Sep 08, 2020 7:08:44 AM com.hazelcast.internal.diagnostics.Diagnostics
INFO: [10.42.128.11]:5701 [dev] [4.0.2] Diagnostics disabled. To enable add -Dhazelcast.diagnostics.enabled=true to the JVM arguments.
Sep 08, 2020 7:08:45 AM com.hazelcast.core.LifecycleService
INFO: [10.42.128.11]:5701 [dev] [4.0.2] [10.42.128.11]:5701 is STARTING
Sep 08, 2020 7:08:47 AM com.hazelcast.kubernetes.RetryUtils
WARNING: Couldn't discover Hazelcast members using Kubernetes API, [1] retrying in 1 seconds...
Sep 08, 2020 7:08:49 AM com.hazelcast.kubernetes.RetryUtils
WARNING: Couldn't discover Hazelcast members using Kubernetes API, [2] retrying in 2 seconds...
Sep 08, 2020 7:08:51 AM com.hazelcast.kubernetes.RetryUtils
WARNING: Couldn't discover Hazelcast members using Kubernetes API, [3] retrying in 3 seconds...
Sep 08, 2020 7:08:54 AM com.hazelcast.kubernetes.RetryUtils
WARNING: Couldn't discover Hazelcast members using Kubernetes API, [4] retrying in 5 seconds...
Sep 08, 2020 7:09:00 AM com.hazelcast.kubernetes.RetryUtils
WARNING: Couldn't discover Hazelcast members using Kubernetes API, [5] retrying in 7 seconds...
Sep 08, 2020 7:09:07 AM com.hazelcast.kubernetes.RetryUtils
WARNING: Couldn't discover Hazelcast members using Kubernetes API, [6] retrying in 11 seconds...
Sep 08, 2020 7:09:12 AM com.hazelcast.internal.ascii.rest.HttpPostCommandProcessor
WARNING: [10.42.128.11]:5701 [dev] [4.0.2] An error occurred while handling request HttpCommand [HTTP_GET]{uri='/hazelcast/health/node-state'}AbstractTextCommand[HTTP_GET]{requestId=0}
java.lang.NullPointerException
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handleHealthcheck(HttpGetCommandProcessor.java:137)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:79)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:47)
at com.hazelcast.internal.ascii.TextCommandServiceImpl$CommandExecutor.run(TextCommandServiceImpl.java:396)
at com.hazelcast.internal.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:217)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
Sep 08, 2020 7:09:14 AM com.hazelcast.internal.ascii.rest.HttpPostCommandProcessor
WARNING: [10.42.128.11]:5701 [dev] [4.0.2] An error occurred while handling request HttpCommand [HTTP_GET]{uri='/hazelcast/health/node-state'}AbstractTextCommand[HTTP_GET]{requestId=0}
java.lang.NullPointerException
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handleHealthcheck(HttpGetCommandProcessor.java:137)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:79)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:47)
at com.hazelcast.internal.ascii.TextCommandServiceImpl$CommandExecutor.run(TextCommandServiceImpl.java:396)
at com.hazelcast.internal.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:217)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
Sep 08, 2020 7:09:19 AM com.hazelcast.kubernetes.RetryUtils
WARNING: Couldn't discover Hazelcast members using Kubernetes API, [7] retrying in 17 seconds...
Sep 08, 2020 7:09:22 AM com.hazelcast.internal.ascii.rest.HttpPostCommandProcessor
WARNING: [10.42.128.11]:5701 [dev] [4.0.2] An error occurred while handling request HttpCommand [HTTP_GET]{uri='/hazelcast/health/node-state'}AbstractTextCommand[HTTP_GET]{requestId=0}
java.lang.NullPointerException
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handleHealthcheck(HttpGetCommandProcessor.java:137)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:79)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:47)
at com.hazelcast.internal.ascii.TextCommandServiceImpl$CommandExecutor.run(TextCommandServiceImpl.java:396)
at com.hazelcast.internal.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:217)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
Sep 08, 2020 7:09:24 AM com.hazelcast.internal.ascii.rest.HttpPostCommandProcessor
WARNING: [10.42.128.11]:5701 [dev] [4.0.2] An error occurred while handling request HttpCommand [HTTP_GET]{uri='/hazelcast/health/node-state'}AbstractTextCommand[HTTP_GET]{requestId=0}
java.lang.NullPointerException
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handleHealthcheck(HttpGetCommandProcessor.java:137)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:79)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:47)
at com.hazelcast.internal.ascii.TextCommandServiceImpl$CommandExecutor.run(TextCommandServiceImpl.java:396)
at com.hazelcast.internal.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:217)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
Sep 08, 2020 7:09:32 AM com.hazelcast.internal.ascii.rest.HttpPostCommandProcessor
WARNING: [10.42.128.11]:5701 [dev] [4.0.2] An error occurred while handling request HttpCommand [HTTP_GET]{uri='/hazelcast/health/node-state'}AbstractTextCommand[HTTP_GET]{requestId=0}
java.lang.NullPointerException
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handleHealthcheck(HttpGetCommandProcessor.java:137)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:79)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:47)
at com.hazelcast.internal.ascii.TextCommandServiceImpl$CommandExecutor.run(TextCommandServiceImpl.java:396)
at com.hazelcast.internal.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:217)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
Sep 08, 2020 7:09:34 AM com.hazelcast.internal.ascii.rest.HttpPostCommandProcessor
WARNING: [10.42.128.11]:5701 [dev] [4.0.2] An error occurred while handling request HttpCommand [HTTP_GET]{uri='/hazelcast/health/node-state'}AbstractTextCommand[HTTP_GET]{requestId=0}
java.lang.NullPointerException
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handleHealthcheck(HttpGetCommandProcessor.java:137)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:79)
at com.hazelcast.internal.ascii.rest.HttpGetCommandProcessor.handle(HttpGetCommandProcessor.java:47)
at com.hazelcast.internal.ascii.TextCommandServiceImpl$CommandExecutor.run(TextCommandServiceImpl.java:396)
at com.hazelcast.internal.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:217)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
This is our custom dockerfile for Hazelcast that because we need some changes in it:
FROM alpine:3.11
# Versions of Hazelcast and Hazelcast plugins
ARG HZ_VERSION=4.0.2
ARG CACHE_API_VERSION=1.1.1
ARG JMX_PROMETHEUS_AGENT_VERSION=0.13.0
ARG BUCKET4J_VERSION=4.10.0
# Build constants
ARG HZ_HOME="/opt/hazelcast"
# JARs to download
# for lib directory:
ARG HAZELCAST_ALL_URL="https://repo1.maven.org/maven2/com/hazelcast/hazelcast-all/${HZ_VERSION}/hazelcast-all-${HZ_VERSION}.jar"
# for user-lib directory:
ARG JCACHE_API_URL="https://repo1.maven.org/maven2/javax/cache/cache-api/${CACHE_API_VERSION}/cache-api-${CACHE_API_VERSION}.jar"
ARG PROMETHEUS_AGENT_URL="https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/${JMX_PROMETHEUS_AGENT_VERSION}/jmx_prometheus_javaagent-${JMX_PROMETHEUS_AGENT_VERSION}.jar"
ARG BUCKET4J_CORE_URL="https://repo1.maven.org/maven2/com/github/vladimir-bukhtoyarov/bucket4j-core/${BUCKET4J_VERSION}/bucket4j-core-${BUCKET4J_VERSION}.jar"
ARG BUCKET4J_HAZELCAST_URL="https://repo1.maven.org/maven2/com/github/vladimir-bukhtoyarov/bucket4j-hazelcast/${BUCKET4J_VERSION}/bucket4j-hazelcast-${BUCKET4J_VERSION}.jar"
ARG BUCKET4J_JCACHE_URL="https://repo1.maven.org/maven2/com/github/vladimir-bukhtoyarov/bucket4j-jcache/${BUCKET4J_VERSION}/bucket4j-jcache-${BUCKET4J_VERSION}.jar"
# Runtime constants / variables
ENV HZ_HOME="${HZ_HOME}" \
CLASSPATH_DEFAULT="${HZ_HOME}/*:${HZ_HOME}/lib/*:${HZ_HOME}/user-lib/*" \
JAVA_OPTS_DEFAULT="-Djava.net.preferIPv4Stack=true -Djava.util.logging.config.file=${HZ_HOME}/logging.properties -XX:MaxRAMPercentage=80.0 -XX:+UseParallelGC --add-modules java.se --add-exports java.base/jdk.internal.ref=ALL-UNNAMED --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/sun.nio.ch=ALL-UNNAMED --add-opens java.management/sun.management=ALL-UNNAMED --add-opens jdk.management/com.sun.management.internal=ALL-UNNAMED" \
PROMETHEUS_PORT="" \
PROMETHEUS_CONFIG="${HZ_HOME}/jmx_agent_config.yaml" \
LOGGING_LEVEL="" \
CLASSPATH="" \
JAVA_OPTS=""
# Expose port
EXPOSE 5701
COPY *.sh *.yaml *.jar *.properties ${HZ_HOME}/
RUN echo "Updating Alpine system" \
&& apk upgrade --update-cache --available \
&& echo "Installing new APK packages" \
&& apk add openjdk11-jre bash curl procps nss
RUN mkdir "${HZ_HOME}/user-lib"\
&& cd "${HZ_HOME}/user-lib" \
&& for USER_JAR_URL in ${JCACHE_API_URL} ${PROMETHEUS_AGENT_URL} ${BUCKET4J_CORE_URL} ${BUCKET4J_HAZELCAST_URL} ${BUCKET4J_JCACHE_URL}; do curl -sf -O -L ${USER_JAR_URL}; done
# Install
RUN echo "Downloading Hazelcast and related JARs" \
&& mkdir "${HZ_HOME}/lib" \
&& cd "${HZ_HOME}/lib" \
&& for JAR_URL in ${HAZELCAST_ALL_URL}; do curl -sf -O -L ${JAR_URL}; done \
&& echo "Granting read permission to ${HZ_HOME}" \
&& chmod 755 -R ${HZ_HOME} \
&& echo "Setting Pardot ID to 'docker'" \
&& echo 'hazelcastDownloadId=docker' > "${HZ_HOME}/hazelcast-download.properties" \
&& echo "Cleaning APK packages" \
&& rm -rf /var/cache/apk/*
WORKDIR ${HZ_HOME}
# Start Hazelcast server
CMD ["/opt/hazelcast/start-hazelcast.sh"]
Hazelcast Kubernetes discovery plugin does not allow you to specify the custom location of certificates. They are always read from the default location: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt.
The parameter ca-certificate lets you inline the certificate like presented here, but not specify the path of the certificate.
If you think such a feature would be useful, feel to create a GH issue at https://github.com/hazelcast/hazelcast-kubernetes (you can also send a PR with the change).

redis cluster in Kubernetes doesn't write nodes.conf file

I'm trying to set up a Redis cluster and I followed this guide here: https://rancher.com/blog/2019/deploying-redis-cluster/
Basically I'm creating a StatefulSet with a replica 6, so that I can have 3 master nodes and 3 slave nodes.
After all the nodes are up, I create the cluster, and it all works fine... but if I look into the file "nodes.conf" (where the configuration of all the nodes should be saved) of each redis node, I can see it's empty.
This is a problem, because whenever a redis node gets restarted, it searches into that file for the configuration of the node to update the IP address of itself and MEET the other nodes, but he finds nothing, so it basically starts a new cluster on his own, with a new ID.
My storage is an NFS connected shared folder. The YAML responsible for the storage access is this one:
kind: Deployment
apiVersion: extensions/v1beta1
metadata:
name: nfs-provisioner-raid5
spec:
replicas: 1
strategy:
type: Recreate
template:
metadata:
labels:
app: nfs-provisioner-raid5
spec:
serviceAccountName: nfs-provisioner-raid5
containers:
- name: nfs-provisioner-raid5
image: quay.io/external_storage/nfs-client-provisioner:latest
volumeMounts:
- name: nfs-raid5-root
mountPath: /persistentvolumes
env:
- name: PROVISIONER_NAME
value: 'nfs.raid5'
- name: NFS_SERVER
value: 10.29.10.100
- name: NFS_PATH
value: /raid5
volumes:
- name: nfs-raid5-root
nfs:
server: 10.29.10.100
path: /raid5
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfs-provisioner-raid5
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs.raid5
provisioner: nfs.raid5
parameters:
archiveOnDelete: "false"
This is the YAML of the redis cluster StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cluster
labels:
app: redis-cluster
spec:
serviceName: redis-cluster
replicas: 6
selector:
matchLabels:
app: redis-cluster
template:
metadata:
labels:
app: redis-cluster
spec:
containers:
- name: redis
image: redis:5-alpine
ports:
- containerPort: 6379
name: client
- containerPort: 16379
name: gossip
command: ["/conf/fix-ip.sh", "redis-server", "/conf/redis.conf"]
readinessProbe:
exec:
command:
- sh
- -c
- "redis-cli -h $(hostname) ping"
initialDelaySeconds: 15
timeoutSeconds: 5
livenessProbe:
exec:
command:
- sh
- -c
- "redis-cli -h $(hostname) ping"
initialDelaySeconds: 20
periodSeconds: 3
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
volumeMounts:
- name: conf
mountPath: /conf
readOnly: false
- name: data
mountPath: /data
readOnly: false
volumes:
- name: conf
configMap:
name: redis-cluster
defaultMode: 0755
volumeClaimTemplates:
- metadata:
name: data
labels:
name: redis-cluster
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: nfs.raid5
resources:
requests:
storage: 1Gi
This is the configMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-cluster
labels:
app: redis-cluster
data:
fix-ip.sh: |
#!/bin/sh
CLUSTER_CONFIG="/data/nodes.conf"
echo "creating nodes"
if [ -f ${CLUSTER_CONFIG} ]; then
if [ -z "${POD_IP}" ]; then
echo "Unable to determine Pod IP address!"
exit 1
fi
echo "Updating my IP to ${POD_IP} in ${CLUSTER_CONFIG}"
sed -i.bak -e "/myself/ s/[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}/${POD_IP}/" ${CLUSTER_CONFIG}
echo "done"
fi
exec "$#"
redis.conf: |+
cluster-enabled yes
cluster-require-full-coverage no
cluster-node-timeout 15000
cluster-config-file /data/nodes.conf
cluster-migration-barrier 1
appendonly yes
protected-mode no
and I created the cluster using the command:
kubectl exec -it redis-cluster-0 -- redis-cli --cluster create --cluster-replicas 1 $(kubectl get pods -l app=redis-cluster -o jsonpath='{range.items[*]}{.status.podIP}:6379 ')
what am I doing wrong?
this is what I see into the /data folder:
the nodes.conf file shows 0 bytes.
Lastly, this is the log from the redis-cluster-0 pod:
creating nodes
1:C 07 Nov 2019 13:01:31.166 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 07 Nov 2019 13:01:31.166 # Redis version=5.0.4, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 07 Nov 2019 13:01:31.166 # Configuration loaded
1:M 07 Nov 2019 13:01:31.179 * No cluster configuration found, I'm e55801f9b5d52f4e599fe9dba5a0a1e8dde2cdcb
1:M 07 Nov 2019 13:01:31.182 * Running mode=cluster, port=6379.
1:M 07 Nov 2019 13:01:31.182 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 07 Nov 2019 13:01:31.182 # Server initialized
1:M 07 Nov 2019 13:01:31.182 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 07 Nov 2019 13:01:31.185 * Ready to accept connections
1:M 07 Nov 2019 13:08:04.264 # configEpoch set to 1 via CLUSTER SET-CONFIG-EPOCH
1:M 07 Nov 2019 13:08:04.306 # IP address for this node updated to 10.40.0.27
1:M 07 Nov 2019 13:08:09.216 # Cluster state changed: ok
1:M 07 Nov 2019 13:08:10.144 * Replica 10.44.0.14:6379 asks for synchronization
1:M 07 Nov 2019 13:08:10.144 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '27972faeb07fe922f1ab581cac0fe467c85c3efd', my replication IDs are '31944091ef93e3f7c004908e3ff3114fd733ea6a' and '0000000000000000000000000000000000000000')
1:M 07 Nov 2019 13:08:10.144 * Starting BGSAVE for SYNC with target: disk
1:M 07 Nov 2019 13:08:10.144 * Background saving started by pid 1041
1041:C 07 Nov 2019 13:08:10.161 * DB saved on disk
1041:C 07 Nov 2019 13:08:10.161 * RDB: 0 MB of memory used by copy-on-write
1:M 07 Nov 2019 13:08:10.233 * Background saving terminated with success
1:M 07 Nov 2019 13:08:10.243 * Synchronization with replica 10.44.0.14:6379 succeeded
thank you for the help.
Looks to be an issue with the shell script that is mounted from configmap. can you update as below
fix-ip.sh: |
#!/bin/sh
CLUSTER_CONFIG="/data/nodes.conf"
echo "creating nodes"
if [ -f ${CLUSTER_CONFIG} ]; then
echo "[ INFO ]File:${CLUSTER_CONFIG} is Found"
else
touch $CLUSTER_CONFIG
fi
if [ -z "${POD_IP}" ]; then
echo "Unable to determine Pod IP address!"
exit 1
fi
echo "Updating my IP to ${POD_IP} in ${CLUSTER_CONFIG}"
sed -i.bak -e "/myself/ s/[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}/${POD_IP}/" ${CLUSTER_CONFIG}
echo "done"
exec "$#"
I just deployed with the updated script and it worked. see below the output
master $ kubectl get po
NAME READY STATUS RESTARTS AGE
redis-cluster-0 1/1 Running 0 83s
redis-cluster-1 1/1 Running 0 54s
redis-cluster-2 1/1 Running 0 45s
redis-cluster-3 1/1 Running 0 38s
redis-cluster-4 1/1 Running 0 31s
redis-cluster-5 1/1 Running 0 25s
master $ kubectl exec -it redis-cluster-0 -- redis-cli --cluster create --cluster-replicas 1 $(kubectl getpods -l app=redis-cluster -o jsonpath='{range.items[*]}{.status.podIP}:6379 ')
>>> Performing hash slots allocation on 6 nodes...
Master[0] -> Slots 0 - 5460
Master[1] -> Slots 5461 - 10922
Master[2] -> Slots 10923 - 16383
Adding replica 10.40.0.4:6379 to 10.40.0.1:6379
Adding replica 10.40.0.5:6379 to 10.40.0.2:6379
Adding replica 10.40.0.6:6379 to 10.40.0.3:6379
M: 9984141f922bed94bfa3532ea5cce43682fa524c 10.40.0.1:6379
slots:[0-5460] (5461 slots) master
M: 76ebee0dd19692c2b6d95f0a492d002cef1c6c17 10.40.0.2:6379
slots:[5461-10922] (5462 slots) master
M: 045b27c73069bff9ca9a4a1a3a2454e9ff640d1a 10.40.0.3:6379
slots:[10923-16383] (5461 slots) master
S: 1bc8d1b8e2d05b870b902ccdf597c3eece7705df 10.40.0.4:6379
replicates 9984141f922bed94bfa3532ea5cce43682fa524c
S: 5b2b019ba8401d3a8c93a8133db0766b99aac850 10.40.0.5:6379
replicates 76ebee0dd19692c2b6d95f0a492d002cef1c6c17
S: d4b91700b2bb1a3f7327395c58b32bb4d3521887 10.40.0.6:6379
replicates 045b27c73069bff9ca9a4a1a3a2454e9ff640d1a
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join
....
>>> Performing Cluster Check (using node 10.40.0.1:6379)
M: 9984141f922bed94bfa3532ea5cce43682fa524c 10.40.0.1:6379
slots:[0-5460] (5461 slots) master
1 additional replica(s)
M: 045b27c73069bff9ca9a4a1a3a2454e9ff640d1a 10.40.0.3:6379
slots:[10923-16383] (5461 slots) master
1 additional replica(s)
S: 1bc8d1b8e2d05b870b902ccdf597c3eece7705df 10.40.0.4:6379
slots: (0 slots) slave
replicates 9984141f922bed94bfa3532ea5cce43682fa524c
S: d4b91700b2bb1a3f7327395c58b32bb4d3521887 10.40.0.6:6379
slots: (0 slots) slave
replicates 045b27c73069bff9ca9a4a1a3a2454e9ff640d1a
M: 76ebee0dd19692c2b6d95f0a492d002cef1c6c17 10.40.0.2:6379
slots:[5461-10922] (5462 slots) master
1 additional replica(s)
S: 5b2b019ba8401d3a8c93a8133db0766b99aac850 10.40.0.5:6379
slots: (0 slots) slave
replicates 76ebee0dd19692c2b6d95f0a492d002cef1c6c17
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
master $ kubectl exec -it redis-cluster-0 -- redis-cli cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:1
cluster_stats_messages_ping_sent:61
cluster_stats_messages_pong_sent:76
cluster_stats_messages_sent:137
cluster_stats_messages_ping_received:71
cluster_stats_messages_pong_received:61
cluster_stats_messages_meet_received:5
cluster_stats_messages_received:137
master $ for x in $(seq 0 5); do echo "redis-cluster-$x"; kubectl exec redis-cluster-$x -- redis-cli role;echo; done
redis-cluster-0
master
588
10.40.0.4
6379
588
redis-cluster-1
master
602
10.40.0.5
6379
602
redis-cluster-2
master
588
10.40.0.6
6379
588
redis-cluster-3
slave
10.40.0.1
6379
connected
602
redis-cluster-4
slave
10.40.0.2
6379
connected
602
redis-cluster-5
slave
10.40.0.3
6379
connected
588