RabbitMQ fails to start after restart Kubernetes cluster - kubernetes

I'm running RabbitMQ on Kubernetes. This is my sts YAML file:
apiVersion: v1
kind: Service
metadata:
name: rabbitmq-management
labels:
app: rabbitmq
spec:
ports:
- port: 15672
name: http
selector:
app: rabbitmq
type: NodePort
---
apiVersion: v1
kind: Service
metadata:
name: rabbitmq
labels:
app: rabbitmq
spec:
ports:
- port: 5672
name: amqp
- port: 4369
name: epmd
- port: 25672
name: rabbitmq-dist
- port: 61613
name: stomp
clusterIP: None
selector:
app: rabbitmq
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
spec:
serviceName: "rabbitmq"
replicas: 3
selector:
matchLabels:
app: rabbitmq
template:
metadata:
labels:
app: rabbitmq
spec:
containers:
- name: rabbitmq
image: rabbitmq:management-alpine
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- >
rabbitmq-plugins enable rabbitmq_stomp;
if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi;
until rabbitmqctl node_health_check; do sleep 1; done;
if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi;
rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
env:
- name: RABBITMQ_ERLANG_COOKIE
valueFrom:
secretKeyRef:
name: rabbitmq-config
key: erlang-cookie
ports:
- containerPort: 5672
name: amqp
- containerPort: 61613
name: stomp
volumeMounts:
- name: rabbitmq
mountPath: /var/lib/rabbitmq
volumeClaimTemplates:
- metadata:
name: rabbitmq
annotations:
volume.alpha.kubernetes.io/storage-class: do-block-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
and I created the cookie with this command:
kubectl create secret generic rabbitmq-config --from-literal=erlang-cookie=c-is-for-cookie-thats-good-enough-for-me
all of my Kubernetes cluster nodes are ready:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubernetes-master Ready master 14d v1.17.3
kubernetes-slave-1 Ready <none> 14d v1.17.3
kubernetes-slave-2 Ready <none> 14d v1.17.3
but after restarting the cluster, the RabbitMQ didn't start. I tried to scale down and up the sts but the problem already exist. The output of kubectl describe pod rabbitmq-0:
kubectl describe pod rabbitmq-0
Name: rabbitmq-0
Namespace: default
Priority: 0
Node: kubernetes-slave-1/192.168.0.179
Start Time: Tue, 24 Mar 2020 22:31:04 +0000
Labels: app=rabbitmq
controller-revision-hash=rabbitmq-6748869f4b
statefulset.kubernetes.io/pod-name=rabbitmq-0
Annotations: <none>
Status: Running
IP: 10.244.1.163
IPs:
IP: 10.244.1.163
Controlled By: StatefulSet/rabbitmq
Containers:
rabbitmq:
Container ID: docker://d5108f818525030b4fdb548eb40f0dc000dd2cec473ebf8cead315116e3efbd3
Image: rabbitmq:management-alpine
Image ID: docker-pullable://rabbitmq#sha256:6f7c8d01d55147713379f5ca26e3f20eca63eb3618c263b12440b31c697ee5a5
Ports: 5672/TCP, 61613/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: PostStartHookError: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-575-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
node_health_check
Usage
rabbitmqctl [--node <node>] [--longnames] [--quiet] node_health_check [--timeout <timeout>]
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-10397-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 24 Mar 2020 22:46:09 +0000
Finished: Tue, 24 Mar 2020 22:58:28 +0000
Ready: False
Restart Count: 1
Environment:
RABBITMQ_ERLANG_COOKIE: <set to the key 'erlang-cookie' in secret 'rabbitmq-config'> Optional: false
Mounts:
/var/lib/rabbitmq from rabbitmq (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bbl9c (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
rabbitmq:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: rabbitmq-rabbitmq-0
ReadOnly: false
default-token-bbl9c:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bbl9c
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 31m default-scheduler Successfully assigned default/rabbitmq-0 to kubernetes-slave-1
Normal Pulled 31m kubelet, kubernetes-slave-1 Container image "rabbitmq:management-alpine" already present on machine
Normal Created 31m kubelet, kubernetes-slave-1 Created container rabbitmq
Normal Started 31m kubelet, kubernetes-slave-1 Started container rabbitmq
Normal SandboxChanged 16m (x9 over 17m) kubelet, kubernetes-slave-1 Pod sandbox changed, it will be killed and re-created.
Normal Pulled 3m58s (x2 over 16m) kubelet, kubernetes-slave-1 Container image "rabbitmq:management-alpine" already present on machine
Warning FailedPostStartHook 3m58s kubelet, kubernetes-slave-1 Exec lifecycle hook ([/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_default(2e561153-a830-4d30-ab1e-71c80d10c9e9)" failed - error: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
other nodes on rabbitmq-0: [rabbitmqprelaunch1]
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-433-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-575-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
node_health_check
, message: "Enabling plugins on node rabbit#rabbitmq-0:\nrabbitmq_stomp\nThe following plugins have been configured:\n rabbitmq_management\n rabbitmq_management_agent\n rabbitmq_stomp\n rabbitmq_web_dispatch\nApplying plugin configuration to rabbit#rabbitmq-0...\nThe following plugins have been enabled:\n rabbitmq_stomp\n\nset 4 plugins.\nOffline change; changes will take effect at broker restart.\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\
Error:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.\n\nMost common reasons for this are:\n\n * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)\n * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)\n * Target node is not running\n\nIn addition to the diagnostics info below:\n\n * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more\n * Consult server logs on node rabbit#rabbitmq-0\n * If target node is configured to use long node names, don't forget to use --longnames with CLI tools\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit#rabbitmq-0']\n\nrabbit#rabbitmq-0:\n * connected to epmd (port 4369) on rabbitmq-0\n * epmd reports: node 'rabbit' not running at all\n no other nodes on rabbitmq-0\n * suggestion: start the node\n\nCurrent node details:\n * node name: 'rabbitmqcli-10397-rabbit#rabbitmq-0'\n * effective user's home directory: /var/lib/rabbitmq\n * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\n"
Normal Killing 3m58s kubelet, kubernetes-slave-1 FailedPostStartHook
Normal Created 3m57s (x2 over 16m) kubelet, kubernetes-slave-1 Created container rabbitmq
Normal Started 3m57s (x2 over 16m) kubelet, kubernetes-slave-1 Started container rabbitmq
The output of kubectl get sts:
kubectl get sts
NAME READY AGE
consul 3/3 15d
hazelcast 2/3 15d
kafka 2/3 15d
rabbitmq 0/3 13d
zk 3/3 15d
and this is pod log that I copied from Kubernetes dashboard:
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: list of feature flags found:
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] drop_unroutable_metric
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] empty_basic_get_metric
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] implicit_default_bindings
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] quorum_queue
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] virtual_host_metadata
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2020-03-24 22:58:43.979 [info] <0.319.0> ra: meta data store initialised. 0 record(s) recovered
2020-03-24 22:58:43.980 [info] <0.324.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0/quorum/rabbit#rabbitmq-0/00000262.wal"]
2020-03-24 22:58:43.982 [info] <0.328.0>
Starting RabbitMQ 3.8.2 on Erlang 22.2.8
Copyright (c) 2007-2019 Pivotal Software, Inc.
Licensed under the MPL 1.1. Website: https://rabbitmq.com
## ## RabbitMQ 3.8.2
## ##
########## Copyright (c) 2007-2019 Pivotal Software, Inc.
###### ##
########## Licensed under the MPL 1.1. Website: https://rabbitmq.com
Doc guides: https://rabbitmq.com/documentation.html
Support: https://rabbitmq.com/contact.html
Tutorials: https://rabbitmq.com/getstarted.html
Monitoring: https://rabbitmq.com/monitoring.html
Logs: <stdout>
Config file(s): /etc/rabbitmq/rabbitmq.conf
Starting broker...2020-03-24 22:58:43.983 [info] <0.328.0>
node : rabbit#rabbitmq-0
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : P1XNOe5pN3Ug2FCRFzH7Xg==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0
2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step pre_boot defined by app rabbit
2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step rabbit_core_metrics defined by app rabbit
2020-03-24 22:58:43.998 [info] <0.328.0> Running boot step rabbit_alarm defined by app rabbit
2020-03-24 22:58:44.002 [info] <0.334.0> Memory high watermark set to 1200 MiB (1258889216 bytes) of 3001 MiB (3147223040 bytes) total
2020-03-24 22:58:44.014 [info] <0.336.0> Enabling free disk space monitoring
2020-03-24 22:58:44.014 [info] <0.336.0> Disk free limit set to 50MB
2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step code_server_cache defined by app rabbit
2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step file_handle_cache defined by app rabbit
2020-03-24 22:58:44.019 [info] <0.339.0> Limiting to approx 1048479 file handles (943629 sockets)
2020-03-24 22:58:44.019 [info] <0.340.0> FHC read buffering: OFF
2020-03-24 22:58:44.019 [info] <0.340.0> FHC write buffering: ON
2020-03-24 22:58:44.020 [info] <0.328.0> Running boot step worker_pool defined by app rabbit
2020-03-24 22:58:44.021 [info] <0.329.0> Will use 2 processes for default worker pool
2020-03-24 22:58:44.021 [info] <0.329.0> Starting worker pool 'worker_pool' with 2 processes in it
2020-03-24 22:58:44.021 [info] <0.328.0> Running boot step database defined by app rabbit
2020-03-24 22:58:44.041 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-03-24 22:59:14.042 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 22:59:14.042 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-03-24 22:59:44.043 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 22:59:44.043 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-03-24 23:00:14.044 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:00:14.044 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-03-24 23:00:44.045 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:00:44.045 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2020-03-24 23:01:14.046 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:01:14.046 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2020-03-24 23:01:44.047 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:01:44.047 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2020-03-24 23:02:14.048 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:02:14.048 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2020-03-24 23:02:44.049 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:02:44.049 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2020-03-24 23:03:14.050 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:03:14.050 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-03-24 23:03:44.051 [error] <0.328.0> Feature flag `quorum_queue`: migration function crashed: {error,{timeout_waiting_for_tables,[rabbit_durable_queue]}}
[{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,117}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,60}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1486}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-2-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2128}]},{maps,fold_1,3,[{file,"maps.erl"},{line,232}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2126}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,1947}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,631}]}]
2020-03-24 23:03:44.051 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-03-24 23:04:14.052 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:04:14.052 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-03-24 23:04:44.053 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:04:44.053 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-03-24 23:05:14.055 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:05:14.055 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-03-24 23:05:44.056 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:05:44.056 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2020-03-24 23:06:14.057 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:06:14.057 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2020-03-24 23:06:44.058 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:06:44.058 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2020-03-24 23:07:14.059 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:07:14.059 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2020-03-24 23:07:44.060 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:07:44.060 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2020-03-24 23:08:14.061 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:08:14.061 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-03-24 23:08:44.062 [error] <0.327.0> CRASH REPORT Process <0.327.0> with 0 neighbours exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138
2020-03-24 23:08:44.063 [info] <0.43.0> Application rabbit exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_r
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Take a look at:
https://www.rabbitmq.com/clustering.html#restarting
You should be able to stop the app and then force boot:
rabbitmqctl stop_app
rabbitmqctl force_boot

To complete #Vincent Gerris answer, I strongly recommend you to use the Bitnami rabbitMQ Docker image.
They have included an env variable called RABBITMQ_FORCE_BOOT:
https://github.com/bitnami/bitnami-docker-rabbitmq/blob/2c38682053dd9b3e88ab1fb305355d2ce88c2ccb/3.9/debian-10/rootfs/opt/bitnami/scripts/librabbitmq.sh#L760
if is_boolean_yes "$RABBITMQ_FORCE_BOOT" && ! is_dir_empty "${RABBITMQ_DATA_DIR}/${RABBITMQ_NODE_NAME}"; then
# ref: https://www.rabbitmq.com/rabbitmqctl.8.html#force_boot
warn "Forcing node to start..."
debug_execute "${RABBITMQ_BIN_DIR}/rabbitmqctl" force_boot
fi
this will force boot the node at entrypoint.

Related

Readiness fails in the Eclipse Hono pods of the Cloud2Edge package

I am a bit desperate and I hope someone can help me. A few months ago I installed the eclipse cloud2edge package on a kubernetes cluster by following the installation instructions, creating a persistentVolume and running the helm install command with these options.
helm install -n $NS --wait --timeout 15m $RELEASE eclipse-iot/cloud2edge --set hono.prometheus.createInstance=false --set hono.grafana.enabled=false --dependency-update --debug
The yaml of the persistentVolume is the following and I create it in the same namespace that I install the package.
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-device-registry
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Mi
hostPath:
path: /mnt/
type: Directory
Everything works perfectly, all pods were ready and running, until the other day when the cluster crashed and some pods stopped working.
The kubectl get pods -n $NS output is as follows:
NAME READY STATUS RESTARTS AGE
ditto-mongodb-7b78b468fb-8kshj 1/1 Running 0 50m
dt-adapter-amqp-vertx-6699ccf495-fc8nx 0/1 Running 0 50m
dt-adapter-http-vertx-545564ff9f-gx5fp 0/1 Running 0 50m
dt-adapter-mqtt-vertx-58c8975678-k5n49 0/1 Running 0 50m
dt-artemis-6759fb6cb8-5rq8p 1/1 Running 1 50m
dt-dispatch-router-5bc7586f76-57dwb 1/1 Running 0 50m
dt-ditto-concierge-f6d5f6f9c-pfmcw 1/1 Running 0 50m
dt-ditto-connectivity-f556db698-q89bw 1/1 Running 0 50m
dt-ditto-gateway-589d8f5596-59c5b 1/1 Running 0 50m
dt-ditto-nginx-897b5bc76-cx2dr 1/1 Running 0 50m
dt-ditto-policies-75cb5c6557-j5zdg 1/1 Running 0 50m
dt-ditto-swaggerui-6f6f989ccd-jkhsk 1/1 Running 0 50m
dt-ditto-things-79ff869bc9-l9lct 1/1 Running 0 50m
dt-ditto-thingssearch-58c5578bb9-pwd9k 1/1 Running 0 50m
dt-service-auth-698d4cdfff-ch5wp 1/1 Running 0 50m
dt-service-command-router-59d6556b5f-4nfcj 0/1 Running 0 50m
dt-service-device-registry-7cf75d794f-pk9ct 0/1 Running 0 50m
The pods that fail all have the same error when running kubectl describe pod POD_NAME -n $NS.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 53m default-scheduler Successfully assigned digitaltwins/dt-service-command-router-59d6556b5f-4nfcj to node1
Normal Pulled 53m kubelet Container image "index.docker.io/eclipse/hono-service-command-router:1.8.0" already present on machine
Normal Created 53m kubelet Created container service-command-router
Normal Started 53m kubelet Started container service-command-router
Warning Unhealthy 52m kubelet Readiness probe failed: Get "https://10.244.1.89:8088/readiness": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 2m58s (x295 over 51m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
According to this, the readinessProbe fails. In the yalm definition of the affected deployments, the readinessProbe is defined:
readinessProbe:
failureThreshold: 3
httpGet:
path: /readiness
port: health
scheme: HTTPS
initialDelaySeconds: 45
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
I have tried increasing these values, increasing the delay to 600 and the timeout to 10. Also i have tried uninstalling the package and installing it again, but nothing changes: the installation fails because the pods are never ready and the timeout pops up. I have also exposed port 8088 (health) and called /readiness with wget and the result is still 503. On the other hand, I have tested if livenessProbe works and it works fine. I have also tried resetting the cluster. First I manually deleted everything in it and then used the following commands:
sudo kubeadm reset
sudo iptables -F && sudo iptables -t nat -F && sudo iptables -t mangle -F && sudo iptables -X
sudo systemctl stop kubelet
sudo systemctl stop docker
sudo rm -rf /var/lib/cni/
sudo rm -rf /var/lib/kubelet/*
sudo rm -rf /etc/cni/
sudo ifconfig cni0 down
sudo ifconfig flannel.1 down
sudo ifconfig docker0 down
sudo ip link set cni0 down
sudo brctl delbr cni0
sudo systemctl start docker
sudo kubeadm init --apiserver-advertise-address=192.168.44.11 --pod-network-cidr=10.244.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl --kubeconfig $HOME/.kube/config apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
The cluster seems to work fine because the Eclipse Ditto part has no problem, it's just the Eclipse Hono part. I add a little more information in case it may be useful.
The kubectl logs dt-service-command-router-b654c8dcb-s2g6t -n $NS output:
12:30:06.340 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.101:44142 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:06.756 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.100:46550 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:07.876 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.102:40706 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.client.impl.HonoConnectionImpl - starting attempt [#258] to connect to server [dt-service-device-registry:5671, role: Device Registration]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - OpenSSL [available: false, supports KeyManagerFactory: false]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - using JDK's default SSL engine
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.3]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.2]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - connecting to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Device Registration]
12:30:08.339 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - can't connect to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Device Registration]: Failed to create SSL connection
12:30:08.339 [vert.x-eventloop-thread-1] WARN o.e.h.client.impl.HonoConnectionImpl - attempt [#258] to connect to server [dt-service-device-registry:5671, role: Device Registration] failed
javax.net.ssl.SSLHandshakeException: Failed to create SSL connection
The kubectl logs dt-adapter-amqp-vertx-74d69cbc44-7kmdq -n $NS output:
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.client.impl.HonoConnectionImpl - starting attempt [#19] to connect to server [dt-service-device-registry:5671, role: Credentials]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - OpenSSL [available: false, supports KeyManagerFactory: false]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - using JDK's default SSL engine
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.3]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.2]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - connecting to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Credentials]
12:19:36.711 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - can't connect to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Credentials]: Failed to create SSL connection
12:19:36.712 [vert.x-eventloop-thread-0] WARN o.e.h.client.impl.HonoConnectionImpl - attempt [#19] to connect to server [dt-service-device-registry:5671, role: Credentials] failed
javax.net.ssl.SSLHandshakeException: Failed to create SSL connection
The kubectl version output is as follows:
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:20:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Thanks in advance!
based on the iconic Failed to create SSL Connection output in the logs, I assume that you have run into the dreaded The demo certificates included in the Hono chart have expired problem.
The Cloud2Edge package chart is being updated currently (https://github.com/eclipse/packages/pull/337) with the most recent version of the Ditto and Hono charts (which includes fresh certificates that are valid for two more years to come). As soon as that PR is merged and the Eclipse Packages chart repository has been rebuilt, you should be able to do a helm repo update and then (hopefully) succesfully install the c2e package.

Greenplum on kubernetes

I've deployed a slightly greenplum cluster on kubernetes.
Everything seems to be up and running:
$ kubectl get pods:
NAME READY STATUS RESTARTS AGE
greenplum-operator-588d8fcfd8-nmgjp 1/1 Running 0 40m
svclb-greenplum-krdtd 1/1 Running 0 39m
svclb-greenplum-k28bv 1/1 Running 0 39m
svclb-greenplum-25n7b 1/1 Running 0 39m
segment-a-0 1/1 Running 0 39m
master-0 1/1 Running 0 39m
Nevertheless, something seems to be wrong since cluster state is Pending:
$ kubectl describe greenplumclusters.greenplum.pivotal.io my-greenplum
Name: my-greenplum
Namespace: default
Labels: <none>
Annotations: <none>
API Version: greenplum.pivotal.io/v1
Kind: GreenplumCluster
Metadata:
Creation Timestamp: 2020-09-23T08:31:04Z
Finalizers:
stopcluster.greenplumcluster.pivotal.io
Generation: 2
Managed Fields:
API Version: greenplum.pivotal.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:masterAndStandby:
.:
f:antiAffinity:
f:cpu:
f:hostBasedAuthentication:
f:memory:
f:standby:
f:storage:
f:storageClassName:
f:workerSelector:
f:segments:
.:
f:antiAffinity:
f:cpu:
f:memory:
f:mirrors:
f:primarySegmentCount:
f:storage:
f:storageClassName:
f:workerSelector:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2020-09-23T08:31:04Z
API Version: greenplum.pivotal.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:finalizers:
f:status:
.:
f:instanceImage:
f:operatorVersion:
f:phase:
Manager: greenplum-operator
Operation: Update
Time: 2020-09-23T08:31:11Z
Resource Version: 590
Self Link: /apis/greenplum.pivotal.io/v1/namespaces/default/greenplumclusters/my-greenplum
UID: 72ed72a8-4dd9-48fb-8a48-de2229d88a24
Spec:
Master And Standby:
Anti Affinity: no
Cpu: 0.5
Host Based Authentication: # host all gpadmin 0.0.0.0/0 trust
Memory: 800Mi
Standby: no
Storage: 1G
Storage Class Name: local-path
Worker Selector:
Segments:
Anti Affinity: no
Cpu: 0.5
Memory: 800Mi
Mirrors: no
Primary Segment Count: 1
Storage: 2G
Storage Class Name: local-path
Worker Selector:
Status:
Instance Image: registry.localhost:5000/greenplum-for-kubernetes:v2.2.0
Operator Version: registry.localhost:5000/greenplum-operator:v2.2.0
Phase: Pending
Events: <none>
As you can see:
Phase: Pending
I've took a look on operator logs:
{"level":"DEBUG","ts":"2020-09-23T09:12:18.494Z","logger":"PodExec","msg":"master-0 is not active master","namespace":"default","error":"command terminated with exit code 2"}
{"level":"DEBUG","ts":"2020-09-23T09:12:18.497Z","logger":"PodExec","msg":"master-1 is not active master","namespace":"default","error":"pods \"master-1\" not found"}
{"level":"DEBUG","ts":"2020-09-23T09:12:18.497Z","logger":"controllers.GreenplumCluster","msg":"current active master","greenplumcluster":"default/my-greenplum","activeMaster":""}
I don't quite figure out what they mean...
I mean, It seems it's looking for a two masters: master-0 and master-1. As you can see bellow, I've only deploying a single master with one segment.
greenplum cluster manifest is:
apiVersion: "greenplum.pivotal.io/v1"
kind: "GreenplumCluster"
metadata:
name: my-greenplum
spec:
masterAndStandby:
hostBasedAuthentication: |
# host all gpadmin 0.0.0.0/0 trust
memory: "800Mi"
cpu: "0.5"
storageClassName: local-path
storage: 1G
workerSelector: {}
segments:
primarySegmentCount: 1
memory: "800Mi"
cpu: "0.5"
storageClassName: local-path
storage: 2G
workerSelector: {}
Master is logging this:
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Starting Master instance master-0 directory /greenplum/data-1
20200923:11:29:12:001380 gpstart:master-0:gpadmin-[INFO]:-Command pg_ctl reports Master master-0 instance active
20200923:11:29:12:001380 gpstart:master-0:gpadmin-[INFO]:-Connecting to dbname='template1' connect_timeout=15
20200923:11:29:27:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 1/4
20200923:11:29:42:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 2/4
20200923:11:29:57:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 3/4
20200923:11:30:12:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 4/4
20200923:11:30:12:001380 gpstart:master-0:gpadmin-[WARNING]:-Failed to connect to template1
20200923:11:30:12:001380 gpstart:master-0:gpadmin-[INFO]:-No standby master configured. skipping...
20200923:11:30:12:001380 gpstart:master-0:gpadmin-[INFO]:-Check status of database with gpstate utility
20200923:11:30:12:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Completed restart of Greenplum instance in production mode
In short:
Timeout expired connecting to template1
Complete master-0 logs:
*******************************
Initializing Greenplum for Kubernetes Cluster
*******************************
*******************************
Generating gpinitsystem_config
*******************************
{"level":"INFO","ts":"2020-09-23T11:28:58.394Z","logger":"startGreenplumContainer","msg":"initializing Greenplum Cluster"}
Sub Domain for the cluster is: agent.greenplum-1.svc.cluster.local
*******************************
Running gpinitsystem
*******************************
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking configuration parameters, please wait...
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Locale has not been set in , will set to default value
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Locale set to en_US.utf8
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[WARN]:-ARRAY_NAME variable not set, will provide default value
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[WARN]:-Master hostname master-0.agent.greenplum-1.svc.cluster.local does not match hostname output
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking to see if master-0.agent.greenplum-1.svc.cluster.local can be resolved on this host
Warning: Permanently added the RSA host key for IP address '10.42.2.5' to the list of known hosts.
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Can resolve master-0.agent.greenplum-1.svc.cluster.local to this host
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-No DATABASE_NAME set, will exit following template1 updates
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[WARN]:-CHECK_POINT_SEGMENTS variable not set, will set to default value
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[WARN]:-ENCODING variable not set, will set to default UTF-8
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-MASTER_MAX_CONNECT not set, will set to default value 250
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Detected a single host GPDB array build, reducing value of BATCH_DEFAULT from 60 to 4
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking configuration parameters, Completed
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking Master host
20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking new segment hosts, please wait...
Warning: Permanently added the RSA host key for IP address '10.42.1.5' to the list of known hosts.
{"level":"DEBUG","ts":"2020-09-23T11:28:59.038Z","logger":"DNS resolver","msg":"resolved DNS entry","host":"segment-a-0"}
{"level":"INFO","ts":"2020-09-23T11:28:59.038Z","logger":"keyscanner","msg":"starting keyscan","host":"segment-a-0"}
20200923:11:28:59:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking new segment hosts, Completed
{"level":"INFO","ts":"2020-09-23T11:28:59.064Z","logger":"keyscanner","msg":"keyscan successful","host":"segment-a-0"}
20200923:11:28:59:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Building the Master instance database, please wait...
20200923:11:29:02:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Found more than 1 instance of shared_preload_libraries in /greenplum/data-1/postgresql.conf, will append
20200923:11:29:02:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Starting the Master in admin mode
20200923:11:29:03:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Commencing parallel build of primary segment instances
20200923:11:29:03:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Spawning parallel processes batch [1], please wait...
.
20200923:11:29:03:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Waiting for parallel processes batch [1], please wait...
......
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:------------------------------------------------
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Parallel process exit status
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:------------------------------------------------
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Total processes marked as completed = 1
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Total processes marked as killed = 0
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Total processes marked as failed = 0
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:------------------------------------------------
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Deleting distributed backout files
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Removing back out file
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-No errors generated from parallel processes
20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Restarting the Greenplum instance in production mode
20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Starting gpstop with args: -a -l /home/gpadmin/gpAdminLogs -m -d /greenplum/data-1
20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Gathering information and validating the environment...
20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Obtaining Segment details from master...
20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Greenplum Version: 'postgres (Greenplum Database) 6.10.1 build commit:efba04ce26ebb29b535a255a5e95d1f5ebfde94e'
20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Commencing Master instance shutdown with mode='smart'
20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Master segment instance directory=/greenplum/data-1
20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Stopping master segment and waiting for user connections to finish ...
server shutting down
20200923:11:29:10:001357 gpstop:master-0:gpadmin-[INFO]:-Attempting forceful termination of any leftover master process
20200923:11:29:10:001357 gpstop:master-0:gpadmin-[INFO]:-Terminating processes for segment /greenplum/data-1
20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Starting gpstart with args: -a -l /home/gpadmin/gpAdminLogs -d /greenplum/data-1
20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Gathering information and validating the environment...
20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 6.10.1 build commit:efba04ce26ebb29b535a255a5e95d1f5ebfde94e'
20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Greenplum Catalog Version: '301908232'
20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Starting Master instance in admin mode
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Obtaining Segment details from master...
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Setting new master era
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Master Started...
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Shutting down master
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Commencing parallel segment instance startup, please wait...
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Process results...
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-----------------------------------------------------
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:- Successful segment starts = 1
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:- Failed segment starts = 0
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:- Skipped segment starts (segments are marked down in configuration) = 0
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-----------------------------------------------------
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Successfully started 1 of 1 segment instances
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-----------------------------------------------------
20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Starting Master instance master-0 directory /greenplum/data-1
20200923:11:29:12:001380 gpstart:master-0:gpadmin-[INFO]:-Command pg_ctl reports Master master-0 instance active
20200923:11:29:12:001380 gpstart:master-0:gpadmin-[INFO]:-Connecting to dbname='template1' connect_timeout=15
20200923:11:29:27:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 1/4
20200923:11:29:42:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 2/4
20200923:11:29:57:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 3/4
20200923:11:30:12:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 4/4
20200923:11:30:12:001380 gpstart:master-0:gpadmin-[WARNING]:-Failed to connect to template1
20200923:11:30:12:001380 gpstart:master-0:gpadmin-[INFO]:-No standby master configured. skipping...
20200923:11:30:12:001380 gpstart:master-0:gpadmin-[INFO]:-Check status of database with gpstate utility
20200923:11:30:12:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Completed restart of Greenplum instance in production mode
Any ideas?
I deployed greenplum on kubernetes these days.
My problem is the permission on cgroup directory. When I look into the files under /greenplum/data1/pg_log/ in the Pod, I found it print errors like 'can't access directory '/sys/fs/cgroup/memory/gpdb/'. Because the Pod used hostPath.
My advice is to get the error printed in the files under /greenplum/data1/pg_log/.
The Pod's log is not the whole fact.
BTW, I used v0.8.0 at last. I choice v2.3.0 first, but the master is killed quickly when it is ready, maybe by Docker. The log is like 'received fast shutdown request.
ic-proxy-server: received signal 15'

Kubernetes 1.18.4, iSCSI

I have problems with connecting volume per iSCSI from Kubernetes. When I try with iscisiadm from worker node, it works. This is what I get from kubectl description pod.
Normal Scheduled <unknown> default-scheduler Successfully assigned default/iscsipd to k8s-worker-2
Normal SuccessfulAttachVolume 4m2s attachdetach-controller AttachVolume.Attach succeeded for volume "iscsipd-rw"
Warning FailedMount 119s kubelet, k8s-worker-2 Unable to attach or mount volumes: unmounted volumes=[iscsipd-rw], unattached volumes=[iscsipd-rw default-token-d5glz]: timed out waiting for the condition
Warning FailedMount 105s (x9 over 3m54s) kubelet, k8s-worker-2 MountVolume.WaitForAttach failed for volume "iscsipd-rw" : failed to get any path for iscsi disk, last err seen:iscsi: failed to attach disk: Error: iscsiadm: No records found(exit status 21)
I'm just using iscsi.yaml file from kubernetes.io!
---
apiVersion: v1
kind: Pod
metadata:
name: iscsipd
spec:
containers:
- name: iscsipd-rw
image: kubernetes/pause
volumeMounts:
- mountPath: "/mnt/iscsipd"
name: iscsipd-rw
volumes:
- name: iscsipd-rw
iscsi:
targetPortal: 192.168.34.32:3260
iqn: iqn.2020-07.int.example:sql
lun: 0
fsType: ext4
readOnly: true
Open-iscsi is installed on all worker nodes(just two of them).
● iscsid.service - iSCSI initiator daemon (iscsid)
Loaded: loaded (/lib/systemd/system/iscsid.service; enabled; vendor preset: e
Active: active (running) since Fri 2020-07-03 10:24:26 UTC; 4 days ago
Docs: man:iscsid(8)
Process: 20507 ExecStart=/sbin/iscsid (code=exited, status=0/SUCCESS)
Process: 20497 ExecStartPre=/lib/open-iscsi/startup-checks.sh (code=exited, st
Main PID: 20514 (iscsid)
Tasks: 2 (limit: 4660)
CGroup: /system.slice/iscsid.service
├─20509 /sbin/iscsid
└─20514 /sbin/iscsid
ISCSI Target is created on the IBM Storwize V7000. Without CHAP.
I tried to connect with iscsiadm from worker node and it works.
sudo iscsiadm -m discovery -t sendtargets -p 192.168.34.32
192.168.34.32:3260,1 iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1
192.168.34.34:3260,1 iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1
sudo iscsiadm -m node --login
Logging in to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.32,3260] (multiple)
Logging in to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.34,3260] (multiple)
Login to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.32,3260] successful.
Login to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.34,3260] successful.
Disk /dev/sdb: 100 GiB, 107374182400 bytes, 209715200 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 32768 bytes / 32768 bytes
Disklabel type: dos
Disk identifier: 0x5b3d0a3a
Device Boot Start End Sectors Size Id Type
/dev/sdb1 2048 209715199 209713152 100G 83 Linux
Is anyone facing the same problem?
Remember to not use a hostname for the target. Use the IP. For some reason, if the target is a hostname, it barfs with the error about requesting a duplicate session. If the target is an IP, it works fine. I now have multiple iSCSI targets mounted in various pods, and I am absolutely ecstatic.
You may also have authentication issue to your iscsi target.
If you don't use CHAP authentication yet, you still have to disable authentication. For example, if you use targetcli, you can run below commands to disable it.
$ sudo targetcli
/> /iscsi/iqn.2003-01.org.xxxx/tpg1 set attribute authentication=0 # will disable auth
/> /iscsi/iqn.2003-01.org.xxxx/tpg1 set attribute generate_node_acls=1 # will force to use tpg1 auth mode by default
If this doesn't help you, please share your iscsi target configuration, or guide that you followed.
What is important check if all of your nodes have the open-iscsi-package installed.
Take a look: kubernetes-iSCSI, volume-failed-iscsi-disk, iscsi-into-container-fails.

K3s cluster not starting (anymore)

My local k3s playground decided to suddenly stop working. I have the gut feeling something is wrong with the https certs
I start the cluster from docker compose using
version: '3.2'
services:
server:
image: rancher/k3s:latest
command: server --disable-agent --tls-san 192.168.2.110
environment:
- K3S_CLUSTER_SECRET=somethingtotallyrandom
- K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml
- K3S_KUBECONFIG_MODE=666
volumes:
- k3s-server:/var/lib/rancher/k3s
# get the kubeconfig file
- .:/output
- ./registries.yaml:/etc/rancher/k3s/registries.yaml
ports:
- 192.168.2.110:6443:6443
node:
image: rancher/k3s:latest
volumes:
- ./registries.yaml:/etc/rancher/k3s/registries.yaml
tmpfs:
- /run
- /var/run
privileged: true
environment:
- K3S_URL=https://server:6443
- K3S_CLUSTER_SECRET=somethingtotallyrandom
ports:
- 31000-32000:31000-32000
volumes:
k3s-server: {}
Nothing special. the registries.yaml can be uncommented without making a difference. contents
is
mirrors:
"192.168.2.110:5055":
endpoint:
- "http://192.168.2.110:5055"
However I get now a bunch of weird failures
server_1 | E0516 22:58:03.264451 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
server_1 | E0516 22:58:08.265272 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
node_1 | I0516 22:58:12.695365 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: bb7ee4b14724692f4497e99716b68c4dc4fe77333b03801909092d42c00ef5a2
node_1 | I0516 22:58:15.006306 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: bb7ee4b14724692f4497e99716b68c4dc4fe77333b03801909092d42c00ef5a2
node_1 | I0516 22:58:15.006537 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1 | E0516 22:58:15.006757 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1 | E0516 22:58:22.345501 1 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
node_1 | I0516 22:58:27.695296 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1 | E0516 22:58:27.695989 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1 | I0516 22:58:30.328999 1 request.go:621] Throttling request took 1.047650754s, request: GET:https://127.0.0.1:6444/apis/admissionregistration.k8s.io/v1beta1?timeout=32s
server_1 | W0516 22:58:31.081020 1 garbagecollector.go:644] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
server_1 | E0516 22:58:36.442904 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
node_1 | I0516 22:58:40.695404 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c
node_1 | E0516 22:58:40.696176 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"
server_1 | E0516 22:58:41.443295 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
also it seems my node is not really connecting to the server anymore
user#ipc:~/dev/test_mk3s_docker$ docker exec -it $(docker ps |grep "k3s server"|awk -F\ '{print $1}') kubectl cluster-info
Kubernetes master is running at https://127.0.0.1:6443
CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
user#ipc:~/dev/test_mk3s_docker$ docker exec -it $(docker ps |grep "k3s agent"|awk -F\ '{print $1}') kubectl cluster-info
error: Missing or incomplete configuration info. Please point to an existing, complete config file:
1. Via the command-line flag --kubeconfig
2. Via the KUBECONFIG environment variable
3. In your home directory as ~/.kube/config
To view or setup config directly use the 'config' command.
if I run `kubectl get apiservice I get the following line
v1beta1.storage.k8s.io Local True 20m
v1beta1.scheduling.k8s.io Local True 20m
v1.storage.k8s.io Local True 20m
v1.k3s.cattle.io Local True 20m
v1.helm.cattle.io Local True 20m
v1beta1.metrics.k8s.io kube-system/metrics-server False (FailedDiscoveryCheck) 20m
also downgrading k3s to k3s:v1.0.1 only changes the error message
server_1 | E0516 23:46:02.951073 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: no kind "CSINode" is registered for version "storage.k8s.io/v1" in scheme "k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30"
server_1 | E0516 23:46:03.444519 1 status.go:71] apiserver received an error that is not an metav1.Status: &runtime.notRegisteredErr{schemeName:"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30", gvk:schema.GroupVersionKind{Group:"storage.k8s.io", Version:"v1", Kind:"CSINode"}, target:runtime.GroupVersioner(nil), t:reflect.Type(nil)}
after executing
docker exec -it $(docker ps |grep "k3s server"|awk -F\ '{print $1}') kubectl --namespace kube-system delete apiservice v1beta1.metrics.k8s.io
I only get
node_1 | W0517 07:03:06.346944 1 info.go:51] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
node_1 | I0517 07:03:21.504932 1 log.go:172] http: TLS handshake error from 10.42.1.15:53888: remote error: tls: bad certificate

Ansible AWX RabbitMQ container in Kubernetes Failed to get nodes from k8s with nxdomain

I am trying to get Ansible AWX installed on my Kubernetes cluster but the RabbitMQ container is throwing "Failed to get nodes from k8s" error.
Below are the version of platforms I am using
[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5",
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean",
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc",
Platform:"linux/amd64"}
Kubernetes is deployed via the kubespray playbook v2.5.0 and all the services and pods are up and running. (CoreDNS, Weave, IPtables)
I am deploying AWX via the 1.0.6 release using the 1.0.6 images for awx_web and awx_task.
I am using an external PostgreSQL database at v10.4 and have verified the tables are being created by awx in the db.
Troubleshooting steps I have tried.
I tried to deploy AWX 1.0.5 with the etcd pod to the same cluster and it has worked as expected
I have deployed a stand alone RabbitMQ cluster in the same k8s cluster trying to mimic the AWX rabbit deployment as much as possible and it works with the rabbit_peer_discovery_k8s backend.
I have tried tweeking some of the rabbitmq.conf for AWX 1.0.6 with no luck it just keeps thowing the same error.
I have verified the /etc/resolv.conf file has the kubernetes.default.svc.cluster.local entry
Cluster Info
[node1 ~]# kubectl get all -n awx
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME READY STATUS RESTARTS AGE
po/awx-654f7fc84c-9ppqb 3/4 CrashLoopBackOff 11 38m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/awx-rmq-mgmt ClusterIP 10.233.10.146 <none> 15672/TCP 1d
svc/awx-web-svc NodePort 10.233.3.75 <none> 80:31700/TCP 1d
svc/rabbitmq NodePort 10.233.37.33 <none> 15672:30434/TCP,5672:31962/TCP 1d
AWX RabbitMQ error log
[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
Starting RabbitMQ 3.7.4 on Erlang 20.1.7
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
## ##
## ## RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
########## Licensed under the MPL. See http://www.rabbitmq.com/
###### ##
########## Logs: <stdout>
Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
node : rabbit#10.233.120.5
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : at619UOZzsenF44tSK3ulA==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit#10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering: OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit#10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Kubernetes API service
[node1 ~]# kubectl describe service kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.233.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 10.237.34.19:6443,10.237.34.21:6443
Session Affinity: ClientIP
Events: <none>
nslookup from a busybox in the same kubernetes cluster
[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup kubernetes.default.svc.cluster.local
Server: 10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local
Name: kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local
Please let me know if I am missing anything that could help troubleshooting.
I believe the solution is to omit the explicit kubernetes host. I can't think of any good reason one would need to specify the kubernetes api host from inside the cluster.
If for some terrible reason the RMQ plugin requires it, then try swapping in the Service IP (assuming your SSL cert for the master has its Service IP in the SANs list).
As for why it is doing such a silly thing, the only good reason I can think of is that the RMQ PodSpec has somehow gotten a dnsPolicy of something other than ClusterFirst. If you truly wish to troubleshoot the RMQ Pod, then you can provide an explicit command: to run some debugging bash commands first, in order to interrogate the state of the container at launch, and then exec /launch.sh to resume booting up RMQ (as they do)