RabbitMQ fails to start after restart Kubernetes cluster - kubernetes
I'm running RabbitMQ on Kubernetes. This is my sts YAML file:
apiVersion: v1
kind: Service
metadata:
name: rabbitmq-management
labels:
app: rabbitmq
spec:
ports:
- port: 15672
name: http
selector:
app: rabbitmq
type: NodePort
---
apiVersion: v1
kind: Service
metadata:
name: rabbitmq
labels:
app: rabbitmq
spec:
ports:
- port: 5672
name: amqp
- port: 4369
name: epmd
- port: 25672
name: rabbitmq-dist
- port: 61613
name: stomp
clusterIP: None
selector:
app: rabbitmq
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
spec:
serviceName: "rabbitmq"
replicas: 3
selector:
matchLabels:
app: rabbitmq
template:
metadata:
labels:
app: rabbitmq
spec:
containers:
- name: rabbitmq
image: rabbitmq:management-alpine
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- >
rabbitmq-plugins enable rabbitmq_stomp;
if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi;
until rabbitmqctl node_health_check; do sleep 1; done;
if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi;
rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
env:
- name: RABBITMQ_ERLANG_COOKIE
valueFrom:
secretKeyRef:
name: rabbitmq-config
key: erlang-cookie
ports:
- containerPort: 5672
name: amqp
- containerPort: 61613
name: stomp
volumeMounts:
- name: rabbitmq
mountPath: /var/lib/rabbitmq
volumeClaimTemplates:
- metadata:
name: rabbitmq
annotations:
volume.alpha.kubernetes.io/storage-class: do-block-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
and I created the cookie with this command:
kubectl create secret generic rabbitmq-config --from-literal=erlang-cookie=c-is-for-cookie-thats-good-enough-for-me
all of my Kubernetes cluster nodes are ready:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubernetes-master Ready master 14d v1.17.3
kubernetes-slave-1 Ready <none> 14d v1.17.3
kubernetes-slave-2 Ready <none> 14d v1.17.3
but after restarting the cluster, the RabbitMQ didn't start. I tried to scale down and up the sts but the problem already exist. The output of kubectl describe pod rabbitmq-0:
kubectl describe pod rabbitmq-0
Name: rabbitmq-0
Namespace: default
Priority: 0
Node: kubernetes-slave-1/192.168.0.179
Start Time: Tue, 24 Mar 2020 22:31:04 +0000
Labels: app=rabbitmq
controller-revision-hash=rabbitmq-6748869f4b
statefulset.kubernetes.io/pod-name=rabbitmq-0
Annotations: <none>
Status: Running
IP: 10.244.1.163
IPs:
IP: 10.244.1.163
Controlled By: StatefulSet/rabbitmq
Containers:
rabbitmq:
Container ID: docker://d5108f818525030b4fdb548eb40f0dc000dd2cec473ebf8cead315116e3efbd3
Image: rabbitmq:management-alpine
Image ID: docker-pullable://rabbitmq#sha256:6f7c8d01d55147713379f5ca26e3f20eca63eb3618c263b12440b31c697ee5a5
Ports: 5672/TCP, 61613/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: PostStartHookError: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-575-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
node_health_check
Usage
rabbitmqctl [--node <node>] [--longnames] [--quiet] node_health_check [--timeout <timeout>]
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-10397-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 24 Mar 2020 22:46:09 +0000
Finished: Tue, 24 Mar 2020 22:58:28 +0000
Ready: False
Restart Count: 1
Environment:
RABBITMQ_ERLANG_COOKIE: <set to the key 'erlang-cookie' in secret 'rabbitmq-config'> Optional: false
Mounts:
/var/lib/rabbitmq from rabbitmq (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bbl9c (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
rabbitmq:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: rabbitmq-rabbitmq-0
ReadOnly: false
default-token-bbl9c:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bbl9c
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 31m default-scheduler Successfully assigned default/rabbitmq-0 to kubernetes-slave-1
Normal Pulled 31m kubelet, kubernetes-slave-1 Container image "rabbitmq:management-alpine" already present on machine
Normal Created 31m kubelet, kubernetes-slave-1 Created container rabbitmq
Normal Started 31m kubelet, kubernetes-slave-1 Started container rabbitmq
Normal SandboxChanged 16m (x9 over 17m) kubelet, kubernetes-slave-1 Pod sandbox changed, it will be killed and re-created.
Normal Pulled 3m58s (x2 over 16m) kubelet, kubernetes-slave-1 Container image "rabbitmq:management-alpine" already present on machine
Warning FailedPostStartHook 3m58s kubelet, kubernetes-slave-1 Exec lifecycle hook ([/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_default(2e561153-a830-4d30-ab1e-71c80d10c9e9)" failed - error: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
other nodes on rabbitmq-0: [rabbitmqprelaunch1]
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-433-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-575-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
node_health_check
, message: "Enabling plugins on node rabbit#rabbitmq-0:\nrabbitmq_stomp\nThe following plugins have been configured:\n rabbitmq_management\n rabbitmq_management_agent\n rabbitmq_stomp\n rabbitmq_web_dispatch\nApplying plugin configuration to rabbit#rabbitmq-0...\nThe following plugins have been enabled:\n rabbitmq_stomp\n\nset 4 plugins.\nOffline change; changes will take effect at broker restart.\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\
Error:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.\n\nMost common reasons for this are:\n\n * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)\n * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)\n * Target node is not running\n\nIn addition to the diagnostics info below:\n\n * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more\n * Consult server logs on node rabbit#rabbitmq-0\n * If target node is configured to use long node names, don't forget to use --longnames with CLI tools\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit#rabbitmq-0']\n\nrabbit#rabbitmq-0:\n * connected to epmd (port 4369) on rabbitmq-0\n * epmd reports: node 'rabbit' not running at all\n no other nodes on rabbitmq-0\n * suggestion: start the node\n\nCurrent node details:\n * node name: 'rabbitmqcli-10397-rabbit#rabbitmq-0'\n * effective user's home directory: /var/lib/rabbitmq\n * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\n"
Normal Killing 3m58s kubelet, kubernetes-slave-1 FailedPostStartHook
Normal Created 3m57s (x2 over 16m) kubelet, kubernetes-slave-1 Created container rabbitmq
Normal Started 3m57s (x2 over 16m) kubelet, kubernetes-slave-1 Started container rabbitmq
The output of kubectl get sts:
kubectl get sts
NAME READY AGE
consul 3/3 15d
hazelcast 2/3 15d
kafka 2/3 15d
rabbitmq 0/3 13d
zk 3/3 15d
and this is pod log that I copied from Kubernetes dashboard:
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: list of feature flags found:
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] drop_unroutable_metric
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] empty_basic_get_metric
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] implicit_default_bindings
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] quorum_queue
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] virtual_host_metadata
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2020-03-24 22:58:43.979 [info] <0.319.0> ra: meta data store initialised. 0 record(s) recovered
2020-03-24 22:58:43.980 [info] <0.324.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0/quorum/rabbit#rabbitmq-0/00000262.wal"]
2020-03-24 22:58:43.982 [info] <0.328.0>
Starting RabbitMQ 3.8.2 on Erlang 22.2.8
Copyright (c) 2007-2019 Pivotal Software, Inc.
Licensed under the MPL 1.1. Website: https://rabbitmq.com
## ## RabbitMQ 3.8.2
## ##
########## Copyright (c) 2007-2019 Pivotal Software, Inc.
###### ##
########## Licensed under the MPL 1.1. Website: https://rabbitmq.com
Doc guides: https://rabbitmq.com/documentation.html
Support: https://rabbitmq.com/contact.html
Tutorials: https://rabbitmq.com/getstarted.html
Monitoring: https://rabbitmq.com/monitoring.html
Logs: <stdout>
Config file(s): /etc/rabbitmq/rabbitmq.conf
Starting broker...2020-03-24 22:58:43.983 [info] <0.328.0>
node : rabbit#rabbitmq-0
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : P1XNOe5pN3Ug2FCRFzH7Xg==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0
2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step pre_boot defined by app rabbit
2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step rabbit_core_metrics defined by app rabbit
2020-03-24 22:58:43.998 [info] <0.328.0> Running boot step rabbit_alarm defined by app rabbit
2020-03-24 22:58:44.002 [info] <0.334.0> Memory high watermark set to 1200 MiB (1258889216 bytes) of 3001 MiB (3147223040 bytes) total
2020-03-24 22:58:44.014 [info] <0.336.0> Enabling free disk space monitoring
2020-03-24 22:58:44.014 [info] <0.336.0> Disk free limit set to 50MB
2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step code_server_cache defined by app rabbit
2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step file_handle_cache defined by app rabbit
2020-03-24 22:58:44.019 [info] <0.339.0> Limiting to approx 1048479 file handles (943629 sockets)
2020-03-24 22:58:44.019 [info] <0.340.0> FHC read buffering: OFF
2020-03-24 22:58:44.019 [info] <0.340.0> FHC write buffering: ON
2020-03-24 22:58:44.020 [info] <0.328.0> Running boot step worker_pool defined by app rabbit
2020-03-24 22:58:44.021 [info] <0.329.0> Will use 2 processes for default worker pool
2020-03-24 22:58:44.021 [info] <0.329.0> Starting worker pool 'worker_pool' with 2 processes in it
2020-03-24 22:58:44.021 [info] <0.328.0> Running boot step database defined by app rabbit
2020-03-24 22:58:44.041 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-03-24 22:59:14.042 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 22:59:14.042 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-03-24 22:59:44.043 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 22:59:44.043 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-03-24 23:00:14.044 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:00:14.044 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-03-24 23:00:44.045 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:00:44.045 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2020-03-24 23:01:14.046 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:01:14.046 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2020-03-24 23:01:44.047 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:01:44.047 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2020-03-24 23:02:14.048 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:02:14.048 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2020-03-24 23:02:44.049 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:02:44.049 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2020-03-24 23:03:14.050 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:03:14.050 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-03-24 23:03:44.051 [error] <0.328.0> Feature flag `quorum_queue`: migration function crashed: {error,{timeout_waiting_for_tables,[rabbit_durable_queue]}}
[{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,117}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,60}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1486}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-2-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2128}]},{maps,fold_1,3,[{file,"maps.erl"},{line,232}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2126}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,1947}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,631}]}]
2020-03-24 23:03:44.051 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-03-24 23:04:14.052 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:04:14.052 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-03-24 23:04:44.053 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:04:44.053 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-03-24 23:05:14.055 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:05:14.055 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-03-24 23:05:44.056 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:05:44.056 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2020-03-24 23:06:14.057 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:06:14.057 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2020-03-24 23:06:44.058 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:06:44.058 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2020-03-24 23:07:14.059 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:07:14.059 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2020-03-24 23:07:44.060 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:07:44.060 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2020-03-24 23:08:14.061 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:08:14.061 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-03-24 23:08:44.062 [error] <0.327.0> CRASH REPORT Process <0.327.0> with 0 neighbours exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138
2020-03-24 23:08:44.063 [info] <0.43.0> Application rabbit exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_r
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Take a look at:
https://www.rabbitmq.com/clustering.html#restarting
You should be able to stop the app and then force boot:
rabbitmqctl stop_app
rabbitmqctl force_boot
To complete #Vincent Gerris answer, I strongly recommend you to use the Bitnami rabbitMQ Docker image.
They have included an env variable called RABBITMQ_FORCE_BOOT:
https://github.com/bitnami/bitnami-docker-rabbitmq/blob/2c38682053dd9b3e88ab1fb305355d2ce88c2ccb/3.9/debian-10/rootfs/opt/bitnami/scripts/librabbitmq.sh#L760
if is_boolean_yes "$RABBITMQ_FORCE_BOOT" && ! is_dir_empty "${RABBITMQ_DATA_DIR}/${RABBITMQ_NODE_NAME}"; then
# ref: https://www.rabbitmq.com/rabbitmqctl.8.html#force_boot
warn "Forcing node to start..."
debug_execute "${RABBITMQ_BIN_DIR}/rabbitmqctl" force_boot
fi
this will force boot the node at entrypoint.
Related
Readiness fails in the Eclipse Hono pods of the Cloud2Edge package
I am a bit desperate and I hope someone can help me. A few months ago I installed the eclipse cloud2edge package on a kubernetes cluster by following the installation instructions, creating a persistentVolume and running the helm install command with these options. helm install -n $NS --wait --timeout 15m $RELEASE eclipse-iot/cloud2edge --set hono.prometheus.createInstance=false --set hono.grafana.enabled=false --dependency-update --debug The yaml of the persistentVolume is the following and I create it in the same namespace that I install the package. apiVersion: v1 kind: PersistentVolume metadata: name: pv-device-registry spec: accessModes: - ReadWriteOnce capacity: storage: 1Mi hostPath: path: /mnt/ type: Directory Everything works perfectly, all pods were ready and running, until the other day when the cluster crashed and some pods stopped working. The kubectl get pods -n $NS output is as follows: NAME READY STATUS RESTARTS AGE ditto-mongodb-7b78b468fb-8kshj 1/1 Running 0 50m dt-adapter-amqp-vertx-6699ccf495-fc8nx 0/1 Running 0 50m dt-adapter-http-vertx-545564ff9f-gx5fp 0/1 Running 0 50m dt-adapter-mqtt-vertx-58c8975678-k5n49 0/1 Running 0 50m dt-artemis-6759fb6cb8-5rq8p 1/1 Running 1 50m dt-dispatch-router-5bc7586f76-57dwb 1/1 Running 0 50m dt-ditto-concierge-f6d5f6f9c-pfmcw 1/1 Running 0 50m dt-ditto-connectivity-f556db698-q89bw 1/1 Running 0 50m dt-ditto-gateway-589d8f5596-59c5b 1/1 Running 0 50m dt-ditto-nginx-897b5bc76-cx2dr 1/1 Running 0 50m dt-ditto-policies-75cb5c6557-j5zdg 1/1 Running 0 50m dt-ditto-swaggerui-6f6f989ccd-jkhsk 1/1 Running 0 50m dt-ditto-things-79ff869bc9-l9lct 1/1 Running 0 50m dt-ditto-thingssearch-58c5578bb9-pwd9k 1/1 Running 0 50m dt-service-auth-698d4cdfff-ch5wp 1/1 Running 0 50m dt-service-command-router-59d6556b5f-4nfcj 0/1 Running 0 50m dt-service-device-registry-7cf75d794f-pk9ct 0/1 Running 0 50m The pods that fail all have the same error when running kubectl describe pod POD_NAME -n $NS. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 53m default-scheduler Successfully assigned digitaltwins/dt-service-command-router-59d6556b5f-4nfcj to node1 Normal Pulled 53m kubelet Container image "index.docker.io/eclipse/hono-service-command-router:1.8.0" already present on machine Normal Created 53m kubelet Created container service-command-router Normal Started 53m kubelet Started container service-command-router Warning Unhealthy 52m kubelet Readiness probe failed: Get "https://10.244.1.89:8088/readiness": net/http: request canceled (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 2m58s (x295 over 51m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 According to this, the readinessProbe fails. In the yalm definition of the affected deployments, the readinessProbe is defined: readinessProbe: failureThreshold: 3 httpGet: path: /readiness port: health scheme: HTTPS initialDelaySeconds: 45 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 I have tried increasing these values, increasing the delay to 600 and the timeout to 10. Also i have tried uninstalling the package and installing it again, but nothing changes: the installation fails because the pods are never ready and the timeout pops up. I have also exposed port 8088 (health) and called /readiness with wget and the result is still 503. On the other hand, I have tested if livenessProbe works and it works fine. I have also tried resetting the cluster. First I manually deleted everything in it and then used the following commands: sudo kubeadm reset sudo iptables -F && sudo iptables -t nat -F && sudo iptables -t mangle -F && sudo iptables -X sudo systemctl stop kubelet sudo systemctl stop docker sudo rm -rf /var/lib/cni/ sudo rm -rf /var/lib/kubelet/* sudo rm -rf /etc/cni/ sudo ifconfig cni0 down sudo ifconfig flannel.1 down sudo ifconfig docker0 down sudo ip link set cni0 down sudo brctl delbr cni0 sudo systemctl start docker sudo kubeadm init --apiserver-advertise-address=192.168.44.11 --pod-network-cidr=10.244.0.0/16 mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config kubectl --kubeconfig $HOME/.kube/config apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml The cluster seems to work fine because the Eclipse Ditto part has no problem, it's just the Eclipse Hono part. I add a little more information in case it may be useful. The kubectl logs dt-service-command-router-b654c8dcb-s2g6t -n $NS output: 12:30:06.340 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.101:44142 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown 12:30:06.756 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.100:46550 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown 12:30:07.876 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.102:40706 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown 12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.client.impl.HonoConnectionImpl - starting attempt [#258] to connect to server [dt-service-device-registry:5671, role: Device Registration] 12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - OpenSSL [available: false, supports KeyManagerFactory: false] 12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - using JDK's default SSL engine 12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.3] 12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.2] 12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - connecting to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Device Registration] 12:30:08.339 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - can't connect to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Device Registration]: Failed to create SSL connection 12:30:08.339 [vert.x-eventloop-thread-1] WARN o.e.h.client.impl.HonoConnectionImpl - attempt [#258] to connect to server [dt-service-device-registry:5671, role: Device Registration] failed javax.net.ssl.SSLHandshakeException: Failed to create SSL connection The kubectl logs dt-adapter-amqp-vertx-74d69cbc44-7kmdq -n $NS output: 12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.client.impl.HonoConnectionImpl - starting attempt [#19] to connect to server [dt-service-device-registry:5671, role: Credentials] 12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - OpenSSL [available: false, supports KeyManagerFactory: false] 12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - using JDK's default SSL engine 12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.3] 12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.2] 12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - connecting to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Credentials] 12:19:36.711 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - can't connect to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Credentials]: Failed to create SSL connection 12:19:36.712 [vert.x-eventloop-thread-0] WARN o.e.h.client.impl.HonoConnectionImpl - attempt [#19] to connect to server [dt-service-device-registry:5671, role: Credentials] failed javax.net.ssl.SSLHandshakeException: Failed to create SSL connection The kubectl version output is as follows: Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:20:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"} Thanks in advance!
based on the iconic Failed to create SSL Connection output in the logs, I assume that you have run into the dreaded The demo certificates included in the Hono chart have expired problem. The Cloud2Edge package chart is being updated currently (https://github.com/eclipse/packages/pull/337) with the most recent version of the Ditto and Hono charts (which includes fresh certificates that are valid for two more years to come). As soon as that PR is merged and the Eclipse Packages chart repository has been rebuilt, you should be able to do a helm repo update and then (hopefully) succesfully install the c2e package.
Greenplum on kubernetes
I've deployed a slightly greenplum cluster on kubernetes. Everything seems to be up and running: $ kubectl get pods: NAME READY STATUS RESTARTS AGE greenplum-operator-588d8fcfd8-nmgjp 1/1 Running 0 40m svclb-greenplum-krdtd 1/1 Running 0 39m svclb-greenplum-k28bv 1/1 Running 0 39m svclb-greenplum-25n7b 1/1 Running 0 39m segment-a-0 1/1 Running 0 39m master-0 1/1 Running 0 39m Nevertheless, something seems to be wrong since cluster state is Pending: $ kubectl describe greenplumclusters.greenplum.pivotal.io my-greenplum Name: my-greenplum Namespace: default Labels: <none> Annotations: <none> API Version: greenplum.pivotal.io/v1 Kind: GreenplumCluster Metadata: Creation Timestamp: 2020-09-23T08:31:04Z Finalizers: stopcluster.greenplumcluster.pivotal.io Generation: 2 Managed Fields: API Version: greenplum.pivotal.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:kubectl.kubernetes.io/last-applied-configuration: f:spec: .: f:masterAndStandby: .: f:antiAffinity: f:cpu: f:hostBasedAuthentication: f:memory: f:standby: f:storage: f:storageClassName: f:workerSelector: f:segments: .: f:antiAffinity: f:cpu: f:memory: f:mirrors: f:primarySegmentCount: f:storage: f:storageClassName: f:workerSelector: Manager: kubectl-client-side-apply Operation: Update Time: 2020-09-23T08:31:04Z API Version: greenplum.pivotal.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: f:status: .: f:instanceImage: f:operatorVersion: f:phase: Manager: greenplum-operator Operation: Update Time: 2020-09-23T08:31:11Z Resource Version: 590 Self Link: /apis/greenplum.pivotal.io/v1/namespaces/default/greenplumclusters/my-greenplum UID: 72ed72a8-4dd9-48fb-8a48-de2229d88a24 Spec: Master And Standby: Anti Affinity: no Cpu: 0.5 Host Based Authentication: # host all gpadmin 0.0.0.0/0 trust Memory: 800Mi Standby: no Storage: 1G Storage Class Name: local-path Worker Selector: Segments: Anti Affinity: no Cpu: 0.5 Memory: 800Mi Mirrors: no Primary Segment Count: 1 Storage: 2G Storage Class Name: local-path Worker Selector: Status: Instance Image: registry.localhost:5000/greenplum-for-kubernetes:v2.2.0 Operator Version: registry.localhost:5000/greenplum-operator:v2.2.0 Phase: Pending Events: <none> As you can see: Phase: Pending I've took a look on operator logs: {"level":"DEBUG","ts":"2020-09-23T09:12:18.494Z","logger":"PodExec","msg":"master-0 is not active master","namespace":"default","error":"command terminated with exit code 2"} {"level":"DEBUG","ts":"2020-09-23T09:12:18.497Z","logger":"PodExec","msg":"master-1 is not active master","namespace":"default","error":"pods \"master-1\" not found"} {"level":"DEBUG","ts":"2020-09-23T09:12:18.497Z","logger":"controllers.GreenplumCluster","msg":"current active master","greenplumcluster":"default/my-greenplum","activeMaster":""} I don't quite figure out what they mean... I mean, It seems it's looking for a two masters: master-0 and master-1. As you can see bellow, I've only deploying a single master with one segment. greenplum cluster manifest is: apiVersion: "greenplum.pivotal.io/v1" kind: "GreenplumCluster" metadata: name: my-greenplum spec: masterAndStandby: hostBasedAuthentication: | # host all gpadmin 0.0.0.0/0 trust memory: "800Mi" cpu: "0.5" storageClassName: local-path storage: 1G workerSelector: {} segments: primarySegmentCount: 1 memory: "800Mi" cpu: "0.5" storageClassName: local-path storage: 2G workerSelector: {} Master is logging this: 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Starting Master instance master-0 directory /greenplum/data-1 20200923:11:29:12:001380 gpstart:master-0:gpadmin-[INFO]:-Command pg_ctl reports Master master-0 instance active 20200923:11:29:12:001380 gpstart:master-0:gpadmin-[INFO]:-Connecting to dbname='template1' connect_timeout=15 20200923:11:29:27:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 1/4 20200923:11:29:42:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 2/4 20200923:11:29:57:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 3/4 20200923:11:30:12:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 4/4 20200923:11:30:12:001380 gpstart:master-0:gpadmin-[WARNING]:-Failed to connect to template1 20200923:11:30:12:001380 gpstart:master-0:gpadmin-[INFO]:-No standby master configured. skipping... 20200923:11:30:12:001380 gpstart:master-0:gpadmin-[INFO]:-Check status of database with gpstate utility 20200923:11:30:12:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Completed restart of Greenplum instance in production mode In short: Timeout expired connecting to template1 Complete master-0 logs: ******************************* Initializing Greenplum for Kubernetes Cluster ******************************* ******************************* Generating gpinitsystem_config ******************************* {"level":"INFO","ts":"2020-09-23T11:28:58.394Z","logger":"startGreenplumContainer","msg":"initializing Greenplum Cluster"} Sub Domain for the cluster is: agent.greenplum-1.svc.cluster.local ******************************* Running gpinitsystem ******************************* 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking configuration parameters, please wait... 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Locale has not been set in , will set to default value 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Locale set to en_US.utf8 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[WARN]:-ARRAY_NAME variable not set, will provide default value 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[WARN]:-Master hostname master-0.agent.greenplum-1.svc.cluster.local does not match hostname output 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking to see if master-0.agent.greenplum-1.svc.cluster.local can be resolved on this host Warning: Permanently added the RSA host key for IP address '10.42.2.5' to the list of known hosts. 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Can resolve master-0.agent.greenplum-1.svc.cluster.local to this host 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-No DATABASE_NAME set, will exit following template1 updates 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[WARN]:-CHECK_POINT_SEGMENTS variable not set, will set to default value 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[WARN]:-ENCODING variable not set, will set to default UTF-8 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-MASTER_MAX_CONNECT not set, will set to default value 250 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Detected a single host GPDB array build, reducing value of BATCH_DEFAULT from 60 to 4 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking configuration parameters, Completed 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking Master host 20200923:11:28:58:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking new segment hosts, please wait... Warning: Permanently added the RSA host key for IP address '10.42.1.5' to the list of known hosts. {"level":"DEBUG","ts":"2020-09-23T11:28:59.038Z","logger":"DNS resolver","msg":"resolved DNS entry","host":"segment-a-0"} {"level":"INFO","ts":"2020-09-23T11:28:59.038Z","logger":"keyscanner","msg":"starting keyscan","host":"segment-a-0"} 20200923:11:28:59:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Checking new segment hosts, Completed {"level":"INFO","ts":"2020-09-23T11:28:59.064Z","logger":"keyscanner","msg":"keyscan successful","host":"segment-a-0"} 20200923:11:28:59:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Building the Master instance database, please wait... 20200923:11:29:02:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Found more than 1 instance of shared_preload_libraries in /greenplum/data-1/postgresql.conf, will append 20200923:11:29:02:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Starting the Master in admin mode 20200923:11:29:03:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Commencing parallel build of primary segment instances 20200923:11:29:03:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Spawning parallel processes batch [1], please wait... . 20200923:11:29:03:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Waiting for parallel processes batch [1], please wait... ...... 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:------------------------------------------------ 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Parallel process exit status 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:------------------------------------------------ 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Total processes marked as completed = 1 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Total processes marked as killed = 0 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Total processes marked as failed = 0 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:------------------------------------------------ 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Deleting distributed backout files 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Removing back out file 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-No errors generated from parallel processes 20200923:11:29:09:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Restarting the Greenplum instance in production mode 20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Starting gpstop with args: -a -l /home/gpadmin/gpAdminLogs -m -d /greenplum/data-1 20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Gathering information and validating the environment... 20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information 20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Obtaining Segment details from master... 20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Greenplum Version: 'postgres (Greenplum Database) 6.10.1 build commit:efba04ce26ebb29b535a255a5e95d1f5ebfde94e' 20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Commencing Master instance shutdown with mode='smart' 20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Master segment instance directory=/greenplum/data-1 20200923:11:29:09:001357 gpstop:master-0:gpadmin-[INFO]:-Stopping master segment and waiting for user connections to finish ... server shutting down 20200923:11:29:10:001357 gpstop:master-0:gpadmin-[INFO]:-Attempting forceful termination of any leftover master process 20200923:11:29:10:001357 gpstop:master-0:gpadmin-[INFO]:-Terminating processes for segment /greenplum/data-1 20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Starting gpstart with args: -a -l /home/gpadmin/gpAdminLogs -d /greenplum/data-1 20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Gathering information and validating the environment... 20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 6.10.1 build commit:efba04ce26ebb29b535a255a5e95d1f5ebfde94e' 20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Greenplum Catalog Version: '301908232' 20200923:11:29:10:001380 gpstart:master-0:gpadmin-[INFO]:-Starting Master instance in admin mode 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Obtaining Segment details from master... 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Setting new master era 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Master Started... 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Shutting down master 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Commencing parallel segment instance startup, please wait... 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Process results... 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:----------------------------------------------------- 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:- Successful segment starts = 1 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:- Failed segment starts = 0 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:- Skipped segment starts (segments are marked down in configuration) = 0 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:----------------------------------------------------- 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Successfully started 1 of 1 segment instances 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:----------------------------------------------------- 20200923:11:29:11:001380 gpstart:master-0:gpadmin-[INFO]:-Starting Master instance master-0 directory /greenplum/data-1 20200923:11:29:12:001380 gpstart:master-0:gpadmin-[INFO]:-Command pg_ctl reports Master master-0 instance active 20200923:11:29:12:001380 gpstart:master-0:gpadmin-[INFO]:-Connecting to dbname='template1' connect_timeout=15 20200923:11:29:27:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 1/4 20200923:11:29:42:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 2/4 20200923:11:29:57:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 3/4 20200923:11:30:12:001380 gpstart:master-0:gpadmin-[WARNING]:-Timeout expired connecting to template1, attempt 4/4 20200923:11:30:12:001380 gpstart:master-0:gpadmin-[WARNING]:-Failed to connect to template1 20200923:11:30:12:001380 gpstart:master-0:gpadmin-[INFO]:-No standby master configured. skipping... 20200923:11:30:12:001380 gpstart:master-0:gpadmin-[INFO]:-Check status of database with gpstate utility 20200923:11:30:12:000095 gpinitsystem:master-0:gpadmin-[INFO]:-Completed restart of Greenplum instance in production mode Any ideas?
I deployed greenplum on kubernetes these days. My problem is the permission on cgroup directory. When I look into the files under /greenplum/data1/pg_log/ in the Pod, I found it print errors like 'can't access directory '/sys/fs/cgroup/memory/gpdb/'. Because the Pod used hostPath. My advice is to get the error printed in the files under /greenplum/data1/pg_log/. The Pod's log is not the whole fact. BTW, I used v0.8.0 at last. I choice v2.3.0 first, but the master is killed quickly when it is ready, maybe by Docker. The log is like 'received fast shutdown request. ic-proxy-server: received signal 15'
Kubernetes 1.18.4, iSCSI
I have problems with connecting volume per iSCSI from Kubernetes. When I try with iscisiadm from worker node, it works. This is what I get from kubectl description pod. Normal Scheduled <unknown> default-scheduler Successfully assigned default/iscsipd to k8s-worker-2 Normal SuccessfulAttachVolume 4m2s attachdetach-controller AttachVolume.Attach succeeded for volume "iscsipd-rw" Warning FailedMount 119s kubelet, k8s-worker-2 Unable to attach or mount volumes: unmounted volumes=[iscsipd-rw], unattached volumes=[iscsipd-rw default-token-d5glz]: timed out waiting for the condition Warning FailedMount 105s (x9 over 3m54s) kubelet, k8s-worker-2 MountVolume.WaitForAttach failed for volume "iscsipd-rw" : failed to get any path for iscsi disk, last err seen:iscsi: failed to attach disk: Error: iscsiadm: No records found(exit status 21) I'm just using iscsi.yaml file from kubernetes.io! --- apiVersion: v1 kind: Pod metadata: name: iscsipd spec: containers: - name: iscsipd-rw image: kubernetes/pause volumeMounts: - mountPath: "/mnt/iscsipd" name: iscsipd-rw volumes: - name: iscsipd-rw iscsi: targetPortal: 192.168.34.32:3260 iqn: iqn.2020-07.int.example:sql lun: 0 fsType: ext4 readOnly: true Open-iscsi is installed on all worker nodes(just two of them). ● iscsid.service - iSCSI initiator daemon (iscsid) Loaded: loaded (/lib/systemd/system/iscsid.service; enabled; vendor preset: e Active: active (running) since Fri 2020-07-03 10:24:26 UTC; 4 days ago Docs: man:iscsid(8) Process: 20507 ExecStart=/sbin/iscsid (code=exited, status=0/SUCCESS) Process: 20497 ExecStartPre=/lib/open-iscsi/startup-checks.sh (code=exited, st Main PID: 20514 (iscsid) Tasks: 2 (limit: 4660) CGroup: /system.slice/iscsid.service ├─20509 /sbin/iscsid └─20514 /sbin/iscsid ISCSI Target is created on the IBM Storwize V7000. Without CHAP. I tried to connect with iscsiadm from worker node and it works. sudo iscsiadm -m discovery -t sendtargets -p 192.168.34.32 192.168.34.32:3260,1 iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1 192.168.34.34:3260,1 iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1 sudo iscsiadm -m node --login Logging in to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.32,3260] (multiple) Logging in to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.34,3260] (multiple) Login to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.32,3260] successful. Login to [iface: default, target: iqn.1986-03.com.ibm:2145.hq-v7000.hq-v7000-rz1-c1, portal: 192.168.34.34,3260] successful. Disk /dev/sdb: 100 GiB, 107374182400 bytes, 209715200 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 32768 bytes / 32768 bytes Disklabel type: dos Disk identifier: 0x5b3d0a3a Device Boot Start End Sectors Size Id Type /dev/sdb1 2048 209715199 209713152 100G 83 Linux Is anyone facing the same problem?
Remember to not use a hostname for the target. Use the IP. For some reason, if the target is a hostname, it barfs with the error about requesting a duplicate session. If the target is an IP, it works fine. I now have multiple iSCSI targets mounted in various pods, and I am absolutely ecstatic. You may also have authentication issue to your iscsi target. If you don't use CHAP authentication yet, you still have to disable authentication. For example, if you use targetcli, you can run below commands to disable it. $ sudo targetcli /> /iscsi/iqn.2003-01.org.xxxx/tpg1 set attribute authentication=0 # will disable auth /> /iscsi/iqn.2003-01.org.xxxx/tpg1 set attribute generate_node_acls=1 # will force to use tpg1 auth mode by default If this doesn't help you, please share your iscsi target configuration, or guide that you followed. What is important check if all of your nodes have the open-iscsi-package installed. Take a look: kubernetes-iSCSI, volume-failed-iscsi-disk, iscsi-into-container-fails.
K3s cluster not starting (anymore)
My local k3s playground decided to suddenly stop working. I have the gut feeling something is wrong with the https certs I start the cluster from docker compose using version: '3.2' services: server: image: rancher/k3s:latest command: server --disable-agent --tls-san 192.168.2.110 environment: - K3S_CLUSTER_SECRET=somethingtotallyrandom - K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml - K3S_KUBECONFIG_MODE=666 volumes: - k3s-server:/var/lib/rancher/k3s # get the kubeconfig file - .:/output - ./registries.yaml:/etc/rancher/k3s/registries.yaml ports: - 192.168.2.110:6443:6443 node: image: rancher/k3s:latest volumes: - ./registries.yaml:/etc/rancher/k3s/registries.yaml tmpfs: - /run - /var/run privileged: true environment: - K3S_URL=https://server:6443 - K3S_CLUSTER_SECRET=somethingtotallyrandom ports: - 31000-32000:31000-32000 volumes: k3s-server: {} Nothing special. the registries.yaml can be uncommented without making a difference. contents is mirrors: "192.168.2.110:5055": endpoint: - "http://192.168.2.110:5055" However I get now a bunch of weird failures server_1 | E0516 22:58:03.264451 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) server_1 | E0516 22:58:08.265272 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) node_1 | I0516 22:58:12.695365 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: bb7ee4b14724692f4497e99716b68c4dc4fe77333b03801909092d42c00ef5a2 node_1 | I0516 22:58:15.006306 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: bb7ee4b14724692f4497e99716b68c4dc4fe77333b03801909092d42c00ef5a2 node_1 | I0516 22:58:15.006537 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c node_1 | E0516 22:58:15.006757 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)" server_1 | E0516 22:58:22.345501 1 resource_quota_controller.go:408] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request node_1 | I0516 22:58:27.695296 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c node_1 | E0516 22:58:27.695989 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)" server_1 | I0516 22:58:30.328999 1 request.go:621] Throttling request took 1.047650754s, request: GET:https://127.0.0.1:6444/apis/admissionregistration.k8s.io/v1beta1?timeout=32s server_1 | W0516 22:58:31.081020 1 garbagecollector.go:644] failed to discover some groups: map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request] server_1 | E0516 22:58:36.442904 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) node_1 | I0516 22:58:40.695404 1 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: fc2e51300f2ec06949abf5242690cb36077adc409f0d7f131a9d4f911063b63c node_1 | E0516 22:58:40.696176 1 pod_workers.go:191] Error syncing pod e127dc88-e252-4e2e-bbd5-2e93ce5e32ff ("helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)"), skipping: failed to "StartContainer" for "helm" with CrashLoopBackOff: "back-off 1m20s restarting failed container=helm pod=helm-install-traefik-jfrjk_kube-system(e127dc88-e252-4e2e-bbd5-2e93ce5e32ff)" server_1 | E0516 22:58:41.443295 1 available_controller.go:420] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: Get https://10.43.6.218:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) also it seems my node is not really connecting to the server anymore user#ipc:~/dev/test_mk3s_docker$ docker exec -it $(docker ps |grep "k3s server"|awk -F\ '{print $1}') kubectl cluster-info Kubernetes master is running at https://127.0.0.1:6443 CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy Metrics-server is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'. user#ipc:~/dev/test_mk3s_docker$ docker exec -it $(docker ps |grep "k3s agent"|awk -F\ '{print $1}') kubectl cluster-info error: Missing or incomplete configuration info. Please point to an existing, complete config file: 1. Via the command-line flag --kubeconfig 2. Via the KUBECONFIG environment variable 3. In your home directory as ~/.kube/config To view or setup config directly use the 'config' command. if I run `kubectl get apiservice I get the following line v1beta1.storage.k8s.io Local True 20m v1beta1.scheduling.k8s.io Local True 20m v1.storage.k8s.io Local True 20m v1.k3s.cattle.io Local True 20m v1.helm.cattle.io Local True 20m v1beta1.metrics.k8s.io kube-system/metrics-server False (FailedDiscoveryCheck) 20m also downgrading k3s to k3s:v1.0.1 only changes the error message server_1 | E0516 23:46:02.951073 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: no kind "CSINode" is registered for version "storage.k8s.io/v1" in scheme "k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30" server_1 | E0516 23:46:03.444519 1 status.go:71] apiserver received an error that is not an metav1.Status: &runtime.notRegisteredErr{schemeName:"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30", gvk:schema.GroupVersionKind{Group:"storage.k8s.io", Version:"v1", Kind:"CSINode"}, target:runtime.GroupVersioner(nil), t:reflect.Type(nil)} after executing docker exec -it $(docker ps |grep "k3s server"|awk -F\ '{print $1}') kubectl --namespace kube-system delete apiservice v1beta1.metrics.k8s.io I only get node_1 | W0517 07:03:06.346944 1 info.go:51] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id" node_1 | I0517 07:03:21.504932 1 log.go:172] http: TLS handshake error from 10.42.1.15:53888: remote error: tls: bad certificate
Ansible AWX RabbitMQ container in Kubernetes Failed to get nodes from k8s with nxdomain
I am trying to get Ansible AWX installed on my Kubernetes cluster but the RabbitMQ container is throwing "Failed to get nodes from k8s" error. Below are the version of platforms I am using [node1 ~]# kubectl version Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Kubernetes is deployed via the kubespray playbook v2.5.0 and all the services and pods are up and running. (CoreDNS, Weave, IPtables) I am deploying AWX via the 1.0.6 release using the 1.0.6 images for awx_web and awx_task. I am using an external PostgreSQL database at v10.4 and have verified the tables are being created by awx in the db. Troubleshooting steps I have tried. I tried to deploy AWX 1.0.5 with the etcd pod to the same cluster and it has worked as expected I have deployed a stand alone RabbitMQ cluster in the same k8s cluster trying to mimic the AWX rabbit deployment as much as possible and it works with the rabbit_peer_discovery_k8s backend. I have tried tweeking some of the rabbitmq.conf for AWX 1.0.6 with no luck it just keeps thowing the same error. I have verified the /etc/resolv.conf file has the kubernetes.default.svc.cluster.local entry Cluster Info [node1 ~]# kubectl get all -n awx NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/awx 1 1 1 0 38m NAME DESIRED CURRENT READY AGE rs/awx-654f7fc84c 1 1 0 38m NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/awx 1 1 1 0 38m NAME DESIRED CURRENT READY AGE rs/awx-654f7fc84c 1 1 0 38m NAME READY STATUS RESTARTS AGE po/awx-654f7fc84c-9ppqb 3/4 CrashLoopBackOff 11 38m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/awx-rmq-mgmt ClusterIP 10.233.10.146 <none> 15672/TCP 1d svc/awx-web-svc NodePort 10.233.3.75 <none> 80:31700/TCP 1d svc/rabbitmq NodePort 10.233.37.33 <none> 15672:30434/TCP,5672:31962/TCP 1d AWX RabbitMQ error log [node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit 2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.964 [info] <0.193.0> Starting RabbitMQ 3.7.4 on Erlang 20.1.7 Copyright (C) 2007-2018 Pivotal Software, Inc. Licensed under the MPL. See http://www.rabbitmq.com/ ## ## ## ## RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc. ########## Licensed under the MPL. See http://www.rabbitmq.com/ ###### ## ########## Logs: <stdout> Starting broker... 2018-07-09 14:47:37.982 [info] <0.193.0> node : rabbit#10.233.120.5 home dir : /var/lib/rabbitmq config file(s) : /etc/rabbitmq/rabbitmq.conf cookie hash : at619UOZzsenF44tSK3ulA== log(s) : <stdout> database dir : /var/lib/rabbitmq/mnesia/rabbit#10.233.120.5 2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total 2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring 2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB 2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets) 2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering: OFF 2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON 2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit#10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch... 2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s 2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s 2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay 2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay. 2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}}, {inet,[inet],nxdomain}]} 2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134 2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 {"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"} Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done Kubernetes API service [node1 ~]# kubectl describe service kubernetes Name: kubernetes Namespace: default Labels: component=apiserver provider=kubernetes Annotations: <none> Selector: <none> Type: ClusterIP IP: 10.233.0.1 Port: https 443/TCP TargetPort: 6443/TCP Endpoints: 10.237.34.19:6443,10.237.34.21:6443 Session Affinity: ClientIP Events: <none> nslookup from a busybox in the same kubernetes cluster [node2 ~]# kubectl exec -it busybox -- sh / # nslookup kubernetes.default.svc.cluster.local Server: 10.233.0.3 Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local Name: kubernetes.default.svc.cluster.local Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local Please let me know if I am missing anything that could help troubleshooting.
I believe the solution is to omit the explicit kubernetes host. I can't think of any good reason one would need to specify the kubernetes api host from inside the cluster. If for some terrible reason the RMQ plugin requires it, then try swapping in the Service IP (assuming your SSL cert for the master has its Service IP in the SANs list). As for why it is doing such a silly thing, the only good reason I can think of is that the RMQ PodSpec has somehow gotten a dnsPolicy of something other than ClusterFirst. If you truly wish to troubleshoot the RMQ Pod, then you can provide an explicit command: to run some debugging bash commands first, in order to interrogate the state of the container at launch, and then exec /launch.sh to resume booting up RMQ (as they do)