AWS ECS Container stuck in restart loop - amazon-ecs
Trying to bring up my first ECS Cluster.
Container is brought up successfully, but keeps on restarting every minute or so.
Cluster service is using an application load balancer.
Task definition:
{
"containerDefinitions": [
{
"dnsSearchDomains": null,
"entryPoint": null,
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 80
}
],
"command": [
"bash",
"deploy.sh"
],
"cpu": 512,
"memory": 1961,
"memoryReservation": 256,
"image": "1",
"essential": true,
}
],
"placementConstraints": [],
"memory": null,
"taskRoleArn": null,
"compatibilities": [
"EC2"
],
"taskDefinitionArn": "arn:aws:ecs:...",
"family": "service",
"networkMode": "awsvpc",
"status": "ACTIVE",
"inferenceAccelerators": null,
"proxyConfiguration": null,
"volumes": []
}
ecs agent log (removed and renamed specifics), goes through the following loop every minute or so
00:05Z [INFO] Handling ENI attachment
00:05Z [INFO] Starting ENI ack timer
00:05Z [INFO] Adding task eni attachment
00:05Z [INFO] Emitting task ENI attached event for: ENI Attachment:
00:05Z [INFO] TaskHandler: Adding event: TaskChange: arn:... -> NONE, ENI Attachment: sent: false
00:05Z [INFO] TaskHandler: Sending task attachment change: TaskChange: arn:... -> NONE, ENI Attachment: sent: false
00:11Z [INFO] Managed task arn:... unable to create state change event for container create container state change event api: status not recognized by ECS: NONE
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: status not recognized by ECS: NONE
00:11Z [INFO] Managed task arn:... unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE
00:11Z [INFO] Managed task arn:... waiting for any previous stops to complete. Sequence number: 110
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Managed task arn:... no longer waiting
00:11Z [INFO] Managed task arn:... wait over; ready to move towards status: RUNNING
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [PULLED]
00:11Z [INFO] Managed task arn:... handling container change [{PULLED { <nil> [] <nil> [] map[] {UNKNOWN <nil> 0 } <nil>} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: status not recognized by ECS: PULLED
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... creating container: ~internal~ecs~pause
00:11Z [INFO] Task engine arn:... created container name mapping for task: ~internal~ecs~pause ->
00:11Z [INFO] Creating cgroup
00:11Z [INFO] Managed task arn:... transitioned resource [cgroup] to [CREATED]
00:11Z [INFO] Managed task arn:... got resource [cgroup] event: [CREATED]
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... created docker container for task: ~internal~ecs~pause -> ...
00:11Z [INFO] Task engine arn:... created docker container for task: ~internal~ecs~pause -> ..., took 103.387592ms
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [CREATED]
00:11Z [INFO] Managed task arn:... handling container change [{CREATED {... <nil> [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc00011ab00} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: status not recognized by ECS: CREATED
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... starting container: ~internal~ecs~pause
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [RUNNING]
00:11Z [INFO] Managed task arn:... handling container change [{RUNNING {... <nil> [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc00011b600} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: status not recognized by ECS: RUNNING
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... setting up container resources for container [~internal~ecs~pause]
00:11Z [INFO] Task engine arn:... started docker container for task: ~internal~ecs~pause -> ..., took 221.148618ms
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [RUNNING]
00:11Z [INFO] Managed task arn:... handling container change [{RUNNING {... <nil> [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc0007aed00} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... redundant container state change. ~internal~ecs~pause to RUNNING, but already RUNNING
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... associated with ip address '1.2.3.4'
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [RESOURCES_PROVISIONED]
00:11Z [INFO] Managed task arn:... handling container change [{RESOURCES_PROVISIONED {... <nil> [] <nil> [] map[] {UNKNOWN <nil> 0 } <nil>} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: internal container: ~internal~ecs~pause
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... pulling image for container concurrently
00:11Z [INFO] Task engine arn:... recording timestamp for starting image pulltime:
00:11Z [INFO] Updating container reference in Image State -
00:11Z [INFO] Task engine arn:... finished pulling image for container in 99.366882ms
00:11Z [INFO] Managed task arn:... got container event: [PULLED]
00:11Z [INFO] Managed task arn:... handling container change [{PULLED { <nil> [] <nil> [] map[] {UNKNOWN <nil> 0 } <nil>} ContainerStatusChangeEvent}] for container
00:11Z [INFO] Managed task arn:... unable to create state change event for container create container state change event api: status not recognized by ECS: PULLED
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... creating container:
00:11Z [INFO] Task engine arn:... created container name mapping for task: ->
00:12Z [INFO] Task engine arn:... created docker container for task: -> ...
00:12Z [INFO] Task engine arn:... created docker container for task: -> ..., took 81.682728ms
00:12Z [INFO] Managed task arn:... got container event: [CREATED]
00:12Z [INFO] Managed task arn:... handling container change [{CREATED {... <nil> [] <nil> [] com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } container: 0xc00062eb00} ContainerStatusChangeEvent}] for container
00:12Z [INFO] Managed task arn:... unable to create state change event for container create container state change event api: status not recognized by ECS: CREATED
00:12Z [INFO] api/task: Updating task's known status to: CREATED, task: arn:aws:ecs:... TaskStatus: (NONE->RUNNING) Containers: (CREATED->RUNNING),~internal~ecs~pause (RESOURCES_PROVISIONED->RESOURCES_PROVISIONED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]
00:12Z [INFO] Managed task arn:... container change also resulted in task change [RUNNING]
00:12Z [INFO] Managed task arn:... unable to create task state change event []: create task state change event api: status not recognized by ECS: CREATED
00:12Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:12Z [INFO] Managed task arn:... waiting for event for task
00:12Z [INFO] Task engine arn:... starting container:
00:12Z [INFO] Managed task arn:... got container event: [RUNNING]
00:12Z [INFO] Managed task arn:... handling container change [{RUNNING {... <nil> [] <nil> [] com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } container: 0xc0007aeb00} ContainerStatusChangeEvent}] for container
00:12Z [INFO] Managed task arn:... sending container change event arn:aws:ecs:... -> RUNNING, Known Sent: NONE
00:12Z [INFO] Managed task arn:... sent container change event arn:aws:ecs:... -> RUNNING, Known Sent: NONE
00:12Z [INFO] api/task: Updating task's known status to: RUNNING, task: arn:aws:ecs:... TaskStatus: (CREATED->RUNNING) Containers: (RUNNING->RUNNING),~internal~ecs~pause (RESOURCES_PROVISIONED->RESOURCES_PROVISIONED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]
00:12Z [INFO] Managed task arn:... container change also resulted in task change [RUNNING]
00:12Z [INFO] Managed task arn:... sending task change event arn:... -> RUNNING, Known Sent: NONE, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: ]
00:12Z [INFO] TaskHandler: batching container event -> RUNNING, Known Sent: NONE
00:12Z [INFO] TaskHandler: Adding event: TaskChange: arn:... -> RUNNING, Known Sent: NONE, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: , arn:aws:ecs:... -> RUNNING, Known Sent: NONE] sent: false
00:12Z [INFO] TaskHandler: Sending task change: TaskChange: arn:... -> RUNNING, Known Sent: NONE, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: , arn:aws:ecs:... -> RUNNING, Known Sent: NONE] sent: false
00:12Z [INFO] Managed task arn:... sent task change event arn:... -> RUNNING, Known Sent: NONE, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: ]
00:12Z [INFO] Managed task arn:... task at steady state: RUNNING
00:12Z [INFO] Managed task arn:... waiting for event for task
00:12Z [INFO] Task engine arn:... started docker container for task: -> ..., took 204.620728ms
00:12Z [INFO] Managed task arn:... got container event: [RUNNING]
00:12Z [INFO] Managed task arn:... handling container change [{RUNNING {... <nil> [] <nil> [] com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } container: 0xc00062f600} ContainerStatusChangeEvent}] for container
00:12Z [INFO] Managed task arn:... redundant container state change. to RUNNING, but already RUNNING
00:12Z [INFO] Managed task arn:... task at steady state: RUNNING
00:12Z [INFO] Managed task arn:... waiting for event for task
00:14Z [INFO] Managed task arn:... got container event: [STOPPED]
00:14Z [INFO] Managed task arn:... handling container change [{STOPPED {... 0xc00 {UNKNOWN <nil> 0 } container: 0xc00062f800} ContainerStatusChangeEvent}] for container
00:14Z [INFO] Task arn:... recording execution stopped time. Essential container stopped at:
00:14Z [INFO] Managed task arn:... sending container change event arn:aws:ecs:... -> STOPPED, Exit 0, , Known Sent: RUNNING
00:14Z [INFO] Managed task arn:... sent container change event arn:aws:ecs:... -> STOPPED, Exit 0, , Known Sent: RUNNING
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: task: arn:aws:ecs:... TaskStatus: (RUNNING->RUNNING) Containers: (STOPPED->RUNNING),~internal~ecs~pause (RESOURCES_PROVISIONED->RESOURCES_PROVISIONED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [INFO] Managed task arn:... got resource [cgroup] event: [REMOVED]
00:14Z [INFO] Task engine arn:... stopping container [~internal~ecs~pause]
00:14Z [INFO] Task engine arn:... cleaning up the network namespace
00:14Z [INFO] TaskHandler: batching container event -> STOPPED, Exit 0, , Known Sent: RUNNING
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: task: arn:aws:ecs:... TaskStatus: (RUNNING->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (RESOURCES_PROVISIONED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: [~internal~ecs~pause]; task: arn:aws:ecs:... TaskStatus: (RUNNING->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (RESOURCES_PROVISIONED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [WARN] DockerGoClient: Unable to decode stats for container ...: context canceled
00:14Z [INFO] Container ... is terminal, stopping stats collection
00:14Z [INFO] Task engine arn:... cleaned pause container network namespace
00:14Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [STOPPED]
00:14Z [INFO] Managed task arn:... handling container change [{STOPPED {... 0xc00 {UNKNOWN <nil> 0 } none 0xc00062fb00} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:14Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: internal container: ~internal~ecs~pause
00:14Z [INFO] api/task: Updating task's known status to: STOPPED, task: arn:aws:ecs:... TaskStatus: (RUNNING->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: task: arn:aws:ecs:... TaskStatus: (STOPPED->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: [~internal~ecs~pause]; task: arn:aws:ecs:... TaskStatus: (STOPPED->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] Managed task arn:... container change also resulted in task change [~internal~ecs~pause]: [STOPPED]
00:14Z [INFO] Managed task arn:... sending task change event arn:... -> STOPPED, Known Sent: RUNNING, PullStartedAt: PullStoppedAt: ExecutionStoppedAt:
00:14Z [INFO] Managed task arn:... sent task change event arn:... -> STOPPED, Known Sent: RUNNING, PullStartedAt: PullStoppedAt: ExecutionStoppedAt:
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: task: arn:aws:ecs:... TaskStatus: (STOPPED->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: [~internal~ecs~pause]; task: arn:aws:ecs:... TaskStatus: (STOPPED->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] Managed task arn:... task has reached stopped. Waiting for container cleanup
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [WARN] DockerGoClient: Unable to decode stats for container ...: context canceled
00:14Z [INFO] Container ... is terminal, stopping stats collection
00:14Z [INFO] TaskHandler: Adding event: TaskChange: arn:... -> STOPPED, Known Sent: RUNNING, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: arn:aws:ecs:... -> STOPPED, Exit 0, , Known Sent: RUNNING] sent: false
00:14Z [INFO] TaskHandler: Sending task change: TaskChange: arn:... -> STOPPED, Known Sent: RUNNING, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: arn:aws:ecs:... -> STOPPED, Exit 0, , Known Sent: RUNNING] sent: false
00:14Z [INFO] Managed task arn:... IPAM releasing ip for task eni
00:14Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [STOPPED]
00:14Z [INFO] Managed task arn:... handling container change [{STOPPED {... 0xc000489ee0 [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc000505000} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:14Z [INFO] Managed task arn:... redundant container state change. ~internal~ecs~pause to STOPPED, but already STOPPED
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [STOPPED]
00:14Z [INFO] Managed task arn:... handling container change [{STOPPED {... 0xc000072500 [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc000505400} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:14Z [INFO] Managed task arn:... redundant container state change. ~internal~ecs~pause to STOPPED, but already STOPPED
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [WARN] Udev watcher event-handler: unable to send state change: udev watcher send ENI state change: eni status already sent: ENI Attachment:
00:52Z [INFO] Handling ENI attachment
00:52Z [INFO] Starting ENI ack timer
00:52Z [INFO] Adding task eni attachment
Can anyone point to a potential cause based on this?
Would anything else be helpful?
Related
Datree dashboard login giving error in verifying token
I tried to login to datree dashboard on https://app.datree.io/webhook-instructions. But I am unable to login to it because an error is coming whenever I run this command: helm install -n datree datree-webhook datree-webhook/datree-admission-webhook \ --create-namespace \ --set datree.token=5fb***2ac0 \ --debug this is the output which I am getting : install.go:178: [debug] Original chart version: "" install.go:199: [debug] CHART PATH: /home/nikhil/.cache/helm/repository/datree-admission-webhook-0.3.14.tgz client.go:128: [debug] creating 1 resource(s) client.go:128: [debug] creating 20 resource(s) client.go:128: [debug] creating 1 resource(s) client.go:528: [debug] Watching for changes to Job datree-label-namespaces-hook-post-install with timeout of 5m0s client.go:556: [debug] Add/Modify event for datree-label-namespaces-hook-post-install: ADDED client.go:595: [debug] datree-label-namespaces-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 client.go:556: [debug] Add/Modify event for datree-label-namespaces-hook-post-install: MODIFIED client.go:595: [debug] datree-label-namespaces-hook-post-install: Jobs active: 0, jobs failed: 0, jobs succeeded: 0 client.go:556: [debug] Add/Modify event for datree-label-namespaces-hook-post-install: MODIFIED client.go:128: [debug] creating 1 resource(s) client.go:528: [debug] Watching for changes to Job datree-wait-server-ready-hook-post-install with timeout of 5m0s client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: ADDED client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 1, jobs succeeded: 0 client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 1, jobs succeeded: 0 client.go:299: [debug] Starting delete for "datree-wait-server-ready-hook-post-install" Job Error: INSTALLATION FAILED: failed post-install: timed out waiting for the condition helm.go:88: [debug] failed post-install: timed out waiting for the condition INSTALLATION FAILED main.newInstallCmd.func2 helm.sh/helm/v3/cmd/helm/install.go:127 github.com/spf13/cobra.(*Command).execute github.com/spf13/cobra#v1.2.1/command.go:856 github.com/spf13/cobra.(*Command).ExecuteC github.com/spf13/cobra#v1.2.1/command.go:974 github.com/spf13/cobra.(*Command).Execute github.com/spf13/cobra#v1.2.1/command.go:902 main.main helm.sh/helm/v3/cmd/helm/helm.go:87 runtime.main runtime/proc.go:225 runtime.goexit runtime/asm_amd64.s:1371 enter image description here Expected behavior Login to datree dashboard Desktop (please complete the following information): OS: Ubuntu Datree version (run datree version): Version: 1.8.1 Environment: Minikube kubernetes version: Client Version: v1.25.4 Kustomize Version: v4.5.7 Server Version: v1.25.3 Client ID (cat ~/.datree/config.yaml): client_id: XXaKCoqJSBKNMb3AounaUg Please help me resolving this issue. Thank you
Rancher failed to launch with the error k3s exited with: exit status 1
I am running rancher latest docker image on Mac M1 laptop but the contact failed to start. The command I am using is sudo docker run -d -p 80:80 -p 443:443 --privileged rancher/rancher. Below is the versions for my environment: $ docker --version Docker version 20.10.13, build a224086 $ uname -a Darwin Joeys-MBP 21.3.0 Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 arm64 $ docker images|grep rancher rancher/rancher latest f09cdb8a8fba 3 weeks ago 1.39GB Below is the logs from the container. $ docker logs -f 8d21d7d19b21 2022/04/28 03:34:00 [INFO] Rancher version v2.6.4 (4b4e29678) is starting 2022/04/28 03:34:00 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:} 2022/04/28 03:34:00 [INFO] Listening on /tmp/log.sock 2022/04/28 03:34:00 [INFO] Waiting for k3s to start 2022/04/28 03:34:01 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding 2022/04/28 03:34:03 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding 2022/04/28 03:34:05 [INFO] Running in single server mode, will not peer connections 2022/04/28 03:34:05 [INFO] Applying CRD features.management.cattle.io 2022/04/28 03:34:05 [INFO] Waiting for CRD features.management.cattle.io to become available 2022/04/28 03:34:05 [INFO] Done waiting for CRD features.management.cattle.io to become available 2022/04/28 03:34:08 [INFO] Applying CRD navlinks.ui.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD clusters.management.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD apiservices.management.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD clusterregistrationtokens.management.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD settings.management.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD preferences.management.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD features.management.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD clusterrepos.catalog.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD operations.catalog.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD apps.catalog.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD fleetworkspaces.management.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD bundles.fleet.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD clusters.fleet.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD managedcharts.management.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD clusters.provisioning.cattle.io 2022/04/28 03:34:08 [INFO] Applying CRD clusters.provisioning.cattle.io 2022/04/28 03:34:09 [INFO] Applying CRD rkeclusters.rke.cattle.io 2022/04/28 03:34:09 [INFO] Applying CRD rkecontrolplanes.rke.cattle.io 2022/04/28 03:34:09 [INFO] Applying CRD rkebootstraps.rke.cattle.io 2022/04/28 03:34:09 [INFO] Applying CRD rkebootstraptemplates.rke.cattle.io 2022/04/28 03:34:09 [INFO] Applying CRD rkecontrolplanes.rke.cattle.io 2022/04/28 03:34:09 [INFO] Applying CRD custommachines.rke.cattle.io 2022/04/28 03:34:09 [INFO] Applying CRD etcdsnapshots.rke.cattle.io 2022/04/28 03:34:09 [INFO] Applying CRD clusters.cluster.x-k8s.io 2022/04/28 03:34:09 [INFO] Applying CRD machinedeployments.cluster.x-k8s.io 2022/04/28 03:34:09 [INFO] Applying CRD machinehealthchecks.cluster.x-k8s.io 2022/04/28 03:34:09 [INFO] Applying CRD machines.cluster.x-k8s.io 2022/04/28 03:34:09 [INFO] Applying CRD machinesets.cluster.x-k8s.io 2022/04/28 03:34:09 [INFO] Waiting for CRD machinesets.cluster.x-k8s.io to become available 2022/04/28 03:34:09 [INFO] Done waiting for CRD machinesets.cluster.x-k8s.io to become available 2022/04/28 03:34:09 [INFO] Creating CRD authconfigs.management.cattle.io 2022/04/28 03:34:09 [INFO] Creating CRD groupmembers.management.cattle.io 2022/04/28 03:34:09 [INFO] Creating CRD groups.management.cattle.io 2022/04/28 03:34:09 [INFO] Creating CRD tokens.management.cattle.io 2022/04/28 03:34:09 [INFO] Creating CRD userattributes.management.cattle.io 2022/04/28 03:34:09 [INFO] Creating CRD users.management.cattle.io 2022/04/28 03:34:09 [INFO] Waiting for CRD tokens.management.cattle.io to become available 2022/04/28 03:34:10 [INFO] Done waiting for CRD tokens.management.cattle.io to become available 2022/04/28 03:34:10 [INFO] Waiting for CRD userattributes.management.cattle.io to become available 2022/04/28 03:34:10 [INFO] Done waiting for CRD userattributes.management.cattle.io to become available 2022/04/28 03:34:10 [INFO] Waiting for CRD users.management.cattle.io to become available 2022/04/28 03:34:11 [INFO] Done waiting for CRD users.management.cattle.io to become available 2022/04/28 03:34:11 [INFO] Creating CRD clusterroletemplatebindings.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD apps.project.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD catalogs.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD apprevisions.project.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD dynamicschemas.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD catalogtemplates.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD pipelineexecutions.project.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD etcdbackups.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD pipelinesettings.project.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD globalrolebindings.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD pipelines.project.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD catalogtemplateversions.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD globalroles.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD sourcecodecredentials.project.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clusteralerts.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clusteralertgroups.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD sourcecodeproviderconfigs.project.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD kontainerdrivers.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD nodedrivers.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clustercatalogs.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD sourcecoderepositories.project.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clusterloggings.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD nodepools.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD nodetemplates.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clusteralertrules.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clustermonitorgraphs.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clusterscans.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD nodes.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD podsecuritypolicytemplateprojectbindings.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD composeconfigs.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD podsecuritypolicytemplates.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD multiclusterapps.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projectnetworkpolicies.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD multiclusterapprevisions.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projectroletemplatebindings.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD monitormetrics.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projects.management.cattle.io 2022/04/28 03:34:11 [INFO] Waiting for CRD sourcecodecredentials.project.cattle.io to become available 2022/04/28 03:34:11 [INFO] Creating CRD rkek8ssystemimages.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD notifiers.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD rkek8sserviceoptions.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projectalerts.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD rkeaddons.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projectalertgroups.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD roletemplates.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projectcatalogs.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projectloggings.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD samltokens.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projectalertrules.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clustertemplates.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD projectmonitorgraphs.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD clustertemplaterevisions.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD cisconfigs.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD cisbenchmarkversions.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD templates.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD templateversions.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD templatecontents.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD globaldnses.management.cattle.io 2022/04/28 03:34:11 [INFO] Creating CRD globaldnsproviders.management.cattle.io 2022/04/28 03:34:11 [INFO] Waiting for CRD nodetemplates.management.cattle.io to become available 2022/04/28 03:34:12 [INFO] Waiting for CRD projectalertgroups.management.cattle.io to become available 2022/04/28 03:34:12 [FATAL] k3s exited with: exit status 1
I would recommend trying to run it with a specific tag, i.e. rancher/rancher:v2.6.6. Some other things that may interfere: What size setup are you running on? CPU and minimum memory requirements are currently 2 CPUs and 4gb RAM. Also, you can try their docker install scripts and check out other documentation here: https://rancher.com/docs/rancher/v2.6/en/installation/requirements/installing-docker/ Edit: noticed you're running on ARM. There is additional documentation for running rancher on ARM here: https://rancher.com/docs/rancher/v2.5/en/installation/resources/advanced/arm64-platform/
CoreDNS logs report unauthorized
We deployed new Kubernetes cluster, and it has 2 pods for Coredns. $ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns NAME READY STATUS RESTARTS AGE coredns-74ff55c5b-7v9bd 0/1 Running 0 7h22m coredns-74ff55c5b-tfpqb 0/1 Running 0 7h23m There suppose to be 2 replicas, but 0 READY. When I check the logs to find the reason for not running, I see there are many Unauthorized errors. $ for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --tail 20 --namespace=kube-system $p; done E0323 00:58:04.393710 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:58:34.184217 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:58:51.873269 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:59:00.966217 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:59:23.151006 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:59:47.362409 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Unauthorized E0323 00:59:48.563791 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized E0323 00:59:56.278764 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:58:07.504557 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:58:24.948534 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:58:33.605013 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:58:56.471477 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:59:20.436808 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized E0323 00:59:21.200346 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" E0323 00:59:29.597663 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" [INFO] plugin/ready: Still waiting on: "kubernetes" When I try to find some help on net, I found out, it use coredns service user. I check for all that roles and bindings. SERVICE ACCOUNT $ kubectl get sa coredns -n kube-system -o yaml apiVersion: v1 kind: ServiceAccount metadata: creationTimestamp: "2021-03-03T15:17:38Z" name: coredns namespace: kube-system resourceVersion: "297" uid: 13633498-2e6b-4ac4-bb34-f2d5c9e4d262 secrets: - name: coredns-token-sg7p9 TOKEN SECRET $ kubectl get secret coredns-token-sg7p9 -n kube-system NAME TYPE DATA AGE coredns-token-sg7p9 kubernetes.io/service-account-token 3 19d CLUSTER ROLE $ kubectl get clusterrole system:coredns -n kube-system -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: creationTimestamp: "2021-03-03T15:17:38Z" managedFields: - apiVersion: rbac.authorization.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:rules: {} manager: kubeadm operation: Update time: "2021-03-03T15:17:38Z" name: system:coredns resourceVersion: "292" uid: 35adc9a3-7415-4498-81b2-a4eab50882b1 rules: - apiGroups: - "" resources: - endpoints - services - pods - namespaces verbs: - list - watch - apiGroups: - "" resources: - nodes verbs: - get CLUSTER ROLE BINDINGS $ kubectl get clusterrolebindings system:coredns -n kube-system -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: creationTimestamp: "2021-03-03T15:17:38Z" managedFields: - apiVersion: rbac.authorization.k8s.io/v1 fieldsType: FieldsV1 fieldsV1: f:roleRef: f:apiGroup: {} f:kind: {} f:name: {} f:subjects: {} manager: kubeadm operation: Update time: "2021-03-03T15:17:38Z" name: system:coredns resourceVersion: "293" uid: 2d47c2cb-6641-4a62-b867-8a598ac3923a roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:coredns subjects: - kind: ServiceAccount name: coredns namespace: kube-system From unauthorized error, I can predict it might be related to token, like token expired and not renewed. I was trying to find help on net, for how to renew token for Coredns, but didn't find any help. I might be doing something wrong, but can't find that. There is help available if Pod is not in Running state, but not for unauthorized after running pod.
RabbitMQ fails to start after restart Kubernetes cluster
I'm running RabbitMQ on Kubernetes. This is my sts YAML file: apiVersion: v1 kind: Service metadata: name: rabbitmq-management labels: app: rabbitmq spec: ports: - port: 15672 name: http selector: app: rabbitmq type: NodePort --- apiVersion: v1 kind: Service metadata: name: rabbitmq labels: app: rabbitmq spec: ports: - port: 5672 name: amqp - port: 4369 name: epmd - port: 25672 name: rabbitmq-dist - port: 61613 name: stomp clusterIP: None selector: app: rabbitmq --- apiVersion: apps/v1 kind: StatefulSet metadata: name: rabbitmq spec: serviceName: "rabbitmq" replicas: 3 selector: matchLabels: app: rabbitmq template: metadata: labels: app: rabbitmq spec: containers: - name: rabbitmq image: rabbitmq:management-alpine lifecycle: postStart: exec: command: - /bin/sh - -c - > rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new; cat /etc/resolv.conf.new > /etc/resolv.conf; rm /etc/resolv.conf.new; fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then rabbitmqctl stop_app; rabbitmqctl join_cluster rabbit#rabbitmq-0; rabbitmqctl start_app; fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}' env: - name: RABBITMQ_ERLANG_COOKIE valueFrom: secretKeyRef: name: rabbitmq-config key: erlang-cookie ports: - containerPort: 5672 name: amqp - containerPort: 61613 name: stomp volumeMounts: - name: rabbitmq mountPath: /var/lib/rabbitmq volumeClaimTemplates: - metadata: name: rabbitmq annotations: volume.alpha.kubernetes.io/storage-class: do-block-storage spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 10Gi and I created the cookie with this command: kubectl create secret generic rabbitmq-config --from-literal=erlang-cookie=c-is-for-cookie-thats-good-enough-for-me all of my Kubernetes cluster nodes are ready: kubectl get nodes NAME STATUS ROLES AGE VERSION kubernetes-master Ready master 14d v1.17.3 kubernetes-slave-1 Ready <none> 14d v1.17.3 kubernetes-slave-2 Ready <none> 14d v1.17.3 but after restarting the cluster, the RabbitMQ didn't start. I tried to scale down and up the sts but the problem already exist. The output of kubectl describe pod rabbitmq-0: kubectl describe pod rabbitmq-0 Name: rabbitmq-0 Namespace: default Priority: 0 Node: kubernetes-slave-1/192.168.0.179 Start Time: Tue, 24 Mar 2020 22:31:04 +0000 Labels: app=rabbitmq controller-revision-hash=rabbitmq-6748869f4b statefulset.kubernetes.io/pod-name=rabbitmq-0 Annotations: <none> Status: Running IP: 10.244.1.163 IPs: IP: 10.244.1.163 Controlled By: StatefulSet/rabbitmq Containers: rabbitmq: Container ID: docker://d5108f818525030b4fdb548eb40f0dc000dd2cec473ebf8cead315116e3efbd3 Image: rabbitmq:management-alpine Image ID: docker-pullable://rabbitmq#sha256:6f7c8d01d55147713379f5ca26e3f20eca63eb3618c263b12440b31c697ee5a5 Ports: 5672/TCP, 61613/TCP Host Ports: 0/TCP, 0/TCP State: Waiting Reason: PostStartHookError: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new; cat /etc/resolv.conf.new > /etc/resolv.conf; rm /etc/resolv.conf.new; fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then rabbitmqctl stop_app; rabbitmqctl join_cluster rabbit#rabbitmq-0; rabbitmqctl start_app; fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}' ' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below. Most common reasons for this are: * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues) * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server) * Target node is not running In addition to the diagnostics info below: * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more * Consult server logs on node rabbit#rabbitmq-0 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools DIAGNOSTICS =========== attempted to contact: ['rabbit#rabbitmq-0'] rabbit#rabbitmq-0: * connected to epmd (port 4369) on rabbitmq-0 * epmd reports: node 'rabbit' not running at all no other nodes on rabbitmq-0 * suggestion: start the node Current node details: * node name: 'rabbitmqcli-575-rabbit#rabbitmq-0' * effective user's home directory: /var/lib/rabbitmq * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg== Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'. Arguments given: node_health_check Usage rabbitmqctl [--node <node>] [--longnames] [--quiet] node_health_check [--timeout <timeout>] Error: {:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}} DIAGNOSTICS =========== attempted to contact: ['rabbit#rabbitmq-0'] rabbit#rabbitmq-0: * connected to epmd (port 4369) on rabbitmq-0 * epmd reports: node 'rabbit' not running at all no other nodes on rabbitmq-0 * suggestion: start the node Current node details: * node name: 'rabbitmqcli-10397-rabbit#rabbitmq-0' * effective user's home directory: /var/lib/rabbitmq * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg== Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 24 Mar 2020 22:46:09 +0000 Finished: Tue, 24 Mar 2020 22:58:28 +0000 Ready: False Restart Count: 1 Environment: RABBITMQ_ERLANG_COOKIE: <set to the key 'erlang-cookie' in secret 'rabbitmq-config'> Optional: false Mounts: /var/lib/rabbitmq from rabbitmq (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-bbl9c (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: rabbitmq: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: rabbitmq-rabbitmq-0 ReadOnly: false default-token-bbl9c: Type: Secret (a volume populated by a Secret) SecretName: default-token-bbl9c Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 31m default-scheduler Successfully assigned default/rabbitmq-0 to kubernetes-slave-1 Normal Pulled 31m kubelet, kubernetes-slave-1 Container image "rabbitmq:management-alpine" already present on machine Normal Created 31m kubelet, kubernetes-slave-1 Created container rabbitmq Normal Started 31m kubelet, kubernetes-slave-1 Started container rabbitmq Normal SandboxChanged 16m (x9 over 17m) kubelet, kubernetes-slave-1 Pod sandbox changed, it will be killed and re-created. Normal Pulled 3m58s (x2 over 16m) kubelet, kubernetes-slave-1 Container image "rabbitmq:management-alpine" already present on machine Warning FailedPostStartHook 3m58s kubelet, kubernetes-slave-1 Exec lifecycle hook ([/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new; cat /etc/resolv.conf.new > /etc/resolv.conf; rm /etc/resolv.conf.new; fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then rabbitmqctl stop_app; rabbitmqctl join_cluster rabbit#rabbitmq-0; rabbitmqctl start_app; fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}' ]) for Container "rabbitmq" in Pod "rabbitmq-0_default(2e561153-a830-4d30-ab1e-71c80d10c9e9)" failed - error: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new; cat /etc/resolv.conf.new > /etc/resolv.conf; rm /etc/resolv.conf.new; fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then rabbitmqctl stop_app; rabbitmqctl join_cluster rabbit#rabbitmq-0; rabbitmqctl start_app; fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}' ' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below. Most common reasons for this are: * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues) * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server) * Target node is not running In addition to the diagnostics info below: * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more * Consult server logs on node rabbit#rabbitmq-0 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools DIAGNOSTICS =========== attempted to contact: ['rabbit#rabbitmq-0'] rabbit#rabbitmq-0: * connected to epmd (port 4369) on rabbitmq-0 * epmd reports: node 'rabbit' not running at all other nodes on rabbitmq-0: [rabbitmqprelaunch1] * suggestion: start the node Current node details: * node name: 'rabbitmqcli-433-rabbit#rabbitmq-0' * effective user's home directory: /var/lib/rabbitmq * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg== Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below. Most common reasons for this are: * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues) * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server) * Target node is not running In addition to the diagnostics info below: * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more * Consult server logs on node rabbit#rabbitmq-0 * If target node is configured to use long node names, don't forget to use --longnames with CLI tools DIAGNOSTICS =========== attempted to contact: ['rabbit#rabbitmq-0'] rabbit#rabbitmq-0: * connected to epmd (port 4369) on rabbitmq-0 * epmd reports: node 'rabbit' not running at all no other nodes on rabbitmq-0 * suggestion: start the node Current node details: * node name: 'rabbitmqcli-575-rabbit#rabbitmq-0' * effective user's home directory: /var/lib/rabbitmq * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg== Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'. Arguments given: node_health_check , message: "Enabling plugins on node rabbit#rabbitmq-0:\nrabbitmq_stomp\nThe following plugins have been configured:\n rabbitmq_management\n rabbitmq_management_agent\n rabbitmq_stomp\n rabbitmq_web_dispatch\nApplying plugin configuration to rabbit#rabbitmq-0...\nThe following plugins have been enabled:\n rabbitmq_stomp\n\nset 4 plugins.\nOffline change; changes will take effect at broker restart.\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\ Error:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.\n\nMost common reasons for this are:\n\n * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)\n * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)\n * Target node is not running\n\nIn addition to the diagnostics info below:\n\n * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more\n * Consult server logs on node rabbit#rabbitmq-0\n * If target node is configured to use long node names, don't forget to use --longnames with CLI tools\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit#rabbitmq-0']\n\nrabbit#rabbitmq-0:\n * connected to epmd (port 4369) on rabbitmq-0\n * epmd reports: node 'rabbit' not running at all\n no other nodes on rabbitmq-0\n * suggestion: start the node\n\nCurrent node details:\n * node name: 'rabbitmqcli-10397-rabbit#rabbitmq-0'\n * effective user's home directory: /var/lib/rabbitmq\n * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\n" Normal Killing 3m58s kubelet, kubernetes-slave-1 FailedPostStartHook Normal Created 3m57s (x2 over 16m) kubelet, kubernetes-slave-1 Created container rabbitmq Normal Started 3m57s (x2 over 16m) kubelet, kubernetes-slave-1 Started container rabbitmq The output of kubectl get sts: kubectl get sts NAME READY AGE consul 3/3 15d hazelcast 2/3 15d kafka 2/3 15d rabbitmq 0/3 13d zk 3/3 15d and this is pod log that I copied from Kubernetes dashboard: 2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: list of feature flags found: 2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] drop_unroutable_metric 2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] empty_basic_get_metric 2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] implicit_default_bindings 2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] quorum_queue 2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] virtual_host_metadata 2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: feature flag states written to disk: yes 2020-03-24 22:58:43.979 [info] <0.319.0> ra: meta data store initialised. 0 record(s) recovered 2020-03-24 22:58:43.980 [info] <0.324.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0/quorum/rabbit#rabbitmq-0/00000262.wal"] 2020-03-24 22:58:43.982 [info] <0.328.0> Starting RabbitMQ 3.8.2 on Erlang 22.2.8 Copyright (c) 2007-2019 Pivotal Software, Inc. Licensed under the MPL 1.1. Website: https://rabbitmq.com ## ## RabbitMQ 3.8.2 ## ## ########## Copyright (c) 2007-2019 Pivotal Software, Inc. ###### ## ########## Licensed under the MPL 1.1. Website: https://rabbitmq.com Doc guides: https://rabbitmq.com/documentation.html Support: https://rabbitmq.com/contact.html Tutorials: https://rabbitmq.com/getstarted.html Monitoring: https://rabbitmq.com/monitoring.html Logs: <stdout> Config file(s): /etc/rabbitmq/rabbitmq.conf Starting broker...2020-03-24 22:58:43.983 [info] <0.328.0> node : rabbit#rabbitmq-0 home dir : /var/lib/rabbitmq config file(s) : /etc/rabbitmq/rabbitmq.conf cookie hash : P1XNOe5pN3Ug2FCRFzH7Xg== log(s) : <stdout> database dir : /var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0 2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step pre_boot defined by app rabbit 2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step rabbit_core_metrics defined by app rabbit 2020-03-24 22:58:43.998 [info] <0.328.0> Running boot step rabbit_alarm defined by app rabbit 2020-03-24 22:58:44.002 [info] <0.334.0> Memory high watermark set to 1200 MiB (1258889216 bytes) of 3001 MiB (3147223040 bytes) total 2020-03-24 22:58:44.014 [info] <0.336.0> Enabling free disk space monitoring 2020-03-24 22:58:44.014 [info] <0.336.0> Disk free limit set to 50MB 2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step code_server_cache defined by app rabbit 2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step file_handle_cache defined by app rabbit 2020-03-24 22:58:44.019 [info] <0.339.0> Limiting to approx 1048479 file handles (943629 sockets) 2020-03-24 22:58:44.019 [info] <0.340.0> FHC read buffering: OFF 2020-03-24 22:58:44.019 [info] <0.340.0> FHC write buffering: ON 2020-03-24 22:58:44.020 [info] <0.328.0> Running boot step worker_pool defined by app rabbit 2020-03-24 22:58:44.021 [info] <0.329.0> Will use 2 processes for default worker pool 2020-03-24 22:58:44.021 [info] <0.329.0> Starting worker pool 'worker_pool' with 2 processes in it 2020-03-24 22:58:44.021 [info] <0.328.0> Running boot step database defined by app rabbit 2020-03-24 22:58:44.041 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2020-03-24 22:59:14.042 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 22:59:14.042 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left 2020-03-24 22:59:44.043 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 22:59:44.043 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left 2020-03-24 23:00:14.044 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 23:00:14.044 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left 2020-03-24 23:00:44.045 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 23:00:44.045 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left 2020-03-24 23:01:14.046 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 23:01:14.046 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left 2020-03-24 23:01:44.047 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 23:01:44.047 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left 2020-03-24 23:02:14.048 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 23:02:14.048 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left 2020-03-24 23:02:44.049 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 23:02:44.049 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left 2020-03-24 23:03:14.050 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]} 2020-03-24 23:03:14.050 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left 2020-03-24 23:03:44.051 [error] <0.328.0> Feature flag `quorum_queue`: migration function crashed: {error,{timeout_waiting_for_tables,[rabbit_durable_queue]}} [{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,117}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,60}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1486}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-2-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2128}]},{maps,fold_1,3,[{file,"maps.erl"},{line,232}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2126}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,1947}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,631}]}] 2020-03-24 23:03:44.051 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left 2020-03-24 23:04:14.052 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:04:14.052 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left 2020-03-24 23:04:44.053 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:04:44.053 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left 2020-03-24 23:05:14.055 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:05:14.055 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left 2020-03-24 23:05:44.056 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:05:44.056 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left 2020-03-24 23:06:14.057 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:06:14.057 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left 2020-03-24 23:06:44.058 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:06:44.058 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left 2020-03-24 23:07:14.059 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:07:14.059 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left 2020-03-24 23:07:44.060 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:07:44.060 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left 2020-03-24 23:08:14.061 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]} 2020-03-24 23:08:14.061 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left 2020-03-24 23:08:44.062 [error] <0.327.0> CRASH REPORT Process <0.327.0> with 0 neighbours exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138 2020-03-24 23:08:44.063 [info] <0.43.0> Application rabbit exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}} {"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}}"} Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_r Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Take a look at: https://www.rabbitmq.com/clustering.html#restarting You should be able to stop the app and then force boot: rabbitmqctl stop_app rabbitmqctl force_boot
To complete #Vincent Gerris answer, I strongly recommend you to use the Bitnami rabbitMQ Docker image. They have included an env variable called RABBITMQ_FORCE_BOOT: https://github.com/bitnami/bitnami-docker-rabbitmq/blob/2c38682053dd9b3e88ab1fb305355d2ce88c2ccb/3.9/debian-10/rootfs/opt/bitnami/scripts/librabbitmq.sh#L760 if is_boolean_yes "$RABBITMQ_FORCE_BOOT" && ! is_dir_empty "${RABBITMQ_DATA_DIR}/${RABBITMQ_NODE_NAME}"; then # ref: https://www.rabbitmq.com/rabbitmqctl.8.html#force_boot warn "Forcing node to start..." debug_execute "${RABBITMQ_BIN_DIR}/rabbitmqctl" force_boot fi this will force boot the node at entrypoint.
Ansible AWX RabbitMQ container in Kubernetes Failed to get nodes from k8s with nxdomain
I am trying to get Ansible AWX installed on my Kubernetes cluster but the RabbitMQ container is throwing "Failed to get nodes from k8s" error. Below are the version of platforms I am using [node1 ~]# kubectl version Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Kubernetes is deployed via the kubespray playbook v2.5.0 and all the services and pods are up and running. (CoreDNS, Weave, IPtables) I am deploying AWX via the 1.0.6 release using the 1.0.6 images for awx_web and awx_task. I am using an external PostgreSQL database at v10.4 and have verified the tables are being created by awx in the db. Troubleshooting steps I have tried. I tried to deploy AWX 1.0.5 with the etcd pod to the same cluster and it has worked as expected I have deployed a stand alone RabbitMQ cluster in the same k8s cluster trying to mimic the AWX rabbit deployment as much as possible and it works with the rabbit_peer_discovery_k8s backend. I have tried tweeking some of the rabbitmq.conf for AWX 1.0.6 with no luck it just keeps thowing the same error. I have verified the /etc/resolv.conf file has the kubernetes.default.svc.cluster.local entry Cluster Info [node1 ~]# kubectl get all -n awx NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/awx 1 1 1 0 38m NAME DESIRED CURRENT READY AGE rs/awx-654f7fc84c 1 1 0 38m NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/awx 1 1 1 0 38m NAME DESIRED CURRENT READY AGE rs/awx-654f7fc84c 1 1 0 38m NAME READY STATUS RESTARTS AGE po/awx-654f7fc84c-9ppqb 3/4 CrashLoopBackOff 11 38m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/awx-rmq-mgmt ClusterIP 10.233.10.146 <none> 15672/TCP 1d svc/awx-web-svc NodePort 10.233.3.75 <none> 80:31700/TCP 1d svc/rabbitmq NodePort 10.233.37.33 <none> 15672:30434/TCP,5672:31962/TCP 1d AWX RabbitMQ error log [node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit 2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit#10.233.120.5' 2018-07-09 14:47:37.964 [info] <0.193.0> Starting RabbitMQ 3.7.4 on Erlang 20.1.7 Copyright (C) 2007-2018 Pivotal Software, Inc. Licensed under the MPL. See http://www.rabbitmq.com/ ## ## ## ## RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc. ########## Licensed under the MPL. See http://www.rabbitmq.com/ ###### ## ########## Logs: <stdout> Starting broker... 2018-07-09 14:47:37.982 [info] <0.193.0> node : rabbit#10.233.120.5 home dir : /var/lib/rabbitmq config file(s) : /etc/rabbitmq/rabbitmq.conf cookie hash : at619UOZzsenF44tSK3ulA== log(s) : <stdout> database dir : /var/lib/rabbitmq/mnesia/rabbit#10.233.120.5 2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total 2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring 2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB 2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets) 2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering: OFF 2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON 2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit#10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch... 2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s 2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s 2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay 2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay. 2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}}, {inet,[inet],nxdomain}]} 2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134 2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 {"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"} Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done Kubernetes API service [node1 ~]# kubectl describe service kubernetes Name: kubernetes Namespace: default Labels: component=apiserver provider=kubernetes Annotations: <none> Selector: <none> Type: ClusterIP IP: 10.233.0.1 Port: https 443/TCP TargetPort: 6443/TCP Endpoints: 10.237.34.19:6443,10.237.34.21:6443 Session Affinity: ClientIP Events: <none> nslookup from a busybox in the same kubernetes cluster [node2 ~]# kubectl exec -it busybox -- sh / # nslookup kubernetes.default.svc.cluster.local Server: 10.233.0.3 Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local Name: kubernetes.default.svc.cluster.local Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local Please let me know if I am missing anything that could help troubleshooting.
I believe the solution is to omit the explicit kubernetes host. I can't think of any good reason one would need to specify the kubernetes api host from inside the cluster. If for some terrible reason the RMQ plugin requires it, then try swapping in the Service IP (assuming your SSL cert for the master has its Service IP in the SANs list). As for why it is doing such a silly thing, the only good reason I can think of is that the RMQ PodSpec has somehow gotten a dnsPolicy of something other than ClusterFirst. If you truly wish to troubleshoot the RMQ Pod, then you can provide an explicit command: to run some debugging bash commands first, in order to interrogate the state of the container at launch, and then exec /launch.sh to resume booting up RMQ (as they do)