AWS ECS Container stuck in restart loop - amazon-ecs

Trying to bring up my first ECS Cluster.
Container is brought up successfully, but keeps on restarting every minute or so.
Cluster service is using an application load balancer.
Task definition:
{
"containerDefinitions": [
{
"dnsSearchDomains": null,
"entryPoint": null,
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 80
}
],
"command": [
"bash",
"deploy.sh"
],
"cpu": 512,
"memory": 1961,
"memoryReservation": 256,
"image": "1",
"essential": true,
}
],
"placementConstraints": [],
"memory": null,
"taskRoleArn": null,
"compatibilities": [
"EC2"
],
"taskDefinitionArn": "arn:aws:ecs:...",
"family": "service",
"networkMode": "awsvpc",
"status": "ACTIVE",
"inferenceAccelerators": null,
"proxyConfiguration": null,
"volumes": []
}
ecs agent log (removed and renamed specifics), goes through the following loop every minute or so
00:05Z [INFO] Handling ENI attachment
00:05Z [INFO] Starting ENI ack timer
00:05Z [INFO] Adding task eni attachment
00:05Z [INFO] Emitting task ENI attached event for: ENI Attachment:
00:05Z [INFO] TaskHandler: Adding event: TaskChange: arn:... -> NONE, ENI Attachment: sent: false
00:05Z [INFO] TaskHandler: Sending task attachment change: TaskChange: arn:... -> NONE, ENI Attachment: sent: false
00:11Z [INFO] Managed task arn:... unable to create state change event for container create container state change event api: status not recognized by ECS: NONE
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: status not recognized by ECS: NONE
00:11Z [INFO] Managed task arn:... unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE
00:11Z [INFO] Managed task arn:... waiting for any previous stops to complete. Sequence number: 110
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Managed task arn:... no longer waiting
00:11Z [INFO] Managed task arn:... wait over; ready to move towards status: RUNNING
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [PULLED]
00:11Z [INFO] Managed task arn:... handling container change [{PULLED { <nil> [] <nil> [] map[] {UNKNOWN <nil> 0 } <nil>} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: status not recognized by ECS: PULLED
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... creating container: ~internal~ecs~pause
00:11Z [INFO] Task engine arn:... created container name mapping for task: ~internal~ecs~pause ->
00:11Z [INFO] Creating cgroup
00:11Z [INFO] Managed task arn:... transitioned resource [cgroup] to [CREATED]
00:11Z [INFO] Managed task arn:... got resource [cgroup] event: [CREATED]
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... created docker container for task: ~internal~ecs~pause -> ...
00:11Z [INFO] Task engine arn:... created docker container for task: ~internal~ecs~pause -> ..., took 103.387592ms
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [CREATED]
00:11Z [INFO] Managed task arn:... handling container change [{CREATED {... <nil> [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc00011ab00} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: status not recognized by ECS: CREATED
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... starting container: ~internal~ecs~pause
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [RUNNING]
00:11Z [INFO] Managed task arn:... handling container change [{RUNNING {... <nil> [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc00011b600} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: status not recognized by ECS: RUNNING
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... setting up container resources for container [~internal~ecs~pause]
00:11Z [INFO] Task engine arn:... started docker container for task: ~internal~ecs~pause -> ..., took 221.148618ms
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [RUNNING]
00:11Z [INFO] Managed task arn:... handling container change [{RUNNING {... <nil> [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc0007aed00} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... redundant container state change. ~internal~ecs~pause to RUNNING, but already RUNNING
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... associated with ip address '1.2.3.4'
00:11Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [RESOURCES_PROVISIONED]
00:11Z [INFO] Managed task arn:... handling container change [{RESOURCES_PROVISIONED {... <nil> [] <nil> [] map[] {UNKNOWN <nil> 0 } <nil>} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:11Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: internal container: ~internal~ecs~pause
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... pulling image for container concurrently
00:11Z [INFO] Task engine arn:... recording timestamp for starting image pulltime:
00:11Z [INFO] Updating container reference in Image State -
00:11Z [INFO] Task engine arn:... finished pulling image for container in 99.366882ms
00:11Z [INFO] Managed task arn:... got container event: [PULLED]
00:11Z [INFO] Managed task arn:... handling container change [{PULLED { <nil> [] <nil> [] map[] {UNKNOWN <nil> 0 } <nil>} ContainerStatusChangeEvent}] for container
00:11Z [INFO] Managed task arn:... unable to create state change event for container create container state change event api: status not recognized by ECS: PULLED
00:11Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:11Z [INFO] Managed task arn:... waiting for event for task
00:11Z [INFO] Task engine arn:... creating container:
00:11Z [INFO] Task engine arn:... created container name mapping for task: ->
00:12Z [INFO] Task engine arn:... created docker container for task: -> ...
00:12Z [INFO] Task engine arn:... created docker container for task: -> ..., took 81.682728ms
00:12Z [INFO] Managed task arn:... got container event: [CREATED]
00:12Z [INFO] Managed task arn:... handling container change [{CREATED {... <nil> [] <nil> [] com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } container: 0xc00062eb00} ContainerStatusChangeEvent}] for container
00:12Z [INFO] Managed task arn:... unable to create state change event for container create container state change event api: status not recognized by ECS: CREATED
00:12Z [INFO] api/task: Updating task's known status to: CREATED, task: arn:aws:ecs:... TaskStatus: (NONE->RUNNING) Containers: (CREATED->RUNNING),~internal~ecs~pause (RESOURCES_PROVISIONED->RESOURCES_PROVISIONED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]
00:12Z [INFO] Managed task arn:... container change also resulted in task change [RUNNING]
00:12Z [INFO] Managed task arn:... unable to create task state change event []: create task state change event api: status not recognized by ECS: CREATED
00:12Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:12Z [INFO] Managed task arn:... waiting for event for task
00:12Z [INFO] Task engine arn:... starting container:
00:12Z [INFO] Managed task arn:... got container event: [RUNNING]
00:12Z [INFO] Managed task arn:... handling container change [{RUNNING {... <nil> [] <nil> [] com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } container: 0xc0007aeb00} ContainerStatusChangeEvent}] for container
00:12Z [INFO] Managed task arn:... sending container change event arn:aws:ecs:... -> RUNNING, Known Sent: NONE
00:12Z [INFO] Managed task arn:... sent container change event arn:aws:ecs:... -> RUNNING, Known Sent: NONE
00:12Z [INFO] api/task: Updating task's known status to: RUNNING, task: arn:aws:ecs:... TaskStatus: (CREATED->RUNNING) Containers: (RUNNING->RUNNING),~internal~ecs~pause (RESOURCES_PROVISIONED->RESOURCES_PROVISIONED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]
00:12Z [INFO] Managed task arn:... container change also resulted in task change [RUNNING]
00:12Z [INFO] Managed task arn:... sending task change event arn:... -> RUNNING, Known Sent: NONE, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: ]
00:12Z [INFO] TaskHandler: batching container event -> RUNNING, Known Sent: NONE
00:12Z [INFO] TaskHandler: Adding event: TaskChange: arn:... -> RUNNING, Known Sent: NONE, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: , arn:aws:ecs:... -> RUNNING, Known Sent: NONE] sent: false
00:12Z [INFO] TaskHandler: Sending task change: TaskChange: arn:... -> RUNNING, Known Sent: NONE, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: , arn:aws:ecs:... -> RUNNING, Known Sent: NONE] sent: false
00:12Z [INFO] Managed task arn:... sent task change event arn:... -> RUNNING, Known Sent: NONE, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: ]
00:12Z [INFO] Managed task arn:... task at steady state: RUNNING
00:12Z [INFO] Managed task arn:... waiting for event for task
00:12Z [INFO] Task engine arn:... started docker container for task: -> ..., took 204.620728ms
00:12Z [INFO] Managed task arn:... got container event: [RUNNING]
00:12Z [INFO] Managed task arn:... handling container change [{RUNNING {... <nil> [] <nil> [] com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } container: 0xc00062f600} ContainerStatusChangeEvent}] for container
00:12Z [INFO] Managed task arn:... redundant container state change. to RUNNING, but already RUNNING
00:12Z [INFO] Managed task arn:... task at steady state: RUNNING
00:12Z [INFO] Managed task arn:... waiting for event for task
00:14Z [INFO] Managed task arn:... got container event: [STOPPED]
00:14Z [INFO] Managed task arn:... handling container change [{STOPPED {... 0xc00 {UNKNOWN <nil> 0 } container: 0xc00062f800} ContainerStatusChangeEvent}] for container
00:14Z [INFO] Task arn:... recording execution stopped time. Essential container stopped at:
00:14Z [INFO] Managed task arn:... sending container change event arn:aws:ecs:... -> STOPPED, Exit 0, , Known Sent: RUNNING
00:14Z [INFO] Managed task arn:... sent container change event arn:aws:ecs:... -> STOPPED, Exit 0, , Known Sent: RUNNING
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: task: arn:aws:ecs:... TaskStatus: (RUNNING->RUNNING) Containers: (STOPPED->RUNNING),~internal~ecs~pause (RESOURCES_PROVISIONED->RESOURCES_PROVISIONED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [INFO] Managed task arn:... got resource [cgroup] event: [REMOVED]
00:14Z [INFO] Task engine arn:... stopping container [~internal~ecs~pause]
00:14Z [INFO] Task engine arn:... cleaning up the network namespace
00:14Z [INFO] TaskHandler: batching container event -> STOPPED, Exit 0, , Known Sent: RUNNING
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: task: arn:aws:ecs:... TaskStatus: (RUNNING->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (RESOURCES_PROVISIONED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: [~internal~ecs~pause]; task: arn:aws:ecs:... TaskStatus: (RUNNING->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (RESOURCES_PROVISIONED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] Managed task arn:... task not steady state or terminal; progressing it
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [WARN] DockerGoClient: Unable to decode stats for container ...: context canceled
00:14Z [INFO] Container ... is terminal, stopping stats collection
00:14Z [INFO] Task engine arn:... cleaned pause container network namespace
00:14Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [STOPPED]
00:14Z [INFO] Managed task arn:... handling container change [{STOPPED {... 0xc00 {UNKNOWN <nil> 0 } none 0xc00062fb00} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:14Z [INFO] Managed task arn:... unable to create state change event for container [~internal~ecs~pause]: create container state change event api: internal container: ~internal~ecs~pause
00:14Z [INFO] api/task: Updating task's known status to: STOPPED, task: arn:aws:ecs:... TaskStatus: (RUNNING->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: task: arn:aws:ecs:... TaskStatus: (STOPPED->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: [~internal~ecs~pause]; task: arn:aws:ecs:... TaskStatus: (STOPPED->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] Managed task arn:... container change also resulted in task change [~internal~ecs~pause]: [STOPPED]
00:14Z [INFO] Managed task arn:... sending task change event arn:... -> STOPPED, Known Sent: RUNNING, PullStartedAt: PullStoppedAt: ExecutionStoppedAt:
00:14Z [INFO] Managed task arn:... sent task change event arn:... -> STOPPED, Known Sent: RUNNING, PullStartedAt: PullStoppedAt: ExecutionStoppedAt:
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: task: arn:aws:ecs:... TaskStatus: (STOPPED->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] api/task: Updating task desired status to stopped because of container: [~internal~ecs~pause]; task: arn:aws:ecs:... TaskStatus: (STOPPED->STOPPED) Containers: (STOPPED->STOPPED),~internal~ecs~pause (STOPPED->STOPPED),] ENIs: [eni id:eni-1 mac: aaa hostname: hostname ipv4addresses: [1.2.3.4], ipv6addresses: [], dns: [], dns search: [], gateway ipv4: [1.2.3.4/20][ ,VLan ID: [], TrunkInterfaceMacAddress: []],]]
00:14Z [INFO] Managed task arn:... task has reached stopped. Waiting for container cleanup
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [WARN] DockerGoClient: Unable to decode stats for container ...: context canceled
00:14Z [INFO] Container ... is terminal, stopping stats collection
00:14Z [INFO] TaskHandler: Adding event: TaskChange: arn:... -> STOPPED, Known Sent: RUNNING, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: arn:aws:ecs:... -> STOPPED, Exit 0, , Known Sent: RUNNING] sent: false
00:14Z [INFO] TaskHandler: Sending task change: TaskChange: arn:... -> STOPPED, Known Sent: RUNNING, PullStartedAt: PullStoppedAt: ExecutionStoppedAt: arn:aws:ecs:... -> STOPPED, Exit 0, , Known Sent: RUNNING] sent: false
00:14Z [INFO] Managed task arn:... IPAM releasing ip for task eni
00:14Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [STOPPED]
00:14Z [INFO] Managed task arn:... handling container change [{STOPPED {... 0xc000489ee0 [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc000505000} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:14Z [INFO] Managed task arn:... redundant container state change. ~internal~ecs~pause to STOPPED, but already STOPPED
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [INFO] Managed task arn:... got container [~internal~ecs~pause] event: [STOPPED]
00:14Z [INFO] Managed task arn:... handling container change [{STOPPED {... 0xc000072500 [] <nil> [] com.amazonaws.ecs.container-name:~internal~ecs~pause com.amazonaws.ecs.task-arn:arn:aws:ecs:... com.amazonaws.ecs.task-definition-version:15] {UNKNOWN <nil> 0 } none 0xc000505400} ContainerStatusChangeEvent}] for container [~internal~ecs~pause]
00:14Z [INFO] Managed task arn:... redundant container state change. ~internal~ecs~pause to STOPPED, but already STOPPED
00:14Z [INFO] Managed task arn:... waiting for event for task
00:14Z [WARN] Udev watcher event-handler: unable to send state change: udev watcher send ENI state change: eni status already sent: ENI Attachment:
00:52Z [INFO] Handling ENI attachment
00:52Z [INFO] Starting ENI ack timer
00:52Z [INFO] Adding task eni attachment
Can anyone point to a potential cause based on this?
Would anything else be helpful?

Related

Datree dashboard login giving error in verifying token

I tried to login to datree dashboard on https://app.datree.io/webhook-instructions. But I am unable to login to it because an error is coming whenever I run this command:
helm install -n datree datree-webhook datree-webhook/datree-admission-webhook \
--create-namespace \
--set datree.token=5fb***2ac0 \
--debug
this is the output which I am getting :
install.go:178: [debug] Original chart version: ""
install.go:199: [debug] CHART PATH: /home/nikhil/.cache/helm/repository/datree-admission-webhook-0.3.14.tgz
client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 20 resource(s)
client.go:128: [debug] creating 1 resource(s)
client.go:528: [debug] Watching for changes to Job datree-label-namespaces-hook-post-install with timeout of 5m0s
client.go:556: [debug] Add/Modify event for datree-label-namespaces-hook-post-install: ADDED
client.go:595: [debug] datree-label-namespaces-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:556: [debug] Add/Modify event for datree-label-namespaces-hook-post-install: MODIFIED
client.go:595: [debug] datree-label-namespaces-hook-post-install: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
client.go:556: [debug] Add/Modify event for datree-label-namespaces-hook-post-install: MODIFIED
client.go:128: [debug] creating 1 resource(s)
client.go:528: [debug] Watching for changes to Job datree-wait-server-ready-hook-post-install with timeout of 5m0s
client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: ADDED
client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED
client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED
client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED
client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED
client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 1, jobs succeeded: 0
client.go:556: [debug] Add/Modify event for datree-wait-server-ready-hook-post-install: MODIFIED
client.go:595: [debug] datree-wait-server-ready-hook-post-install: Jobs active: 1, jobs failed: 1, jobs succeeded: 0
client.go:299: [debug] Starting delete for "datree-wait-server-ready-hook-post-install" Job
Error: INSTALLATION FAILED: failed post-install: timed out waiting for the condition
helm.go:88: [debug] failed post-install: timed out waiting for the condition
INSTALLATION FAILED
main.newInstallCmd.func2
helm.sh/helm/v3/cmd/helm/install.go:127
github.com/spf13/cobra.(*Command).execute
github.com/spf13/cobra#v1.2.1/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
github.com/spf13/cobra#v1.2.1/command.go:974
github.com/spf13/cobra.(*Command).Execute
github.com/spf13/cobra#v1.2.1/command.go:902
main.main
helm.sh/helm/v3/cmd/helm/helm.go:87
runtime.main
runtime/proc.go:225
runtime.goexit
runtime/asm_amd64.s:1371
enter image description here
Expected behavior
Login to datree dashboard
Desktop (please complete the following information):
OS: Ubuntu
Datree version (run datree version):
Version: 1.8.1
Environment: Minikube
kubernetes version: Client Version: v1.25.4
Kustomize Version: v4.5.7
Server Version: v1.25.3
Client ID (cat ~/.datree/config.yaml):
client_id: XXaKCoqJSBKNMb3AounaUg
Please help me resolving this issue.
Thank you

Rancher failed to launch with the error k3s exited with: exit status 1

I am running rancher latest docker image on Mac M1 laptop but the contact failed to start.
The command I am using is sudo docker run -d -p 80:80 -p 443:443 --privileged rancher/rancher.
Below is the versions for my environment:
$ docker --version
Docker version 20.10.13, build a224086
$ uname -a
Darwin Joeys-MBP 21.3.0 Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 arm64
$ docker images|grep rancher
rancher/rancher latest f09cdb8a8fba 3 weeks ago 1.39GB
Below is the logs from the container.
$ docker logs -f 8d21d7d19b21
2022/04/28 03:34:00 [INFO] Rancher version v2.6.4 (4b4e29678) is starting
2022/04/28 03:34:00 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:true Embedded:false BindHost: HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLogPath:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features: ClusterRegistry:}
2022/04/28 03:34:00 [INFO] Listening on /tmp/log.sock
2022/04/28 03:34:00 [INFO] Waiting for k3s to start
2022/04/28 03:34:01 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding
2022/04/28 03:34:03 [INFO] Waiting for server to become available: an error on the server ("apiserver not ready") has prevented the request from succeeding
2022/04/28 03:34:05 [INFO] Running in single server mode, will not peer connections
2022/04/28 03:34:05 [INFO] Applying CRD features.management.cattle.io
2022/04/28 03:34:05 [INFO] Waiting for CRD features.management.cattle.io to become available
2022/04/28 03:34:05 [INFO] Done waiting for CRD features.management.cattle.io to become available
2022/04/28 03:34:08 [INFO] Applying CRD navlinks.ui.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD clusters.management.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD apiservices.management.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD clusterregistrationtokens.management.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD settings.management.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD preferences.management.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD features.management.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD clusterrepos.catalog.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD operations.catalog.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD apps.catalog.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD fleetworkspaces.management.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD bundles.fleet.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD clusters.fleet.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD managedcharts.management.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD clusters.provisioning.cattle.io
2022/04/28 03:34:08 [INFO] Applying CRD clusters.provisioning.cattle.io
2022/04/28 03:34:09 [INFO] Applying CRD rkeclusters.rke.cattle.io
2022/04/28 03:34:09 [INFO] Applying CRD rkecontrolplanes.rke.cattle.io
2022/04/28 03:34:09 [INFO] Applying CRD rkebootstraps.rke.cattle.io
2022/04/28 03:34:09 [INFO] Applying CRD rkebootstraptemplates.rke.cattle.io
2022/04/28 03:34:09 [INFO] Applying CRD rkecontrolplanes.rke.cattle.io
2022/04/28 03:34:09 [INFO] Applying CRD custommachines.rke.cattle.io
2022/04/28 03:34:09 [INFO] Applying CRD etcdsnapshots.rke.cattle.io
2022/04/28 03:34:09 [INFO] Applying CRD clusters.cluster.x-k8s.io
2022/04/28 03:34:09 [INFO] Applying CRD machinedeployments.cluster.x-k8s.io
2022/04/28 03:34:09 [INFO] Applying CRD machinehealthchecks.cluster.x-k8s.io
2022/04/28 03:34:09 [INFO] Applying CRD machines.cluster.x-k8s.io
2022/04/28 03:34:09 [INFO] Applying CRD machinesets.cluster.x-k8s.io
2022/04/28 03:34:09 [INFO] Waiting for CRD machinesets.cluster.x-k8s.io to become available
2022/04/28 03:34:09 [INFO] Done waiting for CRD machinesets.cluster.x-k8s.io to become available
2022/04/28 03:34:09 [INFO] Creating CRD authconfigs.management.cattle.io
2022/04/28 03:34:09 [INFO] Creating CRD groupmembers.management.cattle.io
2022/04/28 03:34:09 [INFO] Creating CRD groups.management.cattle.io
2022/04/28 03:34:09 [INFO] Creating CRD tokens.management.cattle.io
2022/04/28 03:34:09 [INFO] Creating CRD userattributes.management.cattle.io
2022/04/28 03:34:09 [INFO] Creating CRD users.management.cattle.io
2022/04/28 03:34:09 [INFO] Waiting for CRD tokens.management.cattle.io to become available
2022/04/28 03:34:10 [INFO] Done waiting for CRD tokens.management.cattle.io to become available
2022/04/28 03:34:10 [INFO] Waiting for CRD userattributes.management.cattle.io to become available
2022/04/28 03:34:10 [INFO] Done waiting for CRD userattributes.management.cattle.io to become available
2022/04/28 03:34:10 [INFO] Waiting for CRD users.management.cattle.io to become available
2022/04/28 03:34:11 [INFO] Done waiting for CRD users.management.cattle.io to become available
2022/04/28 03:34:11 [INFO] Creating CRD clusterroletemplatebindings.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD apps.project.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD catalogs.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD apprevisions.project.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD dynamicschemas.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD catalogtemplates.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD pipelineexecutions.project.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD etcdbackups.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD pipelinesettings.project.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD globalrolebindings.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD pipelines.project.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD catalogtemplateversions.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD globalroles.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD sourcecodecredentials.project.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clusteralerts.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clusteralertgroups.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD sourcecodeproviderconfigs.project.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD kontainerdrivers.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD nodedrivers.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clustercatalogs.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD sourcecoderepositories.project.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clusterloggings.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD nodepools.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD nodetemplates.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clusteralertrules.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clustermonitorgraphs.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clusterscans.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD nodes.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD podsecuritypolicytemplateprojectbindings.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD composeconfigs.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD podsecuritypolicytemplates.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD multiclusterapps.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projectnetworkpolicies.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD multiclusterapprevisions.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projectroletemplatebindings.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD monitormetrics.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projects.management.cattle.io
2022/04/28 03:34:11 [INFO] Waiting for CRD sourcecodecredentials.project.cattle.io to become available
2022/04/28 03:34:11 [INFO] Creating CRD rkek8ssystemimages.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD notifiers.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD rkek8sserviceoptions.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projectalerts.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD rkeaddons.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projectalertgroups.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD roletemplates.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projectcatalogs.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projectloggings.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD samltokens.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projectalertrules.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clustertemplates.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD projectmonitorgraphs.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD clustertemplaterevisions.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD cisconfigs.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD cisbenchmarkversions.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD templates.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD templateversions.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD templatecontents.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD globaldnses.management.cattle.io
2022/04/28 03:34:11 [INFO] Creating CRD globaldnsproviders.management.cattle.io
2022/04/28 03:34:11 [INFO] Waiting for CRD nodetemplates.management.cattle.io to become available
2022/04/28 03:34:12 [INFO] Waiting for CRD projectalertgroups.management.cattle.io to become available
2022/04/28 03:34:12 [FATAL] k3s exited with: exit status 1
I would recommend trying to run it with a specific tag, i.e. rancher/rancher:v2.6.6.
Some other things that may interfere: What size setup are you running on?
CPU and minimum memory requirements are currently 2 CPUs and 4gb RAM.
Also, you can try their docker install scripts and check out other documentation here: https://rancher.com/docs/rancher/v2.6/en/installation/requirements/installing-docker/
Edit: noticed you're running on ARM. There is additional documentation for running rancher on ARM here: https://rancher.com/docs/rancher/v2.5/en/installation/resources/advanced/arm64-platform/

CoreDNS logs report unauthorized

We deployed new Kubernetes cluster, and it has 2 pods for Coredns.
$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-74ff55c5b-7v9bd 0/1 Running 0 7h22m
coredns-74ff55c5b-tfpqb 0/1 Running 0 7h23m
There suppose to be 2 replicas, but 0 READY.
When I check the logs to find the reason for not running, I see there are many Unauthorized errors.
$ for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --tail 20 --namespace=kube-system $p; done
E0323 00:58:04.393710 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:58:34.184217 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:58:51.873269 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:59:00.966217 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:59:23.151006 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:59:47.362409 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Unauthorized
E0323 00:59:48.563791 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized
E0323 00:59:56.278764 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:58:07.504557 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:58:24.948534 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:58:33.605013 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:58:56.471477 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:59:20.436808 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Unauthorized
E0323 00:59:21.200346 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
E0323 00:59:29.597663 1 reflector.go:178] pkg/mod/k8s.io/client-go#v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Unauthorized
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
When I try to find some help on net, I found out, it use coredns service user. I check for all that roles and bindings.
SERVICE ACCOUNT
$ kubectl get sa coredns -n kube-system -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: "2021-03-03T15:17:38Z"
name: coredns
namespace: kube-system
resourceVersion: "297"
uid: 13633498-2e6b-4ac4-bb34-f2d5c9e4d262
secrets:
- name: coredns-token-sg7p9
TOKEN SECRET
$ kubectl get secret coredns-token-sg7p9 -n kube-system
NAME TYPE DATA AGE
coredns-token-sg7p9 kubernetes.io/service-account-token 3 19d
CLUSTER ROLE
$ kubectl get clusterrole system:coredns -n kube-system -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: "2021-03-03T15:17:38Z"
managedFields:
- apiVersion: rbac.authorization.k8s.io/v1
fieldsType: FieldsV1
fieldsV1:
f:rules: {}
manager: kubeadm
operation: Update
time: "2021-03-03T15:17:38Z"
name: system:coredns
resourceVersion: "292"
uid: 35adc9a3-7415-4498-81b2-a4eab50882b1
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
- pods
- namespaces
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
CLUSTER ROLE BINDINGS
$ kubectl get clusterrolebindings system:coredns -n kube-system -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
creationTimestamp: "2021-03-03T15:17:38Z"
managedFields:
- apiVersion: rbac.authorization.k8s.io/v1
fieldsType: FieldsV1
fieldsV1:
f:roleRef:
f:apiGroup: {}
f:kind: {}
f:name: {}
f:subjects: {}
manager: kubeadm
operation: Update
time: "2021-03-03T15:17:38Z"
name: system:coredns
resourceVersion: "293"
uid: 2d47c2cb-6641-4a62-b867-8a598ac3923a
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:coredns
subjects:
- kind: ServiceAccount
name: coredns
namespace: kube-system
From unauthorized error, I can predict it might be related to token, like token expired and not renewed. I was trying to find help on net, for how to renew token for Coredns, but didn't find any help.
I might be doing something wrong, but can't find that.
There is help available if Pod is not in Running state, but not for unauthorized after running pod.

RabbitMQ fails to start after restart Kubernetes cluster

I'm running RabbitMQ on Kubernetes. This is my sts YAML file:
apiVersion: v1
kind: Service
metadata:
name: rabbitmq-management
labels:
app: rabbitmq
spec:
ports:
- port: 15672
name: http
selector:
app: rabbitmq
type: NodePort
---
apiVersion: v1
kind: Service
metadata:
name: rabbitmq
labels:
app: rabbitmq
spec:
ports:
- port: 5672
name: amqp
- port: 4369
name: epmd
- port: 25672
name: rabbitmq-dist
- port: 61613
name: stomp
clusterIP: None
selector:
app: rabbitmq
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
spec:
serviceName: "rabbitmq"
replicas: 3
selector:
matchLabels:
app: rabbitmq
template:
metadata:
labels:
app: rabbitmq
spec:
containers:
- name: rabbitmq
image: rabbitmq:management-alpine
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- >
rabbitmq-plugins enable rabbitmq_stomp;
if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi;
until rabbitmqctl node_health_check; do sleep 1; done;
if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi;
rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
env:
- name: RABBITMQ_ERLANG_COOKIE
valueFrom:
secretKeyRef:
name: rabbitmq-config
key: erlang-cookie
ports:
- containerPort: 5672
name: amqp
- containerPort: 61613
name: stomp
volumeMounts:
- name: rabbitmq
mountPath: /var/lib/rabbitmq
volumeClaimTemplates:
- metadata:
name: rabbitmq
annotations:
volume.alpha.kubernetes.io/storage-class: do-block-storage
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
and I created the cookie with this command:
kubectl create secret generic rabbitmq-config --from-literal=erlang-cookie=c-is-for-cookie-thats-good-enough-for-me
all of my Kubernetes cluster nodes are ready:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubernetes-master Ready master 14d v1.17.3
kubernetes-slave-1 Ready <none> 14d v1.17.3
kubernetes-slave-2 Ready <none> 14d v1.17.3
but after restarting the cluster, the RabbitMQ didn't start. I tried to scale down and up the sts but the problem already exist. The output of kubectl describe pod rabbitmq-0:
kubectl describe pod rabbitmq-0
Name: rabbitmq-0
Namespace: default
Priority: 0
Node: kubernetes-slave-1/192.168.0.179
Start Time: Tue, 24 Mar 2020 22:31:04 +0000
Labels: app=rabbitmq
controller-revision-hash=rabbitmq-6748869f4b
statefulset.kubernetes.io/pod-name=rabbitmq-0
Annotations: <none>
Status: Running
IP: 10.244.1.163
IPs:
IP: 10.244.1.163
Controlled By: StatefulSet/rabbitmq
Containers:
rabbitmq:
Container ID: docker://d5108f818525030b4fdb548eb40f0dc000dd2cec473ebf8cead315116e3efbd3
Image: rabbitmq:management-alpine
Image ID: docker-pullable://rabbitmq#sha256:6f7c8d01d55147713379f5ca26e3f20eca63eb3618c263b12440b31c697ee5a5
Ports: 5672/TCP, 61613/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: PostStartHookError: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-575-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
node_health_check
Usage
rabbitmqctl [--node <node>] [--longnames] [--quiet] node_health_check [--timeout <timeout>]
Error:
{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :"$1", :_, :_}, [], [:"$1"]}]]}}
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-10397-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 24 Mar 2020 22:46:09 +0000
Finished: Tue, 24 Mar 2020 22:58:28 +0000
Ready: False
Restart Count: 1
Environment:
RABBITMQ_ERLANG_COOKIE: <set to the key 'erlang-cookie' in secret 'rabbitmq-config'> Optional: false
Mounts:
/var/lib/rabbitmq from rabbitmq (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bbl9c (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
rabbitmq:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: rabbitmq-rabbitmq-0
ReadOnly: false
default-token-bbl9c:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bbl9c
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 31m default-scheduler Successfully assigned default/rabbitmq-0 to kubernetes-slave-1
Normal Pulled 31m kubelet, kubernetes-slave-1 Container image "rabbitmq:management-alpine" already present on machine
Normal Created 31m kubelet, kubernetes-slave-1 Created container rabbitmq
Normal Started 31m kubelet, kubernetes-slave-1 Started container rabbitmq
Normal SandboxChanged 16m (x9 over 17m) kubelet, kubernetes-slave-1 Pod sandbox changed, it will be killed and re-created.
Normal Pulled 3m58s (x2 over 16m) kubelet, kubernetes-slave-1 Container image "rabbitmq:management-alpine" already present on machine
Warning FailedPostStartHook 3m58s kubelet, kubernetes-slave-1 Exec lifecycle hook ([/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_default(2e561153-a830-4d30-ab1e-71c80d10c9e9)" failed - error: command '/bin/sh -c rabbitmq-plugins enable rabbitmq_stomp; if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi; until rabbitmqctl node_health_check; do sleep 1; done; if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi; rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
other nodes on rabbitmq-0: [rabbitmqprelaunch1]
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-433-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0']
rabbit#rabbitmq-0:
* connected to epmd (port 4369) on rabbitmq-0
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-575-rabbit#rabbitmq-0'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==
Error: this command requires the 'rabbit' app to be running on the target node. Start it with 'rabbitmqctl start_app'.
Arguments given:
node_health_check
, message: "Enabling plugins on node rabbit#rabbitmq-0:\nrabbitmq_stomp\nThe following plugins have been configured:\n rabbitmq_management\n rabbitmq_management_agent\n rabbitmq_stomp\n rabbitmq_web_dispatch\nApplying plugin configuration to rabbit#rabbitmq-0...\nThe following plugins have been enabled:\n rabbitmq_stomp\n\nset 4 plugins.\nOffline change; changes will take effect at broker restart.\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\nChecking health of node rabbit#rabbitmq-0 ...\nTimeout: 70 seconds ...\
Error:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError:\n{:aborted, {:no_exists, [:rabbit_vhost, [{{:vhost, :\"$1\", :_, :_}, [], [:\"$1\"]}]]}}\nError: unable to perform an operation on node 'rabbit#rabbitmq-0'. Please see diagnostics information and suggestions below.\n\nMost common reasons for this are:\n\n * Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)\n * CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)\n * Target node is not running\n\nIn addition to the diagnostics info below:\n\n * See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more\n * Consult server logs on node rabbit#rabbitmq-0\n * If target node is configured to use long node names, don't forget to use --longnames with CLI tools\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit#rabbitmq-0']\n\nrabbit#rabbitmq-0:\n * connected to epmd (port 4369) on rabbitmq-0\n * epmd reports: node 'rabbit' not running at all\n no other nodes on rabbitmq-0\n * suggestion: start the node\n\nCurrent node details:\n * node name: 'rabbitmqcli-10397-rabbit#rabbitmq-0'\n * effective user's home directory: /var/lib/rabbitmq\n * Erlang cookie hash: P1XNOe5pN3Ug2FCRFzH7Xg==\n\n"
Normal Killing 3m58s kubelet, kubernetes-slave-1 FailedPostStartHook
Normal Created 3m57s (x2 over 16m) kubelet, kubernetes-slave-1 Created container rabbitmq
Normal Started 3m57s (x2 over 16m) kubelet, kubernetes-slave-1 Started container rabbitmq
The output of kubectl get sts:
kubectl get sts
NAME READY AGE
consul 3/3 15d
hazelcast 2/3 15d
kafka 2/3 15d
rabbitmq 0/3 13d
zk 3/3 15d
and this is pod log that I copied from Kubernetes dashboard:
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: list of feature flags found:
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] drop_unroutable_metric
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] empty_basic_get_metric
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] implicit_default_bindings
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] quorum_queue
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: [x] virtual_host_metadata
2020-03-24 22:58:41.402 [info] <0.8.0> Feature flags: feature flag states written to disk: yes
2020-03-24 22:58:43.979 [info] <0.319.0> ra: meta data store initialised. 0 record(s) recovered
2020-03-24 22:58:43.980 [info] <0.324.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0/quorum/rabbit#rabbitmq-0/00000262.wal"]
2020-03-24 22:58:43.982 [info] <0.328.0>
Starting RabbitMQ 3.8.2 on Erlang 22.2.8
Copyright (c) 2007-2019 Pivotal Software, Inc.
Licensed under the MPL 1.1. Website: https://rabbitmq.com
## ## RabbitMQ 3.8.2
## ##
########## Copyright (c) 2007-2019 Pivotal Software, Inc.
###### ##
########## Licensed under the MPL 1.1. Website: https://rabbitmq.com
Doc guides: https://rabbitmq.com/documentation.html
Support: https://rabbitmq.com/contact.html
Tutorials: https://rabbitmq.com/getstarted.html
Monitoring: https://rabbitmq.com/monitoring.html
Logs: <stdout>
Config file(s): /etc/rabbitmq/rabbitmq.conf
Starting broker...2020-03-24 22:58:43.983 [info] <0.328.0>
node : rabbit#rabbitmq-0
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : P1XNOe5pN3Ug2FCRFzH7Xg==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0
2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step pre_boot defined by app rabbit
2020-03-24 22:58:43.997 [info] <0.328.0> Running boot step rabbit_core_metrics defined by app rabbit
2020-03-24 22:58:43.998 [info] <0.328.0> Running boot step rabbit_alarm defined by app rabbit
2020-03-24 22:58:44.002 [info] <0.334.0> Memory high watermark set to 1200 MiB (1258889216 bytes) of 3001 MiB (3147223040 bytes) total
2020-03-24 22:58:44.014 [info] <0.336.0> Enabling free disk space monitoring
2020-03-24 22:58:44.014 [info] <0.336.0> Disk free limit set to 50MB
2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step code_server_cache defined by app rabbit
2020-03-24 22:58:44.018 [info] <0.328.0> Running boot step file_handle_cache defined by app rabbit
2020-03-24 22:58:44.019 [info] <0.339.0> Limiting to approx 1048479 file handles (943629 sockets)
2020-03-24 22:58:44.019 [info] <0.340.0> FHC read buffering: OFF
2020-03-24 22:58:44.019 [info] <0.340.0> FHC write buffering: ON
2020-03-24 22:58:44.020 [info] <0.328.0> Running boot step worker_pool defined by app rabbit
2020-03-24 22:58:44.021 [info] <0.329.0> Will use 2 processes for default worker pool
2020-03-24 22:58:44.021 [info] <0.329.0> Starting worker pool 'worker_pool' with 2 processes in it
2020-03-24 22:58:44.021 [info] <0.328.0> Running boot step database defined by app rabbit
2020-03-24 22:58:44.041 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-03-24 22:59:14.042 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 22:59:14.042 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-03-24 22:59:44.043 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 22:59:44.043 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-03-24 23:00:14.044 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:00:14.044 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-03-24 23:00:44.045 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:00:44.045 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2020-03-24 23:01:14.046 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:01:14.046 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2020-03-24 23:01:44.047 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:01:44.047 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2020-03-24 23:02:14.048 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:02:14.048 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2020-03-24 23:02:44.049 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:02:44.049 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2020-03-24 23:03:14.050 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}
2020-03-24 23:03:14.050 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-03-24 23:03:44.051 [error] <0.328.0> Feature flag `quorum_queue`: migration function crashed: {error,{timeout_waiting_for_tables,[rabbit_durable_queue]}}
[{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,117}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,60}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1486}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-2-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2128}]},{maps,fold_1,3,[{file,"maps.erl"},{line,232}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2126}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,1947}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,631}]}]
2020-03-24 23:03:44.051 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2020-03-24 23:04:14.052 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:04:14.052 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2020-03-24 23:04:44.053 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:04:44.053 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2020-03-24 23:05:14.055 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:05:14.055 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2020-03-24 23:05:44.056 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:05:44.056 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2020-03-24 23:06:14.057 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:06:14.057 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
2020-03-24 23:06:44.058 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:06:44.058 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
2020-03-24 23:07:14.059 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:07:14.059 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
2020-03-24 23:07:44.060 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:07:44.060 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
2020-03-24 23:08:14.061 [warning] <0.328.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-03-24 23:08:14.061 [info] <0.328.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-03-24 23:08:44.062 [error] <0.327.0> CRASH REPORT Process <0.327.0> with 0 neighbours exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}} in application_master:init/4 line 138
2020-03-24 23:08:44.063 [info] <0.43.0> Application rabbit exited with reason: {{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]},{rabbit,start,[normal,[]]}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_r
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Take a look at:
https://www.rabbitmq.com/clustering.html#restarting
You should be able to stop the app and then force boot:
rabbitmqctl stop_app
rabbitmqctl force_boot
To complete #Vincent Gerris answer, I strongly recommend you to use the Bitnami rabbitMQ Docker image.
They have included an env variable called RABBITMQ_FORCE_BOOT:
https://github.com/bitnami/bitnami-docker-rabbitmq/blob/2c38682053dd9b3e88ab1fb305355d2ce88c2ccb/3.9/debian-10/rootfs/opt/bitnami/scripts/librabbitmq.sh#L760
if is_boolean_yes "$RABBITMQ_FORCE_BOOT" && ! is_dir_empty "${RABBITMQ_DATA_DIR}/${RABBITMQ_NODE_NAME}"; then
# ref: https://www.rabbitmq.com/rabbitmqctl.8.html#force_boot
warn "Forcing node to start..."
debug_execute "${RABBITMQ_BIN_DIR}/rabbitmqctl" force_boot
fi
this will force boot the node at entrypoint.

Ansible AWX RabbitMQ container in Kubernetes Failed to get nodes from k8s with nxdomain

I am trying to get Ansible AWX installed on my Kubernetes cluster but the RabbitMQ container is throwing "Failed to get nodes from k8s" error.
Below are the version of platforms I am using
[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5",
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean",
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc",
Platform:"linux/amd64"}
Kubernetes is deployed via the kubespray playbook v2.5.0 and all the services and pods are up and running. (CoreDNS, Weave, IPtables)
I am deploying AWX via the 1.0.6 release using the 1.0.6 images for awx_web and awx_task.
I am using an external PostgreSQL database at v10.4 and have verified the tables are being created by awx in the db.
Troubleshooting steps I have tried.
I tried to deploy AWX 1.0.5 with the etcd pod to the same cluster and it has worked as expected
I have deployed a stand alone RabbitMQ cluster in the same k8s cluster trying to mimic the AWX rabbit deployment as much as possible and it works with the rabbit_peer_discovery_k8s backend.
I have tried tweeking some of the rabbitmq.conf for AWX 1.0.6 with no luck it just keeps thowing the same error.
I have verified the /etc/resolv.conf file has the kubernetes.default.svc.cluster.local entry
Cluster Info
[node1 ~]# kubectl get all -n awx
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/awx 1 1 1 0 38m
NAME DESIRED CURRENT READY AGE
rs/awx-654f7fc84c 1 1 0 38m
NAME READY STATUS RESTARTS AGE
po/awx-654f7fc84c-9ppqb 3/4 CrashLoopBackOff 11 38m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/awx-rmq-mgmt ClusterIP 10.233.10.146 <none> 15672/TCP 1d
svc/awx-web-svc NodePort 10.233.3.75 <none> 80:31700/TCP 1d
svc/rabbitmq NodePort 10.233.37.33 <none> 15672:30434/TCP,5672:31962/TCP 1d
AWX RabbitMQ error log
[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit#10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
Starting RabbitMQ 3.7.4 on Erlang 20.1.7
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
## ##
## ## RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
########## Licensed under the MPL. See http://www.rabbitmq.com/
###### ##
########## Logs: <stdout>
Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
node : rabbit#10.233.120.5
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : at619UOZzsenF44tSK3ulA==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit#10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering: OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit#10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
Kubernetes API service
[node1 ~]# kubectl describe service kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.233.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 10.237.34.19:6443,10.237.34.21:6443
Session Affinity: ClientIP
Events: <none>
nslookup from a busybox in the same kubernetes cluster
[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup kubernetes.default.svc.cluster.local
Server: 10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local
Name: kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local
Please let me know if I am missing anything that could help troubleshooting.
I believe the solution is to omit the explicit kubernetes host. I can't think of any good reason one would need to specify the kubernetes api host from inside the cluster.
If for some terrible reason the RMQ plugin requires it, then try swapping in the Service IP (assuming your SSL cert for the master has its Service IP in the SANs list).
As for why it is doing such a silly thing, the only good reason I can think of is that the RMQ PodSpec has somehow gotten a dnsPolicy of something other than ClusterFirst. If you truly wish to troubleshoot the RMQ Pod, then you can provide an explicit command: to run some debugging bash commands first, in order to interrogate the state of the container at launch, and then exec /launch.sh to resume booting up RMQ (as they do)