I'm trying to run K8ssandra but the Cassandra container keeps failing with the following message (Repeating over and over):
WARN [epollEventLoopGroup-374-2] 2021-12-30 23:54:23,711 AbstractBootstrap.java:452 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x7cf79bf5]'
WARN [epollEventLoopGroup-374-2] 2021-12-30 23:54:23,712 Loggers.java:39 - [s369] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=7ec5e39e), trying next node (FileNotFoundException: null)
INFO [nioEventLoopGroup-2-1] 2021-12-30 23:54:23,713 Cli.java:617 - address=/100.97.28.180:53816 url=/api/v0/metadata/endpoints status=500 Internal Server Error
and from the server-system-logger container:
tail: cannot open '/var/log/cassandra/system.log' for reading: No such file or directory
and finally, in the cass-operator pod:
2021-12-30T23:56:22.580Z INFO controllers.CassandraDatacenter incorrect status code when calling Node Management Endpoint {"cassandradatacenter": "default/dc1", "requestNamespace": "default", "requestName": "dc1", "loopID": "d1f81abc-6b68-4e63-9e95-1c2b5f6d4e9d", "namespace": "default", "datacenterName": "dc1", "clusterName": "mydomaincom", "statusCode": 500, "pod": "100.122.58.236"}
2021-12-30T23:56:22.580Z ERROR controllers.CassandraDatacenter Could not get endpoints data {"cassandradatacenter": "default/dc1", "requestNamespace": "default", "requestName": "dc1", "loopID": "d1f81abc-6b68-4e63-9e95-1c2b5f6d4e9d", "namespace": "default", "datacenterName": "dc1", "clusterName": "mydomaincom", "error": "incorrect status code of 500 when calling endpoint"}
Not really sure what's happening here. It works fine using the same config on a local minikube cluster, but I can't seem to get it to work on my AWS cluster (running kubernetes v1.20.10)
All other pods are running fine.
NAME READY STATUS RESTARTS AGE
mydomaincom-dc1-rac1-sts-0 2/3 Running 0 17m
k8ssandra-cass-operator-8675f58b89-qt2dx 1/1 Running 0 29m
k8ssandra-medusa-operator-589995d979-rnjhr 1/1 Running 0 29m
k8ssandra-reaper-operator-5d9d5d975d-c6nhv 1/1 Running 0 29m
the pod events show this:
Warning Unhealthy 109s (x88 over 16m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
My values.yaml (deployed with Helm3):
cassandra:
enabled: true
version: "4.0.1"
versionImageMap:
3.11.7: k8ssandra/cass-management-api:3.11.7-v0.1.33
3.11.8: k8ssandra/cass-management-api:3.11.8-v0.1.33
3.11.9: k8ssandra/cass-management-api:3.11.9-v0.1.27
3.11.10: k8ssandra/cass-management-api:3.11.10-v0.1.27
3.11.11: k8ssandra/cass-management-api:3.11.11-v0.1.33
4.0.0: k8ssandra/cass-management-api:4.0.0-v0.1.33
4.0.1: k8ssandra/cass-management-api:4.0.1-v0.1.33
clusterName: "mydomain.com"
auth:
enabled: true
superuser:
secret: ""
username: ""
cassandraLibDirVolume:
storageClass: default
size: 100Gi
encryption:
keystoreSecret:
keystoreMountPath:
truststoreSecret:
truststoreMountPath:
additionalSeeds: []
heap: {}
resources:
requests:
memory: 4Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 1000m
datacenters:
-
name: dc1
size: 1
racks:
- name: rac1
heap: {}
ingress:
enabled: false
stargate:
enabled: false
reaper:
autoschedule: true
enabled: true
cassandraUser:
secret: ""
username: ""
jmx:
secret: ""
username: ""
medusa:
enabled: true
image:
registry: docker.io
repository: k8ssandra/medusa
tag: 0.11.3
cassandraUser:
secret: ""
username: ""
storage_properties:
region: us-east-1
bucketName: my-bucket-name
storageSecret: medusa-bucket-key
reaper-operator:
enabled: true
monitoring:
grafana:
provision_dashboards: false
prometheus:
provision_service_monitors: false
kube-prometheus-stack:
enabled: false
prometheusOperator:
enabled: false
serviceMonitor:
selfMonitor: false
prometheus:
enabled: false
grafana:
enabled: false
I was able to fix this by increasing the memory to 12Gi
Related
I deployed hashicorp vault with 3 replicas. Pod vault-0 is running but the other two pods are in pending status.
enter image description here
This is my override yaml,
# Vault Helm Chart Value Overrides
global:
enabled: true
tlsDisable: true
injector:
enabled: true
# Use the Vault K8s Image https://github.com/hashicorp/vault-k8s/
image:
repository: "hashicorp/vault-k8s"
tag: "0.9.0"
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 256Mi
cpu: 250m
affinity: ""
server:
auditStorage:
enabled: true
standalone:
enabled: false
image:
repository: "hashicorp/vault"
tag: "1.6.3"
resources:
requests:
memory: 4Gi
cpu: 1000m
limits:
memory: 8Gi
cpu: 1000m
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true
config: |
ui = true
listener "tcp" {
tls_disable = true
address = "[::]:8200"
cluster_address = "[::]:8201"
}
storage "raft" {
path = "/vault/data"
}
service_registration "kubernetes" {}
config: |
ui = true
listener "tcp" {
tls_disable = true
address = "[::]:8200"
cluster_address = "[::]:8201"
}
service_registration "kubernetes" {}
# Vault UI
ui:
enabled: true
serviceType: "ClusterIP"
externalPort: 8200
Did a kubectl describe into the pending pods and can see the following status message. I am not sure I am adding the correct affinity settings in the override file. Not sure what I am doing wrong. I am using vault helm charts to deploy to a docker desktop local cluster. Appreciate any help.
enter image description here
There are a few problems in your values.yaml file.
1.You set
server:
auditStorage:
enabled: true
but you didn't specify how the PVC would be created and what the Storage class is. The chart expects you to do that if you enable the storage. Look at: https://github.com/hashicorp/vault-helm/blob/v0.9.0/values.yaml#L443
Turn it false if you just testing on your local machine or specify storage config.
2.You set empty affinity variable for the injector but not for the server. Set
affinity: ""
for the server too. Look at: https://github.com/hashicorp/vault-helm/blob/v0.9.0/values.yaml#L338
3.An uninitialised and sealed Vault cluster is not really usable. You need to initialize and unseal Vault before it becomes ready. That means setting up a readinessProbe. Something like this:
server:
readinessProbe:
path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
4.Last one but this is kinda optional. Those memory requests:
resources:
requests:
memory: 4Gi
cpu: 1000m
limits:
memory: 8Gi
cpu: 1000m
are a bit on the higher side. Setting up an HA cluster of 3 replicas with each requesting 4Gi of memory might result in Insufficient memory errors - most likely to happen when deploying on a local cluster.
But then again, you local machine might have 32 gigs of memory - I wouldn't know ;) If it doesn't, trim down those to fit on your machine.
So the following values works for me:
# Vault Helm Chart Value Overrides
global:
enabled: true
tlsDisable: true
injector:
enabled: true
# Use the Vault K8s Image https://github.com/hashicorp/vault-k8s/
image:
repository: "hashicorp/vault-k8s"
tag: "0.9.0"
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 256Mi
cpu: 250m
affinity: ""
server:
auditStorage:
enabled: false
standalone:
enabled: false
image:
repository: "hashicorp/vault"
tag: "1.6.3"
resources:
requests:
memory: 256Mi
cpu: 200m
limits:
memory: 512Mi
cpu: 400m
affinity: ""
readinessProbe:
enabled: true
path: "/v1/sys/health?standbyok=true&sealedcode=204&uninitcode=204"
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: true
config: |
ui = true
listener "tcp" {
tls_disable = true
address = "[::]:8200"
cluster_address = "[::]:8201"
}
storage "raft" {
path = "/vault/data"
}
service_registration "kubernetes" {}
config: |
ui = true
listener "tcp" {
tls_disable = true
address = "[::]:8200"
cluster_address = "[::]:8201"
}
service_registration "kubernetes" {}
# Vault UI
ui:
enabled: true
serviceType: "ClusterIP"
externalPort: 8200
I keep getting this error when Deploying into k8s
how can i get more info about what is happening in the pod and container?
Here is my helm :
global:
enabled: true
tlsDisable: false
extraEnvironmentVars:
VAULT_CACERT: /vault/userconfig/vault-tls/vault.ca
server:
extraVolumes:
- type: secret
name: vault-tls
extraSecretEnvironmentVars:
- envName: AWS_ACCESS_KEY_ID
secretName: eks-creds
secretKey: AWS_ACCESS_KEY_ID
- envName: AWS_SECRET_ACCESS_KEY
secretName: eks-creds
secretKey: AWS_SECRET_ACCESS_KEY
ha:
enabled: true
replicas: 3
raft:
enabled: true
setNodeId: false
config: |
ui = true
serviceType: "LoadBalancer"
serviceNodePort: null
externalPort: 8200
listener "tcp" {
address = "0.0.0.0:8200"
cluster_address = "0.0.0.0:8201"
tls_cert_file = "/vault/userconfig/vault-tls/vault.crt"
tls_key_file = "/vault/userconfig/vault-tls/vault.key"
tls_client_ca_file = "/vault/userconfig/vault-tls/vault.ca"
}
storage "raft" {
path = "/vault/data"
}
seal "awskms" {
region = "us-east-1"
kms_key_id = "xxxxxxxxxxxx"
}
service_registration "kubernetes" {}
running :
kubectl -n vault-perso logs -p vault-0
I'm getting :
error loading configuration from /tmp/storageconfig.hcl: At 3:12: illegal char
$ kubectl describe pod vault-0 -n vault-xxx
Name: vault-0
Namespace: vault-xxx
Priority: 0
Node: ip-10-xxx-0-xxx.ec2.internal/10.xxx.0.98
Start Time: Mon, 01 Feb 2021 16:48:47 +0200
Labels: app.kubernetes.io/instance=vault
app.kubernetes.io/name=vault
component=server
controller-revision-hash=vault-785bc949ff
helm.sh/chart=vault-0.9.0
statefulset.kubernetes.io/pod-name=vault-0
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 1.1.1.1
IPs:
IP: 1.1.1.1
Controlled By: StatefulSet/vault
Containers:
vault:
Container ID: docker://57ef1439640967f6824031xxxxfa6b64cb95efae72
Image: vault:1.6.1
Image ID: docker-pullable://vault#sha256:efe6036315xxxx2643666a4aab1ad4
Ports: 8200/TCP, 8201/TCP, 8202/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Command:
/bin/sh
-ec
Args:
cp /vault/config/extraconfig-from-values.hcl /tmp/storageconfig.hcl;
[ -n "${HOST_IP}" ] && sed -Ei "s|HOST_IP|${HOST_IP?}|g" /tmp/storageconfig.hcl;
[ -n "${POD_IP}" ] && sed -Ei "s|POD_IP|${POD_IP?}|g" /tmp/storageconfig.hcl;
[ -n "${HOSTNAME}" ] && sed -Ei "s|HOSTNAME|${HOSTNAME?}|g" /tmp/storageconfig.hcl;
[ -n "${API_ADDR}" ] && sed -Ei "s|API_ADDR|${API_ADDR?}|g" /tmp/storageconfig.hcl;
[ -n "${TRANSIT_ADDR}" ] && sed -Ei "s|TRANSIT_ADDR|${TRANSIT_ADDR?}|g" /tmp/storageconfig.hcl;
[ -n "${RAFT_ADDR}" ] && sed -Ei "s|RAFT_ADDR|${RAFT_ADDR?}|g" /tmp/storageconfig.hcl;
/usr/local/bin/docker-entrypoint.sh vault server -config=/tmp/storageconfig.hcl
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 01 Feb 2021 16:54:46 +0200
Finished: Mon, 01 Feb 2021 16:54:46 +0200
Ready: False
Restart Count: 6
Readiness: exec [/bin/sh -ec vault status -tls-skip-verify] delay=5s timeout=3s period=5s #success=1 #failure=2
Environment:
HOST_IP: (v1:status.hostIP)
POD_IP: (v1:status.podIP)
VAULT_K8S_POD_NAME: vault-0 (v1:metadata.name)
VAULT_K8S_NAMESPACE: vault-xxx (v1:metadata.namespace)
VAULT_ADDR: https://127.0.0.1:8200
VAULT_API_ADDR: https://$(POD_IP):8200
SKIP_CHOWN: true
SKIP_SETCAP: true
HOSTNAME: vault-0 (v1:metadata.name)
VAULT_CLUSTER_ADDR: https://$(HOSTNAME).vault-internal:8201
HOME: /home/vault
AWS_ACCESS_KEY_ID: <set to the key 'AWS_ACCESS_KEY_ID' in secret 'eks-creds'> Optional: false
AWS_SECRET_ACCESS_KEY: <set to the key 'AWS_SECRET_ACCESS_KEY' in secret 'eks-creds'> Optional: false
Mounts:
/home/vault from home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from vault-token-xls5s (ro)
/vault/config from config (rw)
/vault/data from data (rw)
/vault/userconfig/vault-tls from userconfig-vault-tls (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-vault-0
ReadOnly: false
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: vault-config
Optional: false
userconfig-vault-tls:
Type: Secret (a volume populated by a Secret)
SecretName: vault-tls
Optional: false
home:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
vault-token-xls5s:
Type: Secret (a volume populated by a Secret)
SecretName: vault-token-xls5s
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m9s default-scheduler Successfully assigned vault-xxx/vault-0 to ip-10-101-0-98.ec2.internal
Normal SuccessfulAttachVolume 8m7s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-626895easssscec00cb845"
Normal Pulled 6m23s (x5 over 8m4s) kubelet Container image "vault:1.6.1" already present on machine
Normal Created 6m23s (x5 over 8m4s) kubelet Created container vault
Normal Started 6m23s (x5 over 8m4s) kubelet Started container vault
Warning BackOff 3m3s (x26 over 8m2s) kubelet Back-off restarting failed container
Your config is wrong. You have the following:
config: |
ui = true
serviceType: "LoadBalancer"
serviceNodePort: null
externalPort: 8200
listener "tcp" {
The serviceType, serviceNodePort and externalPort looks like copy/pasted from some other place.
See Vault Helm docs, right at the end, they do mention a snippet with ui = true, then the listener "tcp"..
I have vault deployed from the official helm chart and it's running in HA mode, with auto-unseal, TLS enabled, raft as the backend, and the cluster is 1.17 in EKS. I have all of the raft followers joined to the vault-0 pod as the leader. I have followed this tutorial to the tee and I always end up with tls bad certificate. http: TLS handshake error from 123.45.6.789:52936: remote error: tls: bad certificate is the exact error.
I did find an issue with following this tutorial exactly. The part where they pipe the kubernetes CA to base64. For me this was multi-line and failed to deploy. So I pipped that output to tr -d '\n'. But this is where I get this error. I've tried the part of launching a container and testing it with curl, and it fails, then tailing the agent injector logs, I get that bad cert error.
Here is my values.yaml if it helps.
global:
tlsDisable: false
injector:
metrics:
enabled: true
certs:
secretName: vault-tls
caBundle: "(output of cat vault-injector.ca | base64 | tr -d '\n')"
certName: vault.crt
keyName: vault.key
server:
extraEnvironmentVars:
VAULT_CACERT: "/vault/userconfig/vault-tls/vault.ca"
extraSecretEnvironmentVars:
- envName: AWS_ACCESS_KEY_ID
secretName: eks-creds
secretKey: AWS_ACCESS_KEY_ID
- envName: AWS_SECRET_ACCESS_KEY
secretName: eks-creds
secretKey: AWS_SECRET_ACCESS_KEY
- envName: VAULT_UNSEAL_KMS_KEY_ID
secretName: vault-kms-id
secretKey: VAULT_UNSEAL_KMS_KEY_ID
extraVolumes:
- type: secret
name: vault-tls
- type: secret
name: eks-creds
- type: secret
name: vault-kms-id
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 512Mi
cpu: 500m
auditStorage:
enabled: true
storageClass: gp2
standalone:
enabled: false
ha:
enabled: true
raft:
enabled: true
config: |
ui = true
api_addr = "[::]:8200"
cluster_addr = "[::]:8201"
listener "tcp" {
tls_disable = 0
tls_cert_file = "/vault/userconfig/vault-tls/vault.crt"
tls_key_file = "/vault/userconfig/vault-tls/vault.key"
tls_client_ca_file = "/vault/userconfig/vault-tls/vault.ca"
tls_min_version = "tls12"
address = "[::]:8200"
cluster_address = "[::]:8201"
}
storage "raft" {
path = "/vault/data"
}
disable_mlock = true
service_registration "kubernetes" {}
seal "awskms" {
region = "us-east-1"
kms_key_id = "VAULT_UNSEAL_KMS_KEY_ID"
}
ui:
enabled: true
I've exec'd into the agent-injector and poked around. I can see the /etc/webhook/certs/ are there and they look correct.
Here is my vault-agent-injector pod
kubectl describe pod vault-agent-injector-6bbf84484c-q8flv
Name: vault-agent-injector-6bbf84484c-q8flv
Namespace: default
Priority: 0
Node: ip-172-16-3-151.ec2.internal/172.16.3.151
Start Time: Sat, 19 Dec 2020 16:27:14 -0800
Labels: app.kubernetes.io/instance=vault
app.kubernetes.io/name=vault-agent-injector
component=webhook
pod-template-hash=6bbf84484c
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 172.16.3.154
IPs:
IP: 172.16.3.154
Controlled By: ReplicaSet/vault-agent-injector-6bbf84484c
Containers:
sidecar-injector:
Container ID: docker://2201b12c9bd72b6b85d855de6917548c9410e2b982fb5651a0acd8472c3554fa
Image: hashicorp/vault-k8s:0.6.0
Image ID: docker-pullable://hashicorp/vault-k8s#sha256:5697b85bc69aa07b593fb2a8a0cd38daefb5c3e4a4b98c139acffc9cfe5041c7
Port: <none>
Host Port: <none>
Args:
agent-inject
2>&1
State: Running
Started: Sat, 19 Dec 2020 16:27:15 -0800
Ready: True
Restart Count: 0
Liveness: http-get https://:8080/health/ready delay=1s timeout=5s period=2s #success=1 #failure=2
Readiness: http-get https://:8080/health/ready delay=2s timeout=5s period=2s #success=1 #failure=2
Environment:
AGENT_INJECT_LISTEN: :8080
AGENT_INJECT_LOG_LEVEL: info
AGENT_INJECT_VAULT_ADDR: https://vault.default.svc:8200
AGENT_INJECT_VAULT_AUTH_PATH: auth/kubernetes
AGENT_INJECT_VAULT_IMAGE: vault:1.5.4
AGENT_INJECT_TLS_CERT_FILE: /etc/webhook/certs/vault.crt
AGENT_INJECT_TLS_KEY_FILE: /etc/webhook/certs/vault.key
AGENT_INJECT_LOG_FORMAT: standard
AGENT_INJECT_REVOKE_ON_SHUTDOWN: false
AGENT_INJECT_TELEMETRY_PATH: /metrics
Mounts:
/etc/webhook/certs from webhook-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from vault-agent-injector-token-k8ltm (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
webhook-certs:
Type: Secret (a volume populated by a Secret)
SecretName: vault-tls
Optional: false
vault-agent-injector-token-k8ltm:
Type: Secret (a volume populated by a Secret)
SecretName: vault-agent-injector-token-k8ltm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 40m default-scheduler Successfully assigned default/vault-agent-injector-6bbf84484c-q8flv to ip-172-16-3-151.ec2.internal
Normal Pulled 40m kubelet, ip-172-16-3-151.ec2.internal Container image "hashicorp/vault-k8s:0.6.0" already present on machine
Normal Created 40m kubelet, ip-172-16-3-151.ec2.internal Created container sidecar-injector
Normal Started 40m kubelet, ip-172-16-3-151.ec2.internal Started container sidecar-injector
My vault deployment
kubectl describe deployment vault
Name: vault-agent-injector
Namespace: default
CreationTimestamp: Sat, 19 Dec 2020 16:27:14 -0800
Labels: app.kubernetes.io/instance=vault
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=vault-agent-injector
component=webhook
Annotations: deployment.kubernetes.io/revision: 1
Selector: app.kubernetes.io/instance=vault,app.kubernetes.io/name=vault-agent-injector,component=webhook
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app.kubernetes.io/instance=vault
app.kubernetes.io/name=vault-agent-injector
component=webhook
Service Account: vault-agent-injector
Containers:
sidecar-injector:
Image: hashicorp/vault-k8s:0.6.0
Port: <none>
Host Port: <none>
Args:
agent-inject
2>&1
Liveness: http-get https://:8080/health/ready delay=1s timeout=5s period=2s #success=1 #failure=2
Readiness: http-get https://:8080/health/ready delay=2s timeout=5s period=2s #success=1 #failure=2
Environment:
AGENT_INJECT_LISTEN: :8080
AGENT_INJECT_LOG_LEVEL: info
AGENT_INJECT_VAULT_ADDR: https://vault.default.svc:8200
AGENT_INJECT_VAULT_AUTH_PATH: auth/kubernetes
AGENT_INJECT_VAULT_IMAGE: vault:1.5.4
AGENT_INJECT_TLS_CERT_FILE: /etc/webhook/certs/vault.crt
AGENT_INJECT_TLS_KEY_FILE: /etc/webhook/certs/vault.key
AGENT_INJECT_LOG_FORMAT: standard
AGENT_INJECT_REVOKE_ON_SHUTDOWN: false
AGENT_INJECT_TELEMETRY_PATH: /metrics
Mounts:
/etc/webhook/certs from webhook-certs (ro)
Volumes:
webhook-certs:
Type: Secret (a volume populated by a Secret)
SecretName: vault-tls
Optional: false
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: vault-agent-injector-6bbf84484c (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 46m deployment-controller Scaled up replica set vault-agent-injector-6bbf84484c to 1
What else can I check and verify or troubleshoot in order to figure out why the agent injector is causing this error?
Currently having a problem where the readiness probe is failing when deploying the Vault Helm chart. Vault is working but whenever I describe the pods get this error. How do I get the probe to use HTTPS instead of HTTP if anyone knows how to solve this I would be great as losing my mind slowly?
Kubectl Describe pod
Name: vault-0
Namespace: default
Priority: 0
Node: ip-192-168-221-250.eu-west-2.compute.internal/192.168.221.250
Start Time: Mon, 24 Aug 2020 16:41:59 +0100
Labels: app.kubernetes.io/instance=vault
app.kubernetes.io/name=vault
component=server
controller-revision-hash=vault-768cd675b9
helm.sh/chart=vault-0.6.0
statefulset.kubernetes.io/pod-name=vault-0
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.221.251
IPs:
IP: 192.168.221.251
Controlled By: StatefulSet/vault
Containers:
vault:
Container ID: docker://445d7cdc34cd01ef1d3a46f2d235cb20a94e48279db3fcdd84014d607af2fe1c
Image: vault:1.4.2
Image ID: docker-pullable://vault#sha256:12587718b79dc5aff542c410d0bcb97e7fa08a6b4a8d142c74464a9df0c76d4f
Ports: 8200/TCP, 8201/TCP, 8202/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Command:
/bin/sh
-ec
Args:
sed -E "s/HOST_IP/${HOST_IP?}/g" /vault/config/extraconfig-from-values.hcl > /tmp/storageconfig.hcl;
sed -Ei "s/POD_IP/${POD_IP?}/g" /tmp/storageconfig.hcl;
/usr/local/bin/docker-entrypoint.sh vault server -config=/tmp/storageconfig.hcl
State: Running
Started: Mon, 24 Aug 2020 16:42:00 +0100
Ready: False
Restart Count: 0
Readiness: exec [/bin/sh -ec vault status -tls-skip-verify] delay=5s timeout=5s period=3s #success=1 #failure=2
Environment:
HOST_IP: (v1:status.hostIP)
POD_IP: (v1:status.podIP)
VAULT_K8S_POD_NAME: vault-0 (v1:metadata.name)
VAULT_K8S_NAMESPACE: default (v1:metadata.namespace)
VAULT_ADDR: http://127.0.0.1:8200
VAULT_API_ADDR: http://$(POD_IP):8200
SKIP_CHOWN: true
SKIP_SETCAP: true
HOSTNAME: vault-0 (v1:metadata.name)
VAULT_CLUSTER_ADDR: https://$(HOSTNAME).vault-internal:8201
HOME: /home/vault
VAULT_CACERT: /vault/userconfig/vault-server-tls/vault.ca
Mounts:
/home/vault from home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from vault-token-cv9vx (ro)
/vault/config from config (rw)
/vault/userconfig/vault-server-tls from userconfig-vault-server-tls (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: vault-config
Optional: false
userconfig-vault-server-tls:
Type: Secret (a volume populated by a Secret)
SecretName: vault-server-tls
Optional: false
home:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
vault-token-cv9vx:
Type: Secret (a volume populated by a Secret)
SecretName: vault-token-cv9vx
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7s default-scheduler Successfully assigned default/vault-0 to ip-192-168-221-250.eu-west-2.compute.internal
Normal Pulled 6s kubelet, ip-192-168-221-250.eu-west-2.compute.internal Container image "vault:1.4.2" already present on machine
Normal Created 6s kubelet, ip-192-168-221-250.eu-west-2.compute.internal Created container vault
Normal Started 6s kubelet, ip-192-168-221-250.eu-west-2.compute.internal Started container vault
Warning Unhealthy 0s kubelet, ip-192-168-221-250.eu-west-2.compute.internal Readiness probe failed: Error checking seal status: Error making API request.
URL: GET http://127.0.0.1:8200/v1/sys/seal-status
Code: 400. Raw Message:
Client sent an HTTP request to an HTTPS server.
Vault Config File
# global:
# tlsDisable: false
injector:
enabled: false
server:
extraEnvironmentVars:
VAULT_CACERT: /vault/userconfig/vault-server-tls/vault.ca
extraVolumes:
- type: secret
name: vault-server-tls # Matches the ${SECRET_NAME} from above
affinity: ""
readinessProbe:
enabled: true
path: /v1/sys/health
# # livelinessProbe:
# # enabled: true
# # path: /v1/sys/health?standbyok=true
# # initialDelaySeconds: 60
ha:
enabled: true
config: |
ui = true
api_addr = "https://127.0.0.1:8200" # Unsure if this is correct
storage "dynamodb" {
ha_enabled = "true"
region = "eu-west-2"
table = "global-vault-data"
access_key = "KEY"
secret_key = "SECRET"
}
# listener "tcp" {
# address = "0.0.0.0:8200"
# tls_disable = "true"
# }
listener "tcp" {
address = "0.0.0.0:8200"
cluster_address = "0.0.0.0:8201"
tls_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
tls_key_file = "/vault/userconfig/vault-server-tls/vault.key"
tls_client_ca_file = "/vault/userconfig/vault-server-tls/vault.ca"
}
seal "awskms" {
region = "eu-west-2"
access_key = "KEY"
secret_key = "SECRET"
kms_key_id = "ID"
}
ui:
enabled: true
serviceType: LoadBalancer
In your environment variable definitions you have:
VAULT_ADDR: http://127.0.0.1:8200
And non TLS is diable on your Vault configs (TLS enabled):
listener "tcp" {
address = "0.0.0.0:8200"
cluster_address = "0.0.0.0:8201"
tls_cert_file = "/vault/userconfig/vault-server-tls/vault.crt"
tls_key_file = "/vault/userconfig/vault-server-tls/vault.key"
tls_client_ca_file = "/vault/userconfig/vault-server-tls/vault.ca"
}
And your Readiness probe is executing in the pod:
vault status -tls-skip-verify
So that's trying to connect to http://127.0.0.1:8200, you can try changing the environment variable to use HTTPS: VAULT_ADDR=https://127.0.0.1:8200
You may have another (different) issue with your configs and env variable not matching:
K8s manifest:
VAULT_API_ADDR: http://$(POD_IP):8200
Vault configs:
api_addr = "https://127.0.0.1:8200"
✌️
If you are on Mac add the Vault URL to your .zshrc or .bash_profile file.
On the terminal open either .zshrc or .bash_profile file by doing this:
$ open .zshrc
Copy and paste this into it export VAULT_ADDR='http://127.0.0.1:8200'
Save the file by issuing on the terminal
$ source .zshrc
You can also set the tlsDisable to false in the global settings like this:
global:
tlsDisable: false
As the documentation for the helm chart says here:
The http/https scheme is controlled by the tlsDisable value.
I depolied rke in air-gapped environment with below specification:
Nodes:
3 controller with etcd
2 workers
RKE version:
v1.0.0
Docker version:
Client:
Debug Mode: false
Server:
Containers: 24
Running: 7
Paused: 0
Stopped: 17
Images: 4
Server Version: 19.03.1-ol
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: **************
runc version: ******
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.35-1902.8.4.el7uek.x86_64
Operating System: Oracle Linux Server 7.7
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.409GiB
Name: rke01.kuberlocal.co
ID:*******************************
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
registry.console:5000
127.0.0.0/8
Live Restore Enabled: false
Registries:
Operating system and kernel: (Oracle linux 7)
Red Hat Enterprise Linux Server release 7.7
4.14.35-1902.8.4.el7uek.x86_64
Type/provider of hosts: VirtualBox (test environment)
cluster.yml file:
If you intened to deploy Kubernetes in an air-gapped environment,
please consult the documentation on how to configure custom RKE images.
nodes:
address: rke01
port: "22"
internal_address: 192.168.40.11
role:
controlplane
etcd
hostname_override: ""
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
address: rke02
port: "22"
internal_address: 192.168.40.17
role:
controlplane
etcd
hostname_override: ""
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
address: rke03
port: "22"
internal_address: 192.168.40.13
role:
controlplane
etcd
hostname_override: ""
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
address: rke04
port: "22"
internal_address: 192.168.40.14
role:
worker
hostname_override: ""
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
address: rke05
port: "22"
internal_address: 192.168.40.15
role:
worker
hostname_override: ""
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
services:
etcd:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
external_urls: []
ca_cert: ""
cert: ""
key: ""
path: ""
uid: 0
gid: 0
snapshot: null
retention: ""
creation: ""
backup_config: null
kube-api:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
service_cluster_ip_range: 10.43.0.0/16
service_node_port_range: ""
pod_security_policy: false
always_pull_images: false
secrets_encryption_config: null
audit_log: null
admission_configuration: null
event_rate_limit: null
kube-controller:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
cluster_cidr: 10.42.0.0/16
service_cluster_ip_range: 10.43.0.0/16
scheduler:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
kubelet:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
cluster_domain: bmi.rke.cluster.local
infra_container_image: ""
cluster_dns_server: 10.43.0.10
fail_swap_on: false
generate_serving_certificate: false
kubeproxy:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
network:
plugin: weave
weave_network_provider:
password: "********"
options: {}
node_selector: {}
authentication:
strategy: x509
sans: []
webhook: null
addons: ""
addons_include: []
system_images:
etcd: registry.console:5000/rancher/coreos-etcd:v3.3.15-rancher1
alpine: registry.console:5000/rancher/rke-tools:v0.1.51
nginx_proxy: registry.console:5000/rancher/rke-tools:v0.1.51
cert_downloader: registry.console:5000/rancher/rke-tools:v0.1.51
kubernetes_services_sidecar: registry.console:5000/rancher/rke-tools:v0.1.51
kubedns: registry.console:5000/rancher/k8s-dns-kube-dns:1.15.0
dnsmasq: registry.console:5000/rancher/k8s-dns-dnsmasq-nanny:1.15.0
kubedns_sidecar: registry.console:5000/rancher/k8s-dns-sidecar:1.15.0
kubedns_autoscaler: registry.console:5000/rancher/cluster-proportional-autoscaler:1.7.1
coredns: registry.console:5000/rancher/coredns-coredns:1.6.2
coredns_autoscaler: registry.console:5000/rancher/cluster-proportional-autoscaler:1.7.1
kubernetes: registry.console:5000/rancher/hyperkube:v1.16.3-rancher1
flannel: registry.console:5000/rancher/coreos-flannel:v0.11.0-rancher1
flannel_cni: registry.console:5000/rancher/flannel-cni:v0.3.0-rancher5
calico_node: registry.console:5000/rancher/calico-node:v3.8.1
calico_cni: registry.console:5000/rancher/calico-cni:v3.8.1
calico_controllers: registry.console:5000/rancher/calico-kube-controllers:v3.8.1
calico_ctl: ""
calico_flexvol: registry.console:5000/rancher/calico-pod2daemon-flexvol:v3.8.1
canal_node: registry.console:5000/rancher/calico-node:v3.8.1
canal_cni: registry.console:5000/rancher/calico-cni:v3.8.1
canal_flannel: registry.console:5000/rancher/coreos-flannel:v0.11.0
canal_flexvol: registry.console:5000/rancher/calico-pod2daemon-flexvol:v3.8.1
weave_node: registry.console:5000/weaveworks/weave-kube:2.5.2
weave_cni: registry.console:5000/weaveworks/weave-npc:2.5.2
pod_infra_container: registry.console:5000/rancher/pause:3.1
ingress: registry.console:5000/rancher/nginx-ingress-controller:nginx-0.25.1-rancher1
ingress_backend: registry.console:5000/rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
metrics_server: registry.console:5000/rancher/metrics-server:v0.3.4
windows_pod_infra_container: rancher/kubelet-pause:v0.1.3
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
mode: rbac
options: {}
#ignore_docker_version: false
ignore_docker_version: true
kubernetes_version: ""
private_registries:
url: registry.console:5000
user: registry_user
password: ***********
is_default: true
ingress:
provider: ""
options: {}
node_selector: {}
extra_args: {}
dns_policy: ""
extra_envs: []
extra_volumes: []
extra_volume_mounts: []
cluster_name: ""
cloud_provider:
name: ""
prefix_path: "/opt/rke/"
addon_job_timeout: 30
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: ""
ssh_cert: ""
ssh_cert_path: ""
monitoring:
provider: ""
options: {}
node_selector: {}
restore:
restore: false
snapshot_name: ""
dns:
provider: coredns
Steps to Reproduce:
rke -d up --config cluster.yml
Results:
INFO[0129] [sync] Successfully synced nodes Labels and Taints
DEBU[0129] Host: rke01 has role: controlplane
DEBU[0129] Host: rke01 has role: etcd
DEBU[0129] Host: rke03 has role: controlplane
DEBU[0129] Host: rke03 has role: etcd
DEBU[0129] Host: rke04 has role: worker
DEBU[0129] Host: rke05 has role: worker
INFO[0129] [network] Setting up network plugin: weave
INFO[0129] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0129] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0129] [addons] Executing deploy job rke-network-plugin
DEBU[0129] [k8s] waiting for job rke-network-plugin-deploy-job to complete..
FATA[0159] Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system rke-network-plugin-deploy-job-4jgcq 0/1 Error 0 4m6s
kube-system rke-network-plugin-deploy-job-57jr8 0/1 Error 0 3m50s
kube-system rke-network-plugin-deploy-job-h2gr8 0/1 Error 0 90s
kube-system rke-network-plugin-deploy-job-p92br 0/1 Error 0 2m50s
kube-system rke-network-plugin-deploy-job-xrgpl 0/1 Error 0 4m1s
kube-system rke-network-plugin-deploy-job-zqhmk 0/1 Error 0 3m30s
kubectl describe pod rke-network-plugin-deploy-job-zqhmk -n kube-system
Name: rke-network-plugin-deploy-job-zqhmk
Namespace: kube-system
Priority: 0
Node: rke01/192.168.40.11
Start Time: Sun, 12 Jan 2020 09:40:00 +0330
Labels: controller-uid=*******************
job-name=rke-network-plugin-deploy-job
Annotations:
Status: Failed
IP: 192.168.40.11
IPs:
IP: 192.168.40.11
Controlled By: Job/rke-network-plugin-deploy-job
Containers:
rke-network-plugin-pod:
Container ID: docker://7658aecff174e4ac53caaf088782dab50654911065371cd0d8dcdd50b8fbef3b
Image: registry.console:5000/rancher/hyperkube:v1.16.3-rancher1
Image ID: docker-pullable://registry.console:5000/rancher/hyperkube#sha256:0a55590eb8453bcc46a4bdb8217a48cf56a7c7f7c52d72a267632ffa35b3b8c8
Port:
Host Port:
Command:
kubectl
apply
-f
/etc/config/rke-network-plugin.yaml
State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 12 Jan 2020 09:40:00 +0330
Finished: Sun, 12 Jan 2020 09:40:01 +0330
Ready: False
Restart Count: 0
Environment:
Mounts:
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from rke-job-deployer-token-9dt6n (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: rke-network-plugin
Optional: false
rke-job-deployer-token-9dt6n:
Type: Secret (a volume populated by a Secret)
SecretName: rke-job-deployer-token-9dt6n
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations:
Events:
Type Reason Age From Message
Normal Pulled 4m10s kubelet, rke01 Container image "registry.console:5000/rancher/hyperkube:v1.16.3-rancher1" already present on machine
Normal Created 4m10s kubelet, rke01 Created container rke-network-plugin-pod
Normal Started 4m10s kubelet, rke01 Started container rke-network-plugin-pod
container logs:
docker logs -f 267a894bb999
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
unable to recognize "/etc/config/rke-network-plugin.yaml": Get https://10.43.0.1:443/api?timeout=32s: dial tcp 10.43.0.1:443: connect: network is unreachable
network interfaces
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether *********** brd ff:ff:ff:ff:ff:ff
inet 192.168.40.11/24 brd 192.168.40.255 scope global dynamic enp0s8
valid_lft 847sec preferred_lft 847sec
inet6 ************* scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether *************** brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 ************* scope link
valid_lft forever preferred_lft forever
docker network status
docker network ls
NETWORK ID NAME DRIVER SCOPE
c6063ba5a4d0 bridge bridge local
822441eae3cf host host local
314798c82599 none null local
is the issue related to network interfaces? if yes: how can i create it?
that's resolved by below command and I created a network interface:
docker network create --driver=bridge --subnet=10.43.0.0/16 br0_rke
I had the same issue, and these two steps solved my problem.
Increase addon_job_timeout
Check node free space (at lease 15%)
In my case, one of the nodes had DiskPressure state
I creating Ubuntu vms in the local machine and having this problem. I got it working by increasing the disk and memory capacities at VM creation. multipass launch --name node1 -m 2G -d 8G.