unknown host when lookup pod by name, resolved with pod restart - kubernetes

I have an installer that spins up two pods in my CI flow, let's call them web and activemq. When the web pod starts it tries to communicate with the activemq pod using the k8s assigned amq-deployment-0.activemq pod name.
Randomly, the web will get an unknown host exception when trying to access amq-deployment1.activemq. If I restart the web pod in this situation the web pod will have no problem communicating with the activemq pod.
I've logged into the web pod when this happens and the /etc/resolv.conf and /etc/hosts files look fine. The host machines /etc/resolve.conf and /etc/hosts are sparse with nothing that looks questionable.
Information:
There is only 1 worker node.
kubectl --version
Kubernetes v1.8.3+icp+ee
Any ideas on how to go about debugging this issue. I can't think of a good reason for it to happen randomly nor resolve itself on a pod restart.
If there is other useful information needed, I can get it. Thank in advance
For activeMQ we do have this service file
apiVersion: v1 kind: Service
metadata:
name: activemq
labels:
app: myapp
env: dev
spec:
ports:
- port: 8161
protocol: TCP
targetPort: 8161
name: http
- port: 61616
protocol: TCP
targetPort: 61616
name: amq
selector:
component: analytics-amq
app: myapp
environment: dev
type: fa-core
clusterIP: None
And this ActiveMQ stateful set (this is the template)
kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
name: pa-amq-deployment
spec:
replicas: {{ activemqs }}
updateStrategy:
type: RollingUpdate
serviceName: "activemq"
template:
metadata:
labels:
component: analytics-amq
app: myapp
environment: dev
type: fa-core
spec:
containers:
- name: pa-amq
image: default/myco/activemq:latest
imagePullPolicy: Always
resources:
limits:
cpu: 150m
memory: 1Gi
livenessProbe:
exec:
command:
- /etc/init.d/activemq
- status
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 16
ports:
- containerPort: 8161
protocol: TCP
name: http
- containerPort: 61616
protocol: TCP
name: amq
envFrom:
- configMapRef:
name: pa-activemq-conf-all
- secretRef:
name: pa-activemq-secret
volumeMounts:
- name: timezone
mountPath: /etc/localtime
volumes:
- name: timezone
hostPath:
path: /usr/share/zoneinfo/UTC
The Web stateful set:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: pa-web-deployment
spec:
replicas: 1
updateStrategy:
type: RollingUpdate
serviceName: "pa-web"
template:
metadata:
labels:
component: analytics-web
app: myapp
environment: dev
type: fa-core
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- analytics-web
topologyKey: kubernetes.io/hostname
containers:
- name: pa-web
image: default/myco/web:latest
imagePullPolicy: Always
resources:
limits:
cpu: 1
memory: 2Gi
readinessProbe:
httpGet:
path: /versions
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 76
livenessProbe:
httpGet:
path: /versions
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 80
securityContext:
privileged: true
ports:
- containerPort: 8080
name: http
protocol: TCP
envFrom:
- configMapRef:
name: pa-web-conf-all
- secretRef:
name: pa-web-secret
volumeMounts:
- name: shared-volume
mountPath: /MySharedPath
- name: timezone
mountPath: /etc/localtime
volumes:
- nfs:
server: 10.100.10.23
path: /MySharedPath
name: shared-volume
- name: timezone
hostPath:
path: /usr/share/zoneinfo/UTC
This web pod also has a similar "unknown host" problem finding an external database we have configured. The issue being resolved similarly by restarting the pod. Here is the configuration of that external service. Maybe it is easier to tackle the problem from this angle? ActiveMQ has no problem using the database service name to find the DB and startup.
apiVersion: v1
kind: Service
metadata:
name: dbhost
labels:
app: myapp
env: dev
spec:
type: ExternalName
externalName: mydb.host.com

Is it possible that it is a question of which pod, and the app in its container, is started up first and which second?
In any case, connecting using a Service and not the pod name would be recommended as the pod's name assigned by Kubernetes changes between pod restarts.
A way to test connectivity, is to use telnet (or curl for the protocols it supports), if found in the image:
telnet <host/pod/Service> <port>

Not able to find a solution, I created a workaround. I set up the entrypoint.sh in my image to lookup the domain I need to access and write to the log, exiting on error:
#!/bin/bash
#disable echo and exit on error
set +ex
#####################################
# verfiy that the db service can be found or exit container
#####################################
# we do not want to install nslookup to determine if the db_host_name is valid name
# we have ping available though
# 0-success, 1-error pinging but lookup worked (services can not be pinged), 2-unreachable host
ping -W 2 -c 1 ${db_host_name} &> /dev/null
if [ $? -le 1 ]
then
echo "service ${db_host_name} is known"
else
echo "${db_host_name} service is NOT recognized. Exiting container..."
exit 1
fi
Next since only a pod restart fixed the issue. In my ansible deploy, I do a rollout check, querying the log to see if I need to do a pod restart. For example:
rollout-check.yml
- name: "Rollout status for {{rollout_item.statefulset}}"
shell: timeout 4m kubectl rollout status -n {{fa_namespace}} -f {{ rollout_item.statefulset }}
ignore_errors: yes
# assuming that the first pod will be the one that would have an issue
- name: "Get {{rollout_item.pod_name}} log to check for issue with dns lookup"
shell: kubectl logs {{rollout_item.pod_name}} --tail=1 -n {{fa_namespace}}
register: log_line
# the entrypoint will write dbhost service is NOT recognized. Exiting container... to the log
# if there is a problem getting to the dbhost
- name: "Try removing {{rollout_item.component}} pod if unable to deploy"
shell: kubectl delete pods -l component={{rollout_item.component}} --force --grace-period=0 --ignore-not-found=true -n {{fa_namespace}}
when: log_line.stdout.find('service is NOT recognized') > 0
I repeat this rollout check 6 times as sometimes even after a pod restart the service cannot be found. The additional checks are instant once the pod is successfully up.
- name: "Web rollout"
include_tasks: rollout-check.yml
loop:
- { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
loop_control:
loop_var: rollout_item

Related

Kubernetes pod never ready and service endpoint empty

I want to deploy an ASP.NET server with Kubernetes:
Deployment with my docker image (ASP.NET server)
Service to expose the pods
Nginx ingress controller
Ingress to access my pods from "outside"
For my issue, we can forget all the ingress yaml.
When I deploy my Deployment and my Service I have a problem: my pod is never ready and my service endpoints field is empty.
NAME READY STATUS RESTARTS AGE
pod/server-deployment-bd4977bf5-n7gmx 0/1 Running 36 (41s ago) 147m
When I run "kubectl logs pod/server-deployment-bd4977bf5-n7gmx" there are no logs related to this issue
Microsoft.Hosting.Lifetime[14]
Now listening on: http://[::]:80
Microsoft.Hosting.Lifetime[14]
Now listening on: https://[::]:403
Microsoft.Hosting.Lifetime[0]
Application started. Press Ctrl+C to shut down.
Microsoft.Hosting.Lifetime[0]
Hosting environment: Production
Microsoft.Hosting.Lifetime[0]
Content root path: /app/
Microsoft.Hosting.Lifetime[0]
Application is shutting down...
When I run "kubectl describe service/server-svc" I see that the "Endpoints:" field is empty.
After some research on stackoverflow & others sites, I didn't find any solution or explanation about my problem. From what I read, I know that service's endpoints field shouldn't be empty and my pods might have a problem with the readinessProbe.
Below is the .yaml of my Deployment and Service
Deployment :
apiVersion: apps/v1
kind: Deployment
metadata:
name: server-deployment
spec:
replicas: 1
selector:
matchLabels:
app: server-app
strategy:
type: RollingUpdate
template:
metadata:
labels:
app: server-app
spec:
imagePullSecrets:
- name: regcred
containers:
- name: server-container
image: server:0.0.2
imagePullPolicy: Always
command: ["dotnet", "server.dll"]
envFrom:
- configMapRef:
name: server-configmap
optional: false
- secretRef:
name: server-secret
optional: false
ports:
- name: http
containerPort: 443
hostPort: 443
livenessProbe:
httpGet:
path: /api/health/live
port: http
initialDelaySeconds: 10
periodSeconds: 20
timeoutSeconds: 1
failureThreshold: 6
successThreshold: 1
readinessProbe:
httpGet:
path: /api/health/ready
port: http
initialDelaySeconds: 10
periodSeconds: 20
timeoutSeconds: 1
failureThreshold: 6
successThreshold: 1
volumeMounts:
- name: server-pfx-volume
mountPath: "/https"
readOnly: true
volumes:
- name: server-pfx-volume
secret:
secretName: server-pfx
Service:
apiVersion: v1
kind: Service
metadata:
name: server-svc
spec:
type: ClusterIP
selector:
app: server-app
ports:
- name: http
protocol: TCP
port: 443
targetPort: 443
When I run "kubectl get pods --show-labels" I got the pod with the correct label
NAME READY STATUS RESTARTS AGE LABELS
server-deployment-bd4977bf5-n7gmx 0/1 CrashLoopBackOff 38 (74s ago) 158m app=server-app,pod-template-hash=bd4977bf5
So I'm here to looking for help to figure out why my pod is never ready and why my service endpoint field is empty.

Why does the pods hostname don't resolve?

i have a problem with pods in kubernetes.
i have an app pod (invoice) with init container that checks if mysql pod (invoice-mysql) is running
invoice-mysql is running and ready but the init container in the invoice pod is not seeing it
here is the logs of the init container
DB is not yet reachable;sleep for 10s before retry
DB is not yet reachable;sleep for 10s before retry
nc: bad address 'invoice-mysql'
nc: bad address 'invoice-mysql'
DB is not yet reachable;sleep for 10s before retry
nc: bad address 'invoice-mysql'
DB is not yet reachable;sleep for 10s before retry
nc: bad address 'invoice-mysql'
here is the yaml of invoice
apiVersion: apps/v1
kind: Deployment
metadata:
name: invoice
namespace: jhipster
spec:
replicas: 1
selector:
matchLabels:
app: invoice
version: 'v1'
template:
metadata:
labels:
app: invoice
version: 'v1'
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- invoice
topologyKey: kubernetes.io/hostname
weight: 100
initContainers:
- name: init-ds
image: busybox:latest
command:
- '/bin/sh'
- '-c'
- |
while true
do
rt=$(nc -z -w 1 invoice-mysql 3306)
if [ $? -eq 0 ]; then
echo "DB is UP"
break
fi
echo "DB is not yet reachable;sleep for 10s before retry"
sleep 10
done
containers:
- name: invoice-app
image: docker.pkg.github.com/morsi84/kubernetes/invoice
env:
- name: SPRING_PROFILES_ACTIVE
value: prod
- name: SPRING_CLOUD_CONFIG_URI
value: http://admin:${jhipster.registry.password}#jhipster-registry.Jhipster.svc.cluster.local:8761/config
- name: JHIPSTER_REGISTRY_PASSWORD
valueFrom:
secretKeyRef:
name: registry-secret
key: registry-admin-password
- name: EUREKA_CLIENT_SERVICE_URL_DEFAULTZONE
value: http://admin:${jhipster.registry.password}#jhipster-registry.Jhipster.svc.cluster.local:8761/eureka/
- name: SPRING_DATASOURCE_URL
value: jdbc:mysql://invoice-mysql.Jhipster.svc.cluster.local:3306/invoice?useUnicode=true&characterEncoding=utf8&useSSL=false&useLegacyDatetimeCode=false&serverTimezone=UTC&createDatabaseIfNotExist=true
- name: SPRING_SLEUTH_PROPAGATION_KEYS
value: 'x-request-id,x-ot-span-context'
- name: JAVA_OPTS
value: ' -Xmx256m -Xms256m'
resources:
requests:
memory: '512Mi'
cpu: '500m'
limits:
memory: '1Gi'
cpu: '1'
ports:
- name: http
containerPort: 8081
readinessProbe:
httpGet:
path: /management/health
port: http
initialDelaySeconds: 20
periodSeconds: 15
failureThreshold: 6
livenessProbe:
httpGet:
path: /management/health
port: http
initialDelaySeconds: 120
imagePullSecrets:
- name: regcred
here is the yaml of invoice-mysql
apiVersion: apps/v1
kind: Deployment
metadata:
name: invoice-mysql
namespace: jhipster
spec:
replicas: 1
selector:
matchLabels:
app: invoice-mysql
template:
metadata:
labels:
app: invoice-mysql
spec:
volumes:
- name: data
emptyDir: {}
containers:
- name: mysql
image: mysql:8.0.20
env:
- name: MYSQL_USER
value: root
- name: MYSQL_ALLOW_EMPTY_PASSWORD
value: 'yes'
- name: MYSQL_DATABASE
value: invoice
args:
- --lower_case_table_names=1
- --skip-ssl
- --character_set_server=utf8mb4
- --explicit_defaults_for_timestamp
ports:
- containerPort: 3306
volumeMounts:
- name: data
mountPath: /var/lib/mysql/
resources:
requests:
memory: '512Mi'
cpu: '500m'
limits:
memory: '1Gi'
cpu: '1'
---
apiVersion: v1
kind: Service
metadata:
name: invoice-mysql
namespace: jhipster
spec:
selector:
app: invoice-mysql
ports:
- port: 3306
here is a description of the invoice-mysql service
Name: invoice-mysql
Namespace: jhipster
Labels: <none>
Annotations: <none>
Selector: app=invoice-mysql
Type: ClusterIP
IP: 10.98.220.110
Port: <unset> 3306/TCP
TargetPort: 3306/TCP
Endpoints: 10.244.1.66:3306
Session Affinity: None
Events: <none>
working environment:
faulty environment:
Can you help me please
The problem in fact was due to the firwall on my centos hosts, i disabled the firewall and the names started resolving.
I think you should use a full service domain instead of invoice-mysql. Can you try using invoice-mysql.Jhipster.svc.cluster.local instead?

Is it possible to use a bash script to do the liveness test in pod?

I'm currently setting up a kubernetes cluster with 3 nodes on 3 differents vm and each node is composed of 1 pod witch run the following docker image: ethereum/client-go:stable
The problem is that I want to do a health check test using a bash script (because I have to test a lot of things) but I don't understand how I can export this file to each container that are deployed with my yaml deployment file.
I've tried to add wget command in the yaml file to download my health check script from my github repo but it wasn't very clean from my point of view, maybe there is an other way ?
My current deployment file:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: goerli
name: goerli-deploy
spec:
replicas: 3
selector:
matchLabels:
app: goerli
template:
metadata:
labels:
app: goerli
spec:
containers:
- image: ethereum/client-go:stable
name: goerli-geth
args: ["--goerli", "--datadir", "/test2"]
env:
- name: LASTBLOCK
value: "0"
- name: FAILCOUNTER
value: "0"
ports:
- containerPort: 30303
name: geth
livenessProbe:
exec:
command:
- /bin/sh
- /test/health.sh
initialDelaySeconds: 60
periodSeconds: 100
volumeMounts:
- name: test
mountPath: /test
restartPolicy: Always
volumes:
- name: test
hostPath:
path: /test
I expect to put health check script in /test/health.sh
Any ideas ?
This could be a perfect usecase for the init container, As there could be different images for the init container and the Application container thus they have different file system inside the pods, therefore we need to use Emptydir in order to share the state.
for further detail follow the link init-containers
Thanks to Suresh Vishnoi:
A way to resolve my problem is to use init container this way:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: goerli
name: goerli-deploy
spec:
replicas: 3
selector:
matchLabels:
app: goerli
template:
metadata:
labels:
app: goerli
spec:
containers:
- image: ethereum/client-go:stable
name: goerli-geth
args: ["--goerli", "--datadir", "/test2"]
env:
- name: LASTBLOCK
value: "0"
- name: FAILCOUNTER
value: "0"
ports:
- containerPort: 30303
name: geth
livenessProbe:
exec:
command:
- /bin/sh
- /test/health.sh
initialDelaySeconds: 60
periodSeconds: 100
volumeMounts:
- name: test
mountPath: /test
initContainers:
- name: healthcheck
image: ethereum/client-go:stable
command: ["wget", "-O", "/test2/health.sh", "https://My-script-bash"]
volumeMounts:
- name: test
mountPath: "/test"
restartPolicy: Always
volumes:
- name: test
emptyDir: {}
The downloaded file will be visible in /test/health.sh
If you're using helm look at chart tests: https://github.com/helm/helm/blob/master/docs/chart_tests.md. This covers readinessProbe tho, not liveness.
For advanced liveness probe, I'd run some kind of healthcheck sidecar which does all the advanced tests continiosly via localhost, and exposes a single /healthcheck endpoint. Then use the endpoint in a liveness probe.

How to use a custom scheduler for kubernetes (on google cloud), written in bash language?

at http://blog.kubernetes.io/2017/03/advanced-scheduling-in-kubernetes.html an example of a custom scheduler for kubernetes is given, which is written in bash language.
My question is how can such a custom scheduler be used for a pod?
It says "Note that you need to run this along with kubectl proxy for it to work", which is not clear to me.
I would appreciate any help.
thanks
You would need to deploy the scheduler. Then associate that scheduler to your pod.
This is a great write-up: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/
Here is an example deployment of my-scheduler:
apiVersion: apps/v1beta1
kind: Deployment
metadata:
labels:
component: scheduler
tier: control-plane
name: my-scheduler
namespace: kube-system
spec:
replicas: 1
template:
metadata:
labels:
component: scheduler
tier: control-plane
version: second
spec:
containers:
- command:
- /usr/local/bin/kube-scheduler
- --address=0.0.0.0
- --leader-elect=false
- --scheduler-name=my-scheduler
image: gcr.io/my-gcp-project/my-kube-scheduler:1.0
livenessProbe:
httpGet:
path: /healthz
port: 10251
initialDelaySeconds: 15
name: kube-second-scheduler
readinessProbe:
httpGet:
path: /healthz
port: 10251
resources:
requests:
cpu: '0.1'
securityContext:
privileged: false
volumeMounts: []
hostNetwork: false
hostPID: false
volumes: []
Here is how to connect a pod to your scheduler:
apiVersion: v1
kind: Pod
metadata:
name: annotation-second-scheduler
labels:
name: multischeduler-example
spec:
schedulerName: my-scheduler
containers:
- name: pod-with-second-annotation-container
image: gcr.io/google_containers/pause:2.0
The key part in the above are spec.schedulerName.

Kubernetes rollout give 503 error when switching web pods

I'm running this command:
kubectl set image deployment/www-deployment VERSION_www=newImage
Works fine. But there's a 10 second window where the website is 503, and I'm a perfectionist.
How can I configure kubernetes to wait for the image to be available before switching the ingress?
I'm using the nginx ingress controller from here:
gcr.io/google_containers/nginx-ingress-controller:0.8.3
And this yaml for the web server:
# Service and Deployment
apiVersion: v1
kind: Service
metadata:
name: www-service
spec:
ports:
- name: http-port
port: 80
protocol: TCP
targetPort: http-port
selector:
app: www
sessionAffinity: None
type: ClusterIP
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: www-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: www
spec:
containers:
- image: myapp/www
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: http-port
name: www
ports:
- containerPort: 80
name: http-port
protocol: TCP
resources:
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- mountPath: /etc/env-volume
name: config
readOnly: true
imagePullSecrets:
- name: cloud.docker.com-pull
volumes:
- name: config
secret:
defaultMode: 420
items:
- key: www.sh
mode: 256
path: env.sh
secretName: env-secret
The Docker image is based on a node.js server image.
/healthz is a file in the webserver which returns ok I thought that liveness probe would make sure the server was up and ready before switching to the new version.
Thanks in advance!
within the Pod lifecycle it's defined that:
The default state of Liveness before the initial delay is Success.
To make sure you don't run into issues better configure the ReadinessProbe for your Pods too and consider to configure .spec.minReadySeconds for your Deployment.
You'll find details in the Deployment documentation