Kubernetes pod never ready and service endpoint empty - kubernetes

I want to deploy an ASP.NET server with Kubernetes:
Deployment with my docker image (ASP.NET server)
Service to expose the pods
Nginx ingress controller
Ingress to access my pods from "outside"
For my issue, we can forget all the ingress yaml.
When I deploy my Deployment and my Service I have a problem: my pod is never ready and my service endpoints field is empty.
NAME READY STATUS RESTARTS AGE
pod/server-deployment-bd4977bf5-n7gmx 0/1 Running 36 (41s ago) 147m
When I run "kubectl logs pod/server-deployment-bd4977bf5-n7gmx" there are no logs related to this issue
Microsoft.Hosting.Lifetime[14]
Now listening on: http://[::]:80
Microsoft.Hosting.Lifetime[14]
Now listening on: https://[::]:403
Microsoft.Hosting.Lifetime[0]
Application started. Press Ctrl+C to shut down.
Microsoft.Hosting.Lifetime[0]
Hosting environment: Production
Microsoft.Hosting.Lifetime[0]
Content root path: /app/
Microsoft.Hosting.Lifetime[0]
Application is shutting down...
When I run "kubectl describe service/server-svc" I see that the "Endpoints:" field is empty.
After some research on stackoverflow & others sites, I didn't find any solution or explanation about my problem. From what I read, I know that service's endpoints field shouldn't be empty and my pods might have a problem with the readinessProbe.
Below is the .yaml of my Deployment and Service
Deployment :
apiVersion: apps/v1
kind: Deployment
metadata:
name: server-deployment
spec:
replicas: 1
selector:
matchLabels:
app: server-app
strategy:
type: RollingUpdate
template:
metadata:
labels:
app: server-app
spec:
imagePullSecrets:
- name: regcred
containers:
- name: server-container
image: server:0.0.2
imagePullPolicy: Always
command: ["dotnet", "server.dll"]
envFrom:
- configMapRef:
name: server-configmap
optional: false
- secretRef:
name: server-secret
optional: false
ports:
- name: http
containerPort: 443
hostPort: 443
livenessProbe:
httpGet:
path: /api/health/live
port: http
initialDelaySeconds: 10
periodSeconds: 20
timeoutSeconds: 1
failureThreshold: 6
successThreshold: 1
readinessProbe:
httpGet:
path: /api/health/ready
port: http
initialDelaySeconds: 10
periodSeconds: 20
timeoutSeconds: 1
failureThreshold: 6
successThreshold: 1
volumeMounts:
- name: server-pfx-volume
mountPath: "/https"
readOnly: true
volumes:
- name: server-pfx-volume
secret:
secretName: server-pfx
Service:
apiVersion: v1
kind: Service
metadata:
name: server-svc
spec:
type: ClusterIP
selector:
app: server-app
ports:
- name: http
protocol: TCP
port: 443
targetPort: 443
When I run "kubectl get pods --show-labels" I got the pod with the correct label
NAME READY STATUS RESTARTS AGE LABELS
server-deployment-bd4977bf5-n7gmx 0/1 CrashLoopBackOff 38 (74s ago) 158m app=server-app,pod-template-hash=bd4977bf5
So I'm here to looking for help to figure out why my pod is never ready and why my service endpoint field is empty.

Related

Readiness probe based on service

I have 2 pods and my application is based on a cluster i.e. application synchronizes with another pod to bring it up. Let us say in my example I am using appod1 and appod2 and the synchronization port is 8080.
I want the service for DNS to be resolved for these pod hostnames but I want to block the traffic from outside the apppod1 and appod2.
I can use a readiness probe but then the service doesn't have endpoints and I can't resolve the IP of the 2nd pod. If I can't resolve the IP of the 2nd pod from pod1 then I can't complete the configuration of these pods.
E.g.
App Statefulset definition
app1_sts.yaml
===
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
cluster: appcluster
name: app1
namespace: app
spec:
selector:
matchLabels:
cluster: appcluster
serviceName: app1cluster
template:
metadata:
labels:
cluster: appcluster
spec:
containers:
- name: app1-0
image: localhost/linux:8
imagePullPolicy: Always
securityContext:
privileged: false
command: [/usr/sbin/init]
ports:
- containerPort: 8080
name: appport
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 20
env:
- name: container
value: "true"
- name: applist
value: "app2-0"
app2_sts.yaml
====
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
cluster: appcluster
name: app2
namespace: app
spec:
selector:
matchLabels:
cluster: appcluster
serviceName: app2cluster
template:
metadata:
labels:
cluster: appcluster
spec:
containers:
- name: app2-0
image: localhost/linux:8
imagePullPolicy: Always
securityContext:
privileged: false
command: [/usr/sbin/init]
ports:
- containerPort: 8080
name: appport
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 120
periodSeconds: 30
failureThreshold: 20
env:
- name: container
value: "true"
- name: applist
value: "app1-0"
Create Statefulsets and check name resolution
[root#oper01 onprem]# kubectl get all -n app
NAME READY STATUS RESTARTS AGE
pod/app1-0 0/1 Running 0 8s
pod/app2-0 0/1 Running 0 22s
NAME READY AGE
statefulset.apps/app1 0/1 49s
statefulset.apps/app2 0/1 22s
kubectl exec -i -t app1-0 /bin/bash -n app
[root#app1-0 ~]# nslookup app2-0
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find app2-0: NXDOMAIN
[root#app1-0 ~]# nslookup app1-0
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find app1-0: NXDOMAIN
[root#app1-0 ~]#
I understand the behavior of the readiness probe and I am using it as it helps me to make sure service should not resolve to app pods if port 8080 is down. However, I am unable to make out how can I complete the configuration as app pods need to resolve each other and they need their hostname and IPs to configure. DNS resolution can only happen once the service has end points. Is there a better way to handle this situation?

unknown host when lookup pod by name, resolved with pod restart

I have an installer that spins up two pods in my CI flow, let's call them web and activemq. When the web pod starts it tries to communicate with the activemq pod using the k8s assigned amq-deployment-0.activemq pod name.
Randomly, the web will get an unknown host exception when trying to access amq-deployment1.activemq. If I restart the web pod in this situation the web pod will have no problem communicating with the activemq pod.
I've logged into the web pod when this happens and the /etc/resolv.conf and /etc/hosts files look fine. The host machines /etc/resolve.conf and /etc/hosts are sparse with nothing that looks questionable.
Information:
There is only 1 worker node.
kubectl --version
Kubernetes v1.8.3+icp+ee
Any ideas on how to go about debugging this issue. I can't think of a good reason for it to happen randomly nor resolve itself on a pod restart.
If there is other useful information needed, I can get it. Thank in advance
For activeMQ we do have this service file
apiVersion: v1 kind: Service
metadata:
name: activemq
labels:
app: myapp
env: dev
spec:
ports:
- port: 8161
protocol: TCP
targetPort: 8161
name: http
- port: 61616
protocol: TCP
targetPort: 61616
name: amq
selector:
component: analytics-amq
app: myapp
environment: dev
type: fa-core
clusterIP: None
And this ActiveMQ stateful set (this is the template)
kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
name: pa-amq-deployment
spec:
replicas: {{ activemqs }}
updateStrategy:
type: RollingUpdate
serviceName: "activemq"
template:
metadata:
labels:
component: analytics-amq
app: myapp
environment: dev
type: fa-core
spec:
containers:
- name: pa-amq
image: default/myco/activemq:latest
imagePullPolicy: Always
resources:
limits:
cpu: 150m
memory: 1Gi
livenessProbe:
exec:
command:
- /etc/init.d/activemq
- status
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 16
ports:
- containerPort: 8161
protocol: TCP
name: http
- containerPort: 61616
protocol: TCP
name: amq
envFrom:
- configMapRef:
name: pa-activemq-conf-all
- secretRef:
name: pa-activemq-secret
volumeMounts:
- name: timezone
mountPath: /etc/localtime
volumes:
- name: timezone
hostPath:
path: /usr/share/zoneinfo/UTC
The Web stateful set:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: pa-web-deployment
spec:
replicas: 1
updateStrategy:
type: RollingUpdate
serviceName: "pa-web"
template:
metadata:
labels:
component: analytics-web
app: myapp
environment: dev
type: fa-core
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- analytics-web
topologyKey: kubernetes.io/hostname
containers:
- name: pa-web
image: default/myco/web:latest
imagePullPolicy: Always
resources:
limits:
cpu: 1
memory: 2Gi
readinessProbe:
httpGet:
path: /versions
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 76
livenessProbe:
httpGet:
path: /versions
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 80
securityContext:
privileged: true
ports:
- containerPort: 8080
name: http
protocol: TCP
envFrom:
- configMapRef:
name: pa-web-conf-all
- secretRef:
name: pa-web-secret
volumeMounts:
- name: shared-volume
mountPath: /MySharedPath
- name: timezone
mountPath: /etc/localtime
volumes:
- nfs:
server: 10.100.10.23
path: /MySharedPath
name: shared-volume
- name: timezone
hostPath:
path: /usr/share/zoneinfo/UTC
This web pod also has a similar "unknown host" problem finding an external database we have configured. The issue being resolved similarly by restarting the pod. Here is the configuration of that external service. Maybe it is easier to tackle the problem from this angle? ActiveMQ has no problem using the database service name to find the DB and startup.
apiVersion: v1
kind: Service
metadata:
name: dbhost
labels:
app: myapp
env: dev
spec:
type: ExternalName
externalName: mydb.host.com
Is it possible that it is a question of which pod, and the app in its container, is started up first and which second?
In any case, connecting using a Service and not the pod name would be recommended as the pod's name assigned by Kubernetes changes between pod restarts.
A way to test connectivity, is to use telnet (or curl for the protocols it supports), if found in the image:
telnet <host/pod/Service> <port>
Not able to find a solution, I created a workaround. I set up the entrypoint.sh in my image to lookup the domain I need to access and write to the log, exiting on error:
#!/bin/bash
#disable echo and exit on error
set +ex
#####################################
# verfiy that the db service can be found or exit container
#####################################
# we do not want to install nslookup to determine if the db_host_name is valid name
# we have ping available though
# 0-success, 1-error pinging but lookup worked (services can not be pinged), 2-unreachable host
ping -W 2 -c 1 ${db_host_name} &> /dev/null
if [ $? -le 1 ]
then
echo "service ${db_host_name} is known"
else
echo "${db_host_name} service is NOT recognized. Exiting container..."
exit 1
fi
Next since only a pod restart fixed the issue. In my ansible deploy, I do a rollout check, querying the log to see if I need to do a pod restart. For example:
rollout-check.yml
- name: "Rollout status for {{rollout_item.statefulset}}"
shell: timeout 4m kubectl rollout status -n {{fa_namespace}} -f {{ rollout_item.statefulset }}
ignore_errors: yes
# assuming that the first pod will be the one that would have an issue
- name: "Get {{rollout_item.pod_name}} log to check for issue with dns lookup"
shell: kubectl logs {{rollout_item.pod_name}} --tail=1 -n {{fa_namespace}}
register: log_line
# the entrypoint will write dbhost service is NOT recognized. Exiting container... to the log
# if there is a problem getting to the dbhost
- name: "Try removing {{rollout_item.component}} pod if unable to deploy"
shell: kubectl delete pods -l component={{rollout_item.component}} --force --grace-period=0 --ignore-not-found=true -n {{fa_namespace}}
when: log_line.stdout.find('service is NOT recognized') > 0
I repeat this rollout check 6 times as sometimes even after a pod restart the service cannot be found. The additional checks are instant once the pod is successfully up.
- name: "Web rollout"
include_tasks: rollout-check.yml
loop:
- { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
loop_control:
loop_var: rollout_item

Kubernetes - Best strategy for pods with same port?

I have a Kubernetes cluster with 2 Slaves. I have 4 docker containers which all use a tomcat image and expose port 8080 and 8443. When I now put each container into a separate pod I get an issue with the ports since I only have 2 worker nodes.
What would be the best strategy for my scenario?
Current error message is: 1 PodToleratesNodeTaints, 2 PodFitsHostPorts.
Put all containers into one pod? This is my current setup (times 4)
kind: Deployment
apiVersion: apps/v1beta2
metadata:
name: myApp1
namespace: appNS
labels:
app: myApp1
spec:
replicas: 1
selector:
matchLabels:
app: myApp1
template:
metadata:
labels:
app: myApp1
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
containers:
- image: myregistry:5000/myApp1:v1
name: myApp1
ports:
- name: http-port
containerPort: 8080
- name: https-port
containerPort: 8443
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 6
---
kind: Service
apiVersion: v1
metadata:
name: myApp1-srv
namespace: appNS
labels:
version: "v1"
app: "myApp1"
spec:
type: NodePort
selector:
app: "myApp1"
ports:
- protocol: TCP
name: http-port
port: 8080
- protocol: TCP
name: https-port
port: 8443
You should not use hostNetwork unless absolutely necessary. Without host network you can have multiple pods listening on the same port number as each will have its own, dedicated network namespace.

Kubernetes endpoints empty , can I restart the pods?

I have a situation where I have zero endpoints available for one service. To test this, I specially crafted a yaml descriptor that uses a simple node server to set and retrieve the ready/live status for a pod:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nodejs-deployment
labels:
app: nodejs
spec:
replicas: 3
selector:
matchLabels:
app: nodejs
template:
metadata:
labels:
app: nodejs
spec:
containers:
- name: nodejs
image: nodejs_server
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /is_alive
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 3
periodSeconds: 10
readinessProbe:
httpGet:
path: /is_ready
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 3
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: nodejs-service
labels:
app: nodejs
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8080
selector:
app: nodejs
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: nodejs-ingress
spec:
backend:
serviceName: nodejs-service
servicePort: 80
The node server has methods to set and retrieve the liveness and readiness.
When the app start I can see that 3 replicas are created and the status of them is ready. OK then now I trigger manually the status of their readiness to set to false [from outside the ingress]. One pod is correctly removed from the endpoint so no traffic is routed to it[that's OK as this is the expected behavior]. When I set all the ready-statuses to false for all pods the endpoints list is empty [still the expected behavior].
At that point I cannot set ready=true from outside the ingress as the traffic is not routed to any pod. Is there a way here for example of triggering a restart of the pod when the ready is not achieved after n-timer or n-seconds? Or when the endpoints list is empty?
Well, that is perfectly normal and expected behaviour. What you can do, on the side, is to forward traffic from localhost to a particular pod with kubectl port-forward. That way you can access the pod directly, without ingresses etc. and set it's readiness back to ok. If you want to restart when host it not ready for to long, just use the same endpoint for liveness probe, but trigger it after more tries.

Kubernetes rollout give 503 error when switching web pods

I'm running this command:
kubectl set image deployment/www-deployment VERSION_www=newImage
Works fine. But there's a 10 second window where the website is 503, and I'm a perfectionist.
How can I configure kubernetes to wait for the image to be available before switching the ingress?
I'm using the nginx ingress controller from here:
gcr.io/google_containers/nginx-ingress-controller:0.8.3
And this yaml for the web server:
# Service and Deployment
apiVersion: v1
kind: Service
metadata:
name: www-service
spec:
ports:
- name: http-port
port: 80
protocol: TCP
targetPort: http-port
selector:
app: www
sessionAffinity: None
type: ClusterIP
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: www-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: www
spec:
containers:
- image: myapp/www
imagePullPolicy: Always
livenessProbe:
httpGet:
path: /healthz
port: http-port
name: www
ports:
- containerPort: 80
name: http-port
protocol: TCP
resources:
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- mountPath: /etc/env-volume
name: config
readOnly: true
imagePullSecrets:
- name: cloud.docker.com-pull
volumes:
- name: config
secret:
defaultMode: 420
items:
- key: www.sh
mode: 256
path: env.sh
secretName: env-secret
The Docker image is based on a node.js server image.
/healthz is a file in the webserver which returns ok I thought that liveness probe would make sure the server was up and ready before switching to the new version.
Thanks in advance!
within the Pod lifecycle it's defined that:
The default state of Liveness before the initial delay is Success.
To make sure you don't run into issues better configure the ReadinessProbe for your Pods too and consider to configure .spec.minReadySeconds for your Deployment.
You'll find details in the Deployment documentation