Hostname of pods in same statefulset can not be resolved - kubernetes

I am configuring a statefulset deploying 2 Jira DataCenter nodes. The statefulset results in 2 pods. Everything seems fine until the 2 pods try to connect to eachother. They do this with their short hostname being jira-0 and jira-1.
The jira-1 pod reports UnknownHostException when connecting to jira-0. The hostname can not be resolved.
I read about adding a headless service which I didn't have yet. After adding that I can resolve the FQDN but still no luck for the short name.
Then I read this page: DNS for Services and Pods and added:
dnsConfig:
searches:
- jira.default.svc.cluster.local
That solves my issue but I think it shouldn't be necessary to add this?
Some extra info:
Cluster on AKS with CoreDNS
Kubernetes v1.19.9
Network plugin: Kubenet
Network policy: none
My full yaml file:
apiVersion: v1
kind: Service
metadata:
name: jira
labels:
app: jira
spec:
clusterIP: None
selector:
app: jira
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: jira
spec:
serviceName: jira
replicas: 0
selector:
matchLabels:
app: jira
template:
metadata:
labels:
app: jira
spec:
containers:
- name: jira
image: atlassian/jira-software:8.12.2-jdk11
readinessProbe:
httpGet:
path: /jira/status
port: 8080
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /jira/
port: 8080
initialDelaySeconds: 600
periodSeconds: 10
envFrom:
– configMapRef:
name: jira-config
ports:
- containerPort: 8080
dnsConfig:
searches:
- jira.default.svc.cluster.local

That solves my issue but I think it shouldn't be necessary to add this?
From the StatefulSet documentation:
StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods. You are responsible for creating this Service.
The example above will create three Pods named web-0,web-1,web-2. A StatefulSet can use a Headless Service to control the domain of its Pods.
The pod-identity is will be subdomain to the governing service, eg. in your case it will be e.g:
jira-0.jira.default.svc.cluster.local
jira-1.jira.default.svc.cluster.local

Related

Health checks for service returning 301 after updating deployment

We recently updated the deployment of a dropwizard service deployed using Docker and Kubernetes.
It was working correctly before, the readiness probe was yielding a healthcheck ping to internal cluster IP getting 200s. Since we updated the healthcheck pings are resulting in a 301 and the service is considered down.
I've noticed that the healthcheck is now Default kubernetes L7 Loadbalancing health check for NEG. (port is set to 80) where it was previously Default kubernetes L7 Loadbalancing health check. where the port was configurable.
The kube file is deployed via CircleCI but the readiness probe is:
kind: Deployment
metadata:
name: pes-${CIRCLE_BRANCH}
namespace: ${GKE_NAMESPACE_NAME}
annotations:
reloader.stakater.com/auto: 'true'
spec:
replicas: 2
selector:
matchLabels:
app: ***
template:
metadata:
labels:
app: ***
spec:
containers:
- name: ***
image: ***
envFrom:
- configMapRef:
name: ***
- secretRef:
name: ***
command: ['./gradlew', 'run']
resources: {}
ports:
- name: pes
containerPort: 5000
readinessProbe:
httpGet:
path: /api/healthcheck
port: pes
initialDelaySeconds: 15
timeoutSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: ***
namespace: ${GKE_NAMESPACE_NAME}
spec:
ports:
- name: pes
port: 5000
targetPort: pes
protocol: TCP
selector:
app: ***
type: LoadBalancer
Any ideas on how this needs to be configured in GCP?
I have a feeling that the new deployment has changed from legacy health check to non legacy but no idea what else needs to be set up for it to work. Does the kube file handle creating firewall rules or does that need to be done manually?
Reading the docs at https://cloud.google.com/load-balancing/docs/health-check-concepts?hl=en
EDIT:
Issue is now resolved. After GKE version was updated it is now creating a NEG healthcheck by default. We disabled this by adding below annotation to service deployment file.
metadata:
annotations:
cloud.google.com/neg: '{"ingress":false}'
Issue is now resolved. After GKE version was updated it is now creating a NEG healthcheck by default. We disabled this by adding below annotation to service deployment file.
metadata: annotations: cloud.google.com/neg: '{"ingress":false}'

GCP load balancer backend status unknown

I'm flabbergasted.
I have a staging and production environment. Both environments have the same deployments, services, ingress, firewall rules, and both serve a 200 on /.
However, after turning on the staging environment and provisioning the same ingress, the staging service fails with Some backend services are in UNKNOWN state. Production is still live.
Both the frontend and backend pods are ready on GKE. I've manually tested the health checks and they pass when I visit /.
I see nothing in the logs or gcp docs pointing in the right direction. What could I have possibly broken?
ingress.yaml:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: fanout-ingress
annotations:
kubernetes.io/ingress.global-static-ip-name: "STATIC-IP"
spec:
backend:
serviceName: frontend
servicePort: 8080
tls:
- hosts:
- <DOMAIN>
secretName: staging-tls
rules:
- host: <DOMAIN>
http:
paths:
- path: /*
backend:
serviceName: frontend
servicePort: 8080
- path: /backend/*
backend:
serviceName: backend
servicePort: 8080
frontend.yaml:
apiVersion: v1
kind: Service
metadata:
labels:
app: frontend
name: frontend
namespace: default
spec:
ports:
- nodePort: 30664
port: 8080
protocol: TCP
targetPort: 8080
selector:
app: frontend
type: NodePort
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
generation: 15
labels:
app: frontend
name: frontend
namespace: default
spec:
progressDeadlineSeconds: 600
replicas: 1
selector:
matchLabels:
app: frontend
minReadySeconds: 5
template:
metadata:
labels:
app: frontend
spec:
containers:
- image: <our-image>
name: frontend
ports:
- containerPort: 8080
protocol: TCP
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 3
livenessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 3
Yesterday even this guide https://cloud.google.com/kubernetes-engine/docs/tutorials/http-balancer
didn't work. Don't know what happened but even waiting 30minutes + the ingress was reporting UNKNOWN state for backends .
After 24 hours, things seem to be much better. L7 http ingress works but with big delay on reporting healthy backends.
I think this is a bug. I created a new cluster and couldn't reproduce. If anyone hits this again, I would suggest trying a new cluster.
If it started happening after altering the scalability settings of your cluster:
Deleting and re-creating the Ingress resource might help - in my case it fixed it almost immediately.
Steps I followed:
kubectl delete ingress <faulty_ingress>
kubectl apply -f <my_ingress.yaml>
What worked for me was deleting and recreating the BackendConfig.
Are you still experiencing this issue?
I tried to reproduce following the Google public documentation on: Setting up HTTP Load Balancing with Ingressto deploy:
A web App using the sample web application container image that listens on a HTTP server on port 8080:
However, it seems to be working now. So if you are still having this issue, please consider filing a public issue against the kubernetes/ingress-gce using the Google issue-tracking tool. Include as much details as possible, including steps to reproduce for so that this issue can get a better visibility as well as more sampling.
Please note:
The Issue Tracker User Content and Conduct Policy details the types of information that are inappropriate for submitting to Issue Tracker which includes things like sensitive personal information and spam. Please do not submit inappropriate content in Issue Tracker.
Repo Output $ kubectl describe ing
sunny#test-dev:~$ kubectl describe ing basic-ingress
Name: basic-ingress
Namespace: default
Address: xx.xxx.xxx.228
Default backend: web:8080 (10.8.2.6:8080)
Rules:
Host Path Backends
---- ---- --------
* * web:8080 (10.8.2.6:8080)
Annotations:
target-proxy: k8s-tp-default-basic-ingress--f5636f071d87exxx
url-map: k8s-um-default-basic-ingress--f5636f071d87exxx
backends: {"k8s-be-31544--f5636f071d87exxx":"HEALTHY"}
forwarding-rule: k8s-fw-default-basic-ingress--f5636f071d87exxx
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Service 7m (x376 over 2d) loadbalancer-controller default backend set to web:31544

GCE Ingress not picking up health check from readiness probe

When I create a GCE ingress, Google Load Balancer does not set the health check from the readiness probe. According to the docs (Ingress GCE health checks) it should pick it up.
Expose an arbitrary URL as a readiness probe on the pods backing the Service.
Any ideas why?
Deployment:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: frontend-prod
labels:
app: frontend-prod
spec:
selector:
matchLabels:
app: frontend-prod
replicas: 3
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: frontend-prod
spec:
imagePullSecrets:
- name: regcred
containers:
- image: app:latest
readinessProbe:
httpGet:
path: /healthcheck
port: 3000
initialDelaySeconds: 15
periodSeconds: 5
name: frontend-prod-app
- env:
- name: PASSWORD_PROTECT
value: "1"
image: nginx:latest
readinessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 5
periodSeconds: 5
name: frontend-prod-nginx
Service:
apiVersion: v1
kind: Service
metadata:
name: frontend-prod
labels:
app: frontend-prod
spec:
type: NodePort
ports:
- port: 80
targetPort: 80
protocol: TCP
name: http
selector:
app: frontend-prod
Ingress:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: frontend-prod-ingress
annotations:
kubernetes.io/ingress.global-static-ip-name: frontend-prod-ip
spec:
tls:
- secretName: testsecret
backend:
serviceName: frontend-prod
servicePort: 80
So apparently, you need to include the container port on the PodSpec.
Does not seem to be documented anywhere.
e.g.
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
Thanks, Brian! https://github.com/kubernetes/ingress-gce/issues/241
This is now possible in the latest GKE (I am on 1.14.10-gke.27, not sure if that matters)
Define a readinessProbe on your container in your Deployment.
Recreate your Ingress.
The health check will point to the path in readinessProbe.httpGet.path of the Deployment yaml config.
Update by Jonathan Lin below: This has been fixed very recently. Define a readinessProbe on the Deployment. Recreate your Ingress. It will pick up the health check path from the readinessProbe.
GKE Ingress health check path is currently not configurable. You can go to http://console.cloud.google.com (UI) and visit Load Balancers list to see the health check it uses.
Currently the health check for an Ingress is GET / on each backend: specified on the Ingress. So all your apps behind a GKE Ingress must return HTTP 200 OK to GET / requests.
That said, the health checks you specified on your Pods are still being used ––by the kubelet to make sure your Pod is actually functioning and healthy.
Google has recently added support for CRD that can configure your Backend Services along with healthchecks:
apiVersion: cloud.google.com/v1beta1
kind: BackendConfig
metadata:
name: backend-config
namespace: prod
spec:
healthCheck:
checkIntervalSec: 30
port: 8080
type: HTTP #case-sensitive
requestPath: /healthcheck
See here.
Another reason why Google Cloud Load Balancer does not pick-up GCE health check configuration from Kubernetes Pod readiness probe could be that the service is configured as "selectorless" (the selector attribute is empty and you manage endpoints directly).
This is the case with e.g. kube-lego: see https://github.com/jetstack/kube-lego/issues/68#issuecomment-303748457 and https://github.com/jetstack/kube-lego/issues/68#issuecomment-327457982.
Original question does have selector specified in the service, so this hint doesn't apply. This hints serves visitors that have the same problem with a different cause.

Kubernetes endpoints empty , can I restart the pods?

I have a situation where I have zero endpoints available for one service. To test this, I specially crafted a yaml descriptor that uses a simple node server to set and retrieve the ready/live status for a pod:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nodejs-deployment
labels:
app: nodejs
spec:
replicas: 3
selector:
matchLabels:
app: nodejs
template:
metadata:
labels:
app: nodejs
spec:
containers:
- name: nodejs
image: nodejs_server
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /is_alive
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 3
periodSeconds: 10
readinessProbe:
httpGet:
path: /is_ready
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 3
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: nodejs-service
labels:
app: nodejs
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8080
selector:
app: nodejs
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: nodejs-ingress
spec:
backend:
serviceName: nodejs-service
servicePort: 80
The node server has methods to set and retrieve the liveness and readiness.
When the app start I can see that 3 replicas are created and the status of them is ready. OK then now I trigger manually the status of their readiness to set to false [from outside the ingress]. One pod is correctly removed from the endpoint so no traffic is routed to it[that's OK as this is the expected behavior]. When I set all the ready-statuses to false for all pods the endpoints list is empty [still the expected behavior].
At that point I cannot set ready=true from outside the ingress as the traffic is not routed to any pod. Is there a way here for example of triggering a restart of the pod when the ready is not achieved after n-timer or n-seconds? Or when the endpoints list is empty?
Well, that is perfectly normal and expected behaviour. What you can do, on the side, is to forward traffic from localhost to a particular pod with kubectl port-forward. That way you can access the pod directly, without ingresses etc. and set it's readiness back to ok. If you want to restart when host it not ready for to long, just use the same endpoint for liveness probe, but trigger it after more tries.

Kubernetes Ingress (GCE) keeps returning 502 error

I am trying to setup an Ingress in GCE Kubernetes. But when I visit the IP address and path combination defined in the Ingress, I keep getting the following 502 error:
Here is what I get when I run: kubectl describe ing --namespace dpl-staging
Name: dpl-identity
Namespace: dpl-staging
Address: 35.186.221.153
Default backend: default-http-backend:80 (10.0.8.5:8080)
TLS:
dpl-identity terminates
Rules:
Host Path Backends
---- ---- --------
*
/api/identity/* dpl-identity:4000 (<none>)
Annotations:
https-forwarding-rule: k8s-fws-dpl-staging-dpl-identity--5fc40252fadea594
https-target-proxy: k8s-tps-dpl-staging-dpl-identity--5fc40252fadea594
url-map: k8s-um-dpl-staging-dpl-identity--5fc40252fadea594
backends: {"k8s-be-31962--5fc40252fadea594":"HEALTHY","k8s-be-32396--5fc40252fadea594":"UNHEALTHY"}
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
15m 15m 1 {loadbalancer-controller } Normal ADD dpl-staging/dpl-identity
15m 15m 1 {loadbalancer-controller } Normal CREATE ip: 35.186.221.153
15m 6m 4 {loadbalancer-controller } Normal Service no user specified default backend, using system default
I think the problem is dpl-identity:4000 (<none>). Shouldn't I see the IP address of the dpl-identity service instead of <none>?
Here is my service description: kubectl describe svc --namespace dpl-staging
Name: dpl-identity
Namespace: dpl-staging
Labels: app=dpl-identity
Selector: app=dpl-identity
Type: NodePort
IP: 10.3.254.194
Port: http 4000/TCP
NodePort: http 32396/TCP
Endpoints: 10.0.2.29:8000,10.0.2.30:8000
Session Affinity: None
No events.
Also, here is the result of executing: kubectl describe ep -n dpl-staging dpl-identity
Name: dpl-identity
Namespace: dpl-staging
Labels: app=dpl-identity
Subsets:
Addresses: 10.0.2.29,10.0.2.30
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
http 8000 TCP
No events.
Here is my deployment.yaml:
apiVersion: v1
kind: Secret
metadata:
namespace: dpl-staging
name: dpl-identity
type: Opaque
data:
tls.key: <base64 key>
tls.crt: <base64 crt>
---
apiVersion: v1
kind: Service
metadata:
namespace: dpl-staging
name: dpl-identity
labels:
app: dpl-identity
spec:
type: NodePort
ports:
- port: 4000
targetPort: 8000
protocol: TCP
name: http
selector:
app: dpl-identity
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
namespace: dpl-staging
name: dpl-identity
labels:
app: dpl-identity
annotations:
kubernetes.io/ingress.allow-http: "false"
spec:
tls:
- secretName: dpl-identity
rules:
- http:
paths:
- path: /api/identity/*
backend:
serviceName: dpl-identity
servicePort: 4000
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
namespace: dpl-staging
name: dpl-identity
kind: Ingress
metadata:
namespace: dpl-staging
name: dpl-identity
labels:
app: dpl-identity
annotations:
kubernetes.io/ingress.allow-http: "false"
spec:
tls:
- secretName: dpl-identity
rules:
- http:
paths:
- path: /api/identity/*
backend:
serviceName: dpl-identity
servicePort: 4000
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
namespace: dpl-staging
name: dpl-identity
labels:
app: dpl-identity
spec:
replicas: 2
strategy:
type: RollingUpdate
template:
metadata:
labels:
app: dpl-identity
spec:
containers:
- image: gcr.io/munpat-container-engine/dpl/identity:0.4.9
name: dpl-identity
ports:
- containerPort: 8000
name: http
volumeMounts:
- name: dpl-identity
mountPath: /data
volumes:
- name: dpl-identity
secret:
secretName: dpl-identity
Your backend k8s-be-32396--5fc40252fadea594 is showing as "UNHEALTHY".
Ingress will not forward traffic if the backend is UNHEALTHY, this will result in the 502 error you are seeing.
It will be being marked as UNHEALTHY becuase it is not passing it's health check, you can check the health check setting for k8s-be-32396--5fc40252fadea594 to see if they are appropriate for your pod, it may be polling an URI or port that is not returning a 200 response. You can find these setting under Compute Engine > Health Checks.
If they are correct then there are many steps between your browser and the container that could be passing traffic incorrectly, you could try kubectl exec -it PODID -- bash (or ash if you are using Alpine) and then try curl-ing localhost to see if the container is responding as expected, if it is and the health checks are also configured correctly then this would narrow down the issue to likely be with your service, you could then try changing the service from a NodePort type to a LoadBalancer and see if hitting the service IP directly from your browser works.
I was having the same issue. It turns out I had to wait a few minutes before ingress to validate the service health. If someone is going to the same and done all the steps like readinessProbe and linvenessProbe, just ensure your ingress is pointing to a service that is either a NodePort, and wait a few minutes until the yellow warning icon turns into a green one. Also, check the log on StackDriver to get a better idea of what's going on. My readinessProbe and livenessProbe is on /login, for the gce class. So I don't think it has to be on /healthz.
Issue is indeed a health check and seemed "random" for my apps where I used name-based virtual hosts to reverse proxy requests from ingress via domains to two separate backend services. Both were secured using Lets Encrypt and kube-lego. My solution was to standardize the path for health checks for all services sharing an ingress, and declare the readinessProbe and livenessProbe configs in my deployment.yml file.
I faced this issue with Google cloud cluster node version 1.7.8 and found this issue that closely-resembled what I experienced:
* https://github.com/jetstack/kube-lego/issues/27
I'm using gce and kube-lego and my backend service health checks were on / and kube-lego is on /healthz. It appears differing paths for health checks with gce ingress might be the cause so it may be worth updating backend services to match the /healthz pattern so all use same (or as one commenter in Github issue stated they updated kube-lego to pass on /).
I had the same problem, and it persisted after I enabled livenessProbe as well readinessPorbe.
It turned this was to do with basic auth. I've added basic auth to livenessProbe and the readinessPorbe, but turns out the GCE HTTP(S) load balancer doesn't have a configuration option for that.
There seem to be a few another kind of issue with too, e.g. setting container port to 8080 and service port to 80 didn't work with GKE ingress controller (yet I wouldn't clearly indicate what the problem was). And broadly, it looks to me like there is very little visibility and running your own ingress container is a better option with respect to visibility.
I picked Traefik for my project, it worked out of the box, and I'd like to enable Let's Encrypt integration. The only change I had to make to Traefik manifests was about tweaking the service object to disabling access to the UI from outside of the cluster and expose my app with through external load balancer (GCE TCP LB). Also, Traefik is more native to Kubernetes. I tried Heptio Contour, but something didn't work out of the box (will give it a go next time when the new version comes out).
I had the same issue. I turned out that the pod itself was running ok, which I tested via port-forwarding and accessing the health-check URL.
Port-Forward can be activated in console as follows:
$ kubectl port-forward <pod-name> local-port:pod-port
So if the pod is running ok and ingress still shows unhealthy state there might be an issue with your service configuration. In my case my app-selector was incorrect, causing the selection of a non existent pod. Interestingly this isn't showed as an errors or alerts in google console.
Definition of the pods:
#pod-definition.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: <pod-name>
namespace: <namespace>
spec:
selector:
matchLabels:
app: **<pod-name>**
template:
metadata:
labels:
app: <pod-name>
spec:
#spec-definition follows
#service.yaml
apiVersion: v1
kind: Service
metadata:
name: <name-of-service-here>
namespace: <namespace>
spec:
type: NodePort
selector:
app: **<pod-name>**
ports:
- protocol: TCP
port: 8080
targetPort: 8080
name: <port-name-here>
The "Limitations" section of the kubernetes documentation states that:
All Kubernetes services must serve a 200 page on '/', or whatever custom value you've specified through GLBC's --health-check-path argument.
https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/cluster-loadbalancing/glbc#limitations
I solved the problem by
Remove the service from ingress definition
Deploy ingress kubectl apply -f ingress.yaml
Add the service to ingress definition
Deploy ingress again
Essentially, I followed Roy's advice and tried to turn it off and on again.
Log can read from Stackdriver Logging, in my case, it is backend_timeout error. After increase the default timeout (30s) via BackendConfig, it stop returning 502 even under load.
More on:
https://cloud.google.com/kubernetes-engine/docs/how-to/configure-backend-service#creating_a_backendconfig
I've fixed this issue after adding the following readiness and liveness probe with successThreshold: 1 and failureThreshold: 3 . Also i kept initialDelaySeconds to 70 because sometime an application responds bit late , it may vary per application.
NOTE: Also ensure that the path in httpGet should exist in your application(like in my case /api/books) other wise GCP pings /healthz path and doesn't guarantee to return 200 OK .
readinessProbe:
httpGet:
path: /api/books
port: 80
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
initialDelaySeconds: 70
timeoutSeconds: 60
livenessProbe:
httpGet:
path: /api/books
port: 80
initialDelaySeconds: 70
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
timeoutSeconds: 60
I could able to sort out after struggling a lot and tried many things.
Keep Learn & Share
I had the same issue when I was using a wrong image and the request couldn't be satisfied as the configurations were different.