Connection refused error in outbound request in k8s app container. Istio? - kubernetes

Updated
I have some script that initializes our service.
The script fails when it runs in the container because of connection refused error in the first outbound request (to external service) in the script.
We tried to add a loop that makes curl and if it fails, re-try, if not - continuous the script.
Sometimes it succeeds for the first time, sometimes it fails 10-15 times in a row.
We recently started using istio
What may be a reason??

It is a famous istio bug https://github.com/istio/istio/issues/11130 ( App container unable to connect to network before Istio's sidecar is fully running) it seems the Istio proxy will not start in parallel , it is waiting for the app container to be ready. a sequential startup sequence as one blogger mentioned (https://medium.com/#marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74) quote: most Kubernetes users assume that after a pod’s init containers have finished, the pod’s regular containers are started in parallel. It turns out that’s not the case.
containers will start in order defined by the Deployment spec YAML.
so the biggest question is will the Istio proxy envoy will start while the first container is stuck in a curl-loop . (egg and chicken problem) .
App container script performs:
until curl --head localhost:15000 ; do echo "Waiting for Istio Proxy to start" ; sleep 3 ; done
as far as I saw: that script doesn't help a bit. proxy is up but connection to external hostname return "connection refused"

With istio 1.7 comes a new feature that configures the pod to start the sidecar first and hold every other container untill the sidecar is started.
Just set values.proxy.holdApplicationUntilProxyStarts to true.
Please note that the feature is still experimental.

Related

k8s container initialization and load balancing

I have a deployment with one pod with my custom image. After executing kubectl create -f deployment.yaml, this pod becomes running. I see that everything is fine and it has "running" state in kubectl's output. But, i have one initialization script to start Apache Tomcat, it takes around 40-45 seconds to execute it and up server inside.
I also have load balancer deployment with nginx. Nginx redirects incoming requests to Apache Tomcat via proxy_pass. When i scale my deployment for 2 replicas and shut down one of them, sometimes application becomes stuck and freezing.
I feel that load balancing by k8s works not correctly, k8s is trying to use pod, which is initializing by script right now.
How can i tell k8s that pod in deployment hasn't been initialized and not to use it until it becomes totally up?
If I understand correctly mostly your problem is related to the application not being ready to accept requests because your initialization script hasn’t finished.
For that situation, you can easily setup different types of probes, such as liveliness and readiness. Such a solution would be useful, as your application wouldn’t be considered ready to accept requests unless the whole pod would start up and signal that it is alive.
Here you can read more about it: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

K8s does strange networking stuff that breaks application designs

I discovered a strange behavior with K8s networking that can break some applications designs completely.
I have two pods and one Service
Pod 1 is a stupid Reverse Proxy (I don't know the implementation)
Pod 2 is a Webserver
The mentioned Service belongs to pod 2, the webserver
After the initial start of my stack I discovered that Pod 1 - the Reverse Proxy is not able to reach the webserver on the first attempt for some reason, ping is working fine and curl also.
Now I tried wget mywebserver inside of Pod 1 - Reverse Proxy and got back the following:
wget mywebserver
--2020-11-16 20:07:37-- http://mywebserver/
Resolving mywebserver (mywebserver)... 10.244.0.34, 10.244.0.152, 10.244.1.125, ...
Connecting to mywebserver (mywebserver)|10.244.0.34|:80... failed: Connection refused.
Connecting to mywebserver (mywebserver)|10.244.0.152|:80... failed: Connection refused.
Connecting to mywebserver (mywebserver)|10.244.1.125|:80... failed: Connection refused.
Connecting to mywebserver (mywebserver)|10.244.2.177|:80... connected.
Where 10.244.2.177 is the Pod IP of the Webserver.
The problem to me it seems is that the Reverse-Proxy does not try to trigger the attempt to forward the package twice, instead it only tries once where it fails like in the wget cmd above and the request gets dropped as the backed is not reachable due to fancy K8s IPtables stuff it seems...
If I configure the reverse-proxy not to use the Service DNS-name for load-off and instead use the Pod IP (10.244.2.177) everything is working fine and as expected.
I already tried this with a variety of CNI Providers like: Flannel, Calico, Canal, Weave and also Cilium as Kube-Proxy is not used with Cilium but all of them failed and all of them doing fancy routing nobody clearly understands out-of-the-box. So my question is how can I make K8s routing work immediately at this point? I already have reimplemented my whole stack to docker-swarm just to see if it works, and it does, flawlessly! So this issue has to do something with K8s routing scheme it seems.
Just to exclude misconfiguration from my side I also tried this with different ready-to-use K8s solutions like managed K8s from Digital-Ocean and or self-hosted RKE. All have the same behavior.
Does somebody maybe have a Idea what the problem might be and how to fix this behavior of K8s?
I might also be very useful to know what actually happens at the wget request, as this remains a mystery to me.
Many thanks in advance!
It turned out that I had several misconfigurations at my K8s Deployment.
I first removed ClusterIP: None as this leads to the behavior wget shows above at my question. Beside I've set app: and tier: wrong at my deployment. Anyways now everything is working fine and wget has a proper connection.
Thanks again

Starting a container/pod after running the istio-proxy

I am trying to implement a service mesh to a service with Kubernetes using Istio and Envoy. I was able to set up the service and istio-proxy but I am not able to control the order in which the container and istio-proxy are started.
My container is the first started and tries to access an external resource via TCP but at that time, istio-proxy has not completely loaded and so does the ServiceEntry for the external resource
I tried adding a panic in my service and also tried with a sleep of 5 seconds before accessing the external resource.
Is there a way that I can control the order of these?
On istio version 1.7.X and above you can add configuration option values.global.proxy.holdApplicationUntilProxyStarts, which causes the sidecar injector to inject the sidecar at the start of the pod’s container list and configures it to block the start of all other containers until the proxy is ready. This option is disabled by default.
According to https://istio.io/latest/news/releases/1.7.x/announcing-1.7/change-notes/
I don't think you can control the order other than listing the containers in a particular order in your pod spec. So, I recommend you configure a Readiness Probe so that you are pod is not ready until your service can send some traffic to the outside.
Github issue here:
Support startup dependencies between containers on the same Pod
We're currently recommending that developers solve this problem
themselves by running a startup script on their application container
which delays application startup until Envoy has received its initial
configuration. However, this is a bit of a hack and requires changes
to every one of the developer's containers.

Application container unable to access network before sidecar ready

I was trying fortio server/client application on istio. I used istoctl for injecting istio dependency and my serer pod was came up fine. But client pod was giving connection refused error due to proxy sidecar is not yet ready to handle connection request of client. Please help me addressing this issue. For reference attaching my yaml files.
This is by design and there is no way around it.
The part responsible for configuration of the iptables for capturing the traffic is run as an init container, which ensures that the required rules are in place before any of the normal pod containers start up. If you use istio for all the traffic, then until it's container is ready, no network traffic will reach in/out of the container.
You should make sure your application handles this right. Apps should be able to withstand unavailability of it's dependencies for a time, both on startup and during operation. In worst case you can introduce your own handling in form of ie. custom entrypoint that awaits for communication to be up.

First request to a new ReplicaSet times out

I have a Kubernetes cluster on AWS, set up with kops.
I set up a Deployment that runs an Apache container and a Service for the Deployment (type: LoadBalancer).
When I update the deployment by running kubectl set image ..., as soon as the first pod of the new ReplicaSet becomes ready, the first couple of requests to the service time out.
Things I have tried:
I set up a readinessProbe on the pod, works.
I ran curl localhost on a pod, works.
I performed a DNS lookup for the service, works.
If I curl the IP returned by that DNS lookup inside a pod, the first request will timeout. This tells me it's not an ELB issue.
It's really frustrating since otherwise our Kubernetes stack is working great, but every time we deploy our application we run the risk of a user timing out on a request.
After a lot of debugging, I think I've solved this issue.
TL;DR; Apache has to exit gracefully.
I found a couple of related issues:
https://github.com/kubernetes/kubernetes/issues/47725
https://github.com/kubernetes/ingress-nginx/issues/69
504 Gateway Timeout - Two EC2 instances with load balancer
Some more things I tried:
Increase the KeepAliveTimeout on Apache, didn't help.
Ran curl on the pod IP and node IPs, worked normally.
Set up an externalName selector-less service for a couple of external dependencies, thinking it might have something to do with DNS lookups, didn't help.
The solution:
I set up a preStop lifecycle hook on the pod to gracefully terminate Apache to run apachectl -k graceful-stop
The issue (at least from what I can tell), is that when pods are taken down on a deployment, they receive a TERM signal, which causes apache to immediately kill all of its children. This might cause a race condition where kube-proxy still sends some traffic to pods that have received a TERM signal but not terminated completely.
Also got some help from this blog post on how to set up the hook.
I also recommend increasing the terminationGracePeriodSeconds in the PodSpec so apache has enough time to exit gracefully.