Bluemix Scalable Container Group auto recovery option - ibm-cloud

How does the auto-recovery option on a container in a scalable group work?
I have enabled it (by using --auto and it says Autorecovery: On in the web UI) but it did not try to restart the container when it crashed this morning.
The container in the group died at 2015-09-29T05:51:27.187Z and was manually restarted over one hour later at 2015-09-29T07:35:33.561Z
Restarting the container "solves" the runtime problem (a bug that is being fixed) until a user tries to to the same thing again in the app crashing it.
According to the docs:
To start a new container when one of the containers in the group crashes or becomes unavailable, the Enable autorecovery option. If you do not select this option, a new instance is not started automatically.
Listed in known problems:
Auto-recovery is not immediate
Auto-recovery for container groups might take more than 15 minutes for new systems to come online. Wait for auto-recovery to become available, which can take more than 15 minutes.

For every container in the group, the service will run a curl request against the port that you specified when you created the group.
If a container does not respond for whatever reason, the service assumes the container needs to be replaced. So it will destroy that container and create a new one in its place.
The fine print
The containers need to be running a service that responds to http requests on a particular port.
The port that you expose when you create the container group must be the same as the port in #1.
The port in #1/#2 must respond to http requests, not https requests. The route for the group (eg https://example.mybluemix.net) is secure, and traffic internally from the route to containers is also encrypted, so the containers in the group do not need to listen on https.
The service checks every container in the group once every 2 minutes or so.
Roughly if the service has to replace every instance in the group more than 3 times within roughly a 10 minute period, the service will stop tearing down and recovering instances in the group from that point forward. On the Bluemix site, you might see the Autorecovery label switch from On to Off. This is to prevent a never-ending loop of teardowns and replacements of containers that are either always crashing or consistently non-responsive.

In the IBM Containers service, auto-recovery works by the service doing an http curl against the port that you specify when you launch the container group. If that port does not respond to an http curl, then the service assumes it needs to be recovered and will destroy that container and recreate it.

Related

Kubernetes image update breaks pods and have to kill deployments

So my setup on kubernetes is basically an external nginx load balancer that sends traffic for virtual hosts across the nodes.
The network runs all docker containers, 10 instances of a front end pod which is a compiled angular app, 10 instances of a pod that has two containers, a “built” image of a symfony app with a phpfpm container dedicated for each pod, an external mysql server on the local network which runs a basic docker container, 10 cdn pods which simply run an nginx server to pick up static content requests, 10 pods that run a socket chat application via nginx, a dedicated network of open vidu servers, and it also runs php fpm pods for multiple cron jobs.
All works fine and dandy until I say update the front end image and do an update to the cluster. The pods all update no problem but I end up with a strange issue of pages not loading, or partially loading, or somehow a request for the backend pods ending up not loading the http request or somehow loading from frontend pods. It’s really quite random.
The only way to get it back up again is to destroy every deployment and fire them up again and I’ve no idea what is causing it. Once it’s all restarted it all works again.
Just looking for idea on what it could be , anyone experienced this behaviour before?

Connection refused error in outbound request in k8s app container. Istio?

Updated
I have some script that initializes our service.
The script fails when it runs in the container because of connection refused error in the first outbound request (to external service) in the script.
We tried to add a loop that makes curl and if it fails, re-try, if not - continuous the script.
Sometimes it succeeds for the first time, sometimes it fails 10-15 times in a row.
We recently started using istio
What may be a reason??
It is a famous istio bug https://github.com/istio/istio/issues/11130 ( App container unable to connect to network before Istio's sidecar is fully running) it seems the Istio proxy will not start in parallel , it is waiting for the app container to be ready. a sequential startup sequence as one blogger mentioned (https://medium.com/#marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74) quote: most Kubernetes users assume that after a pod’s init containers have finished, the pod’s regular containers are started in parallel. It turns out that’s not the case.
containers will start in order defined by the Deployment spec YAML.
so the biggest question is will the Istio proxy envoy will start while the first container is stuck in a curl-loop . (egg and chicken problem) .
App container script performs:
until curl --head localhost:15000 ; do echo "Waiting for Istio Proxy to start" ; sleep 3 ; done
as far as I saw: that script doesn't help a bit. proxy is up but connection to external hostname return "connection refused"
With istio 1.7 comes a new feature that configures the pod to start the sidecar first and hold every other container untill the sidecar is started.
Just set values.proxy.holdApplicationUntilProxyStarts to true.
Please note that the feature is still experimental.

AWS ECS service running SSH behind Network Load Balancer + Target Group slow to deploy with CodeDeploy

I have an ECS service that serves an SSH process. I am deploying updates to this service through CodeDeploy. I noticed that this service is much slower to deploy than other services with identical images deployed at the same time using CodePipeline. The difference with this service is that it's behind an NLB (the others are no LB or behind an ALB).
The service is set to 1 container, deploying 200%/100% so the services brings up 1 new container, ensure's it's healthy, then removes the old one. What I see happen is:
New Container started in Initial state
3+ minutes later, New Container becomes Healthy. Old Container enters Draining
2+ minutes later, Old Container finishes Draining and stops
Deploying thus takes 5-7 minutes, mostly waiting for health checks or draining. However, I'm pretty sure SSH starts up very quickly, and I have the following settings on the target group which should make things relatively quick:
TCP health check on the correct port
Healthy/Unhealthy threshold: 2
Interval: 10s
Deregistation Delay: 10s
ECS Docker stop custom timeout: 65s
So the minimum time from SSH being up to the old container being terminated would be:
2*10=20s for TCP health check to turn to Healthy
10s for the deregistration delay before Docker stop
65s for the Docker stop timeout
This is 115 seconds, which is a lot less the observed 5-7 minutes. Other services take 1-3 minutes and the LB/Target Group timings are not nearly as aggressive there.
Any ideas why my service behind an NLB seems slow to cycle through these lifecycle transitions?
You are not doing anything wrong here; this simply appears to be a (current) limitation of this product.
I recently noticed similar delays in registration/availability time with ECS services behind an NLB and decided to explore. I created a simple Javascript TCP echo server and set it up as an ECS service behind an NLB (ECS service count of 1). Like you, I used a TCP healthcheck with a healthy/unhealthy threshold of 2 and interval/deregistration delay of 10 seconds.
After the initial deploy was successful and the service reachable via the NLB, I wanted to see how long it would take for service to be restored in the event of a complete failure of the underlying instance. To simulate, I killed the service via the ECS console. After several iterations of this test, I consistently observed a timeline similar to the following (times are in seconds):
0s: killed service
5s: ECS reports old service draining
Target Group shows service draining
ECS reports new service instance is started
15s: ECS reports new task is registered
Target Group shows new instance with status of 'initial'
135s: TCP healthcheck traffic from the load balancer starts arriving
for the service (as measured by tcpdump on the EC2 host running
the container)
225s: Target Group finally marks the service as 'healthy'
ECS reports service has reached a steady state
I performed the same tests with a simple express app behind an ALB, and the gap between ECS starting the service and the ALB reporting it healthy was 10-15 seconds. The best result we achieved testing the NLB was 3.5 minutes from service stop to full availability.
I shared these findings with AWS via support case, asking specifically for clarification on why there was a consistent 120 second gap before the NLB started healthchecking the service and why we consistently saw 90-120 seconds between the beginning of healthchecks and service availability. They confirmed that this behavior is known but did not offer a time for resolution or a strategy to decrease latency in service availability.
Unfortunately, this will not do much to help resolve your issue, but at least you can know that you're not doing anything wrong.

Postgres connection refused (via CloudSQL proxy) when doing rolling update Kubernetes

When I do a rolling update, I get exceptions from Sentry saying:
DatabaseError('server closed the connection unexpectedly. This probably means the server terminated abnormally before or while processing the request.',...)
I have two containers running inside each Pod, my app container and a cloudsql-proxy container, which the app container uses to communicate to Cloud SQL.
Is there a way to make sure that my app container goes down first during the 30 seconds of grace period (terminationGracePeriodSeconds)?
In other words, I want to drain the connections and have all the current requests finish before the cloudsql-proxy is taken out.
It would be ideal if I could specify that the app container be taken down first during the 30 seconds of grace period, and then the cloudsql-proxy.
This discussion suggests setting the “terminationGracePeriodSeconds” or the "PreStop hook" in the manifest.
Another idea that could work is running the two containers in different pods to allow granular control over the rolling update. You might also want to consider using Init Containers on your deployment to allow the proxy to be ready before your app container.

First request to a new ReplicaSet times out

I have a Kubernetes cluster on AWS, set up with kops.
I set up a Deployment that runs an Apache container and a Service for the Deployment (type: LoadBalancer).
When I update the deployment by running kubectl set image ..., as soon as the first pod of the new ReplicaSet becomes ready, the first couple of requests to the service time out.
Things I have tried:
I set up a readinessProbe on the pod, works.
I ran curl localhost on a pod, works.
I performed a DNS lookup for the service, works.
If I curl the IP returned by that DNS lookup inside a pod, the first request will timeout. This tells me it's not an ELB issue.
It's really frustrating since otherwise our Kubernetes stack is working great, but every time we deploy our application we run the risk of a user timing out on a request.
After a lot of debugging, I think I've solved this issue.
TL;DR; Apache has to exit gracefully.
I found a couple of related issues:
https://github.com/kubernetes/kubernetes/issues/47725
https://github.com/kubernetes/ingress-nginx/issues/69
504 Gateway Timeout - Two EC2 instances with load balancer
Some more things I tried:
Increase the KeepAliveTimeout on Apache, didn't help.
Ran curl on the pod IP and node IPs, worked normally.
Set up an externalName selector-less service for a couple of external dependencies, thinking it might have something to do with DNS lookups, didn't help.
The solution:
I set up a preStop lifecycle hook on the pod to gracefully terminate Apache to run apachectl -k graceful-stop
The issue (at least from what I can tell), is that when pods are taken down on a deployment, they receive a TERM signal, which causes apache to immediately kill all of its children. This might cause a race condition where kube-proxy still sends some traffic to pods that have received a TERM signal but not terminated completely.
Also got some help from this blog post on how to set up the hook.
I also recommend increasing the terminationGracePeriodSeconds in the PodSpec so apache has enough time to exit gracefully.