How to create circuit breaker for cloud run services? - kubernetes

I am trying to understand how we can create circuit breakers for cloud run services,Unlike in GKE we are using istio kind of service mesh how we implement same thing cloud Run ?

On GKE you'd set up a circuit breaker to prevent overloading your legacy backend systems from a surge in requests.
To accomplish the same on Cloud Run or Cloud Functions, you can set a maximum number of instances. From that documentation:
Specifying maximum instances in Cloud Run allows you to limit the scaling of your service in response to incoming requests, although this maximum setting can be exceeded for a brief period due to circumstances such as traffic spikes. Use this setting as a way to control your costs or to limit the number of connections to a backing service, such as to a database.

Related

What could cause GCP Cloud Run to respond 10-30% slower than the same service running in Kubernetes (GKE)?

I have a web service running in GKE behind a GCP global HTTPS load balancer. The same code is also deployed to Cloud Run. The load balancer splits traffic 50-50 between GKE and Cloud Run. According to the load balancer, the total latency from Cloud Run is about 10 to 30% higher than it is from GKE.
https/total_latencies, top line is Cloud Run average latency, bottom line is GKE average latency
The same code is deployed to both environments in the same region. The service does not have any dependencies on databases. It does make HTTP requests over the public internet, which makes it unlikely that the requests from GKE are getting routed differently than the ones in Cloud Run. There's a clear difference in latency between Cloud Run and GKE regardless of HTTP status code class, whether it's 2xx, 3xx, 4xx, or 5xx. There's also consistently a gap, indicating the issue is not due to cold starts.
To try to eliminate the possibility that the Cloud Run instances are overburdened, they are given more vCPU cores and memory than the GKE pods. The target number of concurrent requests is also set much lower than the target request rate used by the GKE service's HPA. In short, the Cloud Run instances are given many more resources than the GKE pods.
Based on the symptoms, it appears there could be a systemic source of latency from Cloud Run. The only other metric that shows a difference between Cloud Run and GKE is the load balancer's https/backend_request_bytes_count metric, which is 2-3x higher for Cloud Run (about 1.2 KB for GKE and 3.1 KB for Cloud Run), which is difficult to explain because the service receives only GET requests, so there's unlikely to be a difference in request size from clients until the load balancer is adding 1.9 KB worth of headers when connecting to Cloud Run.
https/backend_request_bytes_count, top line is Cloud Run, bottom line is GKE
Setting up a TLS connection is the only overhead that comes to mind, if the Google Cloud load balancer weren't reusing HTTP connections.
In summary, are there systemic reasons that Google Cloud Run's infrastructure would be slower than GKE's and why would the request size between a Google Cloud load balancer and Cloud Run be 2-3x larger than requests between the same load balancer and GKE?

What is the recommended way to autoscale a service based on open http connections

I have an EKS cluster in AWS where a specific service will work as a live feed to clients, I expect it to use very low resources but it will be using server-sent event which require long lived http connections. Given I will be serving thousands of users I believe the best metric to auto scaling would be open connections.
My understanding is that k8s only have cpu and memory usage as out of the box metrics for scaling, I may be wrong here. I looked into custom metrics but k8s documentation on that is extremely shallow.
Any sugestions or guidance on this is very welcome.

How to get the latency of an application deployed in Kubernetes?

I have a simple java based application deployed in Kubernetes. I want to get the average latency of requests sent to the application(GET and POST).
Stackdriver Monitoring API has the latency details of loadbalancer. But that can only be collected after 210 seconds which is not sufficient in my case. How can I configure in Kubernetes to get the latency details every 30 seconds (or 1 minute) immediately.
I wish the solution to be independent of Java so that I can use it for any application I deploy.
On GKE, you can use Stackdriver Trace, which is GCP specific. I am currently fighting with python client library. Hopefully Java is more mature.
Or you can use Jaeger, which is CNCF project.
Use a Service Mesh
A Service Mesh will let you observe things like latency between your services without extra code for this in each applications. Istio is such an implementation that is available on Google Kubernetes Engine.
Get uniform metrics and traces from any running applications without requiring developers to manually instrument their applications.
Istio’s monitoring capabilities let you understand how service performance impacts things upstream and downstream
See Istio on GCP
use a service mesh: software that helps you orchestrate, secure, and collect telemetry across distributed applications. A service mesh transparently oversees and monitors all traffic for your application, typically through a set of network proxies that sit alongside each microservice.
Welcome to the service mesh era

Triggering a Kubernetes-based application from AppEngine

I'm currently looking into triggering some 3D rendering from an AppEngine-based service.
The idea is that input data is submitted by an API client to this web service, which then invokes an internal Kubernetes GPU enabled application ("rendering backend") to do the hard work.
GPU-enabled clusters are relatively expensive ($$$), so I really want the cluster to be up and running on demand. I am trying to achieve that by setting the autoscaling minimum to 0 for the rendering backend.
The only pretty way of "triggering" a rendering task on such a cluster I could think of is via Pub/Sub Push. Basically, I need something like Cloud Tasks, but those seem to be aimed at long running tasks executed in AppEngine, not Kubernetes. Plus I like the way Pub/Sub decouples the web service from the rendering backend.
Google's Pub/Sub only allows pushing via HTTPS and only to a validated domain. It appears that Google is forcing me to completely "expose" my internal rendering backend by assigning a domain name to it, which feels ridiculous. I cannot just tell Pub/Sub to invoke http://loadbalancer.IP.address/handle_push.
This is making me doubt my architecture.
How would you go about building something like this on GCP?
From the GKE perspective:
You can have a cluster with a dedicated GPU-based nodepool and schedule your pods there using Taints and tolerations. Additionally, you can control the number of nodes in your nodepool using Autoscaling so that, you can use them only when your pods are to be scheduled/run.
Consider that this requires an additional default-non-GPU-based nodepool, where system pods are being run.
For triggering, as long as your default pool is running, you'd be able to deploy your application and the autoscaling should start automatically. For deploying from an App Engine application, you might want to consider talking to the Kubernetes API directly through a library.
Finally and considering the nature of your current goal (3D rendering), it might be best to use Kubernetes Jobs. With these, you can complete an sporadic computational load, allowing the nodepool to downsize once is finished.
Wrapping up, you can have a minimum cluster with a zero-sized GPU-based nodepool that will autoscale when a tainted job is requested to be run there, and once the workload is finished, it should automatically downscale. These actions can be triggered from GAE, using one of the client libraries.

Is it possible to control the action that Service Fabric takes when a service is 'unhealthy'?

I can see that there are configuration options that allow me to configure the policy that Service Fabric uses to determine whether the cluster is considered to be healthy (based on thresholds for error health reports) but is it possible to get Service Fabric to take some positive action when it detects error health reports, such as restart the application?
Not at the moment. You will need to write a service that takes actions based on health.
We are writing a system service that is able to analyze the cluster (including health) and take mitigation steps for problems identified. However, we don't have an estimated date at this time.