What could cause GCP Cloud Run to respond 10-30% slower than the same service running in Kubernetes (GKE)? - kubernetes

I have a web service running in GKE behind a GCP global HTTPS load balancer. The same code is also deployed to Cloud Run. The load balancer splits traffic 50-50 between GKE and Cloud Run. According to the load balancer, the total latency from Cloud Run is about 10 to 30% higher than it is from GKE.
https/total_latencies, top line is Cloud Run average latency, bottom line is GKE average latency
The same code is deployed to both environments in the same region. The service does not have any dependencies on databases. It does make HTTP requests over the public internet, which makes it unlikely that the requests from GKE are getting routed differently than the ones in Cloud Run. There's a clear difference in latency between Cloud Run and GKE regardless of HTTP status code class, whether it's 2xx, 3xx, 4xx, or 5xx. There's also consistently a gap, indicating the issue is not due to cold starts.
To try to eliminate the possibility that the Cloud Run instances are overburdened, they are given more vCPU cores and memory than the GKE pods. The target number of concurrent requests is also set much lower than the target request rate used by the GKE service's HPA. In short, the Cloud Run instances are given many more resources than the GKE pods.
Based on the symptoms, it appears there could be a systemic source of latency from Cloud Run. The only other metric that shows a difference between Cloud Run and GKE is the load balancer's https/backend_request_bytes_count metric, which is 2-3x higher for Cloud Run (about 1.2 KB for GKE and 3.1 KB for Cloud Run), which is difficult to explain because the service receives only GET requests, so there's unlikely to be a difference in request size from clients until the load balancer is adding 1.9 KB worth of headers when connecting to Cloud Run.
https/backend_request_bytes_count, top line is Cloud Run, bottom line is GKE
Setting up a TLS connection is the only overhead that comes to mind, if the Google Cloud load balancer weren't reusing HTTP connections.
In summary, are there systemic reasons that Google Cloud Run's infrastructure would be slower than GKE's and why would the request size between a Google Cloud load balancer and Cloud Run be 2-3x larger than requests between the same load balancer and GKE?

Related

How to setup MetricBalancingThresholds and MetricActivityThresholds on Azure Service Fabric cluster?

I have a Service Fabric cluster with 7 nodes.
I am using Microsoft Azure Service Fabric version 8.2.1571.9590.
The problem is, the cluster is not balanced by CPU nor Memory usage.
It is balanced by the number of PrimaryCount, ReplicaCount and similar metrics, but not by CPU or Memory usage.
The result is, because some of our services are heavy spenders of CPU/RAM ("noisy neighbour" issue), they consume more resources, starving other services in the process.
I know I can set MetricBalancingThresholds and MetricActivityThresholds for our cluster, but don't know metrics names.
I figured based on article: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-resource-manager-balancing
that I can setup MetricBalancingThresholds and MetricActivityThresholds
(https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-fabric-settings)
I know that we can set those values in Azure portal / Service fabric resource / Custom fabric settings.
The problem is, I don't know what parameter names, or metrics names to use to set thresholds on CPU and Memory.
The documentation says: "PropertyGroup", but I don't know what are possible values here:
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-fabric-settings#metricactivitythresholds
Earlier CpuPercentageNodeCapacity and MemoryPercentageNodeCapacity were used in PlacementAndLoadBalancing section, tho
se were straightforward, but it seems they are deprecated.

How to create circuit breaker for cloud run services?

I am trying to understand how we can create circuit breakers for cloud run services,Unlike in GKE we are using istio kind of service mesh how we implement same thing cloud Run ?
On GKE you'd set up a circuit breaker to prevent overloading your legacy backend systems from a surge in requests.
To accomplish the same on Cloud Run or Cloud Functions, you can set a maximum number of instances. From that documentation:
Specifying maximum instances in Cloud Run allows you to limit the scaling of your service in response to incoming requests, although this maximum setting can be exceeded for a brief period due to circumstances such as traffic spikes. Use this setting as a way to control your costs or to limit the number of connections to a backing service, such as to a database.

How to deploy workload with K8s on-demand (GKE)?

I need to deploy a GPU intensive task on GCP. I want to use a Node.js Docker image and within that container to run a Node.js server that listens to HTTP requests and runs a Python image processing script on-demand (every time that a new HTTP request is received containing the images to be processed). My understanding is that I need to deploy a load balancer in front of the K8s cluster that has a static public IP address which then builds/launches containers every time a new HTTP request comes in? And then destroy the container once processing is completed. Is container re-use not a concern? I never worked with K8s before and I want to understand how it works and after reading the GKE documentation this is how I imagine the architecture. What am I missing here?
runs a Python image processing script on-demand (every time that a new HTTP request is received containing the images to be processed)
This can be solved on Kubernetes, but it is not a very common kind of workload.
The project that support your problem best is Knative with its per-request auto-scaler. Google Cloud Run is the easiest way to use this. But if you want to run this within your own GKE cluster, you can enable it.
That said, you can also design your Node.js service to integrate with the Kubernetes API-server to create Jobs - but it is not a good design to have common workload talk to the API-server. It is better to use Knative or Google Cloud Run.

How to get the latency of an application deployed in Kubernetes?

I have a simple java based application deployed in Kubernetes. I want to get the average latency of requests sent to the application(GET and POST).
Stackdriver Monitoring API has the latency details of loadbalancer. But that can only be collected after 210 seconds which is not sufficient in my case. How can I configure in Kubernetes to get the latency details every 30 seconds (or 1 minute) immediately.
I wish the solution to be independent of Java so that I can use it for any application I deploy.
On GKE, you can use Stackdriver Trace, which is GCP specific. I am currently fighting with python client library. Hopefully Java is more mature.
Or you can use Jaeger, which is CNCF project.
Use a Service Mesh
A Service Mesh will let you observe things like latency between your services without extra code for this in each applications. Istio is such an implementation that is available on Google Kubernetes Engine.
Get uniform metrics and traces from any running applications without requiring developers to manually instrument their applications.
Istio’s monitoring capabilities let you understand how service performance impacts things upstream and downstream
See Istio on GCP
use a service mesh: software that helps you orchestrate, secure, and collect telemetry across distributed applications. A service mesh transparently oversees and monitors all traffic for your application, typically through a set of network proxies that sit alongside each microservice.
Welcome to the service mesh era

Kubernetes network utilization metrics

I'm running a Kubernetes cluster GKE on Google Compute Engine GCE.
Through Heapster I am able to get different network metrics such as sent or received bytes, or error rate.
However, in order to better understand my application (Pods) bottlenecks, it would be essential to understand how utilized is the Node's network. Is it possible to query Network Utilization, otherwise what metrics do indicate my network health?
There is not a direct way on your side to monitor the VPC network but there are some tools that might help you to check that it is behaving as expected.
The only documented limit on VPC networks are the egress throughput caps which will depend on the number of cores the nodes have.
You can see the graphs for “Networks Bytes” and “Network Packets” in your Google Cloud Console. They can be retrieve by going to:
Cloud Console -> Instance Groups -> Managed_Intance_Group_Name
or
Cloud Console-> VM Instances -> Node_Name
Network graphs for the pools and the nodes can also be found in the Stackdriver Account
(https://app.google.stackdriver.com) -> Resources -> Container Engine -> Cluster Name
Analyzing those graphs might help you to determine if your traffic is being throttled.
To obtain additional visibility you could also use cAdvisor or other tools mentioned here