Disable Istio default retry strategy (at least on POST requests) - kubernetes

I have an application (microservices-based) running on kubernets with Istio 1.7.4
The microservices has its own mechanisms of transaction compensation on integration failures.
But Istio is retrying requests, when some integrations has 503 status code responses. I need to disabled it (at least on POST, which is non-idenpontent).
And let the application take care of it.
But I've tried so many ways without success. Can someone help me?

Documentation
From Istio Retries documentation: Default retry is hardcoded and it's value equal to 2.
The interval between retries (25ms+) is variable and determined
automatically by Istio, preventing the called service from being
overwhelmed with requests. The default retry behavior for HTTP
requests is to retry twice before returning the error.
Btw, it was initially 10, but decreased to 2 in Enable retries for specific status codes and reduce num retries to 2 commit.
workaround is to use virtual services
you can adjust your retry settings on a per-service basis in virtual
services without having to touch your service code. You can also
further refine your retry behavior by adding per-retry timeouts,
specifying the amount of time you want to wait for each retry attempt
to successfully connect to the service.
Examples
The following example configures a maximum of 3 retries to connect to this service subset after an initial call failure, each with a 2 second timeout.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ratings
spec:
hosts:
- ratings
http:
- route:
- destination:
host: ratings
subset: v1
retries:
attempts: 3
perTryTimeout: 2s
Your case. Disabling retries. Taken from Disable globally the default retry policy:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: no-retries-for-one-service
spec:
hosts:
- one-service.default.svc.cluster.local
http:
- retries:
attempts: 0
route:
- destination:
host: one-service.default.svc.cluster.local

Related

HTTPRoute set a timeout

I am trying to set up a multi-cluster architecture. I have a Spring Boot API that I want to run on a second cluster (for isolation purposes). I have set that up using the gateway.networking.k8s.io API. I am using a Gateway that has an SSL certificate and matches an IP address that's registered to my domain in the DNS registry. I am then setting up an HTTPRoute for each service that I am running on the second cluster. That works fine and I can communicate between our clusters and everything works as intended but there is a problem:
There is a timeout of 30s by default and I cannot change it. I want to increase it as the application in the second cluster is a WebSocket and I obviously would like our WebSocket connections to stay open for more than 30s at a time. I can see that in the backend service that's created from our HTTPRoute there is a timeout specified as 30s. I found a command to increase it gcloud compute backend-services update gkemcg1-namespace-store-west-1-8080-o1v5o5p1285j --timeout=86400
When I run that command it would increase the timeout and the webSocket connection will be kept alive. But after a few minutes this change gets overridden (I suspect that it's because it's managed by the yaml file). This is the yaml file for my backend service
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1beta1
metadata:
name: public-store-route
namespace: namespace
labels:
gateway: external-http
spec:
hostnames:
- "my-website.example.org"
parentRefs:
- name: external-http
rules:
- matches:
- path:
type: PathPrefix
value: /west
backendRefs:
- group: net.gke.io
kind: ServiceImport
name: store-west-1
port: 8080
I have tried to add either a timeout, timeoutSec, or timeoutSeconds under every level with no success. I always get the following error:
error: error validating "public-store-route.yaml": error validating data: ValidationError(HTTPRoute.spec.rules[0].backendRefs[0]): unknown field "timeout" in io.k8s.networking.gateway.v1beta1.HTTPRoute.spec.rules.backendRefs; if you choose to ignore these errors, turn validation off with --validate=false
Surely there must be a way to configure this. But I wasn't able to find anything in the documentation referring to a timeout. Am I missing something here?
How do I configure the timeout?
Edit:
I have found this resource: https://cloud.google.com/kubernetes-engine/docs/how-to/configure-gateway-resources
I have been trying to set up a LBPolicy and attatch it it the Gateway, HTTPRoute, Service, or ServiceImport but nothing has made a difference. Am I doing something wrong or is this not working how it is supposed to? This is my yaml:
kind: LBPolicy
apiVersion: networking.gke.io/v1
metadata:
name: store-timeout-policy
namespace: sandstone-test
spec:
default:
timeoutSec: 50
targetRef:
name: public-store-route
group: gateway.networking.k8s.io
kind: HTTPRoute

Availability with Kubernetes

We run an internal a healthcheck of the service every 5 seconds. And we run Kubernetes liveness probes every 1 second. So in the worst scenario the Kubernetes loadbalancer has up-to-date information every 6 seconds.
My question is what happens when a client request hits a pod which is broken but not seen by the loadbalancer as unhealthy? Should the client implement a logic with retries? Or should we implement backend logic to handle the cases when a request hits a pod which is not yet seen as unhealthy by the loadbalancer?
Not sure how your architecture is however LoadBalancers are generally set with the ingress controller like Nginx and etc.
Load Balancer backed by the ingress controller forwards the traffic to the K8s service, and the K8s service mostly manages the request routing to PODs, not LB.
Based on the Readiness K8s service route the request to PODs, so if your POD is NotReady, the request won't reach there. Due to any delay if the request reaches to that POD there could be a chance you get internal error or so in return.
Retries
yes, you implement the retries at the client side but if you are on K8s, you can offload the retries part to the service mesh. it would be easy to maintain and integrate retries logic with the K8s and service mesh.
You can use the service mesh like Istio and implement the retries policy at virtual service level
retries:
attempts: 5
retryOn: 5xx
Virtual service
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ratings
spec:
hosts:
- ratings
http:
- route:
- destination:
host: ratings
subset: v1
retries:
attempts: 3
perTryTimeout: 2s
Read more at : https://istio.io/latest/docs/concepts/traffic-management/#retries

Where to put Istio network retry

I'm very new to Istio and not a Kubernete's expert, though I have used the latter. I respectfully ask for your understanding and a bit more details than you might normally include.
For simplicity, say I have two services, both Java/SpringBoot. Service A listens to requests from the outside world, Service B listens to requests from Service A. Service B is scalable, and at points might return 503. I wish to have service A retry calls to service B in a configurable non-programmatic way. Here's a blog/link that I tried to follow that I think is very similar.
https://samirbehara.com/2019/06/05/retry-design-pattern-with-istio/
Two questions:
It may seem obvious, but if I wanted to define a virtual retriable service, do I add it to the existing application.yml file for the project or is there some other file that the networking.istio.io/v1alpha3 goes?
Would I define the retry configuration in the yaml/repo for Service A or Service B? I can think of reasons for architecting Istio either way.
Thanks,
Woodsman
If the scalable service is returning 503, it makes sense to add a virtual service just like the blog example for serviceB and make serviceA connect to virtualServiceB which will do the retries to ServiceB
Now, for this to work (from within the cluster):
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: serviceB
spec:
hosts:
– serviceB
http:
– route:
– destination:
host: serviceB
retries:
attempts: 3
perTryTimeout: 2s
These lines:
hosts:
– serviceB
Will tell the default Istio Gateway (mesh) to route all the traffic not to serviceB, but to virtualServiceB first which will then route to ServiceB. Then you will have retries from virtualServiceB to serviceB.
Hope this helps

How to replicate some traffic on Kubernetes and divert it to the Service for investigation

I'm using istio and I know that I can define weights in Virtual Service and divert traffic to different services.
My question is: how to amplify some of the traffic and direct the amplified traffic to the validation service? This amplified traffic will not go back to the original source but will be closed within the cluster. In other words, it does not bother the user.
I'm not even sure if there is an ecosystem, feature or application that provides this kind of mechanism. I don't even know if there is an ecosystem or application that provides such a mechanism and I don't know what it is called, so I'm having trouble finding it.
Thanks.
OP found a soultion by themselves, in the comments, hence the CW.
In this scenario, the best solution is to use Istio Mirroring - also called shadowing.
When traffic gets mirrored, the requests are sent to the mirrored service with their Host/Authority headers appended with -shadow. For example, cluster-1 becomes cluster-1-shadow.
Also, it is important to note that these requests are mirrored as “fire and forget”, which means that the responses are discarded.
To create mirroring rule, you have to create VirtualService with mirror and mirrorPercentage fields.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: httpbin
spec:
hosts:
- httpbin
http:
- route:
- destination:
host: httpbin
subset: v1
weight: 100
mirror:
host: httpbin
subset: v2
mirrorPercentage:
value: 100.0
This route rule sends 100% of the traffic to v1. The last stanza specifies that you want to mirror (i.e., also send) 100% of the same traffic to the httpbin:v2 service.
[source]

Does Istio allow to configure a maximum response timeout for a circuit breaker to open? How?

I'm checking the documentation for the DestinationRule, where there are several examples of a circuit breaking configuration, e.g:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: bookinfo-app
spec:
host: bookinfoappsvc.prod.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
connectTimeout: 30ms
...
The connectionPool.tcp element offers a connectTimeout. However what I need to configure is a maximum response timeout. Imagine I want to open the circuit if the service takes longer than 5 seconds to answer. Is it possible to configure this in Istio? How?
Take a look at Tasks --> Traffic Management --> Setting Request Timeouts:
A timeout for http requests can be specified using the timeout field
of the route rule. By default, the timeout is 15 seconds [...]
So, you must set the http.timeout in the VirtualService configuration.
Take a look at this example from the Virtual Service / Destination official docs:
The following VirtualService sets a timeout of 5s for all calls to
productpage.prod.svc.cluster.local service in Kubernetes.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-productpage-rule
namespace: istio-system
spec:
hosts:
- productpage.prod.svc.cluster.local # ignores rule namespace
http:
- timeout: 5s
route:
- destination:
host: productpage.prod.svc.cluster.local
http.timeout: Timeout for HTTP requests.