zuul proxy slowness - RibbonLoadBalancingHttpClient - httpclient

Firstly, I have only basic knowledge in java. I have some microservices and currently using zuul/eureka to proxy the services.
Noticed that when calling the microservice directly the throughput is 3 times faster than when called through zuul. So I'm wondering if my zuul configuration is wrong.
ab output:
Calling microservice directly:
Concurrency Level: 10
Time taken for tests: 5.938 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 37750000 bytes
HTML transferred: 36190000 bytes
Requests per second: 1684.20 [#/sec] (mean)
Time per request: 5.938 [ms] (mean)
Time per request: 0.594 [ms] (mean, across all concurrent requests)
Transfer rate: 6208.84 [Kbytes/sec] received
Calling through zuul:
Concurrency Level: 10
Time taken for tests: 15.049 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 37990000 bytes
HTML transferred: 36190000 bytes
Requests per second: 664.52 [#/sec] (mean)
Time per request: 15.049 [ms] (mean)
Time per request: 1.505 [ms] (mean, across all concurrent
Zuul config:
server:
port: 7001
zuul:
#Services will be mapped under the /api URI
prefix: /api
sslHostnameValidationEnabled: false
host:
maxTotalConnections: 800
maxPerRouteConnections: 200
endpoints:
restart:
enabled: true
shutdown:
enabled: true
health:
sensitive: false
eureka:
instance:
hostname: localhost
client:
serviceUrl:
defaultZone: http://localhost:8761/eureka/
ribbon:
eureka:
enabled: true
spring:
application:
name: zuul-server
id: zuul-server
Noticed that zuul takes a lot of CPU when compared to the microservice itself. So took a thread dump. And my suspicion is that RibbonLoadBalancingHttpClient seems to keep instantiating.
Thread dump: https://1drv.ms/t/s!Atq1lsqOLA98mHjh0lSJHPJj5J_I

The zuul.host.* properties you specified are only for zuul routes with "url" specified directly and do not apply to the serviceIds routes fetched from Eureka. See here. You may want to increase the ribbon level total HTTP connections and connections per host and rerun your test. Here is an example config -
ribbon:
ReadTimeout: 30000
ConnectTimeout: 1000
MaxTotalHttpConnections: 1600
MaxConnectionsPerHost: 800
In my tests with Zuul, I do remember seeing max-response times some of the requests were much higher than the zuul-bypassed direct to target requests, but the 95th and the 99th percentile were always comparable with approximately ~200ms difference with the direct requests to the target server.

Related

Default timeout value of istio

I have a service in which i have added a delay of 5 minutes. So the request to this service will take 5 minutes to give the response.
Now I have deployed this service in kubernetes with istio v1.5. When I am calling this service through the ingress gateway, I am getting a timeout in 3 minutes.
{"res_tx_duration":"-","route_name":"default","user_agent":"grpc-java-netty/1.29.0","response_code_details":"-","start_time":"****","request_id":"****","method":"POST","upstream_host":"127.0.0.1:6565","x_forwarded_for":"****","bytes_sent":"0","upstream_cluster":"****","downstream_remote_address":"****","req_duration":"0","path":"/****","authority":"****","protocol":"HTTP/2","upstream_service_time":"-","upstream_local_address":"-","duration":"180000","upstream_transport_failure_reason":"-","response_code":"0","res_duration":"-","response_flags":"DC","bytes_received":"5"}
I tried to set the timeout in the Virtual service greater than the 3 minutes, but that is not working. Only the timeouts less than 3 minutes set in the virtual service is working.
route:
- destination:
host: demo-service
port:
number: 8000
timeout: 240s
Is there any other configuration where we can set the timeout, other than VirtualService?
Is 3 minutes (180s) is the maximum value we can set in the VirtualService?

Disable Istio default retry strategy (at least on POST requests)

I have an application (microservices-based) running on kubernets with Istio 1.7.4
The microservices has its own mechanisms of transaction compensation on integration failures.
But Istio is retrying requests, when some integrations has 503 status code responses. I need to disabled it (at least on POST, which is non-idenpontent).
And let the application take care of it.
But I've tried so many ways without success. Can someone help me?
Documentation
From Istio Retries documentation: Default retry is hardcoded and it's value equal to 2.
The interval between retries (25ms+) is variable and determined
automatically by Istio, preventing the called service from being
overwhelmed with requests. The default retry behavior for HTTP
requests is to retry twice before returning the error.
Btw, it was initially 10, but decreased to 2 in Enable retries for specific status codes and reduce num retries to 2 commit.
workaround is to use virtual services
you can adjust your retry settings on a per-service basis in virtual
services without having to touch your service code. You can also
further refine your retry behavior by adding per-retry timeouts,
specifying the amount of time you want to wait for each retry attempt
to successfully connect to the service.
Examples
The following example configures a maximum of 3 retries to connect to this service subset after an initial call failure, each with a 2 second timeout.
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: ratings
spec:
hosts:
- ratings
http:
- route:
- destination:
host: ratings
subset: v1
retries:
attempts: 3
perTryTimeout: 2s
Your case. Disabling retries. Taken from Disable globally the default retry policy:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: no-retries-for-one-service
spec:
hosts:
- one-service.default.svc.cluster.local
http:
- retries:
attempts: 0
route:
- destination:
host: one-service.default.svc.cluster.local

How to debug failed requests with client_disconnected_before_any_response

We have an HTTP(s) Load Balancer created by a kubernetes ingress, which points to a backend formed by set of pods running nginx and Ruby on Rails.
Taking a look to the load balancer logs we have detected an increasing number of requests with a response code of 0 and statusDetails = client_disconnected_before_any_response.
We're trying to understand why this his happening, but we haven't found anything relevant. There is nothing in the nginx access or error logs.
This is happening for multiple kind of requests, from GET to POST.
We also suspect that sometimes despite of the request being logged with that error, the requests is actually passed to the backend. For instance we're seeing PG::UniqueViolation errors, due to idential sign up requests being sent twice to the backend in our sign up endpoint.
Any kind of help would be appreciated. Thanks!
 UPDATE 1
As requested here is the yaml file for the ingress resource:
 UPDATE 2
I've created a log-based Stackdriver metric, to count the number of requests that present this behavior. Here is the chart:
The big peaks approximately match the timestamp for these kubernetes events:
Full error: Readiness probe failed: Get http://10.48.1.28:80/health_check: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
So it seems sometimes the readiness probe for the pods behind the backend fails, but not always.
Here is the definition of the readinessProbe
readinessProbe:
failureThreshold: 3
httpGet:
httpHeaders:
- name: X-Forwarded-Proto
value: https
- name: Host
value: [redacted]
path: /health_check
port: 80
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 5
A response code of 0 and statusDetails = client_disconnected_before_any_response means the client closed the connection before the Load Balancer being able to provide a response as per this GCP documentation.
Investigating why it did not respond in time, one of the reasons could be the difference between the keepalive timeouts from nginx and the GCP Load Balancer, even if this will most-like provide a backend_connection_closed_before_data_sent_to_client caused by a 502 Bad Gateway race condition.
To make sure the backend responds to the request and to see if how long it takes, you can repeat this process for a couple of times (since you still get some valid responses):
curl response time
$ curl -w "#curl.txt" -o /dev/null -s IP_HERE
curl.txt content(create and save this file first):
time_namelookup: %{time_namelookup}\n
time_connect: %{time_connect}\n
time_appconnect: %{time_appconnect}\n
time_pretransfer: %{time_pretransfer}\n
time_redirect: %{time_redirect}\n
time_starttransfer: %{time_starttransfer}\n
----------\n
time_total: %{time_total}\n
If this is the case, please review the sign up endpoint code for any type of loop like the PG::UniqueViolation errors that you mentioned.

Feign with Ribbon and Hystrix

Hystrix is not opening the circuit for the feign client. I am testing the server which is always slow with following configuration:
hystrix.circuitBreaker.requestVolumeThreshold: 4
hystrix.circuitBreaker.errorThresholdPercentage: 50
hystrix.circuitBreaker.sleepWindowInMilliseconds: 7000
hystrix.metrics.rollingStats.timeInMilliseconds: 15000
hystrix.metrics.rollingStats.numBuckets: 5
hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds: 31000
feign:
hystrix:
enabled: true
organizationService:
ribbon:
MaxAutoRetries: 3
MaxAutoRetriesNextServer: 0
OkToRetryOnAllOperations: true
ConnectTimeout: 10000
ReadTimeout: 1000
I can see 4 retry and java.net.SocketTimeoutException: Read timed out error after that but when I am checking the Turbine it shows Circuit Closed for the operation.

haproxy with 2 backends - when 1 backend queues, the other is also affected

I have haproxy setup with 2 backends: be1 and be2
I'm using ACL to route based on the path.
When be2 begins to develop a queue, the requests to be1 are negatively affected -- what normally takes 100ms takes 2-3 seconds (just like what happens to the requests going to be2).
Is there a way to allow be2 to queue up without affecting performance on be1?
At peak, I was serving about 2000 req/s.
global
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
#log loghost local0 info
maxconn 2000
#chroot /usr/share/haproxy
user haproxy
group haproxy
daemon
#debug
#quiet
ulimit-n 65535
stats socket /var/run/haproxy.sock
nopoll
defaults
log global
mode http
option httplog
option dontlognull
retries 3
option redispatch
maxconn 2000
contimeout 5000
clitimeout 50000
srvtimeout 50000
frontend http_in *:80
option httpclose
option forwardfor
acl vt path_beg /route/1
use_backend be2 if vt
default_backend be1
backend be1
balance leastconn
option httpchk HEAD /redirect/are_you_alive HTTP/1.0
server 01-2C2P9HI x:80 check inter 3000 rise 2 fall 3 maxconn 500
backend be2
balance leastconn
option httpchk HEAD /redirect/are_you_alive HTTP/1.0
server 01-3TPDP27 x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-3CR0FKC x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-3E9CVMP x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-211LQMA x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-3H974V3 x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-13UCFVO x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-0HPIGGT x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-2LFP88F x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-1TIQBDH x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-2GG2LBB x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-1H5231E x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-0KIOVID x:80 check inter 3000 rise 2 fall 3 maxconn 250
listen stats 0.0.0.0:7474 #Listen on all IP's on port 9000
mode http
balance
timeout client 5000
timeout connect 4000
timeout server 30000
#This is the virtual URL to access the stats page
stats uri /haproxy_stats
#Authentication realm. This can be set to anything. Escape space characters with a backslash.
stats realm HAProxy\ Statistics
#The user/pass you want to use. Change this password!
stats auth ge:test123
#This allows you to take down and bring up back end servers.
#This will produce an error on older versions of HAProxy.
stats admin if TRUE
Not sure how I didn't notice this yesterday, but seeing that maxconn is set to 2000... so that is likely one of my issues?
There are two different maxconn settings. One for frontends and other for backends. The setting for frontends limit the incoming connections, so even though your backend is available it would not get the request as it is queued on the frontend side. Once requests goes through frontend the backend queuing takes place. Frontends are affected by maxconn setting in "default" section, so I would increase that to 4000 for example, as backend should be able to handle it.
Please not that maxconn does not limit requests per second, but simultaneous connections. You might be having some HTTP keep-alive requests active that could limit the available throughput a lot.