Feign with Ribbon and Hystrix - spring-cloud

Hystrix is not opening the circuit for the feign client. I am testing the server which is always slow with following configuration:
hystrix.circuitBreaker.requestVolumeThreshold: 4
hystrix.circuitBreaker.errorThresholdPercentage: 50
hystrix.circuitBreaker.sleepWindowInMilliseconds: 7000
hystrix.metrics.rollingStats.timeInMilliseconds: 15000
hystrix.metrics.rollingStats.numBuckets: 5
hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds: 31000
feign:
hystrix:
enabled: true
organizationService:
ribbon:
MaxAutoRetries: 3
MaxAutoRetriesNextServer: 0
OkToRetryOnAllOperations: true
ConnectTimeout: 10000
ReadTimeout: 1000
I can see 4 retry and java.net.SocketTimeoutException: Read timed out error after that but when I am checking the Turbine it shows Circuit Closed for the operation.

Related

url_param routing algorithm in HAProxy

haproxy.cfg:
global
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 5000
user haproxy
group haproxy
daemon
stats socket /var/lib/haproxy/stats
defaults
mode http
log global
option httplog
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 10m
timeout server 10m
timeout http-keep-alive 10s
timeout check 10s
maxconn 4000
listen stats
bind :9000
mode http
stats enable
stats hide-version
stats realm Haproxy\ Statistics
stats uri /haproxy
frontend http-in
bind *:80
acl acl_is_host hdr(host) -i example.com
acl acl_is_param url_param(id) -m str server1 server2 server3 server4
use_backend worker_cluster if acl_is_host acl_is_param
backend worker_cluster
cookie APPID insert secure httponly
balance url_param id
hash-type consistent
server w1 localhost:8080 weight 40 maxconn 1000 check
server w2 localhost:8081 weight 40 maxconn 1000 check
server w3 localhost:8082 weight 10 maxconn 1000 check
server w4 localhost:8083 weight 10 maxconn 1000 check
Previously, when I had only 2 servers w1, w2 in the backend without any weights, it worked perfectly fine. But after I added 2 more servers w3,w4 with weights (All weights on servers adding to 100), I see issues like, requests with param id=server1 reaching to server w2.
Is there any mistake in the config file?
Are the weights always should add up to 100?
Haproxy version used: 1.5.18

Default timeout value of istio

I have a service in which i have added a delay of 5 minutes. So the request to this service will take 5 minutes to give the response.
Now I have deployed this service in kubernetes with istio v1.5. When I am calling this service through the ingress gateway, I am getting a timeout in 3 minutes.
{"res_tx_duration":"-","route_name":"default","user_agent":"grpc-java-netty/1.29.0","response_code_details":"-","start_time":"****","request_id":"****","method":"POST","upstream_host":"127.0.0.1:6565","x_forwarded_for":"****","bytes_sent":"0","upstream_cluster":"****","downstream_remote_address":"****","req_duration":"0","path":"/****","authority":"****","protocol":"HTTP/2","upstream_service_time":"-","upstream_local_address":"-","duration":"180000","upstream_transport_failure_reason":"-","response_code":"0","res_duration":"-","response_flags":"DC","bytes_received":"5"}
I tried to set the timeout in the Virtual service greater than the 3 minutes, but that is not working. Only the timeouts less than 3 minutes set in the virtual service is working.
route:
- destination:
host: demo-service
port:
number: 8000
timeout: 240s
Is there any other configuration where we can set the timeout, other than VirtualService?
Is 3 minutes (180s) is the maximum value we can set in the VirtualService?

Why am I losing my connection to my MongoDB after my GKE node gets preempted?

I am running a Mongo, Express, Node, React app in a GKE cluster that is setup with a preemptible VM (to save money). I am using mongoose to connect to my MongoDB which is hosted on Mongo Cloud Atlas. Everything works find when the pod first starts. However, when my node gets preempted, I lose connection to my mongoDB instance. I then have to go in and manually scale the deployment down to 0 replicas and then scaled it back up and the connection to the mongoDB is restored. Below is the error I am getting and the code for my mongo connection. Is this just a intended effect of using a preemptible instance? Is there any way to deal with it like, automatically scale the deployment after a preemption? I was running a GKE autopilot cluster and had no problems but that was a little expensive for my purposes. Thanks
mongoose
.connect(process.env.MONGODB_URL, {
useNewUrlParser: true,
useUnifiedTopology: true,
useFindAndModify: false,
})
.then(() => console.log('mongoDB connected...'));
(node: 24) UnhandledPromiseRejectionWarning: Error: querySrv ECONNREFUSED _mongodb._tcp.clusterx.xxxxx.azure.mongodb.net at QueryReqWrap.onresolve (dns.js:203)
The VM preemption can be reproduced in Compute Engine -> Instance groups -> Restart/Replace VMS and then choose option: Replace. After the VM has been restarted, the containers will be recreated too but unfortunately with network issues as mentioned.
My solution was to add liveness and readiness probes to Kubernetes Pods/Deployment via /health URL which checks if MongoDB is available and returns status code 500 if not. Details on how to define liveness and readiness probes in Kubernetes are here. The Kubernetes will restart pods that are not alive. The pods created later won't have network issues.
yaml spec block in my project looks like this:
spec:
containers:
- env:
- name: MONGO_URL
value: "$MONGO_URL"
- name: NODE_ENV
value: "$NODE_ENV"
image: gcr.io/$GCP_PROJECT/$APP_NAME:$APP_VERSION
imagePullPolicy: IfNotPresent
name: my-container
# the readiness probe details
readinessProbe:
httpGet: # make an HTTP request
port: 3200 # port to use
path: /health # endpoint to hit
scheme: HTTP # or HTTPS
initialDelaySeconds: 5 # how long to wait before checking
periodSeconds: 5 # how long to wait between checks
successThreshold: 1 # how many successes to hit before accepting
failureThreshold: 1 # how many failures to accept before failing
timeoutSeconds: 3 # how long to wait for a response
# the livenessProbe probe details
livenessProbe:
httpGet: # make an HTTP request
port: 3200 # port to use
path: /health # endpoint to hit
scheme: HTTP # or HTTPS
initialDelaySeconds: 15 # how long to wait before checking
periodSeconds: 5 # how long to wait between checks
successThreshold: 1 # how many successes to hit before accepting
failureThreshold: 2 # how many failures to accept before failing
timeoutSeconds: 3 # how long to wait for a response

zuul proxy slowness - RibbonLoadBalancingHttpClient

Firstly, I have only basic knowledge in java. I have some microservices and currently using zuul/eureka to proxy the services.
Noticed that when calling the microservice directly the throughput is 3 times faster than when called through zuul. So I'm wondering if my zuul configuration is wrong.
ab output:
Calling microservice directly:
Concurrency Level: 10
Time taken for tests: 5.938 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 37750000 bytes
HTML transferred: 36190000 bytes
Requests per second: 1684.20 [#/sec] (mean)
Time per request: 5.938 [ms] (mean)
Time per request: 0.594 [ms] (mean, across all concurrent requests)
Transfer rate: 6208.84 [Kbytes/sec] received
Calling through zuul:
Concurrency Level: 10
Time taken for tests: 15.049 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 37990000 bytes
HTML transferred: 36190000 bytes
Requests per second: 664.52 [#/sec] (mean)
Time per request: 15.049 [ms] (mean)
Time per request: 1.505 [ms] (mean, across all concurrent
Zuul config:
server:
port: 7001
zuul:
#Services will be mapped under the /api URI
prefix: /api
sslHostnameValidationEnabled: false
host:
maxTotalConnections: 800
maxPerRouteConnections: 200
endpoints:
restart:
enabled: true
shutdown:
enabled: true
health:
sensitive: false
eureka:
instance:
hostname: localhost
client:
serviceUrl:
defaultZone: http://localhost:8761/eureka/
ribbon:
eureka:
enabled: true
spring:
application:
name: zuul-server
id: zuul-server
Noticed that zuul takes a lot of CPU when compared to the microservice itself. So took a thread dump. And my suspicion is that RibbonLoadBalancingHttpClient seems to keep instantiating.
Thread dump: https://1drv.ms/t/s!Atq1lsqOLA98mHjh0lSJHPJj5J_I
The zuul.host.* properties you specified are only for zuul routes with "url" specified directly and do not apply to the serviceIds routes fetched from Eureka. See here. You may want to increase the ribbon level total HTTP connections and connections per host and rerun your test. Here is an example config -
ribbon:
ReadTimeout: 30000
ConnectTimeout: 1000
MaxTotalHttpConnections: 1600
MaxConnectionsPerHost: 800
In my tests with Zuul, I do remember seeing max-response times some of the requests were much higher than the zuul-bypassed direct to target requests, but the 95th and the 99th percentile were always comparable with approximately ~200ms difference with the direct requests to the target server.

haproxy with 2 backends - when 1 backend queues, the other is also affected

I have haproxy setup with 2 backends: be1 and be2
I'm using ACL to route based on the path.
When be2 begins to develop a queue, the requests to be1 are negatively affected -- what normally takes 100ms takes 2-3 seconds (just like what happens to the requests going to be2).
Is there a way to allow be2 to queue up without affecting performance on be1?
At peak, I was serving about 2000 req/s.
global
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
#log loghost local0 info
maxconn 2000
#chroot /usr/share/haproxy
user haproxy
group haproxy
daemon
#debug
#quiet
ulimit-n 65535
stats socket /var/run/haproxy.sock
nopoll
defaults
log global
mode http
option httplog
option dontlognull
retries 3
option redispatch
maxconn 2000
contimeout 5000
clitimeout 50000
srvtimeout 50000
frontend http_in *:80
option httpclose
option forwardfor
acl vt path_beg /route/1
use_backend be2 if vt
default_backend be1
backend be1
balance leastconn
option httpchk HEAD /redirect/are_you_alive HTTP/1.0
server 01-2C2P9HI x:80 check inter 3000 rise 2 fall 3 maxconn 500
backend be2
balance leastconn
option httpchk HEAD /redirect/are_you_alive HTTP/1.0
server 01-3TPDP27 x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-3CR0FKC x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-3E9CVMP x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-211LQMA x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-3H974V3 x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-13UCFVO x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-0HPIGGT x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-2LFP88F x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-1TIQBDH x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-2GG2LBB x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-1H5231E x:80 check inter 3000 rise 2 fall 3 maxconn 250
server 01-0KIOVID x:80 check inter 3000 rise 2 fall 3 maxconn 250
listen stats 0.0.0.0:7474 #Listen on all IP's on port 9000
mode http
balance
timeout client 5000
timeout connect 4000
timeout server 30000
#This is the virtual URL to access the stats page
stats uri /haproxy_stats
#Authentication realm. This can be set to anything. Escape space characters with a backslash.
stats realm HAProxy\ Statistics
#The user/pass you want to use. Change this password!
stats auth ge:test123
#This allows you to take down and bring up back end servers.
#This will produce an error on older versions of HAProxy.
stats admin if TRUE
Not sure how I didn't notice this yesterday, but seeing that maxconn is set to 2000... so that is likely one of my issues?
There are two different maxconn settings. One for frontends and other for backends. The setting for frontends limit the incoming connections, so even though your backend is available it would not get the request as it is queued on the frontend side. Once requests goes through frontend the backend queuing takes place. Frontends are affected by maxconn setting in "default" section, so I would increase that to 4000 for example, as backend should be able to handle it.
Please not that maxconn does not limit requests per second, but simultaneous connections. You might be having some HTTP keep-alive requests active that could limit the available throughput a lot.