K8s inter service communication timeout - kubernetes

We have a K8s Cluster (3 Master - 2 Worker) - v1.17
There are 2 Microservice in this cluster, a Microservices A call to Common.
Sometimes, I face the problem is: A call to Common has timeout after 60s - Although this request is processed very quickly in the Common and success ( < 10ms).
getErrorInfoFallback : feign.RetryableException: Read timed out executing GET http://common-service.dev.svc.cluster.local:8002/errormapping/v1.0?errorCode=abcxyz
I use FeignClient to call other Microservice with url like http://common-service.dev.svc.cluster.local:8002
Here is timeline:
- 16:37:42.362 A send request
- 16:37:42.368 Common logging the request
- 16:37:42.378 Common logging respone return
- 16:38:42.424 A: timeout exeption
Could anyone help me?

Related

How to optimize config Keycloak

I have deployed Keycloak on Kubernetes and am having a performance issue with Keycloak as follows:
I run 6 pods Keycloak with mode standalone HA using KUBE_PING in Kubernetes and auto scale hpa if CPU over 80%. When I test the login load with Keycloak, the threshold is only 150ccu and if over this threshold an error will occur. I see log pods Keycloak occur timeout as below
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p16-t1) ISPN000136: Error executing command RemoveCommand on Cache 'authenticationSessions', writing keys [f85ac151-6196-48e9-977c-048fc8bcd975]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2312955 from keycloak-9b6486f7-bgw8s"
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p16-t1) ISPN000136: Error executing command ReplaceCommand on Cache 'loginFailures', writing keys [LoginFailureKey [ realmId=etc-internal. userId=a76c3578-32fa-42cb-88d7-fcfdccc5f5c6 ]]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2201111 from keycloak-9b6486f7-bgw8s"
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p20-t1) ISPN000136: Error executing command PutKeyValueCommand on Cache 'sessions', writing keys [6a0e8cde-82b7-4824-8082-109ccfc984b4]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2296440 from keycloak-9b6486f7-9l9lq"
I see the RAM, CPU of keycloak takes up very little under 20% so it does not auto scale hpa. So I think the current configuration of Keycloak is not optimize as about number of CACHE_OWNERS, Access Token Lifespan, SSO Session Idle, SSO Session Max, etc...
I want to know what configurations to configure accordingly and can load Keycloak to 500ccu with response time ~ 3s. Please support me if you know about this !
In standalone-ha.xml config, I only update config about datasource as image below

Keycloak cluster fails on Amazon ECS (org.infinispan.commons.CacheException: Initial state transfer timed out for cache)

I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.
In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.

How can one configure retry for IOExceptions in Spring Cloud Gateway?

I see that Retry Filter supports retries based on http status codes. I would like to configure retries in case of io exceptions such as connection resets. Is that possible with Spring Cloud Gateway 2?
I was using 2.0.0.RC1. Looks like latest build snapshot has support for retry based on exceptions. Fingers crossed for the next release.
Here is an example that retries twice for 500 series errors or IOExceptions:
filters:
- name: Retry
args:
retries: 2
series:
- SERVER_ERROR
exceptions:
- java.io.IOException

orientdb 2.2.26 cluster setup - queries hang

Orientdb version - 2.2.26
CLuster - 3 node setup, readQuorum = 2, writeQuorum = "majority", ridBag.embeddedToSbtreeBonsaiThreshold = 2147483647
Nodes - CentOS 7.0, 24 cores and 96 GB RAM
Gremlin-scala/tinkerpop APIs are used for querying and inserting.
This code works fine on single node setup.
Code checks for existing vertex in graph. If vertex does not exist, then the insert operation are batched and send to the db within a transaction.
I see following warnings in orientdb log on all three nodes -
2017-09-15 16:37:31:025 WARNI [dev2] Timeout (852567ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.354 task=record_read(#65:22)) [ODistributedDatabaseImpl]
2017-09-15 16:52:18:239 WARNI [dev2] Timeout (1049042ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.568 task=record_read(#63:24)) [ODistributedDatabaseImpl]
2017-09-15 17:25:22:477 WARNI [dev2] Timeout (1984236ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.889 task=record_read(#63:24)) [ODistributedDatabaseImpl]
There is no problem on network. Firewall is disabled on all three nodes.
Are these logs related to the problem ?
What else I should check to fix the problem ?

Handling connection failures in apache-camel

I am writing an apache-camel RabbitMQ consumer. I would like to react somehow to connection problems (i.e. try to reconnect). Is it possible to configure apache-camel to automatically reconnect?
If not, how can I find out that a connection to the queue was interrupted? I've done the following test:
start the queue (and some producer)
start my consumer (it was getting messages as expected)
stop the queue (the messages stopped arriving, as expected, but no exception was thrown)
start the queue (no new messages were received)
I am using camel in Scala (via akka-camel), but a Java solution would be probably also OK
You can pass in the flag automaticRecoveryEnabled=true to the URI, Camel will reconnect if the connection is lost.
For automatic RabbitMQ resource recovery (Connections/Channels/Consumers/Queues/Exchanages/Bindings) when failures occur, check out Lyra (which I authored). Example usage:
Config config = new Config()
.withRecoveryPolicy(new RecoveryPolicy()
.withMaxAttempts(20)
.withInterval(Duration.seconds(1))
.withMaxDuration(Duration.minutes(5)));
ConnectionOptions options = new ConnectionOptions().withHost("localhost");
Connection connection = Connections.create(options, config);
The rest of the API is just the amqp-client API, except your resources are automatically recovered when failures occur.
I'm not sure about camel-rabbitmq specifically, but hopefully there's a way you can swap in your own resource creation via Lyra.
Current camel-rabbitmq just create a connection and the channel when the consumer or producer is started. So it don't have a chance to catch the connection exception :(.