How to optimize config Keycloak - kubernetes

I have deployed Keycloak on Kubernetes and am having a performance issue with Keycloak as follows:
I run 6 pods Keycloak with mode standalone HA using KUBE_PING in Kubernetes and auto scale hpa if CPU over 80%. When I test the login load with Keycloak, the threshold is only 150ccu and if over this threshold an error will occur. I see log pods Keycloak occur timeout as below
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p16-t1) ISPN000136: Error executing command RemoveCommand on Cache 'authenticationSessions', writing keys [f85ac151-6196-48e9-977c-048fc8bcd975]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2312955 from keycloak-9b6486f7-bgw8s"
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p16-t1) ISPN000136: Error executing command ReplaceCommand on Cache 'loginFailures', writing keys [LoginFailureKey [ realmId=etc-internal. userId=a76c3578-32fa-42cb-88d7-fcfdccc5f5c6 ]]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2201111 from keycloak-9b6486f7-bgw8s"
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p20-t1) ISPN000136: Error executing command PutKeyValueCommand on Cache 'sessions', writing keys [6a0e8cde-82b7-4824-8082-109ccfc984b4]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2296440 from keycloak-9b6486f7-9l9lq"
I see the RAM, CPU of keycloak takes up very little under 20% so it does not auto scale hpa. So I think the current configuration of Keycloak is not optimize as about number of CACHE_OWNERS, Access Token Lifespan, SSO Session Idle, SSO Session Max, etc...
I want to know what configurations to configure accordingly and can load Keycloak to 500ccu with response time ~ 3s. Please support me if you know about this !
In standalone-ha.xml config, I only update config about datasource as image below

Related

OpenSearch error Rejecting request because cold storage is not enabled on the domain

I have an AWS OpenSearch domain with ultrawarm/cold storage disabled.
My application/error log is spammed (new entries every 2-5 minutes) with the following error
[2022-09-05T09:29:13,909][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [eaab83ab8ab8fd53a42b42e694708350] uncaught exception in thread [DefaultDispatcher-worker-1]
OpenSearchStatusException[Rejecting request because cold storage is not enabled on the domain. Enabling cold storage for the first time can take several hours. Please try again later.]
__AMAZON_INTERNAL__
__AMAZON_INTERNAL__
__AMAZON_INTERNAL__
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:195)
at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:120)
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:99)
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
at org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:266)
at org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
at org.opensearch.action.support.TransportAction.execute(TransportAction.java:170)
at org.opensearch.action.support.TransportAction.execute(TransportAction.java:98)
at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
at org.opensearch.indexmanagement.kraken.ColdIndexMetadataService$getMetadata$response$1.invoke(ColdIndexMetadataService.kt:31)
at org.opensearch.indexmanagement.kraken.ColdIndexMetadataService$getMetadata$response$1.invoke(ColdIndexMetadataService.kt:17)
at org.opensearch.indexmanagement.kraken.OpenSearchExtensionsKt.suspendUntil(OpenSearchExtensions.kt:30)
at org.opensearch.indexmanagement.kraken.ColdIndexMetadataService.getMetadata(ColdIndexMetadataService.kt:31)
at org.opensearch.indexmanagement.indexstatemanagement.IndexMetadataProvider.getISMIndexMetadataByType(IndexMetadataProvider.kt:46)
at org.opensearch.indexmanagement.indexstatemanagement.IndexMetadataProvider$getMultiTypeISMIndexMetadata$2$invokeSuspend$$inlined$forEach$lambda$1.invokeSuspend(IndexMetadataProvider.kt:66)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:56)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:738)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
I've tried stopping all applications that interacts with OpenSearch, tried to delete my ISM policy, but these errors keep filling the logs
any ideas?

Keycloak cluster fails on Amazon ECS (org.infinispan.commons.CacheException: Initial state transfer timed out for cache)

I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.
In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.

Error instantiating chaincode in Hyperledger Fabric 1.4 over AKS kubernetes

I am trying to instantiate "sacc" chaincode (which comes with fabric samples) in an hyperledger fabric network deployed over kubernetes in AKS. After hours trying different adjustments, I've not been able to finish the task. I'm always getting the error:
Error: could not assemble transaction, err proposal response was not
successful, error code 500, msg timeout expired while starting
chaincode sacc:0.1 for transaction
Please note that there is no transaccion ID in the error (I've googled some similar cases, but in all of them, there was an ID for the transaction. Not my case, eventhough the error is the same)
The message in the orderer:
2019-07-23 12:40:13.649 UTC [orderer.common.broadcast] Handle -> WARN
047 Error reading from 10.1.0.45:52550: rpc error: code = Canceled
desc = context canceled 2019-07-23 12:40:13.649 UTC [comm.grpc.server]
1 -> INFO 048 streaming call completed {"grpc.start_time":
"2019-07-23T12:34:13.591Z", "grpc.service": "orderer.AtomicBroadcast",
"grpc.method": "Broadcast", "grpc.peer_address": "10.1.0.45:52550",
"error": "rpc error: code = Canceled desc = context canceled",
"grpc.code": "Canceled", "grpc.call_duration": "6m0.057953469s"}
I am calling for instantiation from inside a cli peer, defining variables CORE_PEER_LOCALMSPID, CORE_PEER_TLS_ROOTCERT_FILE, CORE_PEER_MSPCONFIGPATH, CORE_PEER_ADDRESS and ORDERER_CA with appropriate values before issuing the instantiation call:
peer chaincode instantiate -o -n sacc -v 0.1 -c
'{"Args":["init","hi","1"]}' -C mychannelname --tls 'true' --cafile
$ORDERER_CA
All the peers have declared a dockersocket volume pointing to
/run/docker.sock
All the peers have declared the variable CORE_VM_ENDPOINT to unix:///host/var/run/docker.sock
All the orgs were joined to the channel
All the peers have the chaincode installed
I can't see any further message/error in the cli, nor in the orderer, nor in the peers involved in the channel.
Any ideas on what can be going wrong? Or how could I continue troubleshooting the problem? Is it possible to see logs from the docker container that is being created by the peers? how?
Thanks

Failed to send instantiate transaction and get notifications within the timeout period. undefined[fabric1.0 k8s]

I am trying to deploy Hyperledger fabric 1.0.5 on k8s, and use the balance transfer to test it. Everything is right before instantiate-chaincode, and I get this:
[2019-01-02 23:23:14.392] [ERROR] instantiate-chaincode - Failed to send instantiate transaction and get notifications within the timeout period. undefined
[2019-01-02 23:23:14.393] [ERROR] instantiate-chaincode - Failed to order the transaction. Error code: undefined
and I use kubectl logs to get the peer0's log which is like this:
[ConnProducer] NewConnection -> ERRO 61a Failed connecting to orderer2.orderer1:7050 , error: context deadline exceeded
[ConnProducer] NewConnection -> ERRO 61b Failed connecting to orderer1.orderer1:7050 , error: context deadline exceeded
[ConnProducer] NewConnection -> ERRO 61c Failed connecting to orderer0.orderer1:7050 , error: context deadline exceeded
[deliveryClient] connect -> DEBU 61d Connected to
[deliveryClient] connect -> ERRO 61e Failed obtaining connection: Could not connect to any of the endpoints: [orderer2.orderer1:7050 orderer1.orderer1:7050 orderer0.orderer1:7050]
I checked the connectivity of orderer0:7050 and found no problem.
What should I do next?
Thank for help!
You didn't describe what runbook you followed to deploy Hyperledger Fabric but looks like your pods cannot find each other through DNS. If you are following Kubernetes standards your pods should be in the orderer1 namespace and hopefully, you have Kubernetes services for orderer0, orderer1, and orderer2.
You can read more about communication between the Fabric components here in the "Communication between Fabric components" section. Also, read on the "Work around the chaincode sandbox" where it shows you a workaround for --dns-search.
It looks like firewall problem.
In my case to run hlf on k8s, I disabled firewall service.

IBM BLUEMIX BLOCKCHAIN SDK-DEMO failing

I have been working with HFC SDK for Node.js and it used to work, but since last night I am having some problems.
When running helloblockchain.js only few times works, most time I get this error when it tries to enroll a new user:
E0113 11:56:05.983919636 5288 handshake.c:128] Security handshake failed: {"created":"#1484304965.983872199","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484304965.983866102","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
Error: Failed to register and enroll JohnDoe: Error
Other times, the enroll works and the failure appears deploying the chaincode:
Enrolled and registered JohnDoe successfully
Deploying chaincode ...
E0113 12:14:27.341527043 5455 handshake.c:128] Security handshake failed: {"created":"#1484306067.341430168","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306067.341421859","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
Failed to deploy chaincode: request={"fcn":"init","args":["a","100","b","200"],"chaincodePath":"chaincode","certificatePath":"/certs/peer/cert.pem"}, error={"error":{"code":14,"metadata":{"_internal_repr":{}}},"msg":"Error"}
Or:
Enrolled and registered JohnDoe successfully
Deploying chaincode ...
E0113 12:15:27.448867739 5483 handshake.c:128] Security handshake failed: {"created":"#1484306127.448692244","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306127.448668047","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
events.js:160
throw er; // Unhandled 'error' event
^
Error
at ClientDuplexStream._emitStatusIfDone (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:189:19)
at ClientDuplexStream._readsDone (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:158:8)
at readCallback (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:217:12)
E0113 12:15:27.563487641 5483 handshake.c:128] Security handshake failed: {"created":"#1484306127.563437122","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306127.563429661","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
This code worked yesterday, so I don't know what could be happening.
Does anybody know how can I fix it?
Thanks,
Javier.
ibm-bluemix
blockchain
These types of intermittent issues are usually related to GRPC. An initial suggestion is to ensure that you are using at least GRPC version 1.0.0.
If you are using a Mac, then the maximum number of open file descriptors should be checked (using ulimit -n). Sometimes this is initially set to a low value such as 256, so increasing the value could help.
There are a couple of GRPC issues with similar symptoms.
https://github.com/grpc/grpc/issues/8732
https://github.com/grpc/grpc/issues/8839
https://github.com/grpc/grpc/issues/8382
There is a grpc.initial_reconnect_backoff_ms property that is mentioned in some of these issues. Increasing the value past the 1000 ms level might help reduce the frequency of issues. Below are instructions for how the helloblockchain.js file can be modified to set this property to a higher value.
Open the helloblockchain.js file in the Hyperledger Fabric Client example and find the enrollAndRegisterUsers function.
Add “grpc.initial_reconnect_backoff_ms": 5000 to the setMemberServicesUrl call.
chain.setMemberServicesUrl(ca_url, {
pem: cert, "grpc.initial_reconnect_backoff_ms": 5000
});
Add “grpc.initial_reconnect_backoff_ms": 5000 to the addPeer call.
chain.addPeer("grpcs://" + peers[i].discovery_host + ":" + peers[i].discovery_port,
{pem: cert, "grpc.initial_reconnect_backoff_ms": 5000
});
Note that setting the grpc.initial_reconnect_backoff_ms property may reduce the frequency of issues, but it will not necessarily eliminate all issues.
The connection to the eventhub that is made in the helloblockchain.js file can also be a factor. There is an earlier version of the Hyperledger Fabric Client that does not utilize the eventhub. This earlier version could be tried to determine if this makes a difference. After running git clone https://github.com/IBM-Blockchain/SDK-Demo.git, run git checkout b7d5195 to use this prior level. Before running node helloblockchain.js from a Node.js command window, the git status command can be used to check the code level that is being used.