OpenSearch error Rejecting request because cold storage is not enabled on the domain - opensearch

I have an AWS OpenSearch domain with ultrawarm/cold storage disabled.
My application/error log is spammed (new entries every 2-5 minutes) with the following error
[2022-09-05T09:29:13,909][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [eaab83ab8ab8fd53a42b42e694708350] uncaught exception in thread [DefaultDispatcher-worker-1]
OpenSearchStatusException[Rejecting request because cold storage is not enabled on the domain. Enabling cold storage for the first time can take several hours. Please try again later.]
__AMAZON_INTERNAL__
__AMAZON_INTERNAL__
__AMAZON_INTERNAL__
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:195)
at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:120)
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:99)
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
at org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:266)
at org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
at org.opensearch.action.support.TransportAction.execute(TransportAction.java:170)
at org.opensearch.action.support.TransportAction.execute(TransportAction.java:98)
at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
at org.opensearch.indexmanagement.kraken.ColdIndexMetadataService$getMetadata$response$1.invoke(ColdIndexMetadataService.kt:31)
at org.opensearch.indexmanagement.kraken.ColdIndexMetadataService$getMetadata$response$1.invoke(ColdIndexMetadataService.kt:17)
at org.opensearch.indexmanagement.kraken.OpenSearchExtensionsKt.suspendUntil(OpenSearchExtensions.kt:30)
at org.opensearch.indexmanagement.kraken.ColdIndexMetadataService.getMetadata(ColdIndexMetadataService.kt:31)
at org.opensearch.indexmanagement.indexstatemanagement.IndexMetadataProvider.getISMIndexMetadataByType(IndexMetadataProvider.kt:46)
at org.opensearch.indexmanagement.indexstatemanagement.IndexMetadataProvider$getMultiTypeISMIndexMetadata$2$invokeSuspend$$inlined$forEach$lambda$1.invokeSuspend(IndexMetadataProvider.kt:66)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:56)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:738)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
I've tried stopping all applications that interacts with OpenSearch, tried to delete my ISM policy, but these errors keep filling the logs
any ideas?

Related

How to optimize config Keycloak

I have deployed Keycloak on Kubernetes and am having a performance issue with Keycloak as follows:
I run 6 pods Keycloak with mode standalone HA using KUBE_PING in Kubernetes and auto scale hpa if CPU over 80%. When I test the login load with Keycloak, the threshold is only 150ccu and if over this threshold an error will occur. I see log pods Keycloak occur timeout as below
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p16-t1) ISPN000136: Error executing command RemoveCommand on Cache 'authenticationSessions', writing keys [f85ac151-6196-48e9-977c-048fc8bcd975]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2312955 from keycloak-9b6486f7-bgw8s"
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p16-t1) ISPN000136: Error executing command ReplaceCommand on Cache 'loginFailures', writing keys [LoginFailureKey [ realmId=etc-internal. userId=a76c3578-32fa-42cb-88d7-fcfdccc5f5c6 ]]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2201111 from keycloak-9b6486f7-bgw8s"
ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p20-t1) ISPN000136: Error executing command PutKeyValueCommand on Cache 'sessions', writing keys [6a0e8cde-82b7-4824-8082-109ccfc984b4]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 2296440 from keycloak-9b6486f7-9l9lq"
I see the RAM, CPU of keycloak takes up very little under 20% so it does not auto scale hpa. So I think the current configuration of Keycloak is not optimize as about number of CACHE_OWNERS, Access Token Lifespan, SSO Session Idle, SSO Session Max, etc...
I want to know what configurations to configure accordingly and can load Keycloak to 500ccu with response time ~ 3s. Please support me if you know about this !
In standalone-ha.xml config, I only update config about datasource as image below

System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor

I've asked this question on Github also - https://github.com/Azure/service-fabric-issues/issues/379
I have (n) actors that are executing on a continuous reminder every second.
These actor's have been running fine for the last 4 days when out of no where every instance receives the below exception on calling StateManager.GetStateAsync. Subsequently, I see all the actors are deactivated.
I cannot find any information relating to this exception being encountered by reliable actors.
Once this exception occurs and the actors are deactivated, they do not get re-activated.
What are the conditions for this error to occur and how can I further diagnose the problem?
"System.Fabric.FabricNotPrimaryException: Exception of type 'System.Fabric.FabricNotPrimaryException' was thrown. at Microsoft.ServiceFabric.Actors.Runtime.ActorStateProviderHelper.d__81.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__181.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__7`1.MoveNext()
Having a look at the cluster explorer, I can now see the following warnings on one of the partitions for that actor service:
Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false.
Partition reconfiguration is taking longer than expected.
fabric:/Ism.TvcRecognition.App/TvChannelMonitor 3 3 4dcca5ee-2297-44f9-b63e-76a60df3bc3d
S/S IB _Node1_4 Up 131456742276273986
S/P RD _Node1_2 Up 131456742361691499
P/S RD _Node1_0 Down 131457861497316547
(Showing 3 out of 4 replicas. Total available replicas: 1.)
With a warning in the primary replica of that partition:
Unhealthy event: SourceId='System.RAP', Property='IReplicator.CatchupReplicaSetDuration', HealthState='Warning', ConsiderWarningAsError=false.
And a warning in the ActiveSecondary:
Unhealthy event: SourceId='System.RAP', Property='IStatefulServiceReplica.CloseDuration', HealthState='Warning', ConsiderWarningAsError=false. Start Time (UTC): 2017-08-01 04:51:39.740 _Node1_0
3 out of 5 Nodes are showing the following error:
Unhealthy event: SourceId='FabricDCA', Property='DataCollectionAgent.DiskSpaceAvailable', HealthState='Warning', ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.
More Information:
My cluster setup consists of 5 nodes of D1 virtual machines.
Event viewer errors in Microsoft-Service Fabric application:
I see quite a lot of
Failed to read some or all of the events from ETL file D:\SvcFab\Log\QueryTraces\query_traces_5.6.231.9494_131460372168133038_1.etl.
System.ComponentModel.Win32Exception (0x80004005): The handle is invalid
at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime)
at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs)
at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)
and a heap of warnings like:
Api IStatefulServiceReplica.Close() slow on partition {4dcca5ee-2297-44f9-b63e-76a60df3bc3d} replica 131457861497316547, StartTimeUTC = ‎2017‎-‎08‎-‎01T04:51:39.789083900Z
And finally I think I might be at the root of all this. Event Viewer Application Logs has a whole ream of errors like:
Ism.TvcRecognition.TvChannelMonitor (3688) (4dcca5ee-2297-44f9-b63e-76a60df3bc3d:131457861497316547): An attempt to write to the file "D:\SvcFab_App\Ism.TvcRecognition.AppType_App1\work\P_4dcca5ee-2297-44f9-b63e-76a60df3bc3d\R_131457861497316547\edbres00002.jrs" at offset 5242880 (0x0000000000500000) for 0 (0x00000000) bytes failed after 0.000 seconds with system error 112 (0x00000070): "There is not enough space on the disk. ". The write operation will fail with error -1808 (0xfffff8f0). If this error persists then the file may be damaged and may need to be restored from a previous backup.
Ok so, that error is pointing to the D drive, which is Temporary Storage. It has 549 MB free of 50 GB.
Should Service fabric really be persisting to Temporary Storage ?
Re: the errors - yeah looks like disk full causing failures. Just to close the loop here - looks like you found out that your state wasn't actually getting distributed in the cluster, and once you fixed that you stopped seeing disk full. Your capacity planning should hopefully make more sense now.
Regarding safety: TLDR: Using the temporary drive is fine because you're using Service Fabric. If you weren't then using that drive for real data would be a very bad idea.
Those drives are "temporary" from Azure's perspective in the sense that they're the local drives on the machine. Azure doesn't know what you're doing with the drives, and it doesn't want any single machine app to think that data written there is safe, since Azure may Service heal the VM in response to a bunch of different things.
In SF we replicate the data to multiple machines, so using the local disks is fine/safe. SF also integrates with Azure so that a lot of the management operations that would destroy that data are managed in the cluster to prevent exactly that from happening. When Azure announces that it's going to do an update that will destroy the data on that node, we move your service somewhere else before allowing that to happen, and try to stall the update in the meantime. Some more info on that is here.

IBM BLUEMIX BLOCKCHAIN SDK-DEMO failing

I have been working with HFC SDK for Node.js and it used to work, but since last night I am having some problems.
When running helloblockchain.js only few times works, most time I get this error when it tries to enroll a new user:
E0113 11:56:05.983919636 5288 handshake.c:128] Security handshake failed: {"created":"#1484304965.983872199","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484304965.983866102","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
Error: Failed to register and enroll JohnDoe: Error
Other times, the enroll works and the failure appears deploying the chaincode:
Enrolled and registered JohnDoe successfully
Deploying chaincode ...
E0113 12:14:27.341527043 5455 handshake.c:128] Security handshake failed: {"created":"#1484306067.341430168","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306067.341421859","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
Failed to deploy chaincode: request={"fcn":"init","args":["a","100","b","200"],"chaincodePath":"chaincode","certificatePath":"/certs/peer/cert.pem"}, error={"error":{"code":14,"metadata":{"_internal_repr":{}}},"msg":"Error"}
Or:
Enrolled and registered JohnDoe successfully
Deploying chaincode ...
E0113 12:15:27.448867739 5483 handshake.c:128] Security handshake failed: {"created":"#1484306127.448692244","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306127.448668047","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
events.js:160
throw er; // Unhandled 'error' event
^
Error
at ClientDuplexStream._emitStatusIfDone (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:189:19)
at ClientDuplexStream._readsDone (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:158:8)
at readCallback (/usr/lib/node_modules/hfc/node_modules/grpc/src/node/src/client.js:217:12)
E0113 12:15:27.563487641 5483 handshake.c:128] Security handshake failed: {"created":"#1484306127.563437122","description":"Handshake read failed","file":"../src/core/lib/security/transport/handshake.c","file_line":237,"referenced_errors":[{"created":"#1484306127.563429661","description":"FD shutdown","file":"../src/core/lib/iomgr/ev_epoll_linux.c","file_line":948}]}
This code worked yesterday, so I don't know what could be happening.
Does anybody know how can I fix it?
Thanks,
Javier.
ibm-bluemix
blockchain
These types of intermittent issues are usually related to GRPC. An initial suggestion is to ensure that you are using at least GRPC version 1.0.0.
If you are using a Mac, then the maximum number of open file descriptors should be checked (using ulimit -n). Sometimes this is initially set to a low value such as 256, so increasing the value could help.
There are a couple of GRPC issues with similar symptoms.
https://github.com/grpc/grpc/issues/8732
https://github.com/grpc/grpc/issues/8839
https://github.com/grpc/grpc/issues/8382
There is a grpc.initial_reconnect_backoff_ms property that is mentioned in some of these issues. Increasing the value past the 1000 ms level might help reduce the frequency of issues. Below are instructions for how the helloblockchain.js file can be modified to set this property to a higher value.
Open the helloblockchain.js file in the Hyperledger Fabric Client example and find the enrollAndRegisterUsers function.
Add “grpc.initial_reconnect_backoff_ms": 5000 to the setMemberServicesUrl call.
chain.setMemberServicesUrl(ca_url, {
pem: cert, "grpc.initial_reconnect_backoff_ms": 5000
});
Add “grpc.initial_reconnect_backoff_ms": 5000 to the addPeer call.
chain.addPeer("grpcs://" + peers[i].discovery_host + ":" + peers[i].discovery_port,
{pem: cert, "grpc.initial_reconnect_backoff_ms": 5000
});
Note that setting the grpc.initial_reconnect_backoff_ms property may reduce the frequency of issues, but it will not necessarily eliminate all issues.
The connection to the eventhub that is made in the helloblockchain.js file can also be a factor. There is an earlier version of the Hyperledger Fabric Client that does not utilize the eventhub. This earlier version could be tried to determine if this makes a difference. After running git clone https://github.com/IBM-Blockchain/SDK-Demo.git, run git checkout b7d5195 to use this prior level. Before running node helloblockchain.js from a Node.js command window, the git status command can be used to check the code level that is being used.

Google Cloud SQL + Hikari CP + Communications link failure

I'm experiencing intermittent connectivity errors from a Spring Boot application communicating with a D1 Google CloudSQL Server with the configuration settings described here HikariCP MySQL settings
I was wondering if anyone has encountered this before.
I've read the FAQ posted here Hikari FAQ and I'm wondering if my default idleTimeout and maxLifeTime (30 mins) settings might be at fault; wait_timeout and interactive_timeout on the server are both set to default 28800s (8 hours).
The FAQ says that these two settings should be about a minute less that the server settings, but if I'm losing connections after 30 minutes I can't quite see how upping the maxLifeTime to 7hrs 59mins is going to improve the situation.
Does anyone have any recommendations?
Redacted stack trace(s):
Get these from time to time
org.springframework.security.authentication.InternalAuthenticationServiceException: Could not get JDBC Connection; nested exception is java.sql.SQLException: Timeout after 30018ms of waiting for a connection.
at org.springframework.security.authentication.dao.DaoAuthenticationProvider.retrieveUser(DaoAuthenticationProvider.java:110)
at org.springframework.security.authentication.dao.AbstractUserDetailsAuthenticationProvider.authenticate(AbstractUserDetailsAuthenticationProvider.java:132)
at org.springframework.security.authentication.ProviderManager.authenticate(ProviderManager.java:156)
at org.springframework.security.authentication.ProviderManager.authenticate(ProviderManager.java:177)
...
Caused by: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: Timeout after 30023ms of waiting for a connection.
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)
....
Caused by: java.sql.SQLException: Timeout after 30023ms of waiting for a connection.
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:208)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:108)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
... 59 common frames omitted
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:630)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:695)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:727)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:737)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:787)
Hibernate search:
2015-02-17 10:34:17.090 INFO 1 --- [ entityloader-2] o.h.s.i.SimpleIndexingProgressMonitor : HSEARCH000030: 31050 documents indexed in 1147865 ms
2015-02-17 10:34:17.090 INFO 1 --- [ entityloader-2] o.h.s.i.SimpleIndexingProgressMonitor : HSEARCH000031: Indexing speed: 27.050219 documents/second; progress: 99.89%
2015-02-17 10:41:59.917 WARN 1 --- [ntifierloader-1] com.zaxxer.hikari.proxy.ConnectionProxy : Connection com.mysql.jdbc.JDBC4Connection#372f2018 (HikariPool-0) marked as broken because of SQLSTATE(08S01), ErrorCode(0).
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 1,611,087 milliseconds ago. The last packet sent successfully to the server was 927,899 milliseconds ago.
The indexing isn't particularly quick at the moment I think because I'm not using projections. The process takes about 30 minutes to execute.
Thanks
It could be a couple of things here. First, the network infrastructure (firewalls, load-balancers, etc.) between the application tier and the database tier can impose their own connection timeouts, regardless of MySql settings.
The indexing failure indicates that the connection was out of the pool for ~27 minutes with no SQL activity when that failure occurred.
Second, specifically regarding the "Could not get JDBC Connection" error, you may be running into Cloud SQL connection limits.
I recommend three things. One, make sure you are on the latest HikariCP (2.3.2) and latest MySql Connector/J driver (5.1.34). Two, enable DEBUG-level logging for the com.zaxxer.hikari package. HikariCP debug logging is not "chatty", but will log pool statistics every 30 seconds (and sometimes more detail in failure conditions). Lastly, try setting the maxPoolSize to something smaller (unless already at the default), and setting maxLifeTime to 15 or 20 minutes (1200000ms).
If the error occurs again, post updated logs containing the HikariCP debug logs around the time of failure. Also, feel free to open a tracking issue over on Github as larger logs etc. are easier there.

MSDTC (Distributed Transaction Coordinator) Stops working. Error code -1073737669

I cannot start Distributed Transaction Coordinator service.
It stoped to work few days ago.
When I am trying to start service:
Registry properties:
RPC (For a test values was changed here to oposite and back - without any results ):
Windows logs\application logs:
53283
A MS DTC component has encountered an internal error. The process is being terminated. Error Specifics: DtcSystemShutdown (d:\w7rtm\com\complus\dtc\dtc\msdtc\src\msdtc.cpp#2539): Shutting down with an error
4111
The MS DTC service is stopping.
4102
DTC Security Configuration values (OFF = 0 and ON = 1): Network Administration of Transactions = 1,
Network Clients = 1,
Inbound Distributed Transactions using Native MSDTC Protocol = 1,
Outbound Distributed Transactions using Native MSDTC Protocol = 1,
Transaction Internet Protocol (TIP) = 0,
XA Transactions = 1,
SNA LU 6.2 Transactions = 1
Could not initialize the MS DTC Transaction Manager.
4356
Failed to initialize the MS DTC Communication Manager. Error Specifics: hr = 0x80070057, d:\w7rtm\com\complus\dtc\dtc\cm\src\ccm.cpp:2117, CmdLine: C:\Windows\System32\msdtc.exe, Pid: 5332
4358
The MS DTC Connection Manager is unable to register with RPC to use one of LRPC, TCP/IP, or UDP/IP. Please ensure that RPC is configured properly. If "ServerTcpPort" registry key is configured(DWORD value under the HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC for local DTC instance or under cluster hive for clustered DTC instance), please verify if the configured port is valid and the port is not already in use by a different component. Error Specifics:hr = 0x80070057, d:\w7rtm\com\complus\dtc\dtc\cm\src\iomgrsrv.cpp:2523, CmdLine: C:\Windows\System32\msdtc.exe, Pid: 5332
4156
String message: RPC raised an exception with a return code RPC_S_INVALIDA_ARG..
I found that we can use -resetlog command. But this doesnot resolving my problem:
Firewall is disabled.
Try to delete key HKLM\Software\Microsoft\Rpc\Internet from registry.
To get around this issue, I had to copy the log file (I accidentally deleted) to the location specified by the Local DTC Log information location.