System.Fabric.FabricNotPrimaryException on GetStateAsync inside Actor - azure-service-fabric

I've asked this question on Github also - https://github.com/Azure/service-fabric-issues/issues/379
I have (n) actors that are executing on a continuous reminder every second.
These actor's have been running fine for the last 4 days when out of no where every instance receives the below exception on calling StateManager.GetStateAsync. Subsequently, I see all the actors are deactivated.
I cannot find any information relating to this exception being encountered by reliable actors.
Once this exception occurs and the actors are deactivated, they do not get re-activated.
What are the conditions for this error to occur and how can I further diagnose the problem?
"System.Fabric.FabricNotPrimaryException: Exception of type 'System.Fabric.FabricNotPrimaryException' was thrown. at Microsoft.ServiceFabric.Actors.Runtime.ActorStateProviderHelper.d__81.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__181.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ServiceFabric.Actors.Runtime.ActorStateManager.d__7`1.MoveNext()
Having a look at the cluster explorer, I can now see the following warnings on one of the partitions for that actor service:
Unhealthy event: SourceId='System.FM', Property='State', HealthState='Warning', ConsiderWarningAsError=false.
Partition reconfiguration is taking longer than expected.
fabric:/Ism.TvcRecognition.App/TvChannelMonitor 3 3 4dcca5ee-2297-44f9-b63e-76a60df3bc3d
S/S IB _Node1_4 Up 131456742276273986
S/P RD _Node1_2 Up 131456742361691499
P/S RD _Node1_0 Down 131457861497316547
(Showing 3 out of 4 replicas. Total available replicas: 1.)
With a warning in the primary replica of that partition:
Unhealthy event: SourceId='System.RAP', Property='IReplicator.CatchupReplicaSetDuration', HealthState='Warning', ConsiderWarningAsError=false.
And a warning in the ActiveSecondary:
Unhealthy event: SourceId='System.RAP', Property='IStatefulServiceReplica.CloseDuration', HealthState='Warning', ConsiderWarningAsError=false. Start Time (UTC): 2017-08-01 04:51:39.740 _Node1_0
3 out of 5 Nodes are showing the following error:
Unhealthy event: SourceId='FabricDCA', Property='DataCollectionAgent.DiskSpaceAvailable', HealthState='Warning', ConsiderWarningAsError=false. The Data Collection Agent (DCA) does not have enough disk space to operate. Diagnostics information will be left uncollected if this continues to happen.
More Information:
My cluster setup consists of 5 nodes of D1 virtual machines.
Event viewer errors in Microsoft-Service Fabric application:
I see quite a lot of
Failed to read some or all of the events from ETL file D:\SvcFab\Log\QueryTraces\query_traces_5.6.231.9494_131460372168133038_1.etl.
System.ComponentModel.Win32Exception (0x80004005): The handle is invalid
at Tools.EtlReader.TraceFileEventReader.ReadEvents(DateTime startTime, DateTime endTime)
at System.Fabric.Dca.Utility.PerformWithRetries[T](Action`1 worker, T context, RetriableOperationExceptionHandler exceptionHandler, Int32 initialRetryIntervalMs, Int32 maxRetryCount, Int32 maxRetryIntervalMs)
at FabricDCA.EtlProcessor.ProcessActiveEtlFile(FileInfo etlFile, DateTime lastEndTime, DateTime& newEndTime, CancellationToken cancellationToken)
and a heap of warnings like:
Api IStatefulServiceReplica.Close() slow on partition {4dcca5ee-2297-44f9-b63e-76a60df3bc3d} replica 131457861497316547, StartTimeUTC = ‎2017‎-‎08‎-‎01T04:51:39.789083900Z
And finally I think I might be at the root of all this. Event Viewer Application Logs has a whole ream of errors like:
Ism.TvcRecognition.TvChannelMonitor (3688) (4dcca5ee-2297-44f9-b63e-76a60df3bc3d:131457861497316547): An attempt to write to the file "D:\SvcFab_App\Ism.TvcRecognition.AppType_App1\work\P_4dcca5ee-2297-44f9-b63e-76a60df3bc3d\R_131457861497316547\edbres00002.jrs" at offset 5242880 (0x0000000000500000) for 0 (0x00000000) bytes failed after 0.000 seconds with system error 112 (0x00000070): "There is not enough space on the disk. ". The write operation will fail with error -1808 (0xfffff8f0). If this error persists then the file may be damaged and may need to be restored from a previous backup.
Ok so, that error is pointing to the D drive, which is Temporary Storage. It has 549 MB free of 50 GB.
Should Service fabric really be persisting to Temporary Storage ?

Re: the errors - yeah looks like disk full causing failures. Just to close the loop here - looks like you found out that your state wasn't actually getting distributed in the cluster, and once you fixed that you stopped seeing disk full. Your capacity planning should hopefully make more sense now.
Regarding safety: TLDR: Using the temporary drive is fine because you're using Service Fabric. If you weren't then using that drive for real data would be a very bad idea.
Those drives are "temporary" from Azure's perspective in the sense that they're the local drives on the machine. Azure doesn't know what you're doing with the drives, and it doesn't want any single machine app to think that data written there is safe, since Azure may Service heal the VM in response to a bunch of different things.
In SF we replicate the data to multiple machines, so using the local disks is fine/safe. SF also integrates with Azure so that a lot of the management operations that would destroy that data are managed in the cluster to prevent exactly that from happening. When Azure announces that it's going to do an update that will destroy the data on that node, we move your service somewhere else before allowing that to happen, and try to stall the update in the meantime. Some more info on that is here.

Related

Unable to start tomcat 9 with flowable war- PUBLIC.ACT_DE_DATABASECHANGELOGLOCK error

I have downloaded flowable from flowable.com/open-source and placed the flowable-ui.war and flowable-rest.war in tomcat 9.0.52 webapps folder.
When i start server after some time i can see below line repeating in cmd and server getting stopped.
SELECT LOCKED FROM PUBLIC.ACT_DE_DATABASECHANGELOGLOCK WHERE ID=1
2021-08-13 20:45:05.818 INFO 8316 --- [ main] l.lockservice.StandardLockService : Waiting for changelog lock.
Why is this issue occurring I have not made any changes?
The message
l.lockservice.StandardLockService : Waiting for changelog lock.
occurs when Flowable waits for the lock for the DB changes to be released.
If that doesn't happen it means that some other node has picked up the log and not released it properly. I would suggest you manually deleting the lock from that table (ACT_DE_DATABASECHANGELOGLOCK).
In addition to that, there is no need to run both flowable-ui.war and flowable-rest.war. flowable-rest.war is a subset of flowable-ui.war.

How to fill data upto a size in multiple disk?

I am creating 4 mountpoint disk in Windows OS. I need to copy files up to a threshold value (say 50 GB).
I tried with vdbench. It works fine, but it throws an exception at last.
compratio=4
dedupratio=1
dedupunit=256k
* Host Definition section
hd=default,user=Administator,shell=vdbench,jvms=1
hd=localhost,system=localhost
********************************************************************************
* Storage Definition section
fsd=fsd1,anchor=C:\UnMapTest-Volume1\disk1\,depth=1,width=1,files=1,size=5g
fsd=fsd2,anchor=C:\UnMapTest-Volume2\disk2\,depth=1,width=1,files=1,size=5g
fwd=fwd1,fsd=fsd*,operation=write,xfersize=1m,fileio=sequential,fileselect=random,threads=10
rd=rd1,fwd=fwd1,fwdrate=max,format=yes,elapsed=1h,interval=1
Below is the exception from vdbench. Due to this my calling script would fail.
05:29:14.287 Message from slave localhost-0:
05:29:14.289 file=C:\UnMapTest-Volume1\disk1\\vdb.1_1.dir\vdb_f0001.file,busy=true
05:29:14.290 Thread: FwgThread write C:\UnMapTest-Volume1\disk1\ rd=rd1 For loops: None
05:29:14.291
05:29:14.292 last_ok_request: Thu Dec 28 05:28:57 PST 2017
05:29:14.292 Duration: 16.92 seconds
05:29:14.293 consecutive_blocks: 10001
05:29:14.294 last_block: FILE_BUSY File busy
05:29:14.294 operation: write
05:29:14.295
05:29:14.296 Do you maybe have more threads running than that you have
05:29:14.296 files and therefore some threads ultimately give up after 10000 tries?
05:29:14.300 *
05:29:14.301 ******************************************************
05:29:14.302 * Slave localhost-0 aborting: Too many thread blocks *
05:29:14.302 ******************************************************
05:29:14.303 *
05:29:21.235
05:29:21.235 Slave localhost-0 prematurely terminated.
05:29:21.235
05:29:21.235 Slave aborted. Abort message received:
05:29:21.235 Too many thread blocks
05:29:21.235
05:29:21.235 Look at file localhost-0.stdout.html for more information.
05:29:21.735
05:29:21.735 Slave localhost-0 prematurely terminated.
05:29:21.735
java.lang.RuntimeException: Slave localhost-0 prematurely terminated.
at Vdb.common.failure(common.java:335)
at Vdb.SlaveStarter.startSlave(SlaveStarter.java:198)
at Vdb.SlaveStarter.run(SlaveStarter.java:47)
I am using PowerShell in a Windows machine. Even if some other tools like Diskspd is having way to fill data up to some threshold then please provide me.
I found the answer by myself.
I have done this using Diskspd.exe as below
The following command fill 50 GB data in the mentioned disk folder
.\diskspd.exe -c50G -b4K -t2 C:\UnMapTest-Volume1\disk1\testfile1.dat
It is very simple than Vdbench for my requirement.
Caution : But it is not having real data so array side disk size is
not shown up with the size

Eclipselink optimal cache settings suitable for cache co-ordination. UPDATE: causing deadlocks while using SEND_NEW_OBJECTS_WITH_CHANGES

My tech stack
EclipseLink 2.6.3, JPA 2.0, IBM Websphere 8.5.5.8 and JTA enabled
Currently in my project I have enabled SHARED_CACHE mode and have below cache settings on all my non read only entities and CacheType is SOFT_WEAK
#Cacheable
#Cache(
alwaysRefresh=true,
refreshOnlyIfNewer=true,
coordinationType=CacheCoordinationType.SEND_NEW_OBJECTS_WITH_CHANGES,
expiry=3600000 //changed according to entity
)
I have also configured JMS cache coordination as my app is hosted on clustered environment.
My question is are my cache settings are inline with EclipseLink best practices in accordance with cache coordination in place?
Should I need to change anything apart from what mentioned above ?
UPDATE:
Now I have an issue with this cache coordination settings i.e CacheCoordinationType.SEND_NEW_OBJECTS_WITH_CHANGES,it enters into deadlock mode and ends up locking threads and my server crashes.Not sure what's happening here. I've debugged source code and it shows it acquires locks on thread & cache of the object to merge the changes but that should be happening in minimal time.Why is it taking 10mins for the locks to release?
Any help would be appreciated.
Below is the error I'm facing in my QA/Prod env.
[14/09/17 13:37:16:015 BST] 000003d4 SystemOut O [EL Finer]: 2017-09-14 13:37:16.014--ServerSession(772764452)--Thread(Thread[SIBJMSRAThreadPool : 7,5,main])--
Potential deadlock encountered while thread: SIBJMSRAThreadPool : 7 attempted to lock object of class: class com.jlr.vista.business.order.model.event.StatusUpdateEvent with id: 324,836,575, entering deadlock avoidance algorithm. This is a notice only.
[14/09/17 13:47:16:015 BST] 000003d4 SystemOut O [EL Severe]: 2017-09-14 13:47:16.015--ServerSession(772764452)--Thread(Thread[SIBJMSRAThreadPool : 7,5,main])--MAX TIME 600 seconds EXCEEDED FOR WRITELOCKMANAGER WAIT. Waiting on Entity type: com.business.model.event.UpdateEvent with pk: 324,836,575 currently locked by thread: SIBJMSRAThreadPool : 1 with the following trace:
atsun.misc.Unsafe.park(Native Method)
atjava.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:237)
atjava.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2187)
atcom.ibm.ws.util.BoundedBuffer$GetQueueLock.await(BoundedBuffer.java:286)
atcom.ibm.ws.util.BoundedBuffer.waitGet_(BoundedBuffer.java:425)
atcom.ibm.ws.util.BoundedBuffer.take(BoundedBuffer.java:823)
atcom.ibm.ws.util.ThreadPool.getTask(ThreadPool.java:1059)
atcom.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1916)
(There is no English translation for this message.)
NOTE: My application is very large and have almost 150+ entities which are writable.

Google Cloud SQL + Hikari CP + Communications link failure

I'm experiencing intermittent connectivity errors from a Spring Boot application communicating with a D1 Google CloudSQL Server with the configuration settings described here HikariCP MySQL settings
I was wondering if anyone has encountered this before.
I've read the FAQ posted here Hikari FAQ and I'm wondering if my default idleTimeout and maxLifeTime (30 mins) settings might be at fault; wait_timeout and interactive_timeout on the server are both set to default 28800s (8 hours).
The FAQ says that these two settings should be about a minute less that the server settings, but if I'm losing connections after 30 minutes I can't quite see how upping the maxLifeTime to 7hrs 59mins is going to improve the situation.
Does anyone have any recommendations?
Redacted stack trace(s):
Get these from time to time
org.springframework.security.authentication.InternalAuthenticationServiceException: Could not get JDBC Connection; nested exception is java.sql.SQLException: Timeout after 30018ms of waiting for a connection.
at org.springframework.security.authentication.dao.DaoAuthenticationProvider.retrieveUser(DaoAuthenticationProvider.java:110)
at org.springframework.security.authentication.dao.AbstractUserDetailsAuthenticationProvider.authenticate(AbstractUserDetailsAuthenticationProvider.java:132)
at org.springframework.security.authentication.ProviderManager.authenticate(ProviderManager.java:156)
at org.springframework.security.authentication.ProviderManager.authenticate(ProviderManager.java:177)
...
Caused by: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: Timeout after 30023ms of waiting for a connection.
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)
....
Caused by: java.sql.SQLException: Timeout after 30023ms of waiting for a connection.
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:208)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:108)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
... 59 common frames omitted
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:630)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:695)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:727)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:737)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:787)
Hibernate search:
2015-02-17 10:34:17.090 INFO 1 --- [ entityloader-2] o.h.s.i.SimpleIndexingProgressMonitor : HSEARCH000030: 31050 documents indexed in 1147865 ms
2015-02-17 10:34:17.090 INFO 1 --- [ entityloader-2] o.h.s.i.SimpleIndexingProgressMonitor : HSEARCH000031: Indexing speed: 27.050219 documents/second; progress: 99.89%
2015-02-17 10:41:59.917 WARN 1 --- [ntifierloader-1] com.zaxxer.hikari.proxy.ConnectionProxy : Connection com.mysql.jdbc.JDBC4Connection#372f2018 (HikariPool-0) marked as broken because of SQLSTATE(08S01), ErrorCode(0).
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 1,611,087 milliseconds ago. The last packet sent successfully to the server was 927,899 milliseconds ago.
The indexing isn't particularly quick at the moment I think because I'm not using projections. The process takes about 30 minutes to execute.
Thanks
It could be a couple of things here. First, the network infrastructure (firewalls, load-balancers, etc.) between the application tier and the database tier can impose their own connection timeouts, regardless of MySql settings.
The indexing failure indicates that the connection was out of the pool for ~27 minutes with no SQL activity when that failure occurred.
Second, specifically regarding the "Could not get JDBC Connection" error, you may be running into Cloud SQL connection limits.
I recommend three things. One, make sure you are on the latest HikariCP (2.3.2) and latest MySql Connector/J driver (5.1.34). Two, enable DEBUG-level logging for the com.zaxxer.hikari package. HikariCP debug logging is not "chatty", but will log pool statistics every 30 seconds (and sometimes more detail in failure conditions). Lastly, try setting the maxPoolSize to something smaller (unless already at the default), and setting maxLifeTime to 15 or 20 minutes (1200000ms).
If the error occurs again, post updated logs containing the HikariCP debug logs around the time of failure. Also, feel free to open a tracking issue over on Github as larger logs etc. are easier there.

ClickOnce: DeploymentDownloadException: The operation has timed out

Symptom: ClickOnce installation starts and stops after around 600 kB (out of 2 MB).
Progress bar always stops at the same value (tried ten times).
Error log says that The operation has timed out (in inner exception) and fails with "DeploymentDownloadException (Unknown subtype)".
Error log details (irrelevant information trimmed):
ERROR DETAILS
Following errors were detected during this operation.
System.Deployment.Application.DeploymentDownloadException (Unknown subtype)
- Downloading http://fullpath/name.dll.deploy did not succeed.
- Source: System.Deployment
- Stack trace: at System.Deployment.Application.SystemNetDownloader.DownloadSingleFile(Downloa
dQueueItem next)
at
System.Deployment.Application.SystemNetDownloader.DownloadAllFiles()
at
System.Deployment.Application.FileDownloader.Download(SubscriptionState
subState)
--- Inner Exception ---
System.Net.WebException
- The operation has timed out.
- Source: System
- Stack trace:
at System.Net.ConnectStream.Read(Byte[] buffer,
Int32 offset, Int32 size)
at
System.Deployment.Application.SystemNetDownloader.DownloadSingleFile(Downloa
dQueueItem next)
This only happens for two customers. The install works OK for thousands of others. I have found numerous posts via google with no answer or generic "firewall is the issue" or "customer was using dialup".
Has anyone solved this? Is this a ClickOnce bug?
Disabling firewall software on the machine did not help because a hardware firewall installed on the network was the cause (FortiGate 30B).
I doubt that it's a bug. However, it seems like it gets stuck at one file in the deployment path. Maybe it is a type of file that is blocked by a firewall.
I would just remove all files but one from the build and see if that gets downloaded ok, and then add the rest of the files one by one (or maybe type by type) and see at what file ClickOnce gets stuck downloading.
If that doesn't seem to do anything, I'd build a dummy app and deploy it with ClickOnce and see if it installs at all on the customer's box.