OptimisticConcurrencyException that is cleared by restarting IIS - entity-framework

I work on a web application using MVC and Entity Framework 6.1. We typically have around 100 concurrent users load balanced across two IIS servers.
Twice in the last two months we've had periods where numerous users report seeing errors that are caused by OptimisticConcurrencyException ("Store update, insert, or delete statement affected an unexpected number of rows (0). Entities may have been modified or deleted since entities were loaded. See http://go.microsoft.com/fwlink/?LinkId=472540 for information on understanding and handling optimistic concurrency exceptions.")
Based on the logs in Application Insights we can see many instances of this exception during the period when users were reporting problems, up to 10 per minute. This affected many users, but only one of the load balanced webservers. A range of update/delete operations where affected.
The first time this occurred we were able to resolve it by restarting IIS on the affected server. We then didn't see a single occurrence of the exception for about 6 weeks, until it suddenly started happening again. Again, restarting IIS was able to resolve it.
Most people seeing this exception appear to see it consistently across environments and can fix it by making a change to their EF models. What I've seen seems to be more a server-related problem. Possibly a cache that is not being cleared when appropriate.
What steps can I take to identify the cause of this issue? Is there additional logging I could configure?

Related

Issues with matching service in Cadence

Two days ago, we started presenting some issues with our cadence setup.
The first thing we noticed is the Open workflows were not disappearing from the list once they completed. For example this workflow appears as Open in the list:
But when you click on it, you will see that it’s actually completed:
At the same time this started to happen, we noticed how several workflows would take quite a long time to complete, several of them would stuck in “Schedule” states and never go further from there. After checking the logs, the only error we saw was this:
{"level":"error","ts":"2021-03-06T19:12:04.865Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"store-operation":"create-task","error":"InternalServiceError{Message: CreateTasks operation failed. Error : Request on table cadence.tasks with ttl of 630720000 seconds exceeds maximum supported expiration date of 2038-01-19T03:14:06+00:00. In order to avoid this use a lower TTL, change the expiration date overflow policy or upgrade to a version where this limitation is fixed. See CASSANDRA-14092 for more details.}","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"number":6300094,"next-number":6300094,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}
Does somebody have an idea of why this is happening?
The first one is because of visibility sampling being enabled by default(to protect default core DB). You can disable it by configure system.enableVisibilitySampling to false.
But when you doing that, it’s better to separate the visibility and default store into different database cluster so that visibility doesn’t bring down the default(core data model) DB.
see more in https://github.com/uber/cadence/issues/3884
The second is a bug fixed in 0.16.0
It should be resolved if you upgrade server.
See https://github.com/uber/cadence/pull/3627
and https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/recoveringTtlYear2038Problem.html

Is there a way to automate the monitoring and termination of AWS ECS tasks that are silently progressing?

I've been using AWS Fargate for quite a while and have been a big fan of the service.
This week, I created a monitoring dashboard that details the latest runtimes of my containers, and the timestamp watermark of each of my tables (the MAX date updated value). I have SNS topics set up to email me whenever a container exits with code 1.
However, I encountered a tricky issue today that slipped past these safeguards because of what I suspect was a deadlock situation related to a Postgres RDS instance.
I have several tasks running at different points in the day on a scheduler (usually every X or Y hours). Most of these tasks will perform some business logic calculations and insert / update an RDS instance.
One of my tasks (when checking the Cloudwatch logs later) was stuck making an update to a table, and basically just hung there waiting. My guess is that a user (perhaps me) - was manually making a small update statement to the same table, triggering some sort of lock that.
Because I have my tasks set on a recurring basis, the same task had another container provisioned a few hours later, attempted to update the same table, and also hung.
I only noticed this issue because my monitoring dashboard showed that the date updated watermark was still a few days in the past, even though I hadn't gotten any alerts or notifications for errors during my container run time. By this time, I had 3 containers all running, each stuck on the same update to the same table.
After I logged into the ECS console, I saw that my cluster had 3 task instances running - all the same task, all stuck making the same insert.
So my questions are:
is there a way to specify a runtime maximum for these tasks (ie. if the task doesn't finish within 2 hours, terminate with an exit code of 1)?
I'm trying to figure out the best way to prevent this type of "silent failure" in the future? I've added in application logic to execute a query checking for blocked process IDs with queries within my RDS instance, and if it notices any blocked PIDS, it skips the update. But are there any more graceful ways of detecting and handling this issue?

Why is my mongodb collection deleted automatically?

I have a MongoDB client in three EC2 instances and I have created a replica set. Last time I had a problem, of space constraint which stopped my mongod process, thereby halting the application and now in an instance couple of days back, some of my tables were gone from database, so I set logging and all to my database just to catch if anything like that happens again. In a fresh incident this morning I was unable to login to my system and that's when I found out that whole database was empty. I checked other SO question like this which suggest setting up a TTL.Which I haven't done at all.
Now how do I debug this situation and do a proper root cause analysis? I can't even find anything in my debug logs as well. The tables just vanished. How do I set up proper logging mechanism and how do I ensure that all my tables are never ever deleted again?
Today I got a mail from Amazon that I was probably running an unsecured version of MongoDB and that may have caused this issue. So who ever is facing this issue please go through the Security Checklist Provided by MongoDB. There are some points that are absolutely necessary in there.
1. Enable Access Control and Enforce Authentication
2. Encrypt Communication
3. Limit Network Exposure
These three are the core and depending upon how many people access your database you can Configure Role-Based Access Control.
These are all the things I have done. Before this incident I had not taken security that seriously but after I was hit by it. I made sure I have all the necessary precautions in place.
Hope this helps someone.

Distributed Recovery - can this be done without timeout?

We have a mail sender application, that receives a bunch of mails in one blob, and then puts all those mails into database. This can take up to ten minutes. During this process the state of the mailing is BUILDING.
When it is finished the state gets changed to READY.
When the server crashes (shouldn't happen of course) and restarts, it looks for all mailings with status BUILDING and marks them as ERROR. This happens, because we never want to send incomplete mailings.
Now we'd like to scale up using a second server. The recovery strategy above doesn't work here.
e.g. server 1 is BUILDING a mailing, and server 2 crashes and restarts. Now server 2 will see the BUILDING mailing and doesn't know if it's been aborted or if it's running on another server.
So what's the best recovery strategy for distributed services?
(We thought about some timeout mechanism, where the BUILDING server updates a timestamp every few seconds, and when some server reboots it checks if there's a BUILDING mailing that hasn't been updated for x minutes. Then it's highly possible that this mailing has been aborted.)
EDIT:
What I'd like to achieve: If some server restarts (after a crash or just because we added a new mailing server to the cluster), it should not mark mailings as ERROR if this particular mailing is actually being built (by another server).
Nice to have: If this would work without having to store server ids, because then it's possible to easily add and/or remove servers. Else it would not be possible to completely remove some server, because then there might be a BUILDING mailing with that particular server id. But this server got removed and will never get started again. Though the only server that could set the mailing to ERROR will be gone.
Add two things to your state tracking: a timestamp and the server working on it.
If a server starts up and sees anything in a building state for itself it knows it failed. Conversely, if it starts up and sees something in a building state for another server, it now has information that it's going to need to look at later to see if there's a problem that needs to be addressed. You need to worry about multiple servers restarting at the same time, so you can't just have a server grab all old bundles for all servers at startup.
Or you can just use a clustering service for your OS.

Managing Core Data iCloud Transaction Logs

I am using iCloud with Core Data, based on the SQLite "Library-style" application design as specified by Apple. While the basic functionality works very well, I am concerned about the transaction logs and how they are managed.
While the database for my app is not large, it is very active and the core data stack is saved many times while the app is in use. I have noticed that a new transaction log is created for every core data save. The end result is that I have a TON of transaction logs and they take up much more space than the actual database.
1) Do the transaction logs ever get automatically pruned / coalesced, or will they continue to grow indefinitely, eventually numbering in the thousands and taking up many megabytes? It seems that the only way to manually purge the transaction logs and recreate a .baseline archive would be to disable and then re-enable iCloud (removing the ubiquity container and starting fresh). But this is obviously not a good solution.
2) My current architecture saves the core data stack often, even for minor changes. In general, this makes sense as my database is small and saving often ensures that the database file is always up-to-date. However, given the above issues with transaction logs, I am thinking that I should perhaps minimize saves to the database. Maybe doing so on a timed basis and/or on app transition states.
3) Even if I minimize the number of transaction logs by reducing how often I save the database, there seems to be an issue here, as the logs will continue to grow in number over time. Eventually the benefit of the "transaction log" design will become a burden in terms of the amount of iCloud storage used and the initial iCloud sync as a new device is added.
As Apple has provided very sparse information on iCloud and almost nothing in the form of "best practices", I would appreciate any insight from the community.
I filed a radar on this issue and received the following reply. They noted that it should be working correctly in iOS 5.1, though I have not yet verified this myself.
A clarification for any who might misunderstand the following. The transaction logs will be cleaned up by the core data internals. This is NOT something that should be performed by the application itself.
Engineering has provided the following feedback regarding this issue:
The transaction logs are intended to be deleted after all the active
peers have had a chance to read them, and they exceed a threshold of
space consumed. There was a previous issue that prevented the devices
from correctly doing so.