Issues with matching service in Cadence - cadence-workflow

Two days ago, we started presenting some issues with our cadence setup.
The first thing we noticed is the Open workflows were not disappearing from the list once they completed. For example this workflow appears as Open in the list:
But when you click on it, you will see that it’s actually completed:
At the same time this started to happen, we noticed how several workflows would take quite a long time to complete, several of them would stuck in “Schedule” states and never go further from there. After checking the logs, the only error we saw was this:
{"level":"error","ts":"2021-03-06T19:12:04.865Z","msg":"Persistent store operation failure","service":"cadence-matching","component":"matching-engine","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"store-operation":"create-task","error":"InternalServiceError{Message: CreateTasks operation failed. Error : Request on table cadence.tasks with ttl of 630720000 seconds exceeds maximum supported expiration date of 2038-01-19T03:14:06+00:00. In order to avoid this use a lower TTL, change the expiration date overflow policy or upgrade to a version where this limitation is fixed. See CASSANDRA-14092 for more details.}","wf-task-list-name":"cadence-sys-history-scanner-tasklist-0","wf-task-list-type":1,"number":6300094,"next-number":6300094,"logging-call-at":"taskWriter.go:176","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/matching.(*taskWriter).taskWriterLoop\n\t/cadence/service/matching/taskWriter.go:176"}
Does somebody have an idea of why this is happening?

The first one is because of visibility sampling being enabled by default(to protect default core DB). You can disable it by configure system.enableVisibilitySampling to false.
But when you doing that, it’s better to separate the visibility and default store into different database cluster so that visibility doesn’t bring down the default(core data model) DB.
see more in https://github.com/uber/cadence/issues/3884
The second is a bug fixed in 0.16.0
It should be resolved if you upgrade server.
See https://github.com/uber/cadence/pull/3627
and https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/recoveringTtlYear2038Problem.html

Related

MongoDB Atlas dedicated cluster: Should I be concerned about 'Restarts in last hour' alerts?

We’re using a standard 3-node Atlas replicaset in a dedicated cluster (M10, Mongo 6.0.3, AWS) and have configured an alert if the ‘Restarts in last hour is’ rule exceeds 0 for any node.
https://www.mongodb.com/docs/atlas/reference/alert-conditions/#mongodb-alert-Restarts-in-Last-Hour-is
We’re seeing this alert fire every now and then and we’re wondering what this means for a node in a dedicated cluster and whether this is something to be concerned about, since I don’t think we have any control over it. Should we should disable this rule or increase the restart threshold?
Thanks in advance for any advice.
(Note I've asked this over at the Mongo community support site also, but haven't received any traction yet so asking here too)
I got an excellent response on my question at the Mongo community support site:
A node restarting is not necessarily a cause for concern. However, you should investigate the cause of the restart itself to better determine if this is an issue or not. You should take a look at your Project Activity Feed to see if you can determine why the nodes are restarting. I understand you have noted this is an M10 cluster so you should have access to the MongoDB logs, you also can check those to try determine the cause of the node restart. If you do not have access to the logs, you can consider working with Atlas in-app chat support to diagnose the issue.
It’s always good to keep the alerts active, as they can indicate a potential problem as soon as they occur. You can consider increasing the restart threshold to reduce alert noise after concluding whether the restarts are expected or not.
In my case, having checked the activity feed I was able to match up all the alerts we were seeing to Mongo version auto-updates on the nodes. We still wanted to keep that so we've increased our alert threshold to fire on >1 restart per hour rather than >0 restart, assuming that auto-updates won't be applied multiple times in the same hour.

Marklogic DLS Document Checkout Timout

Does MarkLogic DLS offer a similar file versioning experience to subversion?
Under Subversion, once the file(document) has been locked, others could not update it anymore, unless the file has been committed (check-in) or released the lock.
However in MarkLogic Library Services (DLS), once the document has been checked out, others could still call dls:document-checkout-update-checkin to update and release the lock. Does it mean it is the developer who should use those dls functions to implement the file lock and unlock mechanism?
I tried to use the timeout parameter in dls:doucment-checkout. However, it seems the document will remain in the checkout status forever. But I do see that parameter when I call 'dls:coument-checkout-status'.
Does it mean that it is the developer who should check the server timestamp together with the initial checkout timestamp and timeout duration to determine whether the file is still in lock status?
If so, I will need to write some XQuery programs and set up a scheduled task in ML to clean up the file checkout daily. Is my above understanding correct?
Per https://docs.marklogic.com/guide/app-dev/dls#id_56448, I believe the timeout is not enforced automatically - i.e. there's no background process in MarkLogic that is periodically inspecting documents to see if they should be automatically checked back in or un-checked out. The timeout appears to be meant to be used by a developer to apply their own logic with it - e.g. allowing a UI to state that "Jane checked this document out and only intended to keep it for 10 minutes, but that was 2 hours ago - would you like to break her checkout?"

G Suite Email Migration Does Not Complete, Stuck on 99%

I'm currently experiencing something rather weird: while migrating emails from a GoDaddy email server to a new G Suite set up for a number of users, I was able to successfully move for a couple of emails, as confirmed by Google's 'Complete' tick beside them. I was able to observe the migration too as it went on.
However, for one of the emails, the number of emails read just seems to keep increasing, and it still hasn't displayed 'Complete', but remains stuck on '99%'.
See screenshots I literally took just now below: as of the first latest screenshot, it says 'Successfully migrated 3230 emails', while stuck on 99%:
Then I hit refresh, check the status of that same account, and now it says '...3250 emails', while still stuck on 99%:
This isn't how it's supposed to behave, at least that isn't the behaviour I experienced with the previous 4 emails in that list. Ideally, it should say 'Migrating X out of fixed_amount emails'. In this case, that fixed_amount was
about 2,000 emails. It has now since passed that figure, but instead of showing 'Complete', it instead shows 'Successfully migrated new_amount' where new_amount keeps increasing.
This has been ongoing for almost 24 hours now. Honestly, I don't know if this is a bug or not. I really just need some helpful info to know if I should be concerned or not, perhaps maybe if someone else has run into this. Anyone?
Stumbled on to Google's documentation: https://support.google.com/a/answer/7032598?hl=en
To quote the 'Why does my migration look like it's stuck at 99%?' section:
You’ll see 99% when all email is migrated. After everything is
migrated, the data migration service applies any labels to the
migrated email, which can take time. When the labels are applied, you
should see that the migration is complete (100%).
You might also see this issue if the estimated number of emails to
migrate exceeds the actual number of messages. The migration will
report 99% until the migration completes. This process might take some
time.
You shouldn't be concerned. I was migrating around 29.000 emails from a personal gmail to Google Workspace gmail and the migration took 4 days (migrating only one user), from which the last 1.5 days the migration was "stuck" at 99%. No need to restart the migration, eventually it indeed finishes. I also got several error codes (e.g. 17009 - 'Generating an access token with the supplied credentials was unsuccessful...'), but none proved valid, I haven't actioned on them as, like in your case, I saw the number of migrated emails increasing.

MongoDB: Ensure data is delivered and delivered only once

I have a theoretical problem with MongoDB:
I have an API that reads data from a MongoDB database. We have to make sure that each item in the collection is delivered eventually, but only once after it was inserted or changed. So the client needs the most recent version of the items, but only once, we must never send an item again if there was no change to it.
We first thought to achieve this by using a date: The client sends the date of the last query and we will only deliver the items that were created or changed after that date. The problem that I see is that we might miss items if part of the cluster is not available for some time and did not get synchronized with the rest. These items will never be delivered (as they were created after the last sync, did not sync with the rest and the client now has a newer "last fetched" time).
As this will not work, I thought about solving this with some kind of ACK-Flag, which is false on creation and true after it was sent to the client. After a change it would be set to false again. Here I see the problem that if someone changes the item while it is currently being sent to the client this change might be missing afterwards (the change says ACK=false, but the delivery says ACK=true afterwards).
This again does not seem to work as intended, so now I am thinking about some kind of optimistic locking where I store a version in the DB and update the ACK=true only if the version did not change from reading to writing.
This should work, but does not seem optimal at all (what if the call crashes while writing the ACK?). As this seems to be a common problem: What would be the best solution to this? Or is MongoDB just not the right tool for the job? Is it even possible to solve this if you expect that you have to scale vertically at some point?

Salesforce: Trigger that fires off a Workflow rule has stopped working - any ideas?

So in one part of our customised Salesforce system, the following happens:
a trigger changes the value of a picklist on a custom object
a Workflow rule detects that change and fires off an email.
Since about the 4th of December though, it seems to have stopped working.
edit: The Debug Logs show that the trigger is firing and changing the value of the picklist, but no Worflow Rules are evaluated.
The workflow rule is pretty simple, so I don't really understand whats preventing it. The details of the rule are:
Operates on a custom object.
Evaluation Criteria: When a record is created, or when a record is edited and did not previously meet the rule criteria
Rule Criteria: ISPICKVAL(Status__c, 'Not Started')
Active: Yes
Immediate Workflow Actions: an email alert
Edit: The Rule does fire if I manually update the object to set the appropriate status. But it isn't firing when a trigger changes the status.
Edit: Did something change on Salesforce around December 4th 2009? That seems to be when this stopped working ...
Any ideas?
If you had said "the trigger does not fire the workflow, even though a manual change via the UI does", I would have responded something like...
Absolutely. That's how it is designed.
Salesforce do not allow anything
automated to invoke anything automated
(ie you cannot start a WF from a trigger or another WF).
Given that you say this stopped working earlier in the month, I am frankly astonished! We wanted to achieve something like this, would have been about 10 months ago, and Salesforce told us it could not be done; they like to keep tight control over processes that could potentially run away and consume large CPU (because of the multi-tenanted nature of the offering), hence the stringent governor limits...
This may have changed recently, of course, we built work-rounds to get round the restriction...
To answer my own question ... I eventually found out what this was.
The Salesforce Spring '09 Workflow Rule and Roll-Up Summary Field Evaluations update was rolled out to all orgs at the start of Dec '09, and changed certain Workflow behaviours.
The update improves the accuracy of
your data and prevents the
reevaluation of workflow rules in the
event of a recursion.
Our particular problem was that we needed Workflow to be evaluated twice on a single object after the initial action - we had a series of changes to a status field that needed to kick off different things. After the Spring '09 update, Workflow is only evaulated once for an action on an object.
So, it did work, but then the platform changed, and it didn't work anymore. Time to write some code.