In an app on slingr.io there is a listener that gets executed when a webhook arrives. Inside that listener we have a code like this:
// process webhook
// ...
record.field('status').val('active');
sys.data.save(record);
In the logs we are seeing that in many cases we are getting the following error:
ยป 2019-09-25 18:52:00.349 ERROR system#nbt.slingrs.io Optimistic locking exception saving record [Order T792-18]
This is not happening all the time, but only in some cases. What's the reason and how to prevent it from happening?
This is due to concurrency issue as many webhooks are probably arriving at almost the same time and so multiple threads are trying to update the record concurrently.
The most convenient way to avoid this problem when editing a record is to use the lock() method like this:
// process webhook
// ...
record.lock(function(record) {
record.field('status').val('active');
sys.data.save(record);
);
That will put a semaphore if other threads try to update the record at the same time.
Related
In one of the services we had some connection issues and we are getting random timeouts (we think it is because of the client library. it is one of the caching services). We decided to handle it by putting it in the queue and retrying on a separate worker until we solve the underlying issue.
However, there is a case. let's say we want to put the value "A" to cache. but it fails. so we put it in the queue to retry again. but during this time user fire a delete request to remove that data and we call it without any timeouts (no error, but no record to delete as well). then our retry strategy writes that data to cache (which is supposed to be deleted and not be there).
How would we handle this scenario? I first thought maybe we can raise an error if delete doesn't delete anything but then I see it also has so many complications and can end with an endless retry even
It appear as the issue is coming as you are doing actual action on main thread and if it fails then only doing retry through queue by worker thread.
If you do actual action as well through worker thread as well through queue then issue will be resolved.
Or 2nd solution is, you can track all the keys that are in queue for retry. If there is any action related to key already in queue then queue the actual action as well. Like delete should be queue as the action for A as retry action on A is already queue.
2nd solution is little inefficient.
I have a function:
public async static Task Run([QueueTrigger("efs-api-call-last-datetime", Connection = "StorageConnectionString")]DateTime queueItem,
[Queue("efs-api-call-last-datetime", Connection = "StorageConnectionString")]CloudQueue inputQueue,
TraceWriter log)
{
Then I have long process for processing message from queue. Problem is the message will be readded to queue after 30 seconds, while I process this message. I don't need to add this message and process it twice.
I would like to have code like:
try
{
// long operation
}
catch(Exception ex)
{
// something wrong. Readd this message in 1 minute
await inputQueue.AddMessageAsync(new CloudQueueMessage(
JsonConvert.SerializeObject(queueItem)),
timeToLive: null,
initialVisibilityDelay: TimeSpan.FromMinutes(1),
options: null,
operationContext: null
);
}
and prevent to readd it automatically. Any way to do it?
There are couple of things here.
1) When there are multiple queue messages waiting, the queue trigger retrieves a batch of messages and invokes function instances concurrently to process them. By default, the batch size is 16. But this is configurable in Host.json. You can set the batch size to 1 if you want to minimize the parallel execution. Microsoft document explains this.
2) As it is long running process so it seems your messages are not complete and the function might timeout and message are visible again. You should try to break down your function into smaller functions. Then you can use durable function which will chain the work you have to do.
Yes, you can dequeue same message twice.
Reasons:
1.Worker A dequeues Message B and invisibility timeout expires. Message B becomes visible again and Worker C dequeues Message B, invalidating Worker A's pop receipt. Worker A finishes work, goes to delete Message B and error is thrown. This is most common.
2.The lock on the original message that triggers the first Azure Function to execute is likely expiring. This will cause the Queue to assume that processing the message failed, and it will then use that message to trigger the Function to execute again.
3.In certain conditions (very frequent queue polling) you can get the same message twice on a GetMessage. This is a type of race condition that while rare does occur. Worker A and B are polling very quickly and hit the queue simultaneously and both get same message. This used to be much more common (SDK 1.0 time frame) under high polling scenarios, but it has become much more rare now in later storage updates (can't recall seeing this recently).
1 and 3 only happen when you have more than 1 worker.
Workaround:
Install azure-webjobs-sdk 1.0.11015.0 version (visible in the 'Settings' page of the Functions portal). For more details, you could refer to fixing queue visibility renewals
Last week encountered for the first time a rate limit exceed error (4003) in our nightly batch-process. This batch proces is synchronising Smartsheet objects with our TimeTracking application 4TT.
Since 2016 this proces works fine, but somehow now this rate limit error occurs and therefore stops synchronising. With the help of the API (and blog about rate limit) I managed to change the code, putting in pauses when this error occurs. This has taken me quite a lot of time, as every time the error occured in a different part of the synchronisation proces.
Is there or will there be a way to let the API automatically pauses, when the rate limit is about to exceed in stead of changing the code every time. And for those who don't want this feature, for example adding an optional boolean argument 'AutomaticallyPauseWhenRateLimitExceeds' (default false) when making the connection to the Smartsheet API?
You'll need to include logic in your code to effectively handle the rate limiting error -- there's no mechanism by which the Smartsheet API can automatically handle this situation for you.
A simple approach would be for you to include logic in your code such that when a rate limiting error is thrown, your code pauses execution for 60 seconds before continuing. Alternatively, a more sophisticated approach would be to implement exponential backoff logic in your code (an error handling strategy whereby you periodically retry a failed request with progressively longer wait times between retries, until either the request succeeds or the certain number of retry attempts is reached).
Implementing this type of error handling logic should not be difficult or tedious, provided that your code is structured in an efficient manner (i.e., error handling logic is encapsulated in a single location).
Additional note: The Smartsheet API Best Practices blog post (specifically the Be practical: Adhere to rate limiting guidelines section) contains info about this topic.
All our SDKs include error retry. So that's the easiest way to handle this situation. http://smartsheet-platform.github.io/api-docs/#sdks-and-sample-code
I found this and other interesting problems (in my lab) while updating the sheet including Poor Internet connection/bandwidth issues.
If unable to accommodate your code to process chunks of data, my suggestion is to use a simple Try/Catch logic to pause the thread/task for 60 secs and then try again.
using System.Threading
...
... //all your code goes here
...
try
{
// your code to Save/update the Sheet goes here
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Thread.Sleep(60000);
}
The next step is to work notifications when those errors happen
I know that it is possible to consume a SQS queue using multiple threads. I would like to guarantee that each message will be consumed once. I know that it is possible to change the visibility timeout of a message, e.g., equal to my processing time. If my process spend more time than the visibility timeout (e.g. a slow connection) other thread can consume the same message.
What is the best approach to guarantee that a message will be processed once?
What is the best approach to guarantee that a message will be processed once?
You're asking for a guarantee - you won't get one. You can reduce probability of a message being processed more than once to a very small amount, but you won't get a guarantee.
I'll explain why, along with strategies for reducing duplication.
Where does duplication come from
When you put a message in SQS, SQS might actually receive that message more than once
For example: a minor network hiccup while sending the message caused a transient error that was automatically retried - from the message sender's perspective, it failed once, and successfully sent once, but SQS received both messages.
SQS can internally generate duplicates
Simlar to the first example - there's a lot of computers handling messages under the covers, and SQS needs to make sure nothing gets lost - messages are stored on multiple servers, and can this can result in duplication.
For the most part, by taking advantage of SQS message visibility timeout, the chances of duplication from these sources are already pretty small - like fraction of a percent small.
If processing duplicates really isn't that bad (strive to make your message consumption idempotent!), I'd consider this good enough - reducing chances of duplication further is complicated and potentially expensive...
What can your application do to reduce duplication further?
Ok, here we go down the rabbit hole... at a high level, you will want to assign unique ids to your messages, and check against an atomic cache of ids that are in progress or completed before starting processing:
Make sure your messages have unique identifiers provided at insertion time
Without this, you'll have no way of telling duplicates apart.
Handle duplication at the 'end of the line' for messages.
If your message receiver needs to send messages off-box for further processing, then it can be another source of duplication (for similar reasons to above)
You'll need somewhere to atomically store and check these unique ids (and flush them after some timeout). There are two important states: "InProgress" and "Completed"
InProgress entries should have a timeout based on how fast you need to recover in case of processing failure.
Completed entries should have a timeout based on how long you want your deduplication window
The simplest is probably a Guava cache, but would only be good for a single processing app. If you have a lot of messages or distributed consumption, consider a database for this job (with a background process to sweep for expired entries)
Before processing the message, attempt to store the messageId in "InProgress". If it's already there, stop - you just handled a duplicate.
Check if the message is "Completed" (and stop if it's there)
Your thread now has an exclusive lock on that messageId - Process your message
Mark the messageId as "Completed" - As long as this messageId stays here, you won't process any duplicates for that messageId.
You likely can't afford infinite storage though.
Remove the messageId from "InProgress" (or just let it expire from here)
Some notes
Keep in mind that chances of duplicate without all of that is already pretty low. Depending on how much time and money deduplication of messages is worth to you, feel free to skip or modify any of the steps
For example, you could leave out "InProgress", but that opens up the small chance of two threads working on a duplicated message at the same time (the second one starting before the first has "Completed" it)
Your deduplication window is as long as you can keep messageIds in "Completed". Since you likely can't afford infinite storage, make this last at least as long as 2x your SQS message visibility timeout; there is reduced chances of duplication after that (on top of the already very low chances, but still not guaranteed).
Even with all this, there is still a chance of duplication - all the precautions and SQS message visibility timeouts help reduce this chance to very small, but the chance is still there:
Your app can crash/hang/do a very long GC right after processing the message, but before the messageId is "Completed" (maybe you're using a database for this storage and the connection to it is down)
In this case, "Processing" will eventually expire, and another thread could process this message (either after SQS visibility timeout also expires or because SQS had a duplicate in it).
Store the message, or a reference to the message, in a database with a unique constraint on the Message ID, when you receive it. If the ID exists in the table, you've already received it, and the database will not allow you to insert it again -- because of the unique constraint.
AWS SQS API doesn't automatically "consume" the message when you read it with API,etc. Developer need to make the call to delete the message themselves.
SQS does have a features call "redrive policy" as part the "Dead letter Queue Setting". You just set the read request to 1. If the consume process crash, subsequent read on the same message will put the message into dead letter queue.
SQS queue visibility timeout can be set up to 12 hours. Unless you have a special need, then you need to implement process to store the message handler in database to allow it for inspection.
You can use setVisibilityTimeout() for both messages and batches, in order to extend the visibility time until the thread has completed processing the message.
This could be done by using a scheduledExecutorService, and schedule a runnable event after half the initial visibility time. The code snippet bellow creates and executes the VisibilityTimeExtender every half of the visibilityTime with a period of half the visibility time. (The time should to guarantee the message to be processed, extended with visibilityTime/2)
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
ScheduledFuture<?> futureEvent = scheduler.scheduleAtFixedRate(new VisibilityTimeExtender(..), visibilityTime/2, visibilityTime/2, TimeUnit.SECONDS);
VisibilityTimeExtender must implement Runnable, and is where you update the new visibility time.
When the thread is done processing the message, you can delete it from the queue, and call futureEvent.cancel(true) to stop the scheduled event.
Rx has great function Observable.Buffer. But there is a problem with it in real life.
Scenario: application sends a stream of events to a database. Inserting events one-by-one is expensive, so we need to batch it. I want to use Observable.Buffer for this. But inserting into DB has small probability of failure (deadlocks, timeouts, downtime, etc).
I can add some retry logic into batching function itself, but it would be against Rx idea of composablility. Observable.Retry does not cut it, because it will re-subscribe to "hot" source, which means that failed batch will be lost.
Are there functions, which I can compose to achieve desired effect, or do I need to implement my own extension? I would like something like this:
_inputBuffer = new BufferBlock<int>();
_inputBuffer.AsObservable().
Buffer(TimeSpan.FromSeconds(10), 1000).
Do(batch => SqlSaveBatch(batch)).
{Retry???}.
Subscribe()
To make it perfect, I would like to be able to get control over situation when OnComplete is called, while retry buffer has incomplete batches, and be able to perform some actions (send error email, save data to local file system, etc.)
When a save to database fails and needs to be retried, it's not really the stream or the events that are in error, it's a action taken against an event.
I would structure your code more like this:
IDisposable subscription =
_inputBuffer.AsObservable().
Buffer(TimeSpan.FromSeconds(10), 1000).
Subscribe(
batch => SqlSaveBatchWithRetryLogic(batch),
() => YourOnCompleteAction);
You can provide the retry logic inside of SqlSaveBatchWithRetryLogic()
Handle OnComplete of the events inside YourOnCompleteAction()
You can elect to dispose the subscription from within SqlSaveBatchWithRetryLogic() if you fail to save a batch.
This also removes the Do side effect.
I would be careful about this approach though - you need to watch the retry logic. You have no back-pressure (way to slow down the input). So if you have any kind of back-off/retry you are risking the queue backing up and filling memory. If you start seeing batches consistently at the count limit, you are probably in trouble! You may want to implement a counter to monitor the outstanding items.