Why would a Service Broker Receive take longer than the specified timeout? - tsql

I'm writing a high load application that uses SQL Server Service Broker. I have got to a state where running the following script in Management Studio takes 1 minute 6 seconds, even after I have stopped the application. What could be causing it to take so long? I thought the TIMEOUT would make it stop after half a second?
WAITFOR (RECEIVE TOP(1) * FROM [targetqueuename]), TIMEOUT 500;
SELECT ##ERROR;
##ERROR is returning 0. After the first run taking this long, subsiquent runs are returning instantly.

WAITFOR(RECEIVE), TIMEOUT works by actually running the RECEIVE at least once. If the result set is empty, it continues to wait. Every time it believes that it can succeed (it gets notified internally that more messages are available) it runs the RECEIVE again. Repeat in a loop until either it returns rows or it times out.
But the timeout does not interrupt a RECEIVE already executing inside this loop. If the RECEIVE is taking long to find messages in the queue (can happen with large queues or with bad execution plans for RECEIVE) then the timeout cannot be honored. Note that this can be the case even if the RECEIVE does not find any message, since the queue may contain a large number of messages all locked (more precisely all belonging to locked conversation groups). In this case the RECEIVE may take a long time to execute, searching for unlocked conversation groups and in the end still come empty handed.

Related

How are background workers usually implemented for polling a message queue?

Say you have a message queue that needs to be polled every x seconds. What are the usual ways to poll it and execute HTTP/Rest-based jobs? Do you simply create a cron service and call the worker script every x seconds?
Note: This is for a web application
I would write a windows service which constantly polls/waits for new messages.
Scheduling a program to run every x min has a number of problems
If your interval is too small the program will still be running with the next startup is triggered.
If your interval is too big the queue will fill up between runs.
Generally you expect a constant stream of messages, so there is no problem just keeping the program running 24/7
One common feature of the message queue systems I've worked with is that you don't poll but use a blocking read. If you have more than one waiting worker, the queue system will pick which one gets to process the message.

Azure Function and queue

I have a function:
public async static Task Run([QueueTrigger("efs-api-call-last-datetime", Connection = "StorageConnectionString")]DateTime queueItem,
[Queue("efs-api-call-last-datetime", Connection = "StorageConnectionString")]CloudQueue inputQueue,
TraceWriter log)
{
Then I have long process for processing message from queue. Problem is the message will be readded to queue after 30 seconds, while I process this message. I don't need to add this message and process it twice.
I would like to have code like:
try
{
// long operation
}
catch(Exception ex)
{
// something wrong. Readd this message in 1 minute
await inputQueue.AddMessageAsync(new CloudQueueMessage(
JsonConvert.SerializeObject(queueItem)),
timeToLive: null,
initialVisibilityDelay: TimeSpan.FromMinutes(1),
options: null,
operationContext: null
);
}
and prevent to readd it automatically. Any way to do it?
There are couple of things here.
1) When there are multiple queue messages waiting, the queue trigger retrieves a batch of messages and invokes function instances concurrently to process them. By default, the batch size is 16. But this is configurable in Host.json. You can set the batch size to 1 if you want to minimize the parallel execution. Microsoft document explains this.
2) As it is long running process so it seems your messages are not complete and the function might timeout and message are visible again. You should try to break down your function into smaller functions. Then you can use durable function which will chain the work you have to do.
Yes, you can dequeue same message twice.
Reasons:
1.Worker A dequeues Message B and invisibility timeout expires. Message B becomes visible again and Worker C dequeues Message B, invalidating Worker A's pop receipt. Worker A finishes work, goes to delete Message B and error is thrown. This is most common.
2.The lock on the original message that triggers the first Azure Function to execute is likely expiring. This will cause the Queue to assume that processing the message failed, and it will then use that message to trigger the Function to execute again.
3.In certain conditions (very frequent queue polling) you can get the same message twice on a GetMessage. This is a type of race condition that while rare does occur. Worker A and B are polling very quickly and hit the queue simultaneously and both get same message. This used to be much more common (SDK 1.0 time frame) under high polling scenarios, but it has become much more rare now in later storage updates (can't recall seeing this recently).
1 and 3 only happen when you have more than 1 worker.
Workaround:
Install azure-webjobs-sdk 1.0.11015.0 version (visible in the 'Settings' page of the Functions portal). For more details, you could refer to fixing queue visibility renewals

Kafka producer future metadata in callback

In my application when I send messages I use the Metadata in the callback to save the offset of the record for future usage. However sometimes the metadata.offset() returns -1 which makes things hard later.
Why does this happen and is there a way to get the offset without consuming the topic to find it.
Edit: I am on ack 0 currently, when I pass to ack 1 I don't have these errors anymore however my performance drops drastically. From 100k message in 10 sec to 1 min.
acks=0 If set to zero then the producer will not wait for any
acknowledgment from the server at all. The record will be immediately
added to the socket buffer and considered sent. No guarantee can be
made that the server has received the record in this case, and the
retries configuration will not take effect (as the client won't
generally know of any failures). The offset given back for each
record will always be set to -1.
This is not exactly true as out of 100k messages I got 95k with offsets but I guess it's normal.
Still will need to find another solution to get the offset with ack=0

Using many consumers in SQS Queue

I know that it is possible to consume a SQS queue using multiple threads. I would like to guarantee that each message will be consumed once. I know that it is possible to change the visibility timeout of a message, e.g., equal to my processing time. If my process spend more time than the visibility timeout (e.g. a slow connection) other thread can consume the same message.
What is the best approach to guarantee that a message will be processed once?
What is the best approach to guarantee that a message will be processed once?
You're asking for a guarantee - you won't get one. You can reduce probability of a message being processed more than once to a very small amount, but you won't get a guarantee.
I'll explain why, along with strategies for reducing duplication.
Where does duplication come from
When you put a message in SQS, SQS might actually receive that message more than once
For example: a minor network hiccup while sending the message caused a transient error that was automatically retried - from the message sender's perspective, it failed once, and successfully sent once, but SQS received both messages.
SQS can internally generate duplicates
Simlar to the first example - there's a lot of computers handling messages under the covers, and SQS needs to make sure nothing gets lost - messages are stored on multiple servers, and can this can result in duplication.
For the most part, by taking advantage of SQS message visibility timeout, the chances of duplication from these sources are already pretty small - like fraction of a percent small.
If processing duplicates really isn't that bad (strive to make your message consumption idempotent!), I'd consider this good enough - reducing chances of duplication further is complicated and potentially expensive...
What can your application do to reduce duplication further?
Ok, here we go down the rabbit hole... at a high level, you will want to assign unique ids to your messages, and check against an atomic cache of ids that are in progress or completed before starting processing:
Make sure your messages have unique identifiers provided at insertion time
Without this, you'll have no way of telling duplicates apart.
Handle duplication at the 'end of the line' for messages.
If your message receiver needs to send messages off-box for further processing, then it can be another source of duplication (for similar reasons to above)
You'll need somewhere to atomically store and check these unique ids (and flush them after some timeout). There are two important states: "InProgress" and "Completed"
InProgress entries should have a timeout based on how fast you need to recover in case of processing failure.
Completed entries should have a timeout based on how long you want your deduplication window
The simplest is probably a Guava cache, but would only be good for a single processing app. If you have a lot of messages or distributed consumption, consider a database for this job (with a background process to sweep for expired entries)
Before processing the message, attempt to store the messageId in "InProgress". If it's already there, stop - you just handled a duplicate.
Check if the message is "Completed" (and stop if it's there)
Your thread now has an exclusive lock on that messageId - Process your message
Mark the messageId as "Completed" - As long as this messageId stays here, you won't process any duplicates for that messageId.
You likely can't afford infinite storage though.
Remove the messageId from "InProgress" (or just let it expire from here)
Some notes
Keep in mind that chances of duplicate without all of that is already pretty low. Depending on how much time and money deduplication of messages is worth to you, feel free to skip or modify any of the steps
For example, you could leave out "InProgress", but that opens up the small chance of two threads working on a duplicated message at the same time (the second one starting before the first has "Completed" it)
Your deduplication window is as long as you can keep messageIds in "Completed". Since you likely can't afford infinite storage, make this last at least as long as 2x your SQS message visibility timeout; there is reduced chances of duplication after that (on top of the already very low chances, but still not guaranteed).
Even with all this, there is still a chance of duplication - all the precautions and SQS message visibility timeouts help reduce this chance to very small, but the chance is still there:
Your app can crash/hang/do a very long GC right after processing the message, but before the messageId is "Completed" (maybe you're using a database for this storage and the connection to it is down)
In this case, "Processing" will eventually expire, and another thread could process this message (either after SQS visibility timeout also expires or because SQS had a duplicate in it).
Store the message, or a reference to the message, in a database with a unique constraint on the Message ID, when you receive it. If the ID exists in the table, you've already received it, and the database will not allow you to insert it again -- because of the unique constraint.
AWS SQS API doesn't automatically "consume" the message when you read it with API,etc. Developer need to make the call to delete the message themselves.
SQS does have a features call "redrive policy" as part the "Dead letter Queue Setting". You just set the read request to 1. If the consume process crash, subsequent read on the same message will put the message into dead letter queue.
SQS queue visibility timeout can be set up to 12 hours. Unless you have a special need, then you need to implement process to store the message handler in database to allow it for inspection.
You can use setVisibilityTimeout() for both messages and batches, in order to extend the visibility time until the thread has completed processing the message.
This could be done by using a scheduledExecutorService, and schedule a runnable event after half the initial visibility time. The code snippet bellow creates and executes the VisibilityTimeExtender every half of the visibilityTime with a period of half the visibility time. (The time should to guarantee the message to be processed, extended with visibilityTime/2)
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
ScheduledFuture<?> futureEvent = scheduler.scheduleAtFixedRate(new VisibilityTimeExtender(..), visibilityTime/2, visibilityTime/2, TimeUnit.SECONDS);
VisibilityTimeExtender must implement Runnable, and is where you update the new visibility time.
When the thread is done processing the message, you can delete it from the queue, and call futureEvent.cancel(true) to stop the scheduled event.

How can I control a PostgreSQL function is running in a long period of time

A program which I developed is using postgresql. That program is running a plpgsql function it is taking so long time(hours or days). I want to be sure that function is running during that long time.
How can I know that? I don't want to use "raise notice" in a loop in function because that will extend running time.
You can see if it's running by examining pg_stat_activity for the process. However, this won't tell you if the function is progressing.
You can check to see whether that backend is blocked on any locks by joining pg_stat_activity against pg_locks to see if there are any open (granted = False) locks for that table. Again, this won't tell you if it's progressing, just that if it isn't it's not stuck on a lock.
If you want to monitor a function's progress you will need to emit log messages or use one of the other hacks for monitoring progress. You can (ab)use NOTIFY with payload to LISTEN for progress messages. Alternately, you could create a sequence that you call nextval on each time you process an item in your procedure; you can then SELECT * FROM the_sequence_name; in another transaction to see the approximate progress.
In general I'd recommend setting client_min_messages to notice or above then RAISE LOG so you record messages that appear only in the logs, without being sent to the client. To reduce overhead, keep a counter and log every 100 or 1000 or whatever iterations of your loop so you only log occasionally. There's a cost to updating the counter, for sure, but it's pretty low compared to the cost of a big, slow PL/PgSQL procedure like this.