Service Fabric reliable queue long operation - azure-service-fabric

I'm trying to understand some best practices for service fabric.
If I have a queue that is added to by a web service or some other mechanism and a back end task to process that queue what is the best approach to handle long running operations in the background.
Use TryPeekAsync in one transaction, process and then if successful use TryDequeueAsync to finally dequeue.
Use TryDequeueAsync to remove an item, put it into a dictionary and then remove from the dictionary when complete. On startup of the service, check the
dictionary for anything pending before the queue.
Both ways feel slightly wrong, but I can't work out if there is a better way.

One option is to process the queue in RunAsync, something along the lines of this:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
var store = await StateManager.GetOrAddAsync<IReliableQueue<T>>("MyStore").ConfigureAwait(false);
while (!cancellationToken.IsCancellationRequested)
{
using (var tx = StateManager.CreateTransaction())
{
var itemFromQueue = await store.TryDequeueAsync(tx).ConfigureAwait(false);
if (!itemFromQueue.HasValue)
{
await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken).ConfigureAwait(false);
continue;
}
// Process item here
// Remmber to clone the dequeued item if it is a custom type and you are going to mutate it.
// If success, await tx.CommitAsync();
// If failure to process, either let it run out of the Using transaction scope, or call tx.Abort();
}
}
}
Regarding the comment about cloning the dequeued item if you are to mutate it, look under the "Recommendations" part here:
https://azure.microsoft.com/en-us/documentation/articles/service-fabric-reliable-services-reliable-collections/
One limitation with Reliable Collections (both Queue and Dictionary), is that you only have parallelism of 1 per partition. So for high activity queues it might not be the best solution. This might be the issue you're running into.
What we've been doing is to use ReliableQueues for situations where the write amount is very low. For higher throughput queues, where we need durability and scale, we're using ServiceBus Topics. That also gives us the advantage that if a service was Stateful only due to to having the ReliableQueue, it can now be made stateless. Though this adds a dependency to a 3rd party service (in this case ServiceBus), and that might not be an option for you.
Another option would be to create a durable pub/sub implementation to act as the queue. I've done tests before with using actors for this, and it seemed to be a viable option, without spending too much time on it, since we didn't have any issues depending on ServiceBus. Here is another SO about that Pub/sub pattern in Azure Service Fabric

If very slow use 2 queues.. One a fast one where you store the work without interruptions and a slow one to process it. RunAsync is used to move messages from the fast to the slow.

Related

state manager parallel transactions in runasync

In service fabric stateful services, there is RunAsync(cancellationToken) with using() for state manager transaction.
The legacy code I want to refactor contains two queues with dequeue attempts inside the while(true) with 1 second delay. I would like to get rid of this unnecessary delay and instead use two distinct reactive queues (semaphores with reliable queues).
The problem is, now the two distinct workflows depending on these two queues need to be separated into two separate threads because if these two queues run in single thread one wait() will block the other code from running. (I know probably best practice would separate these two tasks into two microservices, next project.)
I came up with below code as a solution:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
await Task.WhenAll(AsyncTask1(cancellationToken), AsyncTask2(cancellationToken)).ConfigureAwait(false);
}
And each task contains something like:
while (true)
{
cancellationToken.ThrowIfCancellationRequested();
using (var tx = this.StateManager.CreateTransaction())
{
var maybeMessage = await messageQueue.TryDequeueAsync(tx, cancellationToken).ConfigureAwait(false);
if (maybeMessage.HasValue)
{
DoWork();
}
await tx.CommitAsync().ConfigureAwait(false);
}
}
Seems working but I just want to make sure if the using(statemanger.createTansaction()) is ok to be used in this parallel way..
According to documentation
Depending on the replica role for single-entry operation (like TryDequeueAsync) the ITransaction uses the Repeatable Read isolation level (when primary) or Snapshot isolation level (when **secondary).
Repeatable Read
Any Repeatable Read operation by default takes Shared locks.
Snapshot
Any read operation done using Snapshot isolation is lock free.
So if DoWork doesn't modifies the reliable collection then multiple transaction can be executed in parallel with no problems.
In case of multiple reads / updates - this can cause deadlocks and should be done with care.

Mongo DB reading 4 million documents

We have Scheduled jobs that runs daily,This jobs looks for matching Documents for that day and takes the document and do minimal transform and sent it a queue for downstream processing. Typically we have 4 millions Documents to be processed for a day. Our aim is to complete the processing within one hour. I am looking for suggestions on the best practices to read 4 million Documents from MongoDB quickly ?
The MongoDB Async driver is the first stop for low overhead querying. There's a good example of using the SingleResultCallback on that page:
Block<Document> printDocumentBlock = new Block<Document>() {
#Override
public void apply(final Document document) {
System.out.println(document.toJson());
}
};
SingleResultCallback<Void> callbackWhenFinished = new SingleResultCallback<Void>() {
#Override
public void onResult(final Void result, final Throwable t) {
System.out.println("Operation Finished!");
}
};
collection.find().forEach(printDocumentBlock, callbackWhenFinished);
It is a common pattern in asynchronous database drivers to allow results to be passed on for processing as soon as they are available. The use of OS-level async I/O will help with low CPU overhead. Which brings up the next problem - how to get the data out.
Without seeing the specifics of your work, you probably want to place the results into an in memory queue to be picked up by another thread at this point so the reader thread can keep reading results. An ArrayBlockingQueue is probably appropriate. put is more appropriate than add because it will block the reader thread if the worker(s) isn't able to keep up (keeping things balanced). Ideally, you don't want it to back up which is where multiple threads will be necessary. If the order of the results is important, use a single worker thread, otherwise use a ThreadPoolExecutor with the queue passed into the constructor. Using the in-memory queue does open up the possibility for data-loss if the results are being somehow discarded as they are read (i.e. if you were immediately sending off another query to delete them), and the reader process crashed.
At this point, either do the 'minimal transforms' on the worker thread(s), or serialize them in the workers and put them on a real queue (e.g. RabbitMQ, ZeroMQ). Putting them onto a real queue allows the work to be divided up amoungst multiple machines trivially, and provides optional persistence allowing recovery of work, and those queues have great clustering options for scalability. Those machines can then put the results into the queue you mentioned in the question (assuming it has the capacity).
The bottleneck in a system like that is how quickly one machine can get through a single mongo query, and how many results the final queue can handle. All the other parts (MongoDB, queues, # of worker machines) are individually scalable. By doing as little work as possible on the querying machine and pushing that work onto other machines that impact can be greatly reduced. It sounds like your destination queue is out of your control.
When trying to work out where bottlenecks are, measurements are critical. Adding metrics to your application up front will let you know which areas need improvement when things aren't going well.
That set-up can build a pretty scalable system. I've built many similar systems before. Beyond that, you'll want to investigate getting your data into something like Apache Storm.

Can we prevent deadlocks and timeouts on ReliableQueue's in Service Fabric?

We have a stateful service in Service Fabric with both a RunAsync method and a couple of service calls.
One service calls allows to enqueue something in a ReliableQueue
using(ITransaction tx = StateManager.CreateTransaction())
{
await queue.EnqueueAsync(tx, message);
queueLength = await queue.GetCountAsync(tx);
await tx.CommitAsync();
}
The RunAsync on the other hand tries to dequeue things:
using(ITransaction tx = StateManager.CreateTransaction())
{
await queue.TryDequeueAsync(tx);
queueLength = await queue.GetCountAsync(tx);
await tx.CommitAsync();
}
The GetCountAsync seems to cause deadlocks, because the two transactions block each other. Would it help if we would switch the order: so first counting and then the dequeue/enqueue?
This is likely due to the fact that the ReliableQueue today is strict FIFO and allows only one reader or writer at a time. You're probably not seeing deadlocks, you're seeing timeouts (please correct me if that is not the case). There's no real way to prevent the timeouts other than to:
Ensure that the transactions are not long lived - any longer than you need and you're blocking other work on the queue.
Increase the default transaction timeout (the default is 4 seconds, you can pass in a different value)
Reordering things shouldn't cause any change.
Having two transactions in two different places shouldn't cause deadlocks, as they act like mutexes. What will cause them though is creating transactions within transactions.
Perhaps that is what's happening? I've developed the habit lately of naming functions that create transactions Transactional, ie DoSomethingTransactionalAsync, and if it's a private helper I'll usually create two versions with one taking a tx and one creating a tx.
For example:
AddToProcessingQueueAsync(ITransaction tx, int num) and AddToProcessingQueueTransactionalAsync(int num).

How to retry hot observable?

Rx has great function Observable.Buffer. But there is a problem with it in real life.
Scenario: application sends a stream of events to a database. Inserting events one-by-one is expensive, so we need to batch it. I want to use Observable.Buffer for this. But inserting into DB has small probability of failure (deadlocks, timeouts, downtime, etc).
I can add some retry logic into batching function itself, but it would be against Rx idea of composablility. Observable.Retry does not cut it, because it will re-subscribe to "hot" source, which means that failed batch will be lost.
Are there functions, which I can compose to achieve desired effect, or do I need to implement my own extension? I would like something like this:
_inputBuffer = new BufferBlock<int>();
_inputBuffer.AsObservable().
Buffer(TimeSpan.FromSeconds(10), 1000).
Do(batch => SqlSaveBatch(batch)).
{Retry???}.
Subscribe()
To make it perfect, I would like to be able to get control over situation when OnComplete is called, while retry buffer has incomplete batches, and be able to perform some actions (send error email, save data to local file system, etc.)
When a save to database fails and needs to be retried, it's not really the stream or the events that are in error, it's a action taken against an event.
I would structure your code more like this:
IDisposable subscription =
_inputBuffer.AsObservable().
Buffer(TimeSpan.FromSeconds(10), 1000).
Subscribe(
batch => SqlSaveBatchWithRetryLogic(batch),
() => YourOnCompleteAction);
You can provide the retry logic inside of SqlSaveBatchWithRetryLogic()
Handle OnComplete of the events inside YourOnCompleteAction()
You can elect to dispose the subscription from within SqlSaveBatchWithRetryLogic() if you fail to save a batch.
This also removes the Do side effect.
I would be careful about this approach though - you need to watch the retry logic. You have no back-pressure (way to slow down the input). So if you have any kind of back-off/retry you are risking the queue backing up and filling memory. If you start seeing batches consistently at the count limit, you are probably in trouble! You may want to implement a counter to monitor the outstanding items.

MSMQ as a job queue

I am trying to implement job queue with MSMQ to save up some time on me implementing it in SQL. After reading around I realized MSMQ might not offer what I am after. Could you please advice me if my plan is realistic using MSMQ or recommend an alternative ?
I have number of processes picking up jobs from a queue (I might need to scale out in the future), once job is picked up processing follows, during this time job is locked to other processes by status, if needed job is chucked back (status changes again) to the queue for further processing, but physically the job still sits in the queue until completed.
MSMQ doesn't let me to keep the message in the queue while working on it, eg I can peek or read. Read takes message out of queue and peek doesn't allow changing the message (status).
Thank you
Using MSMQ as a datastore is probably bad as it's not designed for storage at all. Unless the queues are transactional the messages may not even get written to disk.
Certainly updating queue items in-situ is not supported for the reasons you state.
If you don't want a full blown relational DB you could use an in-memory cache of some kind, like memcached, or a cheap object db like raven.
Take a look at RabbitMQ, or many of the other messages queues. Most offer this functionality out of the box.
For example. RabbitMQ calls what you are describing, Work Queues. Multiple consumers can pull from the same queue and not pull the same item. Furthermore, if you use acknowledgements and the processing fails, the item is not removed from the queue.
.net examples:
https://www.rabbitmq.com/tutorials/tutorial-two-dotnet.html
EDIT: After using MSMQ myself, it would probably work very well for what you are doing, as far as I can tell. The key is to use transactions and multiple queues. For example, each status should have it's own queue. It's fairly safe to "move" messages from one queue to another since it occurs within a transaction. This moving of messages is essentially your change of status.
We also use the Message Extension byte array for storing message metadata, like status. This way we don't have to alter the actual message when moving it to another queue.
MSMQ and queues in general, require a different set of patterns than what most programmers are use to. Keep that in mind.
Perhaps, if you can give more information on why you need to peek for messages that are currently in process, there would be a way to handle that scenario with MSMQ. You could always add a database for additional tracking.