Can we prevent deadlocks and timeouts on ReliableQueue's in Service Fabric? - azure-service-fabric

We have a stateful service in Service Fabric with both a RunAsync method and a couple of service calls.
One service calls allows to enqueue something in a ReliableQueue
using(ITransaction tx = StateManager.CreateTransaction())
{
await queue.EnqueueAsync(tx, message);
queueLength = await queue.GetCountAsync(tx);
await tx.CommitAsync();
}
The RunAsync on the other hand tries to dequeue things:
using(ITransaction tx = StateManager.CreateTransaction())
{
await queue.TryDequeueAsync(tx);
queueLength = await queue.GetCountAsync(tx);
await tx.CommitAsync();
}
The GetCountAsync seems to cause deadlocks, because the two transactions block each other. Would it help if we would switch the order: so first counting and then the dequeue/enqueue?

This is likely due to the fact that the ReliableQueue today is strict FIFO and allows only one reader or writer at a time. You're probably not seeing deadlocks, you're seeing timeouts (please correct me if that is not the case). There's no real way to prevent the timeouts other than to:
Ensure that the transactions are not long lived - any longer than you need and you're blocking other work on the queue.
Increase the default transaction timeout (the default is 4 seconds, you can pass in a different value)
Reordering things shouldn't cause any change.

Having two transactions in two different places shouldn't cause deadlocks, as they act like mutexes. What will cause them though is creating transactions within transactions.
Perhaps that is what's happening? I've developed the habit lately of naming functions that create transactions Transactional, ie DoSomethingTransactionalAsync, and if it's a private helper I'll usually create two versions with one taking a tx and one creating a tx.
For example:
AddToProcessingQueueAsync(ITransaction tx, int num) and AddToProcessingQueueTransactionalAsync(int num).

Related

When to use scala.concurrent.blocking?

I am asking myself the question: "When should you use scala.concurrent.blocking?"
If I understood correctly, the blocking {} only makes sense to be used in conjunction with the ForkJoinPool. In addition docs.scala-lang.org highlights, that blocking shouldn't be used for long running executions:
Last but not least, you must remember that the ForkJoinPool is not designed for long-lasting blocking operations.
I assume a long running execution is a database call or some kind of external IO. In this case a separate thread pools should be used, e.g. CachedThreadPool. Most IO related frameworks, like sttp, doobie, cats can make use of a provided IO thread pool.
So I am asking myself, which use-case still exists for the blocking statement? Is this only useful, when working with locking and waiting operations, like semaphores?
Consider the problem of thread pool starvation. Say you have a fixed size thread pool of 10 available threads, something like so:
implicit val myFixedThreadPool =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
If for some reason all 10 threads are tied up, and a new request comes in which requires an 11th thread to do its work, then this 11th request will hang until one of the threads becomes available.
blocking { Future { ... } } construct can be interpreted as saying please do not consume a thread from myFixedThreadPool but instead spin up a new thread outside myFixedThreadPool.
One practical use case for this is if your application can conceptually be considered to be in two parts, one part which say in 90% of cases is talking to proper async APIs, but there is another part which in few special cases has to talk to say a very slow external API which takes many seconds to respond and which we have no control over. Using the fixed thread pool for the true async part is relatively safe from thread pool starvation, however also using the same fixed thread pool for the second part presents the danger of the situation where suddenly 10 requests are made to the slow external API, which now causes 90% of other requests to hang waiting for those slow requests to finish. Wrapping those slow requests in blocking would help minimise the chances of 90% of other requests from hanging.
Another way of achieving this kind of "swimlaning" of true async request from blocking requests is by offloading the blocking request to a separate dedicated thread pool to be used just for the blocking calls, something like so
implicit val myDefaultPool =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
val myPoolForBlockingRequests =
ExecutionContext.fromExecutor(Executors.newFixedThreadPool(20))
Future {
callAsyncApi
} // consume thread from myDefaultPool
...
Future {
callBlockingApi
}(myPoolForBlockingRequests) // consume thread from myPoolForBlockingRequests
I am asking myself the question: "When should you use scala.concurrent.blocking?"
Well, since that is mostly useful for Future and Future should never be used for serious business logic then never.
Now, "jokes" aside, when using Futures then you should always use blocking when wrapping blocking operations, AND receive a custom ExecutionContext; instead of hardcoding the global one. Note, this should always be the case, even for non-blocking operations, but IME most folks using Future don't do this... but that is another discussion.
Then, callers of those blocking operations may decide if they will use their compute EC or a blocking one.
When the docs mention long-lasting they don't mean anything specific, mostly because is too hard to be specific about that; is context / application specific. What you need to understand is that blocking by default (note the actual EC may do whatever they want) will just create a new thread, and if you create a lot of threads and they take too long to be released you will saturate your memory and kill the program with an OOM error.
For those situations, the recommendation is to control the back pressure of your app to avoid creating too many threads. One way to do that is to create a fixed thread pool for the maximum number of blocking operations you will support and just enqueue all other pending tasks; such EC should just ignore blocking calls. You may also just have an unbound number of threads but manage the back pressure manually in other parts of your code; e.g. with an explicit Queue, this was common advice before: https://gist.github.com/djspiewak/46b543800958cf61af6efa8e072bfd5c
However, having blocked threads is always hurtful for the performance of your app, even if the compute EC is not blocked. The latest talks by Daniel explain those in detail: "The case for effect systems" & "Threads at scale".
So the ecosystem is pushing hard the state of the art to avoid that at all costs but is not a simple task. Still, runtimes like the ones provided by cats-effect or ZIO are optimized to handle blocking tasks the best they can as of today, and will probably improve during this and next years.

Azure Function creating too many connections to PostgreSQL

I have an Azure Durable Function that interacts with a PostgreSQL database, also hosted in Azure.
The PostgreSQL database has a connection limit of 50, and furthermore, my connection string limits the connection pool size to 40, leaving space for super user / admin connections.
Nonetheless, under some loads I get the error
53300: remaining connection slots are reserved for non-replication superuser connections
This documentation from Microsoft seemed relevant, but it doesn't seem like I can make a static client, and, as it mentions,
because you can still run out of connections, you should optimize connections to the database.
I have this method
private IDbConnection GetConnection()
{
return new NpgsqlConnection(Environment.GetEnvironmentVariable("PostgresConnectionString"));
}
and when I want to interact with PostgreSQL I do like this
using (var connection = GetConnection())
{
connection.Open();
return await connection.QuerySingleAsync<int>(settings.Query().Insert, settings);
}
So I am creating (and disposing) lots of NpgsqlConnection objects, but according to this, that should be fine because connection pooling is handled behind the scenes. But there may be something about Azure Functions that invalidates this thinking.
I have noticed that I end up with a lot of idle connections (from pgAdmin):
Based on that I've tried fiddling with Npgsql connection parameters like Connection Idle Lifetime, Timeout, and Pooling, but the problem of too many connections seems to persist to one degree or another. Additionally I've tried limiting the number of concurrent orchestrator and activity functions (see this doc), but that seems to partially defeat the purpose of Azure Functions being scalable. It does help - I get fewer of the too many connections error). Presumably If I keep testing it with lower numbers I may even eliminate it, but again, that seems like it defeats the point, and there may be another solution.
How can I use PostgreSQL with Azure Functions without maxing out connections?
I don't have a good solution, but I think I have the explanation for why this happens.
Why is Azure Function App maxing out connections?
Even though you specify a limit of 40 for the pool size, it is only honored on one instance of the function app. Note that that a function app can scale out based on load. It can process several requests concurrently in the same function app instance, plus it can also create new instances of the app. Concurrent requests in the same instance will honor the pool size setting. But in the case of multiple instances, each instance ends up using a pool size of 40.
Even the concurrency throttles in durable functions don't solve this issue, because they only throttle within a single instance, not across instances.
How can I use PostgreSQL with Azure Functions without maxing out connections?
Unfortunately, function app doesn't provide a native way to do this. Note that the connection pool size is not managed by the function runtime, but by npgsql's library code. This library code running on different instances can't talk to each other.
Note that, this is the classic problem of using shared resources. You have 50 of these resources in this case. The most effective way to support more consumers would be to reduce the time each consumer uses the resource. Reducing the Connection Idle Lifetime substantially is probably the most effective way. Increasing Timeout does help reduce errors (and is a good choice), but it doesn't increase the throughput. It just smooths out the load. Reducing Maximum Pool size is also good.
Think of it in terms of locks on a shared resource. You would want to take the lock for the minimal amount of time. When a connection is opened, it's a lock on one of the 50 total connections. In general, SQL libraries do pooling, and keep the connection open to save the initial setup time that is involved in each new connection. However, if this is limiting the concurrency, then it's best to kill idle connections asap. In a single instance of an app, the library does this automatically when max pool size is reached. But in multiple instances, it can't kill another instance's connections.
One thing to note is that reducing Maximum Pool Size doesn't necessarily limit the concurrency of your app. In most cases, it just decreases the number of idle connections - at the cost of - paying the initial setup time when a new connection will need to be established at a later time.
Update
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT might be useful. You can set this to 5, and pool size to 8, or similar. I would go this way if reducing Maximum Pool Size and Connection Idle Lifetime is not helping.
This is where Dependency Injection can be really helpful. You can create a singleton client and it will do the job perfectly. If you want to know more about service lifetimes you can read it here in docs
First add this nuget Microsoft.Azure.Functions.Extensions.DependencyInjection
Now add a new class like below and resolve your client.
[assembly: FunctionsStartup(typeof(Kovai.Serverless360.Functions.Startup))]
namespace MyFunction
{
class Startup : FunctionsStartup
{
public override void Configure(IFunctionsHostBuilder builder)
{
ResolveDependencies(builder);
}
}
public void ResolveDependencies(IFunctionsHostBuilder builder)
{
var conStr = Environment.GetEnvironmentVariable("PostgresConnectionString");
builder.Services.AddSingleton((s) =>
{
return new NpgsqlConnection(conStr);
}
}
}
Now you can easily consume it from any of your function
public FunctionA
{
private readonly NpgsqlConnection _connection;
public FunctionA(NpgsqlConnection conn)
{
_connection = conn;
}
public async Task<HttpResponseMessage> Run()
{
//do something with your _connection
}
}
Here's an example of using a static HttpClient, something which you should consider so that you don't need to explicitly manage connections, rather allow your client to do it:
public static class PeriodicHealthCheckFunction
{
private static HttpClient _httpClient = new HttpClient();
[FunctionName("PeriodicHealthCheckFunction")]
public static async Task Run(
[TimerTrigger("0 */5 * * * *")]TimerInfo healthCheckTimer,
ILogger log)
{
string status = await _httpClient.GetStringAsync("https://localhost:5001/healthcheck");
log.LogInformation($"Health check performed at: {DateTime.UtcNow} | Status: {status}");
}
}

state manager parallel transactions in runasync

In service fabric stateful services, there is RunAsync(cancellationToken) with using() for state manager transaction.
The legacy code I want to refactor contains two queues with dequeue attempts inside the while(true) with 1 second delay. I would like to get rid of this unnecessary delay and instead use two distinct reactive queues (semaphores with reliable queues).
The problem is, now the two distinct workflows depending on these two queues need to be separated into two separate threads because if these two queues run in single thread one wait() will block the other code from running. (I know probably best practice would separate these two tasks into two microservices, next project.)
I came up with below code as a solution:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
await Task.WhenAll(AsyncTask1(cancellationToken), AsyncTask2(cancellationToken)).ConfigureAwait(false);
}
And each task contains something like:
while (true)
{
cancellationToken.ThrowIfCancellationRequested();
using (var tx = this.StateManager.CreateTransaction())
{
var maybeMessage = await messageQueue.TryDequeueAsync(tx, cancellationToken).ConfigureAwait(false);
if (maybeMessage.HasValue)
{
DoWork();
}
await tx.CommitAsync().ConfigureAwait(false);
}
}
Seems working but I just want to make sure if the using(statemanger.createTansaction()) is ok to be used in this parallel way..
According to documentation
Depending on the replica role for single-entry operation (like TryDequeueAsync) the ITransaction uses the Repeatable Read isolation level (when primary) or Snapshot isolation level (when **secondary).
Repeatable Read
Any Repeatable Read operation by default takes Shared locks.
Snapshot
Any read operation done using Snapshot isolation is lock free.
So if DoWork doesn't modifies the reliable collection then multiple transaction can be executed in parallel with no problems.
In case of multiple reads / updates - this can cause deadlocks and should be done with care.

How both sync and async work with thread safe?

In Swift, we can leverage DispatchQueue to prevent race condition. By using serial queue, all things are performed in order, from https://developer.apple.com/library/content/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html
Serial queues (also known as private dispatch queues) execute one task
at a time in the order in which they are added to the queue. The
currently executing task runs on a distinct thread (which can vary
from task to task) that is managed by the dispatch queue. Serial
queues are often used to synchronize access to a specific resource.
But we can easily create deadlock How do I create a deadlock in Grand Central Dispatch? by perform a sync inside async
let serialQueue = DispatchQueue(label: "Cache.Storage.SerialQueue")
serialQueue.async {
serialQueue.sync {
print("perform some job")
}
print("this can't be reached")
}
The only way to prevent deadlock is to use 2 serial queues, each for sync and async function versions. But this can cause rare condition when writeSync and writeAsync happens at the same time.
I see in fs module that it supports both sync and async functions, like fs.writeFileSync(file, data[, options]) and fs.writeFile(file, data[, options], callback). By allowing both 2 versions, it means users can use them in any order they want? So they can easily create deadlock like what we did above?
So maybe fs has a clever way that we can apply to Swift? How do we support both sync and async in a thread safe manner?
serialQueue.async {
serialQueue.sync {
print("perform some job")
}
}
This deadlocks because this code queues a second task on the same dispatch queue and then waits for that second task to finish. The second task can't even start, however, because it is a serial queue and the first task is still executing (albeit blocked on an internal sempahore).
The way to avoid this kind of deadlock is to never do that. It's especially stupid when you consider that you can achieve the same effect with the following:
serialQueue.async {
print("perform some job")
}
There are some use-cases for running synchronous tasks in a different queue to the one you are in e.g.
if the other queue is the main queue and you want to do some stuff in the UI before carrying on
as a means of synchronisation between tasks in different queues, for example if you want to make sure that all the current tasks in another queue have finished before carrying on.
however, there is never a reason to synchronously do something on the same queue, you might as well just do the something. Or to put it another way, if you just write statements one after the other, they are already executing synchronously on the same queue.
I see in fs module that it supports both sync and async functions, like fs.writeFileSync(file, data[, options]) and fs.writeFile(file, data[, options], callback). By allowing both 2 versions, it means users can use them in any order they want? So they can easily create deadlock like what we did above?
That depends on how the two APIs are implemented. The synchronous version of the call might just do the call without messing about on other threads. If it does grab another thread and then wait around until that other thread is finished, then yes there is a potential for deadlock if the node.js server runs out of threads.

Service Fabric reliable queue long operation

I'm trying to understand some best practices for service fabric.
If I have a queue that is added to by a web service or some other mechanism and a back end task to process that queue what is the best approach to handle long running operations in the background.
Use TryPeekAsync in one transaction, process and then if successful use TryDequeueAsync to finally dequeue.
Use TryDequeueAsync to remove an item, put it into a dictionary and then remove from the dictionary when complete. On startup of the service, check the
dictionary for anything pending before the queue.
Both ways feel slightly wrong, but I can't work out if there is a better way.
One option is to process the queue in RunAsync, something along the lines of this:
protected override async Task RunAsync(CancellationToken cancellationToken)
{
var store = await StateManager.GetOrAddAsync<IReliableQueue<T>>("MyStore").ConfigureAwait(false);
while (!cancellationToken.IsCancellationRequested)
{
using (var tx = StateManager.CreateTransaction())
{
var itemFromQueue = await store.TryDequeueAsync(tx).ConfigureAwait(false);
if (!itemFromQueue.HasValue)
{
await Task.Delay(TimeSpan.FromSeconds(1), cancellationToken).ConfigureAwait(false);
continue;
}
// Process item here
// Remmber to clone the dequeued item if it is a custom type and you are going to mutate it.
// If success, await tx.CommitAsync();
// If failure to process, either let it run out of the Using transaction scope, or call tx.Abort();
}
}
}
Regarding the comment about cloning the dequeued item if you are to mutate it, look under the "Recommendations" part here:
https://azure.microsoft.com/en-us/documentation/articles/service-fabric-reliable-services-reliable-collections/
One limitation with Reliable Collections (both Queue and Dictionary), is that you only have parallelism of 1 per partition. So for high activity queues it might not be the best solution. This might be the issue you're running into.
What we've been doing is to use ReliableQueues for situations where the write amount is very low. For higher throughput queues, where we need durability and scale, we're using ServiceBus Topics. That also gives us the advantage that if a service was Stateful only due to to having the ReliableQueue, it can now be made stateless. Though this adds a dependency to a 3rd party service (in this case ServiceBus), and that might not be an option for you.
Another option would be to create a durable pub/sub implementation to act as the queue. I've done tests before with using actors for this, and it seemed to be a viable option, without spending too much time on it, since we didn't have any issues depending on ServiceBus. Here is another SO about that Pub/sub pattern in Azure Service Fabric
If very slow use 2 queues.. One a fast one where you store the work without interruptions and a slow one to process it. RunAsync is used to move messages from the fast to the slow.