I currently have a couple of Grails applications using a single MongoDB replica set for some shared information. Recently, a race condition reproducibly occurred where:
A user performed an action on a front-end that started a back-end process
Part of the back-end process involved saving a new status on two domain objects
The back-end process generated a message to an outside server notifying of the event
The outside server generated a request for more details on the relevant domain objects after receiving the message
My app responded with stale data for one resource and accurate(up-to-date) data for the other.
On retry, the outside server generated an identical request and received the expected(accurate) data for both resources.
Neither save was called with flush:true, however the 1st resources save occurs two lines before the 2nd resources save, and the line in between is the change to the second resource. It may be the case that flush will resolve this issue, but I'd like to implement a solution that will guarantee the specific situation doesn't arise again.
My somewhat naive/1st thrust implementation of a solution involves get & release semaphore methods:
public synchronized boolean releaseSemaphore(Thing thing){
if(thing.semaphore==true){
thing.semaphore=false
if(thing.save(failOnError:true, flush:true)){
return true
}
else{
log.warn "releasing thing semaphore, thing: "+thing.id+", failed on thing.save()"
return false
}
}
else{
log.error "releaseThingSemaphore called with thing: "+thing.id+" when semaphore not held."
return false
}
}
public synchronized getThingWithSemaphore(String thingId){
def thing = Thing.get(thingId)
if(thing){
if(thing.semaphore==false){
thing.semaphore=true
if(thing.save(failOnError:true, flush:true)){
return [thing, true]
}
else{
log.debug "Failed to get lock on thing: "+thingId
return [null, false]
}
}
else{
log.warn "Thing: "+thingId+" was locked when getThingWithSemaphore was called"
return [null, false]
}
}
else{
return [null, false]
}
}
Would the GORM API's Thing.lock(thingId) {(http://grails.github.io/grails-doc/3.0.x/guide/GORM.html#locking) .lock()} accomplish the same guarantees I've tried to achieve in my above implementation here (when getting the semaphore prior to writing or reading for a significant action)?
http://www.anyware.co.uk/2005/2012/11/12/the-false-optimism-of-gorm-and-hibernate/ (while I realize its old) claims the .lock() method used to fail silently when outside of a transaction, not actually locking anything. Is this still the case?
(I also have concerns around MongoDB's general level ofsupport for transactions, and the impact this would have on a GORM .lock() call)
If I were to deploy multiple instances of the identical app in different data-centers requiring cross app-instance locking (or cross non-identical app sharing the same MongoDB replication set datasource), would I be able to leverage anything in the GORM API to enable cross-app document locking?
Related
I am building a bot for for Facebook Messenger using Microsoft Bot Framework. I am planning to use CosmosDB for State Management and also as my backend data store. (I am not stuck to CosmosBD and can use any other store if needed)
I need to send daily/weekly proactive messages(push notifications) to users based on their time preference. I will capturing their time preference when they first interact with the bot.
What is the best way to deliver these notifications?
As I will be storing these preferences in CosmosDB, I am thinking using ComosDB trigger of creating an Azure Function and schedule it based on the user time preference. This Azure function will make a call to my webhook which will deliver these messages. If requried, I will change Function schedule when a user changes his/her preference.
My questions are:
Is this a good approach?
Are there any other alternatives (Notifications Hub?)
I should be able to set specific times for notifications (like at the top of the hour or something like that), does it make sense to schedule an Azure Function to run at these hours rather than creating a function based on user preference (I can actually combine these two approaches too)
Thank you in advance.
First, I don't think there's any "right" answer to be given here; it's going to depend a lot on your domain's specific needs. Scale is going to play a major factor in the design of this. Will you have 100 users? 10000 users? 1mil users? I'm going to assume you want to design for maximum scale up front, but it could be overkill.
First, based on what you've described, I don't think a CosmosDB trigger is necessarily the solution to your problem because that's only going to fire when the preference data is created/updated. I assume that, from that point forward, your function needs to continuously fire at the time slot they've opted into, correct?
So let's pretend you let people choose from the 24hrs in the day. A naïve approach would be to simply use a scheduled trigger that fires up every hour, queries the CosmosDB for all the documents where the preference is set to that particular hour and then begins sending out notifications from there. The problem is how you scale from there and deal with issues of idempotency in the face of failures.
First off, a timer trigger only ever spins up one instance. If you were to just go query the CosmosDB documents and start processing them one by one in the scope of that single trigger, you'd hit a ceiling relatively quickly on how many notifications you can scale to. Instead what you'd want to do is use that timer trigger to fan out the notifications to as many "worker" function instances as possible. The timer trigger can act as the orchestrator in the sense that it can own the query against the CosmosDB and then turn each document result it finds for that particular notification time window into a message that it places on a queue to be processed by a separate function which will scale out on its own.
There are actually a couple ways you can accomplish this with Azure Functions, it really depends on how early an adopter of technology you are comfortable with being.
The first is what I would call the "manual" way which would be done by simply using the existing Azure Storage Queue extension by taking an IAsyncCollector<YourNotificationWorkerMessage> as a parameter to the timer function that's bound to the worker queue and then pumping out the messages through that. Then you write a second companion function which uses a QueueTrigger, bind it to that same queue, and it will take care of processing each message. This second function is where you get the scaling, enabling process all of the queued messages as quickly as possible based on whatever scaling parameters you choose to configure. This is the "simplest" approach
The second approach would be to adopt the newer Durable Functions extension. With that model, you don't have to directly think about creating a worker queue. You simply kick off a new instance of your orchestrator function from the timer function and the orchestrator fans out the work by invoking N "concurrent" calls to an action for each notification. Now, it happens to distribute those calls using queues under the covers, but that's an implementation detail that you need no longer maintain yourself. Additionally, if the work of delivering the notification requires more involved work and/or retry logic, you might actually consider using a sub-orchestration instead of a simple action. Finally, another added benefit of this approach, is that you can "fan back in" to your main orchestrator function once all the notifications are delivered to do some follow up work... even if that's simply some kind of event logging that the notification cycle has completed for this hour.
Now, the challenge with either of these approach is actually dealing with failure in initially fetching the candidates for notification from CosmosDB, paging through the results and making sure you actually fan all of them out in an idempotent manner. You need to deal with possible hiccups as you page and you need to deal with the fact that your whole function could be torn down and you might have to restart. Perhaps on the initial run of the 8AM notifications you got through page 273 out of 371 pages and then you got hit with a complete network connectivity fail or the VM your function was running on suffered a power failure. You could resume, but you'd need to know that you left off on page 273 and that you actually processed the 27th record out of that page and start from there. Otherwise, you risk sending double notifications to your users. Maybe that's something you can accept, maybe it's not. Maybe you're ok with the 27 notifications on that page being duplicated as long as the first 272 pages aren't. Again, this is something you need to decide for your domain, but if you want to avoid this issue your orchestrator function will need to track its progress to ensure that it doesn't send out dupes. Again I would say Durable Functions has a leg up here as it comes with the ability to configure retries. Maintaining the state of a particular run is left up to the author in either approach though.
I use pro-active dialog extensively with botframwork and messenger without any issue. During your facebook approval process you simply need to inform them you will be sending notifications trough messenger with your bot. Usually if you use it to inform your user and stay away from promotional content you should be fine.
I also use azure function to trigger the pro-active dialog from a custom controller endpoint.
Bellow sample code for azure function:
public static void Run(TimerInfo notificationTrigger, TraceWriter log)
{
try
{
//Serialize request object
string timerInfo = JsonConvert.SerializeObject(notificationTrigger);
//Create a request for bot service with security token
HttpRequestMessage hrm = new HttpRequestMessage()
{
Method = HttpMethod.Post,
RequestUri = new Uri(NotificationEndPointUrl),
Content = new StringContent(timerInfo, Encoding.UTF8, "application/json")
};
hrm.Headers.Add("Authorization", NotificationApiKey);
log.Info(JsonConvert.SerializeObject(hrm));
//Call service
using (var client = new HttpClient())
{
Task task = client.SendAsync(hrm).ContinueWith((taskResponse) =>
{
HttpResponseMessage result = taskResponse.Result;
var jsonString = result.Content.ReadAsStringAsync();
jsonString.Wait();
if (result.StatusCode != System.Net.HttpStatusCode.OK)
{
//Throw what ever problem as an exception with details
throw new Exception($"AzureFunction - ERRROR - HTTP {result.StatusCode}");
}
});
task.Wait();
}
}
catch (Exception ex)
{
//TODO log
}
}
Bellow sample code for starting the pro-active dialog:
public static async Task Resume<T, R>(string resumptionCookie) where T : IDialog<R>, new()
{
//Deserialize reference to conversation
ConversationReference conversationReference = JsonConvert.DeserializeObject<ConversationReference>(resumptionCookie);
//Generate message from bot to user
var message = conversationReference.GetPostToBotMessage();
var builder = new ContainerBuilder();
using (var scope = DialogModule.BeginLifetimeScope(Conversation.Container, message))
{
//From a cold start the service is not yet authenticated with dev bot azure services
//We thus must trust endpoint url.
if (!MicrosoftAppCredentials.IsTrustedServiceUrl(message.ServiceUrl))
{
MicrosoftAppCredentials.TrustServiceUrl(message.ServiceUrl, DateTime.MaxValue);
}
var botData = scope.Resolve<IBotData>();
await botData.LoadAsync(CancellationToken.None);
//This is our dialog stack
var task = scope.Resolve<IDialogTask>();
T dialog = scope.Resolve<T>(); //Resolve the dialog using autofac
try
{
task.Call(dialog.Void<R, IMessageActivity>(), null);
await task.PollAsync(CancellationToken.None);
}
catch (Exception ex)
{
//TODO log
}
finally
{
//flush dialog stack
await botData.FlushAsync(CancellationToken.None);
}
}
}
Your dialog needs to be registered in autofac.
Your resumptionCookie needs to be saved in your db.
You might want to check FB policy regarding proactive messages
There’s a 24h limit but it might not be totally screwed in your case
https://developers.facebook.com/docs/messenger-platform/policy/policy-overview#standard_messaging
I have used cache2k with read through in a web application to load blog posts on demand. However, I am concerned about blocking for the read through feature. For example, if multiple threads (requests) ask the cache for the same key, is it possible for the read through method to be called multiple times to load the same key/value into the cache?
I get the impression from the documentation that the read through feature does block concurrent requests for the same key until the load has completed, but may I have mis-read the documentation. I just want to check that this is the behaviour.
The method which initializes the cache looks like this:
private void initializeURItoPostCache()
{
final CacheLoader<String, PostImpl> postFileLoader = new CacheLoader<String, PostImpl>(){
#Override public PostImpl load(String uri)
{
// Fetch the data and create the post object
final PostImpl post = new PostImpl();
//.. code omitted
return post;
}
};
// Initialize the cache with a read-through loader
this.cacheUriToPost = new Cache2kBuilder<String, PostImpl>(){}
.name("cacheBlogPosts")
.eternal(true)
.loader(postFileLoader)
.build();
}
The following method is used to request a post from the cache:
public Post getPostByURI(final String uri)
{
// Check with the index service to ensure the URI is known (valid to the application)
if(this.indexService.isValidPostURI(uri))
{
// We have a post associated with the given URI, so
// request it from the cache
return this.cacheUriToPost.get(uri);
}
return EMPTY_POST;
}
Many thanks in advance, and a happy and prosperous New Year to all.
When multiple requests to the same key will provoke a cache loader call, cache2k will only invoke the loader once. Other threads wait until the load is finished. This behavior is called blocking read through. To cite from the Java Doc:
Blocking: If the loader is invoked by Cache.get(K) or other methods that allow transparent access concurrent requests on the same key will block until the loading is completed. For expired values blocking can be avoided by enabling Cache2kBuilder.refreshAhead(boolean). There is no guarantee that the loader is invoked only for one key at a time. For example, after Cache.clear() is called load operations for one key may overlap.
This behavior is very important for caches, since it protects against the Cache stampede. An example: A high traffic website receives 1000 requests per second. One resource takes quite long to generate, about 100 milliseconds. When the cache is not blocking out the multiple requests when there is a cache miss, there would be at least 100 requests hitting the loader for the same key. "at least" is an understatement, since your machine will probably not handle 100 requests at the same speed then one.
Keep in mind that there is no hard guarantee by the cache. The loader must still be able to perform correctly when called for the same key at the same time. For example blocking read through and Cache.clear() lead to competing requirements. The Cache.clear() should be fast, which means we don't want to wait for ongoing load operations to finish.
The question assumes the use of Event Sourcing.
When rebuilding current state by replaying events, event handlers should be idempotent. For example, when a user successfully updates their username, a UsernameUpdated event might be emitted, the event containing a newUsername string property. When rebuilding current state, the appropriate event handler receives the UsernameUpdated event and sets the username property on the User object to the newUsername property of the UsernameUpdated event object. In other words, the handling of the same message multiple times always yields the same result.
However, how does such an event handler work when integrating with external services? For example, if the user wants to reset their password, the User object might emit a PasswordResetRequested event, which is handled by a portion of code that issues a 3rd party with a command to send an SMS. Now when the application is rebuilt, we do NOT want to re-send this SMS. How is this situation best avoided?
There are two messages involved in the interaction: commands and events.
I do not regard the system messages in a messaging infrastructure the same as domain events. Command message handling should be idempotent. Event handlers typically would not need to be.
In your scenario I could tell the aggregate root 100 times to update the user name:
public UserNameChanged ChangeUserName(string username, IServiceBus serviceBus)
{
if (_username.Equals(username))
{
return null;
}
serviceBus.Send(new SendEMailCommand(*data*));
return On(new UserNameChanged{ Username = userName});
}
public UserNameChanged On(UserNameChanged #event)
{
_username = #event.UserName;
return #event;
}
The above code would result in a single event so reconstituting it would not produce any duplicate processing. Even if we had 100 UserNameChanged events the result would still be the same as the On method does not perform any processing. I guess the point to remember is that the command side does all the real work and the event side is used only to change the state of the object.
The above isn't necessarily how I would implement the messaging but it does demonstrate the concept.
I think you are mixing two separate concepts here. The first is reconstructing an object where the handlers are all internal methods of the entity itself. Sample code from Axon framework
public class MyAggregateRoot extends AbstractAnnotatedAggregateRoot {
#AggregateIdentifier
private String aggregateIdentifier;
private String someProperty;
public MyAggregateRoot(String id) {
apply(new MyAggregateCreatedEvent(id));
}
// constructor needed for reconstruction
protected MyAggregateRoot() {
}
#EventSourcingHandler
private void handleMyAggregateCreatedEvent(MyAggregateCreatedEvent event) {
// make sure identifier is always initialized properly
this.aggregateIdentifier = event.getMyAggregateIdentifier();
// do something with someProperty
}
}
Surely you wouldn't put code that talks to an external API inside an aggregate's method.
The second is replaying events on a bounded context which could cause the problem you are talking about and depending on your case you may need to divide your event handlers into clusters.
See Axon frameworks documentation for this point to get a better understanding of the problem and the solution they went with.
Replaying Events on a Cluster
TLDR; store the SMS identifier within the event itself.
A core principle of event sourcing is "idempotency". Events are idempotent, meaning that processing them multiple times will have the same result as if they were processed once. Commands are "non-idempotent", meaning that the re-execution of a command may have a different result for each execution.
The fact that aggregates are identified by UUID (with a very low percentage of duplication) means that the client can generate the UUIDs of newly created aggregates. Process managers (a.k.a., "Sagas") coordinate actions across multiple aggregates by listening to events in order to issue commands, so in this sense, the process manager is also a "client". Cecause the process manager issues commands, it cannot be considered "idempotent".
One solution I came up with is to include the UUID of the soon-to-be-created SMS in the PasswordResetRequested event. This allows the process manager to only create the SMS if it does not yet already exist, hence achieving idempotency.
Sample code below (C++ pseudo-code):
// The event indicating a password reset was successfully requested.
class PasswordResetRequested : public Event {
public:
PasswordResetRequested(const Uuid& userUuid, const Uuid& smsUuid, const std::string& passwordResetCode);
const Uuid userUuid;
const Uuid smsUuid;
const std::string passwordResetCode;
};
// The user aggregate root.
class User {
public:
PasswordResetRequested requestPasswordReset() {
// Realistically, the password reset functionality would have it's own class
// with functionality like checking request timestamps, generationg of the random
// code, etc.
Uuid smsUuid = Uuid::random();
passwordResetCode_ = generateRandomString();
return PasswordResetRequested(userUuid_, smsUuid, passwordResetCode_);
}
private:
Uuid userUuid_;
string passwordResetCode_;
};
// The process manager (aka, "saga") for handling password resets.
class PasswordResetProcessManager {
public:
void on(const PasswordResetRequested& event) {
if (!smsRepository_.hasSms(event.smsUuid)) {
smsRepository_.queueSms(event.smsUuid, "Your password reset code is: " + event.passwordResetCode);
}
}
};
There are a few things to note about the above solution:
Firstly, while there is a (very) low possibility that the SMS UUIDs can conflict, it can actually happen, which could cause several issues.
Communication with the external service is prevented. For example, if user "bob" requests a password reset that generates an SMS UUID of "1234", then (perhaps 2 years later) user "frank" requests a password reset that generates the same SMS UUID of "1234", the process manager will not queue the SMS because it thinks it already exists, so frank will never see it.
Incorrect reporting in the read model. Because there is a duplicate UUID, the read side may display the SMS sent to "bob" when "frank" is viewing the list of SMSes the system sent him. If the duplicate UUIDs were generated in quick succession, it is possible that "frank" would be able to reset "bob"s password.
Secondly, moving the SMS UUID generation into the event means you must make the User aggregate aware of the PasswordResetProcessManager's functionality (but not the PasswordResetManager itself), which increases coupling. However, the coupling here is loose, in that the User is unaware of how to queue an SMS, only that an SMS should be queued. If the User class were to send the SMS itself, you could run into the situation in which the SmsQueued event is stored while the PasswordResetRequested event is not, meaning that the user will receive an SMS but the generated password reset code was not saved on the user, and so entering the code will not reset the password.
Thirdly, if a PasswordResetRequested event is generated but the system crashes before the PasswordResetProcessManager can create the SMS, then the SMS will eventually be sent, but only when the PasswordResetRequested event is re-played (which might be a long time in the future). E.g., the "eventual" part of eventual consistency could be a long time away.
The above approach works (and I can see that it should also work in more complicated scenarious, like the OrderProcessManager described here: https://msdn.microsoft.com/en-us/library/jj591569.aspx). However, I am very keen to hear what other people think about this approach.
I have a Play! Framework 2.3 project hosted on Heroku with the Postgres Addon.
It handles requests from mobile applications (Post a message).
For different reasons, I have duplicate (twice) rows (messages) in database :
the app might send the request twice in a short time ( less than 10ms )
I have multiple dynos that handle requests in parallel
Event if before writing in DB, I check the message does not exist yet. So I guess the first has still not been wrote when the second comes.
I also tried to write a message footprint in the memcache before handling the request (after form validation). But I still got twice messages sometimes.
The solutions I found are :
have a unique constraint on some database field (like a message timestamp client-side generated ?)
regularly check to remove duplicates
As I do not have means to update mobile apps I will script a regular check of duplicates.
Any other idea ?
What are the best practices to handle such concurrent requests ?
Attachement : my pseudo code
public static Result submit() {
User user = MySecured.getCurrentUser(ctx());
final Form<Message> filledForm = form(Message.class).bindFromRequest();
.... Some database pre-verification
if (filledForm.hasErrors()) {
ObjectNode error = Json.newObject();
error.put("error", filledForm.errorsAsJson());
return ok(error);
} else {
if(Cache.get(KEY_LOCK_FLASH_WRITING+filledForm.data().get("mail"))!=null){
return internalServerError();
}
//Verify this flash hasnt already been handled (requests can come twice from client)
Message sameMessage = Message.findSame(filledForm.get().mail, filledForm.get().message);
if(sameMessage!=null){
Logger.info("[Submit] message already exists" + sameMessage.id);
ObjectNode jsonResult = Json.newObject();
.... Processing a result ... no matter this does not happen.
return ok(jsonResult);
}
final Message flash = filledForm.get();
Cache.set(KEY_LOCK_FLASH_WRITING+flash.mail, "");
... some fields initializations like flash.author = new Author();
... Then some promises
return ok();
}
}
I've started using the AspProviders code to store my session data in my table storage.
I'm sporadically getting the following error:
Description: Exception of type 'System.Web.HttpException' was thrown. INNER_EXCEPTION:Error accessing the data store! INNER_EXCEPTION:An error occurred while processing this request. INNER_EXCEPTION: ConditionNotMet The condition specified using HTTP conditional header(s) is not met. RequestId:0c4239cc-41fb-42c5-98c5-7e9cc22096af Time:2010-10-15T04:28:07.0726801Z
StackTrace:
System.Web.SessionState.SessionStateModule.EndAcquireState(IAsyncResult ar)
System.Web.HttpApplication.AsyncEventExecutionStep.OnAsyncEventCompletion(IAsyncResult ar) INNER_EXCEPTION:
Microsoft.Samples.ServiceHosting.AspProviders.TableStorageSessionStateProvider.ReleaseItemExclusive(HttpContext context, String id, Object lockId) in \Azure\AspProviders\TableStorageSessionStateProvider.cs:line 484
System.Web.SessionState.SessionStateModule.GetSessionStateItem()
System.Web.SessionState.SessionStateModule.PollLockedSessionCallback(Object state) INNER_EXCEPTION:
Microsoft.WindowsAzure.StorageClient.Tasks.Task1.get_Result()
Microsoft.WindowsAzure.StorageClient.Tasks.Task1.ExecuteAndWait()
Microsoft.WindowsAzure.StorageClient.TaskImplHelper.ExecuteImplWithRetry[T](Func`2 impl, RetryPolicy policy)
Microsoft.Samples.ServiceHosting.AspProviders.TableStorageSessionStateProvider.ReleaseItemExclusive(TableServiceContext svc, SessionRow session, Object lockId) in \Azure\AspProviders\TableStorageSessionStateProvider.cs:line 603
Microsoft.Samples.ServiceHosting.AspProviders.TableStorageSessionStateProvider.ReleaseItemExclusive(HttpContext context, String id, Object lockId) in \Azure\AspProviders\TableStorageSessionStateProvider.cs:line 480 INNER_EXCEPTION:
System.Data.Services.Client.DataServiceContext.SaveResult.d__1e.MoveNext()
Anyone run into this? The only useful information I've found is this, which I'm hesitant to do:
If you want to bypass the validation, you can open TableStorageSessionStateProvider.cs, find ReleaseItemExclusive, and modify the code from:
svc.UpdateObject(session);
to:
svc.Detach(session);
svc.AttachTo("Sessions", session, "*");
svc.UpdateObject(session);
from here
Thanks!
So I decided to change this:
svc.UpdateObject(session);
svc.SaveChangesWithRetries();
to this:
try
{
svc.UpdateObject(session);
svc.SaveChangesWithRetries();
}
catch
{
svc.Detach(session);
svc.AttachTo("Sessions", session, "*");
svc.UpdateObject(session);
svc.SaveChangesWithRetries();
}
So, I'll see how that works...
I've encountered this problem as well and after some investigation it seems to happen more often when you have more than one instance and you try to make calls in rapid succession in the same session. (e.g. if you had an auto complete box and making ajax calls on each key press)
This occurs because when you try to access the session data, first of all the web server takes out a lock on that session. When the request is complete, it releases the lock. With the table service provider, it updates this lock status by updating a field in the table. What I think is happening is that Instance1 loads the session row, then Instance2 loads the session row, Instance1 saves down the updated lock status and when Instance2 attempts to save the lock status it gets an error because the object isn't in the same state as when it loaded it (the ETag doesn't match any more).
This is why the fix that you found will stop the error from occurring, because by specifying the "*" in the AttachTo, when Instance2 attempts to save the lock it will turn off ETag checking (and over write the changes made by Instance1).
In our situation we have altered the provider so that we can turn off session for certain paths (the ajax call that was giving us our problems didn't need access to session data, neither did the loading of images) which may be an option for you depending on what is causing your problem.
Unfortunately the TableStorageSessionStateProvider is part of the sample projects and so isn't (as far as I'm aware, but I'll happily be told otherwise) officially supported by Microsoft. It does have other issues, like the fact that it doesn't clean up it's session data once a session expires, so you will end up with lots of junk in the session table and blob container that you'll have to clean up some other way.