Idiomatic Scala for handling concurrency in a web application - scala

I need to build a rudimentary RESTful session management service in Scala. A user will login and receive a session id in return. This session id will be validated on each service call. Users will be logged out after a period of inactivity.
The session management service will (could) be a simple in-memory singleton, with a map of session ids to expiry times. Where a user's session has expired it should be removed from the map. The map can be read and written by multiple threads simultaneously.
Idiomatic Scala would suggest this map be immutable but how would I handle updates? As I see the options:
Synchronize access to a mutable map
Make the map immutable but synchronize access to its reference
What is the idiomatic way of handling this kind of problem?
Note: Akka is not an option, but other libraries are.

As a developer, you have a set of techniques to deal with concurrency that you can pick up. If you decide to go for synchronisation, you should be aware of the price:
Performance decreases as lock contention increase
Lock contention is a function of how long locks are held
You need to fine tune locking (or you'll end up in the degenerate case where the lock is held forever -> single threading)
Using a singleton with synchronized access makes latency increase very quickly. Assuming each request keeps the lock for 30ms, and requests arrive every 25ms, you will have a growing latency and your users will be really upset.
If your application is a trivial exercise, go for locking. If your application has speed/latency requirement, the sooner you abandon synchronization techniques the better. And by the way, storing session in memory won't work if you need to deploy your application in a highly available cluster.

Related

Design of a simple REST API with Akka + Persistence

I'm building a simple REST API for generating some objects that must be created and sent periodically out of the API. The nature of the objects doesn't matter, neither the framework supporting the REST interface (Spray, Play Framework, whatever else). My question is, what would be a good scalable actor design for this system using Akka? Suppose the service crashes or it's migrated or whatever that causes to stop it. In order to recover the description of the tasks about what objects must be sent and when, is akka-persistence a good way to go here? or it's better to persist such things in a traditional DB?
Thanks.
NOTE: also I would like to know, supposing there's some actor which is not stateful himself, but creates many children actors, if it's a good practice to use akka-persistence in order to replay the messages which causes this actor to create his children again (the children being also non-stateful).
In a traditional DB you would most likely end up modeling this with timestamps and events, and with event sourcing this is already the native model.
Akka-persistence would be a natural fit for this scenario since it will persist every event about what objects must be created and sent periodically out. The snapshot support will also help with speed of recovery when the number of events gets very large.
In the case of crashes or migration, the recovery process will handle this just fine.
Regarding your note, if the actor is truly stateless then there is no need to persist the events that cause the children to be created since they can be recreated on demand. If the existence of the children does need to be recovered, then the actor is not stateless. In that case then it may indeed make sense to persist those events.

Managing non-serializable objects at the session level

My Wicket application integrates a couple of third party services. When a user authenticates to the app, one of the services instantiates a client object tied to that particular user.
Instantiating the client is quite expensive, so re-instantiating it with every request isn't quite an option. Were the client serializable, I'd keep a reference in the session, but since it isn't, I'm maintaining a map of clients at the application level, keyed by session. It works, but it's a little kludgy, particularly when a session expires or something else misbehave and the map is out of sync.
I'm wondering if there might be any other options to that problem. I was thinking along the line of intercepting the serialization of the session, and maintaining the client instances in memory instead.
Any suggestions?
DON'T DO THIS!
As per tetsuo's comment, this approach wouldn't work.
Original, non-working proposition
Besides an HttpSessionListener, you could also use a WeakHashMap in your application. You can then keep the key to your client objects in the Wicket Session. When the session is destroyed, the corresponding key-value entry in the map will be garbage-collected automatically.
See this explanation of weak references in Java.

Why should the event store be on the write side?

Event sourcing is pitched as a bonus for a number of things, e.g. event history / audit trail, complete and consistent view regeneration, etc. Sounds great. I am a fan. But those are read-side implementation details, and you could accomplish the same by moving the event store completely to the read side as another subscriber.. so why not?
Here's some thoughts:
The views/denormalizers themselves don't care about an event store. They just handle events from the domain.
Moving the event store to the read side still gives you event history / audit
You can still regenerate your views from the event store. Except now it need not be a write model leak. Give him read model citizenship!
Here seems to be one technical argument for keeping it on the write side. This from Greg Young at http://codebetter.com/gregyoung/2010/02/20/why-use-event-sourcing/:
There are however some issues that exist with using something that is storing a snapshot of current state. The largest issue revolves around the fact that you have introduced two models to your data. You have an event model and a model representing current state.
The thing I find interesting about this is the term "snapshot", which more recently has become a distinguished term in event sourcing as well. Introducing an event store on the write side adds some overhead to loading aggregates. You can debate just how much overhead, but it's apparently a perceived or anticipated problem, since there is now the concept of loading aggregates from a snapshot and all events since the snapshot. So now we have... two models again. And not only that, but the snapshotting suggestions I've seen are intended to be implemented as an infrastructure leak, with a background process going over your entire data store to keep things performant.
And after a snapshot is taken, events before the snapshot become 100% useless from the write perspective, except... to rebuild the read side! That seems wrong.
Another performance related topic: file storage. Sometimes we need to attach large binary files to entities. Conceptually, sometimes these are associated with entities, but sometimes they ARE the entities. Putting these in the event store means you have to physically load that data each and every time you load the entity. That's bad enough, but imagine several or hundreds of these in a large aggregate. Every answer I have seen to this is to basically bite the bullet and pass a uri to the file. That is a cop-out, and undermines the distributed system.
Then there's maintenance. Rebuilding views requires a process involving the event store. So now every view maintenance task you ever write further binds your write model into using the event store.. forever.
Isn't the whole point of CQRS that the use cases around the read model and write model are fundamentally incompatible? So why should we put read model stuff on the write side, sacrificing flexibility and performance, and coupling them back up again. Why spend the time?
So all in all, I am confused. In all respects from where I sit, the event store makes more sense as a read model detail. You still achieve the many benefits of keeping an event store, but you don't over-abstract write side persistence, possibly reducing flexibility and performance. And you don't couple your read/write side back up by leaky abstractions and maintenance tasks.
So could someone please explain to me one or more compelling reasons to keep it on the write side? Or alternatively, why it should NOT go on the read side as a maintenance/reporting concern? Again, I'm not questioning the usefulness of the store. Just where it should go :)
This is a long dead question that someone pointed me to. There are quite a few reasons why its better to store events on the write side.
From my understanding the architecture you are talking about is a very common one that I see ... fail. We will store our domain model in a relational database then put out events. You add the twist of them saving the events on the read side in an event store. This will likely lead to a mess.
The first issue you will run into is in the publishing of your events. What happens when I save to the database and publish to say MSMQ (I die in the middle). So DTC gets introduced between them. This is a huge thing to bring in, distributed transactions should be avoided like the plague. It is also quite inefficient as I am probably making the data durable twice (once to queue once to database). This will limit system throughput by a lot (DTC benchmarks of 200-300 messages/second are common, with events only 20-30k/second is common).
Some work around the need for DTC by putting a table in their database that has the events and operates as a queue. This will avoid the need for DTC however this will still run into the next issue.
What happens when you have a bug? I know you would never write buggy code but one of the Jrs/maintenance developers later working with the project. As an example what happens when the domain object change and the event raised do not match? Say you set State on your domain object to "LA" (hardcoded) but you properly set State on the event to cmd.State ("CT").
How will you detect such errors are occurring? The biggest problem with what is being discussed is that there are now two sources of "truth" there is the database on the write side and the event stream coming out. There is no way to prove that they are equivalent. This will cause all sorts of weird bugs down the line.
I think this is really an excellent question. Treating your aggregate as a sequence of events is useful in its own right on the write side, making command retries and the like easier. But I agree that it seems upsetting to work to create your events, then have to make yet another model of your object for persistence if you need this snapshotting performance improvement.
A system where your aggregates only stored snapshots, but sent events to the read-model for projection into read models would I think be called "CQRS", just not "Event Sourcing". If you kept the events around for re-projection, I guess you'd have a system that was very much both.
But then wouldn't you have three definitions? One for persisting your aggregates, one for communicating state changes, and any number more for answering queries?
In such a system it would be tempting to start answering queries by loading your aggregates and asking them questions directly. While this isn't forbidden by any means, it does tend to start causing those aggregates to accrete functionality they might not otherwise need, not to mention complicating threading and transactions.
One reason for having the event store on the write-side might be for resolving concurrency issues before events become "facts" and get distributed/dispatched, e.g. through optimistic locking on committing to event streams. That way, on the write side you can make sure that concurrent "commits" to the same event stream (aggregate) are resolved, one of them gets through, the other one has to resolve the conflicts in a smart way through comparing events or propagating the conflict to the client, thus rejecting the command.

Is a context singleton okay in a single-user desktop application?

A single-user desktop application is unique in that you know the in-memory data is current. So rather than going through the pain of creating a new context for intermittent database operations then reattaching objects, would using just one context for the entire application session carry any risks (other than a multi-user requirement arising later)?
The context is 'transaction' based (i.e. for the commit). Therefore i would not make it a singleton.
I like this article: Singleton datacontext where it states that:
A DataContext is lightweight and is not expensive to create
and
You are probably saving a few 10s of milliseconds. The word micro optimisation springs to mind - in which case you probably shouldn't be using Entity Framework.
The only risk of using a single DataContext is growing the change log too large, AFAIK, and exhausting the main memory or loosing lots of changes the user made in case of a crash. I'm not sure the transaction behaviour is configurable.
But you'll have to manage thread synchronization (as with any shared data in a multi-threaded application), so maybe you're better off using a DataContext per data operation - e.g. opening a Form to edit users in the app should open it's own DataContext and commit it on save or close.

How does Scala's Lift manage state?

I'm quite impressed by what Lift 2.0 brings to the table with Actors and StatefulSnippets, etc, but I'm a little worried about the memory overhead of these things. My question is twofold:
How does Lift determine when to garbage collect state objects?
What does the memory footprint of a page request look like?
If a web crawler dances across the footprint of the site, are they going to be opening up enough state objects to drown out a modest VPS (512M)? The question is very obviously application dependent, but I'm curious if anyone has any real world figures they can throw out at me.
Lift stores state information in a session, so once the session is destroyed the state associated with that session goes away.
Within the session, Lift tracks each page that state is allocated for (e.g., mapping between an ajax button in the browser and a function on the server) and have a heart-beat from the browser. Functions for pages that have not seen the heartbeat in 10 minutes are unreferenced so the JVM can garbage collection them. All of this is tunable, so you can change heart-beat frequency, function lifespan, etc., but in practice the defaults work quite well.
In terms of session explosion, yeah... that's a minor issue. Popular sites (including http://demo.liftweb.net/ ) experience it. The example code (see http://github.com/lift/lift/tree/master/examples/example/ ) detects sessions that were created by a single request and then abandoned and expires those early. I'm running demo.liftweb.net with 256MB of heap size (that'd fit in a 512MB VPS) and occasionally, the session count rises over 1,000, but that's quickly tamped down for search engine traffic.
I think the question about memory footprint was once answered somewhere on the mailing list, but I can’t find it at the moment.
Garbage collection is done after some idle time. There is, however, an example on the wiki which uses some better heuristics to kill off sessions spawned by web crawlers.
Of course, for your own project it makes sense to check memory consumption with something like VisualVM while spawning a couple of sessions yourself.