Immutability and shared references - how to reconcile? - scala

Consider this simplified application domain:
Criminal Investigative database
Person is anyone involved in an investigation
Report is a bit of info that is part of an investigation
A Report references a primary Person (the subject of an investigation)
A Report has accomplices who are secondarily related (and could certainly be primary in other investigations or reports
These classes have ids that are used to store them in a database, since their info can change over time (e.g. we might find new aliases for a person, or add persons of interest to a report)
Domain http://yuml.me/13fc6da0
If these are stored in some sort of database and I wish to use immutable objects, there seems to be an issue regarding state and referencing.
Supposing that I change some meta-data about a Person. Since my Person objects immutable, I might have some code like:
class Person(
val id:UUID,
val aliases:List[String],
val reports:List[Report]) {
def addAlias(name:String) = new Person(id,name :: aliases,reports)
}
So that my Person with a new alias becomes a new object, also immutable. If a Report refers to that person, but the alias was changed elsewhere in the system, my Report now refers to the "old" person, i.e. the person without the new alias.
Similarly, I might have:
class Report(val id:UUID, val content:String) {
/** Adding more info to our report */
def updateContent(newContent:String) = new Report(id,newContent)
}
Since these objects don't know who refers to them, it's not clear to me how to let all the "referrers" know that there is a new object available representing the most recent state.
This could be done by having all objects "refresh" from a central data store and all operations that create new, updated, objects store to the central data store, but this feels like a cheesy reimplementation of the underlying language's referencing. i.e. it would be more clear to just make these "secondary storable objects" mutable. So, if I add an alias to a Person, all referrers see the new value without doing anything.
How is this dealt with when we want to avoid mutability, or is this a case where immutability is not helpful?

If X refers to Y, both are immutable, and Y changes (i.e. you replace it with an updated copy), then you have no choice but to replace X also (because it has changed, since the new X points to the new Y, not the old one).
This rapidly becomes a headache to maintain in highly interconnected data structures. You have three general approaches.
Forget immutability in general. Make the links mutable. Fix them as needed. Be sure you really do fix them, or you might get a memory leak (X refers to old Y, which refers to old X, which refers to older Y, etc.).
Don't store direct links, but rather ID codes that you can look up (e.g. a key into a hash map). You then need to handle the lookup failure case, but otherwise things are pretty robust. This is a little slower than the direct link, of course.
Change the entire world. If something is changed, everything that links to it must also be changed (and performing this operation simultaneously across a complex data set is tricky, but theoretically possible, or at least the mutable aspects of it can be hidden e.g. with lots of lazy vals).
Which is preferable depends on your rate of lookups and updates, I expect.

I suggest you to read how they people deal with the problem in clojure and Akka. Read about Software transactional memory. And some of my thoughts...
The immutability exists not for the sake of itself. Immutability is abstraction. It does not "exist" in nature. World is mutable, world is permanently changing. So it's quite natural for data structures to be mutable - they describe the state of the real or simulated object at a given moment in time. And it looks like OOP rulez here. At conceptual level the problem with this attitude is that object in RAM != real object - the data can be inaccurate, it comes with delay etc
So in case of most trivial requirements you can go with everything mutable - persons, reports etc Practical problems will arise when:
data structures are modified from concurrent threads
users provide conficting changes for the same objects
a user provide an invalid data and it should be rolled back
With naive mutable model you will quickly end up with inconsistent data and crushing system. Mutability is error prone, immutability is impossible. What you need is transactional view of the world. Within transaction program sees immutable world. And STM manages changes to be applied in consistent and thread-safe way.

I think you are trying to square the circle. Person is immutable, the list of Reports on a Person is part of the Person, and the list of Reports can change.
Would it be possible for an immutable Person have a reference to a mutable PersonRecord that keeps things like Reports and Aliases?

Related

DDD can event handler construct value object for aggregate

Can I construct a value object in the event handler or should I pass the parameters to the aggregate to construct the value object itself? Seller is the aggregate and offer is the value object. Will it be better for the aggregate to pass the value object in the event?
public async Task HandleAsync(OfferCreatedEvent domainEvent)
{
var seller = await this.sellerRepository.GetByIdAsync(domainEvent.SellerId);
var offer = new Offer(domainEvent.BuyerId, domainEvent.ProductId, seller.Id);
seller.AddOffer(offer);
}
should I pass the parameters to the aggregate to construct the value object itself?
You should probably default to passing the assembled value object to the domain entity / root entity.
The supporting argument is that we want to avoid polluting our domain logic with plumbing concerns. Expressed another way, new is not a domain concept, so we'd like that expression to live "somewhere else".
Note: that by passing the value to the domain logic, you protect that logic from changes to the construction of the values; for instance, how much code has to change if you later discover that there should be a fourth constructor argument?
That said, I'd consider this to be a guideline - in cases where you discover that violating the guideline offers significant benefits, you should violate the guideline without guilt.
Will it be better for the aggregate to pass the value object in the event?
Maybe? Let's try a little bit of refactoring....
// WARNING: untested code ahead
public async Task HandleAsync(OfferCreatedEvent domainEvent)
{
var seller = await this.sellerRepository.GetByIdAsync(domainEvent.SellerId);
Handle(domainEvent, seller);
}
static Handle(OfferCreatedEvent domainEvent, Seller seller)
{
var offer = new Offer(domainEvent.BuyerId, domainEvent.ProductId, seller.Id);
seller.AddOffer(offer);
}
Note the shift - where HandleAsync needs to be aware of async/await constructs, Handle is just a single threaded procedure that manipulates two local memory references. What that procedure does is copy information from the OfferCreatedEvent to the Seller entity.
The fact that Handle here can be static, and has no dependencies on the async shell, suggests that it could be moved to another place; another hint being that the implementation of Handle requires a dependency (Offer) that is absent from HandleAsync.
Now, within Handle, what we are "really" doing is copying information from OfferCreatedEvent to Seller. We might reasonably choose:
seller.AddOffer(domainEvent);
seller.AddOffer(domainEvent.offer());
seller.AddOffer(new Offer(domainEvent));
seller.AddOffer(new Offer(domainEvent.BuyerId, domainEvent.ProductId, seller.Id));
seller.AddOffer(domainEvent.BuyerId, domainEvent.ProductId, seller.Id);
These are all "fine" in the sense that we can get the machine to do the right thing using any of them. The tradeoffs are largely related to where we want to work with the information in detail, and where we prefer to work with the information as an abstraction.
In the common case, I would expect that we'd use abstractions for our domain logic (therefore: Seller.AddOffer(Offer)) and keep the details of how the information is copied "somewhere else".
The OfferCreatedEvent -> Offer function can sensibly live in a number of different places, depending on which parts of the design we think are most stable, how much generality we can justify, and so on.
Sometimes, you have to do a bit of war gaming: which design is going to be easiest to adapt if the most likely requirements change happens?
I would also advocate for passing an already assembled value object to the aggregate in this situation. In addition to the reasons already mentioned by #VoiceOfUnreason, this also fits more naturally with the domain language. Also, when reading code and method APIs you can then focus on domain concepts (like an offer) without being distracted by details until you really need to know them.
This becomes even more important if you would need to pass in more then one value object (or entity). Rather passing in all the values required for construction as parameters not only makes the API more resilient to refactoring but also burdens the reader with more details.
The seller is receiving an offer.
Assuming this is what is meant here, fits better than something like the following:
The seller receives some buyer id, product id, etc.
This most probably would not be found in conversations using the ubiquitous language. In my opinion code should be as readable as possible and express the behaviour and business logic as close to human language as possible. Because you compile code for machines to execute it but the way you write it is for humans to easily understand it.
Note: I would even consider using factory methods on value objects in certain cases to unburden the client code of knowing what else might be needed to assemble a valid value object, for instance, if there are different valid constellations and ways of constructing the same value objects where some values need reasonable default values or values are chosen by the value object itself. In more complex situations a separate factory might even make sense.

Scala [2.11.6] -- is there an elegant way to create cache-keys for objects based on an Int/Long and the full class name?

I want to use a cache to hold recently accessed objects that just came from a database read.
The database primary key, in my case, will be a Long.
In each case I'll have an Object (Case Class) that represents this data.
The combination of the Long plus the full class name will be a unique identifier for finding any specific object. (The namespace should never have an conflicts as class names do not use numbers (as a rule?). In any case for this usage case I control the entire name space so not a huge concern).
The objects will be relatively short lived in the cache - I just see a few situations where I can save memory by holding the same immutable Object more than once as opposed to different instances of the same Object that would be extremely difficult to "pass everything everywhere" to avoid.
This also would help performance in situations where different eyeballs are checking out the same stuff but this is not the driver for this particular use case (just gravy).
My concern is now for every time I need a given object I'll need to recreate the cache key. This will involve a Long.toString and a String Concat. The case classes in question have a val in their companion object so that they know their class name without any further reflection occurring.
I'm thinking of putting a "cache" together in the companion object for the main cache keys as I wish to avoid the (needless?) repeat ops per lookup as well as the resultant garbage collection etc. (The fastest code to run is the code that never gets written (or called) - right?)
Is there a more elegant way to handle this? Has someone else already solved this specific problem?
I thought of writing a key class but even with a val (lazy or otherwise) for the hash and toString I still get a hit for each and every object I ask for as now I have to create the key object each time. (That could of course go back into the companion object key cache but if I go to the trouble of setting up that companion object cache for keys the key object approach is redundant.)
As a secondary ask of this question - assuming I use a Long and a full class name (as a String) which is most likely to get the quickest pull for the cache?
Long.toString + fullClassName
or
fullClassName + Long.toString
The Long IS a string in the key so assuming it is a string "find" on the cache which would be easier to index find? The numeric portion first or the string class name.
Numbers first means you wade through ALL the objects with matching numbers searching for the matching class whereas class first means you find the block of a particular class first but you have to go to the very end of the string to find the exact match.
I suspect the former might be more easily optimized for a "fast find" (I know in MySQL terms it would be...)
Then again perhaps someone already has a dual-key lookup based cache? :)
I would keep it extremely simple until you had concrete performance metrics to the contrary. Something like:
trait Key {
def id: Long
lazy val key: String = s"${getClass.getName}-${id}"
}
case class MyRecordObject(id: Long, ...) extends Key
Use a simple existing caching solution like Guava Caching.
To your secondary question, I would not worry about the performance of generating a key at all until you could actually prove key generation is a bottleneck (which I kind of doubt it ever would be).
import play.api.cache.Cache
It turns out that Cache.getOrElse[T](idAsString, seconds) actually does most of the heavy lifting!
[T] is of course a type in Scala and that is enough to keep things separated in the cache. Each [T] is a unique, separate and distinct bucket in the cache.
So Cache.getOrElse[AUser](10, 5) will get a completely different object from Cache.getOrElse[ALog](10, 5) (where the ID of 10 just happens to be the same for the purpose of illustration here).
I'm currently doing this with thousands of objects across hundreds of types so I know it works...
I say most of the work as the Long has to be .toString'ed before it can be used as a key. Not a complete GC disaster as I simply set up a Map to hold the most commonly/recently .toString'ed Long values.
For those of you that simply don't get the value of this consider a simple log screen which is very common in most web applications.
2015/10/22 10:22 - Johnny Rotten - deleted an important file
2015/10/22 10:22 - Johnny Rotten - deleted another important file
2015/10/22 10:22 - Johnny Rotten - looked up another user
2015/10/22 10:22 - Johnny Rotten - added a bogus file
2015/10/22 10:22 - Johnny Rotten - insulted his boss
Under Java (Tomcat) there would typically a single Object that represented that user (Johnny Rotten) and that single Object would be linked to each and every time the Name of that user appeared in the log display.
Now under Scala we tend to create a new instance (Case Class) for each and every line of the log entry simply because we have no (efficient/plumbing) way of getting to the last used instance of that Case Class. The Log itself tends to be a case class and it has a lazy val of the User Case Class.
So, along comes user-x and they look up a log and the set the pagination to 500 lines and low and behold we now have 500 case classes being created simply to display a users name (the "who" in each log entry).
And then a few seconds later we have yet another 500 User Case Classes when they hit refresh because they didn't think they clicked the mouse right the first time...
With a simple cache however that holds a recently accessed object for all of say 5 seconds, all we create for the entire 500 log entries is a single instance of a User Case Class for each unique name we display in the log.
In Scala Case Classes are immutable so the single instance is perfectly acceptable use case here and the GC has no needless work to do...

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

Data Base Design Dilemma

I am creating a simple DB application for reports. According to DB design theory, you should never store the same information twice. This makes sense for most DB applications, but I need something that you can simply select a generic topic, you could then keep the new instance copy of the generic topic untouched or change the information but the generic topic should not be modified by modifying the instance copy, but the relationship needs to be tracked between the original topic and the instance copy of the topic.
Confusing, I know. Here is a diagram that may help:
I need the report to be immutable or mutable based off of the situation.
A quick example would be you select a customer, then you finish your report. A month later the customer's phone number changes so you update the customer portion of the DB, but you do not want to pull up a finished report and have the new information update into the already completed report.
What would be the most elegant solution to this scenario?
This may work:
But by utilizing this approach I would find myself using looping statements and if statements to identify the relationships between Generic, Checked Off and Report.
for (NSManagedObject *managedObject in checkedOffTaskObjects) {
if ([[reportObject valueForKeyPath:#"tasks"] containsObject:managedObject]) {
if ([[managedObject valueForKeyPath:#"tasks"] containsObject:genericTaskObjectAtIndexPath]) {
cell.backgroundView = [[[UIImageView alloc] initWithImage:[UIImage imageNamed:#"cellbackground.png"]] autorelease];
}
}
}
I know a better solution exists, but I cannot see it.
Thank you for time.
It's tricky to be very precise without knowing much about what exactly you're modelling, but here goes...
As you've noted, there's at least two strategies to get the "mutable instance copies of a prototype" functionality you want:
1) When creating an instance based on a prototype, completely copy the instance data from the prototype. No link between them thereafter.
PRO: faster access to the instance data with less logic involved.
CON 1: Any update to your prototype will not make it into the instances. e.g. if you have the address of a company wrong in the prototype.
CON 2: you're duplicating database data -- to a certain extent -- wasteful if you have huge records.
2) When creating an instance based on a prototype, store a reference to the 'parent' record, i.e. the prototype, and then only store updated fields in the actual instance.
PRO 1: Updates to prototype get reflected in all instances.
PRO 2: More efficient use of storage space (less duplication of data)
CON: more logic around pulling an instance from the database.
In summary: there's not any magical solution I can think of that gets you the best of both of these worlds. They're both valid strategies, depending on your exact problem and constraints (runtime speed versus storage size, for example).
If you go for 2), I certainly don't think it's a disaster -- particularly if you model things well and find out the best most efficient way to structure things in core data.

What is the most practical Solution to Data Management using SQLite on the iPhone?

I'm developing an iPhone application and am new to Objective-C as well as SQLite. That being said, I have been struggling w/ designing a practical data management solution that is worthy of existing. Any help would be greatly appreciated.
Here's the deal:
The majority of the data my application interacts with is stored in five tables in the local SQLite database. Each table has a corresponding Class which handles initialization, hydration, dehydration, deletion, etc. for each object/row in the corresponding table. Whenever the application loads, it populates five NSMutableArrays (one for each type of object). In addition to a Primary Key, each object instance always has an ID attribute available, regardless of hydration state. In most cases it is a UUID which I can then easily reference.
Before a few days ago, I would simply access the objects via these arrays by tracking down their UUID. I would then proceed to hydrate/dehydrate them as I needed. However, some of the objects I have also maintain their own arrays which reference other object's UUIDs. In the event that I must track down one of these "child" objects via it's UUID, it becomes a bit more difficult.
In order to avoid having to enumerate through one of the previously mentioned arrays to find a "parent" object's UUID, and then proceed to find the "child's" UUID, I added a DataController w/ a singleton instance to simplify the process.
I had hoped that the DataController could provide a single access point to the local database and make things easier, but I'm not so certain that is the case. Basically, what I did is create multiple NSMutableDicationaries. Whenever the DataController is initialized, it enumerates through each of the previously mentioned NSMutableArrays maintained in the Application Delegate and creates a key/value pair in the corresponding dictionary, using the given object as the value and it's UUID as the key.
The DataController then exposes procedures that allow a client to call in w/ a desired object's UUID to retrieve a reference to the actual object. Whenever their is a request for an object, the DataController automatically hydrates the object in question and then returns it. I did this because I wanted to take control of hydration out of the client's hands to prevent dehydrating an object being referenced multiple times.
I realize that in most cases I could just make a mutable copy of the object and then if necessary replace the original object down the road, but I wanted to avoid that scenario if at all possible. I therefore added an additional dictionary to monitor what objects are hydrated at any given time using the object's UUID as the key and a fluctuating count representing the number of hydrations w/out an offset dehydration. My goal w/ this approach was to have the DataController automatically dehydrate any object once it's "hydration retainment count" hit zero, but this could easily lead to significant memory leaks as it currently relies on the caller to later call a procedure that decreases the hydration retainment count of the object. There are obviously many cases when this is just not obvious or maybe not even easily accomplished, and if only one calling object fails to do so properly I encounter the exact opposite scenario I was trying to prevent in the first place. Ironic, huh?
Anyway, I'm thinking that if I proceed w/ this approach that it will just end badly. I'm tempted to go back to the original plan but doing so makes me want to cringe and I'm sure there is a more elegant solution floating around out there. As I said before, any advice would be greatly appreciated. Thanks in advance.
I'd also be aware (as I'm sure you are) that CoreData is just around the corner, and make sure you make the right choice for the future.
Have you considered implementing this via the NSCoder interface? Not sure that it wouldn't be more trouble than it's worth, but if what you want is to extract all the data out into an in-memory object graph, and save it back later, that might be appropriate. If you're actually using SQL queries to limit the amount of in-memory data, then obviously, this wouldn't be the way to do it.
I decided to go w/ Core Data after all.