I am creating a simple DB application for reports. According to DB design theory, you should never store the same information twice. This makes sense for most DB applications, but I need something that you can simply select a generic topic, you could then keep the new instance copy of the generic topic untouched or change the information but the generic topic should not be modified by modifying the instance copy, but the relationship needs to be tracked between the original topic and the instance copy of the topic.
Confusing, I know. Here is a diagram that may help:
I need the report to be immutable or mutable based off of the situation.
A quick example would be you select a customer, then you finish your report. A month later the customer's phone number changes so you update the customer portion of the DB, but you do not want to pull up a finished report and have the new information update into the already completed report.
What would be the most elegant solution to this scenario?
This may work:
But by utilizing this approach I would find myself using looping statements and if statements to identify the relationships between Generic, Checked Off and Report.
for (NSManagedObject *managedObject in checkedOffTaskObjects) {
if ([[reportObject valueForKeyPath:#"tasks"] containsObject:managedObject]) {
if ([[managedObject valueForKeyPath:#"tasks"] containsObject:genericTaskObjectAtIndexPath]) {
cell.backgroundView = [[[UIImageView alloc] initWithImage:[UIImage imageNamed:#"cellbackground.png"]] autorelease];
}
}
}
I know a better solution exists, but I cannot see it.
Thank you for time.
It's tricky to be very precise without knowing much about what exactly you're modelling, but here goes...
As you've noted, there's at least two strategies to get the "mutable instance copies of a prototype" functionality you want:
1) When creating an instance based on a prototype, completely copy the instance data from the prototype. No link between them thereafter.
PRO: faster access to the instance data with less logic involved.
CON 1: Any update to your prototype will not make it into the instances. e.g. if you have the address of a company wrong in the prototype.
CON 2: you're duplicating database data -- to a certain extent -- wasteful if you have huge records.
2) When creating an instance based on a prototype, store a reference to the 'parent' record, i.e. the prototype, and then only store updated fields in the actual instance.
PRO 1: Updates to prototype get reflected in all instances.
PRO 2: More efficient use of storage space (less duplication of data)
CON: more logic around pulling an instance from the database.
In summary: there's not any magical solution I can think of that gets you the best of both of these worlds. They're both valid strategies, depending on your exact problem and constraints (runtime speed versus storage size, for example).
If you go for 2), I certainly don't think it's a disaster -- particularly if you model things well and find out the best most efficient way to structure things in core data.
Related
Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.
I am needing to eager load a hierarchy structure so that I can recursively iterate through it. The eager loading is necessary to prevent multiple db queries while traversing the tree. It seems the consensus is that you can't eager load infinite levels of the tree, so I did something like
var item= db.ItemHierarchies
.Include("Children.Children.Children.Children.Children")
.Where(x => x.condition == condition)
to load 5 levels of children. This seems to get the job done. I'm wondering what the drawback is to doing this? If there is none then theoretically could I add 50 levels of includes here without slowing things down?
I recommend taking a look at the SQL that is generated as you add eager loading to your query.
var item= db.ItemHierarchies
.Include("Children")
.Include("Children.Children")
.Include("Children.Children.Children")
.Include("Children.Children.Children.Children")
.Include("Children.Children.Children.Children.Children")
var sql = ((System.Data.Objects.ObjectQuery) item).ToTraceString()
// http://visualstudiomagazine.com/blogs/tool-tracker/2011/11/seeing-the-sql.aspx
You'll see that the SQL quickly gets very big and complicated and can potentially have serious performance implications. You'd do well to limit your eager loading to data that you are certain you will need and to consider using explicit loading for some of the related entities - especially if you're working with connected entities in which case you can explicitly load collection properties when they're needed.
Also note that you may not need multiple separate Includes. For example, the following needs to be separate Includes because they're addressing separate properties (Widgets and Spanners) of the root.
var item= db.ItemHierarchies
.Include("Widgets")
.Include("Spanners.Flanges")
But the following isn't necessary:
var item= db.ItemHierarchies
.Include("Widgets") //This isn't necessary.
.Include("Widgets.Flanges") //This loads both Widges and Flanges.
Well honestly.. It's an extremely bad practice.
Let's assume you had 50 objects in your root.. and 50 per level.
You may end up retrieving 312500000 "capsules" of information.
Now, one might ask: "So what is wrong with that?!",
I mean if that is what is required than why not do that..
Rule #1: we develop software that should be used by human beings.
And the fact is that no human capable of taking a glimpse at 312500000 items of information at once and learn or conclude something beneficial out of it. (except.. that it does not help him or her to watch it)
Rule #2: UI should be based on what is needed and not what is possible.
And since we already established that showing 312500000 capsules of data is not needed there is no reason to bring all that at once.
And now you might come forward and say - But I don't care about the UI, really! All I need is to iterate in that data in order to process some information!
In that case you would probably want to save your results somewhere for future reference, but that means that its a batch job.. so why not apply batch job rules upon it.. like process it item by item which will also may give you the benefit of splitting it between even more machines if needed.
So you see.. no matter which path you choose there should be no reason to do it.
(= definition of what is a bad practice.)
Update:
After reading interesting concerns in the comments, I would like to update this answer with more analysis:
Deciding what is a bad practice must always be in reference to what is to be achieved or what is the role of each part in the system. In the current situation (after reading the comments) it has been brought or implied that the data storage is actually a persistent medium for objects opposed to a different concept where the data is the 'heart' of the application.
We can define two data types:
1) Data-Center which is being used in data-centric applications such as banks, CRM, ERP, websites or other service based solutions.
VS.
2) Data-Persistence medium which is being used as data to be saved for when the application is not active, in example: any simple app save file or any game save file and etc.
The main difference is that a data persistence medium is to be accessed only by a single instance of the app at a single point in time.. meaning the data is not designed to be shared by many instances. if the data is to be shared - we are dealing with a data-center application.
If your app just need a data-persistence medium - loading all the information cannot be considered as a bad practice - but you still need to make sure you are not exploding the memory. and in that frame of work, SQL Server might not be what you need or the best tool to use.
In the other case of Data-Centric application - my original answer remains as it will be a bad practice to bring all the information per instance of the application.
In our multi-tier business application we have ObservableCollections of Self-Tracking Entities that are returned from service calls.
The idea is we want to be able to get entities, add, update and remove them from the collection client side, and then send these changes to the server side, where they will be persisted to the database.
Self-Tracking Entities, as their name might suggest, track their state themselves.
When a new STE is created, it has the Added state, when you modify a property, it sets the Modified state, it can also have Deleted state but this state is not set when the entity is removed from an ObservableCollection (obviously). If you want this behavior you need to code it yourself.
In my current implementation, when an entity is removed from the ObservableCollection, I keep it in a shadow collection, so that when the ObservableCollection is sent back to the server, I can send the deleted items along, so Entity Framework knows to delete them.
Something along the lines of:
protected IDictionary<int, IList> DeletedCollections = new Dictionary<int, IList>();
protected void SubscribeDeletionHandler<TEntity>(ObservableCollection<TEntity> collection)
{
var deletedEntities = new List<TEntity>();
DeletedCollections[collection.GetHashCode()] = deletedEntities;
collection.CollectionChanged += (o, a) =>
{
if (a.OldItems != null)
{
deletedEntities.AddRange(a.OldItems.Cast<TEntity>());
}
};
}
Now if the user decides to save his changes to the server, I can get the list of removed items, and send them along:
ObservableCollection<Customer> customers = MyServiceProxy.GetCustomers();
customers.RemoveAt(0);
MyServiceProxy.UpdateCustomers(customers);
At this point the UpdateCustomers method will verify my shadow collection if any items were removed, and send them along to the server side.
This approach works fine, until you start to think about the life-cycle these shadow collections. Basically, when the ObservableCollection is garbage collected there is no way of knowing that we need to remove the shadow collection from our dictionary.
I came up with some complicated solution that basically does manual memory management in this case. I keep a WeakReference to the ObservableCollection and every few seconds I check to see if the reference is inactive, in which case I remove the shadow collection.
But this seems like a terrible solution... I hope the collective genius of StackOverflow can shed light on a better solution.
EDIT:
In the end I decided to go with subclassing the ObservableCollection. The service proxy code is generated so it was a relatively simple task to change it to return my derived type.
Thanks for all the help!
Instead of rolling your own "weak reference + poll Is it Dead, Is it Alive" logic, you could use the HttpRuntime.Cache (available from all project types, not just web projects).
Add each shadow collection to the Cache, either with a generous timeout, or a delegate that can check if the original collection is still alive (or both).
It isn't dreadfully different to your own solution, but it does use tried and trusted .Net components.
Other than that, you're looking at extending ObservableCollection and using that new class instead (which I'd imagine is no small change), or changing/wrapping UpdateCustomers method to remove the shadow collection form DeletedCollections
Sorry I can't think of anything else, but hope this helps.
BW
If replacing ObservableCollection is a possibility (e.g. if you are using a common factory for all the collections instances) then you could subclass ObservableCollection and add a Finalize method which cleans up the deleted items that belongs to this collection.
Another alternative is to change the way you compute which items are deleted. You could maintain the original collection, and give the client a shallow copy. When the collection comes back, you can compare the two to see what items are no longer present. If the collections are sorted, then the comparison can be done in linear time on the size of the collection. If they're not sorted, then the modified collection values can be put in a hash table and that used to lookup each value in the original collection. If the entities have a natural id, then using that as the key is a safe way of determining which items are not present in the returned collection, that is, have been deleted. This also runs in linear time.
Otherwise, your original solution doesn't sound that bad. In java, a WeakReference can register a callback that gets called when the reference is cleared. There is no similar feature in .NET, but using polling is a close approximation. I don't think this approach is so bad, and if it's working, then why change it?
As an aside, aren't you concerned about GetHashCode() returning the same value for distinct collections? Using a weak reference to the collection might be more appropriate as the key, then there is no chance of a collision.
I think you're on a good path, I'd consider refactoring in this situation. My experience is that in 99% of the cases the garbage collector makes memory managment awesome - almost no real work needed.
but in the 1% of the cases it takes someone to realize that they've got to up the ante and go "old school" by firming up their caching/memory management in those areas. hats off to you for realizing you're in that situation and for trying to avoid the IDispose/WeakReference tricks. I think you'll really help the next guy who works in your code.
As for getting a solution, I think you've got a good grip on the situation
-be clear when your objects need to be created
-be clear when your objects need to be destroyed
-be clear when your objects need to be pushed to the server
good luck! tell us how it goes :)
Consider this simplified application domain:
Criminal Investigative database
Person is anyone involved in an investigation
Report is a bit of info that is part of an investigation
A Report references a primary Person (the subject of an investigation)
A Report has accomplices who are secondarily related (and could certainly be primary in other investigations or reports
These classes have ids that are used to store them in a database, since their info can change over time (e.g. we might find new aliases for a person, or add persons of interest to a report)
Domain http://yuml.me/13fc6da0
If these are stored in some sort of database and I wish to use immutable objects, there seems to be an issue regarding state and referencing.
Supposing that I change some meta-data about a Person. Since my Person objects immutable, I might have some code like:
class Person(
val id:UUID,
val aliases:List[String],
val reports:List[Report]) {
def addAlias(name:String) = new Person(id,name :: aliases,reports)
}
So that my Person with a new alias becomes a new object, also immutable. If a Report refers to that person, but the alias was changed elsewhere in the system, my Report now refers to the "old" person, i.e. the person without the new alias.
Similarly, I might have:
class Report(val id:UUID, val content:String) {
/** Adding more info to our report */
def updateContent(newContent:String) = new Report(id,newContent)
}
Since these objects don't know who refers to them, it's not clear to me how to let all the "referrers" know that there is a new object available representing the most recent state.
This could be done by having all objects "refresh" from a central data store and all operations that create new, updated, objects store to the central data store, but this feels like a cheesy reimplementation of the underlying language's referencing. i.e. it would be more clear to just make these "secondary storable objects" mutable. So, if I add an alias to a Person, all referrers see the new value without doing anything.
How is this dealt with when we want to avoid mutability, or is this a case where immutability is not helpful?
If X refers to Y, both are immutable, and Y changes (i.e. you replace it with an updated copy), then you have no choice but to replace X also (because it has changed, since the new X points to the new Y, not the old one).
This rapidly becomes a headache to maintain in highly interconnected data structures. You have three general approaches.
Forget immutability in general. Make the links mutable. Fix them as needed. Be sure you really do fix them, or you might get a memory leak (X refers to old Y, which refers to old X, which refers to older Y, etc.).
Don't store direct links, but rather ID codes that you can look up (e.g. a key into a hash map). You then need to handle the lookup failure case, but otherwise things are pretty robust. This is a little slower than the direct link, of course.
Change the entire world. If something is changed, everything that links to it must also be changed (and performing this operation simultaneously across a complex data set is tricky, but theoretically possible, or at least the mutable aspects of it can be hidden e.g. with lots of lazy vals).
Which is preferable depends on your rate of lookups and updates, I expect.
I suggest you to read how they people deal with the problem in clojure and Akka. Read about Software transactional memory. And some of my thoughts...
The immutability exists not for the sake of itself. Immutability is abstraction. It does not "exist" in nature. World is mutable, world is permanently changing. So it's quite natural for data structures to be mutable - they describe the state of the real or simulated object at a given moment in time. And it looks like OOP rulez here. At conceptual level the problem with this attitude is that object in RAM != real object - the data can be inaccurate, it comes with delay etc
So in case of most trivial requirements you can go with everything mutable - persons, reports etc Practical problems will arise when:
data structures are modified from concurrent threads
users provide conficting changes for the same objects
a user provide an invalid data and it should be rolled back
With naive mutable model you will quickly end up with inconsistent data and crushing system. Mutability is error prone, immutability is impossible. What you need is transactional view of the world. Within transaction program sees immutable world. And STM manages changes to be applied in consistent and thread-safe way.
I think you are trying to square the circle. Person is immutable, the list of Reports on a Person is part of the Person, and the list of Reports can change.
Would it be possible for an immutable Person have a reference to a mutable PersonRecord that keeps things like Reports and Aliases?
I'm developing an iPhone application and am new to Objective-C as well as SQLite. That being said, I have been struggling w/ designing a practical data management solution that is worthy of existing. Any help would be greatly appreciated.
Here's the deal:
The majority of the data my application interacts with is stored in five tables in the local SQLite database. Each table has a corresponding Class which handles initialization, hydration, dehydration, deletion, etc. for each object/row in the corresponding table. Whenever the application loads, it populates five NSMutableArrays (one for each type of object). In addition to a Primary Key, each object instance always has an ID attribute available, regardless of hydration state. In most cases it is a UUID which I can then easily reference.
Before a few days ago, I would simply access the objects via these arrays by tracking down their UUID. I would then proceed to hydrate/dehydrate them as I needed. However, some of the objects I have also maintain their own arrays which reference other object's UUIDs. In the event that I must track down one of these "child" objects via it's UUID, it becomes a bit more difficult.
In order to avoid having to enumerate through one of the previously mentioned arrays to find a "parent" object's UUID, and then proceed to find the "child's" UUID, I added a DataController w/ a singleton instance to simplify the process.
I had hoped that the DataController could provide a single access point to the local database and make things easier, but I'm not so certain that is the case. Basically, what I did is create multiple NSMutableDicationaries. Whenever the DataController is initialized, it enumerates through each of the previously mentioned NSMutableArrays maintained in the Application Delegate and creates a key/value pair in the corresponding dictionary, using the given object as the value and it's UUID as the key.
The DataController then exposes procedures that allow a client to call in w/ a desired object's UUID to retrieve a reference to the actual object. Whenever their is a request for an object, the DataController automatically hydrates the object in question and then returns it. I did this because I wanted to take control of hydration out of the client's hands to prevent dehydrating an object being referenced multiple times.
I realize that in most cases I could just make a mutable copy of the object and then if necessary replace the original object down the road, but I wanted to avoid that scenario if at all possible. I therefore added an additional dictionary to monitor what objects are hydrated at any given time using the object's UUID as the key and a fluctuating count representing the number of hydrations w/out an offset dehydration. My goal w/ this approach was to have the DataController automatically dehydrate any object once it's "hydration retainment count" hit zero, but this could easily lead to significant memory leaks as it currently relies on the caller to later call a procedure that decreases the hydration retainment count of the object. There are obviously many cases when this is just not obvious or maybe not even easily accomplished, and if only one calling object fails to do so properly I encounter the exact opposite scenario I was trying to prevent in the first place. Ironic, huh?
Anyway, I'm thinking that if I proceed w/ this approach that it will just end badly. I'm tempted to go back to the original plan but doing so makes me want to cringe and I'm sure there is a more elegant solution floating around out there. As I said before, any advice would be greatly appreciated. Thanks in advance.
I'd also be aware (as I'm sure you are) that CoreData is just around the corner, and make sure you make the right choice for the future.
Have you considered implementing this via the NSCoder interface? Not sure that it wouldn't be more trouble than it's worth, but if what you want is to extract all the data out into an in-memory object graph, and save it back later, that might be appropriate. If you're actually using SQL queries to limit the amount of in-memory data, then obviously, this wouldn't be the way to do it.
I decided to go w/ Core Data after all.