Use optional keys or a catch-all key in MongoMapper? - mongodb

Suppose I'm working on a MongoMapper class that looks like this:
class Animal
include MongoMapper::Document
key :type, String, :required => true
key :color, String
key :feet, Integer
end
Now I want to store a bird's wingspan. Would it be better to add this, even though it's irrelevant for many documents and feels a bit untidy:
key :wingspan, Float
Or this, even though it's an indescriptive catch-all that feels like a hack:
key :metadata, Hash
It seems like the :metadata approach (for which there's precedent in the code I'm inheriting) is almost redundant to the Mongo document as a whole: they're both intended to be schemaless buckets of key-value pairs.
However, it also seems like adding animal-specific keys is a slippery slope to a pretty ugly model.
Any alternatives (create a Bird subclass)?

MongoMapper doesn't store keys that are nil, so if you did define key :wingspan only the documents that actually set that key would store it.
If you opt not to define the key, you can still set/get it with my_bird[:wingspan] = 23. (The [] call will actually automatically define a key for you; similarly if a doc comes back from MongoDB with a key that's not explicitly defined a key will be defined for it and all docs of that class--it's kind of a bug to define it for the whole class but since nil keys aren't stored it's not so much of a problem.)
If bird has its own behavior as well (it probably does), then a subclass makes sense. For birds and animals I would take this route, since every bird is an animal. MongoDB is much nicer than ActiveRecord for Single Table/Single Collection Inheritance, because you don't need a billion migrations and your code makes it clear which attributes go with which classes.

It's hard to give a good answer without knowing how you intend to extend the database in the future and how you expect to use the information you store. If you were storing large numbers of birds and wanted to summarize on wingspan, then wingspan would be helpful even if it would be unused for other animals. If you plan to store random arbitrary information for every known animal, there are too many possibilities to try to track in a schema and the metadata approach would be more usable.

Related

Scala [2.11.6] -- is there an elegant way to create cache-keys for objects based on an Int/Long and the full class name?

I want to use a cache to hold recently accessed objects that just came from a database read.
The database primary key, in my case, will be a Long.
In each case I'll have an Object (Case Class) that represents this data.
The combination of the Long plus the full class name will be a unique identifier for finding any specific object. (The namespace should never have an conflicts as class names do not use numbers (as a rule?). In any case for this usage case I control the entire name space so not a huge concern).
The objects will be relatively short lived in the cache - I just see a few situations where I can save memory by holding the same immutable Object more than once as opposed to different instances of the same Object that would be extremely difficult to "pass everything everywhere" to avoid.
This also would help performance in situations where different eyeballs are checking out the same stuff but this is not the driver for this particular use case (just gravy).
My concern is now for every time I need a given object I'll need to recreate the cache key. This will involve a Long.toString and a String Concat. The case classes in question have a val in their companion object so that they know their class name without any further reflection occurring.
I'm thinking of putting a "cache" together in the companion object for the main cache keys as I wish to avoid the (needless?) repeat ops per lookup as well as the resultant garbage collection etc. (The fastest code to run is the code that never gets written (or called) - right?)
Is there a more elegant way to handle this? Has someone else already solved this specific problem?
I thought of writing a key class but even with a val (lazy or otherwise) for the hash and toString I still get a hit for each and every object I ask for as now I have to create the key object each time. (That could of course go back into the companion object key cache but if I go to the trouble of setting up that companion object cache for keys the key object approach is redundant.)
As a secondary ask of this question - assuming I use a Long and a full class name (as a String) which is most likely to get the quickest pull for the cache?
Long.toString + fullClassName
or
fullClassName + Long.toString
The Long IS a string in the key so assuming it is a string "find" on the cache which would be easier to index find? The numeric portion first or the string class name.
Numbers first means you wade through ALL the objects with matching numbers searching for the matching class whereas class first means you find the block of a particular class first but you have to go to the very end of the string to find the exact match.
I suspect the former might be more easily optimized for a "fast find" (I know in MySQL terms it would be...)
Then again perhaps someone already has a dual-key lookup based cache? :)
I would keep it extremely simple until you had concrete performance metrics to the contrary. Something like:
trait Key {
def id: Long
lazy val key: String = s"${getClass.getName}-${id}"
}
case class MyRecordObject(id: Long, ...) extends Key
Use a simple existing caching solution like Guava Caching.
To your secondary question, I would not worry about the performance of generating a key at all until you could actually prove key generation is a bottleneck (which I kind of doubt it ever would be).
import play.api.cache.Cache
It turns out that Cache.getOrElse[T](idAsString, seconds) actually does most of the heavy lifting!
[T] is of course a type in Scala and that is enough to keep things separated in the cache. Each [T] is a unique, separate and distinct bucket in the cache.
So Cache.getOrElse[AUser](10, 5) will get a completely different object from Cache.getOrElse[ALog](10, 5) (where the ID of 10 just happens to be the same for the purpose of illustration here).
I'm currently doing this with thousands of objects across hundreds of types so I know it works...
I say most of the work as the Long has to be .toString'ed before it can be used as a key. Not a complete GC disaster as I simply set up a Map to hold the most commonly/recently .toString'ed Long values.
For those of you that simply don't get the value of this consider a simple log screen which is very common in most web applications.
2015/10/22 10:22 - Johnny Rotten - deleted an important file
2015/10/22 10:22 - Johnny Rotten - deleted another important file
2015/10/22 10:22 - Johnny Rotten - looked up another user
2015/10/22 10:22 - Johnny Rotten - added a bogus file
2015/10/22 10:22 - Johnny Rotten - insulted his boss
Under Java (Tomcat) there would typically a single Object that represented that user (Johnny Rotten) and that single Object would be linked to each and every time the Name of that user appeared in the log display.
Now under Scala we tend to create a new instance (Case Class) for each and every line of the log entry simply because we have no (efficient/plumbing) way of getting to the last used instance of that Case Class. The Log itself tends to be a case class and it has a lazy val of the User Case Class.
So, along comes user-x and they look up a log and the set the pagination to 500 lines and low and behold we now have 500 case classes being created simply to display a users name (the "who" in each log entry).
And then a few seconds later we have yet another 500 User Case Classes when they hit refresh because they didn't think they clicked the mouse right the first time...
With a simple cache however that holds a recently accessed object for all of say 5 seconds, all we create for the entire 500 log entries is a single instance of a User Case Class for each unique name we display in the log.
In Scala Case Classes are immutable so the single instance is perfectly acceptable use case here and the GC has no needless work to do...

Comparing the attribute contents of two managed objects?

I am setting a managedObject up from data I am getting off the web, before I add this new object to the managedObjectContext I want to check if its all ready in the database. Is there a way to compare two managed objects in one hit, or do I have to compare each attribute individually to work out if they are identical or one contains a difference?
Simple Example:
Entity:Pet (Created but not inserted into database)
Attribute, Name: Brian
Attribute, Type: Cat
Attribute, Age: 12
Entity:Pet (Currently in database)
Attribute, Name: Brian
Attribute, Type: Cat
Attribute, Age: 7
In this example can I compare [Brian, Cat, 12] with [Brian, Cat, 7] or do I need to go through each attribute one by one to ascertain a full match?
Unique identifiers are often used to search for objects by only having to match the one field. As you note, matching on multiple fields could be annoying and inefficient, but it's perhaps not as bad as you think: you can construct an NSPredicate to quite easily match all the required fields on objects in Core Data.
Use of NSPredicate aside: suppose you just want to match one field. If you don't have a suitable unique identifier in the data as provided, you could derive one. The obvious way is to construct a hash code for everything you store, based on each field you want to match on. Then when you wish to check if an 'incoming' object is already in core data, compute the hash code for the new object, then just look for an object in core data with that same hash code. (Note: if you find an object that already exists with the same hash code, you might want to then compare all the fields to check that it really does represent the same object -- there's a tiny chance it might be a 'different' object, A.K.A. a hash collision).
A very naive hash code implementation for an object X would be something like:
hashcode(X) = hashcode(X.name) + hashcode(X.type) + hashcode(X.age)
To see a more realistic example of writing a hashcode function, see the accepted answer here.
By the way, I'm assuming that you don't want to load all your objects from core data into memory at once. If however that is acceptable (suppose you have quite a limited amount of items), an alternative is to implement isEqual and hash on your class, and then use regular foundation class methods like NSArray indexOfObject: (or, even better, NSDictionary objectForKey:) to locate objects of interest.

MongoDB case insensitive key search

I am able to query values without regard to case, but I would like to to query keys insensitively, so users can type them in all lower case.
This doesn't work, because it is not valid JSON:
{
/^lastName$/i: "Jones"
}
Is there a strategy I could use for this, besides just making a new collection of keys as values?
There is currently no way to do this.
MongoDB is "schema-free" but that should not be confused with "doesn't have a schema". There's an implicit assumption that your code has some control over the names of the keys that actually appear in the system.
Let's flip the question around.
Is there a good reason that users are inserting case-sensitive keys?
Can you just cast all keys to lower-case when they're inserted?
Again, MongoDB assumes that you have some knowledge of the available keys. Your question implies that you have no knowledge of the available keys. You'll need to close this gap.

Immutability and shared references - how to reconcile?

Consider this simplified application domain:
Criminal Investigative database
Person is anyone involved in an investigation
Report is a bit of info that is part of an investigation
A Report references a primary Person (the subject of an investigation)
A Report has accomplices who are secondarily related (and could certainly be primary in other investigations or reports
These classes have ids that are used to store them in a database, since their info can change over time (e.g. we might find new aliases for a person, or add persons of interest to a report)
Domain http://yuml.me/13fc6da0
If these are stored in some sort of database and I wish to use immutable objects, there seems to be an issue regarding state and referencing.
Supposing that I change some meta-data about a Person. Since my Person objects immutable, I might have some code like:
class Person(
val id:UUID,
val aliases:List[String],
val reports:List[Report]) {
def addAlias(name:String) = new Person(id,name :: aliases,reports)
}
So that my Person with a new alias becomes a new object, also immutable. If a Report refers to that person, but the alias was changed elsewhere in the system, my Report now refers to the "old" person, i.e. the person without the new alias.
Similarly, I might have:
class Report(val id:UUID, val content:String) {
/** Adding more info to our report */
def updateContent(newContent:String) = new Report(id,newContent)
}
Since these objects don't know who refers to them, it's not clear to me how to let all the "referrers" know that there is a new object available representing the most recent state.
This could be done by having all objects "refresh" from a central data store and all operations that create new, updated, objects store to the central data store, but this feels like a cheesy reimplementation of the underlying language's referencing. i.e. it would be more clear to just make these "secondary storable objects" mutable. So, if I add an alias to a Person, all referrers see the new value without doing anything.
How is this dealt with when we want to avoid mutability, or is this a case where immutability is not helpful?
If X refers to Y, both are immutable, and Y changes (i.e. you replace it with an updated copy), then you have no choice but to replace X also (because it has changed, since the new X points to the new Y, not the old one).
This rapidly becomes a headache to maintain in highly interconnected data structures. You have three general approaches.
Forget immutability in general. Make the links mutable. Fix them as needed. Be sure you really do fix them, or you might get a memory leak (X refers to old Y, which refers to old X, which refers to older Y, etc.).
Don't store direct links, but rather ID codes that you can look up (e.g. a key into a hash map). You then need to handle the lookup failure case, but otherwise things are pretty robust. This is a little slower than the direct link, of course.
Change the entire world. If something is changed, everything that links to it must also be changed (and performing this operation simultaneously across a complex data set is tricky, but theoretically possible, or at least the mutable aspects of it can be hidden e.g. with lots of lazy vals).
Which is preferable depends on your rate of lookups and updates, I expect.
I suggest you to read how they people deal with the problem in clojure and Akka. Read about Software transactional memory. And some of my thoughts...
The immutability exists not for the sake of itself. Immutability is abstraction. It does not "exist" in nature. World is mutable, world is permanently changing. So it's quite natural for data structures to be mutable - they describe the state of the real or simulated object at a given moment in time. And it looks like OOP rulez here. At conceptual level the problem with this attitude is that object in RAM != real object - the data can be inaccurate, it comes with delay etc
So in case of most trivial requirements you can go with everything mutable - persons, reports etc Practical problems will arise when:
data structures are modified from concurrent threads
users provide conficting changes for the same objects
a user provide an invalid data and it should be rolled back
With naive mutable model you will quickly end up with inconsistent data and crushing system. Mutability is error prone, immutability is impossible. What you need is transactional view of the world. Within transaction program sees immutable world. And STM manages changes to be applied in consistent and thread-safe way.
I think you are trying to square the circle. Person is immutable, the list of Reports on a Person is part of the Person, and the list of Reports can change.
Would it be possible for an immutable Person have a reference to a mutable PersonRecord that keeps things like Reports and Aliases?

Hashes vs Numeric id's

When creating a web application that some how displays the display of a unique identifier for a recurring entity (videos on YouTube, or book section on a site like mine), would it be better to use a uniform length identifier like a hash or the unique key of the item in the database (1, 2, 3, etc).
Besides revealing a little, what I think is immaterial, information about the internals of your app, why would using a hash be better than just using the unique id?
In short: Which is better to use as a publicly displayed unique identifier - a hash value, or a unique key from the database?
Edit: I'm opening up this question again because Dmitriy brought up the good point of not tying down the naming to db specific property. Will this sort of tie down prevent me from optimizing/normalizing the database in the future?
The platform uses php/python with ISAM /w MySQL.
Unless you're trying to hide the state of your internal object ID counter, hashes are needlessly slow (to generate and to compare), needlessly long, needlessly ugly, and needlessly capable of colliding. GUIDs are also long and ugly, making them just as unsuitable for human consumption as hashes are.
For inventory-like things, just use a sequential (or sharded) counter instead. If you migrate to a different database, you will just have to initialize the new counter to a value at least as large as your largest existing record ID. Pretty much every database server gives you a way to do this.
If you are trying to hide the state of your counter, perhaps because you're counting users and don't want competitors to know how many you have, I suggest avoiding the display of your internal IDs. If you insist on displaying them and don't want the drawbacks of a hash, you might consider using a maximal-period linear feedback shift register to generate IDs.
I typically use hashes if I don't want the user to be able to guess the next ID in the series. But for your book sections, I'd stick with numerical id's.
Using hashes is preferable in case you need to rebuild your database for some reason, for example, and the ordering changes. The ordinal numbers will move around -- but the hashes will stay the same.
Not relying on the order you put things into a box, but on properties of the things, just seems.. safer.
But watch out for collisions, obviously.
With hashes you
Are free to merge the database with a similar one (or a backup), if necessary
Are not doing something that could help some guessing attacks even a bit
Are not disclosing more private information about the user than necessary, e.g. if somebody sees a user number 2 in your current database log in, they're getting information that he is an oldie.
(Provided that you use a long hash or a GUID,) greatly helping youself in case you're bought by YouTube and they decide to integrate your databases.
Helping yourself in case there appears a search engine that indexes by GUID.
Please let us know if the last 6 months brought you some clarity on this question...
Hashes aren't guaranteed to be unique, nor, I believe, consistent.
will your users have to remember/use the value? or are you looking at it from a security POV?
From a security perspective, it shouldn't matter - since you shouldn't just be relying on people not guessing a different but valid ID of something they shouldn't see in order to keep them out.
Yeah, I don't think you're looking for a hash - you're more likely looking for a Guid.If you're on the .Net platform, try System.Guid.
However, the most important reason not to use a Guid is for performance. Doing database joins and lookups on (long) strings is very suboptimal. Numbers are fast. So, unless you really need it, don't do it.
Hashes have the advantage that you can check if they are valid or not BEFORE performing any check to your database whether they exist or not. This can help you to fend off attacks with random hashes as you don't need to burden your database with fake lookups.
Therefor, if your hash has some kind of well-defined format with for example a checksum at the end, you can check if it's correct without needing to go to the database.