Document databases: data model migrations - mongodb

Like most of us, I'm coming from a relational database world,
and I'm currently looking into the possibilities of the document database world.
One of my concerns is the handling of changes in the data model over time (new properties are added, properties are renamed, relationships are added, ..).
In relational databases, this is typically handled as follows:
Write a database migration
-> Modify the database schema
-> Fix data for existing rows (typically contains some business logic)
Modify the code (ORM updates, ..)
When using a document database, I have a feeling that changes to the data model
are much easier; there's no need to update a database schema, mostly it's just adding a property, .. and everything "just works".
I wonder how teams are managing this kind of migrations in real life, Enterprise projects with document databases:
Is there a strict policy for making changes to Types that are stored in the document db?
For instance, does every change to such a Type require a migration to update
existing documents?
As a consequence, is there a clear separation between the data model (types stored in the document db) and the business model?
Thanks for your time,
Koen

With RavenDB, you can do that with patching.
See: http://ayende.com/blog/157185/awesome-ravendb-feature-of-the-day-evil-patching
And: http://blog.hibernatingrhinos.com/12705/new-option-in-the-ravendb-studiondash-patching

There are three general strategies that you can take for "schema" modifications in MongoDB. I've seen all three work well; which one you'll use depends a lot on your particular use case.
First: you can simply add a new field to new documents, and write your code to handle the case where that field does not exist. For example, you could add an 'address' field to your "user" documents, but you'd have to write your client code to handle the case where that field does not exist.
Second: you can write your code to look at the existing document & update it when it sees an "old-style" document. For example, you could have code that checks to see if there is a "name" field in your "user" document. If it finds that field, it would split it up into "first_name" and "sur_name" fields, $unset the "name" field in that document, and $set the new "first_name" and "sur_name" fields to their calculated values.
Third: you can batch-update all of the documents in the collection to use the new schema. You'd write the same code as above, but instead of lazily applying it when your application reads a document, you'd apply it to all the documents in the collection.
Note that this last strategy can have performance impact: if you page in a lot of documents that haven't been accessed for a while, you'll put an extra load on your MongoDB system.

Related

Decide create multiple database and relational in cloudant

I'm currently learning NOSQL cloudant and trying to design the database, as I'm learning I'm cloudant treat all records as a documents and using denormalise for table. so I'm currently a bit confuse to how to decide which one need to be in one document and which one is need to be separated.
Below are my test cases :
let's say i'm designing store book tables structure, for simplicity I'll be having this tables BOOK, STORE, STORE_BRANCH
BOOK field : _id, book_name, author
STORE field : _id, store_name
STORE_BRANCH field : _id, store_branch_name, address, store_id_fk
with above case, I not able to decide where should i put the "price" field to ? as for normal RDBMS i will just create another table and having fields : ( store book_id, store_branch_id and prices), this with the assumption the price of the book is different for each branch. so i wondering how I put this in cloudant ?
any suggestion is appreciated
Your doubts are pretty common for RDMBS user.
In NoSQL generally you use the everything-in-one-document approach. In fact, in some cases, approximating JOINs in a document-oriented database like Cloudant is outright trivial. For example if you want to model a one-to-n relationship, you can put all n-related documents into the document they belong to. In your case you should put all the store_branch in the related store. This strategy is OK if:
The document does not get so big that it impairs performance. This can be mitigated somewhat by using database views or show functions.
The information in the inner document only appears there and does not need to be duplicated into other documents, or such duplication is acceptable for your application.
The document does not get updated concurrently. If it does, there will likely be unnecessary conflicts that will need to be resolved by the application.
If the above strategy is not applicable you can use an approach that more closely mimics how you solve this problem in a relational database: you can create a document for each "relational table". In your case you should create a document having fields : ( store book_id, store_branch_id and prices), too.
This Cloudant article explains very deeply these possibilities: Cloudant - Join the fun, have fun with JOINs.

MongoDB: is indexing a pain?

Speaking in general, I want to know what are the best practices for querying (and therefore indexing) of schemaless data structures? (i.e. documents)
Lets say I use MongoDB to store and query deterministic data structures in a collection. At this point all documents have the same structure therefore I can easily create indexes for any queries in my app since I know each document has required field(s) for the index.
What happens after I change the structure and try to save new documents to the db? Lets say I joined two fields FirstName and Lastname to FullName. As a result the collection contains nondeterministic data. I see two problems here:
Old indexes cannot cover new data, therefore new indexes needed that handle both fields old and new
App should take care of dealing with two representations of the documents
This may result in a big problem when there are many changes in the db resulting in many versions of document structures.
I see two main approaches:
Lazy migration. This means that each document is migrated on demand (i.e. only after loading from collection) to final structure and then stored back to colection. This approach actually does not solve the problems because it concedes nondeterminism at any point of time.
Forced migration. This is the same approach as for RDBMS migrations. The migration is performed for all documents at one point of time while the app does not run. The main con is downtime of the app.
So the question: Is there any good way of solving the problem, especially without app downtime?
If you can't have downtime then the only choice is to do the migrations "on the fly":
Change the application so that when new documents are saved the new field is created, but read from the old ones.
Update your collection with a script/queries to add the new field in the collection.
Create new indexes on that field.
Change the application so that it reads from the new fields.
Drop the unnecessary indexes and remove the old fields from the documents.
Changing the schema on a live database is never an easy process, no matter what database you use. It always requires some forward thinking and careful planning.
is indexing a pain?
Indexing is not a pain, but premature optimization is. You should always test and check that you actually need indexes before adding them and when you have them, check that they are being properly used.
If you're worried about performance issues on a live system when creating indexes, then you should consider having replica sets and doing rolling maintenance (in short: taking secondaries down from replication, creating indexes on them, bringing them back into replication and then repeating the process for all the subsequent replica set members).
Edit
What I was describing is basically a process of migrating your schema to a new one while temporary supporting both versions of the documents.
In step 1, you're basically adding support for multiple versions of documents. You're updating existing documents i.e. creating new fields, while you're reading data from the previous version fields. Step 2 is optional, because you can gradually update your documents as they are being saved.
In step 4 you're removing the support for the previous versions from your application code and migrating to a new version. Finally, in step 5 you're removing the previous version fields from your actual MongoDB documents.

Dynamic Data inside MongoDB

I'm making dynamic data storage, which it was in relational data base as:
form -> field
form -> form_data
field -> field_data
form_data -> filed_data
and field data contains field_id, form_data_id, and value
but for scalability and performance I'm planing to move form_data and field_data to MongoDB and my problem now is how to design MongoDB collections, using one collection for all form_data and move field_data to map inside it and key is field_id and value is the value of this field_data, or make collection for each form record and store data direct in form_data without map as the all data in this case will be consistent.
In document-oriented databases like MongoDB you should always favor aggregation over references, because they don't support JOIN-operations on the database to link multiple documents.
That means that a "A has many B" relationship between two entities should not be modeled as two tables A and B, but rather as one collection of A where each A has an embedded array of B objects.
In the context of MongoDB, there is however an additional limitation: MongoDB doesn't like objects which grow. When an object accumulates more and more data over its lifetime, the object will grow. That means that the hard drive space MongoDB allocated for it will run out again and again, which will require space reallocation. This takes performance and fragments the database. Also, MongoDB has an artificial size limit for documents, mainly to discourage developers from designing growing objects.
Therefor:
When the data exists at creation, embed it directly.
When more and more data is added after creation, put it into a different collection.
When a form has X fields, the number of fields will likely not change much over its lifetime. So you should embed the fields and their descriptions directly into the form object.
But the number of answers entered into these forms will grow over time, which means that these should be treated as separate objects in a separate collection.
So I would recommend you to have two collections, forms and form_data.
Each document in forms embeds a sub-object of fields with the static field properties.
Each document in form_data has a field with the _id of the corresponding form and embeds a sub-object of field_data which uses the same keys as the fields sub-object of forms and stores the entries the user made in that form.
When your use-case requires frequent access to the aggregated data (like when you want to publish the up-to-date statistics on a public website), you could also store this information in the fields of forms to avoid an expensive aggregation query over many form_data documents. MongoDB in general recommends to orient your database schema rather on your performance requirements than on the semantics of your data.
Regarding your remark "all data in this case will be consistent": keep in mind that MongoDB does not enforce referential integrity. When an application deletes or changes a document, it's the responsibility of the application to fix any outdated references to it in other documents.

What are the different approaches to versioning individual documents in MongoDB?

If I add a property to a domain object, this will ripple down to the MongoDB document. For example, adding a new "facebookId" property to my users, will add a new facebookId field to my MongoDB documents.
What are the different approaches to keep track of documents versions ?
I was thinking about adding a "_version" field to all my documents. Are there any other solutions ?
If you mean the document schema version, there are few options:
You could add a version field to all documents or just those that are newer than the original schema (only add it when something changes from the original schema). Given that every byte counts in a BSON document, I'd keep it short and simple, like _v : 1. Depending on the consuming platform (like Node.JS, C#, Java, etc.), having a version may not make deserialization simple.
Use a system that detects version automatically by detecting certain data fields (like, only version 2 has facebookId). For simple cases, this could be easily handled in a model class by changing object behavior based on presence of the field. This is what I've done before and it works well in many scenarios.
Update all documents in a collection to reflect new schema. This could be done all in one pass in a brief system downtime (depending on the number of documents).
As needed, update a document (when a document is read, rewrite it if it doesn't match the current schema). I've used this in non-MongoDB systems ... and it can work well (but it should work well in MongoDB as well).
It makes things easier (especially #2 and #4) if the new field can be null/non-required.

MongoDB - lookup/taxonomies

In my MongoDB database I have a number of document types that require look-up/taxonomy information. Typically I'm either holding an Id of the look-up or an Id and denormalising the look up info into the main document.
e.g.
task = {
TaskDetail : "Some task",
TaskPriority : { Id : xxxxx, Code : 'U', Description:'Urgent' }
}
Moving from traditional relational databases (RDBMS) where I would just have a TaskPriority table, I was wondering what the best practice is when using documents within mongo?
My initial thought was to have a taxonomy/look-up collection. Typically, look-up and enums are short, so each could be a separate document in the collection? Or I could mirror what you'd typically do in a RDBMS database and create a collection to look-up?
Can anybody point in me in the right direction?
Thanks in advance,
Sam
Referencing has advantages such as :
Better management of master data
De-duplication, hence lesser updates.
disadvantages :
look-up is a separate API call, no joins in MongoDB
$DBRef of $lookup functions can still be used, but cumbersome. Manual reference is easier.
Atomicity in document level, hence look-up in collections can get out of sync for sometime if the reference look-up collection is updated.
Embedding -
On the other hand embedding does not make sense for reference data, since in case the look-up value gets updated, you need to update all your documents. That is a huge exercise to keep a track of all the documents and their keys which needs reference data updates.
Secondly embedding of reference data, also can not achieve SCD (slowly changing dimension) history keeping ability. If the referencing is used instead, you can achieve this with a version date in the reference collection.
Considering these points, I am inclined to referencing. But my only doubt is if I do not have the referenced value in my parent documents, how will search work? When a user search with descriptive referenced values, how mongo will refer to the reference data collection?