What does denormalizing mean? - mongodb

While reading the article http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3 chapter: "Rules of Thumb: Your Guide Through the Rainbow" i came across the words: embedding and denormalizing.
One: favor embedding unless there is a compelling reason not to
Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
I know embedding is nesting documents, instead of writing seperate tables/collections.
But i have no clue what denormalizing means.

Denormalization is the opposite of normalization, a good practice to design relational databases like MySQL, where you split up one logical dataset into separated tables to reduce redundancy.
Because MongoDB does not support joins of tables you prefere to duplicate an attribute into two collections if the attribute is read often and less updated.
E.G.: You want to save data about a person and want to save the gender of the Person:
In a SQL database you would create a table person and a table gender, store a foreign key of the genders ID in the person table and perform a join to get the full information. This way "male" and "female" only exists once as a value.
In MongoDB you would save the gender in every Person so the value "male" and "female" can exist multiple times but the lookup is faster because the data is tightly coupled in one object of a collection.

Related

Storing primary key of Relational database in document of MongoDB

Suppose, I have a table (customers) in Oracle with column names as customer_id(PK), customer_name, customer_email, customer address. And I have a collection (products) in MongoDB which is storing customer_id as one of its field. Below, is a sample of document in products collection, which is storing customer_id "customer123", which is primary key in customers table in Oracle database.
{
_id : "product124",
customer_id: "customer123",
product_name: "hairdryer"
}
My questions is, Is it a good idea to use different types of databases when one field like customer_id here is shared between them. Is it a good practice in enterprises level development?
Please ignore the use case, as I am just trying to give a simple example to provide better understanding of the problem.
I would say it is acceptable to use different databases in distributed systems and keep references between entities, but it really depends on the use case. If you plan to perform frequent and heavy joins between these 2 entities then storing them in separated databases (especially of different types) might dramatically affect your performance. However, if your use case does not require frequent relations resolving, this approach could work. But bear in mind that you need to consider the future scale of your application and how would this architectural decision affect the potential growth.

mongodb/mongoose - when to use subdocument and when to use new collection

I would like to know if there is a rule of thumb about when to use a new document and when to use a sub document. In sql database I used to break all realtions to seperate tables by the rule of normalization and connect them with keys , but I can't find a good approch about what to do in mongodb ( I don't know how other no-sql databases are handled).
Any help will be appreicated.
Kind regards.
Though no fixed rules, there are some general guidelines which are intuitive enough to follow while modeling data in noSql.
Nearly all cases of 1-1 can be handled with sub-documents. For example: A user has an address. All likelihood is that address would be unique for each user (in context of your system, say a social website). So, keeping address in another collection would be a waste of space and queries. Address sub-document is the best choice.
Another example: Hundreds of employees share a same building/address. In this case keeping 1-1 is a poor use of space and will cost you a lot of updates whenever a slight change happens in any of the addresses because it's being replicated across multiple employee documents as sub-document. Therefore, an address should have many employees i.e. 1 to many relationship
You must have noticed that in noSql there are multiple ways to represent 1 to many relationship.
Keep an array of references. Choose this if you're sure the size of the array won't get too big and preferably the document containing the array is not expected to be updated a lot.
Keep an array of sub-documents. A handy option if the sub-documents don't qualify for a separate collection and you don't run the risk of hitting 16Mb document size limit. (thanks greyfairer for reminding!)
Sql style foreign key. Use this if 1 and 2 are not good enough or you prefer this style over them
Modeling documents for retrieval, Document design considerations and Modeling Relationships from Couchbase (another noSql database) are really good reads and equally applicable to mongodb.

How can denormalization be attribute of NoSQL DB

While discussing NoSQL DBs against traditional RDBMS, many articles say that, in NoSQL-DB all related data is kept together so joins are avoided. Thus retrieving data is faster.
In short data is denormalized. There are downsides of denormalization as well. e.g. redundancy, extra space, need to update data at multiple places etc.
But irrespective of Pros-and-cons of denormalization; it is a DB design attribute. How can it be attributed to particular DB-type ?
If in a given case, it is ok to denormalize data then same can be achieved in RDBMs also.
So why is denormalization discussed as attribute of NoSQL db ?
You seem to be reading hype, instead of database design articles. You can denormalize any database. Yes, NoSQL is for cases where denormalized data is a good thing, for instance, in storing documents, where subdocuments are used instead of joins to another table. This works best when the subdocuments are not duplicated. Of course, if they are duplicated, then you have the usual problems of denormalized data.
Example: Person uses Car. In a relational database, you would have a Persons table and a Cars table and a junction table, perhaps "CarsUsedByPerson". In a NoSQL system, you might have a "car" document embedded within a "person" document.
Of course, if two people use the same car, then you have the same data in multiple places, and you'll need to update it in all such places, or it will be inconsistent.
NoSQL is for cases where you need the performance more than you need the consistency.
Seconding John Saunders that you can denormalize data in an RDBMS as well - denormalization is an attribute of most NoSQL databases ("most" meaning "excluding graph databases") because in many cases you MUST denormalize in order to get decent performance.
Continuing with his example, let's say that I've got a Person record, which has a foreign key to a Car record (one car per person in this example to simplify matters), which has a foreign key to a Manufacturer record. For a given person I want the record for that person, for their car, and for their car's manufacturer.
In an RDBMS I can normalize this data and retrieve it all in one query using a join, or I can denormalize this data - the denormalized read is going to be a bit cheaper than the normalized read because joins aren't free, but in this case the difference in read performance probably won't be significant.
My NoSQL database probably doesn't support joins so if I normalize this data then I'll have to do three separate lookups for it, e.g. using a key-value database I'd first retrieve the Person which contains a Car key, then I'd retrieve the Car which contains a Manufacturer key, then I'd retrieve the Manufacturer; if this data were denormalized then I'd only need one lookup, so the performance improvement would be significant. In the rare case that the NoSQL database does support joins then it is almost certainly location agnostic, so the Person, Car, and Manufacturer records might be on different servers or even in different data centers making for a very expensive join.
So an overly simplistic breakdown of your options are:
Traditional RDBMS, good with normalized data but difficult to scale out
NoSQL database, relatively easy to scale out but a bit crap with normalized data
Distributed OLAP database (e.g. Aster, Greenplum), relatively easy to scale out and good with normalized data but very expensive

Single Collection Inheritance or multiple collections?

Assuming I have data of high school students across the country. Each high school data are not related each other and also never needed to be related to each other (compartmentalized). Which one is recommended if I use mongoDB:
1) Create single collection inheritance with the following attributes:
high_school_id, student_id, name, address
2) Create multiple collections (possibly thousands) with the following attributes:
student_id, name, address
The name of collection will follow school_data_<X> format, where X is the high_school_id. So, to query, my program can dynamically construct the collection name.
I came from MySQL, PostgreSQL background where having thousands tables are not common (So, option (1) is far more makes sense). How is it in MongoDB?
I recommend you use the first option, because MongoDB has a limit on the number of collections. More about this read docs.
You may want to consider a third option: create a collection with the students, where each student's record will include a high school data. There is nothing wrong in the duplication of data, you should not thinking about this in MongoDB, but you should thinking about more convenient way working with data.

Am I missing something about Document Databases?

I've been looking at the rise of the NoSql movement and the accompanying rise in popularity of document databases like mongodb, ravendb, and others. While there are quite a few things about these that I like, I feel like I'm not understanding something important.
Let's say that you are implementing a store application, and you want to store in the database products, all of which have a single, unique category. In Relational Databases, this would be accomplished by having two tables, a product and a category table, and the product table would have a field (called perhaps "category_id") which would reference the row in the category table holding the correct category entry. This has several benefits, including non-repetition of data.
It also means that if you misspelled the category name, for example, you could update the category table and then it's fixed, since that's the only place that value exists.
In document databases, though, this is not how it works. You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data, and errors are much more difficult to correct. Thinking about this more, doesn't it also mean that running queries like "give me all products with this category" can lead to result that do not have integrity.
Of course the way around this is to re-implement the whole "category_id" thing in the document database, but when I get to that point in my thinking, I realize I should just stay with relational databases instead of re-implementing them.
This leads me to believe I'm missing some key point about document databases that leads me down this incorrect path. So I wanted to put it to stack-overflow, what am I missing?
You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data [...]
True, denormalizing means storing additional data. It also means less collections (tables in SQL), thus resulting in less relations between pieces of data. Each single document can contain the information that would otherwise come from multiple SQL tables.
Now, if your database is distributed across multiple servers, it's more efficient to query a single server instead of multiple servers. With the denormalized structure of document databases, it's much more likely that you only need to query a single server to get all the data you need. With a SQL database, chances are that your related data is spread across multiple servers, making queries very inefficient.
[...] and errors are much more difficult to correct.
Also true. Most NoSQL solutions don't guarantee things such as referential integrity, which are common to SQL databases. As a result, your application is responsible for maintaining relations between data. However, as the amount of relations in a document database is very small, it's not as hard as it may sound.
One of the advantages of a document database is that it is schema-less. You're completely free to define the contents of a document at all times; you're not tied to a predefined set of tables and columns as you are with a SQL database.
Real-world example
If you're building a CMS on top of a SQL database, you'll either have a separate table for each CMS content type, or a single table with generic columns in which you store all types of content. With separate tables, you'll have a lot of tables. Just think of all the join tables you'll need for things like tags and comments for each content type. With a single generic table, your application is responsible for correctly managing all of the data. Also, the raw data in your database is hard to update and quite meaningless outside of your CMS application.
With a document database, you can store each type of CMS content in a single collection, while maintaining a strongly defined structure within each document. You could also store all tags and comments within the document, making data retrieval very efficient. This efficiency and flexibility comes at a price: your application is more responsible for managing the integrity of the data. On the other hand, the price of scaling out with a document database is much less, compared to a SQL database.
Advice
As you can see, both SQL and NoSQL solutions have advantages and disadvantages. As David already pointed out, each type has its uses. I recommend you to analyze your requirements and create two data models, one for a SQL solution and one for a document database. Then choose the solution that fits best, keeping scalability in mind.
I'd say that the number one thing you're overlooking (at least based on the content of the post) is that document databases are not meant to replace relational databases. The example you give does, in fact, work really well in a relational database. It should probably stay there. Document databases are just another tool to accomplish tasks in another way, they're not suited for every task.
Document databases were made to address the problem that (looking at it the other way around), relational databases aren't the best way to solve every problem. Both designs have their use, neither is inherently better than the other.
Take a look at the Use Cases on the MongoDB website: http://www.mongodb.org/display/DOCS/Use+Cases
A document db gives a feeling of freedom when you start. You no longer have to write create table and alter table scripts. You simply embed details in the master 'records'.
But after a while you realize that you are locked in a different way. It becomes less easy to combine or aggregate the data in a way that you didn't think was needed when you stored the data. Data mining/business intelligence (searching for the unknown) becomes harder.
That means that it is also harder to check if your app has stored the data in the db in a correct way.
For instance you have two collection with each approximately 10000 'records'. Now you want to know which ids are present in 'table' A that are not present in 'table' B.
Trivial with SQL, a lot harder with MongoDB.
But I like MongoDB !!
OrientDB, for example, supports schema-less, schema-full or mixed mode. In some contexts you need constraints, validation, etc. but you would need the flexibility to add fields without touch the schema. This is a schema mixed mode.
Example:
{
'#rid': 10:3,
'#class': 'Customer',
'#ver': 3,
'name': 'Jay',
'surname': 'Miner',
'invented': [ 'Amiga' ]
}
In this example the fields "name" and "surname" are mandatories (by defining them in the schema), but the field "invented" has been created only for this document. All your app need to don't know about it but you can execute queries against it:
SELECT FROM Customer WHERE invented IS NOT NULL
It will return only the documents with the field "invented".