MongoDB beginner - to normalize or not to normalize? - mongodb

I'm going to try and make this as straight-forward as I can.
Coming from MySQL and thinking in terms of tables, let's use the following example:
Let's say that we have a real-estate website and we're displaying a list of houses
normally, I'd use the following tables:
houses - the real estate asset at hand
owners - the owner of the house (one-to-many relationship with houses)
agencies - the real-estate broker agency (many-to-many relationship with houses)
images - many-to-one relationship with houses
reviews - many-to-one relationship with houses
I understand that MongoDB gives you the flexibility to design your web-app in different collections with unique IDs much like a relational database (normalized), and to enjoy quick selections, you can nest within a collection, related objects and data (un-normalized).
Back to our real-estate houses list, the query used to populate it is quite expensive in a normal relational DB, for each house you need to query its images, reviews, owner & agencies, each entity resides in a different table with its fields, you'd probably use joins and have multiple queries joined into one - Expensive!
Enter MongoDB - where you don't need joins, and you can store all the related data of a house in a house item on the houses collection, selection was never faster, it's a db heaven!
But what happens when you need to add/update/delete related reviews/agencies/owner/images?
This is a mystery to me, and if I need to guess, each related collection exist on its own collection on top of its data within the houses table, and once one of these pieces of related data is being added/updated/deleted you'll have to update it on its own collection as well as on the houses collection. Upon this update - do I need to query the other collections as well to make sure I'm updating the house record with all the updated related data?
I'm just guessing here and would really appreciate your feedback.
Thanks,
Ajar

Try this approach:
Work out which entity (or entities) are the hero(s)
With 'hero', I mean the entity(s) that the database is centered around. Let's take your example. The hero of the real-estate example is the house*.
Work out the ownerships
Go through the other entities, such as the owner, agency, images and reviews and ask yourself whether it makes sense to place their information together with the house. Would you have a cascading delete on any of the foreign keys in your relational database? If so, then that implies ownership.
Work out whether it actually matters that data is de-normalised
You will have agency (and probably owner) details spread across multiple houses. Does that matter?
Your house collection will probably look like this:
house: {
owner,
agency,
images[], // recommend references to GridFS here
reviews[] // you probably won't get too many of these for a single house
}
*Actually, it's probably the ad of the house (since houses are typically advertised on a real-estate website and that's probably what you're really interested in) so just consider that

Sarah Mei wrote an informative article about the kinds of issues that can arise with data integrity in nosql dbs. The choice between duplicate data or using id's, code based joins and the challenges with keeping data integrity. Her take is that any nosql db with code based joins will lose data integrity at some point. Imho the articles comments are as valuable as the article itself in understanding these issues and possible resolutions.
Link: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/comment-page-1/

I would just like to give a normalization refresher from the MongoDB's perspective -
What are the goals of normalization?
Frees the database from modification anomalies - For MongoDB, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that's not the default approach. The default is to avoid it.
Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents
Avoid bias toward any particular access pattern - this is something, we're not going to worry about when describing schema in MongoDB. And one of the ideas behind the MongoDB is to tune up your database to the applications that we're trying to write and the problem we're trying to solve.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

MongoDB database design with addresses

I'm trying to figure out how to design my database in terms of addresses. For example, I have real estate agents who would be inputting property information. I'm using MongoDB and is the best approach separating out by city, state, zip and having id's in each of these collections that point to the property id? Or would i just include it in the property collection itself.
if i'm not mistaken with MySQL you would separate it out, but i'm not sure of the best approach with non-relational databases.
Also, wouldn't updating be horrendous in the controller? I'm using a MEAN stack.
I've drawn out some possibilities if anybody has any thoughts.adding to the property collection itself
separate out the address, city, state etc as separate collections
You are right SQL would be as you drew it but in non relational db goal is to have almost all info in one document. Some data replication is not an issue as memory nowdays is cheap (Processing power is expensive more or less). So, if you keep everything in one place you only need single query to retrieve all info about property plus many more benefits.
To answer your question your approach is a big no in MongoDb. You will produce yourself a living nightmare as joins in mongo are "almost nonexistant".
Special case is if your addresses will be repeated a lot. Then you can make Addresses collection and Properties collection and have address id in property.
You can find more info on modeling relations here:
https://docs.mongodb.com/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/
Enjoy your day!
EDIT: ahh I didn't see. Go for picture one =)

Suitability of MongoDB for certain use case?

I'm considering using MongoDB for a web application but I have read that there are some situations where it not recommended. I am wondering would my project be one of those situations.
Here's a brief outline of the entities in my system -
There are Users with standard user detail attributes
Each User can have many Receipts, a Receipt can have only one User
A Receipt contains many products
Products have standard product detail attributes
Users can have many Friends, and each Friend is a User themselves
Reviews can be given by Users for Products
There will be a Reputation system like the one here on stackoverflow where users earn Points and Badges
As you can see there are lots of entities that have various relationships with each other. And data integrity is important. Is this type of schema suitable for MongoDB?
Your data seems to be very relational, not document-oriented.
You are also pointing out data integrity as an important requirement (I assume you mean referential integrity between different entities). When entities are not stored in the same document, it is hard to guarantee integrity between them in MongoDB. Also, MongoDB can't enforce any constraints (except for unique-ness of field values).
You also seem to have a very relational mind pattern overall which will make it hard for you to utilize MongoDB how it was meant to be used.
For these reasons I would recommend you to stick to a relational database.
The reason why you considered using MongoDB in the first place is likely because you heard that it is fast and that it scales well. Yes, it is fast and it does scale well, but only because it doesn't enforce referential integrity and because it doesn't do table joins. When you need these features and thus have to find ugly workarounds to simulate them, MongoDB won't be that fast anymore.

Many-to-many relationship in NoSQL

I am trying to figure out how to best implement this for my system...and get my head out of the RDBMS space for now...
A part of my current DB has three tables: Show, ShowEntry, and Entry. Basically ShowEntry is a many-to-many joining table between Show and Entry. In my RDBMS thinking it's quite logical since any changes to Show details can be done in one place, and the same with Entry.
What's the best way to reflect this in a document-based storage? I'm sure there is no one way of doing this but I can't help but think if document-based storage is appropriate for this case at all.
FYI, I am currently considering implementing RavenDB. While discussions on general NoSQL design will be good a more RavenDB focused one will be fantastic!
Thanks,
D.
When modelling a many-to-many relationship in a document database, you usually store a collection of foreign keys in just one of the documents. The document you choose largely depends on the direction you intend to traverse the relationship. Traversing it one way is trivial, traversing it the other way requires an index.
Take the shopping basket example. It's more important to know exactly which items are in a particular basket than which baskets contain a particular item. Since we're usually following the relationship in the basket-to-item direction, it makes more sense to store item IDs in a basket than it does to store basket IDs in an item.
You can still traverse the relationship in the opposite direction (e.g. find baskets containing a particular item) by using an index, but the index will be updated in the background so it won't always be 100% accurate. (You can wait for the index to become accurate with WaitForNonStaleResults, but that delay will show in your UI.)
If you require immediate 100% accuracy in both directions, you can store foreign keys in both documents, but your application will have to update two documents whenever a relationship is created or destroyed.
This went a long way towards solving my question!
Answer to the question
Many-to-many relationships in NoSQL are implemented via an array of references on one of the entities.
You've got two options:
Show has an array of Entry items;
Entry has an array of Shows.
Location of the array is determined by the most common direction of querying. To resolve records in the other direction - index the array (in RavenDB it works like a charm).
Usually, having two arrays on both entities pointing to each other brings more grief than joy. You're losing the single source of truth in an eventually consistent environment... it has potential for breaking data integrity.
Check out this article - Entity Relationships in NoSQL (one-to-many, many-to-many). It covers entity relationships from various angles, taking into account performance, operational costs, time/costs of development and maintenance... and provides examples for RavenDB.

Am I missing something about Document Databases?

I've been looking at the rise of the NoSql movement and the accompanying rise in popularity of document databases like mongodb, ravendb, and others. While there are quite a few things about these that I like, I feel like I'm not understanding something important.
Let's say that you are implementing a store application, and you want to store in the database products, all of which have a single, unique category. In Relational Databases, this would be accomplished by having two tables, a product and a category table, and the product table would have a field (called perhaps "category_id") which would reference the row in the category table holding the correct category entry. This has several benefits, including non-repetition of data.
It also means that if you misspelled the category name, for example, you could update the category table and then it's fixed, since that's the only place that value exists.
In document databases, though, this is not how it works. You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data, and errors are much more difficult to correct. Thinking about this more, doesn't it also mean that running queries like "give me all products with this category" can lead to result that do not have integrity.
Of course the way around this is to re-implement the whole "category_id" thing in the document database, but when I get to that point in my thinking, I realize I should just stay with relational databases instead of re-implementing them.
This leads me to believe I'm missing some key point about document databases that leads me down this incorrect path. So I wanted to put it to stack-overflow, what am I missing?
You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data [...]
True, denormalizing means storing additional data. It also means less collections (tables in SQL), thus resulting in less relations between pieces of data. Each single document can contain the information that would otherwise come from multiple SQL tables.
Now, if your database is distributed across multiple servers, it's more efficient to query a single server instead of multiple servers. With the denormalized structure of document databases, it's much more likely that you only need to query a single server to get all the data you need. With a SQL database, chances are that your related data is spread across multiple servers, making queries very inefficient.
[...] and errors are much more difficult to correct.
Also true. Most NoSQL solutions don't guarantee things such as referential integrity, which are common to SQL databases. As a result, your application is responsible for maintaining relations between data. However, as the amount of relations in a document database is very small, it's not as hard as it may sound.
One of the advantages of a document database is that it is schema-less. You're completely free to define the contents of a document at all times; you're not tied to a predefined set of tables and columns as you are with a SQL database.
Real-world example
If you're building a CMS on top of a SQL database, you'll either have a separate table for each CMS content type, or a single table with generic columns in which you store all types of content. With separate tables, you'll have a lot of tables. Just think of all the join tables you'll need for things like tags and comments for each content type. With a single generic table, your application is responsible for correctly managing all of the data. Also, the raw data in your database is hard to update and quite meaningless outside of your CMS application.
With a document database, you can store each type of CMS content in a single collection, while maintaining a strongly defined structure within each document. You could also store all tags and comments within the document, making data retrieval very efficient. This efficiency and flexibility comes at a price: your application is more responsible for managing the integrity of the data. On the other hand, the price of scaling out with a document database is much less, compared to a SQL database.
Advice
As you can see, both SQL and NoSQL solutions have advantages and disadvantages. As David already pointed out, each type has its uses. I recommend you to analyze your requirements and create two data models, one for a SQL solution and one for a document database. Then choose the solution that fits best, keeping scalability in mind.
I'd say that the number one thing you're overlooking (at least based on the content of the post) is that document databases are not meant to replace relational databases. The example you give does, in fact, work really well in a relational database. It should probably stay there. Document databases are just another tool to accomplish tasks in another way, they're not suited for every task.
Document databases were made to address the problem that (looking at it the other way around), relational databases aren't the best way to solve every problem. Both designs have their use, neither is inherently better than the other.
Take a look at the Use Cases on the MongoDB website: http://www.mongodb.org/display/DOCS/Use+Cases
A document db gives a feeling of freedom when you start. You no longer have to write create table and alter table scripts. You simply embed details in the master 'records'.
But after a while you realize that you are locked in a different way. It becomes less easy to combine or aggregate the data in a way that you didn't think was needed when you stored the data. Data mining/business intelligence (searching for the unknown) becomes harder.
That means that it is also harder to check if your app has stored the data in the db in a correct way.
For instance you have two collection with each approximately 10000 'records'. Now you want to know which ids are present in 'table' A that are not present in 'table' B.
Trivial with SQL, a lot harder with MongoDB.
But I like MongoDB !!
OrientDB, for example, supports schema-less, schema-full or mixed mode. In some contexts you need constraints, validation, etc. but you would need the flexibility to add fields without touch the schema. This is a schema mixed mode.
Example:
{
'#rid': 10:3,
'#class': 'Customer',
'#ver': 3,
'name': 'Jay',
'surname': 'Miner',
'invented': [ 'Amiga' ]
}
In this example the fields "name" and "surname" are mandatories (by defining them in the schema), but the field "invented" has been created only for this document. All your app need to don't know about it but you can execute queries against it:
SELECT FROM Customer WHERE invented IS NOT NULL
It will return only the documents with the field "invented".