MongoDB : where is the limit between "few" and "many"? - mongodb

I am coming from the relational database world (Rails / PostgreSQL) and transitioning to the NoSQL world (Meteor / MongoDB), so I am learning about denormalization, embedding and true links.
It seems that, in many cases, choosing between various database schemas comes down to the number of documents that will be "related" to each others.
In this video series, the author distinguishes:
one-to-many relationships from one-to-few relationships
many-to-many relationships from few-to-few relationships
So, I am wondering: where is the limit between few and many?
I guess there may not be a hard number, but are we in the dozens, the hundreds, the thousands or the millions?

It's all relative and is really kind of a dangerous question to make assumptions about when you're designing an architecture. It's worth investing time to make the right choices for your schema and your setup. I would advise a few steps:
Do the math. Multiply your relationships out based on what you expect your application to need to do. If you have a few nested arrays or embedded documents, a couple of "one-to-few" can expand out to many documents pretty easily when you start $unwinding them.
Write a prototype. Do some basic testing on your expected hardware/environment to see if it can easily handle that load when you do queries for all the data.
Based on your testing, create the limitations. This is where you need to draw the line on how many relations you can create per document, for each relationship type, before the system breaks down.
If it were me, I would say one-to-few is less than a dozen, and one-to-many is theoretically unlimited, but practically in the millions. Maybe there should be a middle ground of "one-to-some" to indicate possibly hundreds.

Taken from 6 rules of thumb for MongoDB schema design:
one-to-few - two until few hundreds
one-to-many - couple of hundreds until few thousands
one-to-squillions - thousands and up
I totally agree with #womp about the need to choose the right scheme for your use case. The article I posted above has some good guidelines and examples of which schema design to use.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

Mongodb ways of implementing many to many?

I started reading up on MongoDB (which got me very excited) as I understand one of their flaws is the self explanatory lack of relation. Especially when it comes to large or ever growing on both sides, many to many relationships.
And, as I read around the best way to avoid ever growing arrays inside some document is either try avoiding it by creating buckets of documents and then referencing the buckets (that does not guarantee total prevention of overgrowth). Or to create the both document referencing a third many to many document.
Since I could not found a final answer to this dilemma or at least one the wouldn't be a few years old, could someone explain if this is the dead end (in case the project uses a few big(ever growing) many to many relationships) and I should switch to RDBMS?
It depends on your usecase.
The main question is do you actually know why you want to use MongoDB in the first place? Hopefully, the reason is not because of the trend. RDBMS's are still relevant and have their own usecases. For some applications RDBMS is the way to go for some it isn't.
Now back to your original question about many-to-many relations. As you have already researched there are ways to model those relationships in MongoDB. So that doesn't disqualify MongoDB as a database on its own. For example, to you need transactionality or referential integrity checks when you insert or delete records for those many to many relationships? If the answer to that is yes, then MongoDB may not be the perfect fit for your case.
When i first started working on MongoDB this exact question crossed my mind and during searching for the answer i read something very interesting (hope i had the link to that for you, but unfortunately i dont).
think of a real world problem where you have a many to many relation that just keeps on growing ? there may be very exceptional cases of such kind.
lets say many students are registered for many courses. Now a course may be registered by 100 students but for sure a student wont register for 100 courses, so you can simply in the student collection keep a array field for registered course ID's..
let's deep dive and say there are a bunch of super brilliant students who actually registered for 100 courses in such scenario a array field may not be a viable solution. Then ? how about a collection that just have student_id and course_id. This even exists in the RDBMS world too.
so the workarounds available should be enough to find and design an optimized solution for probably the most complex of the scenarios.

Is there a graph-document database that supports geospatial queries?

Here's the rundown of what I need:
A graph database
Each node is a document; there will be hundreds of types of nodes; each of these several hundred types will have its own consistent schema.
Can scale to billions of nodes
Each node also has a (lat,lng) cooordinate in addition to the edges between nodes
I want to use (lat,lng) as a shard key so this can be scaled to a large sharded, replicated cluster. Edge traversals will occur ~95% within nearby (lat,lng) locations.
I want to be able to issue geo+document queries. For example "Show me all the graph nodes/documents matching this query { ... } ordered by distance from (lat_0, lng_0)"
I want something that's well-documented, has an active developer community, is recommended for production use, and likely to be around for years.
Here are problems with existing databases:
MongoDB: no graph support, no joins
Neo4j: no sharding
OrientDB: no geospatial indexing
ArangoDB: can do WITHIN queries but cannot have additional query clauses (e.g. MongoDB's geoNear has a query parameter)
Is there anything that fits my use case?
Would you like a unicorn and a machine that prints an unlimited number of $100 bills to go along with that? Har har har....
OK but seriously, you've got a tall order there. You're going to need a custom system that blends a few of those things together. For one, as you observe, there's really no such thing as a "graph/document" database.
As a general area of systems research, many people are looking into hybrid systems. An example would be that you maintain your graph structure in neo4j, and that the IDs of nodes in neo4j point to identifiers for documents in MongoDB. In this way, you'd have a graph/document database, but it would really be two databases. Such hybrid systems are rife with tradeoffs. For one, writing a query across both systems will be extremely difficult. For two, you'll introduce data dependencies across them, such that it might not be easy to update your graph structure without changing your documents, or vice versa.
For really intense performance requirements, hybrid systems are sometimes the only way to go. But just as a rule of thumb, for every 100 times you see someone say they need such a solution, probably 80 times they're better off with picking just one database and then living with the pros and the cons that it provides to them. Technology is ultimately about choices, pros, and cons, and learning to live with what you've picked. :)
To give you a succinct answer to the question you've asked, no there's nothing that does all of that. I'd recommend you work with an architect or consultant who can explore your requirements in depth, and make a recommendation on what architecture best suits most of your needs, balancing simplicity and cost. That's as much an art as a science.

NoSQL vs. Relational Databases vs. Possible Hybrid

I'm hearing more about NoSQL, but have yet had someone give me a clear explanation of how it is to be used instead of relational databases.
I've read that it can't do left joins, so I was trying to figure out how you'd be able to use such a data storage. From reading: Preserve Joins by code in MongoDB it seems like a suggestion is to just make a large table, as if you already did the joins on it.
If the above statement is true, then I can see how it can be used. However I'm curious on how you'd handle repeat data. As the concept of normalizing, helps you remove the redundancy and ensure consistency in the data (e.g. Slight modifications like capitalization, white space, etc)...
Are we simply sacrificing the consistency of the data for scalable speed, or am I missing something?
Edit
I've been doing some more digging and found the answers the following questions useful for clarifying my understanding:
Why Google's BigTable referred as a NoSQL database?
How do you track record relations in NoSQL?
My understanding of consistency seems to be correct from those answers. And it looks like NoSQL is suppose to be used for specific problems types and that if you need relations that you should use a relational database.
But this raises more questions like:
It makes me wonder about real life examples of when to use NoSQL versus when not to?
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
MongoDB has the ability to have documents which include arrays of other documents. This solves many cases where you would have relations in reational databases.
When an invoice has multiple positions, you wouldn't put these positions into a separate collection. You would embed them as an array.
It makes me wonder about real life examples of when to use NoSQL versus when not to?
There are many different NoSQL databases, each one designed with different use-cases in mind. But you tagged this question as MongoDB, so I assume that you mean MongoDB in particular.
MongoDB has two main advantages over relational databases.
First, it scales well.
When the database is too slow or too big, you can easily add more servers by creating a cluster or replica-set of multiple shards. This doesn't work nearly as well with most relational databases.
Second, it allows heterogeneous data.
Imagine, for example, the product database of a computer hardware store. What properties do products have? All products have a price and a vendor. But CPUs have a clock rate, hard drives and RAM chips have a capacity (and these capacities aren't comparable), monitors have a resolution and so on. How would you design this in a relational database? You would either create a very long productID-property-value table or you would create a very wide and sparse product table with every property you can imagine, but most of them being NULL for most products. Both solutions aren't really elegant. But MongoDB can solve this much better because it allows each document in a collection to have a different set of properties.
What can't it do?
As a rather new technology, there isn't that much literature about it. The software ecosystem around it isn't that well either. The tools you can get for relational databases are often much more shiny.
There are also some use-cases MongoDB isn't well-suited for.
MongoDB doesn't do JOINs. When your data is very relational and denormalizing it would be counter-productive, it might be a poor choice for your product. But you might want to take a look at graph databases like Neo4j, which focus even more on relations than relational databases. Update 2016: MongoDB 3.2 now has rudimentary JOIN support with the $lookup aggregation stage, but it's still very limited in functionality compared to relational and graph databases.
MongoDB doesn't do transactions. At least not complex transactions. Certain actions which only affect a single document are guaranteed to be atomic, but as soon as you affect more than one document, you can't guarantee that no other query will happen in-between and find an inconsistent state.
MongoDB is bad for ad-hoc reporting. Its options for data-mining are severely limited. The rather new aggregation functions help and MapReduce can also solve some surprisingly complex problems when you learn to use it smart, but SQL has usually the better tools for things like that.
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Relational databases are around for about 40 years. Their theory is a well-researched topic in computer science. There are whole libraries of books written about the theory behind them. There is a by-the-book solution for every imaginable corner-case by now.
But NoSQL databases, on the other hand, are a rather new technology. We are still figuring out the best practices. The most frequent advise is: "Use your own head. Think about what queries are performed most often, and optimize your data schema for them."
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
When possible I would advise against using two different database technologies in the same product:
Anyone who maintains and supports the product must be familiar with both technologies
Troubleshooting gets a lot harder
The sysadmins need to keep an additional database running and updated
You have an additional point of failure which can lead to downtime
I would only recommend to mix database technologies when fulfilling your requirements without it doesn't just become hard but physically impossible. Otherwise, make your pick and stay with it.

MongoDB beginner - to normalize or not to normalize?

I'm going to try and make this as straight-forward as I can.
Coming from MySQL and thinking in terms of tables, let's use the following example:
Let's say that we have a real-estate website and we're displaying a list of houses
normally, I'd use the following tables:
houses - the real estate asset at hand
owners - the owner of the house (one-to-many relationship with houses)
agencies - the real-estate broker agency (many-to-many relationship with houses)
images - many-to-one relationship with houses
reviews - many-to-one relationship with houses
I understand that MongoDB gives you the flexibility to design your web-app in different collections with unique IDs much like a relational database (normalized), and to enjoy quick selections, you can nest within a collection, related objects and data (un-normalized).
Back to our real-estate houses list, the query used to populate it is quite expensive in a normal relational DB, for each house you need to query its images, reviews, owner & agencies, each entity resides in a different table with its fields, you'd probably use joins and have multiple queries joined into one - Expensive!
Enter MongoDB - where you don't need joins, and you can store all the related data of a house in a house item on the houses collection, selection was never faster, it's a db heaven!
But what happens when you need to add/update/delete related reviews/agencies/owner/images?
This is a mystery to me, and if I need to guess, each related collection exist on its own collection on top of its data within the houses table, and once one of these pieces of related data is being added/updated/deleted you'll have to update it on its own collection as well as on the houses collection. Upon this update - do I need to query the other collections as well to make sure I'm updating the house record with all the updated related data?
I'm just guessing here and would really appreciate your feedback.
Thanks,
Ajar
Try this approach:
Work out which entity (or entities) are the hero(s)
With 'hero', I mean the entity(s) that the database is centered around. Let's take your example. The hero of the real-estate example is the house*.
Work out the ownerships
Go through the other entities, such as the owner, agency, images and reviews and ask yourself whether it makes sense to place their information together with the house. Would you have a cascading delete on any of the foreign keys in your relational database? If so, then that implies ownership.
Work out whether it actually matters that data is de-normalised
You will have agency (and probably owner) details spread across multiple houses. Does that matter?
Your house collection will probably look like this:
house: {
owner,
agency,
images[], // recommend references to GridFS here
reviews[] // you probably won't get too many of these for a single house
}
*Actually, it's probably the ad of the house (since houses are typically advertised on a real-estate website and that's probably what you're really interested in) so just consider that
Sarah Mei wrote an informative article about the kinds of issues that can arise with data integrity in nosql dbs. The choice between duplicate data or using id's, code based joins and the challenges with keeping data integrity. Her take is that any nosql db with code based joins will lose data integrity at some point. Imho the articles comments are as valuable as the article itself in understanding these issues and possible resolutions.
Link: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/comment-page-1/
I would just like to give a normalization refresher from the MongoDB's perspective -
What are the goals of normalization?
Frees the database from modification anomalies - For MongoDB, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that's not the default approach. The default is to avoid it.
Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents
Avoid bias toward any particular access pattern - this is something, we're not going to worry about when describing schema in MongoDB. And one of the ideas behind the MongoDB is to tune up your database to the applications that we're trying to write and the problem we're trying to solve.