Denormalization vs Parent Referencing vs MapReduce - mongodb

I have a highly normalized data model with me. Currently I'm using manual referencing by storing the _id and running sequential queries to fetch details from the deepest collection.
The referencing is one-way and the flow has around 5-6 collections. For one particular use case, I'm having to query down to the deepest collection by querying subsequent "_id" from the higher level collections. So technically I'm hitting the database every time I run a
db.collection_name.find(_id: ****).
My prime goal is to optimize the read without hugely affecting the atomicity of the other collections. I have read about de-normalization and it does not make sense to me because I want to keep an option for changing the cardinality down the line and hence want to maintain a separate collection altogether.
I was initially thinking of using MapReduce to do an aggregation from the back and have a collection primarily for the particular use-case. But well even that does not sound that good.
In a relational db, I would be breaking the query in sub-queries and performing a join to get the data sets that intersect from the initial results. Since mongodb does not support joins, I'm having a tough time figuring anything out.
Please help if you have faced anything like this earlier or have any idea how to resolve it.

Denormalize your data.
MongoDB does not do JOIN's - period.
There is no operation on the database which gets data from more than one collection. Not find(), not aggregate() and not MapReduce. When you need to puzzle your data together from more than one collection, there is no other way than doing it on the application layer. For that reason you should organize your data in a way that any common and performance-relevant query can be resolved by querying just a single collection.
In order to do that you might have to create redundancies and transitive dependencies. This is normal in MongoDB.
When this feels "dirty" to you, then you should either accept the fact that your performance will be sub-optimal or use a different kind of database, like a classic relational database or a graph database.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

I wonder if there are a lot of collections

Do many mongodb collections have a big impact on mongodb performance, memory and capacity? I am designing an api with mvc pattern, and a collection is being created for each model. I question the way I am doing now.
MongoDB with the WirdeTiger engine supports an unlimited number of collections. So you are not going to run into any hard technical limitations.
When you wonder if something should be in one collection or in multiple collections, these are some of the considerations you need to keep in mind:
More collections = more maintenance work. Sharding is configured on the collection level. So having a large number of collections will make shard configuration a lot more work. You also need to set up indexes for each collection separately, but this is quite easy to automatize, because createIndex on an index which already exists does nothing.
The MongoDB API is designed in a way that every database query operates on one collection at a time. That means when you need to search for a document in n different collections, you need to perform n queries. When you need to aggregate data stored in multiple collections, you run into even more problems. So any data which is queried together should be stored together in the same collection.
Creating one collection for each class in your model is usually a god rule of thumb, but it is not a golden hammer solution. There are situations where you want to embed object in their parent-object documents instead of putting them into a separate collection. There are also cases where you want to put all objects with the same base-class in the same collection to benefit from MongoDB's ability to handle heterogeneous collections. But that goes beyond the scope of this question.
Why don't you use this and test your application ?
https://docs.mongodb.com/manual/tutorial/evaluate-operation-performance/
By the way your question is not completely clear... is more like a "discussion" rather than question. And you're asking others to evaluate your work instead of searching the web the rigth approach.

Is dynamically creating and dropping collections in MongoDB going to create scalability issues?

I have an application (built in Meteor) that provides some ad hoc reporting capabilities to the end user. I have built up that functionality by using the aggregation pipeline to produce the results for a given query. This makes it extremely fast and I was using $out to push the results right into a results table.
The results table included a queryID, which the client used to figure out which were the correct results.
Unfortunately, as you may know (and I discovered), that doesn't work so well once you have more than one user running reports at a time because $out deletes the whole results table before pushing the new query in.
I see three possible workarounds:
Run the aggregation, but manually push the results into the results collection
$out the results into a temporary collection (dynamically named to avoid conflicts) and then manually copy the results from there into results collection, immediately dropping the temporary one. This made some sense when I thought I could use copyTo(), but that doesn't appear possible within Meteor, so I think this option doesn't make much sense relative to #1 in this case.
$out the results into a temporary collection (dynamically named to avoid conflicts) and have the client pull its results directly from there. I would then periodically drop the extra collections after say 24 hours (like I do with specific query results in the main collection today).
#3 would be the fastest by far - the time it takes to manually copy rows dwarfs the time it takes the queries to run. But I'm concerned about the impact of creating and dropping so many collections.
We're not talking millions of users here, but if an average of 500 users a day were each running 10-20 reports, there could be an additional 5-10k collections in the database at any one time. That seems like a lot. Perhaps I could be smarter about cleaning them up somehow, though I can't just immediately remove them because a user might want to have multiple tabs open with different reports. Even still, we're potentially talking about hundreds to thousands of collections.
Is that going to be a problem?
Are there other approaches I should consider instead?
Other recommendations?
Thanks!
Dropping a collection in mongoDB is very efficient operation, anyway much more efficient than deleting some documents in a larger collection.
Maximum number of collections is quite high, only limited by namespace namespace in MMAPv1 while no hard limit exists in wiretiger engine.
So I would favor your solution #3.
Some improvements/alternatives you can think:
Consider creating the collections in a separated database (say per day) then you can drop the entire database in a single operation without having to drop individual collections.
Use an endpoint for the result set, cash the results then drop the $out collection. Let cache handle user requirements and only rerun the aggregation if cache has expired or something.
This kind of activity is done very easily in relational databases such as mysql or pgsql. You might consider synchronising your data to a separate relational database for the purposes of reporting.
There is a package https://github.com/perak/mysql-shadow which claims to provide synchronisation. I played with it and it didn't work perfectly, although doing just one way sync is more likely to succeed.
The other option is to use Graphql over a mongo/mysql hybrid database which can be done with the Apollo stack http://www.apollodata.com/

MongoDB $lookup: Limitations & Usage

With the new aggregation pipeline stage $lookup we are now able to perform 'left outer joins'.
At first glance, I want to immediately replace one of our denormalised collections with two separate collections and use the $lookup to join them upon querying. This will solve the problem of having, when necessary, to update a huge number of documents. Now we can update just one document.
But surely this is too good to be true? This is a NoSQL, document database after all!
MongoDB's CTO also highlights his concerns:
We’re still concerned that $lookup can be misused to treat MongoDB
like a relational database. But instead of limiting its availability,
we’re going to help developers know when its use is appropriate, and
when it’s an anti-pattern. In the coming months, we will go beyond the
existing documentation to provide clear, strong guidance in this area.
What are the limitations of $lookup? Can I use them in real-time, operational querying of our data or should they be left for reporting, offline situations?
I share your same enthusiasm for $lookup.
I think there are trade-offs. One of the major concerns of SQL databases (and which is one of the reasons for the genesis of NoSQL) is that at large scale, joins can take a lot of time (well, relatively speaking).
It definitely helps in giving you a declarative model for your data, but then if you start to model your entire NoSQL database as though its a database of rows and tables (just using refs, for example), then you begin modeling it as though it's simply a SQL database (to a degree). Even MongoDB mentioned it (like you put in your question):
We’re still concerned that $lookup can be misused to treat MongoDB like a relational database.
You mentioned:
This will solve the problem of having, when necessary, to update a huge number of documents. Now we can update just one document.
I'm not sure what your collections look like exactly, but that definitely sounds like it could be a good use for $lookup.
Can I use them in real-time, operational querying
I would say, again, it depends on your use-case. You'll have to compare:
Desired semantics of your queries (declarative vs imperative)
Whether modeling your data as more relational (and thus using $lookup) in certain circumstances is worth the potential trade-off in computational time (that's assuming that querying across collections is even something to be concerned about, computationally speaking)
etc...
I'm sure in the coming months we'll see perf tests of the "left outer joins" and perhaps MongoDB will start writing some posts about when $lookup is an antipattern.
Hope this answer helps add to the discussion.
First of all MongoDB is a document-based database and will always be. So the $lookup aggregation pipeline stage new in version 3.2 didn't change MongoDB to relational database (RDBMS) as MongoDB's CTO mentioned:
We’re still concerned that $lookup can be misused to treat MongoDB like a relational database.
The first limitation of $lookup as mentioned in the documentation is that it:
Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing.
Which means that you can't use it with a sharded collection.
Also the $lookup operator doesn't work directly with an array as mentioned in post therefore you will need a preliminary $unwind stage to denormalize the localField if it is an array.
Now you said:
This will solve the problem of having, when necessary, to update a huge number of documents.
This is a good idea if your data are updated often than they are read.
as mentioned in 6 Rules of Thumb for MongoDB Schema Design: Part 3 especially if you have a large hierarchical data sets.
Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
I believe that with careful schema design you probably will not need the $lookup operator.

Use publications or separate into collections (performance)?

I have a collection, in which only two queries are ever called on it.
Ex. Cars.find({color: 'red'}); and Cars.find({color: 'blue'});
I was wondering if I should just create RedCars and BlueCars collections instead of using two publications on Cars.
Thinking of performance here, if the Cars collection were to get very large, would it be more performant to use two collections? Also, they are never called on the same template. Each has its own template.
Thanks
From a Mongo perspective, if you have a scenario where a single field across documents within a collection begins to look like an index (as you have described above) it will actually start to index queries against that field and make the return highly tuned. You can update this index (and if you have a lot of data that falls into scenario like you have described, you should tune this index), using standard Mongo indexing parameters against the database. There is more to this performance as well. For example, if it is a high read, low write, then Mongo will often keep portions or all of the query in memory for quick retrieval if it can.
As for whether it is better to split these into two collections. That's a tough one. From a performance standpoint it might be about the same either way if you tune your indexes properly and allow Mongo to do what it does best. However, from the meteor standpoint, I would consider it much easier to just keep them in a single collection from a code maintainability and testability standpoint.
In terms of performance, if the collection does get large, then your application will end up receiving alot more data than you expected it to if changes are made on either blue or red cars. A good solution rather than creating two collection is to use a parameterized subscription that will filter only on the data set you are looking at.
e.g.
Meteor.publish('cars', function(c) {
check(c, String);
return Cars.find({color: c});
});
Then you can access the data by subscribing Meteor.subscribe('cars', 'blue')