Use publications or separate into collections (performance)? - mongodb

I have a collection, in which only two queries are ever called on it.
Ex. Cars.find({color: 'red'}); and Cars.find({color: 'blue'});
I was wondering if I should just create RedCars and BlueCars collections instead of using two publications on Cars.
Thinking of performance here, if the Cars collection were to get very large, would it be more performant to use two collections? Also, they are never called on the same template. Each has its own template.

From a Mongo perspective, if you have a scenario where a single field across documents within a collection begins to look like an index (as you have described above) it will actually start to index queries against that field and make the return highly tuned. You can update this index (and if you have a lot of data that falls into scenario like you have described, you should tune this index), using standard Mongo indexing parameters against the database. There is more to this performance as well. For example, if it is a high read, low write, then Mongo will often keep portions or all of the query in memory for quick retrieval if it can.
As for whether it is better to split these into two collections. That's a tough one. From a performance standpoint it might be about the same either way if you tune your indexes properly and allow Mongo to do what it does best. However, from the meteor standpoint, I would consider it much easier to just keep them in a single collection from a code maintainability and testability standpoint.

In terms of performance, if the collection does get large, then your application will end up receiving alot more data than you expected it to if changes are made on either blue or red cars. A good solution rather than creating two collection is to use a parameterized subscription that will filter only on the data set you are looking at.
Meteor.publish('cars', function(c) {
check(c, String);
return Cars.find({color: c});
Then you can access the data by subscribing Meteor.subscribe('cars', 'blue')


Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

Can I create different sets of collections per tenant to avoid contention issues?

I am considering using Google Cloud Firestore for a multi tenant application.
I have come accross this page which gives suggestions about scale:
There is also this page showing limitations:
I come up with this solution which may make better the performance and resilience of the application with minimum or no additional cost.
The solution: I can use different collections per each multi-tenant collection in the application like : products_1, orders_1, products_2, orders_2.
I want to use it because:
1- It will have better performance since I will have smaller tables/indexes. Otherwise in the long term, it may contain too many documents.
2- It is doable because the code interacts with collections with their names and I don't have to explicitly create collections. It doesn't seem like it is a big issue when compared to doing it with a relational database / ORM combination.
3- There is no limitations about how many collections I can create with different names.
So my question:
Could any of my assumptions be incorrect such that it won't make any gain in terms of performance or it is not feasible to create infinite number of collections even if it's not documented.
Finally, can this approach create a possible maintenance trouble in long term which I cannot realise at the moment?
Thank you for your time.
Using separate collections to shard out clients is definitely a way to improve overall write throughput. But you don't need a naming convention for the collections.
Instead I'd consider creating a single top-level collection for all tenants/users and then have a subcollection for each tenant/user document. That way you have a separate subcollection for each tenant/user, but still have predictable collection names.

I wonder if there are a lot of collections

Do many mongodb collections have a big impact on mongodb performance, memory and capacity? I am designing an api with mvc pattern, and a collection is being created for each model. I question the way I am doing now.
MongoDB with the WirdeTiger engine supports an unlimited number of collections. So you are not going to run into any hard technical limitations.
When you wonder if something should be in one collection or in multiple collections, these are some of the considerations you need to keep in mind:
More collections = more maintenance work. Sharding is configured on the collection level. So having a large number of collections will make shard configuration a lot more work. You also need to set up indexes for each collection separately, but this is quite easy to automatize, because createIndex on an index which already exists does nothing.
The MongoDB API is designed in a way that every database query operates on one collection at a time. That means when you need to search for a document in n different collections, you need to perform n queries. When you need to aggregate data stored in multiple collections, you run into even more problems. So any data which is queried together should be stored together in the same collection.
Creating one collection for each class in your model is usually a god rule of thumb, but it is not a golden hammer solution. There are situations where you want to embed object in their parent-object documents instead of putting them into a separate collection. There are also cases where you want to put all objects with the same base-class in the same collection to benefit from MongoDB's ability to handle heterogeneous collections. But that goes beyond the scope of this question.
Why don't you use this and test your application ?
By the way your question is not completely clear... is more like a "discussion" rather than question. And you're asking others to evaluate your work instead of searching the web the rigth approach.

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents).
We're wondering what's the best approach for this:
A single query with no filtering to open a cursor and get documents in batches of 5/6k
Iterate with pagination, using a logic of find().limit(5000).skip(currentIteration * 5000)
We're unsure what's the best practice and will yield the best results with minimum impact on performance.
I would go with 1. & 2. mixed if possible: Iterate over your huge dataset in pages but access those pages by querying instead of skipping over them as this may be costly as also pointed out by the docs.
The cursor.skip() method is often expensive because it requires the
server to walk from the beginning of the collection or index to get
the offset or skip position before beginning to return results. As the
offset (e.g. pageNumber above) increases, cursor.skip() will become
slower and more CPU intensive. With larger collections, cursor.skip()
may become IO bound.
So if possible build your pages on an indexed field and process those batches of data with an according query range.
The brutal way
Generally speaking, most drivers load batches of documents anyway. So your languages equivalent of
var docs = db.yourcoll.find()
will actually just create a cursor initially, and will then, when the current batch is close to exhaustion, load a new batch transparently. So doing this pagination manually while planning to access every document in the collection will have little to no advantage, but hold the overhead of multiple queries.
As for ETL, manually iterating over the documents to modify and then store them in a new instance does under most circumstances not seem reasonable to me, as you basically reinvent the wheel.
Alternate approach
Generally speaking, there is no one-size-fits all "best" way. The best way is the one that best fits your functional and non-functional requirements.
When doing ETL from MongoDB to MongoDB, I usually proceed as follows:
Unless you have very complicated transformations, MongoDB's aggregation framework is a surprisingly capable ETL tool. I use it regularly for that purpose and have yet to find a problem not solvable with the aggregation framework for in-MongoDB ETL. Given the fact that in general each document is processed one by one, the impact on your production environment should be minimal, if noticeable at all. After you did your transformation, simply use the $out stage to save the results in a new collection.
Even collection spanning transformations can be achieved, using the $lookup stage.
After you did the extract and transform on the old instance, for loading the data to the new MongoDB instance, you have several possibilities:
Create a temporary replica set, consisting of the old instance, the new instance and an arbiter. Make sure your old instance becomes primary, do the ET part, have the primary step down so your new instance becomes primary and remove the old instance and the arbiter from the replica set. The advantage is that you facilitate MongoDB's replication mechanics to get the data from your old instance to your new instance, without the need to worry about partially executed transfers and such. And you can use it the other way around: Transfer the data first, make the new instance the primary, remove the other members from the replica set perform your transformations and remove the "old" data, then.
Use db.CloneCollection(). The advantage here is that you only transfer the collections you need, at the expense of more manual work.
Use db.cloneDatabase() to copy over the entire DB. Unless you have multiple databases on the original instance, this method has little to now advantage over the replica set method.
As written, without knowing your exact use cases, transformations and constraints, it is hard to tell which approach makes the most sense for you.
MongoDB 3.4 support Parallel Collection Scan. I never tried this myself yet. But looks interesting to me.
This will not work on sharded clusters. If we have parallel processing setup this will speed up the scanning for sure.
Please see the documentation here:

Denormalization vs Parent Referencing vs MapReduce

I have a highly normalized data model with me. Currently I'm using manual referencing by storing the _id and running sequential queries to fetch details from the deepest collection.
The referencing is one-way and the flow has around 5-6 collections. For one particular use case, I'm having to query down to the deepest collection by querying subsequent "_id" from the higher level collections. So technically I'm hitting the database every time I run a
db.collection_name.find(_id: ****).
My prime goal is to optimize the read without hugely affecting the atomicity of the other collections. I have read about de-normalization and it does not make sense to me because I want to keep an option for changing the cardinality down the line and hence want to maintain a separate collection altogether.
I was initially thinking of using MapReduce to do an aggregation from the back and have a collection primarily for the particular use-case. But well even that does not sound that good.
In a relational db, I would be breaking the query in sub-queries and performing a join to get the data sets that intersect from the initial results. Since mongodb does not support joins, I'm having a tough time figuring anything out.
Please help if you have faced anything like this earlier or have any idea how to resolve it.
Denormalize your data.
MongoDB does not do JOIN's - period.
There is no operation on the database which gets data from more than one collection. Not find(), not aggregate() and not MapReduce. When you need to puzzle your data together from more than one collection, there is no other way than doing it on the application layer. For that reason you should organize your data in a way that any common and performance-relevant query can be resolved by querying just a single collection.
In order to do that you might have to create redundancies and transitive dependencies. This is normal in MongoDB.
When this feels "dirty" to you, then you should either accept the fact that your performance will be sub-optimal or use a different kind of database, like a classic relational database or a graph database.