Model a large MongoDB collection that rarely updates

Model a large MongoDB collection that rarely updates - mongodb

Context:
I'm currently modeling data which follow a deep tree pattern consisting of 4 layers (categories, subcategories, subsubcategories, subsubsubcategories... those two lasts are of course not the real words I'll be using)
This collection is meant to grow larger and larger over time, and each layer will contain a list of dozens of elements.
Problem:
Modeling a full embedded collection like that raises a big problem ; the 16MB document limit of MongoDB is not really ideal in this context because the document size will slowly approach the limit.
But at the same time, this data is not meant to be updated very often (at most a few times a day). Client-side, the API needs to return a fully-constructed big JSON file made of all those layers nested together. It can be easily made in such a way that every time a layer is updated, the full JSON result is updated too and stored in RAM, ready to be sent.
I was wondering if having a 4 layers tree like that split in different collections would be a better idea, because at the same time it would raises more queries, but it would be way more scalable and easy to understand. But I don't really know if it's the way MongoDB documents are meant do be modeled. I may be doing something wrong (first time using MongoDB) and I want to be sure that everything is already in this way of doing things

I'll suggest you to take a look at official MongoDB tree structures advices, and especially the solution with parent reference. It will allow you to keep your structure without struggling of the 16MB maximum size, and you can use $graphLookup aggregation stages to perform your further queries on tree subdocuments

Related

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?

I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you

Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.

When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.

If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.

Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

When should I create a new collections in MongoDB?

So just a quick best practice question here. How do I know when I should create new collections in MongoDB?
I have an app that queries TV show data. Should each show have its own collection, or should they all be store within one collection with relevant data in the same document. Please explain why you chose the approach you did. (I'm still very new to MongoDB. I'm used to MySql.)

The Two Most Popular Approaches to Schema Design in MongoDB
Embed data into documents and store them in a single collection.
Normalize data across multiple collections.
Embedding Data
There are several reasons why MongoDB doesn't support joins across collections, and I won't get into all of them here. But the main reason why we don't need joins is because we can embed relevant data into a single hierarchical JSON document. We can think of it as pre-joining the data before we store it. In the relational database world, this amounts to denormalizing our data. In MongoDB, this is about the most routine thing we can do.
Normalizing Data
Even though MongoDB doesn't support joins, we can still store related data across multiple collections and still get to it all, albeit in a round about way. This requires us to store a reference to a key from one collection inside another collection. It sounds similar to relational databases, but MongoDB doesn't enforce any of key constraints for us like most relational databases do. Enforcing key constraints is left entirely up to us. We're good enough to manage it though, right?
Accessing all related data in this way means we're required to make at least one query for every collection the data is stored across. It's up to each of us to decide if we can live with that.
When to Embed Data
Embed data when that embedded data will be accessed at the same time as the rest of the document. Pre-joining data that is frequently used together reduces the amount of code we have to write to query across multiple collections. It also reduces the number of round trips to the server.
Embed data when that embedded data only pertains to that single document. Like most rules, we need to give this some thought before blindly following it. If we're storing an address for a user, we don't need to create a separate collection to store addresses just because the user might have a roommate with the same address. Remember, we're not normalizing here, so duplicating data to some degree is ok.
Embed data when you need "transaction-like" writes. Prior to v4.0, MongoDB did not support transactions, though it does guarantee that a single document write is atomic. It'll write the document or it won't. Writes across multiple collections could not be made atomic, and update anomalies could occur for how many ever number of scenarios we can imagine. This is no longer the case since v4.0, however it is still more typical to denormalize data to avoid the need for transactions.
When to Normalize Data
Normalize data when data that applies to many documents changes frequently. So here we're talking about "one to many" relationships. If we have a large number of documents that have a city field with the value "New York" and all of a sudden the city of New York decides to change its name to "New-New York", well then we have to update a lot of documents. Got anomalies? In cases like this where we suspect other cities will follow suit and change their name, then we'd be better off creating a cities collection containing a single document for each city.
Normalize data when data grows frequently. When documents grow, they have to be moved on disk. If we're embedding data that frequently grows beyond its allotted space, that document will have to be moved often. Since these documents are bigger each time they're moved, the process only grows more complex and won't get any better over time. By normalizing those embedded parts that grow frequently, we eliminate the need for the entire document to be moved.
Normalize data when the document is expected to grow larger than 16MB. Documents have a 16MB limit in MongoDB. That's just the way things are. We should start breaking them up into multiple collections if we ever approach that limit.
The Most Important Consideration to Schema Design in MongoDB is...
How our applications access and use data. This requires us to think? Uhg! What data is used together? What data is used mostly as read-only? What data is written to frequently? Let your applications data access patterns drive your schema, not the other way around.

The scope you've described is definitely not too much for "one collection". In fact, being able to store everything in a single place is the whole point of a MongoDB collection.
For the most part, you don't want to be thinking about querying across combined tables as you would in SQL. Unlike in SQL, MongoDB lets you avoid thinking in terms of "JOINs"--in fact MongoDB doesn't even support them natively.
See this slideshare:
http://www.slideshare.net/mongodb/migrating-from-rdbms-to-mongodb?related=1
Specifically look at slides 24 onward. Note how a MongoDB schema is meant to replace the multi-table schemas customary to SQL and RDBMS.
In MongoDB a single document holds all information regarding a record. All records are stored in a single collection.
Also see this question:
MongoDB query multiple collections at once

Use publications or separate into collections (performance)?

I have a collection, in which only two queries are ever called on it.
Ex. Cars.find({color: 'red'}); and Cars.find({color: 'blue'});
I was wondering if I should just create RedCars and BlueCars collections instead of using two publications on Cars.
Thinking of performance here, if the Cars collection were to get very large, would it be more performant to use two collections? Also, they are never called on the same template. Each has its own template.
Thanks

From a Mongo perspective, if you have a scenario where a single field across documents within a collection begins to look like an index (as you have described above) it will actually start to index queries against that field and make the return highly tuned. You can update this index (and if you have a lot of data that falls into scenario like you have described, you should tune this index), using standard Mongo indexing parameters against the database. There is more to this performance as well. For example, if it is a high read, low write, then Mongo will often keep portions or all of the query in memory for quick retrieval if it can.
As for whether it is better to split these into two collections. That's a tough one. From a performance standpoint it might be about the same either way if you tune your indexes properly and allow Mongo to do what it does best. However, from the meteor standpoint, I would consider it much easier to just keep them in a single collection from a code maintainability and testability standpoint.

In terms of performance, if the collection does get large, then your application will end up receiving alot more data than you expected it to if changes are made on either blue or red cars. A good solution rather than creating two collection is to use a parameterized subscription that will filter only on the data set you are looking at.
e.g.
Meteor.publish('cars', function(c) {
check(c, String);
return Cars.find({color: c});
});
Then you can access the data by subscribing Meteor.subscribe('cars', 'blue')

Is it a good idea to generate per day collections in mongodb

Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance?
To give you more context, the data will be more like facebook feeds, and only the latest data (say last one week or month) is more important to us. Making per day collections keeps the number of documents low, and probably would result in fast access. Even if we need old data, we can fall back to older collections. Does this make sense, or am I heading in the wrong direction?

what you actually need is to archive the old data. I would suggest you to take a look at this thread at the mongodb mailing list:
https://groups.google.com/forum/#!topic/mongodb-user/rsjQyF9Y2J4
Last post there from Michael Dirolf (10gen)says:
"The OS will handle LRUing out data, so if all of your queries are
touching the same portion of data that should stay in memory
independently of the total size of the collection."
so I guess you can stay with single collection and good indexes will do the work.
anyhow, if the collection goes too big you can always run manual archive process.

Yes, there is a limit to the number of collections you can make. From the Mongo documentation Abhishek referenced:
The limitation on the number of namespaces is the size of the namespace file divided by 628.
A 16 megabyte namespace file can support approximately 24,000 namespaces. Each index also counts as a namespace.
Indexes etc. are included in the namespaces, but even still, it would take something like 60 years to hit that limit.
However! Have you considered what happens when you want data that spans collections? In other words, if you wanted to know how many users have feeds updated in a week, you're in a bit of a tight spot. It's not easy/trivial to query across collections.
I would recommend instead making one collection to store the data and simply move data out periodically as Tamir recommended. You can easily write a job to move data out of the collection every week or every month.

Creating a collection is not much overhead, but it the overhead is larger than creating a new document inside a collections.
There is a limitation on the no of collections that you can create: " http://docs.mongodb.org/manual/reference/limits/#Number of Namespaces "
Making new collections to me, won't be having any performance difference because in RAM you cache only those data that you actually query. In your case it will be recent feeds etc.
But having per day/hour collection will help you in achieving old data very easily.

Is there any way to register a callback for deletions in a capped collection in Mongo?

I want to use a capped collection in Mongo, but I don't want my documents to die when the collection loops around. Instead, I want Mongo to notice that I'm running out of space and move the old documents into another, permanent collection for archival purposes.
Is there a way to have Mongo do this automatically, or can I register a callback that would perform this action?

You shouldn't be using a capped collection for this. I'm assuming you're doing so because you want to keep the amount of "hot" data relatively small and move stale data to a permanent collection. However, this is effectively what happens anyway when you use MongoDB. Data that's accessed often will be in memory and data that is used less often will not be. Same goes for your indexes if they remain right-balanced. I would think you're doing a bit of premature optimization or at least have a suboptimal schema or index strategy for your problem. If you post exactly what you're trying to achieve and where your performance takes a dive I can have a look.
To answer your actual question; MongoDB does not have callbacks or triggers. There are some open feature requests for them though.
EDIT (Small elaboration on technical implementation) : MongoDB is built on top of memory mapped files for it's storage engine. It basically means it's an LRU based cache of "hot" data where data in this case can be both actual data and index data. As a result data and associated index data you access often (in your case the data you'd typically have in your capped collection) will be in memory and thus very fast to query. In typical use cases the performance difference between having an "active" collection and an "archive" collection and just one big collection should be small. As you can imagine having more memory available to the mongod process means more data can stay in memory and as a result performance will improve. There are some nice presentations from 10gen available on mongodb.org that go into more detail and also provide detail on how to keep indexes right balanced etc.

At the moment, MongoDB does not support triggers at all. If you want to move documents away before they reach the end of the "cap" then you need to monitor the data usage yourself.
However, I don't see why you would want a capped collection and also still want to move your items away. If you clarify that in your question, I'll update the answer.