MongoDB - single huge collection of raw data. Split or not?

MongoDB - single huge collection of raw data. Split or not? - mongodb

We collect and store instrumentation data from a large number of hosts.
Our storage is MongoDB - several shards with replicas. Everything is stored in a single large collection.
Each document we insert is a time based observation with some attributes (measurements). The time stamp is the most important attribute because all queries are based on time at least. Documents are never updated, so it's a pure write-in-look-up model. Right now it works reasonably well with several billions of docs.
Now,
We want to grow a bit and hold up to 12 month of data which may amount to a scary trillion+ observations (documents).
I was wandering if dumping everything into a single monstrous collection is the best choice or there is a more intelligent way to go about it.
By more intelligent I mean - use less hardware while still providing fast inserts and (importantly) fast queries.
So I thought about splitting the large collection into smaller pieces hoping to gain memory on indexes, insertion and query speed.
I looked into shards, but sharding by the time stamp sounds like a bad idea because all writes will go into one node canceling the benefits of sharding.
The insert rates are pretty high, so we need sharding to work properly here.
I also thought about creating a new collection every month and then pick up a relevant collection for a user query.
Collections older than 12 month will be either dropped or archived.
There is also an option to create entirely new database every month and do similar rotation.
Other options? Or perhaps one large collection is THE option to grow real big?
Please share your experience and considerations in similar apps.

It really depends on the use-case for your queries.
If it's something that could be aggregated, I would say do this through a scheduled map/reduce function and store the smaller data size in separate collection(s).
If everything should be in the same collection and all data should be queried at the same time to generate the desired results, then you need to go with Sharding. Then depending on the data size for your queries, you could go with an in memory map/reduce or even doing it at the application layer.
As yourself pointed out, Sharding based on time is a very bad idea. It makes all the writes going to one shard, so define your shard key. MongoDB Docs, has a very good explanation on this.
If you can elaborate more on your specific needs for the queries would be easier to suggest something.
Hope it helps.

I think collection on monthly basis will help you to get some boost up but I was wondering why can not you use the hour field of your timestamp for sharding . You can add a column which will hold the HOUR part of time stamp and when you shard against it will be shared nicely as you have repeating hour daily basis. I have not tested it but thought it will may help you

Would suggest to go ahead with single collection, as suggested by #Devesh hour based shard should be fine, Need to take care of the new ' hour Key ' while querying to get better performance.

Related

how to choose best mechanism for delete logs saved to mongodb

I'm implementing a logger using MongoDB and I'm quite new to the concept.
The logger is supposed to log each request and Its response.
I'm facing the question of using the TTL Index of mongo or just using the query overnight approach.
I think that the first method might bring some overhead by using a background thread and probably rebuilding the index after each deletion but, it frees space as soon as the documents expire and this might be beneficial.
The second approach, on the other hand, does not have this kind of overhead but it frees up space just at the end of each day.
It seems to me that the second approach will suit my case better as it would not be the case that my server just goes on the edge of not having enough disk space, but it will always be the case that we need to reduce the overhead on the server.
I'm wondering if there are some aspects to the subject that I'm missing and also I'm not sure about the applications of the MongoDB TTL.

Just my opinion:
It seems to be best to store logs in monthly , daily or hourly collection depends on your applications write load , and at the end of the day to just drop() the oldest collections with custom script. From experience TTL indices not working well when there is heavy write load to your collection since they add additional write load based on expiration time.
For example imagine you insert at 06:00h log events with 100k/sec and your TTL index life time is set to 3h , this mean after 3h at 09:00h you will have those 100k/sec deletes applied to your collection that are also stored in the oplog ... , solution in such cases is to add more shards , but it become kind of expensive... , far easier is to just drop the exprired collection ...
Moreover depending on your project size for bigger collections to speed up searches you can additionally shard and pre-split the collections based on compound index hashed datetime field(every log contain timestamp) with another field which you will search often and this will allow you scalable search across multiple distributed shards.
Also note mongoDB is a general purpose document database and fulltext search is kind of limited to expensinve regex expressions , so in case you need to do fast raw fulltext search in your logs some inverse index search engine like elasticsearch on top of your mongoDB backand maybe a good solution to cover this functionality.

Distributing big data storage for non-relational data

The problem consists of a lot (apprx. 500 million per day) of non-relational messages of relatively small size (apprx. 1KB). The messages are written once and never modified again. The messages has various structures, though there are patterns that the message must fit in. This data then must be used to make a search over them. The search may be done on any fields of the message, the only always present field is the date, thus the search will be done for a specific day.
The approach I have come up so far is to use MongoDB. Each day I create a few collections (apprx. 2000) and distribute messages during the day to those collections according to the pattern. I find the patterns important because I make indexing that the number of indexes is limited to 64.
This strategy results in 500G of data + 150G of indexes = 650G per day. Of course, the question here is how to distribute those data? Obvious solution is to use Mongo Sharding and spread the collections over the shards. However, I have not find any scenario close to my problem described in mongo manuals. Moreover, I am not even sure if I can dynamically (not manually) add new collections every day to shards. Any knowledge/suggestions from expreinced users? Shoudl I change my design?

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?

I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you

Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.

When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.

If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.

Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

Is it a good idea to generate per day collections in mongodb

Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance?
To give you more context, the data will be more like facebook feeds, and only the latest data (say last one week or month) is more important to us. Making per day collections keeps the number of documents low, and probably would result in fast access. Even if we need old data, we can fall back to older collections. Does this make sense, or am I heading in the wrong direction?

what you actually need is to archive the old data. I would suggest you to take a look at this thread at the mongodb mailing list:
https://groups.google.com/forum/#!topic/mongodb-user/rsjQyF9Y2J4
Last post there from Michael Dirolf (10gen)says:
"The OS will handle LRUing out data, so if all of your queries are
touching the same portion of data that should stay in memory
independently of the total size of the collection."
so I guess you can stay with single collection and good indexes will do the work.
anyhow, if the collection goes too big you can always run manual archive process.

Yes, there is a limit to the number of collections you can make. From the Mongo documentation Abhishek referenced:
The limitation on the number of namespaces is the size of the namespace file divided by 628.
A 16 megabyte namespace file can support approximately 24,000 namespaces. Each index also counts as a namespace.
Indexes etc. are included in the namespaces, but even still, it would take something like 60 years to hit that limit.
However! Have you considered what happens when you want data that spans collections? In other words, if you wanted to know how many users have feeds updated in a week, you're in a bit of a tight spot. It's not easy/trivial to query across collections.
I would recommend instead making one collection to store the data and simply move data out periodically as Tamir recommended. You can easily write a job to move data out of the collection every week or every month.

Creating a collection is not much overhead, but it the overhead is larger than creating a new document inside a collections.
There is a limitation on the no of collections that you can create: " http://docs.mongodb.org/manual/reference/limits/#Number of Namespaces "
Making new collections to me, won't be having any performance difference because in RAM you cache only those data that you actually query. In your case it will be recent feeds etc.
But having per day/hour collection will help you in achieving old data very easily.

indexing multiple keys for random queries in different combinations of keys

I'm building an application that stores lots of data per user (possibly in gigabytes).
Something like a request log, so lets say you have the following fields for every record:
customer_id
date
hostname
environment
pid
ip
user_agent
account_id
user_id
module
action
id
response code
response time (range)
and possibly some more.
The good thing is that the usage will be mostly write only, but when there are reads
I'd like to be able to answer then quickly in near real time.
Another prediction about the usage pattern is that most of the time people will be looking at the most recent data,
and infrequently query for the past, aggregate etc, so my guess is that the working set will be much smaller then
the whole database, i.e. recent data for most users and ranges of history for some users that are doing analytics right now.
for the later case I suppose its ok for first query to be slower until it gets the range into memory.
But the problem is that Im not quite sure how to effectively index the data.
The start of the index is clear, its customer_id and date. but the rest can be
used in any combination and I can't predict the most common ones, at least not with any degree of certainty.
We are currently prototyping this with mongo. Is there a way to do it in mongo (storage/cpu/cost) effectively?
The only thing that comes to mind is to try to predict a couple of frequent queries and index them and just massively shard the data
and ensure that each customer's data is spread evenly over the shards to allow fast table scan over just the 'customer, date' index for the rest
of the queries.
P.S. I'm also open to suggestions about db alternatives.

with this limited number of fields, you could potentially just have an index on each of them, or perhaps in combination with customer_id. MongoDB is clever enough to pick the fastest index for each case then. If you can fit your whole data set in memory (a few GB is not a lot of data!), then this all really doesn't matter.
You're saying you have a GB per user, but that still means you can have an index on the fields as there are only about a dozen. And with that much data, you want sharding anyway at some point soon.
cheers,
Derick

I think, your requirements don't really mix well together. You can't have lots of data and instantaneous ad-hoc queries.
If you use a lot of indexes, then your writes will be slow, and you'll need much more RAM.
May I suggest this:
Keep your index on customer id and date to serve recent data to users and relax your requirements to either real-timeliness or accuracy of aggregate queries.
If you sacrifice accuracy, you will be firing map-reduce jobs every once in a while to precompute queries. Users then may see slightly stale data (or may not, it's historical immutable data, after all).
If you sacrifice speed, then you'll run map-reduce each time (right now it's the only sane way of calculating aggregates in a mongodb cluster).
Hope this helps :)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse