I'm using MongoDB to handle timeseries, this is working fine as until now there is not too many data but I now need to identify what is needed to scale to a larger number of data. Today, there are +200k data received per day, each data received every couple of seconds, that is not huge but this should increase soon.
The data collection used is far from beeing efficient as each piece of data (parentID, timestamp, value) creates a document. I've seen several approaches that uses a document that keeps the timeseries for a whole hour (with, for instance, an inner array that keeps data for each seconds), this is really great but as the data I have to handle are not received regularly (depending upon the parentID), this approach might not be appropriate.
Among the data I receive:
- some are received every couple of seconds
- some are received every couple of minutes
For all those data, the step between 2 consecutive ones is not necessarily the same.
Is there a better approach I could use to handle those data, for instance using another modelisation, that could help to scale the DB ?
Today only one mongod process is running, and I'm wondering at which level the sharding might really be needed, any tips for this ?
You may still be able to reap the benefit of having a preallocated document even if readings aren't uniformly distributed. You can't structure each document by the time of the readings, but you can structure each document to hold a fixed number of readings
{
"type" : "cookies consumed"
"0" : { "number" : 1, "timestamp" : ISODate("2015-02-09T19:00:20.309Z") },
"1" : { "number" : 4, "timestamp" : ISODate("2015-02-09T19:03:25.874Z") },
...
"1000" : { "number" : 0, "timestamp" : ISODate("2015-01-01T00:00:00Z") }
}
Depending on your use case, this structure might work for you and give you the benefit of updating preallocated documents with new readings, only allocating a brand new document every N readings for some big N.
The solution to your problem is very well captured here:
http://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb
Basic idea as already pointed out is: to have fixed number of events captured per document and keep a track record of the start and end time stamp of each document in another "higher-level" collection.
Related
I am storing some data into a mongo database and I'm not sure about the structure I have to use... It's about IoT sensors that sends a value (temperature, pression, etc...) every specific time. I want to store into a collection (the collection name will be the sensor name) all the value from the sensor for a specific time (I thought about an array), the sensor type (like temperature).
Here is an example :
{
history : [ { date : "ISODate(2016-02-01T11:23:21.321Z)", value : 10.232216 }, { date : "ISODate(2016-02-01T11:26:41.314Z)", value : 10.164892 } ],
type : "temperature"
}
But my problem is that I want to query the database to get the history as a "list" of document. Each one with the date and the value.
On the other hand, I want to add a new value to the history each time there is a new one.
Thanks
Store every reading in a readings collection like:
{
date : "ISODate(2016-02-01T11:23:21.321Z)",
value : 10.232216,
type : "temperature",
sensor-name: "sensor-1"
}
This way you can access readings by type, date, value AND sensor. There is no reason why you would need to create a collection for each sensor.
Ting Suns answer is absolutely appropriate: Just store each measurement reading as a separate document in a collection. In doing so it's up to you if you want to arrange a separate collection for each sensor. Although putting them all into the same collection seems to be more obvious.
Especially you should not store items - in your case measurement readings - whose number is basically infinitely growing or could become "very large" into an embedded array of another MongoDB document. This is because:
The size of an individual document is limited to 16MB (MongoDB Version 3.2)
Often recurring modifications of the parent document could be inefficient for the memory management of the database engine.
Furthermore queries for individual embedded items/measurements are inefficient and more difficult to implement because you would actually have to query for the entire parent document.
How you divide readings into collections is completely up to you, whether one collection or multiple. And there are likely good arguments to be had on both sides.
However, regarding arrays: Just remember that sensor readings are unbounded. That is, they are possibly infinite in nature - just a flow of readings. MongoDB documents are limited in size (currently 16MB). With unbounded arrays, you will eventually hit this limit, which will result in failed updates, and requiring you to alter your storage architecture to accommodate your sensor readings.
So... you either need to devise a sharding solution to split array data across multiple documents (to avoid document-size-limit issues), or avoid arrays and store readings in separate documents.
I am wondering what is the best way to expire only a subset of a collection.
In one collection I store conversion data and click data.
The click data I would like to store for lets a week
And the conversion data for a year.
In my collection "customers" I store something like:
{ "_id" : ObjectId("53f5c0cfeXXXXXd"), "appid" : 2, "action" : "conversion", "uid" : "2_b2f5XXXXXX3ea3", "iid" : "2_2905040001", "t" : ISODate("2014-07-18T15:01:00.001Z") }
And
{ "_id" : ObjectId("53f5c0cfe4b0d9cd24847b7d"), "appid" : 2, "action" : "view", "uid" : "2_b2f58679e6f73ea3", "iid" : "2_2905040001", "t" : ISODate("2014-07-18T15:01:00.001Z") }
for the click data
So should I exucute a ensureIndex or something like a cronjob?
Thank you in advance
There are a couple of built in techniques you can use. The most obvious is a TTL collection which will automatically remove documents based on a date/time field. The caveat here is that for that convenience, you lose some control. You will be automatically doing deletes all the time that you have no control over, and deletes are not free - they require a write lock, they need to be flushed to disk etc. Basically you will want to test to see if your system can handle the level of deletes you will be doing and how it impacts your performance.
Another option is a capped collection - capped collections are pre-allocated on disk and don't grow (except for indexes), they don't have the same overheads as TTL deletes do (though again, not free). If you have a consistent insert rate and document size, then you can work out how much space corresponds to the time frame you wish to keep data. Perhaps 20GiB is 5 days, so to be safe you allocate 30GiB and make sure to monitor from time to time to make sure your data size has not changed.
After that you are into more manual options. For example, you could simply have a field that marks a document as expired or not, perhaps a boolean - that would mean that expiring a document would be an in-place update and about as efficient as you can get in terms of a MongoDB operation. You could then do a batch delete of your expired documents at a quiet time for your system when the deletes and their effect on performance are less of a concern.
Another alternative: you could start writing to a new database every X days in a predictable pattern so that your application knows what the name of the current database is and can determine the names of the previous 2. When you create your new database, you delete the one older than the previous two and essentially always just have 3 (sub in numbers as appropriate). This sounds like a lot of work, but the benefit is that the removal of the old data is just a drop database command, which just unlinks/deletes the data files at the OS level and is far more efficient from an IO perspective than randomly removing documents from within a series of large files. This model also allows for a very clean backup model - mongodump the old database, compress and archive, then drop etc.
As you can see, there are a lot of trade offs here - you can go for convenience, IO efficiency, database efficiency, or something in between - it all depends on what your requirements are and what fits best for your particular use case and system.
I am considering MongoDB to hold data of our campaign logs,
{
"domain" : ""
"log_time" : ""
"email" : ""
"event_type" : "",
"data" : {
"campaign_id" : "",
"campaign_name" : "",
"message" : "",
"subscriber_id" : ""
}
}
The above one is our event structure, each event is associated with one domain,
one domain can contain any number of events and there is no relation between one domain to another domain
most of our queries are specific to one domain at a time
for quick query responses I'm planning to create one collection per one domain so that I can query on particular domain collection data instead of query on whole data which contains all domains data
we will have at least 100k+ domains in the future, so I need to create 100k+ collections.
We are expecting 1 million + documents per collection.
our main intention is index on only required collections, we don't want to index on whole data, that is why we are planning to have one collection per one domain
which approach is better for my case
1.Storing all domains events in one collection
(or)
2.Each domain events in separate collection
I have seen some questions on max number of collections that mongodb can support but I didn't get clarity on this topic , as far I know we can extend default limit size 24k, but if I create 100k+ collections what about performance will it get affect
Is this solution (using max number of collections) right approach for my case
Please suggest about my approach, thanks in advance
Without some hard numbers, this question would be probably just opinion based.
However, if you do some calculations with the numbers you provided, you will get to a solution.
So your total document count is:
100 K collections x 1M documents = 100 G (100.000.000.000) documents.
From your document structure, I'm going to do a rough estimate and say that the average size for each document will be 240 bytes (it may be even higher).
Multiplying those two numbers you get ~21.82 TB of data. You can't store this amount of data just one one server, so you will have to split your data across multiple servers.
With this amount of data, your problem isn't anymore one collection vs multiple collections, but rather, how do I store all of this data in MongoDB on multiple servers, so I can efficiently do my queries.
If you have 100K collections, you can probably do some manual work and store e.g. 10 K collections per MongoDB server. But there's a better way.
You can use sharding and let the MongoDB do the hard work of splitting your data across servers. With sharding, you will have one collection for all domains and then shard that collection across multiple servers.
I would strongly recommend you to read all documentation regarding sharding, before trying to deploy a system of this size.
I have a system where 10s of client machines are sending objects to a single server. The job of the server is to aggregate all the objects (removing duplicates - and there are many) and produce a file every hour of the objects received the previous hour.
I tried MongoDB for this task and it did a good a job but there is the overhead of going over all the records by the end of each hour to produce the file. I am now thinking about gradually building the file as data is received stopping by the end of the hour and starting a new file and so on.
I don't need to do any searching or querying of the data, just dropping duplicates based on a key and producing a file of all the data. Also the first time I receive a record, the duplicates come within a maximum of 3 minutes afterwards.
Which system should I use? Do you recommend a different approach?
I would recommend, even though you state in your comments you don't like the idea of it, to use indexes. You can use a unique index on these fields and you use that as a method to insert.
This does, as you rightly point out, produce a full scan however whichever race condition free route you take (the only way to ensure non-duplicates really) you will need to do a full index scan, either by query or by index insertion.
Index insertion probably the best router here, at the end of the day the performance makes it not really matter.
As for dealing with removing your old records I would not use a TTL index. Instead it would be much better to just drop your collection when your ready to receive a new batch, not only will this be a lot faster but it will also send the collection to $freelists instead of adding the documents from the TTL index to a deleted bucket list potentially causing fragmentation and slowing down your system.
Consider this document:
{
"name" : "a",
"type" : "b",
"hourtag": 10,
"created": ISODate("2014-03-13T06:26:01.238Z")
}
Let's say we set up a unique index on this for name and type, another hourtag property, which the value you of you add to the document representing the hour of day it was inserted. Also add a created date if there is not something already and we set another index on that
db.collection.ensureIndex({ hourtag: 1, name: 1, type: 1})
db.collection.ensureIndex({ created: 1, { expireAfterSeconds: 7200 })
The second index is defined as a TTL index, and set the expireAfterSeconds value to be 2 hours.
So you insert your documents as you go, adding the property for the "current hour" that you are in, and the duplicate items will fail to insert.
At the end of the hour, get all the documents for the "last" hour value and process them.
Using the "TTL" index, the documents you no longer need get cleaned up after their expiry time.
That's the most simple implementation I can think of. Tweak the expiry time to your own needs.
Defining the hourtag first in the index order gives you a simple search, while maintaining your "duplicate" rules.
We have a hundred computers running, each computer will send back a heartbeat once in few minutes. we capture those heart beats in our mongodb database. Now we want to check when was last time they sends back their heart beat. One solution we have is to query for each node and get back its last heart beat time. But that'll introduce same number of queries to the database as the number of nodes we have. We wonder if there is a simpler approach to do that.
To be more specific, we store each heart beat from a node in a separate document, something like the following
{
"_id" : ObjectId("51d173adedfce2c67fe04c4a"),
"nodeId" : 260,
"heartBeat" : NumberLong(1374778030),
"status" : "DEPLOYED"
}
You can get the time from ObjectId. Query by node id, then sort by ObjectId, and get the timestamp from latest document's objectId. This will be your last ping time.
See Here. and Here.
Along the same lines of Dylan's comment, you should probably provide some more information for an optimal response. In addition to his comments, one that comes to mind is if you do full scans every time you look for heartbeats. That is, you could potentially group some nodes in a doc as an array (or create new collections based on access patterns) and manipulate in the app layer.