MongoDB : Arrays or not? - mongodb

I am storing some data into a mongo database and I'm not sure about the structure I have to use... It's about IoT sensors that sends a value (temperature, pression, etc...) every specific time. I want to store into a collection (the collection name will be the sensor name) all the value from the sensor for a specific time (I thought about an array), the sensor type (like temperature).
Here is an example :
{
history : [ { date : "ISODate(2016-02-01T11:23:21.321Z)", value : 10.232216 }, { date : "ISODate(2016-02-01T11:26:41.314Z)", value : 10.164892 } ],
type : "temperature"
}
But my problem is that I want to query the database to get the history as a "list" of document. Each one with the date and the value.
On the other hand, I want to add a new value to the history each time there is a new one.
Thanks

Store every reading in a readings collection like:
{
date : "ISODate(2016-02-01T11:23:21.321Z)",
value : 10.232216,
type : "temperature",
sensor-name: "sensor-1"
}
This way you can access readings by type, date, value AND sensor. There is no reason why you would need to create a collection for each sensor.

Ting Suns answer is absolutely appropriate: Just store each measurement reading as a separate document in a collection. In doing so it's up to you if you want to arrange a separate collection for each sensor. Although putting them all into the same collection seems to be more obvious.
Especially you should not store items - in your case measurement readings - whose number is basically infinitely growing or could become "very large" into an embedded array of another MongoDB document. This is because:
The size of an individual document is limited to 16MB (MongoDB Version 3.2)
Often recurring modifications of the parent document could be inefficient for the memory management of the database engine.
Furthermore queries for individual embedded items/measurements are inefficient and more difficult to implement because you would actually have to query for the entire parent document.

How you divide readings into collections is completely up to you, whether one collection or multiple. And there are likely good arguments to be had on both sides.
However, regarding arrays: Just remember that sensor readings are unbounded. That is, they are possibly infinite in nature - just a flow of readings. MongoDB documents are limited in size (currently 16MB). With unbounded arrays, you will eventually hit this limit, which will result in failed updates, and requiring you to alter your storage architecture to accommodate your sensor readings.
So... you either need to devise a sharding solution to split array data across multiple documents (to avoid document-size-limit issues), or avoid arrays and store readings in separate documents.

Related

MongoDb large array in field vs. many small documents with one array item each

I have a question about how to store some simple data in mongoDb that happens to be large and vary in size regularly.
I want to store all of the threadIds from each of my users Inbox's.
The threadIds are strings that look like "168849c793fa996a". It may be common that a user has 10,000+ (~400KB) or even occasionally as many as 50,000+(~2MB) worth of threadIds.
My app assists in clearing out the Inbox (deleting and sorting messages)
I will be using the information to know what the current state of a users Inbox is, i.e. what messages have been removed and what new ones have arrived.
The array will therefore be updated semi-frequently and it's size may significantly change on each update.
This leaves me with two ideas on how to store the data
If I store documents like this:
{
_id: ObjectID,
userId: String,
threadIds: [String]
}
a. It will be easy to query the array of threadIds with db.collection.findOne().
b. It will be easy to update with db.collection.updateOne() (or perhaps db.collection.deleteOne() and db.collection.insertOne() to avoid the large fluctuations in document size).
However...
c. I have read that it is not good for the db to have docs that change in size radically on such a scale
d. my experience is that an array with 10,000+ Strings in it can make Compass hang for ~10-15 seconds if there are a few documents of that size on the page and hang for 20-30 seconds if there is an attempt to view the array within a document that contains it. (Although my app itself loads the data very fast when findOne() is called).
If I store documents like this :
{
_id: ObjectID,
userId: String,
threadId: String
}
a. It will be easy to get an array of all threadIds back by querying with db.collection.find().distinct({ userId: userId }) with an index on userId.
b. This seems to be more inline with the mongoDb way of many smaller documents and is much friendlier within the Compass user interface.
However...
c. It will be slightly harder to update the information because I will have to use db.collection.deleteMany() and db.collection.insertMany() on every update.
d. This would add a small to medium amount of extra complexity to my app.
Given the size of the data held in the array fluctuating between ~5KB and 2MB but also the fact that I will likely always need the entire array every time I query for it. What is the best (most "correct") way that I can represent, store, and update this data?

Mongo for non regular time-series

I'm using MongoDB to handle timeseries, this is working fine as until now there is not too many data but I now need to identify what is needed to scale to a larger number of data. Today, there are +200k data received per day, each data received every couple of seconds, that is not huge but this should increase soon.
The data collection used is far from beeing efficient as each piece of data (parentID, timestamp, value) creates a document. I've seen several approaches that uses a document that keeps the timeseries for a whole hour (with, for instance, an inner array that keeps data for each seconds), this is really great but as the data I have to handle are not received regularly (depending upon the parentID), this approach might not be appropriate.
Among the data I receive:
- some are received every couple of seconds
- some are received every couple of minutes
For all those data, the step between 2 consecutive ones is not necessarily the same.
Is there a better approach I could use to handle those data, for instance using another modelisation, that could help to scale the DB ?
Today only one mongod process is running, and I'm wondering at which level the sharding might really be needed, any tips for this ?
You may still be able to reap the benefit of having a preallocated document even if readings aren't uniformly distributed. You can't structure each document by the time of the readings, but you can structure each document to hold a fixed number of readings
{
"type" : "cookies consumed"
"0" : { "number" : 1, "timestamp" : ISODate("2015-02-09T19:00:20.309Z") },
"1" : { "number" : 4, "timestamp" : ISODate("2015-02-09T19:03:25.874Z") },
...
"1000" : { "number" : 0, "timestamp" : ISODate("2015-01-01T00:00:00Z") }
}
Depending on your use case, this structure might work for you and give you the benefit of updating preallocated documents with new readings, only allocating a brand new document every N readings for some big N.
The solution to your problem is very well captured here:
http://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb
Basic idea as already pointed out is: to have fixed number of events captured per document and keep a track record of the start and end time stamp of each document in another "higher-level" collection.

Which mongo document schema/structure is correct?

I have two document formats which I can't decide is the mongo way of doing things. Are the two examples equivalent? The idea is to search by userId and have userId be indexed. It seems to me the performance will be equal for either schemas.
multiple bookmarks as separate documents in a collection:
{
userId: 123,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
userId: 123,
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
},
{
userId: 456,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
multiple bookmarks within one document per user.
{
userId: 123,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
}
]
},
{
userId: 456,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
]
}
The problem with the second option is that it causes growing documents. Growing documents are bad for write performance, because the database will have to constantly move them around the database files.
To improve write performance, MongoDB always writes each document as a consecutive sequence to the database files with little padding between each document. When a document is changed and the change results in the document growing beyond the current padding, the document needs to be deleted and moved to the end of the current file. This is a quite slow operation.
Also, MongoDB has a hardcoded limit of 16MB per document (mostly to discourage growing documents). In your illustrated use-case this might not be a problem, but I assume that this is just a simplified example and your actual data will have a lot more fields per bookmark entry. When you store a lot of meta-data with each entry, that 16MB limit could become a problem.
So I would recommend you to pick the first option.
I would go with the option 2 - multiple bookmarks within one document per user because this schema would take advantage of MongoDB’s rich documents also known as “denormalized” models.
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. Link
There are two tools that allow applications to represent these
relationships: references and embedded documents.
When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well
as the inherent structure of the data itself.
The Second type of structure represents an Embedded type.
Generally Embedded type structure should be chosen when our application needs:
a) better performance for read operations.
b) the ability to request and retrieve
related data in a single database operation.
c) Data Consistency, to update related data in a single atomic write operation.
In MongoDB, operations are atomic at the document level. No single
write operation can change more than one document. Operations that
modify more than a single document in a collection still operate on
one document at a time. Ensure that your application stores all fields
with atomic dependency requirements in the same document. If the
application can tolerate non-atomic updates for two pieces of data,
you can store these data in separate documents. A data model that
embeds related data in a single document facilitates these kinds of
atomic operations.
d) to issue fewer queries and updates to complete common operations.
When not to choose:
Embedding related data in documents may lead to situations where
documents grow after creation. Document growth can impact write
performance and lead to data fragmentation. (limit of 16MB per
document)
Now let's compare the structures from a developer's perspective:
Say I want to see all the bookmarks of a particular user:
The first type would require an aggregation to be applied on all the documents.
minimum set of functions that would be required to get the aggregated results, $match,$group(with $push operator):
db.collection.aggregate([{$match:{"userId":123}},{$group:{"_id":"$userId","bookmarkNames":{$push:"$bookmarkName"},"bookMarkUrls:{$push:"$bookmarkUrl"}"}}])
or a find() which returns multiple documents to be iterated.
Wheras the Embedded type would allow us to fetch it using a $match in the find query.
db.collection.find({"userId":123});
This just indicates the added overhead from the developer's point of view. We would view the first type as an unwinded form of the embedded document.
The first type, multiple bookmarks as separate documents in a collection,
is normally used in case of logging. Where the log entries are huge and will have a TTL, time to live. The collections in that case, would be capped collections. Where documents would be automatically deleted after a particular period of time.
Bottomline, if your documents size would not grow beyond 16 MB at any particular time opt for the Embedded type. it would save developing effort as well.
See Also: MongoDB relationships: embed or reference?

MongoDB - How much data is too much data?

I'm building an application that uses MongoDB as a database. I have a lot of products, and I want to log what products a user looks at to the user's database entry. For instance, a user profile looks like this:
{
"email" : "foo#bar.com",
"name" : "John Snow",
"_id" : ObjectId("51ecbcc6896652a008000001"),
"productsViewed" : [
product1,
product2,
product3,
product4
]
}
I have two options here. I can log just the _id of each product, or I could log entire objects representing the product (name, price, ~100 word description, categories, that sort of thing). The difference in object size is 1 line of text per product vs about 30 lines per product.
I realise that this is probably a trivial amount of data to be concerned about, but if a user has 10,000 productsViewed entries, will the ~30x larger difference make any sort of impact? Logging more data is far more useful for my purposes but I'd like to avoid my database calls lagging if the user profile becomes quite large.
Question is: At what point (in character length, I guess?) is too much data to store with one MongoDB record?
16 Meg is the limitation for the entire document. This means that all strings etc have to fit within 16 meg. However, before that there are more limitation on your schema which you, yourself hint at:
but if a user has 10,000 productsViewed entries, will the ~30x larger difference make any sort of impact?
And the answer is yes. First off with the added data of the root user you will probably be over the 16 meg limit, however, further on from this the in-memory $pull, $push and other sub document operators might have a hard time keeping peformance up. You can sort of mitigate that problem by batching your subdocuments into groups of 100.
However, yet again, you have an even bigger problem: Fragmentation. Since MongoDB stores the record in a single contigeous space on the disk, hence it has settings like padding, you could see considerable fragmentation from odd sized record objects not being reused here.
I would personally say that you should factor off this relation to a separate collection.

Mongodb data storage performance - one doc with items in array vs multiple docs per item

I have statistical data in a Mongodb collection saved for each record per day.
For example my collection looks roughly like
{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }
Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.
As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:
{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}
On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.
I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.
The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.
Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.
I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.