I have a document that stores sensor data where the sensor readings are objects stored in an array. Example:
{
"readings": [
{
"timestamp": 1499475320,
"temperature": 121
},
{
"timestamp": 1499475326,
"temperature": 93
},
{
"timestamp": 1499475340,
"temperature": 142
}
]
}
I know how to push/add an item to the "readings" array. But what I need is when I add an item to the array, I also want to "clean" the array by removing items that have "timestamp" value older than a cutoff time.
Is this possible in mongodb?
The way I see this you basically have two options here that have varying approaches.
Restrict Arrays to Capped Size
The first option here is "not exactly" what you are asking for, but it is the option with the least implementation and execution overhead. The variance from your question is that instead of "removing past a certain age", we instead simply place a "limit/cap" on the total number of entries in the array.
This is actually done using the $slice modifier to $push:
Model.update(
{ "_id": docId },
{ "$push": {
"readings": {
"$each": [{ "timestamp": 1499478496679, "temperature": 100 }],
"$slice": -10
}
}
)
In this case the -10 argument restricts the array to only have the "last ten" entries from the end of the array since we are "appending" with $push. If you wanted instead the "latest" as the first entry then you would modify with $position and instead provide the "positive" value to $slice, which means "first ten" in contrast.
So it's not the same thing you asked for, but it is practical since the arrays do not have "unlimited growth" and you can simply "cap" them as each update is made and the "oldest" item will be removed once at the maximum length. This means the overall document never actually grows beyond a set size, and this is a very good thing for MongoDB.
Issue with Bulk Operations
The next case which actually does exactly what you ask uses "Bulk Operations" to issue "two" update operations in a "single" request to the server. The reason why it is "two" is because there is a rule that you cannot have different update operators "assigned to the same path" in a singe update operation.
Therefore what you want actually involves a $push AND a $pull operation, and on the "same array path" we need to issue those as "separate" operations. This is where the Bulk API can help:
Model.collection.bulkWrite([
{ "updateOne": {
"filter": { "_id": docId },
"update": {
"$pull": {
"readings": { "timestamp": { "$lt": cutOff } }
}
}
}},
{ "updateOne": {
"filter": { "_id": docId },
"update": {
"$push": { "timestamp": 1499478496679, "temperature": 100 }
}
}}
])
This uses the .bulkWrite() method from the underlying driver which you access from the model via .collection as shown. This will actually return a BulkWriteOpResult within the callback or Promise which contains information about the actual operations performed within the "batch". In this case it will be the "matched" and "modified" numbers which will be appropriate to the operations that were actually performed.
Hence if the $pull did not actually "remove" anything since the timestamp values were actually newer than the given constraint, then the modified count would only reflect the $push operation. But most of the time this need not concern you, where instead you would just accept that the operations completed without error and did something according to what you actually asked.
Conclude
So the general case of "both" is that it's really all done in one request and one response. The differences come in that "under the hood" the second approach which matches your request actually does do "two" operations per request and therefore takes microseconds longer.
There is actually no reason why you could not "combine" the logic of "both", and remove past your "cutoFF" as well as keeping a "cap" on the overall array size. But the general idea here is that the first implementation, though not exactly the same thing as asked will actually do a "good enough" job of "housekeeping" with little to no additional overhead on the request, or indeed the implementation of the actual code.
Also, whilst you can always "read the data" -> "modify" -> "save". That is not a really great pattern. And for best performance as well as "consistency" without conflict, you should be using the atomic operations to modify in just the same way as is outlined here.
Related
I have a index collection containing lots of terms, and a field items containing identifier from an other collection. Currently that field store an array of document, and docs are added by $addToSet, but I have some performance issues. It seems an $unset operation is executed faster, so I plan to change the array of document to a document of embed documents.
Am I right to think the $set/$unset fields are fatest than push/pull embed document into arrays ?
EDIT:
After small tests, we see the set/unset 4 times faster. On the other
hand, if I use object instead of array, it's a little harder to count
the number of properties (vs the length of the array), and we were
counting that a lot. But we can consider using $set everytime and
adding a field with the number of items.
This is a document of the current index :
{
"_id": ObjectId("5594dea2b693fffd8e8b48d3"),
"term": "clock",
"nbItems": NumberLong("1"),
"items": [
{
"_id": ObjectId("55857b10b693ff18948ca216"),
"id": NumberLong("123")
}
{
"_id": ObjectId("55857b10b693ff18948ca217"),
"id": NumberLong("456")
}
]
}
Frequent update operations are :
* remove item : {$pull:{"items":{"id":123}}}
* add item : {$addToSet:{"items":{"_id":ObjectId("55857b10b693ff18948ca216"),"id":123,}}}
* I can change $addToSet to $push and check duplicates before if performances are better
And this is what I plan to do:
{
"_id": ObjectId("5594dea2b693fffd8e8b48d3"),
"term": "clock",
"nbItems": NumberLong("1"),
"items": {
"123":{
"_id": ObjectId("55857b10b693ff18948ca216")
}
"456":{
"_id": ObjectId("55857b10b693ff18948ca217")
}
}
}
* remove item : {$unset:{"items.123":true}
* add item : {$set:{"items.123":{"_id":ObjectId("55857b10b693ff18948ca216"),"id":123,}}}
For information, theses operations are made with pymongo (or can be done with php if there is a good reason to), but I don't think this is relevant
As with any performance question, there are a number of factors which can come into play with an issue like this, such as indexes, need to hit disk, etc.
That being said, I suspect you are likely correct that adding a new field or removing an old field from a MongoDB document will be slightly faster than appending/removing from an array as the array types will be less easy to traverse when searching for duplicates.
i want to know the difference between this 2 query:
myCollection.update ( {
a:1,
b:1,
$isolated:1 } );
myCollection.update ( {
$and:
[
{a:1},
{b:1},
{$isolated:1}
] } );
Basically i need to perform an .update() with $isolated for all the documents that have 'a=1 and b=1'. I'm confusing about how to write the '$isolated' param and how to be sure that the query work fine.
I would basically question the "need to perform" of your statement, especially considering lack of { multi: true } where you intend to match and update a lot of documents.
The second consideration here is that your proposed statement(s) lack any kind of update operation at all. That might be a consquence of the question you are asking about the "difference", but given your present apparent understanding of MongoDB query operations with relation to the usage of $and, then I seriously doubt you "need" this at all.
So "If" you really needed to write a statement this way, then it should look like this:
myCollection.update(
{ "a": 1, "b": 1, "$isolated": true },
{ "$inc": { "c": 1 } },
{ "multi": true }
)
But what you really "need" to understand is what that is doing.
Essentially this query is going to cause MongoDB to hold a "write lock", and at least on the collection level. So that no other operations can be performed until the entire wtite is complete. This also ensures that until that time, then all read attempts will only see the state of the document before any changes were made, until that operation is complete in full and then subsequent reads see all the changes.
This may "sound tempting", to be a good idea to some, but it really is not. Write locks affect the concurrency of updates and a generally a bad thing to be avoided. You might also be confusing this with a "transaction" but it is not, and as such any failure during the execution will only halt the operations at the point where it failed. This does not "undo" changes made within the $isolated block, and they will remain committed.
Just about the only valid use case here, would be where you "absolutely need", all of the elements to be modified matching "a" and "b" to maintain a consistent state in the event that something was "aggregating" that combination at the exact same time as this operation was run. In that case, then exposing "partially" altered values of "c" may not be desirable. But the range of usage of this is pretty slim, and most general applications do not require such consistency.
Back to the usage of $and, well all MongoDB arguments are implicitly an $and operation anyway, unless they are explicitly stated. The only general usage for $and is where you need multiple conditions on the same document key. And even then, that is generally better written on the "right side" of evaluation, such as with $gt and $lt:
{ "a": { "$gt": 1, "$lt": 3 } }
Being exactly the same as:
{
"$and": [
{ "a": { "$gt": 1 } },
{ "b": { "$lt": 3 } }
]
}
So it's really quite superfluous.
In the end, if all you really want to do is:
myCollection.update(
{ "a": 1, "b": 1 },
{ "$inc": { "c": 1 } },
)
Updating a single document, then there is no need for $isolated at all. Obtaining an explicit lock here is really just providing complexity to an otherwise simple operation that is not required. And even in bulk, you likely really do not need the consistency that is provided by obtaining the lock, and as such can simple do again:
myCollection.update(
{ "a": 1, "b": 1 },
{ "$inc": { "c": 1 } },
{ "multi": true }
)
Which will hapilly yield to allow writes on all selected documents and reads of the "latest" information. Generally speaking, "you want this" as "atomic" operators such as $inc are just going to modify the present value they see anyway.
So it does not matter if another process matched one of these documents before the "multi" write found that document in all the matches, since "both" $inc operations are going to execute anyway. All $isolated really does here is "ensure" that when this operation is started, then "it's" write will be the "first" committed, and then anything attempted during the lock will happen "after", as opposed to just the general order of when each operation is able to grab that document and make the modification.
In 9/10 cases, the end result is the same. The exception being that the "write lock" obtained here, "will" slow down other operations.
I have a field in mongodb and it is of integer type. I want to clip the minimum value to zero (0, it shouldn't go negative) on update action like $inc.
Any idea how to achieve this constraint? Thanks
Since MongoDB itself is "schemaless" it cannot really enforce any "schema logic" by itself for such a constraint.
You can possibly look into Object Document Mapper (ODM) libraries available for your chosen language, but all of these generally require that the data be "loaded into the client" in order to enforce such constraints upon modifications.
Additionally, there is no current "operator syntax" that allows such a constaint, such as for $inc to not fall below ( or go above ) a certain value.
What you "can" do however is use something like the "Bulk" operations API to send "two" requests at once, respectively doing:
Increment/Decrement the field by the specified value
Check if the field fell out of constraints and set the value accordingly.
So for a simple example of a Min of "0" and a Max of "100" you would do:
var bulk = db.collection.initializeOrderedBulkOp();
// Increment document field
bulk.find({ "_id": idValue }).updateOne({
"$inc": { "counter": incValue }
});
// Set to Max if greater than Max
bulk.find({ "_id": idValue, "counter": { "$gt": 100 } }).updateOne({
"$set": { "counter": 100 }
});
// Set to Min if less than Min
bulk.find({ "_id": idValue, "counter": { "$lt": 0 } }).updateOne({
"$set": { "counter": 0 }
});
// Only sends and returns from server now
bulk.execute();
That means that though there are actually "three" operations here to enforce the constraint, there is actually only "one" request to the server and "one" response.
Whilst it still really is "three" operations, the likelihood of the "interim" value being picked up is very small. So this is generally preferred to loading the data from the server, then modifying and saving back the contents in all respects.
This option is not very efficient one but may be one of the choice in your case of
http://docs.mongodb.org/manual/tutorial/store-javascript-function-on-server/
You can write a javascript function which will do the increment before validating the data. But i am still sure concurrency may have a problem that you need to consider.
When I perform a Mapreduce operation over a MongoDB collection with an small number of documents everything goes ok.
But when I run it with a collection with about 140.000 documents, I get some strange results:
Map function:
function() { emit(this.featureType, this._id); }
Reduce function:
function(key, values) { return { count: values.length, ids: values };
As a result, I would expect something like (for each mapping key):
{
"_id": "FEATURE_TYPE_A",
"value": { "count": 140000,
"ids": [ "9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
"db364b3f-045f-4cb8-a52e-2267df40066c",
"d2152826-6777-4cc0-b701-3028a5ea4395",
"7ba366ae-264a-412e-b653-ce2fb7c10b52",
"513e37b8-94d4-4eb9-b414-6e45f6e39bb5", .......}
But instead I get this strange document structure:
{
"_id": "FEATURE_TYPE_A",
"value": {
"count": 706,
"ids": [
{
"count": 101,
"ids": [
{
"count": 100,
"ids": [
"9b2066c0-811b-47e3-ad4d-e8fb6a8a14e7",
"db364b3f-045f-4cb8-a52e-2267df40066c",
"d2152826-6777-4cc0-b701-3028a5ea4395",
"7ba366ae-264a-412e-b653-ce2fb7c10b52",
"513e37b8-94d4-4eb9-b414-6e45f6e39bb5".....}
Could someone explain me if this is the expected behavior, or am I doing something wrong?
Thanks in advance!
The case here is un-usual and I'm not sure if this is what you really want given the large arrays being generated. But there is one point in the documentation that has been missed in the presumption of how mapReduce works.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
What that basically says here is that your current operation is only expecting that "reduce" function to be called once, but this is not the case. The input will in fact be "broken up" and passed in here as manageable sizes. The multiple calling of "reduce" now makes another point very important.
Because it is possible to invoke the reduce function more than once for the same key, the following properties need to be true:
the type of the return object must be identical to the type of the value emitted by the map function to ensure that the following operations is true:
Essentially this means that both your "mapper" and "reducer" have to take on a little more complexity in order to produce your desired result. Essentially making sure that the output for the "mapper" is sent in the same form as how it will appear in the "reducer" and the reduce process itself is mindful of this.
So first the mapper revised:
function () { emit(this.type, { count: 1, ids: [this._id] }); }
Which is now consistent with the final output form. This is important when considering the reducer which you now know will be invoked multiple times:
function (key, values) {
var ids = [];
var count = 0;
values.forEach(function(value) {
count += value.count;
value.ids.forEach(function(id) {
ids.push( id );
});
});
return { count: count, ids: ids };
}
What this means is that each invocation of the reduce function expects the same inputs as it is outputting, being a count field and an array of ids. This gets to the final result by essentially
Reduce one chunk of results #chunk1
Reduce another chunk of results #chunk2
Comnine the reduce on the reduced chunks, #chunk1 and #chunk2
That may not seem immediately apparent, but the behavior is by design where the reducer gets called many times in this way to process large sets of emitted data, so it gradually "aggregates" rather than in one big step.
The aggregation framework makes this a lot more straightforward, where from MongoDB 2.6 and upwards the results can even be output to a collection, so if you had more than one result and the combined output was greater than 16MB then this would not be a problem.
db.collection.aggregate([
{ "$group": {
"_id": "$featureType",
"count": { "$sum": 1 },
"ids": { "$push": "$_id" }
}},
{ "$out": "ouputCollection" }
])
So that will not break and will actually return as expected, with the complexity greatly reduced as the operation is indeed very straightforward.
But I have already said that your purpose for returning the array of "_id" values here seems unclear in your intent given the sheer size. So if all you really wanted was a count by the "featureType" then you would use basically the same approach rather than trying to force mapReduce to find the length of an array that is very large:
db.collection.aggregate([
{ "$group": {
"_id": "$featureType",
"count": { "$sum": 1 },
}}
])
In either form though, the results will be correct as well as running in a fraction of the time that the mapReduce operation as constructed will take.
I have different types of data that would be difficult to model and scale with a relational database (e.g., a product type)
I'm interested in using Mongodb to solve this problem.
I am referencing the documentation at mongodb's website:
http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
For the data type that I am storing, I need to also maintain a relational list of id's where this particular product is available (e.g., store location id's).
In their example regarding "one-to-many relationships with embedded documents", they have the following:
{
name: "O'Reilly Media",
founded: 1980,
location: "CA",
books: [12346789, 234567890, ...]
}
I am currently importing the data with a spreadsheet, and want to use a batchInsert.
To avoid duplicates, I assume that:
1) I need to do an ensure index on the ID, and ignore errors on the insert?
2) Do I then need to loop through all the ID's to insert a new related ID to the books?
Your question could possibly be defined a little better, but let's consider the case that you have rows in a spreadsheet or other source that are all de-normalized in some way. So in a JSON representation the rows would be something like this:
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 12346789
},
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 234567890
}
So in order to get those sort of row results into the structure you wanted, one way to do this would be using the "upsert" functionality of the .update() method:
So assuming you have some way of looping the input values and they are identified with some structure then an analog to this would be something like:
books.forEach(function(book) {
db.publishers.update(
{
"name": book.publisher
},
{
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
{ "upsert": true }
);
})
This essentially simplified the code so that MongoDB is doing all of the data collection work for you. So where the "name" of the publisher is considered to be unique, what the statement does is first search for a document in the collection that matches the query condition given, as the "name".
In the case where that document is not found, then a new document is inserted. So either the database or driver will take care of creating the new _id value for this document and your "condition" is also automatically inserted to the new document since it was an implied value that should exist.
The usage of the $setOnInsert operator is to say that those fields will only be set when a new document is created. The final part uses $addToSet in order to "push" the book values that have not already been found into the "books" array (or set).
The reason for the separation is for when a document is actually found to exist with the specified "publisher" name. In this case, all of the fields under the $setOnInsert will be ignored as they should already be in the document. So only the $addToSet operation is processed and sent to the server in order to add the new entry to the "books" array (set) and where it does not already exist.
So that would be simplified logic compared to aggregating the new records in code before sending a new insert operation. However it is not very "batch" like as you are still performing some operation to the server for each row.
This is fixed in MongoDB version 2.6 and above as there is now the ability to do "batch" updates. So with a similar analog:
var batch = [];
books.forEach(function(book) {
batch.push({
"q": { "name": book.publisher },
"u": {
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
"upsert": true
});
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand( "update", "updates": batch );
batch = [];
}
});
db.runCommand( "update", "updates": batch );
So what is doing in setting up all of the constructed update statements into a single call to the server with a sensible size of operations sent in the batch, in this case once every 500 items processed. The actual limit is the BSON document maximum of 16MB so this can be altered appropriate to your data.
If your MongoDB version is lower than 2.6 then you either use the first form or do something similar to the second form using the existing batch insert functionality. But if you choose to insert then you need to do all the pre-aggregation work within your code.
All of the methods are of course supported with the PHP driver, so it is just a matter of adapting this to your actual code and which course you want to take.