Mongodb update in sharded environment using $or - mongodb

I performed an update in sharded environment like :
db.collection.update({$or:[{a:2,b:3},{a:3,b:2}]},{$set:{x:5}})
But i got this error message :
update { q: { $or: [ {a:2,b:3},{a:3,b:2} ] }, u: {$set:{x:5}}, multi: false, upsert: false } does not contain _id or shard key for pattern { a: 1.0, b: 1.0 }
How can i perform this kind of update with the $or predicate on the shard key ?
Thanks

The main problem here is that your $or condition does not really make much sense without the "multi" parameter of the update statement. At least the sharding conditional logic thinks so, even if your intention was to only match a singular document.
In the "mind" of the sharding manager that lives withing the mongos router, the expectation is that you either target a singular shard or range of keys, or you are asking to access a possible variety or shards.
Here is the actual code handling this for reference:
// Validate that single (non-multi) sharded updates are targeted by shard key or _id
if (!updateDoc.getMulti() && shardKey.isEmpty() && !isExactIdQuery(updateDoc.getQuery())) {
return Status(ErrorCodes::ShardKeyNotFound,
stream() << "update " << updateDoc.toBSON()
<< " does not contain _id or shard key for pattern "
<< _manager->getShardKeyPattern().toString());
}
So as you should clearly see in the "if" condition, the expectation here is that there is either a definition of the "shard key" within the query, or at least an exact _id to facilitate an exact match.
Therefore your two provisions on making this valid for an update over shards is to either:
Include an "range" over possible values in the shard key with the query criteria. I don't know your shard key so I cannot really give a sample. But basically:
{
"shardKey": { "$gt": minShardKey, "$lt": maxShardKey },
"$or": [
{ "a": 2, "b": 3 },
{ "a": 3, "b": 2 }
]
}
As the query condition, where the minShardkey and maxShardKey refer to the minimum and maximum possible values on that key within the range ( on the also hypothetical "shardKey" field ) in order to make the manager consider that you really intend to search across all shards.
Include the "multi" option in the update like so:
db.collection.update(
{ "$or":[
{ "a": 2, "b": 3 },
{ "a": 3, "b": 2 }
]},
{ "$set": { "x":5 } },
{ "multi": true }
)
Which makes the selection "possibly" match more than one, and is therefore valid for searching over shards without needing a targeted shard key.
In either case, the logic is fulfilled in that you at least "intend" to search the conditions across the shards to find something or "things" that match the conditions you have given.
As an additional note, then also consider that "upsert" actions have similar restrictions, and that the general principle is that the shard key needs to be addressed, othewise the operation in invalid since there needs to be some indication of which shard to insert the new data.

Related

MongoDB - Looking up documents based on criteria defined by the documents themselves

Overall, I am trying to find a system design to quickly look up stored objects whose metadata matches data bundled on incoming events. Which fields are required, however, are themselves part of the stored objects, and are not fields that I can hardcode into a lookup query.
My system has a policies collection stored in MongoDB with documents that look like this:
{
id: 123,
name: "Jason's Policy",
requirements: {
"var1": "aaa",
"var2": "bbb"
// could have any number more, and each policy can have different field/values under requirements
}
}
My system receives events that look like this:
// Event 1 - matches all requirements under above policy
{
id: 777,
"var1": "aaa",
"var2": "bbb"
}
// Event 2 - does not match all requirements from above policy since var1 is undefined
{
id: 888,
"var2": "bbb",
"var3": "zzz"
}
As I receive events, how can I efficiently look up all the policies whose requirements are fully satisfied by the values received in the event?
As an example, in the sample data above, event 1 should return the policy (since var1 and var2 match the policy requirements), but event 2 should not return the policy (since var1 does not match/ is missing).
I can think of brute-force ways to do this on the application server itself (think nested for loops) but efficiency will be key as we receive hundreds of events per second.
I am open to recommendations for document schema changes that can satisfy the general problem (looking up documents based on criteria itself defined in our documents). I am also open to any overall design recommendations that address the problem, too (perhaps there is a better way to structure our system to trigger policy actions in response to events).
Thanks!
Not sure what's the exact scenario but can think of 2 here,
You need an exact match. For that you can run the below querydb.getCollection('test').find({'requirements':{'var1':'aaa','var2':'bbb'}})
for above query to run you need to save requirements object after sorting it's keys var1 and var2.
You need to match all properties exists and don't care if anything is extra in policies collection. You need to change policies being stored as,
{
"_id" : ObjectId("603250b0775428e32b9b303f"),
"id" : 123,
"name" : "Jason's Policy",
"requirements" : {
"var1" : "aaa",
"var2" : "bbb"
},
"requirements_search" : [
"var1aaa",
"var2bbb",
"var3ccc"
]
}
then you can run the below query,
db.getCollection('test').find({'requirements_search':{'$all' : ['var1aaa','var2bbb']}})
I found an answer to my question in another post: Find Documents in MongoDB whose with an array field is a subset of a query array.
MongoDB offers a $setIsSubset operator that can check if a document's array values are a subset of the array values in a query. Translated to my use case: if a given policy's requirements are a subset of the event's metadata, then I know that the event data fully meets the requirements for that policy.
For completeness, below is the MongoDB aggregation that solved my problem. I still need to research if there is a more efficient overall system design to facilitate what I need, but at a minimum, this Mongo aggregation will fetch the results that I need.
// Requires us to flatten policy requirements into an array like the following
//
// {
// "id" : 123,
// "name" : "Jason's Policy",
// "requirements" : [
// "var1_aaa",
// "var2_bbb"
// ]
// }
//
// Event matches all policy requirements and has extra unrelated attributes
// {
// id: 777,
// "var1": "aaa",
// "var2": "bbb",
// "var3": "ccc"
// }
db.collection.aggregate([
{$project: {
doc: '$$ROOT',
isSubset: {$setIsSubset: ['$requirements', ['var1_aaa', 'var2_bbb', 'var3_ccc']]}
}},
{$match: {isSubset: true}},
{$project: {_id: 0, 'doc.name': 1}}
])

MongoDB update with $isolated

i want to know the difference between this 2 query:
myCollection.update ( {
a:1,
b:1,
$isolated:1 } );
myCollection.update ( {
$and:
[
{a:1},
{b:1},
{$isolated:1}
] } );
Basically i need to perform an .update() with $isolated for all the documents that have 'a=1 and b=1'. I'm confusing about how to write the '$isolated' param and how to be sure that the query work fine.
I would basically question the "need to perform" of your statement, especially considering lack of { multi: true } where you intend to match and update a lot of documents.
The second consideration here is that your proposed statement(s) lack any kind of update operation at all. That might be a consquence of the question you are asking about the "difference", but given your present apparent understanding of MongoDB query operations with relation to the usage of $and, then I seriously doubt you "need" this at all.
So "If" you really needed to write a statement this way, then it should look like this:
myCollection.update(
{ "a": 1, "b": 1, "$isolated": true },
{ "$inc": { "c": 1 } },
{ "multi": true }
)
But what you really "need" to understand is what that is doing.
Essentially this query is going to cause MongoDB to hold a "write lock", and at least on the collection level. So that no other operations can be performed until the entire wtite is complete. This also ensures that until that time, then all read attempts will only see the state of the document before any changes were made, until that operation is complete in full and then subsequent reads see all the changes.
This may "sound tempting", to be a good idea to some, but it really is not. Write locks affect the concurrency of updates and a generally a bad thing to be avoided. You might also be confusing this with a "transaction" but it is not, and as such any failure during the execution will only halt the operations at the point where it failed. This does not "undo" changes made within the $isolated block, and they will remain committed.
Just about the only valid use case here, would be where you "absolutely need", all of the elements to be modified matching "a" and "b" to maintain a consistent state in the event that something was "aggregating" that combination at the exact same time as this operation was run. In that case, then exposing "partially" altered values of "c" may not be desirable. But the range of usage of this is pretty slim, and most general applications do not require such consistency.
Back to the usage of $and, well all MongoDB arguments are implicitly an $and operation anyway, unless they are explicitly stated. The only general usage for $and is where you need multiple conditions on the same document key. And even then, that is generally better written on the "right side" of evaluation, such as with $gt and $lt:
{ "a": { "$gt": 1, "$lt": 3 } }
Being exactly the same as:
{
"$and": [
{ "a": { "$gt": 1 } },
{ "b": { "$lt": 3 } }
]
}
So it's really quite superfluous.
In the end, if all you really want to do is:
myCollection.update(
{ "a": 1, "b": 1 },
{ "$inc": { "c": 1 } },
)
Updating a single document, then there is no need for $isolated at all. Obtaining an explicit lock here is really just providing complexity to an otherwise simple operation that is not required. And even in bulk, you likely really do not need the consistency that is provided by obtaining the lock, and as such can simple do again:
myCollection.update(
{ "a": 1, "b": 1 },
{ "$inc": { "c": 1 } },
{ "multi": true }
)
Which will hapilly yield to allow writes on all selected documents and reads of the "latest" information. Generally speaking, "you want this" as "atomic" operators such as $inc are just going to modify the present value they see anyway.
So it does not matter if another process matched one of these documents before the "multi" write found that document in all the matches, since "both" $inc operations are going to execute anyway. All $isolated really does here is "ensure" that when this operation is started, then "it's" write will be the "first" committed, and then anything attempted during the lock will happen "after", as opposed to just the general order of when each operation is able to grab that document and make the modification.
In 9/10 cases, the end result is the same. The exception being that the "write lock" obtained here, "will" slow down other operations.

How often should we use where operator of MongoDB

I have this particular scenario where I have to update certain value in MongoDB depending on different attributes present in same Document. So I am trying to use findAndUpdate with where operator which will be passed a JavaScript function and I will also be using one of the attribute as find criteria. But it has been mentioned in MongoDB documentation that, one should not use where operator until it can not be avoided because of performance issue.
Now lets say I have 3 attributes id, counter1, counter2 in my document and I am updating counter1 by 1 only when counter1 + counter2 = 2. So I will be writing something like
db.mydb.findAndUpdate({"_id" : id, $where : function() {
this.counter1 + this.counter2 == 2 ;}},
{$inc : {counter1 : 1}})
Now my question is:
Will this particular approach create any performance issue? as I am using id as another nonWhere operator criteria to search for a document.
Or I should be having another attribute in mydb collection something called say sumCounter which will store the values of counter1 and counter2.
So the main catch with $where evaluation is that the conditional logic cannot process an "index" in order to filter out matches. In addition, it is JavaScript logic afterall, and needs to be compiled as well as there needs to be "object translation" from the native forms into something that will work with the evaluation in the JavacScript engine.
So it's use should be "very sparingly" and only when "absolutely" required, as in there is no other practical way. In your case this is an "update" operation, therefore if you need that logic then fine. If it where just a "query", then I would say to use $redact in the aggregation framework instead:
db.mydb.aggregate([
{ "$match": { "_id": id } } },
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$add": [ "$counter1", "$counter2" ] },
2
]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
As that is at least all in native operators and therefore going to work faster than JavaScript.
As for "performance", then it is all relative. But however in your case where _id is a "unique" lookup, then the actual performance "hit" should be negligible as the "exact match" was already done on the "index" for the primary key.
This is the general advice for $where conditions. In that you "use them" generally in conjuction with other native query operators that do the "bulk" of the filtering. Then if it takes a few more CPU cycles to apply the conditions in your JavaScript logic ( and it is absolutely needed since there is no other way ), then so be it.
But if however your JavaScript based condition needs to scan many documents without the assistance of other filtering, then that is bad indeed.

Delete a MongoDB subdocument by value

I have a collection containing documents that look like this:
{
"user": "foo",
"topics": {
"Topic AB": {
"score": 20,
"frequency": 3,
"last_seen": 40
},
"Topic BD": {
"score": 10,
"frequency": 2,
"last_seen": 38
},
"Topic TF": {
"score": 19,
"frequency": 6,
"last_seen": 20
}
}
}
I want to remove subdocuments whose last_seen value is less than 30.
I don't want to use arrays here since I'm using $inc to update the subdocuments in conjunction with upsert (which doesn't support the $ notation).
The real question here is how can I delete a key depending on its value. Using $unset simply drops a subdocument regardless of what it contains.
I'm afraid I don't think this is possible with your current design. Knowing the name of the key whose last_seen value you wish to test, for example Topic TF, you can do
> db.topics.update({"topics.Topic TF.last_seen" : { "$lt" : 30 }},
{ "$unset" : { "topics.Topic TF" : 1} })
However, with an embedded document structure, if you don't know the name of the key that you want to query against then you can't run the query. If the Topic XX keys are only known by what's in the document, you'd have to pull the whole document to find out what keys to test, and at that point you ought to just manipulate the document client-side and then update by _id.
The best option is to use arrays. The $ positional operator works with upserts, it just has a serious gotcha that, in the case of an insert, the $ will be interpreted as part of the field name instead of as an operator, so I understand your conclusion that it doesn't seem feasible. I'm not quite sure how you are using upsert such that arrays seem like they won't work, though. Could you give more detail there and I'll try to help come up with a reasonable workaround to use arrays and $ with your use case?

MongoDB "filtered" index: is it possible?

Is it possible to index some documents of the collection "only if" one of the fields to be indexed has a particular value?
Let me explain with an example:
The collection "posts" has millions of documents, ALL defined as follows:
{
    "network": "network_1",
    "blogname": "blogname_1",
    "post_id": 1234,
    "post_slug": "abcdefg"
}
Let's assume that the distribution of the post is equally split on network_1 and network_2
My application OFTEN select the type of query based on the value of "network" (although sometimes I need the data from both networks):
For example:
www.test.it/network_1/blog_1/**postid**/1234/
-> db.posts.find ({network: "network_1" blogname "blog_1", post_id: 1234})
www.test.it/network_2/blog_4/**slug**/aaaa/
-> db.posts.find ({network: "network_2" blogname "blog_4" post_slug: "yyyy"})
I could create two separate indexes (network / blogname / post_id and network / blogname / post_slug) but I would get a huge waste of RAM, since 50% of the data in the index will never be used.
Is there a way to create an index "filtered"?
Example:
(Note the WHERE parameter)
db.posts.ensureIndex ({network: 1 blogname: 1, post_id: 1}, {where: {network: "network_1"}})
db.posts.ensureIndex ({network: 1 blogname: 1, post_slug: 1}, {where: {network: "network_2"}})
Indeed it's possible in MongoDB 3.2+ They call it partialFilterExpression where you can set a condition based on which index will be created.
Example
db.users.createIndex({ "userId": 1, "project": 1 },
{ unique: true, partialFilterExpression:{
userId: { $exists: true, $gt : { $type : 10 } } } })
Please see Partial Index documentation
As of MongoDB v3.2, partial indexes are supported. Documentation: https://docs.mongodb.org/manual/core/index-partial/
It's possible, but it requires a workaround which creates redundancy in your documents, requires you to rewrite your find-queries and limits find-queries to exact matches.
MongoDB supports sparse indexes which only index the documents where the given field exists. You can use this feature to only index a part of the collection by adding this field only to those documents you want to index.
The bad news is that sparse indexes can only include a single field. But the good news is, that this field can also contain an object with multiple fields, so you can still store all the data you want to search for in this field.
To do this, add a new field to the included documents which includes an object with the fields you search for:
{
"network": "network_1",
"blogname": "blogname_1",
"post_id": 1234,
"post_slug": "abcdefg"
"network_1_index_key": {
"blogname": "blogname_1",
"post_id": 1234
}
}
Your ensureIndex command would index the field network_1_index_key:
db.posts.ensureIndex( { network_1_index_key: 1 }, { sparse: true } )
A find-query which is supposed to use this index, must now query for the exact object of the field network_1_index_key:
db.posts.find ({
network_1_index_key: {
blogname: "blogname_1",
post_id: 1234
}
})
Doing this would likely only make sense when the documents you want to index are a very small part of the collection. When its about half, I would just create a regular index and live with it because the larger document-size could mitigate the gains from the reduced index size.
You can try create index on all field (network / blogname / post_id / post_slug)