MongoDB - Looking up documents based on criteria defined by the documents themselves

MongoDB - Looking up documents based on criteria defined by the documents themselves - mongodb

Overall, I am trying to find a system design to quickly look up stored objects whose metadata matches data bundled on incoming events. Which fields are required, however, are themselves part of the stored objects, and are not fields that I can hardcode into a lookup query.
My system has a policies collection stored in MongoDB with documents that look like this:
{
id: 123,
name: "Jason's Policy",
requirements: {
"var1": "aaa",
"var2": "bbb"
// could have any number more, and each policy can have different field/values under requirements
}
}
My system receives events that look like this:
// Event 1 - matches all requirements under above policy
{
id: 777,
"var1": "aaa",
"var2": "bbb"
}
// Event 2 - does not match all requirements from above policy since var1 is undefined
{
id: 888,
"var2": "bbb",
"var3": "zzz"
}
As I receive events, how can I efficiently look up all the policies whose requirements are fully satisfied by the values received in the event?
As an example, in the sample data above, event 1 should return the policy (since var1 and var2 match the policy requirements), but event 2 should not return the policy (since var1 does not match/ is missing).
I can think of brute-force ways to do this on the application server itself (think nested for loops) but efficiency will be key as we receive hundreds of events per second.
I am open to recommendations for document schema changes that can satisfy the general problem (looking up documents based on criteria itself defined in our documents). I am also open to any overall design recommendations that address the problem, too (perhaps there is a better way to structure our system to trigger policy actions in response to events).
Thanks!

Not sure what's the exact scenario but can think of 2 here,
You need an exact match. For that you can run the below querydb.getCollection('test').find({'requirements':{'var1':'aaa','var2':'bbb'}})
for above query to run you need to save requirements object after sorting it's keys var1 and var2.
You need to match all properties exists and don't care if anything is extra in policies collection. You need to change policies being stored as,
{
"_id" : ObjectId("603250b0775428e32b9b303f"),
"id" : 123,
"name" : "Jason's Policy",
"requirements" : {
"var1" : "aaa",
"var2" : "bbb"
},
"requirements_search" : [
"var1aaa",
"var2bbb",
"var3ccc"
]
}
then you can run the below query,
db.getCollection('test').find({'requirements_search':{'$all' : ['var1aaa','var2bbb']}})

I found an answer to my question in another post: Find Documents in MongoDB whose with an array field is a subset of a query array.
MongoDB offers a $setIsSubset operator that can check if a document's array values are a subset of the array values in a query. Translated to my use case: if a given policy's requirements are a subset of the event's metadata, then I know that the event data fully meets the requirements for that policy.
For completeness, below is the MongoDB aggregation that solved my problem. I still need to research if there is a more efficient overall system design to facilitate what I need, but at a minimum, this Mongo aggregation will fetch the results that I need.
// Requires us to flatten policy requirements into an array like the following
//
// {
// "id" : 123,
// "name" : "Jason's Policy",
// "requirements" : [
// "var1_aaa",
// "var2_bbb"
// ]
// }
//
// Event matches all policy requirements and has extra unrelated attributes
// {
// id: 777,
// "var1": "aaa",
// "var2": "bbb",
// "var3": "ccc"
// }
db.collection.aggregate([
{$project: {
doc: '$$ROOT',
isSubset: {$setIsSubset: ['$requirements', ['var1_aaa', 'var2_bbb', 'var3_ccc']]}
}},
{$match: {isSubset: true}},
{$project: {_id: 0, 'doc.name': 1}}
])

Related

Meteor Collection: find element in array

I have no experience with NoSQL. So, I think, if I just try to ask about the code, my question can be incorrect. Instead, let me explain my problem.
Suppose I have e-store. I have catalogs
Catalogs = new Mongo.Collection('catalogs);
and products in that catalogs
Products = new Mongo.Collection('products');
Then, people add there orders to temporary collection
Order = new Mongo.Collection();
Then, people submit their comments, phone, etc and order. I save it to collection Operations:
Operations.insert({
phone: "phone",
comment: "comment",
etc: "etc"
savedOrder: Order //<- Array, right? Or Object will be better?
});
Nice, but when i want to get stats by every product, in what Operations product have used. How can I search thru my Operations and find every operation with that product?
Or this way is bad? How real pro's made this in real world?

If I understand it well, here is a sample document as stored in your Operation collection:
{
clientRef: "john-001",
phone: "12345678",
other: "etc.",
savedOrder: {
"someMetadataAboutOrder": "...",
"lines" : [
{ qty: 1, itemRef: "XYZ001", unitPriceInCts: 1050, desc: "USB Pen Drive 8G" },
{ qty: 1, itemRef: "ABC002", unitPriceInCts: 19995, desc: "Entry level motherboard" },
]
}
},
{
clientRef: "paul-002",
phone: null,
other: "etc.",
savedOrder: {
"someMetadataAboutOrder": "...",
"lines" : [
{ qty: 3, itemRef: "XYZ001", unitPriceInCts: 950, desc: "USB Pen Drive 8G" },
]
}
},
Given that, to find all operations having item reference XYZ001 you simply have to query:
> db.operations.find({"savedOrder.lines.itemRef":"XYZ001"})
This will return the whole document. If instead you are only interested in the client reference (and operation _id), you will use a projection as an extra argument to find:
> db.operations.find({"savedOrder.lines.itemRef":"XYZ001"}, {"clientRef": 1})
{ "_id" : ObjectId("556f07b5d5f2fb3f94b8c179"), "clientRef" : "john-001" }
{ "_id" : ObjectId("556f07b5d5f2fb3f94b8c17a"), "clientRef" : "paul-002" }
If you need to perform multi-documents (incl. multi-embedded documents) operations, you should take a look at the aggregation framework:
For example, to calculate the total of an order:
> db.operations.aggregate([
{$match: { "_id" : ObjectId("556f07b5d5f2fb3f94b8c179") }},
{$unwind: "$savedOrder.lines" },
{$group: { _id: "$_id",
total: {$sum: {$multiply: ["$savedOrder.lines.qty",
"$savedOrder.lines.unitPriceInCts"]}}
}}
])
{ "_id" : ObjectId("556f07b5d5f2fb3f94b8c179"), "total" : 21045 }

I'm an eternal newbie, but since no answer is posted, I'll give it a try.
First, start by installing robomongo or a similar software, it will allow you to have a look at your collections directly in mongoDB (btw, the default port is 3001)
The way I deal with your kind of problem is by using the _id field. It is a field automatically generated by mongoDB, and you can safely use it as an ID for any item in your collections.
Your catalog collection should have a string array field called product where you find all your products collection items _id. Same thing for the operations: if an order is an array of products _id, you can do the same and store this array of products _id in your savedOrder field. Feel free to add more fields in savedOrder if necessary, e.g. you make an array of objects products with additional fields such as discount.
Concerning your queries code, I assume you will find all you need on the web as soon as you figure out what your structure is.
For example, if you have a product array in your savedorder array, you can pull it out like that:
Operations.find({_id: "your operation ID"},{"savedOrder.products":1)
Basically, you ask for all the products _id in a specific operation. If you have several savedOrders in only one operation, you can specify too the savedOrder _id, if you used the one you had in your local collection.
Operations.find({_id: "your_operation_ID", "savedOrder._id": "your_savedOrder_ID"},{"savedOrder.products":1)
ps: to bad-ass coders here, if I'm doing it wrong, please tell me.

I find an answer :) Of course, this is not a reveal for real professionals, but is a big step for me. Maybe my experience someone find useful. All magic in using correct mongo operators. Let solve this problem in pseudocode.
We have a structure like this:
Operations:
1. Operation: {
_id: <- Mongo create this unique for us
phone: "phone1",
comment: "comment1",
savedOrder: [
{
_id: <- and again
productId: <- whe should save our product ID from 'products'
name: "Banana",
quantity: 100
},
{
_id:,
productId: <- Another ID, that we should save if order
name: "apple",
quantity: 50
}
]
And if we want to know, in what Operation user take "banana", we should use mongoDB operator"elemMatch" in Mongo docs
db.getCollection('operations').find({}, {savedOrder: {$elemMatch:{productId: "f5mhs8c2pLnNNiC5v"}}});
In simple, we get documents our saved order have products with id that we want to find. I don't know is it the best way, but it works for me :) Thank you!

Best practice for obtaining embedded document metadata?

So, my schema design requires that I use an embedded document format. While I recognize that what I'm about to ask could be made easier by redesigning the schema, the current design meets all of the other requirements in place so I'm doing my best to make it work.
Consider the following rudementary schema:
{
"_id" : "01234ABCD,
"type" : "thing",
"resources" : {
foo : [
{
"herp" : "derp",
},
],
bar : [
{
"herp" : "derp",
},
{
"derp" : "herp"
}
]
},
}
Obviously the value that corresponds to the "resources" key is an embedded document. I would like to be able to efficiently calculate the count of keys in that document, and derive results based upon tests on that value. It's important to note that the length and content of the embedded doc is an unknown quantity - hence my reason for wanting to be able to query this meta. Being a complete js idiot, I've managed to cobble together the following query. For example, if I were to look for documents with more than 3 keys in the "resources" document:
db.coll.find({$where: function(){
var total = 0;
for(i in this['resources']){
++total;
if(total > 3){
return true;
}
}
}})
As I'm pretty new to Mongo and terrible at js, I feel like there may be a smarter way to do this. I'm also very curious to hear opinions on whether or not this goes against the Mongo ethos a bit by not pushing this processing to the client. Any feedback or criticism of this approach and implementation are most welcome.
Thanks for reading.

You can use an aggregate pipeline to assemble metadata about the docs and then filter on them.
db.coll.aggregate([
{$project: {
// Compute a total count of the keys in the resources docs
keys: {$add: [{$size: '$resources.foo'}, {$size: '$resources.bar'}]},
// Project the original doc
doc: '$$ROOT'
}},
// Only include the docs that have more than 3 keys
{$match: {keys: {$gt: 3}}}
])

Return only matching subdocs with Mongo aggregation

I have a schema with subdocs.
// Schema
var company = {
_id: ObjectId,
publish: Boolean,
divisions: {
employees: [ObjectId]
}
};
I need to find all the subdocs (divisions) that match my query. It appears that I have to use 2 matches - one to filter out initial docs and a second one to filter out the matching subdocs from the resulting $unwind operation. Is there a more efficient way?
// Query
this.aggregate({
$match: {
'publish': 1,
'divisions.employees': new ObjectId(userid)
}
}, {
$unwind: '$divisions'
}, {
$match: {
'divisions.employees': new ObjectId(userid)
}
}
I found this ticket but I am unsure this does what I need.

Doing both matches is the right thing here. You could eliminate the first match stage and just unwind, but having an initial $match allows you to narrow down the pipeline to exactly those documents that will produce at least one output document (i.e. those documents for which publish : true and some employees ObjectId matches the given ObjectId). You will be able to use indexes, like an index on { publish : 1, divisions.employees : 1 }, to perform the first match stage quickly.
However, you should ask yourself why you are using the aggregation pipeline here and if it is appropriate. Will you commonly be querying for a given employee that's part of a company with publish : 1? Is this one of the main queries for your use case? If it's infrequent or not critical then the aggregation is a good way to do it. Otherwise, you should reconsider the schema. You could make this query easy with a schema like
{
"_id" : ObjectId,
"publish" : Boolean,
"company" : (unique identifier, possibly a String or ObjectId)
}
that models employees as documents and denormalizes company information into the employee document. Then your query is as easy as
db.employees.find({ "_id" : ObjectId(userid), "publish" : true })
which will be much quicker than the aggregation (not that the aggregation will be slow; this is just relatively quicker). I'm not telling you to do it this way - use your own knowledge of your use case to make the right call.

Can the same MongoDB document show up more than once in a single cursor using a mulitkey index?

I'm considering bundling time-sequence data together in session documents. Inside each session, there would be an array of events. Each event would have a timestamp. I know that I can create a multikey index on the timestamp of those events, but I'm curious what mechanism MongoDB uses to prevent the same document from showing up twice in one query.
To clarify, imagine a collection of sessions with the following documents:
{
_id: 'A',
events: [
{time: '10:00'},
{time: '15:00'}
]
}
{
_id: 'B',
events: [
{time: '12:00'}
]
}
If I add a multikey index with db.sessions.ensureIndex({'events.time' : 1}), I would expect the b-tree of that index to look like this:
'10:00' => 'A'
'12:00' => 'B'
'15:00' => 'A'
If I query the collection with {'events.time': {$gte: '10:00'}}, MongoDB scans the b-tree and returns:
{ "_id" : "A", "events" : [ { "time" : "10:00" }, { "time" : "15:00" } ] }
{ "_id" : "B", "events" : [ { "time" : "12:00" } ] }
How does Mongo prevent document A from showing up a second time as the third result in the cursor? For small index scans, it could just keep track of which documents had already been seen, but what happens if the index is enormous? Is there ever a case where the same document would show up more than once in a singe cursor?
My assumption is that it would not. Mongo could look at the document it is scanning and detect that it already would have matched earlier in the scan by inspecting earlier entries in the indexed array. However, I cannot find any mention of this behavior in the MongoDB documentation, and it is important to actually know what to expect.
(NOTE: I do know that it is possible for a document to show up in a single query more than once if the document is modified while the cursor is being scanned. That shouldn't pose a problem for queries on time-sequence data where timestamps are never edited. Even if a new event is added to a session during a scan, if Mongo uses something like the detection mechanism I mentioned above, it should be able to omit the moved document from query results.)

I cannot find any mention of this behavior in the MongoDB
documentation, and it is important to actually know what to expect.
Internals of implementation are seldom mentioned in the documentation, and after all, what you describe is the expected behavior.
There is code to deduplicate a result set and there are tests to make sure that it's working correctly. After all, a multi-key index isn't the primary use case for such functionality - if you have an $or clause in your query, the results must be de-duplicated as well.

Array intersection in MongoDB

Ok there are a couple of things going on here..I have two collections: test and test1. The documents in both collections have an array field (tags and tags1, respectively) that contains some tags. I need to find the intersection of these tags and also fetch the whole document from collection test1 if even a single tag matches.
> db.test.find();
{
"_id" : ObjectId("5166c19b32d001b79b32c72a"),
"tags" : [
"a",
"b",
"c"
]
}
> db.test1.find();
{
"_id" : ObjectId("5166c1c532d001b79b32c72b"),
"tags1" : [
"a",
"b",
"x",
"y"
]
}
> db.test.find().forEach(function(doc){db.test1.find({tags1:{$in:doc.tags}})});
Surprisingly this doesn't return anything. However when I try it with a single document, it works:
> var doc = db.test.findOne();
> db.test1.find({tags1:{$in:doc.tags}});
{ "_id" : ObjectId("5166c1c532d001b79b32c72b"), "tags1" : [ "a", "b", "x", "y" ] }
But this is part of what I need. I need intersection as well. So I tried this:
> db.test1.find({tags1:{$in:doc.tags}},{"tags1.$":1});
{ "_id" : ObjectId("5166c1c532d001b79b32c72b"), "tags1" : [ "a" ] }
But it returned just "a" whereas "a" and "b" both were in tags1. Does positional operator return just the first match? Also, using $in won't exactly give me an intersection..How can I get an intersection (should return "a" and "b") irrespective of which array is compared against the other.
Now say there's an operator that can do this..
> db.test1.find({tags1:{$intersection:doc.tags}},{"tags1.$":1});
{ "_id" : ObjectId("5166c1c532d001b79b32c72b"), "tags1" : [ "a", "b" ] }
My requirement is, I need the entire tags1 array PLUS this intersection, in the same query like this:
> db.test1.find({tags1:{$intersection:doc.tags}},{"tags1":1, "tags1.$":1});
{ "_id" : ObjectId("5166c1c532d001b79b32c72b"), "tags1": [ "a", "b", "x", "y" ],
"tags1" : [ "a", "b" ] }
But this is an invalid json. Is renaming key possible, or this is possible only through aggregation framework (and across different collections?)? I tried the above query with $in. But it behaved as if it totally ignored "tags:1" projection.
PS: I am going to have at least 10k docs in test1 and very few (<10) in test. And this query is in real-time, so I want to avoid mapreduce :)
Thanks for any help!

In newer versions you can use aggregation to accomplish this.
db.test.aggregate(
{
$match: {
tags1: {
$in: doc.tags
}
}
},
{
$project: {
tags1: 1,
intersection: {
$setIntersection: [doc.tags, "$tags1"]
}
}
}
);
As you can see, the match portion is exactly the same as your initial find() query. The project portion generates the result fields. In this case, it selects tags1 from the matching documents and also creates intersection from the input and the matching docs.

Mongo doesn't have any inherent ability to retrieve array intersections. If you really need to use ad-hoc querying get the intersection on the client side.
On the other hand, consider using Map-Reduce and storing it's output as a collection. You can augment the returned objects in the finalize section to add the intersecting tags. Cron MR to run every few seconds. You get the benefit of a permanent collection you can query from on the client side.

If you want to have this in realtime you should consider to move away from Serverside Javascript which is only run with one thread and should be quite slow (single threaded) (this is no longer true for v2.4, http://docs.mongodb.org/manual/core/server-side-javascript/).
The positional operator only returns the first matching/current value. Without knowing the internal implementation, from the point of performance it doesn't even makes sense to look for further matching criteria if the document was already evaluated as match. So I doubt that you can go for this.
I don't know if you need the cartesian product for your search, but I would consider joining your few test one document tags into one and then have some $in search for it on test1, returning all matching documents. On your local machine you could have multiple threads which generate the intersection for your document.
Depending on how frequent your test1 and test collection changes, you're performing this query you might precalculate this information. Which would allow to easily do a query on the field which contains the intersection information.
The document is invalid because you have two fields names tags1

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse