Complex MongoDB Aggregation - mongodb

I have a situation where I need to perform a group by operation based on an array value that sums up occurrences of a field value. The counts are then filtered on and the results are prepared so that they can be displayed according to the condition. Essentially, the documents are transformed back to how they would be presented if you simply used the find function. I am running into an issue of the temporary documents being too large due to the number of items collected in the matchedDocuments array. Any suggestions on how to improve this would be helpful.
db.collection1.aggregate([
{
'$unwind': '$arrayOfValues'
}, {
'$group': {
'_id': '$arrayOfValues',
'x_count': {
$sum: {
$cond: [{
$eq: ['$field.value', 'x']
},
1, 0
]
}
},
'y_count': {
$sum: {
$cond: [{
$eq: ['$field.value', 'y']
},
1, 0
]
}
},
'matchedDocuments': {
'$push': '$$CURRENT'
}
}
},
{'$match': {'$or': [{'x_count': {'$gte': 2}}, {'y_count': { '$gte': 1}}]}},
{'$unwind': '$matchedDocuments'},
{
'$group': {
'_id': '$matchedDocuments.key',
'document': {
'$last': '$$CURRENT.matchedDocuments'
}
}
}
], {
allowDiskUse: true
})
Below are some sample documents and the expected result based on the criteria above:
// Sample documents
{ "_id" : ObjectId("5407c76b7b1c276c74f90524"), "field" : "x", "arrayOfValues" : [ "a", "b", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90525"), "field" : "x", "arrayOfValues" : [ "b", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90526"), "field" : "z", "arrayOfValues" : [ "a" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90527"), "field" : "x", "arrayOfValues" : [ "a", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90528"), "field" : "z", "arrayOfValues" : [ "b" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90529"), "field" : "y", "arrayOfValues" : [ "k" ] }
// Expected Result
[
{ "_id" : ObjectId("5407c76b7b1c276c74f90524"), "field" : "x", "arrayOfValues" : [ "a", "b", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90525"), "field" : "x", "arrayOfValues" : [ "b", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90527"), "field" : "x", "arrayOfValues" : [ "a", "c" ] }
{ "_id" : ObjectId("5407c76b7b1c276c74f90529"), "field" : "y", "arrayOfValues" : [ "k" ] }
]

I think ultimately you are asking a little too much from a single query, as clearly the biggest problem here is trying to store all of the original documents from whence the array element came whilst trying to aggregate a total.
For me, I would just try to identify which conditions on the document would result in a match and then issue a separate query to get the actual documents back. You could adapt the aggregation below to try to return the document, but I think it's very likely to fail when doing so as it would be the reverse of what you should be using arrays for.
The process is also generally much more efficient in the way it goes about the matching alowing you to firstly "Select the elements you are interested in with a match condition" and secondly, "Use the natural grouping conditions rather than rely on conditional sums".
var cursor = db.collection.aggregate([
{ "$match": { "field": { "$in": ["x", "y"] } } },
{ "$unwind": "$arrayOfValues" },
{ "$group": {
"_id": {
"elem": "$arrayOfValues",
"field": "$field"
},
"count": { "$sum": 1 }
}},
{ "$match": {
"$or": [
{ "_id.field": "x", "count": { "$gte": 2 } },
{ "_id.field": "y", "count": { "$gte": 1 } }
]
}},
{ "$group": {
"_id": "$_id.field",
"values": { "$push": "$_id.elem" }
}}
])
var query = { "$or": [] };
cursor.forEach(function(doc) {
query["$or"].push({
"field": doc._id,
"arrayOfValues": { "$in": doc.values }
});
});
db.collection.find(query)
For the record the query should come out like this, given the supplied data:
{
"$or" : [
{
"field" : "x",
"arrayOfValues" : {
"$in" : [
"c",
"b",
"a"
]
}
},
{
"field" : "y",
"arrayOfValues" : {
"$in" : [
"k"
]
}
}
]
}
The basic logic is met by just looking for the values of "field" that you are interested in, so at least eliminating all others from the possible results. Then you basically want to tally up the counts for each array element under each of those "field" values and test where the required occurrences were met.
This may or may not work best the other way around, but the sample here shows the greatest variation by the "arrayOfValues" so that makes sense as the second level of grouping.
As stated earlier, I think it is too much to ask to basically "stuff" all of the parent document information into an array for each "arrayOfValues" element as this works beyond the basic principles of a sensible schema, where that sort of relation would naturally be stored as separate documents. So the end principle here is just find the "conditions" that match those documents which is what the end result comes out as.
The transformed query is then issued against the collection, where that will return all documents that meet the conditions determined from the previous analysis. At the end of the day, moving the responsibility of "fetching" the matching documents off to another query rather than trying to store the documents that match in arrays.
This seems the most logical and scale-able approach, but if you mostly tend to use your data in this type of result you should be looking at re-designing your schema to suit this better. But there really is not enough specific information here to comment on that further.

Related

Aggregation with condtion in embedded documents in mongo db

I'm stuck with aggregation in mongodb. The premise is I have to get data for particular ads within a time range.
So suppose I query for ads within a range of 22nd April to 24th April, here is what I should get, summation of spend from source2, and revenue, session, bounces etc from source1.
[{ "_id" : ObjectId("560bbd5dfabc614611000e95"),
"spend": 470,
"revenue": 440,
"sessions": 3
},....
]
Here is the query, I was attempting which gives me correct data but takes really long - 24seconds for only 22k entires.
db.getCollection('tests').aggregate([{
$match: {
ad_account_id: 40
}
}, {
"$unwind": "$source1"
}, {
"$unwind": "$source2"
}, {
"$group": {
"_id": "$internal_id",
"transactionrevenue": {
"$sum": {
"$cond": [{
"$and": [{
"$gte": [
"$source1.created_at", ISODate("2015-04-22T00:00:00.000Z")
]
}, {
"$lte": [
"$source1.created_at", ISODate("2015-04-25T00:00:00.000Z")
]
}]
}, "$source1.transactionrevenue", 0]
}
},
"sessions": {
"$sum": {
"$cond": [{
"$and": [{
"$gte": [
"$source1.created_at", ISODate("2015-04-22T00:00:00.000Z")
]
}, {
"$lte": [
"$source1.created_at", ISODate("2015-04-25T00:00:00.000Z")
]
}]
}, "$source1.sessions", 0]
}
},
"spend": {
"$sum": {
"$cond": [{
"$and": [{
"$gte": [
"$source2.created_at", ISODate("2015-04-22T00:00:00.000Z")
]
}, {
"$lte": [
"$source2.created_at", ISODate("2015-04-25T00:00:00.000Z")
]
}]
}, "$source2.spend", 0]
}
}
},
}]);
Problems are how to unwind multiple times, how to get summation of multiple things in source1 and not having to do aggregation again and again? It takes 24seconds, for 22 entries....Please suggest on what I should index (I have none), and also if document size of average 4mb suggests there is something wrong with the schema?
Would map reduce be better even though aggregation is usually considered faster in mongodb?
If you think the document design is wrong, I'm all ears, as we're just working on the migration. Much better to correct things now, rather than later.
Here is a sample document
{
"_id" : ObjectId("560bbd5dfabc614611000e95"),
"internal_id": 1,
"created_at" : ISODate("2015-04-21T00:50:02.593Z"),
"updated_at" : ISODate("2015-09-15T12:20:39.154Z"),
"name" : "LookalikeUSApr21_06h19m",
"ad_account_id" : 40,
"targeting" : {
"age_max" : 44,
"age_min" : 35,
"genders" : [
1
],
"page_types" : [
"desktopfeed"
]
},
"auto_optimization" : false,
"source1" : [
{
"id" : 119560952,
"created_at" : ISODate("2015-04-23T12:35:09.467Z"),
"updated_at" : ISODate("2015-05-19T05:20:58.374Z"),
"transactionrevenue" : 320,
"sessions" : 1,
"bounces" : 1
},
{
"id" : 119560955,
"created_at" : ISODate("2015-05-01T12:35:09.467Z"),
"updated_at" : ISODate("2015-05-19T05:20:58.374Z"),
"transactionrevenue" : 230,
"sessions" : 10,
"bounces" : 1
},
{
"id" : 119560954,
"created_at" : ISODate("2015-04-23T10:35:09.467Z"),
"updated_at" : ISODate("2015-05-19T05:20:58.374Z"),
"transactionrevenue" : 120,
"sessions" : 2,
"bounces" : 1
},
{
"id" : 119560953,
"created_at" : ISODate("2015-04-25T12:35:09.467Z"),
"updated_at" : ISODate("2015-05-19T05:20:58.374Z"),
"transactionrevenue" : 100,
"sessions" : 3,
"bounces" : 2
}
],
"source2" : [
{
"id" : 219560952,
"created_at" : ISODate("2015-04-22T12:35:09.467Z"),
"updated_at" : ISODate("2015-05-19T05:20:58.374Z"),
"spend" : 300
},
{
"id" : 219560955,
"created_at" : ISODate("2015-04-23T12:35:09.467Z"),
"updated_at" : ISODate("2015-05-19T05:20:58.374Z"),
"spend" : 170
},
{
"id" : 219560954,
"created_at" : ISODate("2015-04-25T10:35:09.467Z"),
"updated_at" : ISODate("2015-05-19T05:20:58.374Z"),
"spend" : 450
}
]
}
The very first thing you should be doing is adding an index to both the source1 and source2 arrays for their "created_at" field. You will likely reduce a lot of possible results and improve speed greatly by simply querying for these possible matches being present in the documents you select.
The next main improvements are to combine the arrays and filter as one, and notably before you process $unwind. This is going to save a lot of cycles and document expansion in the arrays.
Moreover, it's going to give you the correct totals. When you $unwind two arrays, then one array's details get repeated by the number of items in the second array. This gives you incorrect results for the array content that you "unwound" first. You can always do each separately, but it's far better to merge them into one:
db.getCollection('tests').aggregate([
{ "$match": {
"ad_account_id": 40,
"$or": [
{
"source1": {
"$elemMatch": {
"created_at": {
"$gte": new Date("2015-04-22"),
"$lte": new Date("2015-04-25")
}
}
}
},
{
"source2": {
"$elemMatch": {
"created_at": {
"$gte": new Date("2015-04-22"),
"$lte": new Date("2015-04-25")
}
}
}
}
]
}},
{ "$project": {
"_id": 0,
"internal_id": 1,
"source": {
"$setDifference": [
{ "$map": {
"input": { "$setUnion": [ "$source1", "$source2" ] },
"as": "source",
"in": {
"$cond": [
{ "$and": [
{ "$gte": [ "$$source.created_at", new Date("2015-04-22") ] },
{ "$lte": [ "$$source.created_at", new Date("2015-04-25") ] }
]},
"$$source",
false
]
}
}},
[false]
]
}
}},
{ "$unwind": "$source"},
{ "$group": {
"_id": "$internal_id",
"transactionrevenue": { "$sum": { "$ifNull": [ "$source.transactionrevenue", 0 ] } },
"sessions": { "$sum": { "$ifNull": [ "$source.sessions", 0 ] } },
"spend": { "$sum": { "$ifNull": [ "$source.spend", 0 ] } }
}}
])
Which is going to give the result on your sample:
{ "_id" : 1, "transactionrevenue" : 440, "sessions" : 3, "spend" : 470 }
So probably the great big architecture hint in what is being done here it that it would be very wise to to combine the arrays into a single array in your general application usage. You can always add another field for "type" if you must to discern between the two different types of items, but just about all processing should benefit from a singular array.
The main lesson for the query aside from that, is that you always $match first to filter out as much content as possible. Whilst the initial $match stage cannot of course remove items from arrays that do not meet the conditions, what it can importantly do is "match the documents". because you do not want to process documents that don't have that information at all. That always adds time.
The second part other than the combined array is that basically you want to filter out any content before unwinding the array where possible for much the same reasons, as you don't want to be processing items you don't need to.
Short lesson, filter first to reduce what you are processing. Conditional sums are fine, but really only should be used for selection of content and not raw filtering. It's basically about getting rid of the undesired data first rather than just ignoring it. Process less and do it faster.

How to assign weights to searched documents in MongoDb?

This might sounds like simple question for you but i have spend over 3 hours to achieve it but i got stuck in mid way.
Inputs:
List of keywords
List of tags
Problem Statement: I need to find all the documents from the database which satisfy following conditions:
List documents that has 1 or many matching keywords. (achieved)
List documents that has 1 or many matching tags. (achieved)
Sort the found documents on the basis of weights: Each keyword matching carry 2 points and each tag matching carry 1 point.
Query: How can i achieve requirement#3.
My Attempt: In my attempt i am able to list only on the basis of keyword match (that too without multiplying weight with 2 ).
tags are array of documents. Structure of each tag is like
{
"id" : "ICC",
"some Other Key" : "some Other value"
}
keywords are array of string:
["women", "cricket"]
Query:
var predicate = [
{
"$match": {
"$or": [
{
"keywords" : {
"$in" : ["cricket", "women"]
}
},
{
"tags.id" : {
"$in" : ["ICC"]
}
}
]
}
},
{
"$project": {
"title":1,
"_id": 0,
"keywords": 1,
"weight" : {
"$size": {
"$setIntersection" : [
"$keywords" , ["cricket","women"]
]
}
},
"tags.id": 1
}
},
{
"$sort": {
"weight": -1
}
}
];
It seems that you were close in your attempt, but of course you need to implement something to "match your logic" in order to get the final "score" value you want.
It's just a matter of changing your projection logic a little, and assuming that both "keywords" and "tags" are arrays in your documents:
db.collection.aggregate([
// Match your required documents
{ "$match": {
"$or": [
{
"keywords" : {
"$in" : ["cricket", "women"]
}
},
{
"tags.id" : {
"$in" : ["ICC"]
}
}
]
}},
// Inspect elements and create a "weight"
{ "$project": {
"title": 1,
"keywords": 1,
"tags": 1,
"weight": {
"$add": [
{ "$multiply": [
{"$size": {
"$setIntersection": [
"$keywords",
[ "cricket", "women" ]
]
}}
,2] },
{ "$size": {
"$setIntersection": [
{ "$map": {
"input": "$tags",
"as": "t",
"in": "$$t.id"
}},
["ICC"]
]
}}
]
}
}},
// Then sort by that "weight"
{ "$sort": { "weight": -1 } }
])
So it is basicallt the $map logic here that "transforms" the other array to just give the id values for comparison against the "set" solution that you want.
The $add operator provides the additional "weight" to the member you want to "weight" your responses by.

MongoDB: find documents with a given array of subdocuments

I want to find documents which contain given subdocuments, let's say I have the following documents in my commits collection:
// Document 1
{
"commit": 1,
"authors" : [
{"name" : "Joe", "lastname" : "Doe"},
{"name" : "Joe", "lastname" : "Doe"}
]
}
// Document 2
{
"commit": 2,
"authors" : [
{"name" : "Joe", "lastname" : "Doe"},
{"name" : "John", "lastname" : "Smith"}
]
}
// Document 3
{
"commit": 3,
"authors" : [
{"name" : "Joe", "lastname" : "Doe"}
]
}
All I want from the above collection is 1st document, since I know I'm looking for a commit with 2 authors were both have same name and lastname. So I came up with the query:
db.commits.find({
$and: [{'authors': {$elemMatch: {'name': 'Joe,
'lastname': 'Doe'}},
{'authors': {$elemMatch: {'name': 'Joe,
'lastname': 'Doe'}}],
'authors': { $size: 2 }
})
$size is used to filter out 3rd document, but the query still returns 2nd document since both $elemMatch return True.
I can't use index on subdocuments, since the order of authors used for search is random. Is there a way to remove 2nd document from results without using Mongo's aggregate function?
What you are asking for here is a little different from a standard query. In fact you are asking for where the "name" and "lastname" is found in that combination in your array two times or more to identify that document.
Standard query arguments do not match "how many times" an array element is matched within a result. But of course you can ask the server to "count" that for you using the aggregation framework:
db.collection.aggregate([
// Match possible documents to reduce the pipeline
{ "$match": {
"authors": { "$elemMatch": { "name": "Joe", "lastname": "Doe" } }
}},
// Unwind the array elements for processing
{ "$unwind": "$authors" },
// Group back and "count" the matching elements
{ "$group": {
"_id": "$_id",
"commit": { "$first": "$commit" },
"authors": { "$push": "$authors" },
"count": { "$sum": {
"$cond": [
{ "$and": [
{ "$eq": [ "$authors.name", "Joe" ] },
{ "$eq": [ "$authors.lastname", "Doe" ] }
]},
1,
0
]
}}
}},
// Filter out anything that didn't match at least twice
{ "$match": { "count": { "$gte": 2 } } }
])
So essentially you but your conditions to match inside the $cond operator which returns 1 where matched and 0 where not, and this is passed to $sum to get a total for the document.
Then filter out any documents that did not match 2 or more times

Documents in MongoDB where last n sub-array elements contain a value

Consider this set of data in MongoDB...
{
_id: 1,
name: "Johnny",
properties: [
{
type: "A",
value: 257,
date: "4/1/2014"
},
{
type: "A",
value: 200,
date: "4/2/2014"
},
{
type: "B",
value: 301,
date: "4/3/2014"
},
...]
}
What is the proper way to query the the documents in which the one (or more of) last two "properties" elements have a value > x, or one (or more of) the last two "properties" elements of type "A" have a value > x?
If you can stomach modifying your insertion method try as follows;
Change your updates to push the following:
doc = { type : "A", "value" : 123, "date" : new Date() }
db.foo.update( {_id:1}, { "$push" : { "properties" : { "$each" : [ doc ], "$sort" : { date : -1} } } } )
This will give you an array of documents sorted in descending order by time, making the "most recent" document first.
You can now use the standard MongoDB dot notation to query against the 0, 1, etc elements of your properties array, which represent the most recent additions logically.
As per the comments, the aggregation framework is for a lot more than simply "aggregating" values, so you can take advantage of the various pipeline operators to do very advanced things that cannot be achieved simply using .find()
db.collection.aggregate([
// Match documents that "could" meet the conditions to narrow down
{ "$match": {
"properties": { "$elemMatch": {
"type": "A", "value": { "$gt": 200 }
}}
}},
// Keep a copy of the document for later with an array copy
{ "$project": {
"_id": {
"_id": "$_id",
"name": "$name",
"properties": "$properties"
},
"properties": 1
}},
// Unwind the array to "de-normalize"
{ "$unwind": "$properties" },
// Get the "last" element of the array and copy the existing one
{ "$group": {
"_id": "$_id",
"properties": { "$last": "$_id.properties" },
"last": { "$last": "$properties" },
"count": { "$sum": 1 }
}},
// Unwind the copy again
{ "$unwind": "$properties" },
// Project to mark the element you already have
{ "$project": {
"properties": 1,
"last": 1,
"count": 1,
"seen": { "$eq": [ "$properties", "$last" ] }
}},
// Match again, being careful to keep any array with one element only
// This gets rid of the element you already kept
{ "$match": {
"$or": [
{ "seen": false },
{ "seen": true, "count": 1 }
]
}},
// Group to get the second last element as "next"
{ "$group": {
"_id": "$_id",
"last": { "$last": "$last" },
"next": { "$last": "$properties" }
}},
// Then match to see if either of those elements fits
{ "$match": {
"$or": [
{ "last.type": "A", "last.value": { "$gt": 200 } },
{ "next.type": "A", "next.value": { "$gt": 200 } }
]
}},
// Finally restore your matching documents
{ "$project": {
"_id": "$_id._id",
"name": "$_id.name",
"properties": "$_id.properties"
}}
])
Running through that in a bit more detail:
The first $match usage is to make sure you are only working on documents that can "possibly" match your extended conditions. Always a good idea to optimize like this.
The next stage is to $project since you likely want to keep the original document detail and you are at least going to need the array again in order to get the second last element.
The next stages make use of $unwind in order to break the array into individual documents which is then followed by $group which is used to find the last item on the document _id boundary. This is actually the last item in the array. Plus you keep a count of the array elements.
So then after using $unwind again on the original array content, the usage of $project again adds a "seen" field to the document indicating via the use of the $eq operator whether or not the document from the original is actually the one that was previously keep as the "last" element.
After that stage you again issue a $match in order to filter that last document from the result, but also making sure in the condition that you are not removing anything that originally matched where the array length is actually 1.
From here you want to $group again in order to get the "second last" element from the array (or indeed the same "last" element where there was only one.
The final steps are simply to $match where either of those last two elements meets the conditions, and then finally $project the document in it's original form.
So while that is fairly involved and of course increases in complexity by the number of items you want to test at the end of the array it can be done, and shows how aggregate is very suited to the problem.
Where possible it is the best approach as invoking the JavaScript interpreter will convey an overhead compared to the native code used by aggregate.
Using mapReduce would remove the code complexity for taking the last two possible elements (or more) but it will invoke the JavaScript interpreter by nature and will therefore run much more slowly.
For the record, since the sample in the question would not be a match, here is some data that will match the last two documents, one of which only has one element in the array:
{
"_id" : 1,
"name" : "Johnny",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "A",
"value" : 200,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
{
"_id" : 2,
"name" : "Ace",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "B",
"value" : 200,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
{
"_id" : 3,
"name" : "Bo",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
}
]
}
{
"_id" : 4,
"name" : "Sue",
"properties" : [
{
"type" : "A",
"value" : 257,
"date" : "4/1/2014"
},
{
"type" : "A",
"value" : 240,
"date" : "4/2/2014"
},
{
"type" : "B",
"value" : 301,
"date" : "4/3/2014"
}
]
}
Have you considered using a $where clause? Not the most efficient but I think it should get you what you want. For instance, if you wanted every document that had either the last two properties elements value field greater than 200 you could try:
db.collection.find({properties:{$exists:true},
$where: "(this.properties[this.properties.length-1].value > 200)||
(this.properties[this.properties.length-2].value > 200)"});
This needs some work for edge cases (array < 2 members for example) and more complex queries (by the "type" field too) but should get you started.

Selecting Distinct values from Array in MongoDB

I have a collection name Alpha_Num, It has following structure. I am trying to find out which Alphabet-Numerals pair will appear maximum number of times ?
If we just go with the data below, pair abcd-123 appears twice so as pair efgh-10001, but the second one is not a valid case for me as it appears in same document.
{
"_id" : 12345,
"Alphabet" : "abcd",
"Numerals" : [
"123",
"456",
"2345"
]
}
{
"_id" : 123456,
"Alphabet" : "efgh",
"Numerals" : [
"10001",
"10001",
"1002"
]
}
{
"_id" : 123456567,
"Alphabet" : "abcd",
"Numerals" : [
"123"
]
}
I tried to use aggregation frame work, something like below
db.Alpha_Num.aggregate([
{"$unwind":"$Numerals"},
{"$group":
{"_id":{"Alpha":"$Alphabet","Num":"$Numerals"},
"count":{$sum:1}}
},
{"$sort":{"count":-1}}
])
Problem in this query is it gives pair efgh-10001 twice.
Question : How to select distinct values from array "Numerals" in the above condition ?
Problem solved.
db.Alpha_Num.aggregate([{
"$unwind": "$Numerals"
}, {
"$group": {
_id: {
"_id": "$_id",
"Alpha": "$Alphabet"
},
Num: {
$addToSet: "$Numerals"
}
}
}, {
"$unwind": "$Num"
}, {
"$group": {
_id: {
"Alplha": "$_id.Alpha",
"Num": "$Num"
},
count: {
"$sum": 1
}
}
}])
Grouping using $addToSet and unwinding again did the trick. Got the answer from one of 10gen online course.