I have one requirement where i need to do aggregation on two records both have an array field with different value. What I need that when I do aggregation on these records the result should have one array with unique values from both different arrays. Here is example :
First record
{ Host:"abc.com" ArtId:"123", tags:[ "tag1", "tag2" ] }
Second record
{ Host:"abc.com" ArtId:"123", tags:[ "tag2", "tag3" ] }
After aggregation on host and artid i need result like this:
{ Host: "abc.com", ArtId: "123", count :"2", tags:[ "tag1", "tag2", "tag3" ]}
I tried $addToset in group statement but it gives me like this tags :[["tag1","tag2"],["tag2","tag3"]]
Could you please help me how i can achieve this in aggregation
TLDR;
Modern releases should use $reduce with $setUnion after the initial $group as is shown:
db.collection.aggregate([
{ "$group": {
"_id": { "Host": "$Host", "ArtId": "$ArtId" },
"count": { "$sum": 1 },
"tags": { "$addToSet": "$tags" }
}},
{ "$addFields": {
"tags": {
"$reduce": {
"input": "$tags",
"initialValue": [],
"in": { "$setUnion": [ "$$value", "$$this" ] }
}
}
}}
])
You were right in finding the $addToSet operator, but when working with content in an array you generally need to process with $unwind first. This "de-normalizes" the array entries and essentially makes a "copy" of the parent document with each array entry as a singular value in the field. That's what you need to avoid the behavior you are seeing without using that.
Your "count" poses an interesting problem though, but easily solved through the use of a "double unwind" after an initial $group operation:
db.collection.aggregate([
// Group on the compound key and get the occurrences first
{ "$group": {
"_id": { "Host": "$Host", "ArtId": "$ArtId" },
"tcount": { "$sum": 1 },
"ttags": { "$push": "$tags" }
}},
// Unwind twice because "ttags" is now an array of arrays
{ "$unwind": "$ttags" },
{ "$unwind": "$ttags" },
// Now use $addToSet to get the distinct values
{ "$group": {
"_id": "$_id",
"tcount": { "$first": "$tcount" },
"tags": { "$addToSet": "$ttags" }
}},
// Optionally $project to get the fields out of the _id key
{ "$project": {
"_id": 0,
"Host": "$_id.Host",
"ArtId": "$_id.ArtId",
"count": "$tcount",
"tags": "$ttags"
}}
])
That final bit with $project is also there because I used "temporary" names for each of the fields in other stages of the aggregation pipeline. This is because there is an optimization in $project that "copies" the fields from an existing stage in the order they already appeared "before" any "new" fields are added to the document.
Otherwise the output would look like:
{ "count":2 , "tags":[ "tag1", "tag2", "tag3" ], "Host": "abc.com", "ArtId": "123" }
Where the fields are not in the same order as you might think. Trivial really, but it matters to some people, so worth explaining why, and how to handle.
So $unwind does the work to keep the items separated and not in arrays, and doing the $group first allows you to get the "count" of the occurrences of the "grouping" key.
The $first operator used later "keeps" that "count" value, as it just got "duplicated" for every value present in the "tags" array. It's all the same value anyway so it does not matter. Just pick one.
Related
I want to fetch "all the documents" having highest value for specific field and than group by another field.
Consider below data:
_id:1, country:india, quantity:12, name:xyz
_id:2, country:USA, quantity:5, name:abc
_id:3, country:USA, quantity:6, name:xyz
_id:4, country:india, quantity:8, name:def
_id:5, country:USA, quantity:10, name:jkl
_id:6, country:india, quantity:12, name:jkl
Answer should be
country:india max-quantity:12
name xyz
name jkl
country:USA max-quantity:10
name jkl
I have tried several queries, but I can get only the max value without the name or i can go group by but it shows all the values.
db.coll.aggregate([{
$group:{
_id:"$country",
"maxQuantity":{$max:"$quantity"}
}
}])
for example above will give max quantity on every country but how to combine with other field such that it shows all the documents of max quantity.
If you want to keep document information, then you basically need to $push it into an array. But of course, then having your $max values, you need to filter the contents of the array for just the elements that match:
db.coll.aggregate([
{ "$group":{
"_id": "$country",
"maxQuantity": { "$max": "$quantity" },
"docs": { "$push": {
"_id": "$_id",
"name": "$name",
"quantity": "$quantity"
}}
}},
{ "$project": {
"maxQuantity": 1,
"docs": {
"$setDifference": [
{ "$map": {
"input": "$docs",
"as": "doc",
"in": {
"$cond": [
{ "$eq": [ "$maxQuantity", "$$doc.quantity" ] },
"$$doc",
false
]
}
}},
[false]
]
}
}}
])
So you store everything in an array and then test each array member to see if it's value matches the one that was recorded as the maximum, discarding any that do not.
I'd keep the _id values in the array documents since that is what makes them "unique" and won't be adversely affected by $setDifference when filtering out values. But of course if "name" is always unique then it won't be required.
You can also just return whatever fields you want from $map, but I'm just returning the whole document for example.
Keep in mind that this has the limitation of not exceeding the BSON size limit of 16MB, so is okay for small data samples, but anything producing a potentially large list ( since you cannot pre-filter array content ) would be better of processed with a separate query to find the "max" values, and another to fetch the matching documents.
I know how to do similar task simpler only if you alter specific range of countries:
[
{"$match":{"name":{"$in":["USA","india"]}}}, // stage one
{ "$sort": { "quanity": -1 }}, // stage three
{"$limit":2 } // stage four - count equal ["USA","india"] length
]
If you need all countries try follow, but without guaranties from me:
[
{"$project": {
"country": "$country",
"quantity": "$quantity",
"document": "$$ROOT" // save all fields for future usage
}},
{ "$sort": { "quantity": -1 }},
{"$group":{"_id":{"country":"$country"},"original_doc":{"$first":"$document"} }}
]
Another way can be like:
db.coll.aggregate(
[
{
$sort:{ country: -1, "quantity": -1 }
},
{
"$group":
{
"_id":{ "country": "$country" },
"data":{ "$first": "$$ROOT" }
}
}
])
Another possibility close to Blakes Seven's solution to simplify a bit the setDifference + map part by a filter of the array of documents.
db.coll.aggregate([
{ "$group":{
"_id": "$country",
"maxQuantity": { "$max": "$quantity" },
"docs": { "$push": {
"_id": "$_id",
"name": "$name",
"quantity": "$quantity"
}}
}},
{ "$project": {
"maxQuantity": 1,
"docs": {
"$filter": {
"input": "$docs",
"as": "doc",
"cond": { $eq: ["$$doc.quantity", "$maxQuantity"] }
}
}
}}
])
Document:
{
"_id" : ObjectId("560dcd15491a065d6ab1085c"),
"title" : "example title",
"views" : 1,
"messages" : [
{
"authorId" : ObjectId("560c24b853b558856ef193a3"),
"authorName" : "Karl Morrison",
"created" : ISODate("2015-10-02T00:17:25.119Z"),
"message" : "example message"
}
]
}
Project:
$project: {
_id: 1,
title: 1,
views: 1,
updated: '$messages[$messages.length-1].created' // <--- ReferenceError: $messages is not defined
}
I am trying to get the last elements created value from the array inside of the document. I was reading the documentation but this specific task has fallen short.
I've learnt it has to do with dot notation. However doesn't state how to get the last element.
You cannot just extract properties or basically change the result from a basic .find() query beyond simple top level field selection as it simply is not supported. For more advanced manipulation you can use the aggregation framework.
However, without even touching .aggregate() the $slice projection operator gets you most of the way there:
db.collection.find({},{ "messages": { "$slice": -1 } })
You cannot alter the structure, but it is the last array element with little effort.
Until a new release ( as of writing ) for MongoDB, the aggregation framework is still going to need to $unwind the array in order to get at the "last" element, which you can select with the $last grouping accumulator:
db.collection.aggregate([
{ "$unwind": "$messages" },
{ "$group": {
"_id": "$_id",
"title": { "$last": "$title" },
"views": { "$last": "$views" },
"created": { "$last": "$messages.created" }
}}
])
Future releases have $slice and $arrayElemAt in aggregation which can handle this directly. But you would also need to set a variable with $let to address the dot notated field:
[
{ "$project": {
"name": 1,
"views": 1,
"created": {
"$let": {
"vars": {
"message": {
"$arrayElemAt": [
{ "$slice": [ "$messages", -1 ] },
0
]
}
},
"in": "$$message.created"
}
}
}}
]
Let's say we have records of following structure in database.
{
"_id": 1234,
"tags" : [ "t1", "t2", "t3" ]
}
Now, I want to check if database contains a record with any of the tags specified in array tagsArray which is [ "t3", "t4", "t5" ]
I know about $in operator but I not only want to know whether any of the records in database has any of the tag specified in tagsArray, I also want to know which tag of the record in database matches with any of the tags specified in tagsArray. (i.e. t3 in for the case of record mentioned above)
That is, I want to compare two arrays (one of the record and other given by me) and find out the common element.
I need to have this expression along with many expressions in the query so projection operators like $, $elematch etc won't be of much use. (Or is there a way it can be used without having to iterate over all records?)
I think I can use $where operator but I don't think that is the best way to do this.
How can this problem be solved?
There are a few approaches to do what you want, it just depends on your version of MongoDB. Just submitting the shell responses. The content is basically JSON representation which is not hard to translate for DBObject entities in Java, or JavaScript to be executed on the server so that really does not change.
The first and the fastest approach is with MongoDB 2.6 and greater where you get the new set operations:
var test = [ "t3", "t4", "t5" ];
db.collection.aggregate([
{ "$match": { "tags": {"$in": test } }},
{ "$project": {
"tagMatch": {
"$setIntersection": [
"$tags",
test
]
},
"sizeMatch": {
"$size": {
"$setIntersection": [
"$tags",
test
]
}
}
}},
{ "$match": { "sizeMatch": { "$gte": 1 } } },
{ "$project": { "tagMatch": 1 } }
])
The new operators there are $setIntersection that is doing the main work and also the $size operator which measures the array size and helps for the latter filtering. This ends up as a basic comparison of "sets" in order to find the items that intersect.
If you have an earlier version of MongoDB then this is still possible, but you need a few more stages and this might affect performance somewhat depending if you have large arrays:
var test = [ "t3", "t4", "t5" ];
db.collection.aggregate([
{ "$match": { "tags": {"$in": test } }},
{ "$project": {
"tags": 1,
"match": { "$const": test }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$match" },
{ "$project": {
"tags": 1,
"matched": { "$eq": [ "$tags", "$match" ] }
}},
{ "$match": { "matched": true }},
{ "$group": {
"_id": "$_id",
"tagMatch": { "$push": "$tags" },
"count": { "$sum": 1 }
}}
{ "$match": { "count": { "$gte": 1 } }},
{ "$project": { "tagMatch": 1 }}
])
Or if all of that seems to involved or your arrays are large enough to make a performance difference then there is always mapReduce:
var test = [ "t3", "t4", "t5" ];
db.collection.mapReduce(
function () {
var intersection = this.tags.filter(function(x){
return ( test.indexOf( x ) != -1 );
});
if ( intersection.length > 0 )
emit ( this._id, intersection );
},
function(){},
{
"query": { "tags": { "$in": test } },
"scope": { "test": test },
"output": { "inline": 1 }
}
)
Note that in all cases the $in operator still helps you to reduce the results even though it is not the full match. The other common element is checking the "size" of the intersection result to reduce the response.
All pretty easy to code up, convince the boss to switch to MongoDB 2.6 or greater if you are not already there for the best results.
I am having following sample document in the mongoDB.
{
"location" : {
"language" : null,
"country" : "null",
"city" : "null",
"state" : null,
"continent" : "null",
"latitude" : "null",
"longitude" : "null"
},
"request" : [
{
"referrer" : "direct",
"url" : "http://www.google.com/"
"title" : "index page"
"currentVisit" : "1401282897"
"visitedTime" : "1401282905"
},
{
"referrer" : "direct",
"url" : "http://www.stackoverflow.com/",
"title" : "index page"
"currentVisit" : "1401282900"
"visitedTime" : "1401282905"
},
......
]
"uuid" : "109eeee0-e66a-11e3"
}
Note:
The database contains more than 10845 document
Each document contains nearly 100 request(100 object in the request array).
Technology/Language - node.js
I had setProfiling to check the execution time
First Query - 13899ms
Second Query - 9024ms
Third Query - 8310ms
Fourth Query - 6858ms
There is no much difference using indexing
Queries:
I am having the following aggregation queries to be executed to fetch the data.
var match = {"request.currentVisit":{$gte:core.getTime()[1].toString(),$lte:core.getTime()[0].toString()}};
For Example: var match = {"request.currentVisit":{$gte:"1401282905",$lte:"1401282935"}};
For third and fourth query request.visitedTime instead of request.currentVisit
First
[
{ "$project":{
"request.currentVisit":1,
"request.url":1
}},
{ "$match":{
"request.1": {$exists:true}
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
Second
[
{ "$project": {
"request.currentVisit":1,
"request.url":1
}},
{ "$match": {
"request":{ "$size": 1 }
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id":{
"url":"$request.url"
},
"count":{ "$sum": 1 }
}},
{ "$sort": { "count": -1} }
]
Third
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request.1": { "$exists": true }
}},
{ "$match": match },
{ "$group": {
"_id": "$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id": null,
"total": { "$sum":"$count" }}
}}
]
Forth
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request":{ "$size": 1 }
}},
{ "$match": match },
{ "$group": {
"_id":"$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id":null,
"total": { "$sum": "$count" }
}}
]
Problem:
It is taking more than 38091 ms to fetch the data.
Is there any way to optimize the query?
Any suggestion will be grateful.
Well there are a few problems and you definitely need indexes, but you cannot have compound ones. It is the "timestamp" values that you are querying within the array that you want to index. It would also be advised that you either convert these to numeric values rather than the current strings, or indeed to BSON Date types. The latter form is actually internally stored as a numeric timestamp value, so there is a general storage size reduction, which also reduces the index size as well as being more efficient to match on the numeric values.
The big problem with each query is that you are always later diving into the "array" contents after processing an $unwind and then "filtering" that with match. While this what you want to do for your result, since you have not applied the same filter at an earlier stage, you have many documents in the pipeline that do not match these conditions when you $unwind. The result is "lots" of documents you do not need being processed in this stage. And here you cannot use an index.
Where you need this match is at the start of the pipeline stages. This narrows down the documents to the "possible" matches before that acutual array is filtered.
So using the first as an example:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$unwind": "$request" },
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
So a few changes. There is a $match at the head of the pipeline. This narrows down documents and is able to use an index. That is the most important performance consideration. Golden rule, always "match" first.
The $project you had in there was redundant as you cannot project "just" the fields of an array that is yet unwound. There is also a misconception that people believe they $project first to reduce the pipeline. The effect is very minimal if in fact there is a later $project or $group statement that actually limits the fields, then this will be "forward optimized" so things do get taken out of the pipeline processing for you. Still the $match statement above does more to optimize.
Dropping the need to see if the array is actually there with the other $match stage, as you are now "implicitly" doing that at the start of the pipeline. If more conditions make you more comfortable, then add them to that initial pipeline stage.
The rest remains unchanged, as you then $unwind the array and $match to filter the items that you actually want before moving on to your remaining processing. By now, the input documents have been significantly reduced, or reduced as much as they are going to be.
The other alternative that you can do with MongoDB 2.6 and greater is "filter" the array content before you even **$unwind it. This would produce a listing like this:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$project": {
"request": {
"$setDifference": [
{
"$map": {
"input": "$request",
"as": "el",
"in": {
"$cond"": [
{
"$and":[
{ "$gte": [ "1401282905", "$$el.currentVisit" ] },
{ "$lt": [ "1401282935", "$$el.currentVisit" ] }
]
}
"$el",
false
]
}
}
}
[false]
]
}
}}
{ "$unwind": "$request" },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
That may save you some by being able to "filter" the array before the $unwind and which is possibly better than doing the $match afterwards.
But this is the general rule for all of your statements. You need usable indexes and you need to $match first.
It is possible that the actual results you really want could be obtained in a single query, but as it stands your question is not presented that way. Try changing your processing as outlined, and you should see a notable improvement.
If you are still then trying to come to terms with how this could possibly be singular, then you can always ask another question.
The problem is that given documents with two arrays each containing documents as their elements that I want to find the documents that essentially have:
"obj1.a" === "obj2.b"
So given the sample documents, but actually expecting much larger arrays, then how do do this?:
{
"obj1": [
{ "a": "a", "b": "b" },
{ "a": "a", "b": "c" }
],
"obj2": [
{ "a": "c", "b": "b" },
{ "a": "c", "b": "c" }
]
},
{
"obj1": [
{ "a": "a", "b": "b" }
],
"obj2": [
{ "a": "a", "b": "a" }
]
}
One approach might be to compare these with JavaScript and the $where operator, but looping large arrays from within JavaScript doesn't sound very favorable.
Another approach is using the aggregation framework to do the comparison, but this involves unwinding two arrays on top of each other which can create a lot of documents to be processed in the pipeline:
db.objects.aggregate([
{ "$unwind": "$obj1" },
{ "$unwind": "$obj2" },
{ "$project": {
"match": { "$eq": [ "$obj1.a", "$obj2.b" ] }
}},
{ "$group": {
"_id": "$_id",
"match": { "$max": "$match" }
}},
{ "$match": { "match": true } }
])
Where performance is a concern it is easy to see how the number of documents actually processing through $project and $group can end up many times larger than the original documents in the collection.
So in order to do this there has to be some way of comparing the array elements without needing to perform an $unwind on those arrays and end up grouping the documents back together. How could this be done?
You can get this sort of result using the $map operator that was introduced in MongoDB 2.6. This operates by taking an input array and allowing an expression to be evaluated over each element producing a new array as the result:
db.objects.aggregate([
{ "$project": {
"match": {
"$size": {
"$setIntersection": [
{ "$map": {
"input": "$obj1",
"as": "el",
"in": { "$concat": ["$$el.a",""] }
}},
{ "$map": {
"input": "$obj2",
"as": "el",
"in": { "$concat": ["$$el.b",""] }
}}
]
}
}
}},
{ "$match": { "match": { "$gte": 1 } } }
])
Here this is used with the $setIntersection and $size operators. As the $map returns just the property values from the elements that you want to compare you end up with two arrays just containing those values.
The only this is that the "in" option for $map currently requires an operator to be present within the Object {} notation of it's arguments. You cannot presently say:
"in": "$$el.a"
To get around this we are using $concat to join the string value with an empty string. Other operators can be used for different types of even $ifNull which would be fairly generic and gets around "type" problems
"in": { "$ifNull": [ "$$el.a", false ] }
The $setIntersection that wraps these, is used to determine which values of those "sets" are the same and returns it's result as another array containing only the matching values.
Finally the $size operator here is an aggregation method that returns the actual "size" of the array as an integer. So this can be used in the following $match to then filter out any results that did not return a "size" value of 1 or greater.
Essentially this does all the work that was done in four individual stages, where the first two are exponentially growing the number of documents to be processed, within two simple passes, all without increasing the number of documents that were received as input.