I am applying aggregation on a collection and I would like to group by more than one field. All the calculations are same in the pipeline. I would like to see the results grouped by different fields.
possible values for the fields that I am using:
ageCategory -> 10, 20, 30 40
sex -> Male, Female
type -> A,B,C,D,E
stage -> I, II, III, IV
This is how I am doing this now:
mongoose.connection.db.collection("collection").aggregate([
{ $match: //match conditions },
{ $project: {
ageCategory: 1,
sex: 1,
type: 1,
stage: 1,
//other fileds
}
},
{ $match: //match conditions } ,
{ $project: {
ageCategory: 1,
sex: 1,
type: 1,
stage: 1,
//other fileds
}
},
{
$group: {
_id: "result",
age10: { $sum: { $cond:[//condition for ageCategory 10,1,0] } },
age20: { $sum: { //condition for ageCategory 10 } },
//other age categories
male: { $sum: { //condition for male } },
female: { $sum: { //condition for female } },
typeA: { $sum: { //condition for type A } },
typeB: { $sum: { //condition for type B } },
//other conditions
}
}
]).toArray(function (err, result) {
//final computations
});
Simplified representation of the data and result expected: (there are some calculations that happen in the match and project statements, which are ignored for simplicity)
[{
ageCategory: "10",
sex: "Male",
type: "A",
stage: "I",
sub:[
{}
],
//other sub documents that are used in the pipeline
},
{
ageCategory: "20",
sex: "Male",
type: "B",
stage: "I",
sub:[
{}
],
//other sub documents that are used in the pipeline
}]
Expected Result:
{
age10:1, //count of sub with ageCategory as 10
age20:1,
//other count by age. It is okay to ignore the ones with zero count.
male: 2,
typeA: 1,
typeB: 1,
stageI: 2
}
I am checking all the conditions in the group by. I am not sure if this is a best way to do it. One options is to run this aggregation multiple times with group by applied on individual field, but that is causing performance issues and also repetition of the same query.
I cannot use mapReduce because of performance reasons.
Is this the best way to do this? or any alternative approaches?
Based on the provided expected result it's safe to say you want to get totals. In such case you should group documents by null and not "result", because we don't know what it might mean for Mongo in future.
I think the problem with your question is that you use "group by" term, but in fact you mean computing fields holding the values of some accumulator expressions.
Well, the way you have done this seems to be OK for me (apart from the null/"result" thing).
Related
{
id: 1,
name: "sree",
userId: "001",
paymentData: {
user_Id: "001",
amount: 200
}
},
{
id: 1,
name: "sree",
userId: "001",
paymentData: {
user_Id: "002",
amount: 200
}
}
I got this result after unwind in aggregation any way to check user_Id equal to userId
Are you looking to only retrieve the results when they are equal (meaning you want to filter out documents where the values are not the same) or are you looking to add a field indicating whether the two are equal?
In either case, you append subsequent stage(s) to the aggregation pipeline to achieve your desired result. If you want to filter the documents, the new stage may be:
{
$match: {
$expr: {
$eq: [
"$userId",
"$paymentData.user_Id"
]
}
}
}
See how it works in this playground example.
If instead you want to add a field that compares the two values, then this stage may be what you are looking for:
{
$addFields: {
isEqual: {
$eq: [
"$userId",
"$paymentData.user_Id"
]
}
}
}
See how it works in this playground example.
You could also combine the two as in:
{
$addFields: {
isEqual: {
$eq: [
"$userId",
"$paymentData.user_Id"
]
}
}
},
{
$match: {
isEqual: true
}
}
Playground demonstration here
I have a Order collection with records looking like this:
{
"_id": ObjectId,
"status": String Enum,
"products": [{
"sku": String UUID,
...
}, ...],
...
},
My goal is to find find what products user buy together. Given an sku, i would like to browse the past order and find, for orders that contains more than 1 product AND of course the product with the looked up sku, what other products were bought along.
So I created a aggregation pipeline that works :
[
// exclude cancelled orders
{
'$match': {
'status': {
'$nin': [
'CANCELLED', 'CHECK_OUT'
]
}
}
},
// add a fields with product size and just the products sku
{
'$addFields': {
'size': {
'$size': '$products'
},
'skus': '$products.sku'
}
},
// limit to orders with 2 products or more including the looked up SKU
{
'$match': {
'size': {
'$gte': 2
},
'skus': {
'$elemMatch': {
'$eq': '3516215049767'
}
}
}
},
// group by skus
{
'$unwind': {
'path': '$skus'
}
}, {
'$group': {
'_id': '$skus',
'count': {
'$sum': 1
}
}
},
// sort by count, exclude the looked up sku, limit to 4 results
{
$sort': {
'count': -1
}
}, {
'$match': {
'_id': {
'$ne': '3516215049767'
}
}
}, {
'$limit': 4
}
]
Althought this works, this collection contains more than 10K docs and I have an alert on my MongoDB instance telling me than the ratio Scanned Objects / Returned has gone above 1000.
So my question is, how can my query be improve? and what indexes can I add to improve this?
db.Orders.stats();
{
size: 14329835,
count: 10571,
avgObjSize: 1355,
storageSize: 4952064,
freeStorageSize: 307200,
capped: false
nindexes: 2,
indexBuilds: [],
totalIndexSize: 466944,
totalSize: 5419008,
indexSizes: { _id_: 299008, status_1__created_at_1: 167936 },
scaleFactor: 1,
ok: 1,
operationTime: Timestamp({ t: 1635415716, i: 1 })
}
Let's start with rewriting the query a little bit to make it more efficient.
Currently you're matching all the orders with a certain status and after that you're starting with data manipulations, this means every single stage is doing work on a larger than needed data set.
What we can do is move all the queries into the first stage, this is made possible using Mongo's dot notation, like so:
{
'$match': {
'status': {
'$nin': [
'CANCELLED', 'CHECK_OUT',
],
},
'products.sku': '3516215049767', // mongo allows you to do this using the dot notation.
'products.1': { $exists: true }, // this requires the array to have at least two elements.
},
},
Now this achieves two things:
We start the pipeline only with relevant results, no need to calculate the $size of the array anymore to many unrelevant documents. This already will boost your performance greatly.
Now we can create a compound index that will support this specific query, before we couldn't do that as index usage is limited to the first step and that only included the status field. ( just as an anecdote is that Mongo actually does optimize pipelines, but in this specific case no optimization was possible to to the usage of $addFields )
The index that I recommend building is:
{ status: 1, "products.sku": 1 }
This will allow the best match to start off your pipeline.
Suppose I want to sort the data based on the current city first and then the remaining country data. Is there any way I achieve that in MongoDB?
Example
[
{ id: 2, name: 'sdf' },
{ id: 3, name: 'sfs' },
{ id: 3, name: 'aaa' },
{ id: 1, name: 'dsd' },
];
What I want as an outcome is the data with id 3 at first and the remaining other.
like
[
{ id: 3, name: 'sfs' },
{ id: 3, name: 'aaa' },
{ id: 1, name: 'dsd' },
{ id: 2, name: 'sdf' },
];
It's just a example,
My actual requirement is to sort the data based on certain category first and then the remaining one
It's not possible within mongodb but you could first fetch the documents from the db and then sort them in Javascript (or whatever other language you're using to present the data).
On a side note, having duplicate values in the "id" field is not a good practice and defies the definition of id itself.
There is no straight way to sort condationaly in MongoDB, as per your example you can try aggregation query,
$facet to separate result for both types of documents
first, to get id: 3 documents
second, to get id is not 3 documents and sort by id in ascending order
$project and $concatArrays to concat both arrays in siquance
$unwind deconstruct all array
$replaceRoot to replace all object to root
db.collection.aggregate([
{
$facet: {
first: [
{ $match: { id: 3 } }
],
second: [
{ $match: { id: { $ne: 3 } } },
{ $sort: { id: 1 } }
]
}
},
{
$project: {
all: { $concatArrays: ["$first", "$second"] }
}
},
{ $unwind: "$all" },
{ $replaceRoot: { newRoot: "$all" } }
])
Playground
I have document {id:1, data: [{'name': 'Bob', 'counter':1}, {'name':'Jack', 'counter':1}]}
What I'm expecting:
query:
db.inventory.update(
{ _id: 1 },
{ $addToSet: { data: { $each: [{'name': 'Bob'}, {'name':'Jack'}, {'name':'John'}] } } }
)
result:
{id:1, data: [{'name': 'Bob', 'counter':2}, {'name':'Jack', 'counter':2}, {'name':'John', 'counter':1}]}
You won't be able to do this in a single query. The $addToSet operator will only work if the element being added and the element that exists are an exact match. Instead, you'll need to do it in multiple parts:
// Insert element for Bob if it doesn't exist.
db.inventory.update(
{
_id: 1,
"data.name": {$ne: "Bob"}
},
{
$push: {
data: {
name: "Bob",
counter: 0 // Initialized to 0 so that the first increment results in the expected value of 1.
}
}
}
)
// Increment the counter for Bob.
db.inventory.update(
{
_id: 1,
"data.name": "Bob"
},
{
$inc: {
"data.$.counter": 1
}
}
)
// Repeat as necessary for each element you wish to insert.
This is simply a limitation that you need to work around with your existing document structure. If you modify your document structure such that data is a nested sub-document with each name being a field within that sub-document, you could make this work with a single query:
// Version 1: flat value.
db.inventory.update(
{ _id: 1 },
{ $inc: {
"data.Bob": 1,
"data.Jack": 1
}}
)
/*
Document will look like this:
{
_id: 1,
data: {
Bob: 2,
Jack: 2,
John: 1
}
}
*/
// Version 2: nested sub-document.
db.inventory.update(
{ _id: 1 },
{ $inc: {
"data.Bob.counter": 1,
"data.Jack.counter": 1
}}
)
/*
Document will look like this:
{
_id: 1,
data: {
Bob: {
counter: 2
},
Jack: {
counter: 2
},
John: {
counter: 1
}
}
}
*/
Be warned, however, that you will not be able to index data effectively if you go with this solution, so querying efficiently on e.g. all documents containing elements with data.$.counter > 1 simply will not be possible.
The trade-offs are yours to consider. You can either have efficient updates or efficient lookups, but having both is unlikely to happen. I would personally recommend updating each element individually, but you will know your program's needs far better than I will.
I am implementing a small application using mongodb as a backend. In this application I have a data structure where the documents will contain a field that contains an array of subdocuments.
I use the following use case as a basis:
http://docs.mongodb.org/manual/use-cases/inventory-management/
As you can see from the example, each document have a field called carted, which is an array of subdocuments.
{
_id: 42,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {...} },
{ sku: '0ab42f88', qty: 4, item_details: {...} }
]
}
This fits me perfect, except for one problem:
I want to count each unique item (with "sku" as the unique identifier key) in the entire collection where each document adds the count by 1 (multiple instances of the same "sku" in the same document will still just count 1). E.g. I would like this result:
{ sku: '00e8da9b', doc_count: 1 },
{ sku: '0ab42f88', doc_count: 9 }
After reading up on MongoDB, I am quite confused about how to do this (fast) when you have a complex schema as described above. If I have understood the otherwise excellent documentation correct, such operation may perhaps be achieved using either the aggregation framework or the map/reduce framework, but this is where I need some input:
Which framework would be better suited to achieve the result I am looking for, given the complexity of the structure?
What kind of indexes would be preferred in order to gain the best possible performance out of the chosen framework?
MapReduce is slow, but it can handle very large data sets. The Aggregation framework on the other hand is a little quicker, but will struggle with large data volumes.
The trouble with your structure shown is that you need to "$unwind" the arrays to crack open the data. This means creating a new document for every array item and with the aggregation framework it needs to do this in memory. So if you have 1000 documents with 100 array elements it will need to build a stream of 100,000 documents in order to groupBy and count them.
You might want to consider seeing if there's a schema layout that will server your queries better, but if you want to do it with the Aggregation framework here's how you could do it (with some sample data so the whole script will drop into the shell);
db.so.remove();
db.so.ensureIndex({ "items.sku": 1}, {unique:false});
db.so.insert([
{
_id: 42,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
]
},
{
_id: 43,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
]
},
]);
db.so.runCommand("aggregate", {
pipeline: [
{ // optional filter to exclude inactive elements - can be removed
// you'll want an index on this if you use it too
$match: { status: "active" }
},
// unwind creates a doc for every array element
{ $unwind: "$items" },
{
$group: {
// group by unique SKU, but you only wanted to count a SKU once per doc id
_id: { _id: "$_id", sku: "$items.sku" },
}
},
{
$group: {
// group by unique SKU, and count them
_id: { sku:"$_id.sku" },
doc_count: { $sum: 1 },
}
}
]
//,explain:true
})
Note that I've $group'd twice, because you said that an SKU can only count once per document, so we need to first sort out the unique doc/sku pairs and then count them up.
If you want the output a little different (in other words, EXACTLY like in your sample) we can $project them.
With the latest mongo build (it may be true for other builds too), I've found that slightly different version of cirrus's answer performs faster and consumes less memory. I don't know the details why, seems like with this version mongo somehow have more possibility to optimize the pipeline.
db.so.runCommand("aggregate", {
pipeline: [
{ $unwind: "$items" },
{
$group: {
// create array of unique sku's (or set) per id
_id: { id: "$_id"},
sku: {$addToSet: "$items.sku"}
}
},
// unroll all sets
{ $unwind: "$sku" },
{
$group: {
// then count unique values per each Id
_id: { id: "$_id.id", sku:"$sku" },
count: { $sum: 1 },
}
}
]
})
to match exactly the same format as asked in question, grouping by "_id" should be skipped