Mongodb aggregate on subdocument in array - mongodb

I am implementing a small application using mongodb as a backend. In this application I have a data structure where the documents will contain a field that contains an array of subdocuments.
I use the following use case as a basis:
http://docs.mongodb.org/manual/use-cases/inventory-management/
As you can see from the example, each document have a field called carted, which is an array of subdocuments.
{
_id: 42,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {...} },
{ sku: '0ab42f88', qty: 4, item_details: {...} }
]
}
This fits me perfect, except for one problem:
I want to count each unique item (with "sku" as the unique identifier key) in the entire collection where each document adds the count by 1 (multiple instances of the same "sku" in the same document will still just count 1). E.g. I would like this result:
{ sku: '00e8da9b', doc_count: 1 },
{ sku: '0ab42f88', doc_count: 9 }
After reading up on MongoDB, I am quite confused about how to do this (fast) when you have a complex schema as described above. If I have understood the otherwise excellent documentation correct, such operation may perhaps be achieved using either the aggregation framework or the map/reduce framework, but this is where I need some input:
Which framework would be better suited to achieve the result I am looking for, given the complexity of the structure?
What kind of indexes would be preferred in order to gain the best possible performance out of the chosen framework?

MapReduce is slow, but it can handle very large data sets. The Aggregation framework on the other hand is a little quicker, but will struggle with large data volumes.
The trouble with your structure shown is that you need to "$unwind" the arrays to crack open the data. This means creating a new document for every array item and with the aggregation framework it needs to do this in memory. So if you have 1000 documents with 100 array elements it will need to build a stream of 100,000 documents in order to groupBy and count them.
You might want to consider seeing if there's a schema layout that will server your queries better, but if you want to do it with the Aggregation framework here's how you could do it (with some sample data so the whole script will drop into the shell);
db.so.remove();
db.so.ensureIndex({ "items.sku": 1}, {unique:false});
db.so.insert([
{
_id: 42,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
]
},
{
_id: 43,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
]
},
]);
db.so.runCommand("aggregate", {
pipeline: [
{ // optional filter to exclude inactive elements - can be removed
// you'll want an index on this if you use it too
$match: { status: "active" }
},
// unwind creates a doc for every array element
{ $unwind: "$items" },
{
$group: {
// group by unique SKU, but you only wanted to count a SKU once per doc id
_id: { _id: "$_id", sku: "$items.sku" },
}
},
{
$group: {
// group by unique SKU, and count them
_id: { sku:"$_id.sku" },
doc_count: { $sum: 1 },
}
}
]
//,explain:true
})
Note that I've $group'd twice, because you said that an SKU can only count once per document, so we need to first sort out the unique doc/sku pairs and then count them up.
If you want the output a little different (in other words, EXACTLY like in your sample) we can $project them.

With the latest mongo build (it may be true for other builds too), I've found that slightly different version of cirrus's answer performs faster and consumes less memory. I don't know the details why, seems like with this version mongo somehow have more possibility to optimize the pipeline.
db.so.runCommand("aggregate", {
pipeline: [
{ $unwind: "$items" },
{
$group: {
// create array of unique sku's (or set) per id
_id: { id: "$_id"},
sku: {$addToSet: "$items.sku"}
}
},
// unroll all sets
{ $unwind: "$sku" },
{
$group: {
// then count unique values per each Id
_id: { id: "$_id.id", sku:"$sku" },
count: { $sum: 1 },
}
}
]
})
to match exactly the same format as asked in question, grouping by "_id" should be skipped

Related

MongoDB $group by id where id name varies (conditionals?)

I need to group all entries by id using $group. However, the id is under different objects, so is $student.id for some and $teacher.id for others. I've tried $cond, $is and any other conditional I could find but haven't had any luck. What I'd want is something like this:
lessons.aggregate([
// matches, lookups, etc.
{$group: {
"_id":{
"id": (if student exists "student.id", else if teacher exists "teacher.id"
},
// other fields
}
}}]);
How can I do this? I've scoured the MongoDB docs for hours yet nothing works. I'm new to this company and trying to debug something so not familiar with the tech yet, so apologies if this is rudimentary stuff!
Update: providing some sample data to demo what I'd want. Shortened from the real thing to fit the question. After all the matches, lookups, etc and before using $group, the data looks like this. As student.id of first and second objects are the same, I want them to be grouped.
{
student: {
_id: new ObjectId("61dc0fce904d07184b461c03"),
name: Jess W
},
duration: 30
},
{
student:{
_id: new ObjectId("61dc0fce904d07184b461c03"),
name: Jess W
},
duration: 30
},
{
teacher: {
_id: new ObjectId("61dc0f6a904d07184b461be7"),
name: Michael S
},
duration: 30
},
{
teacher: {
_id: new ObjectId("61dc1087904d07184b461c6a"),
name: Andrew J
},
duration: 30
},
If the fields exist "exclusive only", then you can simply combine them:
{ $group: {_id: {student: "$student.id", teacher: "$teacher.id"} }, // other fields }
concatenate them should also work:
{ $group: {_id: {$concat: [ "$student.id", "$teacher.id" ] }, // other fields }
You can just use $ifNull and chain them, like so:
db.collection.aggregate([
{
$group: {
_id: {
"$ifNull": [
"$student._id",
{
$ifNull: [
"$teacher._id",
"$archer._id"
]
}
]
}
}
}
])

optimize indexes in MongoDB

I have a Order collection with records looking like this:
{
"_id": ObjectId,
"status": String Enum,
"products": [{
"sku": String UUID,
...
}, ...],
...
},
My goal is to find find what products user buy together. Given an sku, i would like to browse the past order and find, for orders that contains more than 1 product AND of course the product with the looked up sku, what other products were bought along.
So I created a aggregation pipeline that works :
[
// exclude cancelled orders
{
'$match': {
'status': {
'$nin': [
'CANCELLED', 'CHECK_OUT'
]
}
}
},
// add a fields with product size and just the products sku
{
'$addFields': {
'size': {
'$size': '$products'
},
'skus': '$products.sku'
}
},
// limit to orders with 2 products or more including the looked up SKU
{
'$match': {
'size': {
'$gte': 2
},
'skus': {
'$elemMatch': {
'$eq': '3516215049767'
}
}
}
},
// group by skus
{
'$unwind': {
'path': '$skus'
}
}, {
'$group': {
'_id': '$skus',
'count': {
'$sum': 1
}
}
},
// sort by count, exclude the looked up sku, limit to 4 results
{
$sort': {
'count': -1
}
}, {
'$match': {
'_id': {
'$ne': '3516215049767'
}
}
}, {
'$limit': 4
}
]
Althought this works, this collection contains more than 10K docs and I have an alert on my MongoDB instance telling me than the ratio Scanned Objects / Returned has gone above 1000.
So my question is, how can my query be improve? and what indexes can I add to improve this?
db.Orders.stats();
{
size: 14329835,
count: 10571,
avgObjSize: 1355,
storageSize: 4952064,
freeStorageSize: 307200,
capped: false
nindexes: 2,
indexBuilds: [],
totalIndexSize: 466944,
totalSize: 5419008,
indexSizes: { _id_: 299008, status_1__created_at_1: 167936 },
scaleFactor: 1,
ok: 1,
operationTime: Timestamp({ t: 1635415716, i: 1 })
}
Let's start with rewriting the query a little bit to make it more efficient.
Currently you're matching all the orders with a certain status and after that you're starting with data manipulations, this means every single stage is doing work on a larger than needed data set.
What we can do is move all the queries into the first stage, this is made possible using Mongo's dot notation, like so:
{
'$match': {
'status': {
'$nin': [
'CANCELLED', 'CHECK_OUT',
],
},
'products.sku': '3516215049767', // mongo allows you to do this using the dot notation.
'products.1': { $exists: true }, // this requires the array to have at least two elements.
},
},
Now this achieves two things:
We start the pipeline only with relevant results, no need to calculate the $size of the array anymore to many unrelevant documents. This already will boost your performance greatly.
Now we can create a compound index that will support this specific query, before we couldn't do that as index usage is limited to the first step and that only included the status field. ( just as an anecdote is that Mongo actually does optimize pipelines, but in this specific case no optimization was possible to to the usage of $addFields )
The index that I recommend building is:
{ status: 1, "products.sku": 1 }
This will allow the best match to start off your pipeline.

Sort data based on given id first

Suppose I want to sort the data based on the current city first and then the remaining country data. Is there any way I achieve that in MongoDB?
Example
[
{ id: 2, name: 'sdf' },
{ id: 3, name: 'sfs' },
{ id: 3, name: 'aaa' },
{ id: 1, name: 'dsd' },
];
What I want as an outcome is the data with id 3 at first and the remaining other.
like
[
{ id: 3, name: 'sfs' },
{ id: 3, name: 'aaa' },
{ id: 1, name: 'dsd' },
{ id: 2, name: 'sdf' },
];
It's just a example,
My actual requirement is to sort the data based on certain category first and then the remaining one
It's not possible within mongodb but you could first fetch the documents from the db and then sort them in Javascript (or whatever other language you're using to present the data).
On a side note, having duplicate values in the "id" field is not a good practice and defies the definition of id itself.
There is no straight way to sort condationaly in MongoDB, as per your example you can try aggregation query,
$facet to separate result for both types of documents
first, to get id: 3 documents
second, to get id is not 3 documents and sort by id in ascending order
$project and $concatArrays to concat both arrays in siquance
$unwind deconstruct all array
$replaceRoot to replace all object to root
db.collection.aggregate([
{
$facet: {
first: [
{ $match: { id: 3 } }
],
second: [
{ $match: { id: { $ne: 3 } } },
{ $sort: { id: 1 } }
]
}
},
{
$project: {
all: { $concatArrays: ["$first", "$second"] }
}
},
{ $unwind: "$all" },
{ $replaceRoot: { newRoot: "$all" } }
])
Playground

How to return a specific element of an array in a document?

I have a table containing documents set up as follows:
_id: 1,
name: { first: 'John', last: 'Doe' },
tools: [ 'Tool1', 'Tool2', 'Tool3' ],
skills: [
{ type: 'carpentry',
years: 3 },
{ type: 'plumbing',
year: 5 },
{ type: 'electrical',
year: 8 }
]
}
I need to write a script that can search each document in the table and return the value of a specific skill, for example: Find the number of years John Doe has in plumbing.
Since I don't need the full document, db.table.find({skills: {$elemMatch: {type:'plumbing'}}}) feels unnecessary and would still require me to search the document to find the value I'm looking for. Is there a way to just return the part of the document I'm looking for?
The desired output would be {type: 'plumbing', year: 5} so that I could then manipulate that data into another field in the document.
Try this-
db.collection.aggregate([
{
"$unwind": "$skills"
},
{
"$match": {
"skills.type": "plumbing"
}
},
{
"$project": {
skills: 1
}
}
])
Mongo Playground
OR try this if you only want year.
Mongo Playground 2

Mongodb Aggregation group by on more than one field

I am applying aggregation on a collection and I would like to group by more than one field. All the calculations are same in the pipeline. I would like to see the results grouped by different fields.
possible values for the fields that I am using:
ageCategory -> 10, 20, 30 40
sex -> Male, Female
type -> A,B,C,D,E
stage -> I, II, III, IV
This is how I am doing this now:
mongoose.connection.db.collection("collection").aggregate([
{ $match: //match conditions },
{ $project: {
ageCategory: 1,
sex: 1,
type: 1,
stage: 1,
//other fileds
}
},
{ $match: //match conditions } ,
{ $project: {
ageCategory: 1,
sex: 1,
type: 1,
stage: 1,
//other fileds
}
},
{
$group: {
_id: "result",
age10: { $sum: { $cond:[//condition for ageCategory 10,1,0] } },
age20: { $sum: { //condition for ageCategory 10 } },
//other age categories
male: { $sum: { //condition for male } },
female: { $sum: { //condition for female } },
typeA: { $sum: { //condition for type A } },
typeB: { $sum: { //condition for type B } },
//other conditions
}
}
]).toArray(function (err, result) {
//final computations
});
Simplified representation of the data and result expected: (there are some calculations that happen in the match and project statements, which are ignored for simplicity)
[{
ageCategory: "10",
sex: "Male",
type: "A",
stage: "I",
sub:[
{}
],
//other sub documents that are used in the pipeline
},
{
ageCategory: "20",
sex: "Male",
type: "B",
stage: "I",
sub:[
{}
],
//other sub documents that are used in the pipeline
}]
Expected Result:
{
age10:1, //count of sub with ageCategory as 10
age20:1,
//other count by age. It is okay to ignore the ones with zero count.
male: 2,
typeA: 1,
typeB: 1,
stageI: 2
}
I am checking all the conditions in the group by. I am not sure if this is a best way to do it. One options is to run this aggregation multiple times with group by applied on individual field, but that is causing performance issues and also repetition of the same query.
I cannot use mapReduce because of performance reasons.
Is this the best way to do this? or any alternative approaches?
Based on the provided expected result it's safe to say you want to get totals. In such case you should group documents by null and not "result", because we don't know what it might mean for Mongo in future.
I think the problem with your question is that you use "group by" term, but in fact you mean computing fields holding the values of some accumulator expressions.
Well, the way you have done this seems to be OK for me (apart from the null/"result" thing).