My use case is different. I am trying to map it to user and orders for easy understanding.
I have to get the following for a user
For each department
For each order type
delivered count
unique orders
Unique order count means user might have ordered the same product, but that count has to be 1 for the same. I have the background logic and identified via duplicate order ids.
db.getCollection('user_orders').aggregate([{"user_id":123},
{$group: {"_id": {"department":"$department", "order_type":"$order_type"},
"del_count":{$sum:"$del_count"},
"unique_order":{$addToSet:{"unique_order":"$unique_order"}}}},
{$project: {"_id":0,
"department":"$_id.department",
"order_type_name":"$_id.order_type",
"unique_order_count": {$size:"$unique_order"},
"del_count":"$del_count"
}},
{$group: {"_id":"$department",
order_types: {$addToSet:
{"order_type_name":"$order_type_name",
"unique_order_count": "$unique_order_count",
"del_count":"$del_count"
}}}}
])
Sorry for my query formatting.
This query is working absolutely fine. I added the second grouping to bring the responses together for all order types of the same department.
Can I do the same in less number of pipelines - efficient ways?
The $project stage appears to be redundant but it's more refactoring rather than performance improvement. Your simplified pipeline can look like below:
db.getCollection('user_orders').aggregate([{$group: {"_id": {"department":"$department", "order_type":"$order_type"},
"del_count":{$sum:"$del_count"},
"unique_order":{$addToSet:{"unique_order":"$unique_order"}}}},
{$group: {"_id":"$_id.department",
order_types: {$addToSet:
{"order_type_name":"$_id.order_type",
"unique_order_count": {$size:"$unique_order"},
"del_count":"$del_count"
}}}}
])
Related
I am performing an aggregate query that contains a match, group, project, and then a sort pipeline. I am wondering if adding the text search block within my match block first vs last makes any performance difference. I am currently going based on the Robo 3T response time metric and not noticing a difference and wanted to confirm whether my observation holds true to the facts or not.
The query looks something like this:
db.COLLECTION.aggregate(
{$match: {$text:{$search: 'xyz'},...}},
{$group: {...}},
{$project: {...}},
{$sort: {...}}
)
Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.
need help here.
I have two collections, the first collection isn't so big with pretty small documents.
The second has much more items (thousands, could be much more) with medium size documents.
There's a property in the first document that matches a property in the second document.
The relation here is that items in the first collection have many other items in the second collection that are referencing it.
Say for example i have the first collection representing persons and the second one representing credit card transactions. A person may have many transactions.
PersonId is the id of the persons collection and every transaction document in the transactions collection.
I want to write a query to count how many trasactions each person has.
I've seen that it is recommended to use aggregate and lookup.
But when i try that i get a message that the document size exceeds limit.
I'm guessing that this is because it aggregate a person with all its transaction into one document... not sure, its the first time ever i'm experiencing with mongodb.
What would be the best approach to achieve that? Is the aggregate method the right choice?
Thanx!
Gili
You can use simple grouping to get transactions count for each person
db.transactions.aggregate([
{ $group: {_id: "$personId", count: {$sum:1}}}
])
Output will contain person ids and count of transactions for that person. You can match people and transactions stats in memory. You can also use lookup to return transactions stats with some person data. But keep in mind - you will not retrieve entry for people without transactions:
db.transactions.aggregate([
{ $group: {_id: "$personId", count: {$sum:1}}},
{ $lookup:
{
from: "people",
localField: "_id",
foreignField: "_id",
as: "person"
}
},
{$unwind: "$person"},
{$project:{name:"$person.name", count:"$count"}}
])
I'm trying to get data from one or more subdocuments but I don't know the name of the field that will hold the subdocument. Here are some examples of what the documents look like.
https://github.com/vz-risk/VCDB/blob/master/data/json/0C5DE044-B9B4-408D-9E65-D367EED12AB2.json
https://github.com/vz-risk/VCDB/blob/master/data/json/064F5887-C2DA-4139-B3AA-D55906F8C30A.json
I would like to get the action varieties for these incidents, so in the case of the first one I would like to get action.malware.variety and action.social.variety. In the second example it would be action.hacking.variety and action.malware.variety. So the problem is that I don't know what field is going to hold the subdocument. It could be one of hacking, malware, social, error, misuse, physical, and environmental.
So I would like to $unwind that field and do some stuff with the key name. Is this something that can be done with aggregation or do I need to switch over to mapReduce?
You seem to be talking about a case where you are not sure if all of the hacking, social or malaware parts are there right. I think you want $project first using the $ifNull operator as in:
db.stuff.aggregate([
{$project:
{
hacking: {$ifNull: ["$action.hacking.variety",[null]]},
social: {$ifNull: ["$action.social.variety",[null]]},
malware: {$ifNull: ["$action.malware.variety",[null]]}
}
},
{$unwind: "$hacking"},
{$unwind: "$social"},
{$unwind: "$malware"}
])
That should give you documents with something in each of those values.
Sort of pretty much the same with any of your possible list of values.
I have an interesting problem. I have a working M/R version of this but it's not really a viable solution in a small-scale environment since it's too slow and the query needs to be executed real-time.
I would like to iterate over each element in a collection and score it, sort by descending, limit to top 10 and return the results to the applications.
Here is the function I'd like applied to each document in pseudo code.
var score = 0;
foreach(tag in document.Tags) {
score += someMap[tag];
}
return score;
Since your someMap is changing each time, I don't see any alternative other than to score all the documents and return the highest-scoring ones. Whatever method you adopt for this type of operation, you'll have to consider all the documents in the collection, which is going to be slow, and will become more and more costly as the collection you're scanning grows.
One issue with map reduce is that each mongod instance can only run one concurrent map reduce. This is a limitation of the javascript engine, which is single-threaded. Multiple map reduces will be interleaved, but they cannot run concurrently with one another. This means that if you're relying on map reduce for "real-time" uses, that is, if your web page has to run a map reduce to render, you'll eventually hit a limit where page load times become unacceptably slow.
You can work around this by querying all the documents into your application, and doing the scoring, sorting, and limiting in your application code. Queries in MongoDB can run concurrently, unlike map reduce, though of course this means that your application servers will have to do a lot of work.
Finally, if you are willing to wait for MongoDB 2.2 to be released (which should be within a few months), you can use the new aggregation framework in place of map reduce. You'll have to massage the someMap to generate the correct pipeline steps. Here's an example of what this might look like if someMap were {"a": 5, "b": 2}:
db.runCommand({aggregate: "foo",
pipeline: [
{$unwind: "$tags"},
{$project: {
tag1score: {$cond: [{$eq: ["$tags", "a"]}, 5, 0]},
tag2score: {$cond: [{$eq: ["$tags", "b"]}, 3, 0]}}
},
{$project: {score: {$add: ["$tag1score", "$tag2score"]}}},
{$group: {_id: "$_id", score: {$sum: "$score"}}},
{$sort: {score: -1}},
{$limit: 10}
]})
This is a little complicated, and bears explaining:
First, we "unwind" the tags array, so that the following steps in the pipeline process documents where "tags" is a scalar -- the value of the tag from the array -- and all the other document fields (notably _id) are duplicated for each unwound element.
We use a projection operator to convert from tags to named score fields. The $cond/$eq expression for each roughly means (for the tag1score example) "if the value in the document in the 'tags' field id equal to 'a', then return 5 and assign that value to a new field tag1score, else return 0 and assign that". This expression would be repeated for each tag/score combination in your someMap. At this point in the pipeline, each document will nave N tagNscore fields, but at most one of them will have a non-zero value.
Next we use another projection operator to create a score field whose value is the sum of the tagNscore fields in the document.
Next we group the documents by their _id, and sum up the value of the score field from the previous step across all documents in each group.
We sort by score, descending (i.e. greatest scores first)
We limit to only the top 10 scores.
I'll leave it as an exercise to the reader how to convert someMap into the correct set of projections in step 2, and the correct set of fields to add in step 3.
This is essentially the same set of steps that your application code or map reduce would go through, but has the following distinct advantages: instead of map reduce, the aggregation framework is fully implemented in C++ and is faster and more concurrent than map reduce; and unlike querying all the documents to your application, the aggregation framework works with the data on the server side, saving network load. But like the other two approaches, this will still have to consider each document, and can only limit the result set once the score has been calculated for all of them.