MongoDB aggregate function: $sort not working with $sample? - mongodb

I'm currently failing to build a mongoDB query that uses first a sort and then outputs one sample. Documents in the database look like this:
{
"_id" : "cynthia",
"crawled" : false,
"priority" : 2.0
}
I'm trying to achieve the following: Get me one random element with the highest priority.
I tested it with the following query:
db.getCollection('profiles').aggregate([
{$match: {crawled: false }},
{$sort: {priority: -1}},
{$sample: {size: 1}}
])
Unfortunately, this is not working. Mongo seems to totally ignore the $sort. I see no difference between using with $sort or not.
Does anybody with more mongoDB experience has an idea on that? If you have an idea of a better implementation of the "priority" feature just tell me.
Every idea is highly appreciated.

$sample is not what you're looking for. According to the docs:
Randomly selects the specified number of documents from its input.
So you'll get one random document from your filtered set of documents.
$limit is what you need since it takes first n documents from previous stage. Your pipeline should look like this:
db.profiles.aggregate([
{$match: {crawled: false }},
{$sort: {priority: -1}},
{$limit: 1}
])

Related

Mongodb optimal query

My use case is different. I am trying to map it to user and orders for easy understanding.
I have to get the following for a user
For each department
For each order type
delivered count
unique orders
Unique order count means user might have ordered the same product, but that count has to be 1 for the same. I have the background logic and identified via duplicate order ids.
db.getCollection('user_orders').aggregate([{"user_id":123},
{$group: {"_id": {"department":"$department", "order_type":"$order_type"},
"del_count":{$sum:"$del_count"},
"unique_order":{$addToSet:{"unique_order":"$unique_order"}}}},
{$project: {"_id":0,
"department":"$_id.department",
"order_type_name":"$_id.order_type",
"unique_order_count": {$size:"$unique_order"},
"del_count":"$del_count"
}},
{$group: {"_id":"$department",
order_types: {$addToSet:
{"order_type_name":"$order_type_name",
"unique_order_count": "$unique_order_count",
"del_count":"$del_count"
}}}}
])
Sorry for my query formatting.
This query is working absolutely fine. I added the second grouping to bring the responses together for all order types of the same department.
Can I do the same in less number of pipelines - efficient ways?
The $project stage appears to be redundant but it's more refactoring rather than performance improvement. Your simplified pipeline can look like below:
db.getCollection('user_orders').aggregate([{$group: {"_id": {"department":"$department", "order_type":"$order_type"},
"del_count":{$sum:"$del_count"},
"unique_order":{$addToSet:{"unique_order":"$unique_order"}}}},
{$group: {"_id":"$_id.department",
order_types: {$addToSet:
{"order_type_name":"$_id.order_type",
"unique_order_count": {$size:"$unique_order"},
"del_count":"$del_count"
}}}}
])

Does order matter within a MongoDB $match block having $text search?

I am performing an aggregate query that contains a match, group, project, and then a sort pipeline. I am wondering if adding the text search block within my match block first vs last makes any performance difference. I am currently going based on the Robo 3T response time metric and not noticing a difference and wanted to confirm whether my observation holds true to the facts or not.
The query looks something like this:
db.COLLECTION.aggregate(
{$match: {$text:{$search: 'xyz'},...}},
{$group: {...}},
{$project: {...}},
{$sort: {...}}
)

MongoDB aggregation $lookup to a field that is an indexed array

I am trying a fairly complex aggregate command on two collections involving $lookup pipeline. This normally works just fine on simple aggregation as long as index is set on foreignField.
But my $lookup is more complex as the indexed field is not just a normal Int64 field but actually an array of Int64. When doing a simple find(), it is easy to verify using explain() that the index is being used. But explaining the aggregate pipeline does not explain whether index is being used in the $lookup pipeline. All my timing tests seem to indicate that the index is not being used. MongoDB version is 3.6.2. Db compatibility is set to 3.6.
As I said earlier, I am not using simple foreignField lookup but the 3.6-specific pipeline + $match + $expr...
Could using pipeline be showstopper for the index? Does anyone have any deep experience with the new $lookup pipeline syntax and/or the index on an array field?
Examples
Either of the following works fine and if explained, shows that index on followers is being used.
db.col1.find({followers: {$eq : 823778}})
db.col1.find({followers: {$in : [823778]}})
But the following one does not seem to make use of the index on followers [there are more steps in the pipeline, stripped for readability].
db.col2.aggregate([
{$match:{field: "123"}},
{$lookup:{
from: "col1",
let : {follower : "$follower"},
pipeline: [{
$match: {
$expr: {
$or: [
{ $eq : ["$follower", "$$follower"] },
{ $in : ["$$follower", "$followers"]}
]
}
}
}],
as: "followers_all"
}
}])
This is a missing feature which is going to part of 3.8 version.
Currently eq matches in lookup sub pipeline are optimised to use indexes.
Refer jira fixed in 3.7.1 ( dev version).
Also, this may be relevant as well for non-multi key indexes.

Limit no. of rows in mongodb input

How to limit the no. of rows retrieved in mongodb input transformation used in kettle.
I tried in mongodb input query with below queries but none of them are working :
{"$query" : {"$limit" : 10}}
or {"$limit" : 10}
Please let me know where i am going wrong.
Thanks,
Deepthi
There are several query modification operators you can use. Their names are not totally intuitive and don't match the names of functions you would use in the Mongo shell, but they do the same sorts of things.
In your case, you need the $maxScan operator. You could write your query as:
{"$query": {...}, "$maxScan": 10}
We have to use aggregation method providing by MongoDB, and here is the link: docs.mongodb.org/manual/applications/aggregation.
In my case, I use this query in MongoDB Input of Kettle, and we also have to choose Query is aggregation pipline.
{$match: {activity_type: " view "}},
{$sort: {activity_target: -1 } },
{$limit: 10}
The following screen shots can help you understand the operations more clear.

Aggregate framework can't use indexes

I run this command:
db.ads_view.aggregate({$group: {_id : "$campaign", "action" : {$sum: 1} }});
ads_view : 500 000 documents.
this queries take 1.8s . this is its profile : https://gist.github.com/afecec63a994f8f7fd8a
indexed : db.ads_view.ensureIndex({campaign: 1});
But mongodb don't use index. Anyone know if can aggregate framework use indexes, how to index this query.
This is a late answer, but since $group in Mongo as of version 4.0 still won't make use of indexes, it may be helpful for others.
To speed up your aggregation significantly, performe a $sort before $group.
So your query would become:
db.ads_view.aggregate({$sort:{"campaign":1}},{$group: {_id : "$campaign", "action" : {$sum: 1} }});
This assumes an index on campaign, which should have been created according to your question. In Mongo 4.0, create the index with db.ads_view.createIndex({campaign:1}).
I tested this on a collection containing 5.5+ Mio. documents. Without $sort, the aggregation would not have finished even after several hours; with $sort preceeding $group, aggregation is taking a couple of seconds.
The $group operator is not one of the ones that will use an index currently. The list of operators that do (as of 2.2) are:
$match
$sort
$limit
$skip
From here:
http://docs.mongodb.org/manual/applications/aggregation/#pipeline-operators-and-indexes
Based on the number of yields going on in the gist, I would assume you either have a very active instance or that a lot of this data is not in memory when you are doing the group (it will yield on page fault usually too), hence the 1.8s
Note that even if $group could use an index, and your index covered everything being grouped, it would still involve a full scan of the index to do the group, and would likely not be terrible fast anyway.
$group doesn't use an index because it doesn't have to. When you $group your items you're essentially indexing all documents passing through the $group stage of the pipeline using your $group's _id. If you used an index that matched the $group's _id, you'd still have to pass through all the docs in the index so it's the same amount of work.