Getting distinct values using Eve aggregation is slower than PyMongo's collection.distinct() - mongodb

I have the following scheme in MongoDB:
collection: records
{
"profile_id":int
"fields":[
{
"marc_ff":string
}
]
}
I'd like to get the distinct fields.marc_ff values!
When I use the Eve's aggregation feature (development branch v0.7):
getuniquefields={
'datasource': {
'source': 'records',
'aggregation': {
'pipeline': [
{"$match": {
"profile_id":"$profile_value"
}},
{ "$unwind": "$fields" },
{ "$group": {
"_id": {"field_subfield":"$fields.marc_ff"},
}}
],
'options': {'allowDiskUse':True}
}
}
}
the query gets almost 48 seconds (for several thousand records).
However, when I use the PyMongo's distinct() function:
db.records.distinct('fields.marc_ff',{"profile_id":"$profile_value"})
it gets 30 seconds.
Why is it faster?
In the second case I'm initializing and authenticating a new MognoClient inside the Eve's run.py file and still it's 18secs faster.
Eve does not support (yet) the distinct aggregation command, however is there a more efficient implementation than the one presented to get the distinct values?
UPDATE
As #Asya pointed in the comments I have added the "profile_id" restriction in the PyMongo's query, so now the queries are equivalent and (oddly) distinct is even faster (from 33secs to 30secs).
I should mention that currently I only have records from one profile_id in my collection, hence the records that both the queries have to take into consideration is the same (the whole collection).

Related

Mongodb to fetch top 100 results for each category

I have a collection of transactions that has below schema:
{
_id,
client_id,
billings,
date,
total
}
What I want to achieve is to get the 10 latest transaction models based on the date for a list of client IDs. I don't think the $slice well as the use case is mostly for embedded arrays.
Currently, I am iterating through the client_ids and using find with the limit but it is extremely slow.
UPDATE
Example
https://mongoplayground.net/p/urKH7HOxwqC
This shows two clients with 10 transaction each on different days, I want to write a query that would return latest 5 transaction for each.
Any suggestions of how to query data to make it faster?
The most efficient way would be to just execute multiple queries, 1 for each client, like so:
const clients = await db.collection.distinct('client_id');
const results = await Promise.all(
clients.map((clientId) => db.collection.find({client_id: clientId}).sort({date: -1}).limit(5))
)
To improve this performance make sure you have an index on client_id and date. If for whatever reason you can't built these indexes I'd recommend using this following pipeline (with syntax available starting version 5.3+):
db.collection.aggregate([
{
$group: {
_id: "$client_id",
latestTransactions: {
"$bottomN": {
"n": 5,
"sortBy": {
"date": 1
},
"output": "$$ROOT"
}
}
}
}
])
Mongo Playground

what is the difference between MongoDB find and aggregate in below queries?

select records using aggregate:
db.getCollection('stock_records').aggregate(
[
{
"$project": {
"info.created_date": 1,
"info.store_id": 1,
"info.store_name": 1,
"_id": 1
}
},
{
"$match": {
"$and": [
{
"info.store_id": "563dcf3465512285781608802a"
},
{
"info.created_date": {
$gt: ISODate("2021-07-18T21:07:42.313+00:00")
}
}
]
}
}
])
select records using find:
db.getCollection('stock_records').find(
{
'info.store_id':'563dcf3465512285781608802a',
'info.created_date':{ $gt:ISODate('2021-07-18T21:07:42.313+00:00')}
})
What is difference between these queries and which is best for select by id and date condition?
I think your question should be rephrased to "what's the difference between find and aggregate".
Before I dive into that I will say that both commands are similar and will perform generally the same at scale. If you want specific differences is that you did not add a project option to your find query so it will return the full document.
Regarding which is better, generally speaking unless you need a specific aggregation operator it's best to use find instead, it performs better
Now why is the aggregation framework performance "worse"? it's simple. it just does "more".
Any pipeline stage needs aggregation to fetch the BSON for the document then convert them to internal objects in the pipeline for processing - then at the end of the pipeline they are converted back to BSON and sent to the client.
This, especially for large queries has a very significant overhead compared to a find where the BSON is just sent back to the client.
Because of this, if you could execute your aggregation as a find query, you should.
Aggregation is slower than find.
In your example, Aggregation
In the first stage, you are returning all the documents with projected fields
For example, if your collection has 1000 documents, you are returning all 1000 documents each having specified projection fields. This will impact the performance of your query.
Now in the second stage, You are filtering the documents that match the query filter.
For example, out of 1000 documents from the stage 1 you select only few documents
In your example, find
First, you are filtering the documents that match the query filter.
For example, if your collection has 1000 documents, you are returning only the documents that match the query condition.
Here You did not specify the fields to return in the documents that match the query filter. Therefore the returned documents will have all fields.
You can use projection in find, instead of using aggregation
db.getCollection('stock_records').find(
{
'info.store_id': '563dcf3465512285781608802a',
'info.created_date': {
$gt: ISODate('2021-07-18T21:07:42.313+00:00')
}
},
{
"info.created_date": 1,
"info.store_id": 1,
"info.store_name": 1,
"_id": 1
}
)

MongoDB query over several collections with one sort stage

I have some data with identical layout divided over several collections, say we have collections named Jobs.Current, Jobs.Finished, Jobs.ByJames.
I have implemented a complex query using some aggregation stages on one of these collections, where the last stage is the sorting. It's something like this (but in real it's implemented in C# and additionally doing a projection):
db.ArchivedJobs.aggregate([ { $match: { Name: { $gte: "A" } } }, { $addFields: { "UpdatedTime": { $max: "$Transitions.TimeStamp" } } }, { $sort: { "__tmp": 1 } } ])
My new requirement is to include all these collections into my query. I could do it by simply running the same query on all collections in sequence - but then I still need to sort the results together. As this sort isn't so trivial (using an additional field being created by a $max on a sub-array) and I'm using skip and limit options I hope it's possible to do it in a way like:
Doing the query I already implemented on all relevant collections by defining appropriate aggregation steps
Sorting the whole result afterwards inside the same aggregation request
I found something with a $lookup stage, but couldn't apply it to my request as it needs to do some field-oriented matching (?). I need to access the complete objects.
The data is something like
{
"_id":"4242",
"name":"Stream recording Test - BBC 60 secs switch",
"transitions":[
{
"_id":"123",
"timeStamp":"2020-02-13T14:59:40.449Z",
"currentProcState":"Waiting"
},
{
"_id":"124",
"timeStamp":"2020-02-13T14:59:40.55Z",
"currentProcState":"Running"
},
{
"_id":"125",
"timeStamp":"2020-02-13T15:00:23.216Z",
"currentProcState":"Error"
} ],
"currentState":"Error"
}

Meteor collection get last document of each selection

Currently I use the following find query to get the latest document of a certain ID
Conditions.find({
caveId: caveId
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
How can I use the same using multiple ids with $in for example
I tried it with the following query. The problem is that it will limit the documents to 1 for all the found caveIds. But it should set the limit for each different caveId.
Conditions.find({
caveId: {$in: caveIds}
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
One solution I came up with is using the aggregate functionality.
var conditionIds = Conditions.aggregate(
[
{"$match": { caveId: {"$in": caveIds}}},
{
$group:
{
_id: "$caveId",
conditionId: {$last: "$_id"},
diveDate: { $last: "$diveDate" }
}
}
]
).map(function(child) { return child.conditionId});
var conditions = Conditions.find({
_id: {$in: conditionIds}
},
{
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
You don't want to use $in here as noted. You could solve this problem by looping through the caveIds and running the query on each caveId individually.
you're basically looking at a join query here: you need all caveIds and then lookup last for each.
This is a problem of database schema/denormalization in my opinion: (but this is only an opinion!):
You could as mentioned here, lookup all caveIds and then run the single query for each, every single time you need to look up last dives.
However I think you are much better off recording/updating the last dive inside your cave document, and then lookup all caveIds of interest pulling only the lastDive field.
That will give you immediately what you need, rather than going through expensive search/sort queries. This is at the expense of maintaining that field in the document, but it sounds like it should be fairly trivial as you only need to update the one field when a new event occurs.

MongoDB - Aggregation Framework (Total Count)

When running a normal "find" query on MongoDB I can get the total result count (regardless of limit) by running "count" on the returned cursor. So, even if I limit to result set to 10 (for example) I can still know that the total number of results was 53 (again, for example).
If I understand it correctly, the aggregation framework, however, doesn't return a cursor but simply the results. And so, if I used the $limit pipeline operator, how can I know the total number of results regardless of said limit?
I guess I could run the aggregation twice (once to count the results via $group, and once with $limit for the actual limited results), but this seems inefficient.
An alternative approach could be to attach the total number of results to the documents (via $group) prior to the $limit operation, but this also seems inefficient as this number will be attached to every document (instead of just returned once for the set).
Am I missing something here? Any ideas? Thanks!
For example, if this is the query:
db.article.aggregate(
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } },
{ $limit : 5 }
);
How would I know how many results are available (before $limit)? The result isn't a cursor, so I can't just run count on it.
There is a solution using push and slice: https://stackoverflow.com/a/39784851/4752635 (#emaniacs mentions it here as well).
But I prefer using 2 queries. Solution with pushing $$ROOT and using $slice runs into document memory limitation of 16MB for large collections. Also, for large collections two queries together seem to run faster than the one with $$ROOT pushing. You can run them in parallel as well, so you are limited only by the slower of the two queries (probably the one which sorts).
First for filtering and then grouping by ID to get number of filtered elements. Do not filter here, it is unnecessary.
Second query which filters, sorts and paginates.
I have settled with this solution using 2 queries and aggregation framework (note - I use node.js in this example):
var aggregation = [
{
// If you can match fields at the begining, match as many as early as possible.
$match: {...}
},
{
// Projection.
$project: {...}
},
{
// Some things you can match only after projection or grouping, so do it now.
$match: {...}
}
];
// Copy filtering elements from the pipeline - this is the same for both counting number of fileter elements and for pagination queries.
var aggregationPaginated = aggregation.slice(0);
// Count filtered elements.
aggregation.push(
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
);
// Sort in pagination query.
aggregationPaginated.push(
{
$sort: sorting
}
);
// Paginate.
aggregationPaginated.push(
{
$limit: skip + length
},
{
$skip: skip
}
);
// I use mongoose.
// Get total count.
model.count(function(errCount, totalCount) {
// Count filtered.
model.aggregate(aggregation)
.allowDiskUse(true)
.exec(
function(errFind, documents) {
if (errFind) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_counting'
});
}
else {
// Number of filtered elements.
var numFiltered = documents[0].count;
// Filter, sort and pagiante.
model.request.aggregate(aggregationPaginated)
.allowDiskUse(true)
.exec(
function(errFindP, documentsP) {
if (errFindP) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_pagination'
});
}
else {
return res.json({
'success': true,
'recordsTotal': totalCount,
'recordsFiltered': numFiltered,
'response': documentsP
});
}
});
}
});
});
Assaf, there's going to be some enhancements to the aggregation framework in the near future that may allow you to do your calculations in one pass easily, but right now, it is best to perform your calculations by running two queries in parallel: one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors. Also, note that if all you need to do is a count on documents, using the count function is a very efficient way of performing the calculation. MongoDB caches counts within btree indexes allowing for very quick counts on queries.
If these aggregations turn out to be slow there are a couple of strategies. First off, keep in mind that you want start the query with a $match if applicable to reduce the result set. $matches can also be speed up by indexes. Secondly, you can perform these calculations as pre-aggregations. Instead of possible running these aggregations every time a user accesses some part of your app, have the aggregations run periodically in the background and store the aggregations in a collection that contains pre-aggregated values. This way, your pages can simply query the pre-calculated values from this collection.
$facets aggregation operation can be used for Mongo versions >= 3.4.
This allows to fork at a particular stage of a pipeline in multiple sub-pipelines allowing in this case to build one sub pipeline to count the number of documents and another one for sorting, skipping, limiting.
This allows to avoid making same stages multiple times in multiple requests.
If you don't want to run two queries in parallel (one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors) you can just remove $limit on pipeline and on results you can use
totalCount = results.length;
results.slice(number of skip,number of skip + number of limit);
ex:
db.article.aggregate([
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } }
//{$skip : yourSkip}, //--remove this
//{ $limit : yourLimit }, // remove this too
]).exec(function(err, results){
var totalCount = results.length;//--GEt total count here
results.slice(yourSkip,yourSkip+yourLimit);
});
I got the same problem, and solved with $project, $slice and $$ROOT.
db.article.aggregate(
{ $group : {
_id : '$author',
posts : { $sum : 1 },
articles: {$push: '$$ROOT'},
}},
{ $sort : { posts: -1 } },
{ $project: {total: '$posts', articles: {$slice: ['$articles', from, to]}},
).toArray(function(err, result){
var articles = result[0].articles;
var total = result[0].total;
});
You need to declare from and to variable.
https://docs.mongodb.com/manual/reference/operator/aggregation/slice/
in my case, we use $out stage to dump result set from aggeration into a temp/cache table, then count it. and, since we need to sort and paginate results, we add index on the temp table and save table name in session, remove the table on session closing/cache timeout.
I get total count with aggregate().toArray().length