Keep the result of subset from $match aggregation in cache in mongoDB - mongodb

I am doing a website to explore mongoDB data. In my database I store GPS measurements captured from smartphones. I am using various queries to explore those measurements. I have one query that groups by day and count the measurements. Another query counts the number of measurements for each kind of smartphone (iOS, Android, ). Etc..
All these queries share the same $match parameters in their aggregation pipeline . In this pipeline I filter the measurement in order to focus in an interval of time and in a geographical area.
Is there a way to keep the subset obtained in the $match in the cache in a manner that the database do not need to apply this filter every time ?
I want to optimize the response time of my queries.
Sample of one the query :
cursor = db.myCollection.aggregate(
[
{
"$match":
{
"$and": [{"t": {"$gt": tMin, "$lt": tMax}, "location":{"$geoWithin":{"$geometry":square}}}]
}
},
{
"$group":
{
"_id": {"hourGroup": "$tHour"},
"count": {"$sum": 1}
}
}
]
)
I want to keep the result of this in the cache :
"$match":
{
"$and": [{"t": {"$gt": tMin, "$lt": tMax}, "location":{"$geoWithin":{"$geometry":square}}}]
}

The way you could do it is to create a new collection using $out pipeline stage.
Then as you will go with the query batch the first query will created a matched output and next ones could use it results.
There is a new pipeline stage in development called $facet where we will be able to execute match and then use this result in multiple aggregation path (plan is to have it ready in mongo 3.4)
Any comments welcome!

Related

MongoDB keysExamined:return Ratio

In MongoDB, we use aggregate and $group if we want to get count of some action. Suppose, we have to get total number of deposits users have done in last 2 months then we use $group and then $sum to get count. Now in MongoDB Atlas profiler, it shows that as a very time taking and intensive operation. Because it scans keys of 2 months' data and return only 1 document(count). So is this a good way to get count or not?
If your query is "get total number of deposits for a set of users between two dates where that set can be all users" and deposits is a collection with a field user or similar then you do not need $group. Simply count() the filtered set:
db.deposits.find({$and:[
{"user":{$in: [ list ]}},
{"depositDate":{$gte: startDate}},
{"depositDate":{$lt: endDate}}
]}).count();
or with the aggregation pipeline:
db.deposits.aggregate([
{$match: {$and:[
{"user":{$in: [ list ]}},
{"depositDate":{$gte: startDate}},
{"depositDate":{$lt: endDate}}
]}},
{$count: "n"}
]});
Note you should be using a real datetime type for 'depositDate' -- not a string, not even ISO 8601 -- to facilitate more sophisticated queries involving dates.

Mongoose aggregate pipeline: sorting indexed date in MongoDB is slow

I've been working with this error for some time on my App here and was hoping someone can lend a hand finding the error of this aggregation query.
I'm using a docker container running MongoDB shell version v4.2.8. The app uses an Express.js backend with Mongoose middleware to interface with the database.
I want to make an aggregation pipeline that first matches by an indexed field called 'platform_number'. We then sort that by the indexed field 'date' (stored as an ISODate type). The remaining pipeline does not seem to influence the performance, its just some projections and filtering.
{$sort: {date: -1}} bottlenecks the entire aggregate, even though there are only around 250 documents returned. I do have an unindexed key called 'cycle_number' that correlates directly with the 'date' field. Replacing {date: -1} with {cycle_number: -1} speeds up the query, but then I get an out of memory error. Sorting has a max 100MB cap on Ram and this sort fails with 250 documents.
A possible solution would be to include the additional option { "allowDiskUse": true }. But before I do, I want to know why 'date' isn't sorting properly in the first place. Another option would be to index 'cycle_number' but again, why does 'date' throw up its hands?
The aggregation pipeline is provided below. It is first a match, followed by the sort and so on. I'm happy to explain what the other functions are doing, but they don't make much difference when I comment them out.
let agg = [ {$match: {platform_number: platform_number}} ] // indexed number
agg.push({$sort: {date: -1}}) // date is indexed in decending order
if (xaxis && yaxis) {
agg.push(helper.drop_missing_bgc_keys([xaxis, yaxis]))
agg.push(helper.reduce_bgc_meas([xaxis, yaxis]))
}
const query = Profile.aggregate(agg)
query.exec(function (err, profiles) {
if (err) return next(err)
if (profiles.length === 0) { res.send('platform not found') }
else {
res.json(profiles)
}
})
Once again, I've been tiptoeing around this issue for some time. Solving the issue would be great, but understanding the issue better is also awesome, Thank you for your help!
The query executor is not able to use a different index for the second stage. MongoDB indexes map the key values to the location of documents in the data files.
Once the $match stage has completed, the documents are in the pipeline, so no further index use is possible.
However, if you create a compound index on {platform_number:1, date:-1} the query planner can combine the $match and $sort stages into a single stage that will not require a blocking sort, which should greatly improve the performance of this pipeline.

Can an index on a subfield cover queries on projections of that field?

Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.

Mongo View Not Showing Same Indexing Speed Improvement As in Targeted Collection

I have created a mongo view that basically targets documents on an "accounts" collection - specifically where the value for "transactions.amount.balance" is greater than zero. So that looks like this:
{"transactions.amounts.balance": { $gt : 0 }}
Now, because results took a long time to return, I have added an index on this field in the collection this view works with. Subsequently, when I then run this query on the collection, the results now return much more quickly -- like less than a second instead of 9 seconds prior to adding the index.
However, that said, I don't seem to be noticing the same performance improvement in the mongo view I've created, which, again, among other things, recreates this same query on the same collection.
My understanding is that a view will inherit all of the indexes that have been created on the collection it targets. So, if that's the case, why am I not seeing any kind of performance improvement in the mongo view? Am I missing something?
By the way, when I check the input and output of each stage of my aggregation pipeline, sure enough, this is the one that takes about 9 seconds to return results:
{ "transactions.amounts.balance" : { "$gt" : 0.0 } }
Why is this query step so much slower in my view than when run directly on the collection it targets? Is there something else I can to help speed up the execution of this query step?
Here are the first few steps of the aggregation pipeline in my mongo view:
db.accounts.aggregate(
// Pipeline
[
// Stage 1
{
$unwind: {
"path": "$transactions"
}
},
// Stage 2
{
$match: {
"transactions.amounts.balance": {
"$gt": 0.0
}
}
},
// Stage 3
{
$addFields: {
"openBalance": "$transactions.amounts.balance"
}
}
According to the documentation, $match will only use an index if it's used with no other preceeding stages.
If you place a $match at the very beginning of a pipeline, the query
can take advantage of indexes like any other db.collection.find() or
db.collection.findOne().
Since you unwind your documents first, $match won't use the index which you should see from the explain() plan, too.
Depending on your data, specifically, if you have lots of documents that do not contain a matching entry in the transactions.amounts.balance array, it can be helpful in terms of performance to simply duplicate the $match filter and put one to the very beginning of your pipeline in order to eliminate some of the documents. In the best case (again, this depends on your data), the resulting number of documents will be low enough for the second $match stage to not hurt performance any longer.

Elasticsearch and subsequent Mongodb queries

I am implementing search functionality using Elasticsearch.
I receive "username" set returned by Elasticsearch after which I need to query a collection in MongoDB for latest comment of each user in the "username" set.
Question: Lets say I receive ~100 usernames everytime I query Elasticsearch what would be the fastest way to query MongoDB to get the latest comment of each user. Is querying MongoDB 100 times in a for loop using .findOne() the only option?
(Note - Because latest comment of a user changes very often, I dont want to store it in Elasticsearch as that will trigger retrieve-change-reindex process for the entire document far too frequently)
This answer assumes following schema for your mongo db stored in comments db.
{
"_id" : ObjectId("5788b71180036a1613ac0e34"),
"username": "abc",
"comment": "Best"
}
assuming usernames is the list of users you get from elasticsearch, you can perform following aggregate:
a =[
{$match: {"username":{'$in':usernames}}},
{$sort:{_id:-1}},
{
$group:
{
_id: "$username",
latestcomment: { $first: "$comment" }
}
}
]
db.comments.aggregate(a)
You can try this..
db.foo.find().sort({_id:1}).limit(100);
The 1 will sort ascending (old to new) and -1 will sort descending (new to old.)