MongoDB Aggregate of 120M documents - mongodb

I've a system that records entries by action. There're more than 120M of them and I want to group them with aggregate by id_entry. The structure is as the following :
entry
{
id_entry: ObjectId(...),
created_at: Date(...),
action: {object},
}
When I try to do an aggregate by id_entry and grouping its actions it takes more than 3h to finish :
db.entry.aggregate([
{ '$match': {'created_at': { $gte:ISODate("2016-02-02"), $lt:ISODate("2016-02-03")}}},
{ '$group': {
'_id' :{'id_entry': '$id_entry'},
actions: {
$push: '$action'
}
}}])
But in that range of days there's only around ~4M documents. (id_entry and created_at has indexes)
What im I doing wrong in the aggregate? How can I group 3-4M documents to join them by id_entry in less than 3h?
Thanks

To speed up your particular query, you need an index on the created_at field.
However, the overall performance of the aggregation will also depend on your hardware specification (among other things).
If you find the query's performance to be less than what you require, you can either:
Create a pre-aggregated report (essentially a document that contains the aggregated data you require, updated every time a new data is inserted), or
Utilize sharding to spread your data to more servers.
If you need to run this aggregation query all the time, a pre-aggregated report allows you to have an extremely up-to-date aggregated report of your data that is accessible using a simple find() query.
The tradeoff is that for every insertion, you will also need to update the pre-aggregated document to reflect the current state of your data. However, this is a relatively small tradeoff compared to having to run long/complex aggregation query that could interfere with your day-to-day operation.
One caveat with the aggregation framework is: once the aggregation pipeline encounters a $group or a $project stage, no index can be used. This is because MongoDB index are tied to how the documents are stored physically. Grouping and projecting transform the documents to a state where the document does not have a physical representation in disk anymore.

Related

Mongoose aggregate pipeline: sorting indexed date in MongoDB is slow

I've been working with this error for some time on my App here and was hoping someone can lend a hand finding the error of this aggregation query.
I'm using a docker container running MongoDB shell version v4.2.8. The app uses an Express.js backend with Mongoose middleware to interface with the database.
I want to make an aggregation pipeline that first matches by an indexed field called 'platform_number'. We then sort that by the indexed field 'date' (stored as an ISODate type). The remaining pipeline does not seem to influence the performance, its just some projections and filtering.
{$sort: {date: -1}} bottlenecks the entire aggregate, even though there are only around 250 documents returned. I do have an unindexed key called 'cycle_number' that correlates directly with the 'date' field. Replacing {date: -1} with {cycle_number: -1} speeds up the query, but then I get an out of memory error. Sorting has a max 100MB cap on Ram and this sort fails with 250 documents.
A possible solution would be to include the additional option { "allowDiskUse": true }. But before I do, I want to know why 'date' isn't sorting properly in the first place. Another option would be to index 'cycle_number' but again, why does 'date' throw up its hands?
The aggregation pipeline is provided below. It is first a match, followed by the sort and so on. I'm happy to explain what the other functions are doing, but they don't make much difference when I comment them out.
let agg = [ {$match: {platform_number: platform_number}} ] // indexed number
agg.push({$sort: {date: -1}}) // date is indexed in decending order
if (xaxis && yaxis) {
agg.push(helper.drop_missing_bgc_keys([xaxis, yaxis]))
agg.push(helper.reduce_bgc_meas([xaxis, yaxis]))
}
const query = Profile.aggregate(agg)
query.exec(function (err, profiles) {
if (err) return next(err)
if (profiles.length === 0) { res.send('platform not found') }
else {
res.json(profiles)
}
})
Once again, I've been tiptoeing around this issue for some time. Solving the issue would be great, but understanding the issue better is also awesome, Thank you for your help!
The query executor is not able to use a different index for the second stage. MongoDB indexes map the key values to the location of documents in the data files.
Once the $match stage has completed, the documents are in the pipeline, so no further index use is possible.
However, if you create a compound index on {platform_number:1, date:-1} the query planner can combine the $match and $sort stages into a single stage that will not require a blocking sort, which should greatly improve the performance of this pipeline.

How to improve the performance of this MongoDB query

I am trying to take an extract from a huge MongoDB collection.
In particular, the collection contains 2.65TB data (unzipped), i.e., 600GB data (zipped). Each document has a deep hierarchy and a couple of arrays and I want to extract some parts out of them. In this collection we have multiple documents for each customer id. Since I want to export the most active document for each customer, I need to group and take the records with the maximum timestamp field and perform some further processing on them. I need some help in forming the query for the export. I have tried to sort the documents per customer id, but this could not be achieved in an acceptable time when combined with a 'match' construct (this is needed since it is a huge collection and we try to create the export in parts). Currently the query looks like this:
db.getCollection('CEM').aggregate([
{'$match' : {'LiveFeed.customer.profile.id':'TCAYT2RY2PF93R93JVSUGU7D3'}},
{'$project':{'LiveFeed.customer.profile.id':1,'LiveFeed.customer.profile.products.air.flights':1, 'LiveFeed.context.timestamp':1}},
{'$sort':{'LiveFeed.customer.profile.id':1,"LiveFeed.context.timestamp":1}},
{'$group':{'_id':'$LiveFeed.customer.profile.id',
'products':{'$last':'$LiveFeed.customer.profile.products.air.flights'}}},
{'$unwind': '$products'},
{'$unwind': '$products.sources'},
{'$project':{'_id':0,
'ceid': '$_id',
'coupon_no':{'$ifNull':['$products.couponId.couponNumber', ""]},
'ticket_no':{'$ifNull':['$products.couponId.ticketId.number','']},
'pnr_id':'$products.sources.id',
'departure_date':'$products.segment.departure.at',
'departure_airport':'$products.segment.departure.code',
'arrival_airport':'$products.segment.arrival.code',
'created_date':'$products.createdAt'}}])
Any ideas/suggestions on to how to improve this query will be very helpful indeed - Thanks in advance!
It is difficult to answer this without knowing the indexes on your collection. However, you can save some time by eliminating stage 3. The $sort is undone by the $group in stage 4. See $group does not preserve order

How to speed up agregate queries in MongoDB

I am running examples of aggregate queries similar to this:
https://www.compose.com/articles/aggregations-in-mongodb-by-example/
db.mycollection.aggregate([
{
{ $match: {"nested.field": "1110"}}, {
$group: {
_id: null,
total: {
$sum: "$nested.field"
},
average_transaction_amount: {
$avg: "$nested.field"
},
min_transaction_amount: {
$min: "$nested.field"
},
max_transaction_amount: {
$max: "$nested.field"
}
}
}
]);
One collection that I created have 5,000,000 inserted big JSON documents (around 1,000 K->V pairs, some are nested).
Before adding index on one nested field - it takes around 5min to do count of that field.
After adding index - for count it takes less than a second (which is good).
Now I am trying to do SUM or AVG or any other like example above - it takes minutes (not seconds).
Is there a way to improve aggregate queries in MongoDB?
Thanks!
Unfortunately, group currently does not use indexes in mongodb. Only sort and match can take advantage of indexes. So the query as you wrote it is as optimized as it could be.
There are a couple things you could do. For max and min, you could just query them instead of using the aggregation framework. You can than sort by $nested.field and take just one. You can put an index on $nested.field and you can then sort ascending or descending with the same index.
If you have any control over when the data is inserting, and the query is as simple as it looks, you could keep track of the data yourself. So you could have a table in mongo where the collection has the "Id" or whatever you are grouping on and have fields for "total" and "sum". You could increment them on inserts and then getting the total and averages would be fast queries. Not sure if that's an option for your situation, but its the best you can do.
Generally, mongo is super fast. In my opinion, the only place its not quite as good as SQL is aggregation. The benefits heavily outweigh the struggles to me. I generally maintain separate reporting collections for this kind of situation as I recommended.

How fast is MongoDB's aggregate query over 10 million entries of simple data?

Let's say you have a database of People.
[
#1 {qualities: ['brown', 'nice', 'happy' 'cool' 'cheery']}
#2 {qualities: ['blue', 'okay', 'happy' 'decent' 'cheery']}
#3 {qualities: ['green', 'alright', 'happy' 'cool' 'cheery']}
]
Here's the People schema and model:
var peopleSchema = mongoose.Schema({
qualities: [],
});
var People = mongoose.model('People', peopleSchema);
If you want to get data according to maximum match document then we use the aggregate query:
People.aggregate([
{$unwind:"$qualities"},
{$match:{'qualities': {$in: ["brown", "happy", "cool"]}}},
{$group:{_id:"$_id",count:{$sum:1}}},
{$sort:{count:-1}}
]).exec(function(err, persons) {
console.log(persons)
});
it will return 1, 3,2 because for first one matched with 3
items, third one matched with 2 items and second one matched with 1 item.
Question
This aggregate works fast for my database of 10,000 people - in fact, it completed this in 273.199ms. However, how will it fare for a MongoDB of 10 million entries? If these rates are proportional [100k:2.7s, 1m:27s, 10m:4m30s], it could take 4 minutes and 30 seconds. Perhaps the rate is not proportional, I do not know. But is there any optimization or suggestions for querying such a large database if my time hypothesis happens to be true?
Okay. So if you have asked,
I will ask you to look into how the aggregate query works.
The aggregate query works on the basis of pipeline stages :
Now, what's a pipeline stage :- From your example the
{$unwind:"$qualities"},
{$match:{'qualities': {$in: ["brown", "happy", "cool"]}}},
{$group:{_id:"$_id",count:{$sum:1}}},
{$sort:{count:-1}}
Here, all the unwind, match, group and sort are pipeline stages.
The $unwind works by creating new documents for the embedded document (in your case qualities) for better nested searching.
But if you keep $unwind as the first stage, it creates a performance overhead by unwinding unnecessary documents.
A better approach would be to keep $match as the first stage in aggregation pipeline.
Now, how fast is the aggregation query :
The aggregation query's speed depends upon the amount of data stored into the embedded doc. If you store a million entries into the embedded doc qualities, it will create a performance overhead while unwinding those millions entries.
So, it all comes to how you create your database schema. Also, for faster querying you can look into the multi key indexing and sharding approaches for mongodb.

how to build index in mongodb in this situation

I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?
You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}