I recently started shifting data from Microsoft SQL Server to MongoDB to obtain scalability. All good in term of migration.
The document has 2 important fields: customer, timestamphash (year month day).
We imported only 75 Million data in Azure Linux where we install MongoDB.
After adding compound index on both fields, we are having the following problem:
On 3 Milion data (after filterin) it takes 24 seconds to finish an aggregate group by count by customerId. The SQL Server gives the result in less then 1 second on the same data.
Do you think Casandra will be a better solution? We need query performance on big number of data.
I tried disk write, giving the VM more RAM. Nothing works.
Query:
aaggregate([
{ "$match" : { "Customer" : 2 } },
{ "$match" : { "TimestampHash" : { "$gte" : 20160710 } } },
{ "$match" : { "TimestampHash" : { "$lte" : 20190909 } } },
{ "$group" : { "_id" : { "Device" : "$Device" }, "__agg0" : { "$sum" : 1 } } },
{ "$project" : { "Device" : "$_id.Device", "Count" : "$__agg0", "_id" : 0 } },
{ "$skip" : 0 },
{ "$limit" : 10 }])
Update:
I used 'allowDiskUse: true' and the problem was solved. Reduced to 4 seconds for 3M data filtered.
I have encounter a similar problem before, during this question, and to be honest, I guess Cassandra is better in your certain case, but the question was about Mongo aggregation query optimization, right?
As for now, one of my collections have more then 3M+ docs, and shouldn't take 24s for aggregation queries if you build indexes correctly.
First of all, check out the index usage via Mongo Compass. Does Mongo actually using it? If your app spam queries to DB and your index have 0 usage (as in example below) than, as you already guessed it, something wrong with your index.
The Second thing is, using explain method (this doc will help you out), to check out more info about your query.
And for the third: index field sorting matters. For example if you have $match stage with 3 fields and you request docs by fields:
{ $match: {a_field:a, b_field:b, c_field:c} }
then you should build compound index on a,b,c fields in the exact same order.
There is always some kind of DB architecture problem. I highly recommend you not to stockpile all data inside one collection. Using {timestamps:true} on insertion (it created two fields, like createdAt: and updatedAt:
{
timestamps: true
}
in your schema, store old-time/outdated data in different collection and use $lookup aggregation method for them, when you really needs to operate with them them.
Hope you'll find something useful in my answer.
Related
I've gone through almost 10 similar posts here on SO, and I'm still confused with the results I am getting: 5+ seconds for a sort on a foreign field on a single $lookup aggregation between a 42K document collection and a 19 record collection. Aka, total cross product of 798K.
Unfortunately, denormalization is not a great option here as the documents in the 'to' collection are heavily shared and would require a massive amount of updates across the database when changes are made.
That being said, I can't seem to understand why the following would take this long regardless. I feel like I must be doing something wrong.
The context:
A 4 vCPU, 16 GB RAM VM is running Debian 10 / MongoDB 4.4 as a single node replica set and nothing
else. Fully updated .NET MongoDB driver (I just updated a moment ago and re-tested)
There is one lookup in the aggregation, with the 'from' collection
having 42K documents, and the 'to' collection having 19 documents.
All aggregations, indexes, and collections are using default collation.
The foreign field in the 'to' collection has an index. Yes, just for those 19 records in case it would make a difference.
One of the posts regarding slow $lookup performance, mentioned that if $eq was not used within the nested pipeline of the $lookup stage, it wouldn't use the index. So I made sure that the aggregation pipeline used an $eq operator.
Here's the pipeline:
[{ "$lookup" :
{ "from" : "4",
"let" : { "key" : "$1" },
"pipeline" :
[{ "$match" :
{ "$expr" :
{ "$eq" : ["$id", { "$substrCP" : ["$$key", 0, { "$indexOfCP" : ["$$key", "_"] }] }] } } },
{ "$unwind" : { "path" : "$RF2" } },
{ "$match" : { "$expr" : { "$eq" : ["$RF2.id", "$$key"] } } },
{ "$replaceRoot" : { "newRoot" : "$RF2" } }],
"as" : "L1" } },
{ "$sort" : { "L1.5" : 1 } },
{ "$project" : { "L1" : 0 } },
{ "$limit" : 100 }]
Taking out the nested $unwind/$match/$replaceRoot combo takes away about 30% of the run time bringing it down to 3.5 seconds, however, those stages are necessary to link/lookup to the proper sub document. Sorts on the 'from' collection with no lookup required are done within 0.5 seconds.
What am I doing wrong here? Thanks in advance!
Edit:
I've just tested the same thing with a larger set of records (38K records in the 'from' collection, 26K records in the 'to' collection, one-to-one relationship). Took over 7 minutes to complete the sort. I checked in Compass and saw that the index on "id" was actually being used (kept refreshing during the 7 minutes and saw it rise, I'm currently the only user of the database).
Here's the pipeline, which is simpler than the first:
[{ "$lookup" :
{ "from" : "1007",
"let" : { "key" : "$13" },
"pipeline" :
[{ "$match" :
{ "$expr" : { "$eq" : ["$id", "$$key"] } } }],
"as" : "L1" } },
{ "$sort" : { "L1.1" : -1 } },
{ "$project" : { "L1" : 0 } },
{ "$limit" : 100 }]
Does 7 minutes sound reasonable given the above info?
Edit 2:
shell code to create two 40k record collections (prod and prod2) with two fields (name: string, uid: integer):
var randomName = function() {
return (Math.random()+1).toString(36).substring(2);
}
for (var i = 1; i <= 40000; ++i) {
db.test.insert({
name: randomName(),
uid: i });
}
I created an index on the 'uid' field on prod2, increased the sample document limit of Compass to 50k, then did just the following lookup, which took two full minutes to compute:
{ from: 'prod2',
localField: 'uid',
foreignField: 'uid',
as: 'test' }
Edit 3:
I've just also ran the aggregation pipeline directly from the shell and got results within a few seconds instead of two minutes:
db.test1.aggregate([{ $lookup:
{ from: 'test2',
localField: 'uid',
foreignField: 'uid',
as: 'test' } }]).toArray()
What's causing the discrepancy between the shell and both Compass and the .NET driver?
For anyone stumbling upon this post, the following worked for me: Using the localField/foreignField version of the $lookup operator.
When monitoring the index in Compass, both the let/pipeline and the localField/foreignField versions hit the appropriate index but was orders of magnitude slower when using the let/pipeline version.
I restructured my query building logic to only use localField/foreignField and it has made a world of difference.
I am using mongodb 3.6.8.
I have a collection (called states) with an ObjectId field (sensor_id), a date, as well as a few other fields.
I created a compound index on the collection:
db.states.createIndex({ "sensor_id" : 1, "date" : 1 });
I am using the aggregation framework with a match stage, for example:
{
"$match" : {
"sensor_id" : { "$oid" : "5b8fd62c4f0cd13c05296df7"},
"date" : {
"$gte" : { "$date" : "2018-10-06T04:19:00.000Z"},
"$lt" : { "$date" : "2018-10-06T10:19:09.000Z"}
}
}
}
My problem: as the states collection gets bigger, the pipeline aggregation gets slower and slower, even when the documents that are added fall outside the dates in the match filter. Using this index, I really expected performance not to vary very much as the collection gets bigger.
Other info:
The states collection does not have very many documents (about 200,000), of which about 20,000 are matched by the above filter.
The indexes in the collection (and other collections) are just a few megabytes and easily fit in memory.
The aggregation pipeline does not insert any documents in the database.
Can anyone suggest what I should investigate to explain the pretty drastic fall in performance as the collection grows (with new documents outside the date range in $match)?
Thank you.
Solved. Another stage in the pipeline was treating every more data, as time evolved. Nothing wrong with the indexes.
Suppose I have a array of objects as below.
"array" : [
{
"id" : 1
},
{
"id" : 2
},
{
"id" : 2
},
{
"id" : 4
}
]
If I want to retrieve multiple objects ({id : 2}) from this array, the aggregation query goes like this.
db.coll.aggregate([{ $match : {"_id" : ObjectId("5492690f72ae469b0e37b61c")}}, { $unwind : "$array"}, { $match : { "array.id" : 2}}, { $group : { _id : "$_id", array : { $push : { id : "$array.id"}}}} ])
The output of above aggregation is
{
"_id" : ObjectId("5492690f72ae469b0e37b61c"),
"array" : [
{
"id" : 2
},
{
"id" : 2
}
]
}
Now the question is:
1) Is retrieving of multiple objects from an array possible using find() in MongoDB?
2) With respect to performance, is aggregation is the correct way to do? (Because we need to use four pipeline operators) ?
3) Can we use Java manipulation (looping the array and only keep {id : 2} objects) to do this after
find({"_id" : ObjectId("5492690f72ae469b0e37b61c")}) query? Because find will once retrieve the document and keeps it in RAM. But if we use aggregation four operations need to be performed in RAM to get the output.
Why I asked the 3) question is: Suppose if thousands of clients accessing at the same time, then RAM memory will be overloaded. If it is done using Java, less task on RAM.
4) For how long the workingSet will be in RAM??
Is my understanding correct???
Please correct me if I am wrong.
Please suggest me to have right insight on this..
No. You project the first matching one with $, you project all of them, or you project none of them.
No-ish. If you have to work with this array, aggregation is what will allow you to extract multiple matching elements, but the correct solution, conceptually and for performance, is to design your document structure so this problem does not arise, or arises only for rare queries whose performance is not particularly important.
Yes.
We have no information that would allow us to give a reasonable answer to this question. This is also out of scope relative to the rest of the question and should be a separate question.
I'm quite new to MongoDb and I'm trying to get my head around the grouping/counting functions. I want to retrieve a list of votes per track ID, ordered by their votes. Can you help me make sense of this?
List schema:
{
"title" : "Bit of everything",
"tracks" : [
ObjectId("5310af6518668c271d52aa8d"),
ObjectId("53122ffdc974dd3c48c4b74e"),
ObjectId("53123045c974dd3c48c4b74f"),
ObjectId("5312309cc974dd3c48c4b750")
],
"votes" : [
{
"track_id" : "5310af6518668c271d52aa8d",
"user_id" : "5312551c92d49d6119481c88"
},
{
"track_id" : "53122ffdc974dd3c48c4b74e",
"user_id" : "5310f488000e4aa02abcec8e"
},
{
"track_id" : "53123045c974dd3c48c4b74f",
"user_id" : "5312551c92d49d6119481c88"
}
]
}
I'm looking to generate the following result (ideally ordered by the number of votes, also ideally including those with no entries in the votes array, using tracks as a reference.):
[
{
track_id: 5310af6518668c271d52aa8d,
track_votes: 1
},
{
track_id: 5312309cc974dd3c48c4b750,
track_votes: 0
}
...
]
In MySQL, I would execute the following
SELECT COUNT(*) AS track_votes, track_id FROM list_votes GROUP BY track_id ORDER BY track_votes DESC
I've been looking into the documentation for the various aggregation/reduce functions, but I just can't seem to get anything working for my particular case.
Can anyone help me get started with this query? Once I know how these are structured I'll be able to apply that knowledge elsewhere!
I'm using mongodb version 2.0.4 on Ubuntu 12, so I don't think I have access to the aggregation functions provided in later releases. Would be willing to upgrade, if that's the general consensus.
Many thanks in advance!
I recommend you to upgrade your MongoDB version to 2.2 and use the Aggregation Framework as follows:
db.collection.aggregate(
{ $unwind:"$votes"},
{ $group : { _id : "$votes.track_id", number : { $sum : 1 } } }
)
MongoDB 2.2 introduced a new aggregation framework, modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result.
The output will look like this:
{
"result":[
{
"_id" : "5310af6518668c271d52aa8d", <- ObjectId
"number" : 2
},
...
],
"ok" : 1
}
If this is not possible to upgrade, I recommend doing it in a programmatically way.
I hate this kind of questions but maybe you can point me to obvious. I'm using Mongo 2.2.2.
I have a collection (in replica set) with 6M documents which has string field called username on which I have index. The index was non-unique but recently I made it unique. Suddenly following query gives me false alarms that I have duplicates.
db.users.aggregate(
{ $group : {_id : "$username", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} } );
which returns
{
"result" : [
{
"_id" : "davidbeges",
"total" : 2
},
{
"_id" : "jesusantonio",
"total" : 2
},
{
"_id" : "elesitasweet",
"total" : 2
},
{
"_id" : "theschoolofbmx",
"total" : 2
},
{
"_id" : "longflight",
"total" : 2
},
{
"_id" : "thenotoriouscma",
"total" : 2
}
],
"ok" : 1
}
I tested this query on sample collection with few documents and it works as expected.
One of 10gen responded in their JIRA.
Are there any updates on this collection? If so, I'd try adding {$sort: {username:1}} to the front of the pipeline. That will ensure that you only see each username once if it is unique.
If there are updates going on, it is possible that aggregation would see a document twice if it moves due to growth. Another possibility is that a document was deleted after being seen by the aggregation and a new one was inserted with the same username.
So sorting by username before grouping helped.
I think the answer may lie in the fact that your $group is not using an index, it's just doing a scan over the entire collection. These operators can use and index currently in the aggregation framework:
$match $sort $limit $skip
And they will work if placed before:
$project $unwind $group
However, $group by itself will not use an index. When you do your find() test I am betting you are using the index, possibly as a covered index (you can verify by looking at an explain() for that query), rather than scanning the collection. Basically my theory is that your index has no dupes, but your collection does.
Edit: This likely happens because a document is updated/moved during the aggregation operation and hence is seen twice, not because of dupes in the collection as originally thought.
If you add an operator earlier in the pipeline that can use the index but not alter the results fed into $group, then you can avoid the issue.