I have a capped collection for storing server logs:
var schema = new mongoose.Schema({
level: { type: Number, required: true },
...
}, { capped: 64 * 1024 * 1024, versionKey: false });
I'm having trouble figuring out how to query logs by level range efficiently. Here's a sample query I want to run:
db.getCollection('logs').find({
level: { $gte: 2, $lte: 6 }
}).sort({ _id: -1 }).limit(500)
Indexing on { _id: 1, level: 1 } doesn't make any sense, as _id is unique and there will be only a single level for each of them, so in worst case whole collection will be checked.
If I index on { level: 1, _id: -1 }, in worst case Mongo pulls all logs for levels 2, 3, 4, 5, 6 joins them and sorts them manually, so performance is horrible. Sometimes it also decides to use { _id: 1 } index, which is terrible too.
It could just walk through these 6 indexes at once and get the result while checking at most 504 documents. Or it could pull only first 500 results from each level, so it would sort at most 2500 documents. But it won't, Mongo is just plain stupid when it comes to range queries.
The fastest solution I can think of is implementing the last mentioned method on the client, so running 5 queries and then merging them manually:
db.getCollection('logs').find({ level: 2 }).sort({ _id: -1 }).limit(500)
db.getCollection('logs').find({ level: 2 }).sort({ _id: -1 }).limit(500)
db.getCollection('logs').find({ level: 3 }).sort({ _id: -1 }).limit(500)
...
Merging can be done in O(n) on the client, there are only 7 log levels so at most 7 queries will be executed and 3500 documents pulled from the database.
Is there a better way?
Since you have only 7 levels, it may worth to consider { level: 1, _id: -1 } index with $or query:
db.logs.find({$or:[
{level: 2},
{level: 3},
{level: 4},
{level: 5},
{level: 6}
]}).sort({_id:-1}).limit(500)
Since it is equality condition, it should make use of the index, but I never tried it on capped collections.
I would give it a try and run explain() to confirm it works, then probably enabled profiler and run few other queries.
Related
I use mongo 4.2.10
There is a collection sales with the following data:
{
"_id": ObjectId("501f1f77bcf86cd799439011"),
"status": "BILLED",
"sellerId": ObjectId("507f1f77bcf86cd799439011"),
"productType": "BIKE",
"saleDate": ISODate("2013-10-02T01:11:18.965Z"),
...
}
So imagine this table contains like 1 billion of records.
I have a requirement to select data by filtering by saleDate and sorting by status, sellerId, productType, saleDate with pagination. This request is fixed and never changes.
So I do like
db.sales.find({
saleDate: {$gte: ISODate("2020-01-01T00:00:00.000Z"),
$lte: ISODate("2020-02-01T00:00:00.000Z")}
}).sort({
status: 1,
sellerId: 1,
productType: 1,
saleDate: 1
}).skip(0).limit(1000);
The problem I see related to index.
The following index:
{
saleDate: 1,
status: 1,
sellerId: 1,
productType: 1
}
It would effectively search by saleDate, but sorting will be ineffective? Because of saleDate many different values it will cause index tree to be very ineffective and index itself will eat memory? I have no idea what would be better and how to create the most effective index here.
To build an efficient index for your query, you want to follow the The ESR (Equality, Sort, Range) Rule
When creating compound indexes, the first fields should be fields you are doing exact matches on, followed by fields you are sorting on, and finally contain fields you are doing range queries on.
The $gte and $lte are considered ranged based queries.
Therefore, the best index for this query will most likely be
{
status: 1,
sellerId: 1,
productType: 1,
saleDate: 1
}
I have the following data and I would like to know How many account that has LogCounts >= 7
Account
LogCounts
AAA
2
BBB
7
AAA
7
AAA
8
AAA
3
CCC
2
Here is my working MongoDB pipeline
[
{
'$match': {
'LogCounts': {
'$gt': 6
}
}
}, {
'$project': {
'Account': 1
}
}, {
'$group': {
'_id': '$Account'
}
}, {
'$count': 'FinalAccountCounts'
}
]
But it took about 5 minutes for ~800 million records collection. I'd like to know if there's any better, faster or more efficient to solve this problem.
Thank you.
From a query perspective you are pretty much as efficient as you can be, this is under the assumption you have an index on LogCounts.
What you can try is to split your query into in different ranges of LogCounts values, this will require prior knowledge of your data distribution. But using this approach you can "map - reduce" the query results. This approach will not help much if the cardinality is extremely low, i.e LogCount has max value of 7 for example
Let's assume for a second LogCounts can only be in the range of 7 to 21. i.e in [7,8,9,10,...,20,21] with equal distribution for each value.
In this case you could execute x queries each of them only querying a certain range of the values:
for example 5 queries at once will look like:
// in nodejs, all queries are executed at once.
const results = await Promise.all([
db.collection.aggregate([ { match: { LogCounts: {$gt: 7, $lte: 10} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 10, $lte: 13} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 13, $lte: 16} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 16, $lte: 19} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 19, $lte: 21} }}, ...restOfPipeline]),
])
// now merge results in memory.
I can tell you that from testing this approach on a 800M collection on a dataset with string values and high cardinality my 1 query ran in 3min, when I split it to 2 queries I managed to run it in 2.1min including the "reduce" part in memory.
Obviously optimization this will require trial an error as you have several parameters to consider, # of buckets, value cardinality for query optimization ( one query can be 7 to 10 and one query 10 to 21 depending on distribution ), # of results etc
If you do end up choosing my approach I'd be happy to get an update after some testing.
Considering the following aggregation pipeline code to return newest entry for all distinct "internal_id":
db.locations.aggregate({$sort: {timestamp: -1}}, {$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}})
This call takes up to 10 seconds, which is not acceptable. The collection is not so huge:
db.locations.count()
1513671
So I guess there's something wrong with the indexes, however I tried to create many indexes and none of them made an improvement, currently I kept those two that were supposed to be enough imho: {timestamp: -1, internal_id: 1} and {internal_id: 1, timestamp: -1}.
MongoDB is NOT sharded, and running a 3 hosts replicaset running version 3.6.14.
MongoDB log show the following:
2020-05-30T12:21:18.598+0200 I COMMAND [conn12652918] command mydb.locations appName: "MongoDB Shell" command: aggregate { aggregate: "locations", pipeline: [ { $sort: { timestamp: -1.0 } }, { $group: { _id: "$internal_id", doc: { $first: "$$ROOT" } } } ], cursor: {}, lsid: { id: UUID("70fea740-9665-4068-a2b5-b7b0f10dcde9") }, $clusterTime: { clusterTime: Timestamp(1590834060, 34), signature: { hash: BinData(0, 9DFB6DBCEE52CFA3A5832DC209519A8E9D6F1204), keyId: 6783976096153993217 } }, $db: "mydb" } planSummary: IXSCAN { timestamp: -1, ms_id: 1 } cursorid:8337712045451536023 keysExamined:1513708 docsExamined:1513708 numYields:11838 nreturned:101 reslen:36699 locks:{ Global: { acquireCount: { r: 24560 } }, Database: { acquireCount: { r: 12280 } }, Collection: { acquireCount: { r: 12280 } } } protocol:op_msg 7677msms
Mongo aggregations are theoretically descriptive (in that you describe what you want to have happen, and the query optimizer figures out an efficient way of doing that calculation), but in practice many aggregations end up being procedural & not optimized. If you take a look at the procedural aggregation instructions:
{$sort: {timestamp: -1}}: sort all documents by the timestamp.
{$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}: go through these timestamp sorted documents and then group them by the id. Because everything is sorted by timestamp at this point (rather than id), it'll end up being a decent amount of work.
You can see that this is what mongo is actually doing by taking a look at that log line's query plan: planSummary IXSCAN { timestamp: -1, ms_id: 1 }.
You want to force mongo to come up with a better query plan than that that uses the
{internal_id: 1, timestamp: -1} index. Giving it a hint to use this index might work -- it depends on how well it's able to calculate the query plan.
If providing that hint doesn't work, one altenative would be to break this query into 2 parts that each uses an appropriate index.
Find the maximum timestamp for each internal_id. db.my_collection.aggregate([{$group: {_id: "$internal_id", timestamp: {$max: "$timestamp"}}}]). This should use the {internal_id: 1, timestamp: -1} index.
Use those results to find the documents that you actually care about: db.my_collection.find({$or: [{internal_id, timestamp}, {other_internal_id, other_timestamp}, ....]}) (if there are duplicate timestamps for the same internal_id you may need to dedupe).
If you wanted to combine these 2 parts into 1, you can use a self-join on the original collection with a $lookup.
So finally I've been able to do all the testing, here is all version I wrote, thanks to willis answer and the result:
Original aggregate query
mongo_query = [
{"$match": group_filter},
{"$sort": {"timestamp": -1}},
{"$group": {"_id": "$internal_id", "doc": {"$first": "$$ROOT"}}},
]
res = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query)
res = await res.to_list(None)
9.61 seconds
Give MongoDB a hint to use proper index (filter internal_id first)
from bson.son import SON
cursor = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query, hint=SON([("internal_id", 1), ("timestamp", -1)]))
res = await cursor.to_list(None)
Not working, MongoDB replies with an exception, saying sorting consume too much memory
Split aggregation, to first find latest timestamp for each internal_id
cursor = mongo.db[self.factory.config.mongo_collection].aggregate([{"$group": {"_id": "$internal_id", "timestamp": {"$max": "$timestamp"}}}])
res = await cursor.to_list(None)
or_query = []
for entry in res:
or_query.append({"internal_id": entry["_id"], "timestamp": entry["timestamp"]})
cursor = mongo.db[self.factory.config.mongo_collection].find({"$or": or_query})
fixed_res = await cursor.to_list(None)
1.88 seconds, a lot better but still not that fast
Parallel coroutines (and the winner is....)
In the meanwhile, as I already have the list of internal_id, and I'm using asynchronous Python, I went for parallel coroutine, getting latest entry for a single internal_id at once:
fixed_res: List[Dict] = []
async def get_one_result(db_filter: Dict) -> None:
""" Coroutine getting one result for each known internal ID """
cursor = mongo.db[self.factory.config.mongo_collection].find(db_filter).sort("timestamp", -1).limit(1)
res = await cursor.to_list(1)
if res:
fixed_res.append(res[0])
coros: List[Awaitable] = []
for internal_id in self.list_of_internal_ids:
coro = get_one_result({"internal_id": internal_id})
coros.append(coro)
await asyncio.gather(*coros)
0.5s, way better than others
If you don't have a list of internal_id
There's an alternative I did not implement but I confirmed the call is very fast: use lowlevel distinct command against {internal_id: 1} index to retrieve list of individual IDs, then use parallel calls.
I have a collection "activities" with model:
{
_id: ObjectId,
action: String,
userId: ObjectId,
time: Date
}
I have a query:
db.activities.find({
action: {
$in: [
"1stAction",
"2ndAction",
...,
"20thAction",
]
},
userId: {
$in: [
ObjectId("1stUserId"),
ObjectId("2ndUserId"),
...,
ObjectId("500thUserId"),
]
},
time: {
$lt: new Date()
}
}).sort({
time: -1
}).limit(20)
Now I have an compound index {userId: 1, action: 1, time: -1}
I guarantee that the index is not multikey one (So it doesnt contain Array values).
The problem starts when count of (actions * userIds) is greater than 200.
When count is below of 200 then mongodb splits every unique pair of userId+action to separated index covered query and after getting a result performs SORT_MERGE clause and finishes very fast.
When count is greater than 200 (can be changed from default by setting internalQueryMaxScansToExplode option in mongoDB config file) then mongoDB performs a query with fetching documents and doesn't use index only. So it works very slowly.
I can't increase internalQueryMaxScansToExplode setting by recommendation of mongoDB team(SERVER-24314) because it causes server crash.
I tried to build another index {time: -1, userId: 1, action: 1} but, unfortunately, I have too many records in the collection that a query even using an index takes 30 seconds at least and scans 30 000 000 index entries. It is not acceptable as well.
I would like to fetch documents by 20 actions and 500 users in one query.
What is the best solution for this problem?
I am trying to filter a data collection by making a query and storing the results in a smaller collection. However, the number of records that were found using count() and the number in the collection are very different(count() is much higher). Am I doing something wrong?
This returns about 110 million.
db.getCollection('ex').count({
'data.points': {$exists: true},
'data.points.points': {$exists: false},
}, {
'data.id': 1,
'data.author.id': 1
})
Then I execute this.
db.getCollection('ex').find({
'data.points': {$exists: true},
'data.points.points': {$exists: false},
}, {
'data.id': 1,
'data.author.id': 1
})
.forEach(function (doc) {
db.s_uid_wid.insert(doc)
})
But this only gives about 5 million records. They should be exactly the same. What is going on?
db.getCollection('s_uid_wid').count({})
Edit
Previously I was running this in robomongo gui and it gave the impression that everything was well. Now I tried this in mongo shell and I got this
2016-02-04T00:39:21.735+0800 Error: getMore: cursor didn't exist on server, possible restart or timeout? at src/mongo/shell/query.js:116
The following fixes the problem. It takes about one day to complete the insertion.
db.getCollection('ex').find({
'data.points': {$exists: true},
'data.points.points': {$exists: false},
}, {
'data.id': 1,
'data.author.id': 1
}).addOption(DBQuery.Option.noTimeout)
.forEach(function (doc) {
db.s_uid_wid.insert(doc)
})