Efficient group distinct count in MongoDB aggregation - mongodb

I have the following data and I would like to know How many account that has LogCounts >= 7
Account
LogCounts
AAA
2
BBB
7
AAA
7
AAA
8
AAA
3
CCC
2
Here is my working MongoDB pipeline
[
{
'$match': {
'LogCounts': {
'$gt': 6
}
}
}, {
'$project': {
'Account': 1
}
}, {
'$group': {
'_id': '$Account'
}
}, {
'$count': 'FinalAccountCounts'
}
]
But it took about 5 minutes for ~800 million records collection. I'd like to know if there's any better, faster or more efficient to solve this problem.
Thank you.

From a query perspective you are pretty much as efficient as you can be, this is under the assumption you have an index on LogCounts.
What you can try is to split your query into in different ranges of LogCounts values, this will require prior knowledge of your data distribution. But using this approach you can "map - reduce" the query results. This approach will not help much if the cardinality is extremely low, i.e LogCount has max value of 7 for example
Let's assume for a second LogCounts can only be in the range of 7 to 21. i.e in [7,8,9,10,...,20,21] with equal distribution for each value.
In this case you could execute x queries each of them only querying a certain range of the values:
for example 5 queries at once will look like:
// in nodejs, all queries are executed at once.
const results = await Promise.all([
db.collection.aggregate([ { match: { LogCounts: {$gt: 7, $lte: 10} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 10, $lte: 13} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 13, $lte: 16} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 16, $lte: 19} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 19, $lte: 21} }}, ...restOfPipeline]),
])
// now merge results in memory.
I can tell you that from testing this approach on a 800M collection on a dataset with string values and high cardinality my 1 query ran in 3min, when I split it to 2 queries I managed to run it in 2.1min including the "reduce" part in memory.
Obviously optimization this will require trial an error as you have several parameters to consider, # of buckets, value cardinality for query optimization ( one query can be 7 to 10 and one query 10 to 21 depending on distribution ), # of results etc
If you do end up choosing my approach I'd be happy to get an update after some testing.

Related

What is the best practice to find mongo documents count?

Wanted to know the performance difference between countDocument and find query.
I have to find the count of documents based on certain filter, which approach will be better and takes less time?
db.collection.countDocuments({ userId: 12 })
or
db.collection.find({ userId: 12 }) and then using the length of resulted array.
You should definitely use db.collection.countDocuments() if you don't need the data. This method uses an aggregation pipeline with the filters you pass on and only returns the count so you don't waste processing and time waiting for an array with all results.
This:
db.collection.countDocuments({ userId: 12 })
Is equivalent to:
db.collection.aggregate([
{ $match: { userId: 12 } },
{ $group: { _id: null, n: { $sum: 1 } } }
])

Aggregation pipeline "latest for all distinct id" is very slow, need to create proper indexes?

Considering the following aggregation pipeline code to return newest entry for all distinct "internal_id":
db.locations.aggregate({$sort: {timestamp: -1}}, {$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}})
This call takes up to 10 seconds, which is not acceptable. The collection is not so huge:
db.locations.count()
1513671
So I guess there's something wrong with the indexes, however I tried to create many indexes and none of them made an improvement, currently I kept those two that were supposed to be enough imho: {timestamp: -1, internal_id: 1} and {internal_id: 1, timestamp: -1}.
MongoDB is NOT sharded, and running a 3 hosts replicaset running version 3.6.14.
MongoDB log show the following:
2020-05-30T12:21:18.598+0200 I COMMAND [conn12652918] command mydb.locations appName: "MongoDB Shell" command: aggregate { aggregate: "locations", pipeline: [ { $sort: { timestamp: -1.0 } }, { $group: { _id: "$internal_id", doc: { $first: "$$ROOT" } } } ], cursor: {}, lsid: { id: UUID("70fea740-9665-4068-a2b5-b7b0f10dcde9") }, $clusterTime: { clusterTime: Timestamp(1590834060, 34), signature: { hash: BinData(0, 9DFB6DBCEE52CFA3A5832DC209519A8E9D6F1204), keyId: 6783976096153993217 } }, $db: "mydb" } planSummary: IXSCAN { timestamp: -1, ms_id: 1 } cursorid:8337712045451536023 keysExamined:1513708 docsExamined:1513708 numYields:11838 nreturned:101 reslen:36699 locks:{ Global: { acquireCount: { r: 24560 } }, Database: { acquireCount: { r: 12280 } }, Collection: { acquireCount: { r: 12280 } } } protocol:op_msg 7677msms
Mongo aggregations are theoretically descriptive (in that you describe what you want to have happen, and the query optimizer figures out an efficient way of doing that calculation), but in practice many aggregations end up being procedural & not optimized. If you take a look at the procedural aggregation instructions:
{$sort: {timestamp: -1}}: sort all documents by the timestamp.
{$group: {_id: "$internal_id", doc: {$first: "$$ROOT"}}: go through these timestamp sorted documents and then group them by the id. Because everything is sorted by timestamp at this point (rather than id), it'll end up being a decent amount of work.
You can see that this is what mongo is actually doing by taking a look at that log line's query plan: planSummary IXSCAN { timestamp: -1, ms_id: 1 }.
You want to force mongo to come up with a better query plan than that that uses the
{internal_id: 1, timestamp: -1} index. Giving it a hint to use this index might work -- it depends on how well it's able to calculate the query plan.
If providing that hint doesn't work, one altenative would be to break this query into 2 parts that each uses an appropriate index.
Find the maximum timestamp for each internal_id. db.my_collection.aggregate([{$group: {_id: "$internal_id", timestamp: {$max: "$timestamp"}}}]). This should use the {internal_id: 1, timestamp: -1} index.
Use those results to find the documents that you actually care about: db.my_collection.find({$or: [{internal_id, timestamp}, {other_internal_id, other_timestamp}, ....]}) (if there are duplicate timestamps for the same internal_id you may need to dedupe).
If you wanted to combine these 2 parts into 1, you can use a self-join on the original collection with a $lookup.
So finally I've been able to do all the testing, here is all version I wrote, thanks to willis answer and the result:
Original aggregate query
mongo_query = [
{"$match": group_filter},
{"$sort": {"timestamp": -1}},
{"$group": {"_id": "$internal_id", "doc": {"$first": "$$ROOT"}}},
]
res = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query)
res = await res.to_list(None)
9.61 seconds
Give MongoDB a hint to use proper index (filter internal_id first)
from bson.son import SON
cursor = mongo.db[self.factory.config.mongo_collection].aggregate(mongo_query, hint=SON([("internal_id", 1), ("timestamp", -1)]))
res = await cursor.to_list(None)
Not working, MongoDB replies with an exception, saying sorting consume too much memory
Split aggregation, to first find latest timestamp for each internal_id
cursor = mongo.db[self.factory.config.mongo_collection].aggregate([{"$group": {"_id": "$internal_id", "timestamp": {"$max": "$timestamp"}}}])
res = await cursor.to_list(None)
or_query = []
for entry in res:
or_query.append({"internal_id": entry["_id"], "timestamp": entry["timestamp"]})
cursor = mongo.db[self.factory.config.mongo_collection].find({"$or": or_query})
fixed_res = await cursor.to_list(None)
1.88 seconds, a lot better but still not that fast
Parallel coroutines (and the winner is....)
In the meanwhile, as I already have the list of internal_id, and I'm using asynchronous Python, I went for parallel coroutine, getting latest entry for a single internal_id at once:
fixed_res: List[Dict] = []
async def get_one_result(db_filter: Dict) -> None:
""" Coroutine getting one result for each known internal ID """
cursor = mongo.db[self.factory.config.mongo_collection].find(db_filter).sort("timestamp", -1).limit(1)
res = await cursor.to_list(1)
if res:
fixed_res.append(res[0])
coros: List[Awaitable] = []
for internal_id in self.list_of_internal_ids:
coro = get_one_result({"internal_id": internal_id})
coros.append(coro)
await asyncio.gather(*coros)
0.5s, way better than others
If you don't have a list of internal_id
There's an alternative I did not implement but I confirmed the call is very fast: use lowlevel distinct command against {internal_id: 1} index to retrieve list of individual IDs, then use parallel calls.

Mongodb relative frequency in grouping aggregation

I have data that looks like this
{"customer_id":1, "amount": 100, "item": "a"}
{"customer_id":1, "amount": 20, "item": "b"}
{"customer_id":2, "amount": 25, "item": "a"}
{"customer_id":3, "amount": 10, "item": "a"}
{"customer_id":4, "amount": 10, "item": "b"}
Using R I can get an overview of relative frequencies very easily by doing this
data %>%
group_by(customer_id,item) %>%
summarise(total=sum(amount)) %>%
mutate(per_customer_spend=total/sum(total))
Which returns;
customer_id item total per_customer_spend
<dbl> <chr> <dbl> <dbl>
1 1 a 100 0.833
2 1 b 20 0.167
3 2 a 25 1
4 3 a 10 1
5 4 b 10 1
I can't figure out how to do this in Mongo efficiently, the best solution I have so far involves multiple groups and pushing and unwinding.
If you don't want to change the data structure there's no way around grouping all the data as we need to determine the total amount spent of each user, though this would require just a single $group stage and a single $uwind stage, it would look somethine like this:
db.collection.aggregate([
{
$group: {
_id: "$customer_id",
total: {$sum: "$amount"},
rootHolder: {$push: "$$ROOT"}
}
},
{
$unwind: "$rootHolder"
},
{
$project: {
newRoot: {
$mergeObjects: [
"$rootHolder",
{total: "$total"}
]
}
}
},
{
$replaceRoot: {
newRoot: "$newRoot"
}
},
{
$project: {
customer_id: 1,
item: 1,
total: "$amount",
per_customer_spend: {$divide: ["$amount", "$total"]}
}
}
])
With that said, especially when scale increases this pipeline becomes very expensive, Now depending on how big the scale is and the amount of unique pairs of costumer_id x item i would advice the following:
considering Mongo doesn't like data duplication and assuming a user does not "buy" new items too often it might be worth to actually save it as a field in the current collection. (which requires updating all the users items on purchase), I know this sounds "weird" and costly but again depending on frequency of purchases it might actually be worth it.
Assuming you decide not to do the above I would instead create a new collection with customer_total and customer_id. Mind you this field will still require upkeeping although much cheaper.
With this collection you can either $lookup the total (which again can be expensive).

MongoDB: aggregate and group by splitting the id

My schema implementation is influenced from this tutorial on official mongo site
{
_id: String,
data:[
{
point_1: Number,
ts: Date
}
]
}
This is basically schema designed for time series data and I store data for each hour per device in an array in a single document. I create _id field combining device id which is sending the data and time. For example if a device having id xyz1234 sends a data at 2018-09-11 12:30:00 then my _id field becomes xyz1234:2018091112.
I create new doc if the document for that hour for that device doesn't exist otherwise I just push my data to the data array.
client.db('iot')
.collection('iotdata')
.update({_id:id},{$push:{data:{point_1,ts:date}}},{upsert:true});
Now I am facing problem while doing aggregation. I am trying to get these types of values
Min point_1 value for many devices in last 24 hours by grouping on device id
Max point_1 value for many devices in last 24 hours by grouping on device id
Average point_1 for many devices in last 24 hours by grouping on device id
I thought this is very simple aggregation then I realized device id is not direct but mixed with time data so it's not so direct to group data by device id. How can I split the _id and group based on device id? I tried my level best to write the question as clearly as possible so please ask questions in comments if any part of the question is not clear.
You can start with $unwind on data to get single document per entry. Then you can get deviceId using $substr and $indexOfBytes operators. Then you can apply your filtering condition (last 24 hours) and use $group to get min, max and avg
db.col.aggregate([
{
$unwind: "$data"
},
{
$project: {
point_1: "$data.point_1",
deviceId: { $substr: [ "$_id", 0, { $indexOfBytes: [ "$_id", ":" ] } ] },
dateTime: "$data.ts"
}
},
{
$match: {
dateTime: { $gte: ISODate("2018-09-10T12:00:00Z") }
}
},
{
$group: {
_id: "$deviceId",
min: { $min: "$point_1" },
max: { $max: "$point_1" },
avg: { $avg: "$point_1" }
}
}
])
You can use below query in 3.6.
db.colname.aggregate([
{"$project":{
"deviceandtime":{"$split":["$_id", ":"]},
"minpoint":{"$min":"$data.point_1"},
"maxpoint":{"$min":"$data.point_1"},
"sumpoint":{"$sum":"$data.point_1"},
"count":{"$size":"$data.point_1"}
}},
{"$match":{"$expr":{"$gte":[{"$arrayElemAt":["$deviceandtime",1]},"2018-09-10 00:00:00"]}}},
{"$group":{
"_id":{"$arrayElemAt":["$deviceandtime",0]},
"minpoint":{"$min":"$minpoint"},
"maxpoint":{"$max":"$maxpoint"},
"sumpoint":{"$sum":"$sumpoint"},
"countpoint":{"$sum":"$count"}
}},
{"$project":{
"minpoint":1,
"maxpoint":1,
"avgpoint":{"$divide":["$sumpoint","$countpoint"]}
}}
])

How mongo index intersection works

If I create two indexes, one with descending and one with ascending with time, if I query old and new data do I get to search in both the indexes and I get a good performance in both the cases.
Currently I have index in descending so when I query old data it is coming really really slow. I am thinking of creating one more ascending and try it out. Since I have a huge number of documents (32 million) I thought of asking here first.
This is my index and the query which cause me issue when start/end time is bit old. I have a TTL close to 100 days which make my collection to keep 32 million documents.
index: {
"source_type" : 1.0 ,
"source_id" : 1.0 ,
"key" : 1.0 ,
"start" : -1.0 ,
"end" : -1.0
}
query: keys = diag_db.telemetry_series.aggregate([
{ '$match': {
'source_type': 'SERVER',
'start': { '$gte': start },
'end': { '$lte': end },
'$or': stream_id_query
}},
{ '$project': {
'source_id': 1,
'key': 1
}},
{ '$group':
{ '_id': { 'source': '$source_id', 'key': '$key' }
}}
])['result']