Slow working subreports (mongodb) - mongodb

The report to which I asked this question has grown to an additional three subreports.
Each of them counts a number of documents.
Main report:
{
runCommand : {
aggregate : 'Collection',
pipeline : [
{ $match : { time : { '$gte' : '$P!{sttime}', '$lt' : '$P!{edtime}' }}},
{ $match : { owner_id : $P{id} } },
{ $match : { status : 0 } },
{ $group : { _id : { StatusID : '$status', SID : '$sid', UserID : '$owner_id', GroupID : '$group_id' }, count : { $sum : 1} } }
]
}
}
User must set a timeframe (Timestamp type of date) and provide the user, for which the report will be filled.
Then selected timeframe, owner_id and sid transmitted in each subreport as parameters.
Subreport:
{
runCommand : {
aggregate : 'Collection',
pipeline : [
{ $match : { update_time : { '$gte' : '$P!{sttime}', '$lt' : '$P!{edtime}' }}},
{ $match : { sid : $P{sid} } },
{ $match : { owner_id : $P{id} } },
{ $match : { status : 1 } },
{ $group : { _id : { StatusID : '$status' }, count : { $sum : 1} } }
]
}
}
The other subreports same as above, except { $match : { status : 1 } }, where is like:
{ $match : { status : 2 } }
and
{ $match : { status : 3 } }
respectively.
I'm working with a large collection, where in 2 hour timeframe about 400.000 documents.
The maximum timframe where the report showed results is 8 hours.
Anything more than this period of time falls on timeout.
Filling the "2 hour timeframe" takes about 10 minutes.
Tried to use {explain : true} to each request individually. Speed results were the fastest in the form in which they are written.
"cursor" : {
"cursor" : "BtreeCursor update_time_1_owner_id_1_status_1_group_id_1",
"isMultiKey" : false,
"n" : 379843,
"nscannedObjects" : 379843,
"nscanned" : 379843,
"nscannedObjectsAllPlans" : 379843,
"nscannedAllPlans" : 379843,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 13,
"nChunkSkips" : 0,
"millis" : 694,
This example on 2 hour timeframe.
Is there any way to speed up filling the report? Somehow to unite the queries? Or something else?
Goal is to increase possible period of month to report (If this possible)

MongoDB 2.4 and earlier do not merge consecutive calls to $match, so your first $match will use an index and subsequent matches will be done in memory on the ~400k documents fetched for your aggregation pipeline. Given you have two or three subsequent $match operations, this processing isn't going to be overly efficient.
FYI, the issue of coalescing consecutive $match calls been addressed in MongoDB 2.6 (see SERVER-11184), but for clarity I think you would still be best combining the $match statements before doing an $explain.
Also, since you are doing range queries on update_time, you probably want to move that later in your compound index so the equality checks on owner, status, and group_id can be used to filter results before range comparisons. The blog post Optimizing MongoDB Compound Indexes has some relevant explanation for this approach.

Related

mongodb aggregation, use $sum 2 individual fields and group

As I am new to mongo, I have a slight problem getting something done the way I need it.
I am trying to group the collection by "token_address", count the occurencies of that as totalTransfers and also for each token_address sum the values of the "decimal" property.
The desired outlook would look like this
{
"token_address" : "0x2a746fb4d7338e4677c102f0ce46dae3971de1cc",
"totalTransfers" : 4.0, //occurencies per token_address in collection
"decimal" :132.423 //the $sum of each decimal per token_address
}
This is how the documents look
{
"_id" : "BrBr1vuhNRmmaZliYopQocD2",
"from_address" : "0x7ed77e237fa0a87fc12afb92d2999e6f90e1d43b",
"log_index" : 442,
"to_address" : "0x31d5e41636c2331d8be9ea9c4393a0ff4e597b6c",
"transaction_hash" : "0x1a80b66839b021ef9c1a902f19d28b77d8e688b2e3ebb9bfc185443ae1830403",
"_created_at" : ISODate("2022-03-21T14:09:49.894Z"),
"_updated_at" : ISODate("2022-03-21T14:09:49.894Z"),
"block_hash" : "0x80dfe8642f998ce7fb3e692ab574d9786efdd81ba6aeace060ae9cc919a8acbf",
"block_number" : 14209975,
"block_timestamp" : ISODate("2022-02-15T09:47:30.000Z"),
"confirmed" : true,
"decimal" : NumberDecimal("0.1206"),
"historical" : true,
"token_address" : "0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2",
"transaction_index" : 278,
"value" : "120600000000000000"
}
This is my aggregation which gives which results in the last output
db.EthTokenTransfers.aggregate([
{ $project: {
token_address : 1 // Inclusion mode
}},
{ $group : { _id : '$token_address', totalTransfers : { $sum : 1 }, decimal: {$sum:"$decimal"}
} }
])
{
"_id" : "0x2a746fb4d7338e4677c102f0ce46dae3971de1cc",
"totalTransfers" : 4.0,
"decimal" : 0
}
Can some one point me towards the correct way of doing this? I been trying for over an hour for such a simple task.
The problem is your are removing fields in the $project stage.
Check this example only with the first stage where the output is only the field token_address.
So to the next aggregation stage this is the income data, only the token_address. And that's why totalTransfersis correct (you group by the existing token_address field and $sum one for each one) but decimal is always 0, because the field does not exists.
One solution can be add decimal into $project like this example

mongo sort before match in aggregation

Given a collection of a few million documents that look like:
{
organization: ObjectId("6a55b2f1aae2fe0ddd525828"),
updated_on: 2019-04-18 14:08:48.781Z
}
and 2 indices, on both keys {organization: 1} and {updated_on: 1}
The following query takes ages to return:
db.getCollection('sessions').aggregate([
{
"$match" : {
"organization" : ObjectId("5a55b2f1aae2fe0ddd525827"),
}
},
{
"$sort" : {
"updated_on" : 1
}
}
])
One thing to note is, the result is 0 matches. Upon further investigation, the planner in explain() actually returns the following:
{
"stage" : "FETCH",
"filter" : {
"organization" : {
"$eq" : ObjectId("5a55b2f1aae2fe0ddd525827")
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"updated_on" : 1.0
},
"indexName" : "updated_on_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"updated_on" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"updated_on" : [
"[MinKey, MaxKey]"
]
}
}
}
Why would Mongo combine these into one stage and decide to to sort ALL documents BEFORE filtering?
How can I prevent that?
Why would Mongo combine these into one stage and decide to to sort ALL
documents BEFORE filtering? How can I prevent that?
The sort does happen after the match stage. The query plan doesn't show the SORT stage - that is because there is an index on the sort key updated_on. If you remove the index on the updated_on field you will see a SORT stage in the query plan (and it will be an in-memory sort).
See Explain Results - sort stage.
Some Ideas:
(i) You can use a compound index, instead of a two single field indexes:
{ organization: 1, updated_on: 1 }
It will work fine. See this topic on Sort and Non-prefix Subset of an Index.
(ii) Also, instead of an aggregation, a find() query can do the same job:
db.test.find( { organization : ObjectId("5a55b2f1aae2fe0ddd525827") } ).sort( { updated_on: 1 } )
NOTE: Do verify with explain() and see how they perform. Also, try using the executionStats mode.
MongoDB will use the index on to $sort because it's a heavy operation even if matching before will limit the result to be sorted,
You can either force using the index for $match:
db.collection.aggregate(pipeline, {hint: "index_name"})
Or create a better index to solve both problems, see more informations here
db.collection.createIndex({organization: 1, updated_on:1}, {background: true})

MongoDB index not used when sorting, although prefix matches

I'm trying to fetch a set of records in the most efficient way from MongoDB, but it goes wrong when I add a sorting stage to the pipeline. The server does not use my intended index. According to the documentation it should however match the prefix:
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/#sort-and-non-prefix-subset-of-an-index
I have an index which looks like this:
{
"v" : 2,
"key" : {
"account_id" : 1,
"cdr_block" : 1,
"billing.total_billed" : 1,
"times.created" : -1
},
"name" : "IDX_by_account_and_block_sorted"
}
So I would suppose that when I filter on account_id, cdr_block and billing.total_billed, followed by a sort on times.created, the index would be used.
However that is not the case; when I check the query explanations in MongoDB shell;
this one does NOT use the index, but uses an index that is composed of times.created only, so it takes a few minutes:
db.getCollection("cdr").aggregate(
[
{
"$match" : {
"account_id" : 160.0,
"cdr_block" : ObjectId("5d11e0364f853f15824aff47"),
"billing.total_billed" : {
"$gt" : 0.0
}
}
},
{
"$sort" : {
"times.created" : -1.0
}
}
],
{
"allowDiskUse" : true
}
);
If I leave out the $sort stage, it does use my above mentioned index.
I was thinking that it was perhaps due to the fact that it's an aggregation, but this 'regular' query also doesn't use the index:
db.getCollection("cdr").find({
"account_id" : 160.0,
"cdr_block" : ObjectId("5d11e0364f853f15824aff47"),
"billing.total_billed" : {
"$gt" : 0.0
}
}).sort({"times.created" : -1 });
$sort Operator and Performance
$sort operator can take advantage of an index when placed at the beginning of the pipeline or placed before the $project, $unwind, and $group aggregation operators. If $project, $unwind, or $group occur prior to the $sort operation,
$sort cannot use any indexes.

I need to count how many children orgs are assigned to a parent org in MongoDB

I'm new to the MongoDB world. I'm trying to figure out how to count the number of children organizations assigned to a parent organization. I have documents that have this general structure:
{
"_id" : "001",
"parentOrganization" : {
"organizationId" : "pOrg1"
},
"childOrganization" : {
"organizationId" : "cOrg1"
}
},
{
"_id" : "002",
"parentOrganization" : {
"organizationId" : "pOrg1"
},
"childOrganization" : {
"organizationId" : "cOrg2"
}
},
{
"_id" : "003",
"parentOrganization" : {
"organizationId" : "pOrg2"
},
"childOrganization" : {
"organizationId" : "cOrg3"
}
}
Each document has a parentOrganization with an associated childOrganization. There may be multiple documents with the same parentOrganization, but different childOrganizations. There may also be multiple documents with the same parent/child relationship. Additionally, there may even be a case where a child org may associate with multiple parent orgs.
I'm trying to group by parentOrganization and then count the number of unique childOrganization's associated with each parentOrganization, as well as display the unique id's.
I have tried using an aggregation framework with $match and $group, but I'm still not getting into the child organization parts to count them. Here is what I'm currently attempting:
var s1 = {$match: {"parentOrganization.organizationId": {$exists: true}}};
var s2 = {$group: {_id: "$parentOrganization.organizationId", count: {$sum: "$childOrganization.organizationId"}}};
db.collection.aggregate(s1, s2);
My results are returning the parentOrganization, but my $sum is not returning the number of associated childOrganizations:
/* 1 */
{
"_id" : "pOrg1",
"count" : 0
}
/* 2 */
{
"_id" : "pOrg2",
"count" : 0
}
I get the feeling it is a bit more complicated than my limited knowledge has access to at this time. What details am I missing in this query?
Your $sum is referencing the childOrganization.organizationId value, which is a string. When $sum references a string, it will return the value 0.
I was a unsure of exactly what you were asking for, but I believe that these aggregations can help you on your way.
This will return a count of documents groups by the parentOrganization.organizationId
db.collection.aggregate({$group: {"_id":"$parentOrganization.organizationId", "count": {"$sum": 1}}})
Output:
{ "_id" : "pOrg2", "count" : 1 }
{ "_id" : "pOrg1", "count" : 2 }
This will return a count of unique parent/child organizations:
db.collection.aggregate(
{$group: {"_id": {"parentOrganization": "$parentOrganization.organizationId", "childOrganization": "$childOrganization.organizationId"}, "count":{$sum:1}}})
Output:
{ "_id" : { "parentOrganization" : "pOrg2", "childOrganization" : "cOrg3" }, "count" : 1 }
{ "_id" : { "parentOrganization" : "pOrg1", "childOrganization" : "cOrg2" }, "count" : 1 }
{ "_id" : { "parentOrganization" : "pOrg1", "childOrganization" : "cOrg1" }, "count" : 1 }
This will return a count of unique child organizations and get the set of unique child organizations as well using $addToSet. One caveat of using $addToSet is that the MongoDB 16MB limit on document size still holds. This means that if your collection is large enough such that the size of the set will make one document greater than 16MB, the command will fail. The first $group will create a set of child organizations grouped by parent organization. The $project is used simply to add the total size of the set to the result.
db.collection.aggregate([
{$group: {"_id" : "$parentOrganization.organizationId", "childOrgs" : { "$addToSet" : "$childOrganization.organizationId"}}},
{$project: {"_id" : "$_id", "uniqueChildOrgsCount": {"$size" : "$childOrgs"}, "uniqueChildOrgs": "$childOrgs"}}])
Output:
{ "_id" : "pOrg2", "uniqueChildOrgsCount" : 1, "uniqueChildOrgs" : [ "cOrg3" ]}
{ "_id" : "pOrg1", "uniqueChildOrgsCount" : 2, "uniqueChildOrgs" : [ "cOrg2", "cOrg1" ]}
During these aggregations, I left out the $match statement you included for simplicity, but you could add that back as well.

Queries on arrays with timestamps

I have documents that look like this:
{
"_id" : ObjectId( "5191651568f1f6000282b81f" ),
"updated_at" : "2013-05-16T09:46:16.199660",
"activities" : [
{
"worker_name" : "image",
"completed_at" : "2013-05-13T21:34:59.293711"
},
{
"worker_name" : "image",
"completed_at" : "2013-05-16T07:33:22.550405"
},
{
"worker_name" : "image",
"completed_at" : "2013-05-16T07:41:47.845966"
}
]
}
and I would like to find only those documents where the updated_at time is greater than the last activities.completed_at time (the array is in time order)
i currently have this, but it matches any activities[].completed_at
{
"activities.completed_at" : {"$gte" : "updated_at"}
}
thanks!
update
well, i have different workers, and each has its own "completed_at".
i'll have to invert activites as follows:
activities: { image :
last : {
completed_at: t3,
},
items: [
{completed_at: t0},
{completed_at: t1},
{completed_at: t2},
{completed_at: t3},
]
}
and use this query:
{
"activities.image.last.completed_at" : {"$gte" : "updated_at"}
}
Assuming that you don't know how many activities you have (it would be easy if you always had 3 activities for example with a activities.3.completed_at positional operator) and since there's no $last positional operator, the short answer is that you cannot do this efficiently.
When the activities are inserted, I would update the record's updated_at value (or another field). Then it becomes a trivial problem.