Optimizing MongoDB aggregate query on large Index objects - mongodb

I have 20 Million objects in my MongoDb collection. Currently running on M30 MongoDb instance with 7.5Gb ram and 40Gb disk.
Data is stored in collection like this -
{
_id:xxxxx,
id : 1 (int),
from : xxxxxxxx (int),
to : xxxxxx (int),
status : xx (int)
.
.
.
.
},
{
_id:xxxxx,
id : 2 (int),
from : xxxxxxxx (int),
to : xxxxxx (int),
status : xx (int)
.
.
.
.
}
.
.
.
. and so on..
id is unique Index & from is a Index in this collection.
I am running a query to group 'to' and return me the max id and sort by max id with a given condition i.e 'from'
$collection->aggregate([
['$project' => ['id'=>1,'to'=>1,'from'=>1],
[ '$match'=> [
'$and'=>
[
[ 'from'=> xxxxxxxxxx],
[ 'status'=> xx ],
]
]
],
['$group' => [
'_id' =>
'$to',
'max_revision'=>['$max' => '$id'],
]
],
['$sort' => ['max_revision' => -1]],
['$limit' => 20],
]);
Above query runs just fine (~2 sec) on small data set on Index from like for 50-100k of same 'from' value in collection. But for conditions like, for example if 2M objects are having same 'from' value, then it is taking over >10 sec to execute and giving the result.
A quick example,
case 1- same query runs under 2 sec if it is executed with from as 12345, As 12345 is present 50k times in the collection.
case 2- query takes over 10 sec if it executed with from as 98765, As 98765 is present 2M times in the collection.
Edit : Explained query below -
{
"command": {
"aggregate": "mycollection",
"pipeline": [
{
"$project": {
"id": 1,
"to": 1,
"from": 1
}
},
{
"$match": {
"$and": [
{
"from": {
"$numberLong": "12345"
}
},
{
"status": 22
}
]
}
},
{
"$group": {
"_id": "$to",
"max_revision": {
"$max": "$id"
}
}
},
{
"$sort": {
"max_revision": -1
}
},
{
"$limit": 20
}
],
"allowDiskUse": false,
"cursor": {},
"$db": "mongo_jc",
"lsid": {
"id": {
"$binary": "8LktsSkpTjOzF3GIC+m1DA==",
"$type": "03"
}
},
"$clusterTime": {
"clusterTime": {
"$timestamp": {
"t": 1597230985,
"i": 1
}
},
"signature": {
"hash": {
"$binary": "PHh4eHh4eD4=",
"$type": "00"
},
"keyId": {
"$numberLong": "6859724943999893507"
}
}
}
},
"planSummary": [
{
"IXSCAN": {
"from": 1
}
}
],
"keysExamined": 1246529,
"docsExamined": 1246529,
"hasSortStage": 1,
"cursorExhausted": 1,
"numYields": 9747,
"nreturned": 0,
"queryHash": "29DAFB9E",
"planCacheKey": "F5EBA6AE",
"reslen": 231,
"locks": {
"ReplicationStateTransition": {
"acquireCount": {
"w": 9847
}
},
"Global": {
"acquireCount": {
"r": 9847
}
},
"Database": {
"acquireCount": {
"r": 9847
}
},
"Collection": {
"acquireCount": {
"r": 9847
}
},
"Mutex": {
"acquireCount": {
"r": 100
}
}
},
"storage": {
"data": {
"bytesRead": {
"$numberLong": "6011370213"
},
"timeReadingMicros": 4350129
},
"timeWaitingMicros": {
"cache": 2203
}
},
"protocol": "op_msg",
"millis": 8548
}

For this specific case the mongod query executor can use an index for the initial match, but not for the sort.
If you were to reorder and modify the stages a bit, it could use an index on {from:1, status:1, id:1} for both matching and sorting:
$collection->aggregate([
[ '$match'=> [
'$and'=>
[
[ 'from'=> xxxxxxxxxx],
[ 'status'=> xx ],
]
]
],
['$sort' => ['id' => -1]],
['$project' => ['id'=>1,'to'=>1,'from'=>1],
['$group' => [
'_id' => '$to',
'max_revision'=>['$first' => '$id'],
]
],
['$limit' => 20],
]);
This way the it should be able to combine the $match and $sort stages into a single index scan.

Related

get sum of integer from array of objects in mongodb

I want to filter my documents by sum of decimal field in array of objects, but didn't find anything good enough. for example I have documents like below:
[
{
"id": 1,
"limit": NumberDecimal("100000"),
"requests": [
{
"money": NumberDecimal("50000"),
"user": "user1"
}
]
},
{
"id": 2,
"limit": NumberDecimal("100000"),
"requests": [
{
"money": NumberDecimal("100000"),
"user": "user2"
}
]
},
{
"id": 1,
"limit": null,
"requests": [
{
"money": NumberDecimal("50000"),
"user": "user1"
},
{
"money": NumberDecimal("50000"),
"user": "user3"
}
]
},
]
description by documents fields:
limit - maximum amount of money, that I have
requests - array of objects, where money it's how much money user get from limit (if user1 get 50000 money there remainder it's 50000, limit - sum(requests.money))
I am making query in mongodb from scala projects:
get all documents where limit equal to null
get all documents where I have x remainder money (x like input value)
first case it's more easy than second one, I know how I can get sum of requests.money: I am doing it by this query:
db.campaign.aggregate([
{$project: {
total: {$sum: ["$requests.money"]}
}}
])
scala filter part
Filters.or(
Filters.equal("limit", null),
Filters.expr(Document(s""" {$$project: {total: {$$sum: ["$$requests.money"]}}}"""))
)
But I don't want to store it and get as result, I want to filter by this condition x (money which I want to get by some user) limit >= sum(requests.money) + x. And by this filter I want to get all filtered documents.
Example:
x = 50000
and output must be like this:
[
{
"id": 1,
"limit": NumberDecimal("100000"),
"requests": [
{
"money": NumberDecimal("50000"),
"user": "user1"
}
]
},
{
"id": 1,
"limit": null,
"requests": [
{
"money": NumberDecimal("50000"),
"user": "user1"
},
{
"money": NumberDecimal("50000"),
"user": "user3"
}
]
},
]
You have to use an aggregation pipeline like this:
db.campaign.aggregate([
{
$set: {
remainder: {
$subtract: [ "$limit", { $sum: "$requests.money" } ]
}
}
},
{
"$match": {
$or: [
{ limit: null },
{ remainder: { $gte: 0 } }
]
}
},
{ $unset: "remainder" }
])
Mongo Playground
This one is also possible, but more difficult to read:
db.campaign.aggregate([
{
"$match": {
$or: [
{ limit: null },
{
$expr: {
$gt: [
{ $subtract: [ "$limit", { $sum: "$requests.money" } ] },
0
]
}
}
]
}
}
])

Mongo Query for Multiple conditions with limit and skip

The document format of all collections in db is as :
{
"_id": {
"$oid": "5e0983863bcf0dab51f2872b"
},
"word": "never", // get the `word` value for each of below queries
"wordset_id": "a42b50e85e",
"meanings": [{
"id": "1f1bca9d9f",
"def": "not ever",
"speech_part": "adverb",
"synonyms": ["ne'er"]
}, {
"id": "d35f973ed0",
"def": "not at all",
"speech_part": "adverb"
}]
}
I am trying to query for word/words,
1) where word length is 4 and speech_part is noun containing ac (%something% in sql) in it (The result would jack,......)
2) how to add all three starting with , ending with , containing in single query (eg: starting with j , containing ac , ending with k----> would give jack)
I have tried for 1) as:
pipeline = [
{
"$match": {
"meanings.speech_part": "noun",
"word": "/ac/",
"$expr": {"$eq": [{"$strLenCP": "$word"}, 4]}
}
}
]
query=db[collection].aggregate(pipeline)
But I got no result for this, also how to add skip and limit for an aggregate , should i use facet ?
referring SO answer, i found this:
db.Order.aggregate([
{ '$match' : { "company_id" : ObjectId("54c0...") } },
{ '$sort' : { 'order_number' : -1 } },
{ '$facet' : {
metadata: [ { $count: "total" }, { $addFields: { page: NumberInt(3) } } ],
data: [ { $skip: 20 }, { $limit: 10 } ] // As shown here------
} }
] )
For reference Pythonic pipeline would look like this:
pipeline = [
{
'$match': {
"$expr": {"$eq": [{"$strLenCP": "$word"}, 4]},
'word': re.compile('ac'),
'meanings.speech_part': "noun"
}
}
]

MongoDB aggregate + $match + $group + Array

Here is my MongoDB query :
profiles.aggregate([{"$match":{"channels.sign_up":true}},{"$group":{"_id":"$channels.slug","user_count":{"$sum":1}}},{"$sort":{"user_count":-1}}])
Here is my Code :
$profiles = Profile::raw()->aggregate([
[
'$match' => [
'channels.sign_up' => true
]
],
[
'$group' => [
'_id' => '$channels.slug',
'user_count' => ['$sum' => 1]
]
],
[
'$sort' => [
"user_count" => -1
]
]
]);
Here is my Mongo Collection :
"channels": [
{
"id": "5ae44c1c2b807b3d1c0038e5",
"slug": "swachhata-citizen-android",
"mac_address": "A3:72:5E:DC:0E:D1",
"sign_up": true,
"settings": {
"email_notifications_preferred": true,
"sms_notifications_preferred": true,
"push_notifications_preferred": true
},
"device_token": "ff949faeca60b0f0ff949faeca60b0f0"
},
{
"id": "5ae44c1c2b807b3d1c0038f3",
"slug": "website",
"mac_address": null,
"device_token": null,
"created_at": "2018-06-19 19:15:13",
"last_login_at": "2018-06-19 19:15:13",
"last_login_ip": "127.0.0.1",
"last_login_user_agent": "PostmanRuntime/7.1.5"
}
],
Here is my response :
{
"data": [
{
"_id": [
"swachhata-citizen-android"
],
"user_count": 1
},
{
"_id": [
"icmyc-portal"
],
"user_count": 1
},
{
"_id": [
"swachhata-citizen-android",
"website",
"icmyc-portal"
],
"user_count": 1
}
]
}
what i am expecting is :
{
"data": [
{
"_id": [
"swachhata-citizen-android"
],
"user_count": 1
},
{
"_id": [
"icmyc-portal"
],
"user_count": 1
},
{
"_id": [
"website",
],
"user_count": 1
}
]
}
As you can see channels is an array and "sign_up" is true only for one element in array from where user is registered as we have many app so we have to maintain more than 1 channel for users.
i want to data how many user registered with different channels but in response its coming all the channel instead of one channel where sign_up is true.
Also count is wrong as i have to records where "slug": "swachhata-citizen-android" and "sign_up": true.
Need suggestion :)
Use $unwind to transform each document with arrays to array of documents with nested fields. In your example, like this:
profiles.aggregate([
{$unwind: '$channels'},
{$match: {'channels.sign_up': true}},
{$group: {_id: '$channels.slug', user_count: {$sum: 1}}},
{$sort: {user_count: -1}}
])

Return object from a list if it's child object contains a certain value

I've been stuck on this issue for a while, I feel like I'm close but just can't to figure out the solution.
I have a condensed schema that look like this:
{
"_id": {
"$oid": "5a423f48d3983274668097f3"
},
"id": "59817",
"key": "DW-15450",
"changelog": {
"histories": [
{
"id": "449018",
"created": "2017-12-13T11:11:26.406+0000",
"items": [
{
"field": "status",
"toString": "Released"
}
]
},
{
"id": "448697",
"created": "2017-12-08T09:54:41.822+0000",
"items": [
{
"field": "resolution",
"toString": "Fixed"
},
{
"field": "status",
"toString": "Completed"
}
]
}
]
},
"fields": {
"issuetype": {
"id": "1",
"name": "Bug"
}
}
}
And I would like to grab all changelog.histories that have a changelog.histories.items.toString value of Completed.
Below is my pipeline
"pipeline" => [
[
'$match' => [
'changelog.histories.items.toString' => 'Completed'
]
],
[
'$unwind' => '$changelog.histories'
],
[
'$project' => [
'changelog.histories' => [
'$filter' => [
'input' => '$changelog.histories.items',
'as' => 'item',
'cond' => [
'$eq' => [
'$$item.toString', 'Completed'
]
]
]
]
]
]
]
So ideally I would like the following returned
{
"id": "448697",
"created": "2017-12-08T09:54:41.822+0000",
"items": [
{
"field": "resolution",
"toString": "Fixed"
},
{
"field": "status",
"toString": "Completed"
}
]
}
You can try something like this.
db.changeLogs.aggregate([
{ $unwind: '$changelog.histories' },
{ $match: {'changelog.histories.items.toString': 'Completed'} },
{ $replaceRoot: { newRoot: "$changelog.histories" } }
]);
This solution performs a COLLSCAN, so it is expensive in case of a large collection. Should you have strict performance requirements, you can create an index as follows.
db.changeLogs.createIndex({'changelog.histories.items.toString': 1})
Then, in order to exploit the index, you have to change the query as follows.
db.changeLogs.aggregate([
{ $match: {'changelog.histories.items.toString': 'Completed'} },
{ $unwind: '$changelog.histories' },
{ $match: {'changelog.histories.items.toString': 'Completed'} },
{ $replaceRoot: { newRoot: "$changelog.histories" } }
]);
The first stage filters the changeLog documents having at least one history item in the Completed state. This stage uses the index. The second stage unwinds the vector. The third stage filters again the unwound documents having at least one history item in the Completed state. Finally, the fourth stage replaces the root returning items as documents.
Edit
Based on your comment, this is an alternate solution preserving id and key fields in the returned documents (while keeping using the index).
db.changeLogs.aggregate([
{ $match: {'changelog.histories.items.toString': 'Completed'} },
{ $unwind: '$changelog.histories' },
{ $match: {'changelog.histories.items.toString': 'Completed'} },
{ $project: { _id: 0, id: 1, key: 1, changelog: 1 }}
]);

How to compare and count each value of element with condition in mongoDB pipeline after unwinding?

This is my command I ran in tools->command
{
aggregate : "hashtags",
pipeline:
[
{$unwind:"$time"},
{$match:{"$time":{$gte:NumberInt(1450854385), $lte:NumberInt(1450854385)}}},
{$group:{"_id":"$word","count":{$sum:1}}}
]
}
which gave us this result
Response from server:
{
"result": [
{
"_id": "dear",
"count": NumberInt(1)
},
{
"_id": "ghost",
"count": NumberInt(1)
},
{
"_id": "rat",
"count": NumberInt(1)
},
{
"_id": "police",
"count": NumberInt(1)
},
{
"_id": "bugs",
"count": NumberInt(3)
},
{
"_id": "dog",
"count": NumberInt(2)
},
{
"_id": "batman",
"count": NumberInt(9)
},
{
"_id": "ear",
"count": NumberInt(1)
}
],
"ok": 1
}
The documents are in collection 'hashtags'
The documents inserted are as shown below
1.
{
"_id": ObjectId("567a483bf0058ed6755ab3de"),
"hash_count": NumberInt(1),
"msgids": [
"1583"
],
"time": [
NumberInt(1450854385)
],
"word": "ghost"
}
2.
{
"_id": ObjectId("5679485ff0058ed6755ab3dd"),
"hash_count": NumberInt(1),
"msgids": [
"1563"
],
"time": [
NumberInt(1450788886)
],
"word": "dear"
}
3.
{
"_id": ObjectId("567941aaf0058ed6755ab3dc"),
"hash_count": NumberInt(9),
"msgids": [
"1555",
"1556",
"1557",
"1558",
"1559",
"1561",
"1562",
"1584",
"1585"
],
"time": [
NumberInt(1450787170),
NumberInt(1450787292),
NumberInt(1450787307),
NumberInt(1450787333),
NumberInt(1450787354),
NumberInt(1450787526),
NumberInt(1450787615),
NumberInt(1450855148),
NumberInt(1450855155)
],
"word": "batman"
}
4.
{
"_id": ObjectId("567939cdf0058ed6755ab3d9"),
"hash_count": NumberInt(3),
"msgids": [
"1551",
"1552",
"1586"
],
"time": [
NumberInt(1450785157),
NumberInt(1450785194),
NumberInt(1450856188)
],
"word": "bugs"
}
So I want to count the number of values in the field 'time' which comes in between two limits
such as this
foreach word
{
foreach time
{
if((a<time)&&(time<b))
word[count]++
}
}
but my query is just giving output of the total size of array 'time'.
What is the correct query?
for eg
if lower bound is 1450787615 and upper bound is 1450855155
there are 3 values in 'time'. for word 'batman'
The answer should be
{
"_id": "batman",
"count": NumberInt(3)
},
for batman.Thank you.
Use the following aggregation pipeline:
db.hashtags.aggregate([
{
"$match": {
"time": {
"$gte": 1450787615, "$lte": 1450855155
}
}
},
{ "$unwind": "$time" },
{
"$match": {
"time": {
"$gte": 1450787615, "$lte": 1450855155
}
}
},
{
"$group": {
"_id": "$word",
"count": {
"$sum": 1
}
}
}
])
For the given sample documents, this will yield:
/* 0 */
{
"result" : [
{
"_id" : "batman",
"count" : 3
},
{
"_id" : "dear",
"count" : 1
},
{
"_id" : "ghost",
"count" : 1
}
],
"ok" : 1
}