Can I batch aggregations in MongoDB? - mongodb

I have some readonly aggregate pipelines that must be runned in parallel with only one connection available. Is that possible or Mongo allows to only have find, update operations in bulk but not aggregate?

Mongodb driver uses connections pool and executes aggregation commands asynchronously. You don't need to do anything special, apart from ensure your application doesn't wait for responses before executing next query.
Consider a test collection:
mgeneratejs '{"num": {"$integer": {"min": 1, "max": 20}}, "text": {"$paragraph": {sentences: 5}}}' -n 100000 | mongoimport -d so -c text
a single aggregation query
db.text.aggregate([
{$match: {text: /ert.*duv/i}},
{$group:{_id:null, cnt:{$sum:1}, text:{$push: "$text"}}}
]);
takes circa 400 millis.
Running 10 of these in parallel (javascript):
const started = new Date().getTime();
let db;
MongoClient.connect(url, {poolSize: 10})
.then(cl =>{
db = cl.db('so');
return Promise.all([/ert.*duv/i, /kkd.*aql/i, /zop/i, /bdgtter/i, /ppa.*mcm/i, /ert.*duv/i, /kkd.*aql/i, /zop/i, /bdgtter/i, /ppa.*mcm/i]
.map(regex=>([{$match: {text: regex}}, {$group:{_id:null, cnt:{$sum:1}, text:{$push: "$text"}}}]))
.map(pipeline=>db.collection('text').aggregate(pipeline).toArray()))
})
.then(()=>{db.close(); console.log("ended in " + ( new Date().getTime() - started))});
takes 1,883 millis (javascript time), of which ~1,830 are on the db side:
db.getCollection('system.profile').find({ns:"so.text", "command.aggregate": "text"}, {ts:1, millis:1})
{
"millis" : 442,
"ts" : ISODate("2018-02-22T17:32:39.738Z")
},
{
"millis" : 452,
"ts" : ISODate("2018-02-22T17:32:39.747Z")
},
{
"millis" : 445,
"ts" : ISODate("2018-02-22T17:32:39.756Z")
},
{
"millis" : 471,
"ts" : ISODate("2018-02-22T17:32:39.762Z")
},
{
"millis" : 448,
"ts" : ISODate("2018-02-22T17:32:39.771Z")
},
{
"millis" : 491,
"ts" : ISODate("2018-02-22T17:32:39.792Z")
},
{
"millis" : 566,
"ts" : ISODate("2018-02-22T17:32:39.854Z")
},
{
"millis" : 561,
"ts" : ISODate("2018-02-22T17:32:39.856Z")
},
{
"millis" : 1822,
"ts" : ISODate("2018-02-22T17:32:41.118Z")
},
{
"millis" : 1834,
"ts" : ISODate("2018-02-22T17:32:41.124Z")
}
If you do the math you see all 10 started at about same time 2018-02-22T17:32:39.300Z, and mongostat indeed shows 10 more connections at the time of script execution.
Limiting poolSize to 5 doubles the time, as the requests will be executed in 2 batches of 5.
Driver uses about 1Mb RAM per connection, so 100 connections per worker is not something unreal.
To summarise - ensure you have connections pool configured properly, check number of connections actually used runtime, check you handle requests asynchronously on the application level.

Related

MongoDB dataSize command error "couldn't find valid index containing key pattern"

We have two MongoDB clusters, they do not interact with each other. We used to run dataSize command (https://docs.mongodb.com/manual/reference/command/dataSize/) to record the storage used for each specified ID. Both two clusters were running smoothly. Recently, we had one cluster's secondary server failure, and we restarted this cluster. Since then, the dataSize command has stopped working for this cluster. It responds back "couldn't find valid index containing key pattern" error.
Example of the error returned:
rs0:PRIMARY> db.runCommand({ dataSize: "dudubots.channel_tdata", keyPattern: { "c_id_s": 1 }, min: { "c_id_s": 1 }, max: { "c_id_s": 4226 } });
{
"estimate" : false,
"ok" : 0,
"errmsg" : "couldn't find valid index containing key pattern",
"operationTime" : Timestamp(1553510158, 20),
"$clusterTime" : {
"clusterTime" : Timestamp(1553510158, 20),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
}
}
The other cluster is running smoothly and no error is given:
rs0:PRIMARY> db.runCommand({ dataSize: "dudubots.channel_tdata", keyPattern: { "c_id_s": 1 }, min: { "c_id_s": 3015 }, max: { "c_id_s": 3017 } })
{
"estimate" : false,
"size" : 6075684,
"numObjects" : 3778,
"millis" : 1315,
"ok" : 1
}
field c_id_s is indeed set as indexes on both clusters. We don't understand why that cluster has failed to run the command.
We have found the problem. The index actually was changed. dataSize command requires index to be ASC, but we had changed one cluster to DSC.

How to measure the query run time in MongoDB

I am trying to measure the query run time in MongoDB.
Steps:
I set the profiling in mongoDB and ran my query
When I did show Profile I got the below output.
db.blogpost.find({post:/.* NATO .*/i})
blogpost is the collection name, I searched for "NATO" keyword in query.
Output: It pulled out 20 records and after running the query to get execution results, I got the below output:
In the output I can see 3 time values, which one is similar to duration time in MySQL ?
query blogtrackernosql.blogpost **472ms** Wed Apr 11 2018 20:37:54
command:{
"find" : "blogpost",
"filter" : {
"post" : /.* NATO .*/i
},
"$db" : "blogtrackernosql"
} cursorid:99983342073 keysExamined:0 docsExamined:1122 numYield:19 locks:{
"Global" : {
"acquireCount" : {
"r" : NumberLong(40)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(20)
}
},
"Collection" : {
"acquireCount" : {
"r" : NumberLong(20)
}
}
} nreturned:101 responseLength:723471 protocol:op_msg planSummary:COLLSCAN
execStats:{
**"stage"** : "COLLSCAN",
"filter" : {
"post" : {
"$regex" : ".* NATO .*",
"$options" : "i"
}
},
"nReturned" : 101,
**"executionTimeMillisEstimate" : 422**,
"works" : 1123,
"advanced" : 101,
"needTime" : 1022,
"needYield" : 0,
"saveState" : 20,
"restoreState" : 19,
"isEOF" : 0,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 1122
} client:127.0.0.1 appName:MongoDB Shell allUsers:[ ] user:
This ...
"executionTimeMillisEstimate" : 422
... is MongoDB's estimation of how long that query will take to execute on the MongoDB server.
This ...
query blogtrackernosql.blogpost 472ms
... must be the end-to-end time including some client side piece (e.g. forming the query and sending it to the MongoDB server) plus the data transfer time from the MongoDB server back to your client.
So:
472ms is the total start-to-finish time
422ms is the time spent inside the MongoDb server
Note: the ouput also tells you that MongoDB has to scan the entire collection ("stage": "COLLSCAN") to perform this query. FWIW, the reason it has to scan the collection is that you are using a case insensitive $regex. According to the docs:
Case insensitive regular expression queries generally cannot use indexes effectively.

MongoDB - CPU utilisation goes beyond 70% on huge update

I have a mongodb collection test. An update operation is done on this collection every 30 seconds. Sometimes the cpu utilization goes above 80% and connection times out. Also when I enabled profiling and checked /var/log/mongodb/mongod.log, the time taken for the update operation to complete is more than 40 seconds. Update query is very big and it looks like this.
Update Query
test.update({_id: xxxxxxxx},{ $set: {
c_id: "5803a892b6646ad17b2b7a67", s_id: "58f29ee1d6ee0610543f152e",
c_id: "58f29a38c91379637619605c", m: "78a3512ad29c",
date: new Date(1505520000000), l.0.h.0.t: 0, l.0.m.0.0.t: 0,
l.0.h.0.t: 0, l.0.m.0.0.t: 0, l.0.h.0.r: 0, l.0.m.0.0.r: 0,
l.0.h.0.r: 0, l.0.m.0.0.r: 0, l.1.h.0.t: 0, l.1.m.0.0.t: 0,
l.1.h.0.t: 0, l.1.m.0.0.t: 0, l.1.h.0.r: 0, l.1.m.0.0.r: 0,
l.1.h.0.r: 0, l.1.m.0.0.r: 0, l.0.m.0.1.t: 0, l.0.m.0.1.t: 0,
l.0.m.0.1.rxb: 0, l.0.m.0.1.r: 0,
l.1.m.0.1.t: 0, l.1.m.0.1.t: 0, l.1.m.0.1.r: 0, l.1.m.0.1.r: 0,
l.0.m.0.2.t: 0,
.....................
.....................
.....................
.....................
.....................
}})
The query is very large, I have only posted a part of the query and the schema is also very much involved. How can we improve the performance of this update query? Should I optimize schema to reduce the number of nested documents?
I would like to know what are steps I can take to improve this update query.
.explain() output for Update Query
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.stats",
"indexFilterSet" : false,
"winningPlan" : {
"stage" : "UPDATE",
"inputStage" : {
"stage" : "IDHACK"
}
},
"rejectedPlans" : []
},
"serverInfo" : {
"host" : "abcxyz",
"port" : 27017,
"version" : "3.4.4",
"gitVersion" : "988390515874a9debd1b6c5d36559ca86b4babj"
},
"ok" : 1.0
}

Speeding up $or query in pymongo

I have a collection of 1.8 billion records stored in mongodb, where each record looks like this:
{
"_id" : ObjectId("54c1a013715faf2cc0047c77"),
"service_type" : "JE",
"receiver_id" : NumberLong("865438083645"),
"time" : ISODate("2012-12-05T23:07:36Z"),
"duration" : 24,
"service_description" : "NQ",
"receiver_cell_id" : null,
"location_id" : "658_55525",
"caller_id" : NumberLong("475035504705")
}
I need to get all the records for 2 million specific users (I have the users of interest id in a text file) and process it before I write the results to a database. I have indices on the receiver_id and on caller_id (each is part of a single index).
The current procedure I have is as the following:
for user in list_of_2million_users:
user_records = collection.find({ "$or" : [ { "caller_id": user }, { "receiver_id" : user } ] })
for record in user_records:
process(record)
However, it takes 15 seconds on average to consume the user_records cursor (the process function is very simple with low running time). This will not be feasible to process 2 million users. Any suggestions to speed up the $or query? as it seems to be the most time-consuming step.
db.call_records.find({ "$or" : [ { "caller_id": 125091840205 }, { "receiver_id" : 125091840205 } ] }).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor caller_id_1",
"isMultiKey" : false,
"n" : 401,
"nscannedObjects" : 401,
"nscanned" : 401,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"caller_id" : [
[
125091840205,
125091840205
]
]
}
},
{
"cursor" : "BtreeCursor receiver_id_1",
"isMultiKey" : false,
"n" : 383,
"nscannedObjects" : 383,
"nscanned" : 383,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"receiver_id" : [
[
125091840205,
125091840205
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
"n" : 784,
"nscannedObjects" : 784,
"nscanned" : 784,
"nscannedObjectsAllPlans" : 784,
"nscannedAllPlans" : 784,
"scanAndOrder" : false,
"nYields" : 753,
"nChunkSkips" : 0,
"millis" : 31057,
"server" : "some_server:27017",
"filterSet" : false
}
And this is the collection stats:
db.call_records.stats()
{
"ns" : "stc_cdrs.call_records",
"count" : 1825338618,
"size" : 438081268320,
"avgObjSize" : 240,
"storageSize" : 468641284752,
"numExtents" : 239,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 165290709024,
"indexSizes" : {
"_id_" : 73450862016,
"caller_id_1" : 45919923504,
"receiver_id_1" : 45919923504
},
"ok" : 1
}
I am running Ubuntu server with 125GB of RAM.
Note that I will run this analysis only once (not periodic thing I will do).
If the indices on caller_id and receiver_id are a single compound index, this query will do a collection scan instead of an index scan. Make sure they are both part of a separate index, i.e.:
db.user_records.ensureIndex({caller_id:1})
db.user_records.ensureIndex({receiver_id:1})
You can confirm that your query is doing an index scan in the mongo shell:
db.user_records.find({'$or':[{caller_id:'example'},{receiver_id:'example'}]}).explain()
If the explain plan returns its cursor type as BTreeCursor, you're using an index scan. If it says BasicCursor, you're doing a collection scan which is not good.
It would also be interesting to know the size of each index. For best query performances, both indices should be completely loaded into RAM. If the indices are so large that only one (or neither!) of them fit into RAM, you will have to page them in from disk to look up the results. If they're too big to fit in your RAM, your options are not too great, basically either splitting up your collection in some manner and re-indexing it, or getting more RAM. You could always get an AWS RAM-heavy instance just for the purpose of this analysis, since this is a one-off thing.
I am no expert in MongoDB, though I had the similar problem & following solutions helped me tackle the problem. Hope it helps you too.
Query is using indexes and scanning exact documents, so there are no issues with your indexing, though I'll suggest you to:
First of all try to see the status of command: mongostat --discover
See for the parameters such as page faults & index miss.
Have you tried warming up (performance of query after executing query for first)? What's the performance after warming up? If it's same as the previous one there might be page faults.
If you are going to run it as an analysis I think warming up the database might help you.
I don't know why your approach is so slow.
But you might want to try these alternative approaches:
Use $in with many ids at once. I'm not sure if mongodb handles millions of values well, but if it does not, sort the list of IDs and then split it into batches.
Do a collection scan in the application and check each entry against a hashset containing the interesting IDs. Should have acceptable performance for a one-off script, especially since you're interested in so many IDs.

MapReduce in MongoDB doesn't output

I was trying to use MongoDB 2.4.3 (also tried 2.4.4) with mapReduce on a cluster with 2 shards with each 3 replicas. I have a problem with results of the mapReduce job not being reduced into output collection. I tried an Incremental Map Reduce. I also tried "merging" instead of reducing, but that didn't work either.
The map reduce command run on mongos: (coll isn't sharded)
db.coll.mapReduce(map, reduce, {out: {reduce: "events", "sharded": true}})
Which yields the following output:
{
"result" : "events",
"counts" : {
"input" : NumberLong(2),
"emit" : NumberLong(2),
"reduce" : NumberLong(0),
"output" : NumberLong(28304112)
},
"timeMillis" : 418,
"timing" : {
"shardProcessing" : 11,
"postProcessing" : 407
},
"shardCounts" : {
"stats2/192.168.…:27017,192.168.…" : {
"input" : 2,
"emit" : 2,
"reduce" : 0,
"output" : 2
}
},
"postProcessCounts" : {
"stats1/192.168.…:27017,…" : {
"input" : NumberLong(0),
"reduce" : NumberLong(0),
"output" : NumberLong(14151042)
},
"stats2/192.168.…:27017,…" : {
"input" : NumberLong(0),
"reduce" : NumberLong(0),
"output" : NumberLong(14153070)
}
},
"ok" : 1,
}
So I see that the mapReduce is run over 2 records, which results in 2 records outputted. However in the postProcessCounts for both shards the input count stays 0. Also trying to find the record with a search on _id yields no result. In the log file of MongoDB I wasn't able to find error messages related to this.
After trying to reproduce this with a newly created output collection, that I also sharded on hashed _id and I also gave the same indexes, I wasn't able to reproduce this. When outputting the same input to a different collection
db.coll.mapReduce(map, reduce, {out: {reduce: "events_test2", "sharded": true}})
The result is stored in the output collection and I got the following output:
{
"result" : "events_test2",
"counts" : {
"input" : NumberLong(2),
"emit" : NumberLong(2),
"reduce" : NumberLong(0),
"output" : NumberLong(4)
},
"timeMillis" : 321,
"timing" : {
"shardProcessing" : 68,
"postProcessing" : 253
},
"shardCounts" : {
"stats2/192.168.…:27017,…" : {
"input" : 2,
"emit" : 2,
"reduce" : 0,
"output" : 2
}
},
"postProcessCounts" : {
"stats1/192.168.…:27017,…" : {
"input" : NumberLong(2),
"reduce" : NumberLong(0),
"output" : NumberLong(2)
},
"stats2/192.168.…:27017,…" : {
"input" : NumberLong(2),
"reduce" : NumberLong(0),
"output" : NumberLong(2)
}
},
"ok" : 1,
}
When running the script again with the same input ouputting again in the second collection, it shows that it is reducing in postProcessCounts. So the map and reduce functions do their job fine. Why doesn't it work on the larger first collection? Am I doing something wrong here? Are there any special limitations on collections that can be used as output for map-reduce?
mapReduce is run over 2 records, which results in 2 records outputted. However in the postProcessCounts for both shards the input count stays 0.
Map is run over 2 records. If those two records have a different key then the Map will output 2 keys and a value for each. Which is normal.
But something that I noticed in an older version of MongoDB (not sure if this applies in your case) is that if the "values array " for the reduce phase have a length, then reducing will be skipped.
Is the output collection empty in the first case?