How do i put {allowDiskUse: true} on pymongo? - mongodb

I have this python code:
import pymongo
import time
start_time = time.time()
connection_string = 'mongodb://localhost'
connection = pymongo.MongoClient(connection_string)
database = connection.solutions
pipe = [
{
'$project':{
"_id":0
}
},
{
'$group':{
"_id":{
"vehicleid":"$vehicleid",
"date":"$metrictimestamp"
},'count':{'$sum':1}
}
}
]
query = list(database.solution1.aggregate(pipe))
print("--- %s seconds ---" % (time.time() - start_time))
And I'm getting this error message: pymongo.errors.OperationFailure: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.
How can I use the allowDiskUse: true?

query = list(database.solution1.aggregate(pipe, allowDiskUse=True))
Ref: pymongo Collection level operations

Related

Delete record based on condition from mongo database using jmeter

I am trying to delete the record from Mongo database using jmeter jsr223 groovy protocol but getting error.
Request:
--delete the records which is retrieved between the insert date
MongoCollection Collection = database.getCollection("TEST_COMM");
Collectionenter code here.deleteMany{
$and: [
{
"insertDate": {
$gte: ISODate("2019-10-15T15:16:45.328+0000")
}
}
,
{
"insertDate": {
$lte: ISODate("2019-10-15T17:02:44.017+0000")
}
}
]
}
Response:
Script1.groovy: 50: expecting '}', found ':' # line 50, column 25.
"insertDate": {
^
Can someone help in resolving the issue.

logstash-input-mongodb loop on a "restarting error" - Timestamp

I try to use the mongodb plugin as input for logstash.
Here is my simple configuration:
input {
mongodb {
uri => 'mongodb://localhost:27017/testDB'
placeholder_db_dir => '/Users/TEST/Documents/WORK/ELK_Stack/LogStash/data/'
collection => 'logCollection_ALL'
batch_size => 50
}
}
filter {}
output { stdout {} }
But I'm facing a "loop issue" probably due to a field "timestamp" but I don't know what to do.
[2018-04-25T12:01:35,998][WARN ][logstash.inputs.mongodb ] MongoDB Input threw an exception, restarting {:exception=>#TypeError: wrong argument type String (expected LogStash::Timestamp)>}
With also a DEBUG log:
[2018-04-25T12:01:34.893000 #2900] DEBUG -- : MONGODB | QUERY | namespace=testDB.logCollection_ALL selector={:_id=>{:$gt=>BSON::ObjectId('5ae04f5917e7979b0a000001')}} flags=[:slave_ok] limit=50 skip=0 project=nil |
runtime: 39.0000ms
How can I parametrize my logstash config to get my output in the stdout console ?
It's because of field #timestamp that has ISODate data type.
You must remove this field from all documents.
db.getCollection('collection1').update({}, {$unset: {"#timestamp": 1}}, {multi: true})

Mongodb Lookup error in Ansible playbook

I am using mongoDB with ansible where I am trying to query to mongodb collection using ansible playbook.
main.yaml(vars file where i have declared all variables)
source_dir_qry_res: "/home/dpatel/Desktop/query_output/"
mongodb_parameters:
- { collection: '1.1.1.1-mx', filter: {"config.groups.interfaces.interface.name" : "xyz"} , qry_res_file: 'allconfig' }
In above file, 'filter' is the query run against mongodb collection. I used filter that way by referring "Mongo Lookup inside this Ansible lookup doc http://docs.ansible.com/ansible/playbooks_lookups.html"
#main_run.yaml
- name: query from mongodb collection
connection: local
gather_facts: no
vars_files:
- 'vars/main.yml'
tasks:
- name: "query to db"
query_config:
collection: "{{ item.collection }}"
filter: "{{ item.filter}}"
qry_res_file: "{{ item.qry_res_file}}"
source_dir: "{{ source_dir_qry_res }}"
with_items: "{{ mongodb_parameters }}"
tags: "save-query-result"
query_config (custom ansible module)
def main():
module = AnsibleModule(
argument_spec=dict(
#host=dict(required=True),
collection=dict(required=False),
qry_res_file=dict(required=False),
filter = dict(required=False),
source_dir=dict(required=True),
logfile=dict(required=False, default=None)),
supports_check_mode=True)
m_args = module.params
m_results = dict(changed=False)
try:
conn = pymongo.MongoClient()
#db = conn.m_args['dtbase']
db = conn.mydb
coll_name = m_args['collection']
print "Connected successfully!!!"
except Exception as ex:
module.fail_json(msg='mongodb connection error: %s' % ex.message)
try:
query = db[coll_name].find(m_args['filter'])
if(query):
target_json_file = m_args['source_dir']+m_args['collection'] + m_args['qry_res_file'] + ".json"
for ele in query:
del ele['_id']
with open(target_json_file, 'a') as the_file:
json.dump(ele, the_file)
except Exception as ex:
module.fail_json(msg="failed to get query result: %s" %ex.message)
module.exit_json(**m_results)
from ansible.module_utils.basic import *
if __name__ == '__main__':
main()
mongo collection (mydb[1.1.1.1-mx].find())
{"config": { "groups": {"name": "123", "system": {"host-name": "something"}, "interfaces": {"interface": {"name": "xyz"}}}}}
when I run below query from mongo shell, it's working fine.
mydb[1.1.1.1-mx].find({"config.groups.interfaces.interface.name" : "xyz"})
but when I tried to run same query though the code it gaves me below error
$ ansible-playbook -i hosts main_run.yml
error msg:"failed to get query result: filter must be an instance of dict, bson.son.SON, or other type that inherits from collections.Mapping"
please see these screenshot for detail error message. If anyone have any idea to solve this problem please share your thoughts that would be really helpful.
I would try to set required type in your argument_spec
argument_spec=dict(
collection = dict(required=False),
qry_res_file = dict(required=False),
filter = dict(required=False, type='dict'),
source_dir = dict(required=True),
logfile = dict(required=False, default=None))
because, by default argument type is string, so you end up with your filter object converted to string (see incovation.module_args in your output).

Replica set read preference nearest is still slow

NOTICE: I've also posted this to dba.stackexchange.com. I'm not sure where this question belongs. If it's not here, tell me, and i'll delete it.
I'm testing my replica set, in particular the read preference, and i'm still getting slow reads even with a nearest read preference set.
For the purpose of this question, we can just assume there are 2 mongodb instances (there are in fact 3). PRIMARY is in Amsterdam (AMS). SECONDARY is in Singapore (SG).
I also have 2 application servers in those 2 locations where I am running my test scripts (node+mongoose).
From the AMS app server (so low latency with PRIMARY), if I run a
simple find query, I get a response in under a second.
However, If I run the same query from my app server in SG, I get response times of ~4-7 seconds.
If I just connect to the SG SECONDARY from SG app server, my query times drop to <1s, similar to (1).
Going back to a standard rep set setting (with nearest), and if I look at the logs, I've noticed that if I send a query to SG using 'nearest', i can see the query in there, but I also see an entry for that same query (but fewer lines) in the PRIMARY log. But it is interesting that there is always an entry in the PRIMARY log even when querying the SECONDARY. I'm not sure if that is somehow related.
So, if I connect directly to the nearest machine, I get a response <1s, but when using the replica set, unless i'm next to the PRIMARY, responses times are >4s.
My question is then, why? Have I set up my replica server incorrectly. Is it a problem on the client side (mongoose/mongodb), or is it in fact working as it is mean to, and i've misunderstood how it works under the hood?
Here are my files (apologies for the wall of text):
test.js
mongoose.connect(configDB.url);
var start = new Date().getTime();
Model.find({})
.exec(function(err, betas){
var end = new Date().getTime();
var time = end - start;
console.log(time/1000);
console.log('finished');
console.log(betas.length);
});
config, also tried with server and replSet options
module.exports = {
'url' : 'user:pwd#ip-primary/db,user:pwd#ip-secondary/db,user:pwd#ip-secondary/db'
}
Betas model
var betaSchema = mongoose.Schema({
.. some fields
}, { read: 'n' });
And the log output from doing a read query as above from the SG app server:
LOG OF PRIMARY:
2015-09-16T07:49:23.120-0400 D COMMAND [conn12520] run command db.$cmd { listIndexes: "betas", cursor: {} }
2015-09-16T07:49:23.120-0400 I COMMAND [conn12520] command db.$cmd command: listIndexes { listIndexes: "betas", cursor: {} } keyUpdates:0 writeConflicts:0 numYields:0 reslen:296 locks:{ Global: { acquireC
ount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms
LOG OF SECONDARY
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] [QLOG] Running query:
ns=db.betas limit=1000 skip=0
Tree: $and
Sort: {}
Proj: {}
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] Running query: query: {} sort: {} projection: {} skip: 0 limit: 1000
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] [QLOG] Beginning planning...
=============================
Options = INDEX_INTERSECTION KEEP_MUTATIONS
Canonical query:
ns=db.betas limit=1000 skip=0
Tree: $and
Sort: {}
Proj: {}
=============================
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] [QLOG] Index 0 is kp: { _id: 1 } io: { v: 1, key: { _id: 1 }, name: "_id_", ns: "db.betas" }
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] [QLOG] Index 1 is kp: { email: 1 } io: { v: 1, unique: true, key: { email: 1 }, name: "email_1", ns: "db.betas", background: true, safe: null }
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] [QLOG] Rated tree:
$and
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] [QLOG] Planner: outputted 0 indexed solutions.
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] [QLOG] Planner: outputting a collscan:
COLLSCAN
---ns = db.betas
---filter = $and
---fetched = 1
---sortedByDiskLoc = 0
---getSort = []
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] Only one plan is available; it will be run but will not be cached. query: {} sort: {} projection: {} skip: 0 limit: 1000, planSummary: COLLSCAN
2015-09-16T07:49:19.368-0400 D QUERY [conn11831] [QLOG] Not caching executor but returning 109 results.
2015-09-16T07:49:19.368-0400 I QUERY [conn11831] query db.betas planSummary: COLLSCAN ntoreturn:1000 ntoskip:0 nscanned:0 nscannedObjects:109 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:109 resl
en:17481 locks:{ Global: { acquireCount: { r: 2 } }, MMAPV1Journal: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { R: 1 } } } 0ms
The information in your output states that the database server is processing the query quickly. So this issue will likely lie outside of the database itself, probably in the client.
Are you running the same query multiple times and timing each execution?
I suspect that this may be due to some initial discovery on your MongoDB client's part - how is it to know what is the nearest before responding if it doesn't initially hit every node and time the responses?

mongodb status of index creation job

I'm using MongoDB and have a collection with roughly 75 million records.
I have added a compound index on two "fields" by using the following command:
db.my_collection.ensureIndex({"data.items.text":1, "created_at":1},{background:true}).
Two days later I'm trying to see the status of the index creation. Running db.currentOp() returns {}, however when I try to create another index I get this error message:
cannot add index with a background operation in progress.
Is there a way to check the status/progress of the index creation job?
One thing to add - I am using mongodb version 2.0.6. Thanks!
At the mongo shell, type below command to see the current progress:
rs0:PRIMARY> db.currentOp(true).inprog.forEach(function(op){ if(op.msg!==undefined) print(op.msg) })
Index Build (background) Index Build (background): 1431577/55212209 2%
To do a real-time running status log:
> while (true) { db.currentOp(true).inprog.forEach(function(op){ if(op.msg!==undefined) print(op.msg) }); sleep(1000); }
Index Build: scanning collection Index Build: scanning collection: 43687948/47760207 91%
Index Build: scanning collection Index Build: scanning collection: 43861991/47760228 91%
Index Build: scanning collection Index Build: scanning collection: 44993874/47760246 94%
Index Build: scanning collection Index Build: scanning collection: 45968152/47760259 96%
You could use currentOp with a true argument which returns a more verbose output, including idle connections and system operations.
db.currentOp(true)
... and then you could use db.killOp() to Kill the desired operation.
The following should print out index progress:
db
.currentOp({"command.createIndexes": { $exists : true } })
.inprog
.forEach(function(op){ print(op.msg) })
outputs:
Index Build (background) Index Build (background): 5311727/27231147 19%
Unfortunately, DR9885 answer didn't work for me, it has spaces in the code (syntax error) and even if the spaces are removed, it returns nothing.
This works as of Mongo Shell v3.6.0
db.currentOp().inprog.forEach(function(op){ if(op.msg) print(op.msg) })
Didn't read Bajal answer until after I posted mine, but it's almost exactly the same except that it's slightly shorter code and also works.
I like:
db.currentOp({
'msg' :{ $exists: true },
'command': { $exists: true },
$or: [
{ 'command.createIndexes': { $exists: true } },
{ 'command.reIndex': { $exists: true } }
]
}).inprog.forEach(function(op) {
print(op.msg);
});
Output example:
Index Build Index Build: 84826/335739 25%
Documentation suggests:
db.adminCommand(
{
currentOp: true,
$or: [
{ op: "command", "command.createIndexes": { $exists: true } },
{ op: "none", "msg" : /^Index Build/ }
]
}
)
Active Indexing Operations example.
Simple one to just check progress of a single index going on:
db.currentOp({"msg":/Index/}).inprog[0].progress;
outputs:
{ "done" : 86007212, "total" : 96868386 }
Find progress of index jobs, nice one liner:
> db.currentOp().inprog.map(a => a.msg)
[
undefined,
undefined,
undefined,
undefined,
undefined,
undefined,
"Index Build: scanning collection Index Build: scanning collection: 16448156/54469342 30%",
undefined,
undefined
]