use cursor as iterator with a query - mongodb

I was reading about mongodb. Came across this part http://www.mongodb.org/display/DOCS/Tutorial It says -
> var cursor = db.things.find();
> printjson(cursor[4]);
{ "_id" : ObjectId("4c220a42f3924d31102bd858"), "x" : 4, "j" : 3 }
"When using a cursor this way, note that all values up to the highest accessed (cursor[4] above) are loaded into RAM at the same time. This is inappropriate for large result sets, as you will run out of memory. Cursors should be used as an iterator with any query which returns a large number of elements."
How to use cursor as iterator with a query?Thanks for the help

You've tagged that you're using pymongo, so I'll give you two pymongo examples using the cursor as an iterator:
import pymongo
cursor = pymongo.Connection().test_db.test_collection.find()
for item in cursor:
print item
#this will print the item as a dictionary
and
import pymongo
cursor = pymongo.Connection().test_db.test_collection.find()
results = [item['some_attribute'] for item in cursor]
#this will create a list comprehension containing the value of some_attribute
#for each item in the collection
In addition, you can set the size of batches returned to the pymongo driver by doing this:
import pymongo
cursor = pymongo.Connection().test_db.test_collection.find()
cursor.batchsize(20) #sets the size of batches of items the cursor will return to 20
It is usually unnecessary to mess with the batch size, but if the machine you are running the driver on is having memory issues and page faulting while you are manipulating results from the query, you might have to set this to achieve better performance (this really seems like a painful optimization to me and I've always left the default).
As far as the javascript driver (the driver that loads when you launch the "shell") that part of the documentation is cautioning you not to use "array mode". From the online manual:
Array Mode in the Shell
Note that in some languages, like JavaScript, the driver supports an
"array mode". Please check your driver documentation for specifics.
In the db shell, to use the cursor in array mode, use array index []
operations and the length property.
Array mode will load all data into RAM up to the highest index
requested. Thus it should not be used for any query which can return
very large amounts of data: you will run out of memory on the client.
You may also call toArray() on a cursor. toArray() will load all
objects queries into RAM.

Using the MongoDB Java driver it should be something like:
DBCursor cursor = collection.find( query );
while( cursor.hasNext() ) {
DBObject obj = cursor.next();
// do something tih obj
}
In the mongo console you can do something like:
var cursor = db.things.find();
while(cursor.hasNext()) { printjson(cursor.next()); }

MongoDB returns results in batches. To see how many objects are left in a batch, we use objLeftInBatch() like this:
var c = db.Schools.find();
var doc = function() {return c.hasNext()? c.next : null;}
c.objLeftInBatch();
To iterate through this batch we can use doc() that we setup in the above code block. More learning on cursors can be found at https://docs.mongodb.com/

Related

MongoDB findOneAndReplace log if added as new document or replaced

I'm using mongo's findOneAndReplace() with upsert = true and returnNewDocument = true
as basically a way to not insert duplicate. But I want to get the _id of the new inserted document (or the old existing document) to be passed to a background processing task.
BUT I also want to log if the document was Added-As-New or if a Replacement took place.
I can't see any way to use findOneAndReplace() with these parameters and answer that question.
The only think I can think of is to find, and insert in two different requests which seems a bit counter-productive.
ps. I'm actually using pymongo's find_one_and_replace() but it seems identical to the JS mongo function.
EDIT: edited for clarification.
Is it not possible to use replace_one function ? In java I am able to use repalceOne which returns UpdateResult. That has method for finding if documented updated or not. I see repalce_one in pymongo and it should behave same. Here is doc PyMongo Doc Look for replace_one
The way I'm going to implement it for now (in python):
import pymongo
def find_one_and_replace_log(collection, find_query,
document_data,
log={}):
''' behaves like find_one_or_replace(upsert=True,
return_document=pymongo.ReturnDocument.AFTER)
'''
is_new = False
document = collection.find_one(find_query)
if not document:
# document didn't exist
# log as NEW
is_new = True
new_or_replaced_document = collection.find_one_and_replace(
find_query,
document_data,
upsert=True,
return_document=pymongo.ReturnDocument.AFTER
)
log['new_document'] = is_new
return new_or_replaced_document

call custom python function on every document in a collection Mongo DB

I want to call a custom python function on some existing attribute of every document in the entire collection and store the result as a new key-value pair in that (same) document. May I know if there's any way to do that (since each call is independent of others) ?
I noticed cursor.forEach but can't it be done just using python efficiently ?
A simple example would be to split the string in text and store the no. of words as a new attribute.
def split_count(text):
# some complex preprocessing...
return len(text.split())
# Need something like this...
db.collection.update_many({}, {'$set': {"split": split_count('$text') }}, upsert=True)
But it seems like setting a new attribute in a document based on the value of another attribute in the same document is not possible this way yet. This post is old but the issues seem to be still open.
I found a way to call any custom python function on a collection using parallel_scan in PyMongo.
def process_text(cursor):
for row in cursor.batch_size(200):
# Any complex preprocessing here...
split_text = row['text'].split()
db.collection.update_one({'_id': row['_id']},
{'$set': {'split_text': split_text,
'num_words': len(split_text) }},
upsert=True)
def preprocess(num_threads=4):
# Get up to max 'num_threads' cursors.
cursors = db.collection.parallel_scan(num_threads)
threads = [threading.Thread(target=process_text, args=(cursor,)) for cursor in cursors]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This is not really faster than cursor.forEach (but not that slow either), but it helps me execute any arbitrarily complex python code and save the results from within Python itself.
Also if I have an array of ints in one of the attributes, doing cursor.forEach converts them to floats which I don't want. So I preferred this way.
But I would be glad to know if there're any better ways than this :)
It is quite unlikely that it will ever be efficient to do this kind of thing in python. This is because the document would have to make a round trip and go through the python function on the client machine.
In your example code, you are passing the result of a function to a mongodb update query, which won't work. You can't run any python code inside mongodb queries on the db server.
As the answer to you linked question suggests, this type of action has to be performed in the mongo shell. e.g:
db.collection.find().snapshot().forEach(
function (elem) {
splitLength = elem.text.split(" ").length
db.collection.update(
{
_id: elem._id
},
{
$set: {
split: splitLength
}
}
);
}
);

Updating multiple complex array elements in MongoDB

I know this has been asked before, but I have yet to find a solution that works efficiently. I am working with the MongoDB C# driver, though this is more of a general question about MongoDB operations.
I have a document structure that looks something like this:
field1: value1
field2: value2
...
users: [ {...user 1 subdocument...}, {...user 2 subdocument...}, ... ]
Some facts:
Each user subdocument includes further sub-arrays & subdocuments (so they're fairly complex).
The average users array only contains about 5 elements, but in the worst case can surpass 100.
Several thousand update operations on multiple users may be conducted per day in this system, each on one document at a time. Larger arrays will receive more frequent updates due to their data size.
I am trying to figure out how to do this efficiently. From what I've heard, you cannot directly set several array elements to new values all at once, so I had to try something else.
I tried using the $pullAll / $AddToSet + $each operations to remove the old array and replace it with a modified one. I am aware that $pullall can remove only the elements that I need as well, but I would like to preserve the order of elements.
The C# code:
try
{
WriteConcernResult wcr = collection.Update(query,
Update.Combine(Update.PullAll("users"),
Update.AddToSetEach("users", newUsers.ToArray())));
}
catch (WriteConcernException wce)
{
return wce.Message;
}
In this case newUsers is aList<BsonValue>converted to an array. However I am getting the following exception message:
Cannot update 'users' and 'users' at the same time
By the looks of it, I can't have two update statements in use on the same field in the same write operation.
I also tried Update.Set("users", newUsers.ToArray()), but apparently the Set statement doesn't work with arrays, just basic values:
Argument 2: cannot convert from 'MongoDB.Bson.BsonValue[]' to 'MongoDB.Bson.BsonValue'
So then I tried converting that array to a BsonDocument:
Update.Set("users", newUsers.ToArray().ToBsonDocument());
And got this:
An Array value cannot be written to the root level of a BSON document.
I could try replacing the whole document, but that seems like overkill and definitely not very efficient.
So the only thing I can think of now is to run two separate write operations: one to remove the unwanted old users and another to replace them with their newer versions:
WriteConcernResult wcr = collection.Update(query, Update.PullAll("users"));
WriteConcernResult wcr = collection.Update(query, Update.AddToSetEach("users", newUsers.ToArray()));
Is this my best option? Or is there another, better way of doing this?
Your code should work with a minor change:
Update.Set("users", new BsonArray(newUsers));
BsonArray is a BsonValue, where as an array of documents is not and we don't implicitly convert arrays like we do other primitive values.
this extension method solve my problem:
public static class MongoExtension
{
public static BsonArray ToBsonArray(this IEnumerable list)
{
var array = new BsonArray();
foreach (var item in list)
array.Add((BsonValue) item);
return array;
}
}

MongoDB remove mapreduce collection

Due to error in client code, mongodb have created many "mr.mapreduce...." collections, how to remove them all (by mask maybe).
I run script in interactive shell:
function f() {
var names = db.getCollectionNames();
for(var i = 0; i < names.length; i++){
if(names[i].indexOf("mr.") == 0){
db[names[i]].drop();}}};
f();
It resolved my problem.
Temporary map-reduce table should be cleaned up when the connection which created them is closed:
map/reduce is invoked via a database
command. The database creates a
temporary collection to hold output of
the operation. The collection is
cleaned up when the client connection
closes, or when explicitly dropped.
Alternatively, one can specify a
permanent output collection name. map
and reduce functions are written in
JavaScript and execute on the server.
-- MongoDB docs
If not, you could delete them using them same method you would delete any other collection. It might get a bit repetitive though.
Another way to achieve the same thing is this snippet:
db.system.namespaces.find({name: /tmp.mr/}).forEach(function(z) {
try{
db.getMongo().getCollection( z.name ).drop();
} catch(err) {}
});
Pro: It won't try to collect all your namespaces into a JavaScript Array. MongoDB segfaults on too many namespaces.
Temporary map-reduce collections should be cleaned up when the connection which created them is closed. However sometimes they remains there and increase database size. You can remove them using the following script:
var database = db.getSiblingDB('yourDatabaseName');
var tmpCollections = database.getCollectionInfos(
{
name: {$regex: /tmp\.mr*/},
'options.temp': true,
});
tmpCollections.forEach(function (collection) {
database.getCollection(collection.name).drop();
});
print(`There was ${tmpCollections.length} tmp collection deleted`);
The drop.tmp.js script can be executed from command line as follow:
mongo --quiet mongodb://mongodb:27137 drop.tmp.js

How to print out more than 20 items (documents) in MongoDB's shell?

db.foo.find().limit(300)
won't do it. It still prints out only 20 documents.
db.foo.find().toArray()
db.foo.find().forEach(printjson)
will both print out very expanded view of each document instead of the 1-line version for find():
DBQuery.shellBatchSize = 300
MongoDB Docs - Configure the mongo Shell - Change the mongo Shell Batch Size
From the shell you can use:
db.collection.find().toArray()
to display all documents without having to use it.
You can use it inside of the shell to iterate over the next 20 results. Just type it if you see "has more" and you will see the next 20 items.
Could always do:
db.foo.find().forEach(function(f){print(tojson(f, '', true));});
To get that compact view.
Also, I find it very useful to limit the fields returned by the find so:
db.foo.find({},{name:1}).forEach(function(f){print(tojson(f, '', true));});
which would return only the _id and name field from foo.
With newer version of mongo shell (mongosh) use following syntax:
config.set("displayBatchSize", 300)
instead of depreciated:
DBQuery.shellBatchSize = 300
Future find() or aggregate() operations will only return 300 documents per cursor iteration.
I suggest you to have a ~/.mongorc.js file so you do not have to set the default size everytime.
# execute in your terminal
touch ~/.mongorc.js
echo 'DBQuery.shellBatchSize = 100;' > ~/.mongorc.js
# add one more line to always prettyprint the ouput
echo 'DBQuery.prototype._prettyShell = true; ' >> ~/.mongorc.js
To know more about what else you can do, I suggest you to look at this article: http://mo.github.io/2017/01/22/mongo-db-tips-and-tricks.html
In the mongo shell, if the returned cursor is not assigned to a variable using the var keyword, the cursor is automatically iterated to access up to the first 20 documents that match the query. You can set the DBQuery.shellBatchSize variable to change the number of automatically iterated documents.
Reference - https://docs.mongodb.com/v3.2/reference/method/db.collection.find/
show dbs
use your database name
in my case, I'm using - use smartbank
then - show collections - just to check the document collections name.
and finally, db. your collection name.find() or find({}) -
show dbs
use smartbank
show collections
db.users.find() or db.users.find({}) or db.users.find({_id: ObjectId("60c8823cbe9c1c21604f642b")}) or db.users.find({}).limit(20)
you can specify _id:ObjectId(write the document id here) to get the single document
or you can specify limit - db.users.find({}).limit(20)