What is a Cursor in MongoDB? - mongodb

We are troubled by eventually occurring cursor not found exceptions for some Morphia Queries asList and I've found a hint on SO, that this might be quite memory consumptive.
Now I'd like to know a bit more about the background: can sombody explain (in English), what a Cursor (in MongoDB) actually is? Why can it kept open or be not found?
The documentation defines a cursor as:
A pointer to the result set of a query. Clients can iterate through a cursor to retrieve results. By default, cursors timeout after 10 minutes of inactivity
But this is not very telling. Maybe it could be helpful to define a batch for query results, because the documentation also states:
The MongoDB server returns the query results in batches. Batch size will not exceed the maximum BSON document size. For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes. [...] For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.
Note: in our queries in question we don't use sort statements at all, but also no limit and offset.

Here's a comparison between toArray() and cursors after a find() in the Node.js MongoDB driver. Common code:
var MongoClient = require('mongodb').MongoClient,
assert = require('assert');
MongoClient.connect('mongodb://localhost:27017/crunchbase', function (err, db) {
assert.equal(err, null);
console.log('Successfully connected to MongoDB.');
const query = { category_code: "biotech" };
// toArray() vs. cursor code goes here
});
Here's the toArray() code that goes in the section above.
db.collection('companies').find(query).toArray(function (err, docs) {
assert.equal(err, null);
assert.notEqual(docs.length, 0);
docs.forEach(doc => {
console.log(`${doc.name} is a ${doc.category_code} company.`);
});
db.close();
});
Per the documentation,
The caller is responsible for making sure that there
is enough memory to store the results.
Here's the cursor-based approach, using the cursor.forEach() method:
const cursor = db.collection('companies').find(query);
cursor.forEach(
function (doc) {
console.log(`${doc.name} is a ${doc.category_code} company.`);
},
function (err) {
assert.equal(err, null);
return db.close();
}
);
});
With the forEach() approach, instead of fetching all data in memory, we're streaming the data to our application. find() creates a cursor immediately because it doesn't actually make a request to the database until we try to use some of the documents it will provide. The point of cursor is to describe our query. The second parameter to cursor.forEach shows what to do when an error occurs.
In the initial version of the above code, it was toArray() which forced the database call. It meant we needed ALL the documents and wanted them to be in an array.
Note that MongoDB returns data in batches. The image below shows requests from cursors (from application) to MongoDB:
forEach scales better than toArray because we can process documents as they come in until we reach the end. Contrast it with toArray - where we wait for ALL the documents to be retrieved and the entire array is built. This means we're not getting any advantage from the fact that the driver and the database system are working together to batch results to your application. Batching is meant to provide efficiency in terms of memory overhead and the execution time. Take advantage of it in your application, if you can.

I am by no mean a mongodb expert but I just want to add some observations from working in a medium sized mongo system for the last year. Also thanks to #xameeramir for the excellent walkthough about how cursors work in general.
The causes of a "cursor lost" exception may be several. One that I have noticed is explained in this answer.
The cursor lives server side. It is not distributed over a replica set but exists on the instance that is primary at the time of creation. This means that if another instance takes over as primary the cursor will be lost to the client. If the old primary is still up and around it may still be there but for no use. I guess it is garbaged collected away after a while. So if your mongo replica set is unstable or you have a shaky network in front of it you are out of luck when doing any long running queries.
If the full content of what the cursor wants to return does not fit in memory on the server the query may be very slow. RAM on your servers needs to be larger than the largest query you run.
All this can partly be avoided by designing better. For a use case with large long running queries you may be better of with several smaller database collections instead of a big one.

The collection's find method returns a cursor - this points to the set of documents (called as result set) that are matched to the query filter. The result set is the actual documents that are returned by the query, but this is on the database server.
To the client program, for example the mongo shell, you get a cursor. You can think the cursor is like an API or a program to work with the result set. The cursor has many methods which can be run to perform some actions on the result set. Some of the methods affect the result set data and some provide the status or info about the result set.
As the cursor maintains information about the result set, some information can change as you use the result set data by applying other cursor methods. You use these methods and information to suit your application, i.e., how and what you want to do with the queried data.
Working on the result set using the cursor and some of its commonly used methods and features from mongo shell:
The count() method returns the count of the number of documents in the result set, initially - as the result of the query. It is always constant at any point in the life of the cursor. This is information. This information remains same even after the cursor is closed or exhausted.
As you read documents from the result set, the result set gets exhausted. Once completely exhausted you cannot read any more. The hasNext() tells if there are any documents available to be read - returns a boolean true or false. The next() returns a document if available (you first check with hasNext, and then do a next). These two methods are commonly used to iterate over the result set data. Another iteration method is the forEach().
The data is retrieved from the server in batches - which has a default size. With the first batch you read the documents and when all it's documents are read, the following next() method retrieves the next batch of documents, etc., until all documents are read from the result set. This batch size can be configured and you can also get its status.
If you apply the toArray() method on the cursor, then all the remaining documents in the result set are loaded into the memory of your client computer and are available as a JavaScript array. And, the result set data is exhausted. The following hasNext method will return false, and the next will throw an error (once you exhaust the cursor, you cannot read data from it). This method loads all the result set data into your client's memory (the array). This can be memory consuming in case of large result sets.
The itcount() returns the count of remaining documents in the result set and exhausts the cursor.
There are cursor methods like isClosed(), isExhausted(), size() which give status information about the cursor and its underlying result set as you work with your data.
Those are the basic features of cursor and result set. There are many cursor methods, and you can try and see how they work and get a better understanding.
Reference:
mongo shell's cursor
methods
Cursor behavior with Aggregate
method
(the collection's aggregate method also returns a cursor)
Example usage in mongo shell:
Assume the test collection has 200 documents (run the commands in the same sequence).
var cur = db.test.find( { } ).limit(25) creates a result set with 25
documents only.
But, cur.count() will show 200, which is the actual count of
documents by the query's filter.
hasNext() will return true.
next() will return a document.
itcount() will return 24 (and exhausts the cursor).
itcount() again will return 0.
cur.count() will still show 200.

This error also comes when you have a large set of data and are doing batch processing on that data and each batch takes more time, totalling that time be exceeded the default cursor live time.
Then you need to change that default time to tell mongo that will not expire this cursor until processing is done.
Do check No TimeOut Documentation

A cursor is an object returned by calling db.collection.find() and which enables iterating through documents (NoSQL equivalent of a SQL "row") of a MongoDB collection (NoSQL equivalent of "table").

In case your cluster is stable and no members where down or changing state, the most posible reason for not finding the cursor is this:
Default idle cursor timeout is 10min , but in the versions >= 3.6 the cursor is also associated with session which is having default session timeout 30min , so even you set the cursor to not expire with the option noCursorTimeout() you are still limited by the session timeout of 30min. To avoid your cursor to be killed by the session timeout you will need to perioducally check in your code and execute sessionRefresh command:
db.adminCommand({"refreshSessions" : [sessionId]})
to extend the session with another 30min so your cursor to not be killed if you do something with the data before fetching the next batch...
check the docs here for detail how to do it:
https://docs.mongodb.com/manual/reference/method/cursor.noCursorTimeout/

Related

MongoDB: Cursor isolation and transactions

I attempt to give MongoDB a try for a new project. Never worked with it before.
The manual on cursors says:
Because the cursor is not isolated during its lifetime, intervening
write operations on a document may result in a cursor that returns a
document more than once if that document has changed. To handle this
situation, see the information on snapshot mode.
That means that I always have to use snapshot() on read and/or $isolated and write operations to ensure consistent result sets or in other words, to apply some kind of transactionality. Is this correct? Or why should I not use snapshot() since not using it would be a risk to always get incosistent data?
You should use snapshot() when you are modifying the results of the cursor itself: while iterating the cursor and modifying the documents on iteration Or if you are calling a collection that you expect to be modified between calling and the iteration of the cursor itself.
also if your cursor result is larger than 1mb because you should consider using snapshot. queries that have a result set of less than 1 megabyte are snapshot from default
Notice that you can't use snapshot if you are using sharded collection

When a mongodb cursor will expire

I have no knowledge of mongodb and I just want to ask if something is possible and, if possible, how it can be done. My question is how can we know when a cursor will expire. Is there any API for this purpose?
I would be grateful for any comments and recommendations.
Best regards.
From the MongoDB documentation:
By default, MongoDB will automatically close a cursor when the client has exhausted all results in the cursor. However, for capped collections you may use a Tailable Cursor that remains open after the client exhausts the results in the initial cursor.
http://docs.mongodb.org/manual/tutorial/create-tailable-cursor/
Other factors which could cause a cursor to expire are the batchSize and timeout. To sum it up factors which expire the cursor are:
result exhausation
batchSize: http://docs.mongodb.org/manual/reference/method/cursor.batchSize/
timeout: http://api.mongodb.org/java/2.6/com/mongodb/MongoOptions.html
I came across this thread looking for a solution to the following error in with pymongo: CursorNotFound: Cursor not found, cursor id. Obviously the reason is the same as this thread, but there is no solution provided for pymongo.
From the FAQ in the pymongo documentation the following links answer the question: here and here.
The quickest solution is to disable the timeout, so the server does not expire the cursor, and it's always valid.
client = MongoClient("localhost")
db = client.testdatabase
cursor = db.testcollection.find({}, no_cursor_timeout=True)
In the find() documentation you can find alternatives by changing the cursor type:
cursor_type (optional): the type of cursor to return. The valid
options are defined by CursorType:
NON_TAILABLE - the result of this find call will return a standard cursor over the result set.
TAILABLE - the result of this find call will be a tailable cursor - tailable cursors are only for use with capped collections. They are not closed when the last data is retrieved but are kept open and the
cursor location marks the final document position. If more data is
received iteration of the cursor will continue from the last document
received. For details, see the tailable cursor documentation.
TAILABLE_AWAIT - the result of this find call will be a tailable cursor with the await flag set. The server will wait for a few seconds
after returning the full result set so that it can capture and return
additional data added during the query.
EXHAUST - the result of this find call will be an exhaust cursor. MongoDB will stream batched results to the client without waiting for
the client to request each batch, reducing latency. See notes on
compatibility below.
Update: After using the no_cursor_timeout I was wondering what happeneded to the cursor when they are not executed. Here is the answer: PyMongo: What happens to cursor when no_cursor_timeout=True
Ordinarily, a cursor "dies" on the database server after a certain length of time (approximately 10 minutes) and will be closed when the client has exhausted all of the results.
Some drivers have an "immortal" option, which does the equivalent in the Java driver of setting a NoTimeout option:
dbcoll.find(...).addOption(Bytes.QUERYOPTION_NOTIMEOUT)
Set in a manner like above if you intend to keep an open cursor over this period of time.
http://api.mongodb.org/java/current/com/mongodb/Bytes.html#QUERYOPTION_NOTIMEOUT

Can lock yielding break query isolation?

Mongo docs talk about queries yielding locks to avoid blocking other operations. Will Mongo yield the lock from a read to a write that changes the read result?
Say I've got docs {x:1}, {x:2}, {x:2}, {x:1} and I'm reading find({x:2}). Assume the fourth doc isn't in the working set, so Mongo page faults, yielding the lock to an update({x:1}, {x:2}, {multi: true}), which completes and returns the lock to the find. The find would now include the fourth doc but omit the first doc. Does Monogo work like this?
In MongoDB there is no guarantee of query isolation - in fact, across multiple documents, you are not guaranteed to be looking at the same point in time.
What you describe can absolutely happen, and it does. The same is true of multi-document queries that fetch a large number of documents in batches (when you use a cursor). When you issue getmore for the next batch, the state of the data is not guaranteed to be the same as when you got the previous batch of results.

Solution to Bulk FindAndModify in MongoDB

My use case is as follows -
I have a collection of documents in mongoDB which I have to send for analysis.
The format of the documents are as follows -
{ _id:ObjectId("517e769164702dacea7c40d8") ,
date:"1359911127494",
status:"available",
other_fields... }
I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing.
ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed.
Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case?
Also, in future for scalability, we might go with more than one reader process.
In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents.
Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?
As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this :
Reader selects X documents with appropriate limit and sorting
Reader marks the documents returned by 1) with it's own unique reader ID (e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true))
Reader selects all documents marked as processing and with it's own reader ID. At this point it is guaranteed that you have exclusive access to the resulting set of documents.
Offer the resultset from 3) for your processing.
Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail).
EDIT: More details :
All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps :
Use an appropriate "w" (write concern) value, meaning 1 or higher. This will ensure that the connection on which the write operation is invoked will wait for it to complete regardless of it yielding.
Make sure you do the read in step 2 on the same connection (only relevant for replicasets with slaveOk enabled reads) or thread so that they are guaranteed to be sequential. The former can be done in most drivers with the "requestStart" and "requestDone" methods or similar (Java documentation here).
Add the $isolated flag to your multi-updates to ensure it cannot be interleaved with other write operations.
Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.

mongodb get count without repeating find

When performing a query in MongoDb, I need to obtain a total count of all matches, along with the documents themselves as a limited/paged subset.
I can achieve the goal with two queries, but I do not see how to do this with one query. I am hoping there is a mongo feature that is, in some sense, equivalent to SQL_CALC_FOUND_ROWS, as it seems like overkill to have to run the query twice. Any help would be great. Thanks!
EDIT: Here is Java code to do the above.
DBCursor cursor = collection.find(searchQuery).limit(10);
System.out.println("total objects = " + cursor.count());
I'm not sure which language you're using, but you can typically call a count method on the cursor that's the result of a find query and then use that same cursor to obtain the documents themselves.
It's not only overkill to run the query twice, but there is also the risk of inconsistency. The collection might change between the two queries, or each query might be routed to a different peer in a replica set, which may have different versions of the collection.
The count() function on cursors (in the MongoDB JavaScript shell) really runs another query, You can see that by typing "cursor.count" (without parentheses), so it is not better than running two queries.
In the C++ driver, cursors don't even have a "count" function. There is "itcount", but it only loops over the cursor and fetches all results, which is not what you want (for performance reasons). The Java driver also has "itcount", and there the documentation says that it should be used for testing only.
It seems there is no way to do a "find some and get total count" consistently and efficiently.