Use allowDiskUse in criteria query with Grails and the MongoDB plugin? - mongodb

In order to iterate over all the documents in a MongoDB (2.6.9) collection using Grails (2.5.0) and the MongoDB Plugin (3.0.2) I created a forEach like this:
class MyObjectService {
def forEach(Closure func) {
def criteria = MyObject.createCriteria()
def ids = criteria.list { projections { id() } }
ids.each { func(MyObject.get(it)) }
}
}
Then I do this:
class AnalysisService{
def myObjectService
#Async
def analyze(){
MyObject.withStatelessSession {
myObjectService.forEach { myObject ->
doSomethingAwesome(myObject)
}
}
}
}
This works great...until I hit a collection that is large (>500K documents) at which point a CommandFailureException is thrown because the size of the aggregation result is greater than 16MB.
Caused by CommandFailureException: { "serverUsed" : "foo.bar.com:27017" , "errmsg" : "exception: aggregation result exceeds maximum document size (16MB)" , "code" : 16389 , "ok" : 0.0}
In reading about this, I think that one way to handle this situation is to use the option allowDiskUse in the aggregation function that runs on the MongoDB side so that the 16MB memory limit won't apply and I can get a larger aggregation result.
How can I pass this option to my criteria query? I've been reading the docs and the Javadoc for the Grails MongoDB plugin, but I can't seem to find it. Is there is another way to approach the generic problem (iterate over all members of a large collection of domain objects)?

This is not possible with the current implementation of MongoDB Grails plugin. https://github.com/grails/grails-data-mapping/blob/master/grails-datastore-gorm-mongodb/src/main/groovy/org/grails/datastore/mapping/mongo/query/MongoQuery.java#L957
If you look at the above line, then you will see that the default options are being used for building AggregationOptions instance so there is no method to provide an option.
But there is another hackish way to do it using the Groovy's metaclass. Let's do it..:-)
Store the original method reference of builder() method before writing criteria in your service:
MetaMethod originalMethod = AggregationOptions.metaClass.static.getMetaMethod("builder", [] as Class[])
Then, replace the builder method to provide your implementation.
AggregationOptions.metaClass.static.builder = { ->
def builderInstance = new AggregationOptions.Builder()
builderInstance.allowDiskUse(true) // solution to your problem
return builderInstance
}
Now, your service method will be called with criteria query and should not results in the aggregation error you are getting since we have not set the allowDiskUse property to true.
Now, reset the original method back so that it should not affect any other call (optional).
AggregationOptions.metaClass.static.addMetaMethod(originalMethod)
Hope this helps!
Apart from this, why do you pulling all IDs in forEach method and then re getting the instance using get() method? You are wasting the database queries which will impact the performance. Also, if you follow this, you don't have to do the above changes.
An example with the same: (UPDATED)
class MyObjectService {
void forEach(Closure func) {
List<MyObject> instanceList = MyObject.createCriteria().list {
// Your criteria code
eq("status", "ACTIVE") // an example
}
// Don't do any of this
// println(instanceList)
// println(instanceList.size())
// *** explained below
instanceList.each { myObjectInstance ->
func(myObjectInstance)
}
}
}
(I'm not adding the code of AnalysisService since there is no change)
*** The main point is here at this point. So whenever you write any criteria in domain class (without projection and in mongo), after executing the criteria code, Grails/gmongo will not immediately fetch the records from the database unless you call some methods like toString(), 'size()ordump()` on them.
Now when you apply each on that instance list, you will not actually loading all instances into memory but instead you are iterating over Mongo Cursor behind the scene and in MongoDB, cursor uses batches to pull record from database which is extremely memory safe. So you are safe to directly call each on your criteria result which will not blow up the JVM unless you called any of the methods which triggers loading all records from the database.
You can confirm this behaviour even in the code: https://github.com/grails/grails-data-mapping/blob/master/grails-datastore-gorm-mongodb/src/main/groovy/org/grails/datastore/mapping/mongo/query/MongoQuery.java#L1775
Whenever you write any criteria without projection, you will get an instance of MongoResultList and there is a method named initializeFully() which is being called on toString() and other methods. But, you can see the MongoResultList is implementing iterator which is in turn calling MongoDB cursor method for iterating over the large collection which is again, memory safe.
Hope this helps!

Related

Bulk.getOperations() in MongoDB Node driver

I'd like to view the results of a bulk operation, specifically to know the IDs of the documents that were updated. I understand that this information is made available through the Bulk.getOperations() method. However, it doesn't appear that this method is available through the MongoDB NodeJS library (at least, the one I'm using).
Could you please let me know if there's something I'm doing wrong here:
const bulk = db.collection('companies').initializeOrderedBulkOp()
const results = getLatestFinanicialResults() // from remote API
results.forEach(result =>
bulk.find({
name: result.companyName,
report: { $ne: result.report }
}).updateOne([
{ $unset: 'prevReport' },
{ $set: { prevReport: '$report' } },
{ $unset: 'report' },
{ $set: { report: result.report } }
]))
await bulk.execute()
await bulk.getOperations() // <-- fails, undefined in Typescript library
I get a static IDE error:
Uncaught TypeError: bulk.getOperations is not a function
I'd like to view the results of a bulk operation, specifically to know the IDs of the documents that were updated
As of currently (MongoDB server v6.x) There is no methods to return IDs for updated documents from a bulk operations (only insert and upsert operations). However, there may be a work around depending on your use case.
The manual that you linked for Bulk.getOperations() is for mongosh, which is a MongoDB Shell application. If you look into the source code for getOperations() in mongosh, it's just a convenient wrapper for batches'. The method batches` returns a list of operations sent for the bulk execution.
As you are utilising ordered bulk operations, MongoDB executes the operations serially. If an error occurs during the processing of one of the write operations, MongoDB will return without processing any remaining write operations in the list.
Depending on the use case, if you modify the bulk.find() part to contain a search for _id for example:
bulk.find({"_id": result._id}).updateOne({$set:{prevReport:"$report"}});
You should be able to see the _id value of the operation in the batches, i.e.
await batch.execute();
console.log(JSON.stringify(batch.batches));
Example output:
{
"originalZeroIndex":0,
"currentIndex":0,
"originalIndexes":[0],
"batchType":2,
"operations":[{"q":{"_id":"634354787d080d3a1e3da51f"},
"u":{"$set":{"prevReport":"$report"}}],
"size":0,
"sizeBytes":0
}
For additional information, you could also retrieve the BulkWriteResult. For example, the getLastOp to retrieve the last operation (in case of a failure)

spring data mongo - mongotemplate count with query hint

The mongo docs specify that you can specify a query hint for count queries using the following syntax:
db.orders.find(
{ ord_dt: { $gt: new Date('01/01/2012') }, status: "D" }
).hint( { status: 1 } ).count()
Can you do this using the mongo template? I have a Query object and am calling the withHint method. I then call mongoTemplate.count(query); However, I'm pretty sure it's not using the hint, though I'm not positive.
Sure, there are a few forms of this including going down to the basic driver, but assuming using your defined classes you can do:
Date date = new DateTime(2012,1,1,0,0).toDate();
Query query = new Query();
query.addCriteria(Criteria.where("ord_dt").gte(date));
query.addCriteria(Criteria.where("status").is("D"));
query.withHint("status_1");
long count = mongoOperation.count(query, Class);
So you basically build up a Query object and use that object passed to your operation, which is .count() in this case.
The "hint" here is the name of the index as a "string" name of the index to use on the collection. Probably something like "status_1" by default, but whatever the actual name is given.

Read response from MongoDB operation

in my application I insert/update some documents.
I would need to act somehow depending on the result of the operation, but I do not understand how to use the WriteResult object.
This is the toString() of an update succesfully terminated:
Update write result: { "serverUsed" : "xxx.xxx.xxx.xxx:27017" , "ok" : 1 , "n" : 1 , "updatedExisting" : true}
Now, from the documentation I read that getLastError methods are deprecated.
I've getN that just tell me how many record has been updated (meaningless with inserts).
I've no methods to retrieve the OK value.
Do you have any suggestion on how manage the WriteResult object to understand the result of an operation?
Thanks in advance,
Samuel
I was using another framework that was hiding this deprecation, but since having removed that framework, I also came upon this problem.
I was also puzzled as you were on what the documentation was trying to say. Having gone through the MongoDB source code it looks like what was previously this:
WriteResult result = collection.update(query, update, true, false);
if (!result.getLastError().ok()) {
// handle error here
}
now looks like this:
try {
collection.update(query, update, true, false);
}
catch (CommandFailureException e) {
CommandResult result = e.getCommandResult();
// you can use result.ok() here, but it should almost always return false.
}
MongoException is a runtime exception that has several subclasses including CommandFailureException, so you may want to take a look at the javadocs here

MongoDB: Retrieving the first document in a collection

I'm new to Mongo, and I'm trying to retrieve the first document from a find() query:
> db.scores.save({a: 99});
> var collection = db.scores.find();
[
{ "a" : 99, "_id" : { "$oid" : "51a91ff3cc93742c1607ce28" } }
]
> var document = collection[0];
JS Error: result is undefined
This is a little weird, since a collection looks a lot like an array. I'm aware of retrieving a single document using findOne(), but is it possible to pull one out of a collection?
The find method returns a cursor. This works like an iterator in the result set. If you have too many results and try to display them all in the screen, the shell will display only the first 20 and the cursor will now point to the 20th result of the result set. If you type it the next 20 results will be displayed and so on.
In your example I think that you have hidden from us one line in the shell.
This command
> var collection = db.scores.find();
will just assign the result to the collection variable and will not print anything in the screen. So, that makes me believe that you have also run:
> collection
Now, what is really happening. If you indeed have used the above command to display the content of the collection, then the cursor will have reached the end of the result set (since you have only one document in your collection) and it will automatically close. That's why you get back the error.
There is nothing wrong with your syntax. You can use it any time you want. Just make sure that your cursor is still open and has results. You can use the collection.hasNext() method for that.
Is that the Mongo shell? What version? When I try the commands you type, I don't get any extra output:
MongoDB shell version: 2.4.3
connecting to: test
> db.scores.save({a: 99});
> var collection = db.scores.find();
> var document = collection[0];
In the Mongo shell, find() returns a cursor, not an array. In the docs you can see the methods you can call on a cursor.
findOne() returns a single document and should work for what you're trying to accomplish.
So you can have several options.
Using Java as the language, but one option is to get a db cursor and iterate over the elements that are returned. Or just simply grab the first one and run.
DBCursor cursor = db.getCollection(COLLECTION_NAME).find();
List<DOCUMENT_TYPE> retVal = new ArrayList<DOCUMENT_TYPE>(cursor.count());
while (cursor.hasNext()) {
retVal.add(cursor.next());
}
return retVal;
If you're looking for a particular object within the document, you can write a query and search all the documents for it. You can use the findOne method or simply find and get a list of objects matching your query. See below:
DBObject query = new BasicDBObject();
query.put(SOME_ID, ID);
DBObject result = db.getCollection(COLLECTION_NAME).findOne(query) // for a single object
DBCursor cursor = db.getCollection(COLLECTION_NAME).find(query) // for a cursor of multiple objects

is it possible to update a mongo collection from the finalize method of the map reduce engine?

I tried to pass the collection to be update as a scope variable - no dice.
I tried to invoke db.getCollection from the finalize body - no dice, I get this:
db assertion failure, assertion: 'invoke failed: JS Error: TypeError: db has no properties nofile_b:18', assertionCode: 9004
I guess it means that db is undefined within a finalize method. So, is it possible?
EDIT
Here is my finalize method:
function(key, value) {
function flatten(value, collector) {
var items = value;
if (!(value instanceof Array)) {
if (!value.items) {
collector.push(value);
return;
}
items = value.items;
}
for (var i = 0; i < items.length && collector.length < max_group_size; ++i) {
flatten(items[i], collector);
}
}
var collector = [];
flatten(value, collector);
return collector;
}
I would like to replace collector.push(value) with insert into some collection.
It is not possible to modify another collection from inside a Map/Reduce/Finalize function.
Here is a link to a question from a user with a similar question. The answer, unfortunately, is "no".
How to change the structure of MongoDB's map-reduce results?
Part of a reason for this is that MapReduce is designed to work in a sharded environment. The computations are distributed among the different shards, and the results are then aggregated. If each function running on each shard was allowed to modify collections, then each shard could end up with different data.
If you would like a separate collection to be modified as a result of a Map Reduce operation, the best strategy is to run the Map Reduce operation, get the results, and then have your application update the separate collection.
If you would like the results of multiple Map Reduce operations to be merged, it is possible to do this via an incremental Map Reduce. The documentation on this may be found here: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce