Want to retrieve records from mongodb in batchwise - mongodb

I am trying to retrieve records from Mongodb whose count is approx up to 50,000 but when I execute the query Java runs out of Heap space and server crashes down.
Following is my code ;
List<FormData> forms = ds.find(FormData.class).field("formId").equal(formId).asList();
Can anyone help me syntactically to fetch records in batchwise from mongodb.
Thanks in advance

I am not sure if the Java implementation has this but in the c# version there is a setBatchSize method.
Using that I could do
foreach(var item in coll.find(...).setBatchSize(1000)) {
}
This will fetch all items the find matches but it will not fetch all at once but ratgher 1000 in each batch. You code will not see this "batching" as it is all handled within the enumeration. Once the loop tries to get the 1001 item, another batch will be fetched from the mongodb server.
This should lessen heap space problem.
http://docs.mongodb.org/manual/reference/method/cursor.batchSize/
You could still have other problems depending on what you do within the loop, but that will be under your control.

Fetching 50k entries doesn't sound like a good idea with any database.
Depending on your use case, you might want to change your query or work with limit and offset:
ds.find(FormData.class)
.field("formId").equal(formId)
.limit(20).offset(0)
.asList();
Note that a range based query is more efficient than working with limit and offset.

Related

Mybatis cursor query more than 100k records

Using Mybatis, I am querying a huge data from database (about 50k records) but a problem with limited memory and the application restart again. I am currently using List<>, maybe this is the problem.
I am planning Cursor<>, can it solve the problem? If the records grow to above 100k?
Adding a cursor could solve your problem. Another option is batching your data. Is there a field like an id on which you could apply batching?
SELECT TOP(1000) * FROM yourTable WHERE id > {record.id} ORDER BY id
This way in a loop you can retrieve a dataset in the size you want, use it for what you want, save the last record.id and call this query again. This way your application will never run out of memory, even if the number of records in the database increases.

Updating data in Mongo sorted by a particular field

I posted this question on Software Engineering portal without conducting any tests. It was also brought to my notice that this needs to be posted on SO, not there. Thanks for the help in advance!
I need Mongo to return the documents sorted by a field value. The easiest way to achieve this would be running the command db.collectionName.find().sort({field:priority}), however, I tried this method on a dummy collection of 1000 documents; it runs in 22ms. I also tried running db.collectionName.find() on the same data, it runs in 3ms, which means that Mongo is taking time to sort and return the documents (which is understandable). Both tests were done in the same environment and were done by adding .explain("executionStats") to the query.
I will be working with a large amount of data and concurrent requests to access DB, so I need the querying to be faster. My question is, is there a way to always keep the data sorted by a field in the DB so that I don't have to sort it over and over for all requests? For instance, some sort of update command that could sort the entire DB once a week or so?
A non-unique index with that field in this collection will give the results you're after and avoid the inefficient in-memory sorting.

Unable to fetch more than 10k records

I am developing an app where I have more than 10k records added to a class in parse. Now I am trying to fetch those records using PFQuery( I am using the "skip" property). But I am unable to fetch records beyond 10k and I get the following error message
"Skips larger than 10000 are not allowed"
This is a big problem for me since I need all the data.
Has anybody come across such problem. Please share your views.
Thanks
The problem is indeed due to the cost of mongo skip operations. You can formulate a query such that you don't need the skip operator. My preferred method is to orderBy objectId and then add a condition that objectId > last yielded objectId. This type of query can be indexed and remain fast, unlike skip pagination, which has a O(N^2) cost in seeks.
My assumption would be that it's based on performance issues with MongoDB's skip implementation.
The cursor.skip() method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset (e.g. pageNumber above) increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.

Morphia is there a difference between fetch and asList in performance wise

We are using morphia 0.99 and java driver 2.7.3 I would like to learn is there any difference between fetching records one by one using fetch and retrieving results by asList (assume that there is enough memory to retrieve records through asList).
We iterate over a large collection, while using fetch I sometimes encounter cursor not found exception on the server during the fetch operation, so I need to execute another command to continue, what could be the reason for this?
1-)fetch the record
2-)do some calculation on it
3-)+save it back to database again
4-)fetch another record and repeat the steps until there isn't any more records.
So which one would be faster? Fetching records one by one or retrieving bulks of results using asList, or isn't there any difference between them using morphia implementation?
Thanks for the answers
As far as I understand the implementation, fetch() streams results from the DB while asList() will load all query results into memory. So they will both get every object that matches the query, but asList() will load them all into memory while fetch() leaves it up to you.
For your use case, it neither would be faster in terms of CPU, but fetch() should use less memory and not blow up in case you have a lot of DB records.
Judging from the source-code, asList() uses fetch() and aggregates the results for you, so I can't see much difference between the two.
One very useful difference would be if the following two conditions applied to your scenario:
You were using offset and limit in the query.
You were changing values on the object such that it would no longer be returned in the query.
So say you were doing a query on awesome=true, and you were using offset and limit to do multiple queries, returning 100 records at a time to make sure you didn't use up too much memory. If, in your iteration loop, you set awesome=false on an object and saved it, it would cause you to miss updating some records.
In a case like this, fetch() would be a better approach.

Mongoid create vs collection.insert

I'm not sure how to put this. Well, recently I worked on a rails project with mongoid, and I had the task of inserting multiple records in Mongodb.
Say insert multiple records of PartPriceRecord in the database. After googling this I came across the collection.insert commands:
PartPriceRecord.collection.insert(multiple_part_price_records)
But on inserting large number of records, MongoDb always seemed to prompt me with error message:
Exceded maximum insert size of 16,000,000 bytes
Googling around I found that the the upper limit for MongoDb for a single document, but surprisingly when I changed my above query to this:
multiple_part_price_records.each do|mppr|
PartPriceRecord.create(mppr)
end
the above errors do not seem to appear any more.
Can anyone explain in depth under the hood what is exactly is the difference between the two?
Thanks
The maximum size for a single, bulk insert is 16M bytes. That's what you're trying to do in your first example.
In your second example, you're inserting each document individually. Therefore, each insert is under the max limit for an insert.
#Kyle explained the difference in his answer quite succinctly (+1'd), but as for a solution to your problem you may want to look at doing batch inserts:
BATCH_SIZE = 200
multiple_part_price_records.each_slice(BATCH_SIZE) do |batch|
PartPriceRecord.collection.insert(batch)
end
This will slice the records into batches of 200 (or whatever size is best for your situation) and insert them within that limit. This will be a lot faster than running save on each one individually which would be sending far more requests to the database.
A few quick things to note about collection.insert:
It does not run validations on your models prior to insertion, you may want to check this prior to insert
It is required to be in a document format unlike save which requires it be a model. You can easily convert to a document by calling as_document on the model.