Using Mybatis, I am querying a huge data from database (about 50k records) but a problem with limited memory and the application restart again. I am currently using List<>, maybe this is the problem.
I am planning Cursor<>, can it solve the problem? If the records grow to above 100k?
Adding a cursor could solve your problem. Another option is batching your data. Is there a field like an id on which you could apply batching?
SELECT TOP(1000) * FROM yourTable WHERE id > {record.id} ORDER BY id
This way in a loop you can retrieve a dataset in the size you want, use it for what you want, save the last record.id and call this query again. This way your application will never run out of memory, even if the number of records in the database increases.
Related
I have to process a file which has records with same ID and different dates. If a specific ID has multiple records with the different dates, it has to sum all of it. Currently, my solution is writing by one chunk and and letting SQL query to do the summation part because I don't have a way to know if multiple entries of same ID are in the same chunk. Is there a huge performance effect of doing it this way especially that I am working on 100k worth of data?
Is there a huge performance effect of doing it this way especially that I am working on 100k worth of data?
Yes, this could impact the performance of your step since each item will be processed in its own transaction. With 100k you would have 100k transactions, whereas if chunk-size=1000 for example, you would have only 100 transactions.
The chunk-oriented processing model is not really suitable to what you are trying to do, as items with the same ID could span different chunks. A common technique for this kind of requirement is to load your data in a temporary table (which could be a very fast step if done against sqlite for example) and then run your aggregation SQL query against that table.
I am querying a CosmosDB with a huge list of ids, and i get an exception saying i have exceeded the permissible limit of 256 characters.
What is the best way to handle such huge queries ?
The only way i can think of is to split the list and execute in batches.
Any other suggestions ?
If you're querying data this way then your model is likely not optimal. I would look to remodel your data such that you can query on another property shared by the items you are looking for (in partition as well too).
Note that this could also be achieved by using Change Feed to copy the data into another container with a different partition key and a new property that groups the data together. whether you do this will depend on how often you run this query and whether this is cheaper than running this query in multiple batches.
I am trying to run a simple query to find number of all records with a particular value using:
db.ColName.find({id_c:1201}).count()
I have 200GB of data. When I run this query, mongodb takes up all the RAM and my system starts lagging. After an hour of futile waiting, I give up without getting any results.
What can be the issue here and how can I solve it?
I believe the right approach in the NoSQL world isn't trying to perform a full query like that, but accumulate stats overtime.
For example, you should have a collection stats with arbitrary objects which should own a kind or id property that can take a value like "totalUserCount". Whenever you add an user, you also update this count.
This way you'll get instant results. It's just getting a property value in a small collection of stats.
BTW, this slowness should be originated by querying objects by a non-indexed property in your collection. Try to index id_c and probably you'll get quicker results.
That amount of data can easily be managed by MySQL, MSSQL or Oracle with the given hardware specification. You don't need a NoSQL database for that, NoSQL databases are made for much larger storing needs which actually require lots of hardware (RAM, harddisks) to be efficient.
You need to define an index to read that id and use a normal SQL database.
I am trying to retrieve records from Mongodb whose count is approx up to 50,000 but when I execute the query Java runs out of Heap space and server crashes down.
Following is my code ;
List<FormData> forms = ds.find(FormData.class).field("formId").equal(formId).asList();
Can anyone help me syntactically to fetch records in batchwise from mongodb.
Thanks in advance
I am not sure if the Java implementation has this but in the c# version there is a setBatchSize method.
Using that I could do
foreach(var item in coll.find(...).setBatchSize(1000)) {
}
This will fetch all items the find matches but it will not fetch all at once but ratgher 1000 in each batch. You code will not see this "batching" as it is all handled within the enumeration. Once the loop tries to get the 1001 item, another batch will be fetched from the mongodb server.
This should lessen heap space problem.
http://docs.mongodb.org/manual/reference/method/cursor.batchSize/
You could still have other problems depending on what you do within the loop, but that will be under your control.
Fetching 50k entries doesn't sound like a good idea with any database.
Depending on your use case, you might want to change your query or work with limit and offset:
ds.find(FormData.class)
.field("formId").equal(formId)
.limit(20).offset(0)
.asList();
Note that a range based query is more efficient than working with limit and offset.
I'm not sure how to put this. Well, recently I worked on a rails project with mongoid, and I had the task of inserting multiple records in Mongodb.
Say insert multiple records of PartPriceRecord in the database. After googling this I came across the collection.insert commands:
PartPriceRecord.collection.insert(multiple_part_price_records)
But on inserting large number of records, MongoDb always seemed to prompt me with error message:
Exceded maximum insert size of 16,000,000 bytes
Googling around I found that the the upper limit for MongoDb for a single document, but surprisingly when I changed my above query to this:
multiple_part_price_records.each do|mppr|
PartPriceRecord.create(mppr)
end
the above errors do not seem to appear any more.
Can anyone explain in depth under the hood what is exactly is the difference between the two?
Thanks
The maximum size for a single, bulk insert is 16M bytes. That's what you're trying to do in your first example.
In your second example, you're inserting each document individually. Therefore, each insert is under the max limit for an insert.
#Kyle explained the difference in his answer quite succinctly (+1'd), but as for a solution to your problem you may want to look at doing batch inserts:
BATCH_SIZE = 200
multiple_part_price_records.each_slice(BATCH_SIZE) do |batch|
PartPriceRecord.collection.insert(batch)
end
This will slice the records into batches of 200 (or whatever size is best for your situation) and insert them within that limit. This will be a lot faster than running save on each one individually which would be sending far more requests to the database.
A few quick things to note about collection.insert:
It does not run validations on your models prior to insertion, you may want to check this prior to insert
It is required to be in a document format unlike save which requires it be a model. You can easily convert to a document by calling as_document on the model.