Mongoid create vs collection.insert - mongodb

I'm not sure how to put this. Well, recently I worked on a rails project with mongoid, and I had the task of inserting multiple records in Mongodb.
Say insert multiple records of PartPriceRecord in the database. After googling this I came across the collection.insert commands:
PartPriceRecord.collection.insert(multiple_part_price_records)
But on inserting large number of records, MongoDb always seemed to prompt me with error message:
Exceded maximum insert size of 16,000,000 bytes
Googling around I found that the the upper limit for MongoDb for a single document, but surprisingly when I changed my above query to this:
multiple_part_price_records.each do|mppr|
PartPriceRecord.create(mppr)
end
the above errors do not seem to appear any more.
Can anyone explain in depth under the hood what is exactly is the difference between the two?
Thanks

The maximum size for a single, bulk insert is 16M bytes. That's what you're trying to do in your first example.
In your second example, you're inserting each document individually. Therefore, each insert is under the max limit for an insert.

#Kyle explained the difference in his answer quite succinctly (+1'd), but as for a solution to your problem you may want to look at doing batch inserts:
BATCH_SIZE = 200
multiple_part_price_records.each_slice(BATCH_SIZE) do |batch|
PartPriceRecord.collection.insert(batch)
end
This will slice the records into batches of 200 (or whatever size is best for your situation) and insert them within that limit. This will be a lot faster than running save on each one individually which would be sending far more requests to the database.
A few quick things to note about collection.insert:
It does not run validations on your models prior to insertion, you may want to check this prior to insert
It is required to be in a document format unlike save which requires it be a model. You can easily convert to a document by calling as_document on the model.

Related

Where should I use sharding in mongodb or run multiple instance of mongodb?

Issue
I have at least 10 text files(CSV), each reaches to 5GB in size. There is no issue when I import the first text file. But when I start importing the second text file it shows the Maximum Size Limit (16MB).
My primary purpose for using the database is for searching the customers from the database using customer_id index.
Given Below is the details of One CSV File.
Collection Name|Documents|Avg.Document Size|Total Document Size|Num.Indexes| Total Index Size|Properties
Customers|8,874,412|1.8 KB|15.7 GB|3|262.0 MB
To overcome this MongoDB community were recommending GridFS, but the problem with GridFS is that the data is stored in bytes and its not possible to query for a specific index in the textfile.
I don't know if its possible to query for a specific index in a textfile when using GridFS. If some one knows any help is appreciated.
Then the other solution I thought about was creating multiple instance of MonogDB running in different ports to solve the issue. Is this method feasible?
But lot of the tutorial on multiple instance shows how to cerate a replica set. There by storing the same data in the PRIMARY and the SECONDARY.
The SECONDARY instances don't allow to write and only allows to read data.
Is it possible to create multiple instance of MongoDB without creating replica set and with write and read operations on them? If Yes How? Can this method overcome the 16MB limit.
Second Solution I thought about was creating shards of the collections or simply sharding. Can this method overcome the 16MB limit. If yes any help regarding this.
Of the two solutions which is more efficient for searching for data (in terms of speed). As I mentioned earlier I just want to search of customers from this database.
The error message shows exactly where the problem is: entry #8437: line 13530, column 627
Have a look at the file and correct it in the file.
The error extraneous " in field ... is quite clear. In your CSV file you have an opening quote " but it is not closed, i.e. the rest of entire file is considered as one single field.

Updating data in Mongo sorted by a particular field

I posted this question on Software Engineering portal without conducting any tests. It was also brought to my notice that this needs to be posted on SO, not there. Thanks for the help in advance!
I need Mongo to return the documents sorted by a field value. The easiest way to achieve this would be running the command db.collectionName.find().sort({field:priority}), however, I tried this method on a dummy collection of 1000 documents; it runs in 22ms. I also tried running db.collectionName.find() on the same data, it runs in 3ms, which means that Mongo is taking time to sort and return the documents (which is understandable). Both tests were done in the same environment and were done by adding .explain("executionStats") to the query.
I will be working with a large amount of data and concurrent requests to access DB, so I need the querying to be faster. My question is, is there a way to always keep the data sorted by a field in the DB so that I don't have to sort it over and over for all requests? For instance, some sort of update command that could sort the entire DB once a week or so?
A non-unique index with that field in this collection will give the results you're after and avoid the inefficient in-memory sorting.

Handling Failure in Bulk insertion in mongodb

Hi I need to insert around 100,000 records to mongodb. I am using BulkWriteOperation api to insert a batch of records. I split up the whole and inserting a batch of 1000 records to mongo. If any one of the record in a batchis failed to insert, then the whole batch is not being inserted to mongo. Is there anyway to get the list of records of the failed batch alone, so that i can do a recursive and insert the remaining records to mongo. Or is there any way to do bulk insert to mongodb and the all the records except the failed ones needs to be inserted.
Thanks in advance.
Could you also make sure to mention which language you are using?
In case of python, I found that using the insert_many operation with ordered=False is better for this use case (This will only work if the ordering of your inserts don't matter - as required, it will not fail if some of the inserts fail). The BulkWriteError gives the details of the inserts that have failed and you can use the error code to decide what to do afterwards.
It should work similarly for other languages. Do let me know if it doesn't work.
Edit: This question seems similar to another question

Want to retrieve records from mongodb in batchwise

I am trying to retrieve records from Mongodb whose count is approx up to 50,000 but when I execute the query Java runs out of Heap space and server crashes down.
Following is my code ;
List<FormData> forms = ds.find(FormData.class).field("formId").equal(formId).asList();
Can anyone help me syntactically to fetch records in batchwise from mongodb.
Thanks in advance
I am not sure if the Java implementation has this but in the c# version there is a setBatchSize method.
Using that I could do
foreach(var item in coll.find(...).setBatchSize(1000)) {
}
This will fetch all items the find matches but it will not fetch all at once but ratgher 1000 in each batch. You code will not see this "batching" as it is all handled within the enumeration. Once the loop tries to get the 1001 item, another batch will be fetched from the mongodb server.
This should lessen heap space problem.
http://docs.mongodb.org/manual/reference/method/cursor.batchSize/
You could still have other problems depending on what you do within the loop, but that will be under your control.
Fetching 50k entries doesn't sound like a good idea with any database.
Depending on your use case, you might want to change your query or work with limit and offset:
ds.find(FormData.class)
.field("formId").equal(formId)
.limit(20).offset(0)
.asList();
Note that a range based query is more efficient than working with limit and offset.

How to store query output in temp db?

I am really new to the programming but I am studying it. I have one problem which I don't know how to solve.
I have collection of docs in mongoDB and I'm using Elasticsearch to query the fields. The problem is I want to store the output of search back in mongoDB but in different DB. I know that I have to create temporary DB which has to be updated with every search result. But how to do this? Or give me documentation to read so I could learn it. I will really appreciate your help!
Mongo does not natively support "temp" collections.
A typical thing to do here is to not actually write the entire results output to another DB since that would be utterly pointless since Elasticsearch does its own caching as such you don't need any layer over the top.
As well, due to IO concerns it is normally a bad idea to write say a result set of 10k records to Mongo or another DB.
There is a feature request for what you talk of: https://jira.mongodb.org/browse/SERVER-3215 but no planning as of yet.
Example
You could have a table of results.
Within this table you would have a doc that looks like:
{keywords: ['bok', 'mongodb']}
Each time you search and scroll through each result item you would write a row to this table populating the keywords field with keywords from that search result. This would be per search result per search result list per search. It would probably be best to just stream each search result to MongoDB as they come in. I have never programmed Python (though I wish to learn) so an example in pseudo:
var elastic_results = [{'elasticresult'}];
foreach(elastic_results as result){
//split down the phrases in this result and make a keywords array
db.results_collection.insert(array_formed_from_splitting_down_result); // Lets just lazy insert no need for batch or trying to shrink the amount of data to one go or whatever, lets just stream it in.
}
So as you go along your results you basically just mass insert as fast a possible create a sort of "stream" of input to MongoDB. It can do this quite well.
This should then give you a shardable list of words and language verbs to process things like MRs on and stuff to aggregate statistics about them.
Without knowing more and more about your scenario this is pretty much my best answer.
This does not use the temp table concept but instead makes your data permanent which is fine by the sounds of it since you wish to use Mongo as a storage engine for further tasks.
Actually there is MongoDB river plugin to work with Elasticsearch...
db.your_table.find().forEach(function(doc) { b.another_table.insert(doc); } );