Handling Failure in Bulk insertion in mongodb - mongodb

Hi I need to insert around 100,000 records to mongodb. I am using BulkWriteOperation api to insert a batch of records. I split up the whole and inserting a batch of 1000 records to mongo. If any one of the record in a batchis failed to insert, then the whole batch is not being inserted to mongo. Is there anyway to get the list of records of the failed batch alone, so that i can do a recursive and insert the remaining records to mongo. Or is there any way to do bulk insert to mongodb and the all the records except the failed ones needs to be inserted.
Thanks in advance.

Could you also make sure to mention which language you are using?
In case of python, I found that using the insert_many operation with ordered=False is better for this use case (This will only work if the ordering of your inserts don't matter - as required, it will not fail if some of the inserts fail). The BulkWriteError gives the details of the inserts that have failed and you can use the error code to decide what to do afterwards.
It should work similarly for other languages. Do let me know if it doesn't work.
Edit: This question seems similar to another question

Related

Async in loop to insert into mongodb

I'm using mongo and I have multiple queries to insert at a time so I use a for loop to insert into the db. The problem is that each query falls under a key so I check if a key exists of not and if it doesn't I add it to the db, if it does, I append it. If I have multiple queries with the same key (since mongo inserts asynchronously) these two same keys could be identified as "nonexistent" in the db since they could be running in parallel. Is there a way around this?
If you're writing a lot of documents you're probably better off using bulk operations in mongo https://docs.mongodb.com/manual/core/bulk-write-operations/.
You can write the queries as upserts. this questions is very similar I think to what you are trying to accomplish. How to properly do a Bulk upsert/update in MongoDB.
If you do it as an ordered bulk operation you should not have the problem with two queries running simultaneously.

How do I get feedback on whether a MongoDB query succeeded and how many rows were changed?

I have a MongoDB database and am using MongoChef to write scripts for it. I have a script that reads in data from a collection and inserts the records into another collection. The script runs fine, but I don't get any feedback on what occurred. Is there a way to get acknowledgement that the script has finished running (that is, all the records are inserted)? Is there a way to get an output of how many records were (in this case) inserted? (I realize I could write another statement to count records, but I want to know how many records were actually inserted by the insert statement). What I'd like to see is something like "Script successful. 1200 records inserted into collection properties." Can someone show me how to turn on this output for MongoChef? Thank you.
Below is an image of my script. This is after it's been run. Notice that there's nothing in the results tabs; there's no indication the queries have been run, that they were run succesfully or how many records have been updated.
You can go through the MongoDb documentation for WriteConcerns and see what information matches your Mongodb version. Previously the getLastError was used to get error information about the last executed CRUD statement.
getLastError can give you information if there's any error occurred post execution of any CRUD operation.
You can also use the WriteResult which is return by insert, update, remove and save operation to get the number of updated document. It also contains properties like writeError to get the information specific to that opeartion.
Sample(psuedo not specific to MongoChef) -
var wr = db.properties.insert(doc);
println("Updated %d collections of type %s", wr.getN(), type);

Mongodb - add if not exists (non id field)

I have a large number of records to iterate over (coming from an external data source) and then insert into a mongo db.
I do not want to allow duplicates. How can this be done in a way that will not affect performance.
The number of records is around 2 million.
I can think of two fairly straightforward ways to do this in mongodb, although a lot depends upon your use case.
One, you can use the upsert:true option to update, using whatever you define as your unique key as the query for the update. If it does not exist it will insert it, otherwise it will update it.
http://docs.mongodb.org/manual/reference/method/db.collection.update/
Two, you could just create a unique index on that key and then insert ignoring the error generated. Exactly how to do this will be somewhat dependent on language and driver used along with version of mongodb. This has the potential to be faster when performing batch inserts, but YMMV.
2 million is not a huge number that will affect performance, split your records fields into diffent collections will be good enough.
i suggest create a unique index on your unique key before insert into the mongodb.
unique index will filter redundant data and lose some records and you can ignore the error.

mongodb read, copy, process and delete

I have to write an app that constantly polls a mongodb db collection in a given db. If it finds documents it reads them copies them to another db, does some extra processing and deletes them from the original db.
What is the most efficient way to implement this? What are the best practices?
Is it better to process one doc at a time: read one document, copy the document then delete it
or is it better to read all documents, copy all of them, then delete all of them?
What would be the best way to handle failures in the middle of one of these read, write deletes?
Bulk reads, inserts and deletes are almost always more performant than single document actions. But try to limit it to a maximum number of documents, e.g. in our setup 500 seemed to be optimal.
For handling errors, you could use the following pseudo transaction pattern:
findAndModify while setting "state":"pending" for all read documents
process documents
bulk insert
delete all documents with "state":"pending"
If something goes wrong in the processing part or the bulk insert, you can unlock all locked documents and try again.
A more elaborate example of these kind of psuedo transactions can be found in the MongoDB Tutorial:
http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/

Mongoid create vs collection.insert

I'm not sure how to put this. Well, recently I worked on a rails project with mongoid, and I had the task of inserting multiple records in Mongodb.
Say insert multiple records of PartPriceRecord in the database. After googling this I came across the collection.insert commands:
PartPriceRecord.collection.insert(multiple_part_price_records)
But on inserting large number of records, MongoDb always seemed to prompt me with error message:
Exceded maximum insert size of 16,000,000 bytes
Googling around I found that the the upper limit for MongoDb for a single document, but surprisingly when I changed my above query to this:
multiple_part_price_records.each do|mppr|
PartPriceRecord.create(mppr)
end
the above errors do not seem to appear any more.
Can anyone explain in depth under the hood what is exactly is the difference between the two?
Thanks
The maximum size for a single, bulk insert is 16M bytes. That's what you're trying to do in your first example.
In your second example, you're inserting each document individually. Therefore, each insert is under the max limit for an insert.
#Kyle explained the difference in his answer quite succinctly (+1'd), but as for a solution to your problem you may want to look at doing batch inserts:
BATCH_SIZE = 200
multiple_part_price_records.each_slice(BATCH_SIZE) do |batch|
PartPriceRecord.collection.insert(batch)
end
This will slice the records into batches of 200 (or whatever size is best for your situation) and insert them within that limit. This will be a lot faster than running save on each one individually which would be sending far more requests to the database.
A few quick things to note about collection.insert:
It does not run validations on your models prior to insertion, you may want to check this prior to insert
It is required to be in a document format unlike save which requires it be a model. You can easily convert to a document by calling as_document on the model.