I'm currently trying to write a script that inserts documents into a MongoDb and returns where each element is stored. Very simple thanks to insert_many(), however my problem occurs if there is an error while I'm inserting.
I won't be able to get the ids that have just been inserted.
from pymongo import MongoClient
client = MongoClient(...)
db = client.test
r = db.test.insert_many([{'foo': 1}, {'foo': 2}, {'foo': 3}])
r.inserted_ids
#: [ObjectId('56b2a592dfcce9001a6efff8'),
#: ObjectId('56b2a592dfcce9001a6efff9'),
#: ObjectId('56b2a592dfcce9001a6efffa')]
list(db.test.find())
#: [{'_id': ObjectId('56b2a592dfcce9001a6efff8'), 'foo': 1},
#: {'_id': ObjectId('56b2a592dfcce9001a6efff9'), 'foo': 2},
#: {'_id': ObjectId('56b2a592dfcce9001a6efffa'), 'foo': 3}]
# This is dead stupid, but forcing an error by re-using the ObjectId we just generated
r2 = db.test.insert_many([{'foo': 4}, {'_id': r.inserted_ids[0], 'foo': 6}, {'foo': 7}])
#: ---------------------------------------------------------------------------
#: BulkWriteError Traceback (most recent call last)
#: <Cut in the interest of time>
Of course, r2 is not initialized, so I can't ask for inserted_ids, however, there will have been one record inserted into the database:
list(db.test.find())
#: [{'_id': ObjectId('56b2a592dfcce9001a6efff8'), 'foo': 1},
#: {'_id': ObjectId('56b2a592dfcce9001a6efff9'), 'foo': 2},
#: {'_id': ObjectId('56b2a592dfcce9001a6efffa'), 'foo': 3},
#: {'_id': ObjectId('56b2a61cdfcce9001a6efffd'), 'foo': 4}]
What I want, is to be able to reliably figure out what Id's were inserted in order. Something Like:
r2.inserted_ids
#: [ObjectId('56b2a61cdfcce9001a6efffd'),
#: None, # or maybe even some specific error for this point.
#: None]
Setting ordered=False still gives the error so r2 won't be initialized, (and it won't reliably return the ids in the order I gave anyway).
Is there any option here?
pymongo sets the _id field at client side, before sending it to the server. It modifies the documents you pass in place.
This means that all the documents you pass are left with the _id field set -- the successful ones and the failed ones.
So you just need to figure out which ones are successful. This can be done like #Austin explained.
Something like:
docs = [{'foo': 1}, {'foo': 2}, {'foo': 3}]
try:
r = db.test.insert_many(docs)
except pymongo.errors.OperationFailure as exc:
inserted_ids = [ doc['_id'] for doc in docs if not is_failed(doc, exc) ]
else:
inserted_ids = r.inserted_ids
is_failed(doc, exc) can be implemented by searching doc in the list of failed documents in the exception details, as explained by #Austin.
Catch the thrown exception. At least according to this site, the returned error details includes the bad record. That should enable you to determine the successful records.
Related
This question already has answers here:
Fast or Bulk Upsert in pymongo
(6 answers)
Closed 3 years ago.
I have some data like this:
data = [{'_id': 1, 'val': 5},
{'_id': 2, 'val': 1}}]
current data in db:
>>> db.collection.find_one()
{'_id': 1, 'val': 3}
I always receive unique rows but am not sure if any of them already exists in DB (such as the case above). And I want to update them based on two types of requirements.
Requirement 1:
Do NOT update the rows if _id already exists. This is kinda easy in a way:
from pymongo.errors import BulkWriteError
try:
db.collection.insert_many(data, unordered=False)
except BulkWriteError:
pass
executing the above would insert 2nd row but won't update the first; but it also raises the exception.
1. Is there any better way of doing the above operation (for bulk inserts) ?
Requirement 2
This is similar to update_if_exists & insert if not exists combined. So the following data:
data2 = [{'_id': 1, 'val': 9},
{'_id': 3, 'val': 4}}]
should update the row with _id=1 and insert the 2nd row in DB.
The problem is I get thousands of rows at one time and am not sure if checking and updating one-by-one is efficient.
2. Is this requirement possible in MongoDB without iterating over each row and with as few operations as possible ?
You can generate a list of updates to pass to bulk write API that will send all the operations together but they will still be executed one by one on the server, but without causing an error.
from pymongo import UpdateOne
data2 = [{'_id': 1, 'val': 9}, {'_id': 3, 'val': 4}]
upserts=[ UpdateOne({'_id':x['_id']}, {'$setOnInsert':x}, upsert=True) for x in data2]
result = db.test.bulk_write(upserts)
You can see in the result that when _id is found the operation is a no-op, but when it's not found, it's an insert.
let's assume that I have collection called my_collection, with three documents:
{'_id": 1, 'foo': 'foo_val', 'bar': 'bar_val'},
{'_id": 2, 'foo': 'foo_val2', 'bar': 'bar_val2'},
{'_id": 3, 'foo': 'foo_val', 'bar': 'bar_val2'}
I'd like to query it by given pairs of key-values, in this case e.g. I'd like to filter it by:
[{'foo': 'foo_val', 'bar': 'bar_val'}, {'foo': 'foo_val2', 'bar': 'bar_val2'}]
so it should return documents with ids 1 and 2.
It there an elegant way to do this in one call to db? I tried using the $in keyword, but it doesn't work as I want.
You'll want to use the $or operator:
db.your_collection.find({$or:[{'foo': 'foo_val', 'bar': 'bar_val'},{'foo': 'foo_val2', 'bar': 'bar_val2'}]})
This question already has answers here:
How to continue insertion after duplicate key error using PyMongo
(2 answers)
Closed 6 years ago.
Is there a way of telling MongoDB to skip (instead of crashing) if we want to insert a document which has the same _id as an existing one?
Say, I have this list and I want to persist it:
to_mongo = [
{'_id': 'aaaaaa', 'content': 'hey'},
{'_id': 'bbbbbb', 'content': 'you'},
{'_id': 'aaaaaa', 'content': 'hey'}
]
mongo_collection.insert_many(to_mongo)
I'd like the last item to be simply ignored, instead of causing the whole request to crash.
Try using the ordered=False in the insert_many() method i.e.
to_mongo = [
{'_id': 'aaaaaa', 'content': 'hey'},
{'_id': 'bbbbbb', 'content': 'you'},
{'_id': 'aaaaaa', 'content': 'hey'}
]
mongo_collection.insert_many(to_mongo, ordered=False)
This makes sure that all write operations are attempted, even if there are errors. From the docs:
ordered (optional): If True (the default) documents will be inserted
on the server serially, in the order provided. If an error occurs all
remaining inserts are aborted. If False, documents will be inserted on
the server in arbitrary order, possibly in parallel, and all document
inserts will be attempted.
I'm doing a bulk-insert into a mongodb database. I know that 99% of the records inserted will fail because of a duplicate key error. I would like to print after the insert how many new records were inserted into the database. All this is being done in python through the tornado motor mongodb driver, but probably this doesn't matter much.
try:
bulk_write_result = yield db.collections.probe.insert(dataarray, continue_on_error=True)
nr_inserts = bulk_write_result["nInserted"]
except pymongo.errors.DuplicateKeyError as e:
nr_inserts = ???? <--- what should I put here?
Since an exception was thrown, bulk_write_result is empty. Obviously I can (except for concurrency issues) do a count of the full collection before and after the insert, but I don't like the extra roundtrips to the database for just a line in the logfile. So is there any way I can discover how many records were actually inserted?
It is not clear to me why you yield your insert result. But, concerning the bulk inserts:
you should use insert_many as insert is deprecated;
when setting the ordered keyword to False, your inserts will continue in case of error;
in case of error, insert_many will raise a BulkWriteError, that you can query to obtain the number of inserted documents.
All of this lead to something like that:
try:
insert_many_result = db.collections.probe.insert_many(dataaray,ordered=False)
nr_inserts = len(insert_many_result.inserted_ids)
except pymongo.errors.BulkWriteError as bwe:
nr_inserts = bwe.details["nInserted"]
If you need to identify the reason behind the write error, you will have to examine the bwe.details['writeErrors'] array. A code value of 11000 means "Duplicate key error":
>>> pprint(e.details['writeErrors'])
[{'code': 11000,
'errmsg': 'E11000 duplicate key error index: test.w.$k_1 dup key: { : 1 }',
'index': 0,
'op': {'_id': ObjectId('555465cacf96c51208587eac'), 'k': 1}},
{'code': 11000,
'errmsg': 'E11000 duplicate key error index: test.w.$k_1 dup key: { : 3 }',
'index': 1,
'op': {'_id': ObjectId('555465cacf96c51208587ead'), 'k': 3}}
Here, as you can see, I tried to insert two documents in the w collection of the test db. Both inserts failed because of a duplicate key error.
Regular insert with continue_on_error can't report the info you want. If you're on MongoDB 2.6 or later, however, we have a high-performance solution with good error reporting. Here's a complete example using Motor's BulkOperationBuilder:
import pymongo.errors
from tornado import gen
from tornado.ioloop import IOLoop
from motor import MotorClient
db = MotorClient()
dataarray = [{'_id': 0},
{'_id': 0}, # Duplicate.
{'_id': 1}]
#gen.coroutine
def my_insert():
try:
bulk = db.collections.probe.initialize_unordered_bulk_op()
# Prepare the operation on the client.
for doc in dataarray:
bulk.insert(doc)
# Send to the server all at once.
bulk_write_result = yield bulk.execute()
nr_inserts = bulk_write_result["nInserted"]
except pymongo.errors.BulkWriteError as e:
print(e)
nr_inserts = e.details['nInserted']
print('nr_inserts: %d' % nr_inserts)
IOLoop.instance().run_sync(my_insert)
Full documentation: http://motor.readthedocs.org/en/stable/examples/bulk.html
Heed the warning about poor bulk insert performance on MongoDB before 2.6! It'll still work but requires a separate round-trip per document. In 2.6+, the driver sends the whole operation to the server in one round trip, and the server reports back how many succeeded and how many failed.
my documents "parents" got the folowing structure:
{childrenIdList: [23, 24, 34]}
{childrenIdList: [23, 88]}
{childrenIdList: [1, 5, 8]}
how to select parents by childId in there childrenIdList?
Such query must return first two documents of 3 in my example if childId = 23.
I tried to use elemMatch method, but seemingly it works only with objects, i.e. it would work only if my data would be {childrenIdList: [{Id: 1}, {Id: 5}, {Id: 8}]}
You can just use db.collection.find({childrenIdList: 23}). See the Query Arrays section in the manual for more details.
db.collection.find({"parent.childrenIdList": {$in: [23]}})