Finding number of inserted documents in a bulk insert with duplicate keys - mongodb

I'm doing a bulk-insert into a mongodb database. I know that 99% of the records inserted will fail because of a duplicate key error. I would like to print after the insert how many new records were inserted into the database. All this is being done in python through the tornado motor mongodb driver, but probably this doesn't matter much.
try:
bulk_write_result = yield db.collections.probe.insert(dataarray, continue_on_error=True)
nr_inserts = bulk_write_result["nInserted"]
except pymongo.errors.DuplicateKeyError as e:
nr_inserts = ???? <--- what should I put here?
Since an exception was thrown, bulk_write_result is empty. Obviously I can (except for concurrency issues) do a count of the full collection before and after the insert, but I don't like the extra roundtrips to the database for just a line in the logfile. So is there any way I can discover how many records were actually inserted?

It is not clear to me why you yield your insert result. But, concerning the bulk inserts:
you should use insert_many as insert is deprecated;
when setting the ordered keyword to False, your inserts will continue in case of error;
in case of error, insert_many will raise a BulkWriteError, that you can query to obtain the number of inserted documents.
All of this lead to something like that:
try:
insert_many_result = db.collections.probe.insert_many(dataaray,ordered=False)
nr_inserts = len(insert_many_result.inserted_ids)
except pymongo.errors.BulkWriteError as bwe:
nr_inserts = bwe.details["nInserted"]
If you need to identify the reason behind the write error, you will have to examine the bwe.details['writeErrors'] array. A code value of 11000 means "Duplicate key error":
>>> pprint(e.details['writeErrors'])
[{'code': 11000,
'errmsg': 'E11000 duplicate key error index: test.w.$k_1 dup key: { : 1 }',
'index': 0,
'op': {'_id': ObjectId('555465cacf96c51208587eac'), 'k': 1}},
{'code': 11000,
'errmsg': 'E11000 duplicate key error index: test.w.$k_1 dup key: { : 3 }',
'index': 1,
'op': {'_id': ObjectId('555465cacf96c51208587ead'), 'k': 3}}
Here, as you can see, I tried to insert two documents in the w collection of the test db. Both inserts failed because of a duplicate key error.

Regular insert with continue_on_error can't report the info you want. If you're on MongoDB 2.6 or later, however, we have a high-performance solution with good error reporting. Here's a complete example using Motor's BulkOperationBuilder:
import pymongo.errors
from tornado import gen
from tornado.ioloop import IOLoop
from motor import MotorClient
db = MotorClient()
dataarray = [{'_id': 0},
{'_id': 0}, # Duplicate.
{'_id': 1}]
#gen.coroutine
def my_insert():
try:
bulk = db.collections.probe.initialize_unordered_bulk_op()
# Prepare the operation on the client.
for doc in dataarray:
bulk.insert(doc)
# Send to the server all at once.
bulk_write_result = yield bulk.execute()
nr_inserts = bulk_write_result["nInserted"]
except pymongo.errors.BulkWriteError as e:
print(e)
nr_inserts = e.details['nInserted']
print('nr_inserts: %d' % nr_inserts)
IOLoop.instance().run_sync(my_insert)
Full documentation: http://motor.readthedocs.org/en/stable/examples/bulk.html
Heed the warning about poor bulk insert performance on MongoDB before 2.6! It'll still work but requires a separate round-trip per document. In 2.6+, the driver sends the whole operation to the server in one round trip, and the server reports back how many succeeded and how many failed.

Related

Mongo .find() returning duplicate documents (with same _id) (!)

Mongo appears to be returning duplicate documents for the same query, i.e. it returns more documents than there are unique _ids in the returned documents:
lobby-brain> count_iterated = 0; ids = {}
{}
lobby-brain> db.the_collection.find({
'a_boolean_key': true
}).forEach((el) => {
count_iterated += 1;
ids[el._id] = (ids[el._id]||0) + 1;
})
lobby-brain> count_iterated
278
lobby-brain> Object.keys(ids).length
251
That is, the number of unique _id returned is 251 -- but there were 278 documents returned by the cursor.
Investigating further:
lobby-brain> ids
{
'60cb8cb92c909a974a96a430': 1,
'61114dea1a13c86146729f21': 1,
'6111513a1a13c861467d3dcf': 1,
...
'61114c491a13c861466d39cf': 2,
'61114bcc1a13c861466b9f8e': 2,
...
}
lobby-brain> db.the_collection.find({
_id: ObjectId("61114c491a13c861466d39cf")
}).forEach((el) => print("foo"));
foo
>
That is, there aren't actually duplicate documents with the same _id -- it's just an issue with the .find().
I tried restarting the database, and rebuilding an index involving 'a_boolean_key', with the same results.
I've never seen this before and this seems impossible... what is causing this and how can I fix it?
Version info:
Using MongoDB: 5.0.5
Using Mongosh: 1.0.4
It is a stand-alone database, no replica set or sharding or anything like that.
Further Info
One thing to note is, there is a compound index with a_boolean_key as the first index, and a datetime field as the second. The boolean key is rarely updated on the database (~once/day), but the datetime field is frequently updated.
Maybe these updates are causing the duplicate return values?
Update Feb 15, 2022: I added a Mongo JIRA task here.
Try checking if you store indexes for a_boolean_key field.
When performing a count, MongoDB can return the count using only the
index
So, maybe you don't have indexes for all documents, so count method result is not equal to your manual count.
According to Louis Williams over at Mongo JIRA, this is not a bug but expected behavior.
Learn something new every day!

Bulk update is too slow

I am using pymongo to do a bulk update.
The names list below is a distinct list of names (each name might have mutiple documents in the collection)
Code 1:
bulk = db.collection.initialize_unordered_bulk_op()
for name in names:
bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
print bulk.execute()
Code 2:
bulk = db.collection.initialize_unordered_bulk_op()
counter = 0
for name in names:
bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
counter =counter + 1
if (counter % 100 == 0):
print bulk.execute()
bulk = db.collection.initialize_unordered_bulk_op()
if (counter % 100 != 0):
print bulk.execute()
I have 50000 documents in my collection.
If I get rid of the counter and if statement (Code 1), the code gets stuck!
With the if statement (Code 2), I am assuming this operation shouldn't take more than a couple of minutes but it is taking way more than that! Can you please help me make it faster or am I wrong in my assumption?!
You most likely forgot to add indexes to support your queries!
This will trigger full collection scans for each of your operations which is boring slow (as you did realize).
The following code does test using update_many, and the bulk-stuff without and with indexes on the 'name' and 'A' field. The numbers you get speak for themselves.
Remark, I was not passionate enough to do this for 50000 without the indexes but for 10000 documents.
Results for 10000 are:
without index and update_many: 38.6 seconds
without index and bulk update: 28.7 seconds
with index and update_many: 3.9 seconds
with index and bulk update: 0.52 seconds
For 50000 documents with added index it takes 2.67 seconds. I did run the test on a windows machine and mongo running on the same host in docker.
For more information about indexes see https://docs.mongodb.com/manual/indexes/#indexes. In short: Indexes are kept in RAM and allow for fast query and lookup of documents. Indexes have to specifically choose to match your queries.
from pymongo import MongoClient
import random
from timeit import timeit
col = MongoClient()['test']['test']
col.drop() # erase all documents in collection 'test'
docs = []
# initialize 10000 documents use a random number between 0 and 1 converted
# to a string as name. For the documents with a name > 0.5 add the key A
for i in range(0, 10000):
number = random.random()
if number > 0.5:
doc = {'name': str(number),
'A': True}
else:
doc = {'name': str(number)}
docs.append(doc)
col.insert_many(docs) # insert all documents into the collection
names = col.distinct('name') # get all distinct values for the key name from the collection
def update_with_update_many():
for name in names:
col.update_many({'A': {'$exists': False}, 'Name': name},
{'$set': {'B': 1, 'C': 2, 'D': 3}})
def update_with_bulk():
bulk = col.initialize_unordered_bulk_op()
for name in names:
bulk.find({'A': {'$exists': False}, 'Name': name}).\
update({'$set': {'B': 1, 'C': 2, 'D': 3}})
bulk.execute()
print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))
col.create_index('A') # this adds an index on key A
col.create_index('Name') # this adds an index on key Name
print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))

insert_many with upsert - PyMongo [duplicate]

This question already has answers here:
Fast or Bulk Upsert in pymongo
(6 answers)
Closed 3 years ago.
I have some data like this:
data = [{'_id': 1, 'val': 5},
{'_id': 2, 'val': 1}}]
current data in db:
>>> db.collection.find_one()
{'_id': 1, 'val': 3}
I always receive unique rows but am not sure if any of them already exists in DB (such as the case above). And I want to update them based on two types of requirements.
Requirement 1:
Do NOT update the rows if _id already exists. This is kinda easy in a way:
from pymongo.errors import BulkWriteError
try:
db.collection.insert_many(data, unordered=False)
except BulkWriteError:
pass
executing the above would insert 2nd row but won't update the first; but it also raises the exception.
1. Is there any better way of doing the above operation (for bulk inserts) ?
Requirement 2
This is similar to update_if_exists & insert if not exists combined. So the following data:
data2 = [{'_id': 1, 'val': 9},
{'_id': 3, 'val': 4}}]
should update the row with _id=1 and insert the 2nd row in DB.
The problem is I get thousands of rows at one time and am not sure if checking and updating one-by-one is efficient.
2. Is this requirement possible in MongoDB without iterating over each row and with as few operations as possible ?
You can generate a list of updates to pass to bulk write API that will send all the operations together but they will still be executed one by one on the server, but without causing an error.
from pymongo import UpdateOne
data2 = [{'_id': 1, 'val': 9}, {'_id': 3, 'val': 4}]
upserts=[ UpdateOne({'_id':x['_id']}, {'$setOnInsert':x}, upsert=True) for x in data2]
result = db.test.bulk_write(upserts)
You can see in the result that when _id is found the operation is a no-op, but when it's not found, it's an insert.

Mongodb bulk write error

I'm executing bulk write
bulk = new_packets.initialize_ordered_bulk_op()
bulk.insert(packet)
output = bulk.execute()
and getting an error that I interpret to mean that packet is not a dict. However, I do know that it is a dict. What could be the problem?
Here is the error:
BulkWriteError Traceback (most recent call last)
<ipython-input-311-93f16dce5714> in <module>()
2
3 bulk.insert(packet)
----> 4 output = bulk.execute()
C:\Users\e306654\AppData\Local\Continuum\Anaconda\lib\site-packages\pymongo\bulk.pyc in execute(self, write_concern)
583 if write_concern and not isinstance(write_concern, dict):
584 raise TypeError('write_concern must be an instance of dict')
--> 585 return self.__bulk.execute(write_concern)
C:\Users\e306654\AppData\Local\Continuum\Anaconda\lib\site-packages\pymongo\bulk.pyc in execute(self, write_concern)
429 self.execute_no_results(generator)
430 elif client.max_wire_version > 1:
--> 431 return self.execute_command(generator, write_concern)
432 else:
433 return self.execute_legacy(generator, write_concern)
C:\Users\e306654\AppData\Local\Continuum\Anaconda\lib\site-packages\pymongo\bulk.pyc in execute_command(self, generator, write_concern)
296 full_result['writeErrors'].sort(
297 key=lambda error: error['index'])
--> 298 raise BulkWriteError(full_result)
299 return full_result
300
BulkWriteError: batch op errors occurred
It can be many reasons...
the best is that you try...catch... the exception and check in the errors
from pymongo.errors import BulkWriteError
try:
bulk.execute()
except BulkWriteError as bwe:
print(bwe.details)
#you can also take this component and do more analysis
#werrors = bwe.details['writeErrors']
raise
Ok, the problem was that i was assigning _id explicitly and it turns out that the string was larger than 12-byte limit, my bad.
You should check 2 things:
Duplicates, if you are defining your own key.
Be able to manage custom types, In my case I was trying to pass a hash type object that was not able to be converted into a valid objectId, and that was leading me to the first point and I felt into a vicious circle (I solve it converting myObject to string.
Inserting one by one will give you the idea what is happening.
In addition to the above, check your unique indexes. If you're bulk inserting and have specified an index that doesn't exist in your data, you will get this error.
For example, I had accidentally specified name as a unique index, and the data I was inserting had no keys called name. After the first entry is inserted into mongo, it will throw this error because you're technically inserting another document with a unique name of null.
Here's a part of my model definition where I'm declaring a unique index:
self.conn[self.collection_name].create_index(
[("name", ASCENDING)],
unique=True,
)
And here are the details of the error being thrown:
{'writeErrors': [{'index': 1, 'code': 11000, 'keyPattern': {'name': 1},
'keyValue': {'name': None}, 'errmsg': 'E11000 duplicate key error collection:
troposphere.temp index: name_1 dup key: { name: null }'
...
more resources:
MongoDB E11000 duplicate key error
I was trying to insert two documents with the same "_id" and other keys.
Solution:
insert different "_id" s for different documents. OR
remove the "_id" and you get a randomized one.
Try using debugger, it should gives you errmsg with exact error, and op object was trying to insert.

How to delete documents by query efficiently in mongo?

I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.
You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.
You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});
Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)
I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});
Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});