Bulk update is too slow - mongodb

I am using pymongo to do a bulk update.
The names list below is a distinct list of names (each name might have mutiple documents in the collection)
Code 1:
bulk = db.collection.initialize_unordered_bulk_op()
for name in names:
bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
print bulk.execute()
Code 2:
bulk = db.collection.initialize_unordered_bulk_op()
counter = 0
for name in names:
bulk.find({"A":{"$exists":False},'Name':name}).update({"$set":{'B':b,'C':c,'D':d}})
counter =counter + 1
if (counter % 100 == 0):
print bulk.execute()
bulk = db.collection.initialize_unordered_bulk_op()
if (counter % 100 != 0):
print bulk.execute()
I have 50000 documents in my collection.
If I get rid of the counter and if statement (Code 1), the code gets stuck!
With the if statement (Code 2), I am assuming this operation shouldn't take more than a couple of minutes but it is taking way more than that! Can you please help me make it faster or am I wrong in my assumption?!

You most likely forgot to add indexes to support your queries!
This will trigger full collection scans for each of your operations which is boring slow (as you did realize).
The following code does test using update_many, and the bulk-stuff without and with indexes on the 'name' and 'A' field. The numbers you get speak for themselves.
Remark, I was not passionate enough to do this for 50000 without the indexes but for 10000 documents.
Results for 10000 are:
without index and update_many: 38.6 seconds
without index and bulk update: 28.7 seconds
with index and update_many: 3.9 seconds
with index and bulk update: 0.52 seconds
For 50000 documents with added index it takes 2.67 seconds. I did run the test on a windows machine and mongo running on the same host in docker.
For more information about indexes see https://docs.mongodb.com/manual/indexes/#indexes. In short: Indexes are kept in RAM and allow for fast query and lookup of documents. Indexes have to specifically choose to match your queries.
from pymongo import MongoClient
import random
from timeit import timeit
col = MongoClient()['test']['test']
col.drop() # erase all documents in collection 'test'
docs = []
# initialize 10000 documents use a random number between 0 and 1 converted
# to a string as name. For the documents with a name > 0.5 add the key A
for i in range(0, 10000):
number = random.random()
if number > 0.5:
doc = {'name': str(number),
'A': True}
else:
doc = {'name': str(number)}
docs.append(doc)
col.insert_many(docs) # insert all documents into the collection
names = col.distinct('name') # get all distinct values for the key name from the collection
def update_with_update_many():
for name in names:
col.update_many({'A': {'$exists': False}, 'Name': name},
{'$set': {'B': 1, 'C': 2, 'D': 3}})
def update_with_bulk():
bulk = col.initialize_unordered_bulk_op()
for name in names:
bulk.find({'A': {'$exists': False}, 'Name': name}).\
update({'$set': {'B': 1, 'C': 2, 'D': 3}})
bulk.execute()
print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))
col.create_index('A') # this adds an index on key A
col.create_index('Name') # this adds an index on key Name
print(timeit(update_with_update_many, number=1))
print(timeit(update_with_bulk, number=1))

Related

MongoDB range query with a sort - how to speed up?

I have a query which is routinely taking around 30 seconds to run for a collection with 1 million documents. This query is to form part of a search engine, where the requirement is that every search completes in less than 5 seconds. Using a simplified example here (the actual docs has embedded documents and other attributes), let's say I have the following:
1 millions docs of a Users collections where each looks as follows:
{
name: Dan,
age: 30,
followers: 400
},
{
name: Sally,
age: 42,
followers: 250
}
... etc
Now, lets I'm wanting to return the IDs of 10 users with a follower count between 200 and 300, sorted by age in descending order. This can be achieved with the following:
db.users.find({
'followers': { $gt: 200, $lt: 300 },
}).
projection({ '_id': 1 }).
sort({ 'age': -1 }).
limit(10)
I have the following compound Index created, which winningPlan tells me is being used:
db.users.createIndex({ 'followed_by': -1, 'age': -1 })}
But this query is still taking ~30 seconds as it's having to examine thousands of docs, near equal to the amount of docs in this case that match the find query. I have experimented with different indexes (with different positions and sort orders) with no luck.
So my question is, what else can I do to either reduce the number of documents examined with the query, or speed up the the process of having to examine the docs?
The query is taking long both in production and on my local dev environment, somewhat ruling many network and hardware factors. currentOp shows that the query is not waiting for locks while running, or that there are any other queries running at the same time.
For me, it looks like you have an incorrect index: { 'followed_by': -1, 'age': -1 } for your query. You should have an index { 'followers': 1} (but take into consideration cardinality of that field). But even with that index, you will need to do inmem sort. Anyway, it should be much faster in the way you have high cardinality because you will not need to scan the whole collection for filtering step as you do with index prefix followed_by.

Iterating through collections and counting the number of same value appearances pymongo

I have similar data in 5 collections in mongodb as follows (documents)
{
"_id" : ObjectId("53490030cf3b942d63cfbc7b"),
"uNr" : "abdc123abcd",
}
I want to iterate through each collection and check if there is uNr match in any collection. If there is then add that uNr and count +1 to new table. For example if there is a match in 3 collections, that it should show {"uNr" : "abcd123", "count": "3"}
If your total number of uNr values is small enough to fit in memory (at most a few million of them), you can total them client-side with a Counter and store them in a MongoDB collection:
from collections import Counter
from pymongo import MongoClient, InsertOne
db = MongoClient().my_database
counts = Counter()
for collection in [db.collection1,
db.collection2,
db.collection3]:
for doc in collection.find():
counts[doc['uNr']] += 1
# Empty the target collection.
db.counts.delete_many({})
inserts = [InsertOne({'_id': uNr, 'n': cnt}) for uNr, cnt in counts.items()]
db.counts.bulk_write(inserts)
Otherwise, query a thousand uNr values at a time and update counts in a separate collection:
from pymongo import MongoClient, UpdateOne, ASCENDING
db = MongoClient().my_database
# Empty the target collection.
db.counts.delete_many({})
db.counts.create_index([('uNr', ASCENDING)])
for collection in [db.collection1,
db.collection2,
db.collection3]:
cursor = collection.find(no_cursor_timeout=True)
# "with" statement helps ensure cursor is closed, since the server will
# never auto-close it.
with cursor:
updates = []
for doc in cursor:
updates.append(UpdateOne({'_id': doc['uNr']},
{'$inc': {'n': 1}},
upsert=True))
if len(updates) == 1000:
db.counts.bulk_write(updates)
updates = []
if updates:
# Last batch.
db.counts.bulk_write(updates)

Finding number of inserted documents in a bulk insert with duplicate keys

I'm doing a bulk-insert into a mongodb database. I know that 99% of the records inserted will fail because of a duplicate key error. I would like to print after the insert how many new records were inserted into the database. All this is being done in python through the tornado motor mongodb driver, but probably this doesn't matter much.
try:
bulk_write_result = yield db.collections.probe.insert(dataarray, continue_on_error=True)
nr_inserts = bulk_write_result["nInserted"]
except pymongo.errors.DuplicateKeyError as e:
nr_inserts = ???? <--- what should I put here?
Since an exception was thrown, bulk_write_result is empty. Obviously I can (except for concurrency issues) do a count of the full collection before and after the insert, but I don't like the extra roundtrips to the database for just a line in the logfile. So is there any way I can discover how many records were actually inserted?
It is not clear to me why you yield your insert result. But, concerning the bulk inserts:
you should use insert_many as insert is deprecated;
when setting the ordered keyword to False, your inserts will continue in case of error;
in case of error, insert_many will raise a BulkWriteError, that you can query to obtain the number of inserted documents.
All of this lead to something like that:
try:
insert_many_result = db.collections.probe.insert_many(dataaray,ordered=False)
nr_inserts = len(insert_many_result.inserted_ids)
except pymongo.errors.BulkWriteError as bwe:
nr_inserts = bwe.details["nInserted"]
If you need to identify the reason behind the write error, you will have to examine the bwe.details['writeErrors'] array. A code value of 11000 means "Duplicate key error":
>>> pprint(e.details['writeErrors'])
[{'code': 11000,
'errmsg': 'E11000 duplicate key error index: test.w.$k_1 dup key: { : 1 }',
'index': 0,
'op': {'_id': ObjectId('555465cacf96c51208587eac'), 'k': 1}},
{'code': 11000,
'errmsg': 'E11000 duplicate key error index: test.w.$k_1 dup key: { : 3 }',
'index': 1,
'op': {'_id': ObjectId('555465cacf96c51208587ead'), 'k': 3}}
Here, as you can see, I tried to insert two documents in the w collection of the test db. Both inserts failed because of a duplicate key error.
Regular insert with continue_on_error can't report the info you want. If you're on MongoDB 2.6 or later, however, we have a high-performance solution with good error reporting. Here's a complete example using Motor's BulkOperationBuilder:
import pymongo.errors
from tornado import gen
from tornado.ioloop import IOLoop
from motor import MotorClient
db = MotorClient()
dataarray = [{'_id': 0},
{'_id': 0}, # Duplicate.
{'_id': 1}]
#gen.coroutine
def my_insert():
try:
bulk = db.collections.probe.initialize_unordered_bulk_op()
# Prepare the operation on the client.
for doc in dataarray:
bulk.insert(doc)
# Send to the server all at once.
bulk_write_result = yield bulk.execute()
nr_inserts = bulk_write_result["nInserted"]
except pymongo.errors.BulkWriteError as e:
print(e)
nr_inserts = e.details['nInserted']
print('nr_inserts: %d' % nr_inserts)
IOLoop.instance().run_sync(my_insert)
Full documentation: http://motor.readthedocs.org/en/stable/examples/bulk.html
Heed the warning about poor bulk insert performance on MongoDB before 2.6! It'll still work but requires a separate round-trip per document. In 2.6+, the driver sends the whole operation to the server in one round trip, and the server reports back how many succeeded and how many failed.

How to delete documents by query efficiently in mongo?

I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.
You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.
You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});
Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)
I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});
Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});

Jumbled up ids in mongodb

We have around 20 million records in our mongodb. In my collection called 'posts' there is a field called 'id' which was supposed to be unique but now it has gotten all messed up. We just want it to be unique and there are many many duplicates now.
We just wanted to do something like iterating over every reocrd and assigning it a unique id in a loop from 1 to 20million.
What would be the easiest way to do this?
There are not many options here, really.
Pick your language and driver of choice.
Fetch N documents.
Assign unique ids to them (several options here: 1) copy _id; 2) assign new ObjectID; 3) assign plain integer)
Save those documents.
Fetch next N documents. Go to step 3.
To fetch next N documents, you should note the last processed document's _id and do this:
db.collection.find({_id: {$gt: last_processed_id}}).sort({_id: 1}).limit(N);
Do not use skip here. It will be too slow.
And, of course, you can always truncate the collection, create unique index on id and populate it again.
You can use a simple script like this:
db.posts.dropIndex("*id index name here*"); // Drop Unique index
counter = 0;
page = 1;
slice = 1000;
total = db.posts.count();
conditions = {};
while (counter < total) {
cursor = db.posts.find(conditions, {_id: true}).sort({_id: 1}).limit(slice);
while (cursor.hasNext()) {
row = cursor.next();
db.posts.update({_id: row._id}, {$set: {id: ++counter}});
}
conditions['_id'] = {$gt: row._id};
print("Processed " + counter + " rows");
}
print('Adding id index');
db.posts.ensureIndex({id: 1}, {unique: true, background: false});
print("done");
save it to assignids.js, and run as
$ mongo dbname assignids.js
the outer-while selects 1000 rows as a time, and prevents cursor timeouts; the inner while assigns each row a new incremental id.