MongoDb update a field in a huge collection using pymongo fast - mongodb

I have an 13GB of documents in a collection in mongoDB where I need to update a field ip_address. The original value and the replacement values are given in excel sheet. I am looping through each value from excel and updating it using:
old_value={"ip_address":original_value}
new_value={"$set":{"ip_address":replacement_value}
tableConnection.update_many(old_value,new_value)
In order to process 1 update it is taking over 2 minutes. I have 1500 updates to do. Is there any better way to do it?

Bulk operations won't speed up your updates by much; the best way to achieve a performance increase is to add an index. This can be as simple as:
db.collection.createIndex({'ip_address': 1})
Refer to the documentation regarding potential blocking on certain older versions of the database https://docs.mongodb.com/manual/reference/method/db.collection.createIndex/
The index will take up addtional storage; if that is an issue you can delete the index once you've completed the updates.

To add on to the above answer given by Belly Buster the syntax to perform indexing and bulk_write in PyMongo that worked for me is :
db.collection.create_index("ip_address")
requests = [UpdateMany({'ip_address': 'xx.xx.xx.xx'}, {'$set': {'ip_address':'yy.yy.yy.yy'}}),[UpdateMany({'ip_address': 'xx.xx.xx.xx'}, {'$set': {'ip_address':'yy.yy.yy.yy'}})]
try :
db.collection.bulk_write(requests, ordered=False)
except BulkWriteError as bwe:
print(bwe)

Related

How should I efficiently delete alot of records from a mongodb collection?

This bounty has ended. Answers to this question are eligible for a +500 reputation bounty. Bounty grace period ends in 4 hours.
Jiew Meng wants to draw more attention to this question.
I am using Mongo to store multi tenant data. As part of data cleanup for a tenant I want to delete everything related to the tenant. The tenantId is indexed but there are alot of rows and it takes a long time to query and I have no easy way to get the progress.
Currently I do something
db.records.deleteMany({tenantId: x})
Is there a better way?
Thinking of doing in batches but like query for x records then build a list of ids to delete. Seems very manual but isit the recommended way?
Some options that I can think of.
Drop the index, before deleting. You can recreate the index after the deletion.
Change the write concern to a lower value, possibly 0. Request won't wait for acknowledgement from secondaries.
db.records.deleteMany({tenantId: x},{w : 0});
If there is another field with enough cardinality to reduce the number of documents, try including that in the query.
Ex: if anotherField as 0,1,2,3 as values, then execute the delete command 4 times, each time with different value.
db.records.deleteMany({tenantId: x, anotherField: 0},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 1},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 2},{w : 0});
db.records.deleteMany({tenantId: x, anotherField: 3},{w : 0});
The performance may depend on variety of different factors. But here are some options you can try to improve the performance
Bulk operations
Bulk operations might help here. bulk.find(query).remove() is a version of db.collection.remove(query) that optimized for large numbers of operations. You can read more about it here
You can use the following way:
Declare a search query:
var query= {tenantId: x};
Initialize and use a bulk:
var bulk = db.yourCollection.initializeUnorderedBulkOp()
bulk.find(query).remove() // or try delete() instead of remove()
bulk.execute()
The idea here rather not to speed up the removal, but to produce less load.
Also you could try bulkWrite()
db.yourCollection.bulkWrite([
{ deleteMany: {
"filter" : query,
}}
])
TTL indexes
It may be not suitable for your use case, but there's entirely another approach without removing by yourself at all.
If it is suitable for you to delete data based on a timestamp, then a TTL index might help you. The idea here is that the record is being removed when the TTL expires.
Implemented as a special index type, TTL collections make it possible
to store data in MongoDB and have the mongod automatically remove data
after a specified period of time.
DeleteMany I think, There must be something common between all the rows that you want to remove from the collection.
You can find out something and then create a query accordingly.
this will help you to remove those records fast.
Let me give you one example. I want to remove all the records where username is not exists.
db.collection.deleteMany({ username: {$exists: false} })
The best place to start is to find something that all records have in common in-order to removed them all at once.
For example the following code deletes all entries that don't contain an email address.
db.users.deleteMany({ email: { $exists: false } })
MongoDB documentation have great examples. Link provided below.
https://www.mongodb.com/docs/manual/reference/method/db.collection.deleteMany/#delete-multiple-documents
You might also want to consider dropping the index since it could be recreated after your done with the operation.
Finally you might want to lower the write concern in your operation in order to speed things up. A compile list of options can be found here
https://www.mongodb.com/docs/v5.0/reference/write-concern/#w-option
I found a good tutorial on https://www.geeksforgeeks.org/mongodb-delete-multiple-documents-using-mongoshell/ that might help you further.
apologies for any grammatical mistakes since English is not my native tongue
I would suggest two solutions, and also please export your model If anything goes wrong you will have a backup of your data or try this in your test DB first 
you can use your tenantId as a condition, not matching _id but with extra logic, like if any of the records do have the tenantId delete them so this way all of your tenant data will be removed using a single query.
db.records.deleteMany({tenantId : {$exists: true})
// suggestion- if any of your tenant data has a field tenantId but it is null you can check for a null value also to delete those records.
 
2) find command data in all of the records, if there is use it as a condition to delete those records.
for example, all of your tenant data have a common field called type with the same value use delete statement like
db.records.deleteMany({type : 1})

How to use mongodb change stream instead of periodic query?

I wan't to calculate sum the documents in my collection satisfying a query. I dont want to poll my collection. How can you do this with mongodb changestream?
For example there are documents in the database and they all have some property: {"destination": "Target1"} And i want to know the amount of documents which are satisfying this previous requirement.
I don't want to run a query on every change of a collection. Because the documents changing very often
I am looking for a similar to oracle's cqn
You can use changestream and watch changes as follow:
watchCursor = db.getSiblingDB("mydatabase").mycollection.watch()
while (!watchCursor.isExhausted()){
if (watchCursor.hasNext()){
printjson(watchCursor.next());
}
}
changeStream docs
But perhaps you may do some query and use some good indexes?
It seems you can just execute:
db.collection.count({destination:"Target1"})
and if you have index on "destination" field it will be pretty quick ...

Performance degrade with Mongo when using bulkwrite with upsert

I am using Mongo java driver 3.11.1 and Mongo Version 4.2.0 for my development.I am still learning mongo. My application receives data and either have to do insert or replace the existing document i.e. do an upsert.
Each document size is 780-1000 bytes as of now and each collection can have more than 3 millions records.
Approach 1: I tried using findOneandreplace for each document and it was taking more than 15 minutes to save the data.
Approach-2 I changed it to bulkwrite using below, which resulted in ~6-7 minutes for saving 20000 records.
List<Data> dataList;
dataList.forEach(data-> {
Document updatedDocument = new Document(data.getFields());
updates.add(new ReplaceOneModel(eq("DataId", data.getId()), updatedDocument, updateOptions));
});
final BulkWriteResult bulkWriteResult = mongoCollection.bulkWrite(updates);
3) I tried using collection.insertMany which takes 2 seconds to store the data.
As per driver code, insertMany also Internally InsertMany uses MixedBulkWriteOperation for inserting the data similar to bulkWrite.
My queries are -
a) I have to do upsert operation, Please let me know where i am doing any mistakes.
- Created the indexes on DataId field but resulted in <2 miliiseconds difference in terms of performance.
- Tried using writeConcern of W1, but performance is still the same.
b) why insertMany's performance is faster than bulk write. I could understand in terms of few seconds difference but unable to figure out the reason for 2-3 seconds for insertMany and 5-7 minutes for bulkwrite.
c) Are there any approaches that can be used to solve this situation.
This problem was solved to greater extent by adding index on DataId Field. Previously i had created index on DataId field but forgot to create index after creating collection.
This link How to improve MongoDB insert performance helped in resolving the problem

How to disable mongodb index while update large collection

My requirement is to update or add an array field into large collection. I've index on filed "Roles". While update this collection it is taking arounf 3 miniutes .. Before creating index on "role" filed it was taking less than 40 sec to update/add fileds in the collection. We need the index to read the collection . But while update it makes trouble. Is it possible to disable index while update in mongodb.. Is there any funtions available with mongo? My mongodb version is 2.6.5
Please advice.
In Mongodb Indexes are updated synchronously with the insert/update. There is no way to pause the update of Indexes.
If your indexes are already created then you have two options
Drop the index and recreate the index, but it will have the following impacts
Queries executed at the time of the insert/update is happening will miss the index.
Rebuilding index is too expensive
Wait for the index to be updated
Queries will not use partially-built indexes: the index will only be usable once the index build is complete.
Source: http://docs.mongodb.org/manual/core/index-creation/
That means your index will block any query on the field/collection as long as the index is not complete. Therefore your have no chance but waiting for the index to be updated after adding new data.
Maybe try using another index.

MongoDB update is slower with relevant indexes set

I am testing a small example for a sharded set up and I notice that updating an embedded field is slower when the search fields are indexed.
I know that indexes are updated during inserts but are the search indexes of the query also updated?
The query for the update and the fields that are updated are not related to any manner.
e.g. (tested with toy data) :
{
id:... (sharded on the id)
embedded :[{ 'a':..,'b':...,'c':.... (indexed on a,b,c),
data:.... (data is what gets updated)
},
...
]
}
In the example above the query for the update is on a,b,c
and the values for the update affect only the data.
The only reasons I can think is that indexes are updated even if the updates are not on the indexed fields. The search part of the update seems to use the indexes when issuing a "find" query with with explain.
Could there be another reason?
I think wdberkeley -on the comments- gives the best explanation.
The document moves because it grows larger and the indexes are updated every time.
As he also notes, updating multiple keys is "bad"....I thinks I will avoid this design for now.