MongoDB takes an hour to delete_many() 1GB of data - mongodb

We have a 3GB collection in mongoDB 4.2 and this python 3.7, pymongo 3.12 function that deletes rows from the collection:
def delete_from_mongo_collection(table_name):
# connect to mongo cluster
cluster = MongoClient(MONGO_URI)
db = cluster["cbbap"]
# remove rows and return
query = { 'competitionId': { '$in': [30629, 30630] } }
db[table_name].delete_many(query)
return
Here is the relevant info on this collection, note that it has 360MB worth of indexes which are set to speed up retrievals of data from this collection by our Node API, although they may be the problem here.
The delete_many() is part of a pattern where we (a) remove stale data and (b) upload fresh data, each day. However, given that it is taking over an hour to remove the rows that match the query { 'competitionId': { '$in': [30629, 30630] } }, we'd be better off just dropping and re-inserting the entire table. What's frustrating is that competitionId is an index, and as the first index in our compound indexes, I thought it should be very fast to drop rows using an index. I wonder if having 360MB of indexes is responsible for the slow deletes?
We cannot use the hint parameter as we have mongoDB 4.2, not 4.4, and we do not want to upgrade to 4.4 yet as we are worried about major breaking changes that may occur in our pipelines and our node API.
What else can be done here to improve the performance of delete_many()?

Related

How to avoid mongo from returning duplicated documents by iterating a cursor in a constantly updated big collection?

Context
I have a big collection with millions of documents which is constantly updated with production workload. When performing a query, I have noticed that a document can be returned multiple times; My workload tries to migrate the documents to a SQL system which is set to allow unique row ids, hence it crashes.
Problem
Because the collection is so big and lots of users are updating it after the query is started, iterating over the cursor's result may give me documents with the same id (old and updated version).
What I'v tried
const cursor = db.collection.find(query, {snapshot: true});
while (cursor.hasNext()) {
const doc = cursor.next();
// do some stuff
}
Based on old documentation for the mongo driver (I'm using nodejs but this is applicable to any official mongodb driver), there is an option called snapshot which is said to avoid what is happening to me. Sadly, the driver returns an error indicating that this option does not exists (It was deprecated).
Question
Is there a way to iterate through the documents of a collection in a safe fashion that I don't get the same document twice?
I only see a viable option with aggregation pipeline, but I want to explore other options with standard queries.
Finally I got the answer from a mongo changelog page:
MongoDB 3.6.1 deprecates the snapshot query option.
For MMAPv1, use hint() on the { _id: 1} index instead to prevent a cursor from returning a document more than once if an intervening write operation results in a move of the document.
For other storage engines, use hint() with { $natural : 1 } instead.
So, from my code example:
const cursor = db.collection.find(query).hint({$natural: 1});
while (cursor.hasNext()) {
const doc = cursor.next();
// do some stuff
}

Performance degrade with Mongo when using bulkwrite with upsert

I am using Mongo java driver 3.11.1 and Mongo Version 4.2.0 for my development.I am still learning mongo. My application receives data and either have to do insert or replace the existing document i.e. do an upsert.
Each document size is 780-1000 bytes as of now and each collection can have more than 3 millions records.
Approach 1: I tried using findOneandreplace for each document and it was taking more than 15 minutes to save the data.
Approach-2 I changed it to bulkwrite using below, which resulted in ~6-7 minutes for saving 20000 records.
List<Data> dataList;
dataList.forEach(data-> {
Document updatedDocument = new Document(data.getFields());
updates.add(new ReplaceOneModel(eq("DataId", data.getId()), updatedDocument, updateOptions));
});
final BulkWriteResult bulkWriteResult = mongoCollection.bulkWrite(updates);
3) I tried using collection.insertMany which takes 2 seconds to store the data.
As per driver code, insertMany also Internally InsertMany uses MixedBulkWriteOperation for inserting the data similar to bulkWrite.
My queries are -
a) I have to do upsert operation, Please let me know where i am doing any mistakes.
- Created the indexes on DataId field but resulted in <2 miliiseconds difference in terms of performance.
- Tried using writeConcern of W1, but performance is still the same.
b) why insertMany's performance is faster than bulk write. I could understand in terms of few seconds difference but unable to figure out the reason for 2-3 seconds for insertMany and 5-7 minutes for bulkwrite.
c) Are there any approaches that can be used to solve this situation.
This problem was solved to greater extent by adding index on DataId Field. Previously i had created index on DataId field but forgot to create index after creating collection.
This link How to improve MongoDB insert performance helped in resolving the problem

Parse-Server source _rperm and _wperm mongo queries, and should I index them?

I'm going through my mongo logs to add indexes for my unindexed queries. By default, mongo only logs queries that take over 100ms to complete.
I've found that I have several on _wperm and _rperm keys. I see that is how the ACL gets broken down. But what type of Parse.Query call might create a query like this in the logs?
query: { orderby: {}, $query: { _rperm: { $in: [ null, "*", "[UserId]" ] } } }
I'm even noticing that this query is on a class that has only 8 total objects, yet is taking 133ms to complete, which seems really slow for such a small class, even if it had to do an in memory sort and scan.
Should I solve this at the code level, modifying my query to avoid this type of mongo query? Or should I add an index for these types of queries?
I notice I also have a few that are showing up in the Slow Queries tab on mLab. The query looks like {"_id":"<val>","_wperm":{"$in":["<vals>"]}}, with the suggested index {"_id": 1, "_wperm": 1}, but it has the following note:
"_id" is in the existing {"_id": 1} unique index. The following index recommendation should only be necessary in certain circumstances.
Yet, this is one of my slower queries, taking 320 ms to complete. It's on the _User class. Is that just because the _User class has a lot of rows? Since the _id is unique I feel like it shouldn't really make a difference adding a _wperm index, since I end up with only a single object.
I'm curious if I will see a benefit from taking action on these queries or if I should safely ignore them.
You should index your collections following the mongodb recommendations. In the good old parse.com days, those indexes were created automatically based on the seen workloads. Now you need to create them. Both make sense. The _rperm will be hit on every queries run without a masterKey. The _wperm in every write. In the future, we could automatically create the _id + _wperm index as all writes use this index

MongoDB - how to get fields fill-rates as quickly as possible?

We have a very big MongoDB collection of documents with some pre-defined fields that can either have a value or not.
We need to gather fill-rates of those fields, we wrote a script that goes over all documents and counts fill-rates for each, problem is it takes a long time to process all documents.
Is there a way to use db.collection.aggregate or db.collection.mapReduce to run such a script server-side?
Should it have significant performance improvements?
Will it slow down other usages of that collection (e.g. holding a major lock)?
Answering my own question, I was able to migrate my script using a cursor to scan the whole collection, to a map-reduce query, and running on a sample of the collection it seems it's at least twice as fast using the map-reduce.
Here's how the old script worked (in node.js):
var cursor = collection.find(query, projection).sort({_id: 1}).limit(limit);
var next = function() {
cursor.nextObject(function(err, doc) {
processDoc(doc, next);
});
};
next();
and this is the new script:
collection.mapReduce(
function () {
var processDoc = function(doc) {
...
};
processDoc(this);
},
function (key, values) {
return Array.sum(values)
},
{
query : query,
out: {inline: 1}
},
function (error, results) {
// print results
}
);
processDoc stayed basically the same, but instead of incrementing a counter on a global stats object, I do:
emit(field_name, 1);
running old and new on a sample of 100k, old took 20 seconds, new took 8.
some notes:
map-reduce's limit option doesn't work on sharded collections, I had to query for _id : { $gte, $lte} to create the sample size needed.
map-reduce's performance boost option: jsMode : true doesn't work on sharded collections as well (might have improve performance even more), it might work to run it manually on each shard to gain that feature.
As I understood what you want to achieve is compute something on your documents, after that you have a new "document" that can be queried. You don't need to store the "new values" computed.
If you don't need to write your "new values" inside that documents, you can use Aggregation Framework.
Aggregations operations process data records and return computed results. Aggregation operations group values from multiple documents together, and can perform a variety of operations on the grouped data to return a single result.
https://docs.mongodb.com/manual/aggregation/
Since Aggregation Framework has a lot of features i can't give you more informations about how to resolve your issue.

MongoDB update is slower with relevant indexes set

I am testing a small example for a sharded set up and I notice that updating an embedded field is slower when the search fields are indexed.
I know that indexes are updated during inserts but are the search indexes of the query also updated?
The query for the update and the fields that are updated are not related to any manner.
e.g. (tested with toy data) :
{
id:... (sharded on the id)
embedded :[{ 'a':..,'b':...,'c':.... (indexed on a,b,c),
data:.... (data is what gets updated)
},
...
]
}
In the example above the query for the update is on a,b,c
and the values for the update affect only the data.
The only reasons I can think is that indexes are updated even if the updates are not on the indexed fields. The search part of the update seems to use the indexes when issuing a "find" query with with explain.
Could there be another reason?
I think wdberkeley -on the comments- gives the best explanation.
The document moves because it grows larger and the indexes are updated every time.
As he also notes, updating multiple keys is "bad"....I thinks I will avoid this design for now.