In my PostgresDB, I'm performing a deletion operation using another table as below.
DELETE FROM user_records
USING to_delete_records
WHERE user_records.record_id = to_delete_records.record_id
user_records table contains around 200 million records while to_delete_records table contains around 5-10 million records. Everyday the to_delete_records table is updated with a new set of records, and have to perform the above deletion operation. (similar to deletion, insertion operations (around 5-10 million records) take place as well, hence the total dataset of user_records remains around 200 million)
Now I'm replacing the PostgresDB with a MongoDB, and following is the script I'm using for deleting records in user_records collection:
db.to_delete_records.find({}, {_id: 0}).forEach(function(doc){
db.user_records.deleteOne({record_id:doc.record_id});
});
As this is a loop running, seems inefficient.
Is there a better way to delete documents of a collection using another collection in Mongo?
If record_id is a unique field in both user_records and to_delete_records, you can build a unique index for the field for each collection if you have not done so.
db.user_records.createIndex({record_id: 1}, {unique:true});
db.to_delete_records.createIndex({record_id: 1}, {unique:true});
Afterwards, you can use a $merge statement to add an auxiliary field toDelete to the collection user_records, based on the content in to_delete_records
db.to_delete_records.aggregate([
{
"$merge": {
"into": "user_records",
"on": "record_id",
"whenMatched": [
{
$set: {
"toDelete": true
}
}
]
}
}
])
Finally run a deleteMany on user_records
db.user_records.deleteMany({toDelete: true});
Related
var product = db.GetCollection<Product>("Product");
var lookup1 = new BsonDocument(
"$lookup",
new BsonDocument {
{ "from", "Variant" },
{ "localField", "Maincode" },
{ "foreignField", "Maincode" },
{ "as", "variants" }
}
);
var pipeline = new[] { lookup1};
var result = product.Aggregate<Product>(pipeline).ToList();
The data of collection a is very large so it takes me 30 seconds to put the data in the list.
What should I do to make a faster lookup?
What that query is doing is retrieving every document from the Product collection, and then for each document found, perform a find query in the Variant collection. If there is no index on the Maincode field in the Variant collection, it will be reading the entire collection for each document.
This means that if there are, say, 1000 total products, with 3000 total variants (3 per product, on average), this query will be reading all 1000 documents from Product, and if that index isn't there, it would read all 3000 documents from Variant 1000 times, i.e. it will be examining 3 million documents.
Some ways to possibly speed this up:
create an index on {Maincode:1} in the Variant collection
This will reduce the number of documents that must be read in order to complete the lookup
change the schema
If the variants are stored in the same document with the product, there is no need for a lookup
filter the products prior to lookup
Again, reducing the documents read during the lookup
use a cursor to retrieve the documents in batches
If you perform any necessary sorting first, and the lookup last, you can return the documents to the application in batches, which would allow the application to display or begin processing the first batch before the second batch is available. This doen't make the query itself faster, but it can reduced the perceived wait in the application.
I have a simple query that I want to run on two different collections (each collection has around 50K records) :
db.collectionA.updateMany({}, { $mul: { score: 0.3 } });
db.collectionB.updateMany({}, { $mul: { score: 0.8 } });
So basically I want to multiply the field score by a certain amount. On collectionA it takes 2-3s and on collectionB it takes ~50s. I noticed that I don't have any index on the field score on collection A (which is a dynamic field, updated very often) and for some reason I have a composed index with this field on collectionB.
My question is simple, why does it take so much time for mongo to execute this query and what can I do to make it faster ?
Thanks
I think you have answered the question yourself: there is an index on that field in the second collection, and on changing the value in a document, that index needs recalculation. You could try to drop that index and re-create it after the update has finished
Helo,
I have a mongodb collection with around 20k documents. The deleted documents are stored in the collection as well, meaning they are "soft" deleted.
Now, i want to query the documents based on "status" key. The status can be either "open", "closed" or "deleted". I need the records with status as "closed"
I see that the document count fulfilling my criteria is 25 only. However, the documents scanned (after applying index) are 18k.
Hence, my query takes around 2 minutes to execute and many a time, it times out.
My first questions is:
1. Should a query executing on 20k documents take so much time? 20k isn't such a huge count right?
2. Can someone guide me with optimizing the query further, if it can be?
Pushing the deleted records in a separate archive collection is the last thing i would like to do.
Here is my current query:
**
db.collectionname.find({$and: [{ $and: [{ status: {$ne: 'open'} },{ status: {$ne: 'deleted'} }] },
{ 'submittedDate': { $gte: new Date("2019-02-01T00:00:00.000Z"), $lte: new Date("2019-02-02T00:00:00.000Z") } }
] })
**
You must have a single field index as {status: 1}. Replace the index with a compound index {status: 1, submittedDate: 1} will improve your performance. 20k is nothing if properly indexed. If you have only 3 status, replace your query as follows.
db.collectionname.find({status: 'closed',
'submittedDate': { $gte: new Date("2019-02-01T00:00:00.000Z"),
$lte: new Date("2019-02-02T00:00:00.000Z") }})
I am using the below query on my MongoDB collection which is taking more than an hour to complete.
db.collection.find({language:"hi"}).sort({_id:-1}).skip(5000).limit(1)
I am trying to to get the results in a batch of 5000 to process in either ascending or descending order for documents with "hi" as a value in language field. So i am using this query in which i am skipping the processed documents every time by incrementing the "skip" value.
The document count in this collection is just above 20 million.
An index on the field "language" is already created.
MongoDB Version i am using is 2.6.7
Is there a more appropriate index for this query which can get the result faster?
When you want to sort descending, you should create a multi-field index which uses the field(s) you sort on as descending field(s). You do that by setting those field(s) to -1.
This index should greatly increase the performance of your sort:
db.collection.ensureIndex({ language: 1, _id: -1 });
When you also want to speed up the other case - retrieving sorted in ascending order - create a second index like this:
db.collection.ensureIndex({ language: 1, _id: 1 });
Keep in mind that when you do not sort your results, you receive them in natural order. Natural order is often insertion order, but there is no guarantee for that. There are various events which can cause the natural order to get messed up, so when you care about the order you should always sort explicitly. The only exception to this rule are capped collections which always maintain insertion order.
In order to efficiently "page" through results in the way that you want, it is better to use a "range query" and keep the last value you processed.
You desired "sort key" here is _id, so that makes things simple:
First you want your index in the correct order which is done with .createIndex() which is not the deprecated method:
db.collection.createIndex({ "language": 1, "_id": -1 })
Then you want to do some simple processing, from the start:
var lastId = null;
var cursor = db.collection.find({language:"hi"});
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
That's the first batch. Now when you move on to the next one:
var cursor = db.collection.find({ "language":"hi", "_id": { "$lt": lastId });
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
So that the lastId value is always considered when making the selection. You store this between each batch, and continue on from the last one.
That is much more efficient than processing with .skip(), which regardless of the index will "still" need to "skip" through all data in the collection up to the skip point.
Using the $lt operator here "filters" all the results you already processed, so you can move along much more quickly.
I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.
You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.
You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});
Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)
I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});
Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});