MongoDB Batch update performance problems - mongodb

I understand that MongoDB supports batch inserts, but not batch updates.
Batch-inserting thousands of documents is pretty fast, but updating thousands of documents is incredibly slow. So slow that the workaround I'm using now is to remove the old documents and batch-insert the updated documents. I had to add some flags to mark them as invalid and all the stuff required for compensating from a failed mock 'bulk update'. :(
I know this is an awful and unsafe solution, but its the only way I've been able to reach the required performance.
If you know a better way, please help me.
Thanks!

As long as you're using MongoDB v2.6 or higher, you can use bulk operations to perform updates as well.
Example from the docs:
var bulk = db.items.initializeUnorderedBulkOp();
bulk.find( { status: "D" } ).update( { $set: { status: "I", points: "0" } } );
bulk.find( { item: null } ).update( { $set: { item: "TBD" } } );
bulk.execute();

I had a similar situation, after doing trial and error, I created an index in MongoDB or through mongoose, now thousands of documents is pretty fast by using bulk operations i.e. bulk.find({}).upsert.update({}).
Example:
var bulk = items.collection.initializeOrderedBulkOp();
bulk.find({fieldname: value, active: false}).updateOne({.upsert().updateOne({
$set: updatejsondata,
$setOnInsert: createjsondata
});
Note: you need to use $push for storing in array like $set, you need to include $push
example:
bulk.find({name: value, active: false}).updateOne({.upsert().updateOne({
$set: updatejsondata,
$push: {logdata: filename + " - " + new Date()}
$setOnInsert: createjsondata
});
Creating index: In above case on Items collection you need to create the index on search fields i.e. name, active
Example:
Mongo Command Line:
db.items.ensureIndex({name: 1, active: 1}, {unique: false, dropDups: false})
Mongoose Schema:
ItemSchema.index({name: 1, active: 1}, {name: "itemnameactive"});
Hope this will help you out doing bulk operations

Related

performance issue for updating mongo documents

I have a mongo collections with about 4M records, 2 of my document fields of this collection are date as string, and I need to change them to ISODate, so arote this small script to do that:
db.vendors.find().forEach(function(el){
el.lastEdited = new Date(el.lastEdited);
el.creationDate = new Date(el.creationDate)
db.vendors.save(el)
})
but it takes foreverrrr...and I added indexes to those fields, can't I do this in some other way which will be reasonable time to complete?
Currently you do a find against whole collection which could take a while, also for each save call it has to do a network roundtrip from the client (shell) to the server.
EDIT: Removed suggestion to use $match and do this in batches, because $out in fact replaces the collection on each run as noted in #Stennie's comment.
(Needless to say, test it in a sample dataset first in a test environment. I haven't tested behavior of new Date() as I don't know your data format)
db.vendors.aggregate([
{
$project: {
_id: 1,
lastEdited: { $add : [new Date('$lastEdited')]},
creationDate: { $add : [new Date('$creationDate')]},
field1: 1,
field2: 1,
//.. (important to repeat all necessary fields)
},
},
{
$out: 'vendors',
},
]);

Is there any way I can know creation date of an Index for MongoDB tables(collections)?

Is there any way I can know creation date of an Index for MongoDB tables(collections)? Recently, we saw some indexes which lead to some problem of space and performance and wonder if we can get the timestamp for create_date of the indexes. And I'm not sure if there is a way to do that in the most recent version of mongodb. and if not so, is there any workaround to do that? Thanks a lot.
Check the rs.oplog collection in the local database.
db.oplog.rs.find({ "o.createIndexes": { "$exists": true } }).sort({ts: -1})
A sample response: (when an index with name unique_email was created on users collection in test database.
{ op: 'c',
ns: 'test.$cmd',
ui: UUID("47041281-c7d2-4d28-90ef-2baa49c6c31f"),
o:
{ createIndexes: 'users',
v: 2,
unique: true,
key: { email: 1 },
name: 'unique_email',
background: false },
ts: Timestamp({ t: 1659682522, i: 2 }),
t: 62,
v: 2,
wall: 2022-08-05T06:55:22.326Z }
CAVEATS
oplog retention is limited both in terms of size of the oplog collection and how long ago the operation was performed. You can read more about these parameters in the official documentation.
All write operations are recorded in the oplog, including index creation, but you are more likely to find the oplog entry if the index was created recently.
Therefore, this method will only work in some cases.
db.t1.aggregate( [ { $indexStats: { } } ] ) , check since

Meteor collection get last document of each selection

Currently I use the following find query to get the latest document of a certain ID
Conditions.find({
caveId: caveId
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
How can I use the same using multiple ids with $in for example
I tried it with the following query. The problem is that it will limit the documents to 1 for all the found caveIds. But it should set the limit for each different caveId.
Conditions.find({
caveId: {$in: caveIds}
},
{
sort: {diveDate:-1},
limit: 1,
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
One solution I came up with is using the aggregate functionality.
var conditionIds = Conditions.aggregate(
[
{"$match": { caveId: {"$in": caveIds}}},
{
$group:
{
_id: "$caveId",
conditionId: {$last: "$_id"},
diveDate: { $last: "$diveDate" }
}
}
]
).map(function(child) { return child.conditionId});
var conditions = Conditions.find({
_id: {$in: conditionIds}
},
{
fields: {caveId: 1, "visibility.visibility":1, diveDate: 1}
});
You don't want to use $in here as noted. You could solve this problem by looping through the caveIds and running the query on each caveId individually.
you're basically looking at a join query here: you need all caveIds and then lookup last for each.
This is a problem of database schema/denormalization in my opinion: (but this is only an opinion!):
You could as mentioned here, lookup all caveIds and then run the single query for each, every single time you need to look up last dives.
However I think you are much better off recording/updating the last dive inside your cave document, and then lookup all caveIds of interest pulling only the lastDive field.
That will give you immediately what you need, rather than going through expensive search/sort queries. This is at the expense of maintaining that field in the document, but it sounds like it should be fairly trivial as you only need to update the one field when a new event occurs.

How to remove duplicates based on a key in Mongodb?

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,
{ "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
"source_references" : [
"_id" : ObjectId("5045xxxxxxxxxxxxxx"),
"name" : "xxx",
"key" : 123
]
}
I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).
I want to remove duplicate records based on source_references.key, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.
Is there a way to remove the duplicates in Mongo Internal command line?
This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.
If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:
db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})
This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.
Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.
Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.
This is the easiest query I used on my MongoDB 3.2
db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})
Index your customKey before running this to increase speed
While #Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options
Let the MongoDB do that for you using Map Reduce
Another way
You do programatically which is less efficient.
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
I had a similar requirement but I wanted to retain the latest entry. The following query worked with my collection which had millions of records and duplicates.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Expanding on Fernando's answer, I found that it was taking too long, so I modified it.
var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
var i = 0;
db.collection.find({ "field": fieldValue }).forEach(doc => {
if (i) {
db.collection.remove({ _id: doc._id });
}
i++;
x += 1;
if (x % 100 === 0) {
print(x); // Every time we process 100 docs.
}
});
});
The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.
Also, indexing the field before the operation helps.
pip install mongo_remove_duplicate_indexes
create a script in any language
iterate over your collection
create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name
for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection
db.createCollection("cname")
create new index
db.cname.createIndex({'genre':1},unique:1)
now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
now just insert the json format values u received into new collection and handle exception using exception handling
for ex pymongo.errors.DuplicateKeyError
check out the package source code for the mongo_remove_duplicate_indexes for better understanding
If you have enough memory, you can in scala do something like that:
cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})

mongodb mapreduce function does not provide skip functionality, is their any solution to this?

Mongodb mapreduce function does not provide any way to skip record from database like find function. It has functionality of query, sort & limit options. But I want to skip some records from the database, and I am not getting any way to it. please provide solutions.
Thanks in advance.
Ideally a well-structured map-reduce query would allow you to skip particular documents in your collection.
Alternatively, as Sergio points out, you can simply not emit particular documents in map(). Using scope to define a global counter variable is one way to restrict emit to a specified range of documents. As an example, to skip the first 20 docs that are sorted by ObjectID (and thus, sorted by insertion time):
db.collection_name.mapReduce(map, reduce, {out: example_output, sort: {id:-1}, scope: "var counter=0")};
Map function:
function(){
counter ++;
if (counter > 20){
emit(key, value);
}
}
I'm not sure since which version this feature is available, but certainly in MongoDB 2.6 the mapReduce() function provides query parameter:
query : document
Optional. Specifies the selection criteria using query operators for determining the documents input to the map
function.
Example
Consider the following map-reduce operations on a collection orders that contains documents of the following prototype:
{
_id: ObjectId("50a8240b927d5d8b5891743c"),
cust_id: "abc123",
ord_date: new Date("Oct 04, 2012"),
status: 'A',
price: 25,
items: [ { sku: "mmm", qty: 5, price: 2.5 },
{ sku: "nnn", qty: 5, price: 2.5 } ]
}
Perform the map-reduce operation on the orders collection using the mapFunction2, reduceFunction2, and finalizeFunction2 functions.
db.orders.mapReduce( mapFunction2,
reduceFunction2,
{
out: { merge: "map_reduce_example" },
query: { ord_date:
{ $gt: new Date('01/01/2012') }
},
finalize: finalizeFunction2
}
)
This operation uses the query field to select only those documents with ord_date greater than new Date(01/01/2012). Then it output the results to a collection map_reduce_example. If the map_reduce_example collection already exists, the operation will merge the existing contents with the results of this map-reduce operation.