Update MongoDB collection using $toLower - mongodb

I have an existing MongoDB collection containing user names. The user names contain both lower case and upper case letters.
I want to update all the user names so they only contain lower case letters.
I have tried this script, but it didn't work
db.myCollection.find().forEach(
function(e) {
e.UserName = $toLower(e.UserName);
db.myCollection.save(e);
}
)

MongoDB does not have a concept of $toLower as a command. The solution is to run a big for loop over the data and issue the updates individually.
You can do this in any driver or from the shell:
db.myCollection.find().forEach(
function(e) {
e.UserName = e.UserName.toLowerCase();
db.myCollection.save(e);
}
)
You can also replace the save with an atomic update:
db.myCollection.update({_id: e._id}, {$set: {UserName: e.UserName.toLowerCase() } })
Again, you could also do this from any of the drivers, the code will be very similar.
EDIT: Remon brings up a good point. The $toLower command does exist as part of the aggregation framework, but this has nothing to do with updating. The documentation for updating is here.

Starting Mongo 4.2, db.collection.update() can accept an aggregation pipeline, finally allowing the update of a field based on its own value:
// { username: "Hello World" }
db.collection.updateMany(
{},
[{ $set: { username: { $toLower: "$username" } } }]
)
// { username: "hello world" }
The first part {} is the match query, filtering which documents to update (in this case all documents).
The second part [{ $set: { username: { $toLower: "$username" } } }], is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline):
$set is a new aggregation operator which in this case modifies the value for "username".
Using $toLower, we modify the value of "username" by its lowercase version.

Very similar solution but this worked me in new mongo 3.2
Execute the following in Mongo Shell or equivalent DB tools like MongoChef!
db.tag.find({hashtag :{ $exists:true}}).forEach(
function(e) {
e.hashtag = e.hashtag.toLowerCase();
db.tag.save(e);
});

With the accepted solution I know its very trivial to do the same for an array of elements, just in case
db.myCollection.find().forEach(
function(e) {
for(var i = 0; i < e.articles.length; i++) {
e.articles[i] = e.articles[i].toLowerCase();
}
db.myCollection.save(e);
}
)

A little late to the party but the below answer works very well with mongo 3.4 and above
First get only those records which have different case and update only those records in bulk.
The performance of this query is multifold better
var bulk = db.myCollection.initializeUnorderedBulkOp();
var count = 0
db.myCollection.find({userId:{$regex:'.*[A-Z]'}}).forEach(function(e) {
var newId = e.userId.toLowerCase();
bulk.find({_id:e._id}).updateOne({$set:{userId: newId}})
count++
if (count % 500 === 0) {
bulk.execute();
bulk = db.myCollection.initializeUnorderedBulkOp();
count = 0;
}
})
if (count > 0) bulk.execute();

Just a note to make sure the field exists for all entries in your collection. If not you will need an if statement, like the following:
if (e.UserName) e.UserName = e.UserName.toLowerCase();

Related

How to remove duplicates based on a key in Mongodb?

I have a collection in MongoDB where there are around (~3 million records). My sample record would look like,
{ "_id" = ObjectId("50731xxxxxxxxxxxxxxxxxxxx"),
"source_references" : [
"_id" : ObjectId("5045xxxxxxxxxxxxxx"),
"name" : "xxx",
"key" : 123
]
}
I am having a lot of duplicate records in the collection having same source_references.key. (By Duplicate I mean, source_references.key not the _id).
I want to remove duplicate records based on source_references.key, I'm thinking of writing some PHP code to traverse each record and remove the record if exists.
Is there a way to remove the duplicates in Mongo Internal command line?
This answer is obsolete : the dropDups option was removed in MongoDB 3.0, so a different approach will be required in most cases. For example, you could use aggregation as suggested on: MongoDB duplicate documents even after adding unique key.
If you are certain that the source_references.key identifies duplicate records, you can ensure a unique index with the dropDups:true index creation option in MongoDB 2.6 or older:
db.things.ensureIndex({'source_references.key' : 1}, {unique : true, dropDups : true})
This will keep the first unique document for each source_references.key value, and drop any subsequent documents that would otherwise cause a duplicate key violation.
Important Note: Any documents missing the source_references.key field will be considered as having a null value, so subsequent documents missing the key field will be deleted. You can add the sparse:true index creation option so the index only applies to documents with a source_references.key field.
Obvious caution: Take a backup of your database, and try this in a staging environment first if you are concerned about unintended data loss.
This is the easiest query I used on my MongoDB 3.2
db.myCollection.find({}, {myCustomKey:1}).sort({_id:1}).forEach(function(doc){
db.myCollection.remove({_id:{$gt:doc._id}, myCustomKey:doc.myCustomKey});
})
Index your customKey before running this to increase speed
While #Stennie's is a valid answer, it is not the only way. Infact the MongoDB manual asks you to be very cautious while doing that. There are two other options
Let the MongoDB do that for you using Map Reduce
Another way
You do programatically which is less efficient.
Here is a slightly more 'manual' way of doing it:
Essentially, first, get a list of all the unique keys you are interested.
Then perform a search using each of those keys and delete if that search returns bigger than one.
db.collection.distinct("key").forEach((num)=>{
var i = 0;
db.collection.find({key: num}).forEach((doc)=>{
if (i) db.collection.remove({key: num}, { justOne: true })
i++
})
});
I had a similar requirement but I wanted to retain the latest entry. The following query worked with my collection which had millions of records and duplicates.
/** Create a array to store all duplicate records ids*/
var duplicates = [];
/** Start Aggregation pipeline*/
db.collection.aggregate([
{
$match: { /** Add any filter here. Add index for filter keys*/
filterKey: {
$exists: false
}
}
},
{
$sort: { /** Sort it in such a way that you want to retain first element*/
createdAt: -1
}
},
{
$group: {
_id: {
key1: "$key1", key2:"$key2" /** These are the keys which define the duplicate. Here document with same value for key1 and key2 will be considered duplicate*/
},
dups: {
$push: {
_id: "$_id"
}
},
count: {
$sum: 1
}
}
},
{
$match: {
count: {
"$gt": 1
}
}
}
],
{
allowDiskUse: true
}).forEach(function(doc){
doc.dups.shift();
doc.dups.forEach(function(dupId){
duplicates.push(dupId._id);
})
})
/** Delete the duplicates*/
var i,j,temparray,chunk = 100000;
for (i=0,j=duplicates.length; i<j; i+=chunk) {
temparray = duplicates.slice(i,i+chunk);
db.collection.bulkWrite([{deleteMany:{"filter":{"_id":{"$in":temparray}}}}])
}
Expanding on Fernando's answer, I found that it was taking too long, so I modified it.
var x = 0;
db.collection.distinct("field").forEach(fieldValue => {
var i = 0;
db.collection.find({ "field": fieldValue }).forEach(doc => {
if (i) {
db.collection.remove({ _id: doc._id });
}
i++;
x += 1;
if (x % 100 === 0) {
print(x); // Every time we process 100 docs.
}
});
});
The improvement is basically using the document id for removing, which should be faster, and also adding the progress of the operation, you can change the iteration value to your desired amount.
Also, indexing the field before the operation helps.
pip install mongo_remove_duplicate_indexes
create a script in any language
iterate over your collection
create new collection and create new index in this collection with unique set to true ,remember this index has to be same as index u wish to remove duplicates from in ur original collection with same name
for ex-u have a collection gaming,and in this collection u have field genre which contains duplicates,which u wish to remove,so just create new collection
db.createCollection("cname")
create new index
db.cname.createIndex({'genre':1},unique:1)
now when u will insert document with similar genre only first will be accepted,other will be rejected with duplicae key error
now just insert the json format values u received into new collection and handle exception using exception handling
for ex pymongo.errors.DuplicateKeyError
check out the package source code for the mongo_remove_duplicate_indexes for better understanding
If you have enough memory, you can in scala do something like that:
cole.find().groupBy(_.customField).filter(_._2.size>1).map(_._2.tail).flatten.map(_.id)
.foreach(x=>cole.remove({id $eq x})

Update with expression instead of value

I am totally new to MongoDB... I am missing a "newbie" tag, so the experts would not have to see this question.
I am trying to update all documents in a collection using an expression. The query I was expecting to solve this was:
db.QUESTIONS.update({}, { $set: { i_pp : i_up * 100 - i_down * 20 } }, false, true);
That, however, results in the following error message:
ReferenceError: i_up is not defined (shell):1
At the same time, the database did not have any problem with eating this one:
db.QUESTIONS.update({}, { $set: { i_pp : 0 } }, false, true);
Do I have to do this one document at a time or something? That just seems excessively complicated.
Update
Thank you Sergio Tulentsev for telling me that it does not work. Now, I am really struggling with how to do this. I offer 500 Profit Points to the helpful soul, who can write this in a way that MongoDB understands. If you register on our forum I can add the Profit Points to your account there.
I just came across this while searching for the MongoDB equivalent of SQL like this:
update t
set c1 = c2
where ...
Sergio is correct that you can't reference another property as a value in a straight update. However, db.c.find(...) returns a cursor and that cursor has a forEach method:
Queries to MongoDB return a cursor, which can be iterated to retrieve
results. The exact way to query will vary with language driver.
Details below focus on queries from the MongoDB shell (i.e. the
mongo process).
The shell find() method returns a cursor object which we can then iterate to retrieve specific documents from the result. We use
hasNext() and next() methods for this purpose.
for( var c = db.parts.find(); c.hasNext(); ) {
print( c.next());
}
Additionally in the shell, forEach() may be used with a cursor:
db.users.find().forEach( function(u) { print("user: " + u.name); } );
So you can say things like this:
db.QUESTIONS.find({}, {_id: true, i_up: true, i_down: true}).forEach(function(q) {
db.QUESTIONS.update(
{ _id: q._id },
{ $set: { i_pp: q.i_up * 100 - q.i_down * 20 } }
);
});
to update them one at a time without leaving MongoDB.
If you're using a driver to connect to MongoDB then there should be some way to send a string of JavaScript into MongoDB; for example, with the Ruby driver you'd use eval:
connection.eval(%q{
db.QUESTIONS.find({}, {_id: true, i_up: true, i_down: true}).forEach(function(q) {
db.QUESTIONS.update(
{ _id: q._id },
{ $set: { i_pp: q.i_up * 100 - q.i_down * 20 } }
);
});
})
Other languages should be similar.
//the only differnce is to make it look like and aggregation pipeline
db.table.updateMany({}, [{
$set: {
col3:{"$sum":["$col1","$col2"]}
},
}]
)
You can't use expressions in updates. Or, rather, you can't use expressions that depend on fields of the document. Simple self-containing math expressions are fine (e.g. 2 * 2).
If you want to set a new field for all documents that is a function of other fields, you have to loop over them and update manually. Multi-update won't help here.
Rha7 gave a good idea, but the code above is not work without defining a temporary variable.
This sample code produces an approximate calculation of the age (leap years behinds the scene) based on 'birthday' field and inserts the value into suitable field for all documents not containing such:
db.employers.find({age: {$exists: false}}).forEach(function(doc){
var new_age = parseInt((ISODate() - doc.birthday)/(3600*1000*24*365));
db.employers.update({_id: doc._id}, {$set: {age: new_age}});
});
Example to remove "00" from the beginning of a caller id:
db.call_detail_records_201312.find(
{ destination: /^001/ },
{ "destination": true }
).forEach(function(row){
db.call_detail_records_201312.update(
{ _id: row["_id"] },
{ $set: {
destination: row["destination"].replace(/^001/, '1')
}
}
)
});

mongoose node.js, query with $lt and $gt not working

I want to get all the pupils whose last mark is between 15 and 20. To do so, I perform the following query in my mongoDB using mongoose:
The models are working fine (all the other queries are ok).
Pupils.find({"marks[-1].value": {'$lt' : 20 }, "marks[-1].value" : { '$gt' : 15 }}, function(err, things){
This is not working, is there something I missed ?
* UPDATE *
I found something like:
Pupils.find({ "marks[-1].value": {$gt : 15, $lt : 20}});
But this does not work either. Is there a way to get the last mark of the marks array in this case ?
Lets consider your Pupils collection:
Pupils
{
_id,
Marks(integer),
LatestMark(int)
}
I suggest to add latest mark into Pupil document(as you can see at the document above), and update it each time when you adding new mark into nested collection.
Then you will able to query on it like this:
db.Pupils.find({ "LatestMark": {$gt : 15, $lt : 20}});
Also you can query latest mark using $where, but be care because:
Javascript executes more slowly than
the native operators, but is very flexible
I believe it's not working because the embedded collections in mongo are accessed like this:
"marks.0.value", although I haven't used mongoose.
Unfortunately for your scenario, I do not think there is a way to use negative indexing. (Mongo doesn't guarantee a preserved natural order unless you use a capped collection anyway though)
You may be able to accomplish this using Map/Reduce or a group command
http://www.mongodb.org/display/DOCS/MapReduce
This is tip than an answer.
Use double quotes for special key words like $elemMatch, $get, $lt etc while using mangoose.
In following code,$gt will not work properly.
var elmatch = { companyname: mycompany };
var condition = { company_info: { $elemMatch: elmatch } };
if(isValid(last_id)){
condition._id = { $gt: new ObjectId(last_id) };
}
console.log(condition);
User.find(condition).limit(limit).sort({_id: 1}).exec(function (err, lists) {
if (!err) {
res.send(lists);
res.end();
}else{
res.send(err);
res.end();
}
});
But this issue is solved when i'm using double quotes for special keywords
var elmatch = { companyname: mycompany };
var condition = { company_info: { "$elemMatch": elmatch } };
if(isValid(last_id)){
condition._id = { "$gt": new ObjectId(last_id) };
}
console.log(condition);
User.find(condition).limit(limit).sort({_id: 1}).exec(function (err, lists) {
if (!err) {
res.send(lists);
res.end();
}else{
res.send(err);
res.end();
}
});
I hope it will be helpful.

Removing duplicate records using MapReduce

I'm using MongoDB and need to remove duplicate records. I have a listing collection that looks like so: (simplified)
[
{ "MlsId": "12345"" },
{ "MlsId": "12345" },
{ "MlsId": "23456" },
{ "MlsId": "23456" },
{ "MlsId": "0" },
{ "MlsId": "0" },
{ "MlsId": "" },
{ "MlsId": "" }
]
A listing is a duplicate if the MlsId is not "" or "0" and another listing has that same MlsId. So in the example above, the 2nd and 4th records would need to be removed.
How would I find all duplicate listings and remove them? I started looking at MapReduce but couldn't find an example that fit my case.
Here is what I have so far, but it doesn't check if the MlsId is "0" or "":
m = function () {
emit(this.MlsId, 1);
}
r = function (k, vals) {
return Array.sum(vals);
}
res = db.Listing.mapReduce(m,r);
db[res.result].find({value: {$gt: 1}});
db[res.result].drop();
I have not used mongoDB but I have used mapreduce. I think you are on the right track in terms of the mapreduce functions. To exclude he 0 and empty strings, you can add a check in the map function itself.. something like
m = function () {
if(this.MlsId!=0 && this.MlsId!="") {
emit(this.MlsId, 1);
}
}
And reduce should return key-value pairs. So it should be:
r = function(k, vals) {
emit(k,Arrays.sum(vals);
}
After this, you should have a set of key-value pairs in output such that the key is MlsId and the value is the number of thimes this particular ID occurs. I am not sure about the db.drop() part. As you pointed out, it will most probably delete all MlsIds instead of removing only the duplicate ones. To get around this, maybe you can call drop() first and then recreate the MlsId once. Will that work for you?
In mongodb you can use a query to restrict documents that are passed in for mapping. You probably want to do that for the ones you don't care about. Then in the reduce function you can ignore the dups and only return one of the docs for each duplicate key.
I'm a little confused about your goal though. If you just want to find duplicates and remove all but one of them then you can just create a unique index on that field and use the dropDups option; the process of creating the index will drop duplicate docs. Keeping the index will ensure that it doesn't happen again.
http://www.mongodb.org/display/DOCS/Indexes#Indexes-DuplicateValues
You can use aggregation operation to remove duplicates. Unwind, introduce a dummy $group and $sum stage and ignore the counts in your next stage. Something like this,
db.myCollection.aggregate([
{
$unwind: '$list'
},
{
$group:{
'_id':
{
'listing_id':'$_id', 'MlsId':'$list.MlsId'
},
'count':
{
'$sum':1
}
}
},
{
$group:
{
'_id':'$_id.listing_id',
'list':
{
'$addToSet':
{
'MlsId':'$_id.MlsId'
}
}
}
}
]);
this is how I following the #harri answer to remove duplicates:
//contains duplicated documents id and numeber of duplicates
db.createCollection("myDupesCollection")
res = db.sampledDB.mapReduce(m, r, { out : "myDupesCollection" });
// iterate through duplicated docs and remove duplicates (keep one)
db.myDupesCollection.find({value: {$gt: 1}}).forEach(function(myDoc){
u_id = myDoc._id.MlsId;
counts =myDoc.value;
db.sampledDB.remove({MlsId: u_id},counts-1); //if there are 3 docs, remove 3-1=2 of them
});

How do I convert a property in MongoDB from text to date type?

In MongoDB, I have a document with a field called "ClockInTime" that was imported from CSV as a string.
What does an appropriate db.ClockTime.update() statement look like to convert these text based values to a date datatype?
This code should do it:
> var cursor = db.ClockTime.find()
> while (cursor.hasNext()) {
... var doc = cursor.next();
... db.ClockTime.update({_id : doc._id}, {$set : {ClockInTime : new Date(doc.ClockInTime)}})
... }
I have exactly the same situation as Jeff Fritz.
In my case I have succeed with the following simpler solution:
db.ClockTime.find().forEach(function(doc) {
doc.ClockInTime=new Date(doc.ClockInTime);
db.ClockTime.save(doc);
})
This is a generic sample code in python using pymongo
from pymongo import MongoClient
from datetime import datetime
def fixTime(host, port, database, collection, attr, date_format):
#host is where the mongodb is hosted eg: "localhost"
#port is the mongodb port eg: 27017
#database is the name of database eg : "test"
#collection is the name of collection eg : "test_collection"
#attr is the column name which needs to be modified
#date_format is the format of the string eg : "%Y-%m-%d %H:%M:%S.%f"
#http://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
client = MongoClient(host, port)
db = client[database]
col = db[collection]
for obj in col.find():
if obj[attr]:
if type(obj[attr]) is not datetime:
time = datetime.strptime(obj[attr],date_format)
col.update({'_id':obj['_id']},{'$set':{attr : time}})
for more info : http://salilpa.com/home/content/how-convert-property-mongodb-text-date-type-using-pymongo
Starting Mongo 4.x:
db.collection.update() can accept an aggregation pipeline, finally allowing the update of a field based on its current value (Mongo 4.2+).
There is a new $toDate aggregation operator (Mongo 4.0).
Such that:
// { a: "2018-03-03" }
db.collection.updateMany(
{},
[{ $set: { a: { $toDate: "$a" } } }]
)
// { a: ISODate("2018-03-03T00:00:00Z") }
The first part {} is the match query, filtering which documents to update (in this case all documents).
The second part [{ $set: { a: { $toDate: "$a" } } }] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline). $set is a new aggregation operator which in this case replaces the field's value. The replaced value being the field itself concerted to an ISODate object. Note how a is modified directly based on its own value ($a).
If you need to check if the field already has been converted you can use this condition:
/usr/bin/mongo mydb --eval 'db.mycollection.find().forEach(function(doc){
if (doc.date instanceof Date !== true) {
doc.date = new ISODate(doc.date);
db.mycollection.save(doc);
}
});'
Otherwise the command line may break.