Removing white spaces (leading and trailing) from string value - mongodb

I have imported a csv file in mongo using mongoimport and I want to remove leading and trailing white spaces from my string value.
Is it possible directly in mongo to use a trim function for all collection or do I need to write a script for that?
My collection contains elements such as:
{
"_id" : ObjectId("53857680f7b2eb611e843a32"),
"category" : "Financial & Legal Services "
}
I want to apply trim function for all the collection so that "category" should not contain any leading and trailing spaces.

It is not currently possible for an update in MongoDB to refer to the existing value of a current field when applying the update. So you are going to have to loop:
db.collection.find({},{ "category": 1 }).forEach(function(doc) {
doc.category = doc.category.trim();
db.collection.update(
{ "_id": doc._id },
{ "$set": { "category": doc.category } }
);
})
Noting the use of the $set operator there and the projected "category" field only in order to reduce network traffic"
You might limit what that processes with a $regex to match:
db.collection.find({
"$and": [
{ "category": /^\s+/ },
{ "category": /\s+$/ }
]
})
Or even as pure $regex without the use of $and which you only need in MongoDB where multiple conditions would be applied to the same field. Otherwise $and is implicit to all arguments:
db.collection.find({ "category": /^\s+|\s+$/ })
Which restricts the matched documents to process to only those with leading or trailing white-space.
If you are worried about the number of documents to look, bulk updating should help if you have MongoDB 2.6 or greater available:
var batch = [];
db.collection.find({ "category": /^\s+|\s+$/ },{ "category": 1 }).forEach(
function(doc) {
batch.push({
"q": { "_id": doc._id },
"u": { "$set": { "category": doc.catetgory.trim() } }
});
if ( batch.length % 1000 == 0 ) {
db.runCommand("update", batch);
batch = [];
}
}
);
if ( batch.length > 0 )
db.runCommand("update", batch);
Or even with the bulk operations API for MongoDB 2.6 and above:
var counter = 0;
var bulk = db.collection.initializeOrderedBulkOp();
db.collection.find({ "category": /^\s+|\s+$/ },{ "category": 1}).forEach(
function(doc) {
bulk.find({ "_id": doc._id }).update({
"$set": { "category": doc.category.trim() }
});
counter = counter + 1;
if ( counter % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp();
}
}
);
if ( counter > 1 )
bulk.execute();
Best done with bulkWrite() for modern API's which uses the Bulk Operations API ( technically everything does now ) but actually in a way that is safely regressive with older versions of MongoDB. Though in all honesty that would mean prior to MongoDB 2.6 and you would be well out of coverage for official support options using such a version. The coding is somewhat cleaner for this:
var batch = [];
db.collection.find({ "category": /^\s+|\s+$/ },{ "category": 1}).forEach(
function(doc) {
batch.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$set": { "category": doc.category.trim() } }
}
});
if ( batch.length % 1000 == 0 ) {
db.collection.bulkWrite(batch);
batch = [];
}
}
);
if ( batch.length > 0 ) {
db.collection.bulkWrite(batch);
batch = [];
}
Which all only send operations to the server once per 1000 documents, or as many modifications as you can fit under the 64MB BSON limit.
As just a few ways to approach the problem. Or update your CSV file first before importing.

Starting Mongo 4.2, db.collection.update() can accept an aggregation pipeline, finally allowing the update of a field based on its own value.
Starting Mongo 4.0, the $trim operator can be applied on a string to remove its leading/trailing white spaces:
// { category: "Financial & Legal Services " }
// { category: " IT " }
db.collection.updateMany(
{},
[{ $set: { category: { $trim: { input: "$category" } } } }]
)
// { category: "Financial & Legal Services" }
// { category: "IT" }
Note that:
The first part {} is the match query, filtering which documents to update (in this case all documents).
The second part [{ $set: { category: { $trim: { input: "$category" } } } }] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline):
$set is a new aggregation operator which in this case replaces the value for "category".
With $trim we modify and trim the value for "category".
Note that $trim can take an optional parameter chars which allows specifying which characters to trim.

Small correction to the answer from Neil for bulk operations api
it is
initializeOrderedBulkOp
not
initializeBulkOrderedOp
also you missed to
counter++;
inside the forEach, so in summary
var counter = 1;
var bulk = db.collection.initializeOrderedBulkOp();
db.collection.find({ "category": /^\s+|\s+$/ },{ "category": 1}).forEach(
function(doc) {
bulk.find({ "_id": doc._id }).update({
"$set": { "category": doc.category.trim() }
});
if ( counter % 1000 == 0 ) {
bulk.execute();
counter = 1;
}
counter++;
}
);
if ( counter > 1 )
bulk.execute();
Note: I don't have enough reputation to comment, hence adding an answer

You can execute javascript in an MongoDB update command when it's in a cursor method:
db.collection.find({},{ "category": 1 }).forEach(function(doc) {
db.collection.update(
{ "_id": doc._id },
{ "$set": { "category": doc.category.trim() } }
);
})
If you have a ton of records and need to batch process, you might want to look at the other answers here.

Related

Upsert issue when updating multiple documents using an array of IDs with $in

This query is doing the job fine :
db.collection.update(
{ "_id": oneIdProvided },
{ $inc: { "field": 5 } },{ upsert: true }
)
Now I would like to do the same operation multiple time with different IDs, I thought the good way was to use $in and therefore I tried :
db.collection.update(
{ "_id": { $in: oneArrayOfIds} },
{ $inc: { "field": 5 } },{ upsert: true }
)
Problem is : if one of the provided ID in the array is not existing in the collection, a new document is created (which is what I want) but will be attributed an automatic ID, not using the ID I provided and was looking for.
One solution I see could be to do first an insert query with my array of ID (those already existing would not be modified) and then doing my update query with upsert: false
Do you see a way of doing that in only one query ?
We can do this by performing multiple write operations using the bulkWrite() method.
function* range(start, end, step) {
for (let val=start; val<end; val+=step)
yield val
}
let oneArrayOfIds; // For example [1, 2, 3, 4]
let bulkOp = oneArrayOfIds.map( id => {
return {
"updateOne": {
"filter": { "_id": id },
"update": { "$set": { "field": 5 } },
"upsert": true
}
};
});
const limit = 1000;
const len = bulkOp.length;
let chunks = [];
if (len > 1000) {
for (let index of range(0, len, limit)) {
db.collection.bulkWrite(bulkOp.slice(index, index+limit));
}
} else {
db.collection.bulkWrite(bulkOp);
}

Remove the max value from an array in a document

Using mongodb. I have a collection of vehicles, each one has an array of accidents, and each accident has a date.
Vehicle {
_id: ...,,
GasAMount...,
Type: ...,
Accidents: [
{
Date: ISODate(...),
Type: ..,
Cost: ..
},
{
Date: ISODate(..),
Type: ..,
Cost:...,
}
]
}
How can i remove the oldest accident of each vehicle without using aggregate ?
Important not to use the aggregate method.
Unfortunately, you may have to use aggregation in this case as it's near impossible to find a non-aggregation based solution that can be as efficient.
Aggregation is useful here to get the embedded documents with the oldest date. Once you get them it's easier to do an update. The following demonstrates this concept, using MongoDB's bulk API to update your collection:
var bulk = db.vehicles.initializeUnorderedBulkOp(),
counter = 0,
pipeline = [
{ "$unwind": "$Accidents" },
{
"$group": {
"_id": "$_id",
"oldestDate": { "$min": "$Accidents.Date" }
}
}
];
var cur = db.vehicles.aggregate(pipeline);
cur.forEach(function (doc){
bulk.find({ "_id": doc._id }).updateOne({
"$pull": { "Accidents": { "Date": doc.oldestDate } }
});
counter++;
if (counter % 100 == 0) {
bulk.execute();
bulk = db.vehicles.initializeUnorderedBulkOp();
}
});
if (counter % 100 != 0) bulk.execute();

Sum of Substrings in mongodb

We have field(s) in mongodb which has numbers in string form, values such as "$123,00,89.00" or "1234$" etc
Is it possible to customize $sum accumulators in mongodb, so that, certain processing can be done at each field value while the sum is performed. Such as substring or reg-ex processing etc.
The .mapReduce() method is what you need here. You cannot "cast" values in the aggregation framework from one "type" to another ( with the exception of "to string" or from Date to numeric ).
The JavaScript processing means that you can convert a string into a value for "summing". Somthing like this ( with a bit more work on a "safe" regex for the required "currency" values:
db.collection.mapReduce(
function() {
emit(null, this.amount.replace(/\$|,|\./g,"") / 100 );
},
function(key,values) {
return Array.sum(values);
},
{ "out": { "inline": 1 } }
)
Or with .group() which also uses JavaScript procesing, but is a bit more restrcitive in it's requirements:
db.collection.group({
"key": null,
"reduce": function( curr,result ) {
result.total += curr.amount.replace(/\$|,|\./g,"") /100;
},
"initial": { "total": 0 }
});
So JavaScript processing is your only option as these sorts of operations are not supported in the aggregatation framework.
A number can be a string:
db.junk.aggregate([{ "$project": { "a": { "$substr": [ 1,0,1 ] } } }])
{ "_id" : ObjectId("55a458c567446a4351c804e5"), "a" : "1" }
And a Date can become a number:
db.junk.aggregate([{ "$project": { "a": { "$subtract": [ new Date(), new Date(0) ] } } }])
{ "_id" : ObjectId("55a458c567446a4351c804e5"), "a" : NumberLong("1436835669446") }
But there are no other operators to "cast" a "string" to "numeric" or even anthing to do a Regex replace as shown above.
If you want to use .aggregate() then you need to fix your data into a format that will support it, thus "numeric":
var bulk = db.collection.initializeOrderedBulkOp(),
count = 0;
db.collection.find({ "amount": /\$|,\./g }).forEach(function(doc) {
doc.amount = doc.amount.replace(/\$|,|\./g,"") /100;
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "amount": doc.amount }
});
count++;
// execute once in 1000 operations
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp();
}
});
// clean up queued operations
if ( count % 1000 != 0 )
bulk.execute();
Then you can use .aggregate() on your "numeric" data:
db.collection.aggregate([
{ "$group": { "_id": null, "total": { "$sum": "$amount" } } }
])

Update MongoDB collection with $toLower or $toUpper

I would like to convert the 'state' field for all 'Organization' to all UPPER case. So
'Ky' becomes 'KY'
'tX' becomes 'TX'
'ca' becomes 'CA'
why this doesn't work
db.organizations.update(state:{ $exists : true }},{$set:{state:{ $toUpper : state }}}, false, true)
The $toLower and $toUpper operators you reference are for use with the aggregation framework only, and by itself does not alter documents in a collection as the .update() statement does. Also it is not presently possible to reference the value of an existing field within an update statement to produce a new value.
What you need to do is "loop" the collection and make your changes:
db.organizations.find({ "state": { "$exists": true } }).forEach(function(doc) {
db.organizations.update(
{ "_id": doc._id },
{ "$set": { "state": doc.state.toUpperCase() } }
);
});
With MongoDB 2.6 or greater you can make this a bit better with the bulk operations API:
var bulk = db.organizations.initializeOrderedBulkOp();
var count = 0;
db.organizations.find({ "state": { "$exists": true } }).forEach(function(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": { "state": doc.state.toUpperCase() } }
);
count++;
if ( count % 500 == 0 ) {
bulk.execute();
bulk = db.organizations.initializeOrderedBulkOp();
count = 0;
}
});
if ( count > 0 )
bulk.execute();
While still basically looping the results, the writes are only sent to the database once every 500 documents or whatever you choose to set staying under the 16MB BSON limit for the operation.
You have to put toUpperCase() like this:
"$set": { "state": doc.state.toUpperCase() } }

Convert a field to an array using update operation

In Mongo, how do you convert a field to an array containing only the original value using only the update operation?
Given a document:
{
"field": "x"
}
Then one or more update operation(s):
db.items.update(...)
Should result in:
{
"field": ["x"]
}
MongoDB does not currently allow you to refer to the existing value of a field within an update type of operation. In order to make changes that refer to the existing field values you would need to loop the results. But the array conversion part is simple:
db.collection.find().forEach(function(doc) {
db.collection.update(
{ _id: doc._id },
{ "$set": { "field": [doc.field] } }
);
})
Or even better with bulk update functionality in MongoDB 2.6:
var batch = [];
db.collection.find().forEach(function(doc) {
batch.push({
"q": { _id: doc._id },
"u": { "$set": { "field": [doc.field] } }
});
if ( batch.length % 500 == 0 ) {
db.runCommand({ "update": "collection", "updates": batch });
batch = [];
}
});
if ( batch.length > 0 )
db.runCommand({ "update": "collection", "updates": batch });
Or even using the new Bulk API helpers:
var counter = 0;
var bulk = db.collection.initializeUnorderedBulkOp();
db.collection.find().forEach(function(doc) {
bulk.find({ _id: doc._id }).update({
"$set": { "field": [doc.field] }
});
counter++;
if ( counter % 500 == 0 ) {
bulk.execute();
bulk = db.collection.initializeUnorderedBulkOp();
counter = 0;
}
});
if ( counter > 0 )
bulk.execute();
Both of those last would only send the updates to the server per every 500 items or whatever you want to tune it to under the 16MB BSON limit. All updates are still performed individually, but this removes a lot of write/confirmation traffic from the overall operation and is much faster.