Most efficient way to change a string field value to its substring - mongodb

I have a collection filled with documents that look like this:
{
data: 11,
version: "0.0.32"
}
and some have a test suffix to version:
{
data: 55,
version: "0.0.42-test"
}
The version field has different values but it always conforms to the pattern: 0.0.XXX. I would like to update all the documents to look like this:
{
data: 11,
version: 32
}
and the suffixed version (for test documents - version should be negative):
{
data: 55,
version: -42
}
The collection with these documents is used by our critical system, that needs to be turned off while updating the data - so I want the update/change to be as fast as possible. There are about 66_000_000 documents in this collection, and it's about 100GB in size.
Which type of mongodb operation would be the most efficient one?

The most efficient way to do this is in the upcoming release of MongoDB as of this writing using the $split operator to split our string as shown here then assign the last element in the array to a variable using the $let variable operator and the $arrayElemAt operators.
Next, we use the $switch operator to perform a logical condition processing or case statement against that variable.
The condition here is $gt which returns true if the value contains "test", and in which case in the in expression we split that string and simply return the $concatenated value of the first element in the newly computed array and the -. If the condition evaluates to false, we just return the variable.
Of course in our case statement, we use the $indexOfCP which returns -1 if there were no occurrences of "test".
let cursor = db.collection.aggregate(
[
{ "$project": {
"data": 1,
"version": {
"$let": {
"vars": {
"v": {
"$arrayElemAt": [
{ "$split": [ "$version", "." ] },
-1
]
}
},
"in": {
"$switch": {
"branches": [
{
"case": {
"$gt": [
{ "$indexOfCP": [ "$$v", "test" ] },
-1
]
},
"then": {
"$concat": [
"-",
"",
{ "$arrayElemAt": [
{ "$split": [ "$$v", "-" ] },
0
]}
]
}
}
],
"default": "$$v"
}
}
}
}
}}
]
)
The aggregation query produces something like this:
{ "_id" : ObjectId("57a98773cbbd42a2156260d8"), "data" : 11, "version" : "32" }
{ "_id" : ObjectId("57a98773cbbd42a2156260d9"), "data" : 55, "version" : "-42" }
As you can see, the "version" field data are string. If the data type for that field does not matter then, you can simply use the $out aggregation pipeline stage operator to write the result into a new collection or replace your collection.
{ "out": "collection" }
If you need to convert your data to floating point number then, the only way to do this, simply because MongoDB doesn't not provides a way to do type conversion out of the box except for integer to string, is to iterate the aggregation Cursor object and convert your value using parseFloat or Number then update your documents using the $set operator and the bulkWrite() method for maximum efficiency.
let requests = [];
cursor.forEach(doc => {
requests.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$set": {
"data": doc.data,
"version": parseFloat(doc.version)
},
"$unset": { "person": " " }
}
}
});
if ( requests.length === 1000 ) {
// Execute per 1000 ops and re-init
db.collection.bulkWrite(requests);
requests = [];
}}
);
// Clean up queues
if(requests.length > 0) {
db.coll.bulkWrite(requests);
}
While the aggregation query will perfectly work in MongoDB 3.4 or newer our best bet from MongoDB 3.2 backwards is mapReduce with the bulkWrite() method.
var results = db.collection.mapReduce(
function() {
var v = this.version.split(".")[2];
emit(this._id, v.indexOf("-") > -1 ? "-"+v.replace(/\D+/g, '') : v)
},
function(key, value) {},
{ "out": { "inline": 1 } }
)["results"];
results looks like this:
[
{
"_id" : ObjectId("57a98773cbbd42a2156260d8"),
"value" : "32"
},
{
"_id" : ObjectId("57a98773cbbd42a2156260d9"),
"value" : "-42"
}
]
From here you use the previous .forEach loop to update your documents.
From MongoDB 2.6 to 3.0 you will need to use the now deprecated Bulk() API and it associated method as show in my answer here.

Related

How can I unset all document properties except for one or two in mongodb?

I have a collection of documents about entities that have status property that could be 1 or 0. Every document contains a lot of data and occupies space.
I want to get rid of most of the data on the documents with status equal 0.
So, I want every document in the collection that looks like
{
_id: 234,
myCode: 101,
name: "sfsdf",
status: 0,
and: 23243423.1,
a: "dsf",
lot: 3234,
more: "efsfs",
properties: "sdfsd"
}
...to be a lot smaller
{
_id: 234,
mycode: 101,
status: 0
}
So, basically I can do
db.getCollection('docs').update(
{'statusCode': 0},
{
$unset: {
and: "",
a: "",
lot: "",
more: "",
properties: ""
}
},
{multi:true}
)
But there are about 40 properties which would be a huge list, and also I'm not sure that all the objects follow the same schema.
Is there a way to unset all except two properties?
The best thing to do here is to actually throw all the possible properties to $unset and let it do it's job. You cannot "wildcard" such arguments so there really is not a better way without writing to another collection.
If you don't want to type them all out or even know all of them, then simply perform a process to "collect" all the other top level properties.
You can do this for example with .mapReduce():
var fields = db.getCollection('docs').mapReduce(
function() {
Object.keys(this)
.filter(k => k !== '_id' && k !== 'myCode')
.forEach( k => emit(k,1) )
},
function() {},
{
"out": { "inline": 1 }
}
).results.map( o => o._id )
.reduce((acc,curr) => Object.assign(acc,{ [curr]: "" }),{})
Gives you an object with the full fields list to provide to $unset as:
{
"a" : "",
"and" : "",
"lot" : "",
"more" : "",
"name" : "",
"properties" : "",
"status" : ""
}
And that is taken from all possible top level fields in the whole collection.
You can do the same thing with .aggregate() in MongoDB 3.4 using $objectToArray:
var fields = db.getCollection('docs').aggregate([
{ "$project": {
"fields": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"as": "d",
"cond": {
"$and": [
{ "$ne": [ "$$d.k", "_id" ] },
{ "$ne": [ "$$d.k", "myCode" ] }
]
}
}
}
}},
{ "$unwind": "$fields" },
{ "$group": {
"_id": "$fields.k"
}}
]).map( o => o._id )
.reduce((acc,curr) => Object.assign(acc,{ [curr]: "" }),{});
Whatever way you obtain the list of names, then simply send them to $unset:
db.getCollection('docs').update(
{ "statusCode": 0 },
{ "$unset": fields },
{ "multi": true }
)
Bottom like is that $unset does not care if the properties are present in the document or not, but will simply remove them where they exist.
The alternate case is to simply write everything out to a new collection if that also suits your needs. This is a simple use of $out as an aggregation pipeline stage:
db.getCollection('docs').aggregate([
{ "$match": { "statusCode": 0 } },
{ "$project": { "myCode": 1 } },
{ "$out": "newdocs" }
])

MongoDB query for values contained in array and results contain only those values

Let's say I have the following DB:
pizzas = [{
name: "pizza1",
toppings: ['mushrooms', 'pepperoni', 'sausage']
},
{
name: "pizza2",
toppings: ['mushrooms', 'pepperoni']
},
{
name: "pizza3",
toppings: ['mushrooms', 'onions']
},
{
name: "pizza4",
toppings: ['mushrooms']
}]
Now I want to fetch the pizzas that have 'mushrooms', 'pepperoni', or 'onions' and any combination of those. Then the query could be:
pizzas.find({toppings: ['mushrooms', 'pepperoni', 'onions']})
This would return all four pizzas in my db. But here's the problem. What if I wanted pizzas with any combination of only those three toppings, i.e. a pizza can not contain a different topping like 'sausage'. For this query, I only want "pizza2", "pizza3", and "pizza4" to be returned. I could make a query like:
pizzas.find({$and: [{toppings: ['mushrooms', 'pepperoni', 'onions']}, {$not: {toppings: ['sausage']}}]
The problem is that this requires me to know all of the possible toppings to exclude. Is there a better way to construct this query?
You basically need to find the "Set Difference" between the stored array and the desired list and see if there are any items stored that are not one of the desired ingredients. Therefore if the returned list is greater than 0 it contains another ingredient in the list.
If you have at least MongoDB 2.6, there is a $setDifference operator you can use in a $redact statement:
db.pizzas.aggregate([
{ "$match": {
"toppings": { "$in": [ "mushrooms", "pepperoni", "onions" ] }
}},
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$size": {
"$setDifference": [
"$toppings",
[ "mushrooms", "pepperoni", "onions" ]
]
}},
0
]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
If your MongoDB is older than that, then you can implement the same logic in JavaScript using $where:
db.pizzas.find({
"toppings": { "$in": [ "mushrooms", "pepperoni", "onions" ] },
"$where": function() {
return this.toppings.filter(function(topping) {
return [ "mushrooms", "pepperoni", "onions" ].indexOf(topping) == -1;
}).length == 0;
}
})
Both exclude "pizza1" from results by the same comparison, with the native operators in .aggregate() being faster:
{
"_id" : ObjectId("564d44a59f28c6e0feabceea"),
"name" : "pizza2",
"toppings" : [
"mushrooms",
"pepperoni"
]
}
{
"_id" : ObjectId("564d44a59f28c6e0feabceeb"),
"name" : "pizza3",
"toppings" : [
"mushrooms",
"onions"
]
}
{
"_id" : ObjectId("564d44a59f28c6e0feabceec"),
"name" : "pizza4",
"toppings" : [
"mushrooms"
]
}
Noting here that it is still wise to use $in to filter first, as it at least narrows down to possible results, and does not need a brute force match of the whole collection. You use it as opposed to a "raw array" as in your question, since your demonstrated form would match only elements with the exact array, and in order.

How to select custom data in mongo

Is there a way how to include custom data in the mongo query response?
What I mean is a mongo alternative for something like this in MySQL code:
SELECT
value,
'7' AS min_value
FORM
my_table
WHERE
value >= 7
...while the 7 should probably be a variable in the language where the mongo query is being called from.
Try the $literal operator if using the aggregation framework with a $match pipeline step as your query filter. For example, create a sample collection in mongo shell that has 10 test documents with the value field as an increasing integer (0 to 9):
for(x=0;x<10;x++){ db.my_table.insert({value: x }) }
Running the following aggregation pipeline:
var base = 7;
db.my_table.aggregate([
{
"$match": {
"value": { "$gte": base }
}
},
{
"$project": {
"value": 1,
"min_value": { "$literal": base }
}
}
])
would produce the result:
/* 0 */
{
"result" : [
{
"_id" : ObjectId("561e2bcc3d8f561c1548d39b"),
"value" : 7,
"min_value" : 7
},
{
"_id" : ObjectId("561e2bcc3d8f561c1548d39c"),
"value" : 8,
"min_value" : 7
},
{
"_id" : ObjectId("561e2bcc3d8f561c1548d39d"),
"value" : 9,
"min_value" : 7
}
],
"ok" : 1
}
The only things in MongoDB query actions that actuallly "modify" the results returned other than the original document or "field selection" are the .aggregate() method or the JavaScript manipulation alternate in mapReduce.
Otherwise documents are returned "as is", or at least with just the selected fields or array entry specified.
So if you want something else returned from the server, then you need to use one of those methods:
var seven = 7;
db.collection.aggregate([
{ "$match": {
"value": { "$gt": seven }
}},
{ "$project": {
"value": 1,
"min_value": { "$literal": seven }
}}
])
Where the $literal operator comes into play, or in versions prior to 2.6 and greater or equal to 2.2 ( aggregation framework introduced ) can use $const instead:
var seven = 7;
db.collection.aggregate([
{ "$match": {
"value": { "$gt": seven }
}},
{ "$project": {
"value": 1,
"min_value": { "$const": seven }
}}
])
Or just use mapReduce and it's JavaScript translation:
var seven = 7;
db.mapReduce(
function() {
emit(this._id,{ "value": this.value, "min_value": seven });
},
function() {}, // no reduce at all since all _id unique
{
"out": { "inline": 1 },
"query": { "value": { "$gt": seven } },
"scope": { "seven": seven }
}
);
Those are basically your options.

MongoDB lists - get every Nth item

I have a Mongodb schema that looks roughly like:
[
{
"name" : "name1",
"instances" : [
{
"value" : 1,
"date" : ISODate("2015-03-04T00:00:00.000Z")
},
{
"value" : 2,
"date" : ISODate("2015-04-01T00:00:00.000Z")
},
{
"value" : 2.5,
"date" : ISODate("2015-03-05T00:00:00.000Z")
},
...
]
},
{
"name" : "name2",
"instances" : [
...
]
}
]
where the number of instances for each element can be quite big.
I sometimes want to get only a sample of the data, that is, get every 3rd instance, or every 10th instance... you get the picture.
I can achieve this goal by getting all instances and filtering them in my server code, but I was wondering if there's a way to do it by using some aggregation query.
Any ideas?
Updated
Assuming the data structure was flat as #SylvainLeroux suggested below, that is:
[
{"name": "name1", "value": 1, "date": ISODate("2015-03-04T00:00:00.000Z")},
{"name": "name2", "value": 5, "date": ISODate("2015-04-04T00:00:00.000Z")},
{"name": "name1", "value": 2, "date": ISODate("2015-04-01T00:00:00.000Z")},
{"name": "name1", "value": 2.5, "date": ISODate("2015-03-05T00:00:00.000Z")},
...
]
will the task of getting every Nth item (of specific name) be easier?
It seems that your question clearly asked "get every nth instance" which does seem like a pretty clear question.
Query operations like .find() can really only return the document "as is" with the exception of general field "selection" in projection and operators such as the positional $ match operator or $elemMatch that allow a singular matched array element.
Of course there is $slice, but that just allows a "range selection" on the array, so again does not apply.
The "only" things that can modify a result on the server are .aggregate() and .mapReduce(). The former does not "play very well" with "slicing" arrays in any way, at least not by "n" elements. However since the "function()" arguments of mapReduce are JavaScript based logic, then you have a little more room to play with.
For analytical processes, and for analytical purposes "only" then just filter the array contents via mapReduce using .filter():
db.collection.mapReduce(
function() {
var id = this._id;
delete this._id;
// filter the content of "instances" to every 3rd item only
this.instances = this.instances.filter(function(el,idx) {
return ((idx+1) % 3) == 0;
});
emit(id,this);
},
function() {},
{ "out": { "inline": 1 } } // or output to collection as required
)
It's really just a "JavaScript runner" at this point, but if this is just for anaylsis/testing then there is nothing generally wrong with the concept. Of course the output is not "exactly" how your document is structured, but it's as near a facsimile as mapReduce can get.
The other suggestion I see here requires creating a new collection with all the items "denormalized" and inserting the "index" from the array as part of the unqique _id key. That may produce something you can query directly, bu for the "every nth item" you would still have to do:
db.resultCollection.find({
"_id.index": { "$in": [2,5,8,11,14] } // and so on ....
})
So work out and provide the index value of "every nth item" in order to get "every nth item". So that doesn't really seem to solve the problem that was asked.
If the output form seemed more desirable for your "testing" purposes, then a better subsequent query on those results would be using the aggregation pipeline, with $redact
db.newCollection([
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$mod": [ { "$add": [ "$_id.index", 1] }, 3 ] },
0 ]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
That at least uses a "logical condition" much the same as what was applied with .filter() before to just select the "nth index" items without listing all possible index values as a query argument.
No $unwind is needed here. You can use $push with $arrayElemAt to project the array value at requested index inside $group aggregation.
Something like
db.colname.aggregate(
[
{"$group":{
"_id":null,
"valuesatNthindex":{"$push":{"$arrayElemAt":["$instances",N]}
}}
},
{"$project":{"valuesatNthindex":1}}
])
You might like this approach using the $lookup aggregation. And probably the most convenient and fastest way without any aggregation trick.
Create a collection Names with the following schema
[
{ "_id": 1, "name": "name1" },
{ "_id": 2, "name": "name2" }
]
and then Instances collection having the parent id as "nameId"
[
{ "nameId": 1, "value" : 1, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 1, "value" : 2, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 1, "value" : 3, "date" : ISODate("2015-03-05T00:00:00.000Z") },
{ "nameId": 2, "value" : 7, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 2, "value" : 8, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 2, "value" : 4, "date" : ISODate("2015-03-05T00:00:00.000Z") }
]
Now with $lookup aggregation 3.6 syntax you can use $sample inside the $lookup pipeline to get the every Nth element randomly.
db.Names.aggregate([
{ "$lookup": {
"from": Instances.collection.name,
"let": { "nameId": "$_id" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$nameId", "$$nameId"] }}},
{ "$sample": { "size": N }}
],
"as": "instances"
}}
])
You can test it here
Unfortunately, with the aggregation framework it's not possible as this would require an option with $unwind to emit an array index/position, of which currently aggregation can't handle. There is an open JIRA ticket for this here SERVER-4588.
However, a workaround would be to use MapReduce but this comes at a huge performance cost since the actual calculations of getting the array index are performed using the embedded JavaScript engine (which is slow), and there still is a single global JavaScript lock, which only allows a single JavaScript thread to run at a single time.
With mapReduce, you could try something like this:
Mapping function:
var map = function(){
for(var i=0; i < this.instances.length; i++){
emit(
{ "_id": this._id, "index": i },
{ "index": i, "value": this.instances[i] }
);
}
};
Reduce function:
var reduce = function(){}
You can then run the following mapReduce function on your collection:
db.collection.mapReduce( map, reduce, { out : "resultCollection" } );
And then you can query the result collection to geta list/array of every Nth item of the instance array by using the map() cursor method :
var thirdInstances = db.resultCollection.find({"_id.index": N})
.map(function(doc){return doc.value.value})
You can use below aggregation:
db.col.aggregate([
{
$project: {
instances: {
$map: {
input: { $range: [ 0, { $size: "$instances" }, N ] },
as: "index",
in: { $arrayElemAt: [ "$instances", "$$index" ] }
}
}
}
}
])
$range generates a list of indexes. Third parameter represents non-zero step. For N = 2 it will be [0,2,4,6...], for N = 3 it will return [0,3,6,9...] and so on. Then you can use $map to get correspinding items from instances array.
Or with just a find block:
db.Collection.find({}).then(function(data) {
var ret = [];
for (var i = 0, len = data.length; i < len; i++) {
if (i % 3 === 0 ) {
ret.push(data[i]);
}
}
return ret;
});
Returns a promise whose then() you can invoke to fetch the Nth modulo'ed data.

Mongodb: find documents with array field that contains more than one SAME specified value

There is three documents in collection test:
// document 1
{
"id": 1,
"score": [3,2,5,4,5]
}
// document 2
{
"id": 2,
"score": [5,5]
}
// document 3
{
"id": 3,
"score": [5,3,3]
}
I want to fetch documents that score field contains [5,5].
query:
db.test.find( {"score": {"$all": [5,5]}} )
will return document 1, 2 and 3, but I only want to fetch document 1 and 2.
How can I do this?
After reading your problem I personally think mongodb not supported yet this kind of query. If any one knows about how to find this using mongo query they defiantly post answers here.
But I think this will possible using mongo forEach method, so below code will match your criteria
db.collectionName.find().forEach(function(myDoc) {
var scoreCounts = {};
var arr = myDoc.score;
for (var i = 0; i < arr.length; i++) {
var num = arr[i];
scoreCounts[num] = scoreCounts[num] ? scoreCounts[num] + 1 : 1;
}
if (scoreCounts[5] >= 2) { //scoreCounts[5] this find occurrence of 5
printjsononeline(myDoc);
}
});
Changed in version 2.6.
The $all is equivalent to an $and operation of the specified values; i.e. the following statement:
{ tags: { $all: [ "ssl" , "security" ] } }
is equivalent to:
{ $and: [ { tags: "ssl" }, { tags: "security" } ] }
I think you need to pass in a nested array -
So try
db.test.find( {"score": {"$all": [[5,5]]}} )
Source
Changed in version 2.6.
When passed an array of a nested array (e.g. [ [ "A" ] ] ), $all can now match documents where the field contains the nested array as an element (e.g. field: [ [ "A" ], ... ]), or the field equals the nested array (e.g. field: [ "A" ]).
http://docs.mongodb.org/manual/reference/operator/query/all/
You can do it with an aggregation. The first step can use an index on { "score" : 1 } but the rest is hard work.
db.test.aggregate([
{ "$match" : { "score" : 5 } },
{ "$unwind" : "$score" },
{ "$match" : { "score" : 5 } },
{ "$group" : { "_id" : "$_id", "sz" : { "$sum" : 1 } } }, // use $first here to include other fields in the results
{ "$match" : { "sz" : { "$gte" : 2 } } }
])