Mongo aggregation vs Java for loop and performance - mongodb

I have a below mongo document stored
{
"Field1": "ABC",
"Field2": [
{ "Field3": "ABC1","Field4": [ {"id": "123" }, { "id" : "234" }, { "id":"345" }] },
{ "Field3": "ABC2","Field4": [ {"id": "123" }, { "id" : "234" }, { "id":"345" }] },
{ "Field3": "ABC3","Field4": [{ "id":"345" }] },
]
}
from the above, I want to fetch the subdocuments which is having id "123"
ie.
{
"Field3" : "ABC1",
"Field4" : [ { "id": "123"} ]
} ,
{
"Field3" : "ABC2",
"Field4" : [ { "id": "123"} ]
}
1. Java way
A. use Mongo find method to get the ABC document from Mongo DB
B. for Loop to Iterate the Field2 Json Array
C. Again for Loop to Iterate over Field4 Json Array
D. Inside the nested for loop I've if condition to Match id value to "123"
E. Store the Matching subdocument into List
2. Mongo Way
A. Use Aggregation query to get the desired output from DB.No Loops and conditions in the Java side.
B. Aggregation Query below stages
I) $Match - match the ABC document
II) $unwind - Field2
III) $unwind - Field4
IV) $match - Match the with id ( value is "123")
V) $group - group the document based on Field3 (based on "ABC1" or "ABC2")
VI) execute aggregation and return results
Both are working good and returning proper results.
Question is which one is the better to follow and why ? I used the aggregation in restful service get method, So executing aggregation queries 1000 or more times in parallel will cause any performance problems?

With Aggregation, the whole query is executed as a single process on the MongoDB server - the application program will get the results cursor from the server.
With Java program also you are getting a cursor from the database server as input to the processing in the application. The response cursor from the server is going to be larger set of data and will use more network bandwidth. And then there is processing in the application program, and this adds more steps to complete the query.
I think the aggregation option is a better choice - as all the processing (the initial match and filtering the array) happens on the database server as a single process.
Also, note the aggregation query steps you had posted can be done in an efficient way. Instead of multiple stages (2, 3, 4 and 5) you can do those operations in a two stages - use a $project with $map on the outer array and then $filter on the inner array and then $filter the outer array.
The aggregation:
db.test.aggregate( [
{
$addFields: {
Field2: {
$map: {
input: "$Field2",
as: "fld2",
in: {
Field3: "$$fld2.Field3",
Field4: {
$filter: {
input: "$$fld2.Field4",
as: "fld4",
cond: { $eq: [ "$$fld4.id", "123" ] }
}
}
}
}
}
}
},
{
$addFields: {
Field2: {
$filter: {
input: "$Field2",
as: "f2",
cond: { $gt: [ { $size: "$$f2.Field4" }, 0 ] }
}
}
}
},
] )

The second way is probably better because it returns a smaller result from the datastore; shlepping bits over the wire is expensive.

Related

how to project fields using another field's value in mongo db?

I have a mongo document like this:
{"_id": {"$oid":"xx"} ,"start": "a", "elements": {"a":"large object", "b": "large object"}
My expected query result is to project only the start element, in this case, it is {"elements.a:"large object"}. But with the value of "start" unknow before the query, I don't know how to write the query.
2 undesirable alternatives:
One way I could figure is to query start once with _id, and project for start to get "a", and another for elements.a。(
Another way is query all, and get the start element in code. But I don't want to query all at once for the document may be very large)
You can make use of $objectToArray, $arrayToObject and $filter operators.
The below query will be helpful:
db.collection.aggregate([
{
$project: {
elements: {
$arrayToObject: {
$filter: {
input: {
$objectToArray: "$elements"
},
as: "e",
cond: {
$eq: [
"$$e.k",
"$start"
]
}
}
}
}
}
}
])
Output:
[
{
"_id": 1,
"elements": {
"a": "large object"
}
}
]
MongoPlayGroundLink
I hope, this is what you want.

MongoDB aggregate query return document based on match query and priority values

In a mongodb collection , i have following documents :
{"id":"1234","name":"John","stateCode":"CA"}
{"id":"1234","name":"Smith","stateCode":"CA"}
{"id":"1234","name":"Tony","stateCode":"GA"}
{"id":"3323", "name":"Neo","stateCode":"OH"}
{"id":"3323", "name":"Sam","stateCode":"US"}
{"id":"4343","name":"Bruce","stateCode":"NV"}
I am trying to write a mongo aggregate query which do following things:
match based on id field
Give more priority to document having values other than "NV" or "GA" in stateCode field.
If all the document have values either "NV" or "GA" then ignore the priority.
If any of the document have stateCode other than "NV" or "GA" , then return those document.
Example 1:
id = "1234"
then return
{"id":"1234","name":"John","stateCode":"CA"}
{"id":"1234","name":"Smith","stateCode":"CA"}
Example 2:
id = "4343"
then return
{"id":"4343","name":"Bruce","stateCode":"NV"}
Could you please help with a query to achieve this.
I tried with a query , but i am stuck with error:
Failed to execute script.
Error: command failed: {
"ok" : 0,
"errmsg" : "input to $filter must be an array not string",
"code" : 28651,
"codeName" : "Location28651"
} : aggregate failed
Query :
db.getCollection('emp').aggregate([{$match:{
'id': "1234"
}
},
{
$project: {
"data": {
$filter: {
input: "$stateCode",
as: "data",
cond: { $ne: [ "$data", "GA" ],$ne: [ "$data", "NV" ] }
}
}
}
}
])
I actually recommend you split this into 2 queries, first try to find documents with a different status code and if that fails then retrieve the rest.
With that said here is a working pipeline that does it in one go, Due to the fact we cant know in advance whether the condition is true or not we need to iterate all the documents who match the id, this fact makes it VERY inefficient in the case the id is shared by many documents, if this is not possible then using this pipeline is fine.
db.getCollection('emp').aggregate([
{
$match: {
'id': "1234"
}
},
{ //we have to group so we can check
$group: {
_id: null,
docs: {$push: "$$ROOT"}
}
},
{
$addFields: {
highPriorityDocs: {
$filter: {
input: "$docs",
as: "doc",
cond: {$and: [{$ne: ["$$doc.stateCode", "NV"]}, {$ne: ["$$doc.stateCode", "GA"]}]}
}
}
}
},
{
$project: {
finalDocs: {
$cond: [ // if size of high priority docs gt 0 return them.
{$gt: [{$ize: "$highPriorityDocs"}, 0]},
"$highPriorityDocs",
"$docs"
]
}
}
},
{
$unwind: "$finalDocs"
},
{
$replaceRoot: {newRoot: "$finalDocs"}
}
])
The last two stages are just to restore the original structure, you can drop them if you don't care about it.

MongoDB select from array based on multiple conditions [duplicate]

This question already has answers here:
Retrieve only the queried element in an object array in MongoDB collection
(18 answers)
Closed 3 years ago.
I have the following structure (this can't be changed, that is I have to work with):
{
"_id" : ObjectId("abc123"),
"notreallyusedfields" : "dontcare",
"data" : [
{
"value" : "value1",
"otherSomtimesInterestingFields": 1
"type" : ObjectId("asd123=type1"),
},
{
"value" : "value2",
"otherSometimesInterestingFields": 1
"type" : ObjectId("asd1234=type2"),
},
many others
]
}
So basically the fields for a schema are inside an array and they can be identified based on the type field inside 1 array element (1 schema field and it's value is 1 element in the array). For me this is pretty strange, but I'm new to NoSQL so this may be ok. (also for different data some fields may be missing and the element order in the data array is not guaranteed)
Maybe it's easier to understand like this:
Table a: type1 column | type2 column | and so on (and these are stored in the array like the above)
My question is: how can you select multiple fields with conditions? What I mean is (in SQL): SELECT * FROM table WHERE type1=value1 AND type2=value2
I can select 1 like this:
db.a.find( {"data.type":ObjectId("asd1234=type2"), "data.value":value2}).pretty()
But I don't know how could I include that type1=value1 or even more conditions. And I think this is not even good because it can match any data.value field inside the array so it doesn't mean that where the type is type2 the value has to be value2.
How would you solve this?
I was thinking of doing 1 query for 1 condition and then do another based on the result. So something like the pipeline for aggregation but as I see $match can't be used more times in an aggregation. I guess there is a way to pipeline these commands but this is pretty ugly.
What am I missing? Or because of the structure of the data I have to do these strange multiple queries?
I've also looked at $filter but the condition there also applies to any element of the array. (Or I'm doing it wrong)
Thanks!
Sorry if I'm not clear enough! Please ask and I can elaborate.
(Basically what I'm trying to do based on this structure ( Retrieve only the queried element in an object array in MongoDB collection ) is this: if shape is square then filter for blue colour, if shape is round then filter for red colour === if type is type1 value has to be value1, if type is type2 value has to be value2)
This can be done like:
db.document.find( { $and: [
{ type:ObjectId('abc') },
{ data: { $elemMatch: { type: a, value: DBRef(...)}}},
{ data: { $elemMatch: { type: b, value: "string"}}}
] } ).pretty()
So you can add any number of "conditions" using $and so you can specify that an element has to have type a and a value b, and another element type b and value c...
If you want to project only the matching elements then use aggregate with filter:
db.document.aggregate([
{$match: { $and: [
{ type:ObjectId('a') },
{ data: { $elemMatch: { Type: ObjectId("b"), value: DBRef(...)}}},
{ data: { $elemMatch: { Type: ObjectId("c"), value: "abc"}}}
] }
},
{$project: {
metadata: {
$filter: {
input: "$data",
as: "data",
cond: { $or: [
{$eq: [ "$$data.Type", ObjectId("a") ] },
{$eq: [ "$$data.Type", ObjectId("b") ] }]}
}
}
}
}
]).pretty()
This is pretty ugly so if there is a better way please post it! Thanks!
If you need to retrieve documents that have array elements matching
multiple conditions, you have to use $elemMatch query operator.
db.collection.find({
data: {
$elemMatch: {
type: "type1",
value: "value1"
}
}
})
This will output whole documents where an element matches.
To output only first matching element in array, you can combine it with $elemMatch projection operator.
db.collection.find({
data: {
$elemMatch: {
type: "type1",
value: "value1"
}
}
},
{
data: {
$elemMatch: {
type: "type1",
value: "value1"
}
}
})
Warning, don't forget to project all other fields you need outside data array.
And if you need to output all matching elements in array, then you have to use $filter in an aggregation $project stage, like this :
db.collection.aggregate([
{
$project: {
data: {
$filter: {
input: "$data",
as: "data",
cond: {
$and: [
{
$eq: [
"$$data.type",
"type1"
]
},
{
$eq: [
"$$data.value",
"value1"
]
}
]
}
}
}
}
}
])

How to update string field in mongodb and manipulate string values?

I have a MongoDB collection with some documents that have a field called Personal.FirstName and another field call Personal.Surname. Some documents are messed up and have the persons first name and last name in both fields. For example there are some documents that have Personal.FirstName = 'John Doe' and Personal.Surname = 'John Doe'.
I want to write a mongo update statement that will do the following:
Find all of the documents that have a Personal section
Find all of the documents where Personal.FirstName == Personal.Surname
Update Personal.FirstName to be just the first part of Personal.FirstName before the space
Update Personal.Surname to be just the second part of Personal.Surname after the space
Is this possible in a mongo update statement? I am new to mongo and know very little about how to query it.
EDIT: here is an example document
{
"_id" : LUUID("fcd140b1-ec0f-0c49-aa79-fed00899290e"),
"Personal" : {
"FirstName" : "John Doe",
"Surname" : "John Doe"
}
}
you can't do this in a single query, but you can achieve this by iterating over result like this :
db.name.find({$and: [{Personal: {$exists: true}}, {$where: "this.Personal.FirstName == this.Personal.Surname"}]}).forEach(function(e,i){
var parts = e.Personal.FirstName.split(" ");
e.Personal.FirstName = parts[0];
e.Personal.Surname = parts[1];
db.name.save(e);
})
result:
{ "_id" : "fcd140b1-ec0f-0c49-aa79-fed00899290e", "Personal" : { "FirstName" : "John", "Surname" : "Doe" } }
The idea is get a subset of the documents from your collection by filtering the documents that match the specified criteria. Once you get the subset you iterate the list and update each document
within a loop.
Now, to get the subset, you need to run an aggregation pipeline which is faster than doing a filter using find() and $where operator. Take the following example aggregate() operation which uses $redact as the filtering mechanism
and then a $project pipeline to create an additional field that you can use in your update. The cursor from the aggregate() method containing the results can then be iterated with its forEach() method and subsequently update the collection on the documents from the subset:
db.collection.aggregate([
{
"$redact": {
"$cond": [
{
"$and": [
{ "$eq": [ "$Personal.FirstName", "$Personal.Surname" ] },
{
"$gt": [
{
"$size": {
"$split": ["$Personal.FirstName", " "]
}
},
0
]
}
]
},
"$$KEEP",
"$$PRUNE"
]
}
},
{
"$project": {
"FirstName": {
"$arrayElemAt": [
{ "$split": ["$Personal.FirstName", " "] },
0
]
},
"Surname": {
"$arrayElemAt": [
{ "$split": ["$Personal.FirstName", " "] },
1
]
}
}
}
]).forEach(function(doc) {
db.collection.updateOne(
{ "_id": doc._id },
{
"$set": {
"Personal.FirstName": doc.FirstName,
"Personal.Surname": doc.Surname,
}
}
)
})
Using the aggregation framework with the $redact pipeline operator allows you to process the logical condition with the $cond operator and uses the special operations $$KEEP to "keep" the document where the logical condition is true or $$PRUNE to "remove" the document where the condition was false.
This should improve in performance significantly because the $redact operator uses MongoDB's native operators whilst a query operation with the $where operator calls the JavaScript engine to evaluate Javascript code on every document and checks the condition for each, thus can be very slow as MongoDB evaluates non-$where query operations before $where expressions and non-$where query statements may use an index.

MongoDB lists - get every Nth item

I have a Mongodb schema that looks roughly like:
[
{
"name" : "name1",
"instances" : [
{
"value" : 1,
"date" : ISODate("2015-03-04T00:00:00.000Z")
},
{
"value" : 2,
"date" : ISODate("2015-04-01T00:00:00.000Z")
},
{
"value" : 2.5,
"date" : ISODate("2015-03-05T00:00:00.000Z")
},
...
]
},
{
"name" : "name2",
"instances" : [
...
]
}
]
where the number of instances for each element can be quite big.
I sometimes want to get only a sample of the data, that is, get every 3rd instance, or every 10th instance... you get the picture.
I can achieve this goal by getting all instances and filtering them in my server code, but I was wondering if there's a way to do it by using some aggregation query.
Any ideas?
Updated
Assuming the data structure was flat as #SylvainLeroux suggested below, that is:
[
{"name": "name1", "value": 1, "date": ISODate("2015-03-04T00:00:00.000Z")},
{"name": "name2", "value": 5, "date": ISODate("2015-04-04T00:00:00.000Z")},
{"name": "name1", "value": 2, "date": ISODate("2015-04-01T00:00:00.000Z")},
{"name": "name1", "value": 2.5, "date": ISODate("2015-03-05T00:00:00.000Z")},
...
]
will the task of getting every Nth item (of specific name) be easier?
It seems that your question clearly asked "get every nth instance" which does seem like a pretty clear question.
Query operations like .find() can really only return the document "as is" with the exception of general field "selection" in projection and operators such as the positional $ match operator or $elemMatch that allow a singular matched array element.
Of course there is $slice, but that just allows a "range selection" on the array, so again does not apply.
The "only" things that can modify a result on the server are .aggregate() and .mapReduce(). The former does not "play very well" with "slicing" arrays in any way, at least not by "n" elements. However since the "function()" arguments of mapReduce are JavaScript based logic, then you have a little more room to play with.
For analytical processes, and for analytical purposes "only" then just filter the array contents via mapReduce using .filter():
db.collection.mapReduce(
function() {
var id = this._id;
delete this._id;
// filter the content of "instances" to every 3rd item only
this.instances = this.instances.filter(function(el,idx) {
return ((idx+1) % 3) == 0;
});
emit(id,this);
},
function() {},
{ "out": { "inline": 1 } } // or output to collection as required
)
It's really just a "JavaScript runner" at this point, but if this is just for anaylsis/testing then there is nothing generally wrong with the concept. Of course the output is not "exactly" how your document is structured, but it's as near a facsimile as mapReduce can get.
The other suggestion I see here requires creating a new collection with all the items "denormalized" and inserting the "index" from the array as part of the unqique _id key. That may produce something you can query directly, bu for the "every nth item" you would still have to do:
db.resultCollection.find({
"_id.index": { "$in": [2,5,8,11,14] } // and so on ....
})
So work out and provide the index value of "every nth item" in order to get "every nth item". So that doesn't really seem to solve the problem that was asked.
If the output form seemed more desirable for your "testing" purposes, then a better subsequent query on those results would be using the aggregation pipeline, with $redact
db.newCollection([
{ "$redact": {
"$cond": {
"if": {
"$eq": [
{ "$mod": [ { "$add": [ "$_id.index", 1] }, 3 ] },
0 ]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
That at least uses a "logical condition" much the same as what was applied with .filter() before to just select the "nth index" items without listing all possible index values as a query argument.
No $unwind is needed here. You can use $push with $arrayElemAt to project the array value at requested index inside $group aggregation.
Something like
db.colname.aggregate(
[
{"$group":{
"_id":null,
"valuesatNthindex":{"$push":{"$arrayElemAt":["$instances",N]}
}}
},
{"$project":{"valuesatNthindex":1}}
])
You might like this approach using the $lookup aggregation. And probably the most convenient and fastest way without any aggregation trick.
Create a collection Names with the following schema
[
{ "_id": 1, "name": "name1" },
{ "_id": 2, "name": "name2" }
]
and then Instances collection having the parent id as "nameId"
[
{ "nameId": 1, "value" : 1, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 1, "value" : 2, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 1, "value" : 3, "date" : ISODate("2015-03-05T00:00:00.000Z") },
{ "nameId": 2, "value" : 7, "date" : ISODate("2015-03-04T00:00:00.000Z") },
{ "nameId": 2, "value" : 8, "date" : ISODate("2015-04-01T00:00:00.000Z") },
{ "nameId": 2, "value" : 4, "date" : ISODate("2015-03-05T00:00:00.000Z") }
]
Now with $lookup aggregation 3.6 syntax you can use $sample inside the $lookup pipeline to get the every Nth element randomly.
db.Names.aggregate([
{ "$lookup": {
"from": Instances.collection.name,
"let": { "nameId": "$_id" },
"pipeline": [
{ "$match": { "$expr": { "$eq": ["$nameId", "$$nameId"] }}},
{ "$sample": { "size": N }}
],
"as": "instances"
}}
])
You can test it here
Unfortunately, with the aggregation framework it's not possible as this would require an option with $unwind to emit an array index/position, of which currently aggregation can't handle. There is an open JIRA ticket for this here SERVER-4588.
However, a workaround would be to use MapReduce but this comes at a huge performance cost since the actual calculations of getting the array index are performed using the embedded JavaScript engine (which is slow), and there still is a single global JavaScript lock, which only allows a single JavaScript thread to run at a single time.
With mapReduce, you could try something like this:
Mapping function:
var map = function(){
for(var i=0; i < this.instances.length; i++){
emit(
{ "_id": this._id, "index": i },
{ "index": i, "value": this.instances[i] }
);
}
};
Reduce function:
var reduce = function(){}
You can then run the following mapReduce function on your collection:
db.collection.mapReduce( map, reduce, { out : "resultCollection" } );
And then you can query the result collection to geta list/array of every Nth item of the instance array by using the map() cursor method :
var thirdInstances = db.resultCollection.find({"_id.index": N})
.map(function(doc){return doc.value.value})
You can use below aggregation:
db.col.aggregate([
{
$project: {
instances: {
$map: {
input: { $range: [ 0, { $size: "$instances" }, N ] },
as: "index",
in: { $arrayElemAt: [ "$instances", "$$index" ] }
}
}
}
}
])
$range generates a list of indexes. Third parameter represents non-zero step. For N = 2 it will be [0,2,4,6...], for N = 3 it will return [0,3,6,9...] and so on. Then you can use $map to get correspinding items from instances array.
Or with just a find block:
db.Collection.find({}).then(function(data) {
var ret = [];
for (var i = 0, len = data.length; i < len; i++) {
if (i % 3 === 0 ) {
ret.push(data[i]);
}
}
return ret;
});
Returns a promise whose then() you can invoke to fetch the Nth modulo'ed data.