How to combine Documents in aggregation pipeline with MongoDB Java driver 3.6? - mongodb

I am using an aggregation pipeline with the MongoDB Java driver version 3.6. If I have documents that look something like:
doc1 --
{
"CAR": {
"VIN": "ASDF1234",
"YEAR": "2018",
"MAKE": "Honda",
"MODEL": "Accord"
},
"FEATURES": [
{
"AUDIO": "MP3",
"TIRES": "All Season",
"BRAKES": "ABS"
}
]
}
doc2 --
{
"CAR": {
"VIN": "ASDF1234",
"AVAILABILITY": "In Stock"
}
}
And if I submit a query like:
collection.aggregate(
Arrays.asList(
Aggregates.match(
and(
in("CAR.VIN", vinList),
or(
eq("CAR.MAKE", carMake),
eq("CAR.AVAILABILITY", carAvailability),
)
)
)
)
)
Let us assume that there are exactly two different records for which the "CAR.VIN" criteria match for every VIN, and I am going to get two results. Rather than deal with two results each time, I would like to merge the documents so that the result looks like this:
{
"CAR": {
"VIN": "ASDF1234",
"YEAR": "2018",
"MAKE": "Honda",
"MODEL": "Accord",
"AVAILABILITY": "In Stock"
},
"FEATURES": [
{
"AUDIO": "MP3",
"TIRES": "All Season",
"BRAKES": "ABS"
}
]
}
The example where I have two and only two results trivializes my need for this. Imagine that vinList is a list of 10000 values, and it might return 2 x 10000 documents. When I return an AggregateIterable to the client that is calling my code, I do not want to impose the requirement that they have to group or collate the results in any way, but that they will receive one document for each result that has all of the information that they will want to parse, cleanly and easily.
Of course, people will suggest that the data is simply combined into one document with all of the data in the MongoDB collection. For reasons that I cannot control, there are two separate documents corresponding to each VIN in the same collection, and that is something that I am unable to change. There is a value in our system that makes this more reasonable than it might seem, so please don't focus on this apparent problem with the data.
I am trying, with not much luck, to utilize the Aggretes.group() operation to merge the fields in my aggregation pipeline. Accumulators.push seems to be the closest operation to what I need, but I do not want to complicate the document structure with extra arrays, etc. Is there a straightforward approach that I am not seeing?

you can try $mergeObjects added in mongo v3.6
db.cc.aggregate(
[
{
$group: {
_id : "$CAR.VIN",
CAR : {$mergeObjects : "$CAR"},
FEATURES : {$mergeObjects : {$arrayElemAt : ["$FEATURES", 0 ]}}
}
}
]
).pretty()
result
{
"_id" : "ASDF1234",
"CAR" : {
"VIN" : "ASDF1234",
"YEAR" : "2018",
"MAKE" : "Honda",
"MODEL" : "Accord",
"AVAILABILITY" : "In Stock"
},
"FEATURES" : {
"AUDIO" : "MP3",
"TIRES" : "All Season",
"BRAKES" : "ABS"
}
}
>
to get features as array
db.cc.aggregate(
[
{
$group: {
_id : "$CAR.VIN",
CAR : {$mergeObjects : "$CAR"},
FEATURES : {$push : {$arrayElemAt : ["$FEATURES", 0 ]}}
}
}
]
).pretty()
result
{
"_id" : "ASDF1234",
"CAR" : {
"VIN" : "ASDF1234",
"YEAR" : "2018",
"MAKE" : "Honda",
"MODEL" : "Accord",
"AVAILABILITY" : "In Stock"
},
"FEATURES" : [
{
"AUDIO" : "MP3",
"TIRES" : "All Season",
"BRAKES" : "ABS"
},
null
]
}
>

Related

How to boost Mongodb search result based on Criteria given

I am working on the requirement where have write query in which if users enters any acronym of university(Ex: MIT) have to get the result from database. JSON looks like this:
{
"_id" : ObjectId("5d68cdcac8acd826e6a386b2"),
"name" : "Massachusetts Institute of Technology",
"acronyms" : [
"MIT"
]
}
,
{
"_id" : ObjectId("5d68ce0bc8acd826e6a45b29"),
"name" : "Manukau Institute of Technology",
"acronyms" : [
"MIT"
]
}
User might input "Name" as well. I have written "OR" query for that.
db.getCollection('universityCollection').find(
{$or: [{"name":"MIT"},{"acronyms":"MIT"}]}
)
Now my requirement is if users enters "input" and if it matches with acronym it should return it first after that it will return items which matches with name.
Current or query is not returning expected order.
Any pointers will help.
Please try below query.
db.getCollection('test').aggregate(
{ $match : { $or : [{ "name":"MIT" }, {"acronyms":"MIT" } ] } }
,{ "$project": {
"name": 1,
"acronyms": 1,
"sortOrder": {
"$setIsSubset": [ ["MIT" ] , "$acronyms" ] }
}
}
,{ "$sort": { "sortOrder": -1 } }
)
If you are not familiar with MongoDB aggregates, check the below links.
https://docs.mongodb.com/manual/reference/method/db.collection.aggregate/
https://docs.mongodb.com/manual/reference/operator/aggregation/setIsSubset/

Updating matched array by identifier with multiple names [duplicate]

I have a large DB with various inconsistencies. One of the items I would like to clear up is changing the country status based on the population.
A Sample of the data is:
{ "_id" : "D", "name" : "Deutschland", "pop" : 70000000, "country" : "Large Western" }
{ "_id" : "E", "name" : "Eire", "pop" : 4500000, "country" : "Small Western" }
{ "_id" : "G", "name" : "Greenland", "pop" : 30000, "country" : "Dependency" }
{ "_id" : "M", "name" : "Mauritius", "pop" : 1200000, "country" : "Small island"}
{ "_id" : "L", "name" : "Luxembourg", "pop" : 500000, "country" : "Small Principality" }
Obviously I would like to change the country field go something more uniform, based on population size.
I've tried this approach, but obviously missing some way of tying into an update of the country field.
db.country.updateMany( { case : { $lt : ["$pop" : 20000000] }, then : "Small country" }, { case : { $gte : ["$pop" : 20000000] }, then : "Large country" }
Edit: Posted before I was finished writing.
I was thinking to use $cond functionality, to basically return if true, do X, if false, do y, while using the updateMany.
Is this possible, or is there a workaround?
You really want want bulkWrite() using two "updateMany" statements within it instead. Aggregation expressions cannot be used to do "alternate selection" in any form of update statement.
db.country.bulkWrite([
{ "updateMany": {
"filter": { "pop": { "$lt": 20000000 } },
"update": { "$set": { "country": "Small Country" } }
}},
{ "updateMany": {
"filter": { "pop": { "$gt": 20000000 } },
"update": { "$set": { "country": "Large Country" } }
}}
])
There is still an outstanding "feature request" on SERVER-6566 for "conditional syntax", but this is not yet resolved. The "bulk" API was actually introduced after this request was raised, and really can be adapted as shown to do more or less the same thing.
Also using $out in an aggregation statement as was otherwise suggested is not an option to "update" and can only write to a "new collection" at present. The slated change from MongoDB 4.2 onwards would allow $out to actually "update" an existing collection, however this would only be where the collection to be updated is different from any other collection used within the gathering of data from the aggregation pipeline. So it is not possible to use an aggregation pipeline to update the same collection as what you are reading from.
In short, use bulkWrite().

For each document retrieve object with $max field from array

I have the following documents in my collection. Each document contains historical weather data about a specific location:
{
'location':'new york',
'history':[
{'timestamp':1524542400, 'temp':79, 'wind_speed':1, 'wind_direction':'SW'}
{'timestamp':1524548400, 'temp':80, 'wind_speed':2, 'wind_direction':'SW'}
{'timestamp':1524554400, 'temp':82, 'wind_speed':3, 'wind_direction':'S'}
{'timestamp':1524560400, 'temp':78, 'wind_speed':4, 'wind_direction':'S'}
]
},
{
'location':'san francisco',
'history':[
{'timestamp':1524542400, 'temp':80, 'wind_speed':5, 'wind_direction':'SW'}
{'timestamp':1524548400, 'temp':81, 'wind_speed':6, 'wind_direction':'SW'}
{'timestamp':1524554400, 'temp':82, 'wind_speed':7, 'wind_direction':'S'}
{'timestamp':1524560400, 'temp':73, 'wind_speed':8, 'wind_direction':'S'}
]
},
{
'location':'miami',
'history':[
{'timestamp':1524542400, 'temp':84, 'wind_speed':9, 'wind_direction':'SW'}
{'timestamp':1524548400, 'temp':85, 'wind_speed':10, 'wind_direction':'SW'}
{'timestamp':1524554400, 'temp':86, 'wind_speed':11, 'wind_direction':'S'}
{'timestamp':1524560400, 'temp':87, 'wind_speed':12, 'wind_direction':'S'}
]
}
I would like to get a list of the most recent weather data for each location (more or less) like so:
{
'location':'new york',
'history':{'timestamp':1524560400, 'temp':78, 'wind_speed':4, 'wind_direction':'S'}
},
{
'location':'san francisco',
'history':{'timestamp':1524560400, 'temp':73, 'wind_speed':8, 'wind_direction':'S'}
},
{
'location':'miami',
'history':{'timestamp':1524560400, 'temp':87, 'wind_speed':12, 'wind_direction':'S'}
}
I was pretty sure it needed some sort of $group aggregate but can't figure out how to select an entire object by $max:<field>. For example the below query only returns the max timestamp itself, without any of the accompanying fields.
db.collection.aggregate([{
'$unwind': '$history'
}, {
'$group': {
'_id': '$name',
'timestamp': {
'$max': '$history.timestamp'
}
}
}])
returns
{ "_id" : "new york", "timestamp" : 1524560400 }
{ "_id" : "san franciscoeo", "timestamp" : 1524560400 }
{ "_id" : "miami", "timestamp" : 1524560400 }
The actual collection and arrays are very large so client side processing won't be ideal. Any help would be much appreciated.
Well as the author of the answer you found, I think we can actually do a bit better with modern MongoDB versions.
Single match per document
In short we can actually apply $max to your particular case, used with $indexOfArray and $arrayElemAt to extract the matched value:
db.collection.aggregate([
{ "$addFields": {
"history": {
"$arrayElemAt": [
"$history",
{ "$indexOfArray": [ "$history.timestamp", { "$max": "$history.timestamp" } ] }
]
}
}}
])
Which will return you:
{
"_id" : ObjectId("5ae9175564de8a00a66b3974"),
"location" : "new york",
"history" : {
"timestamp" : 1524560400,
"temp" : 78,
"wind_speed" : 4,
"wind_direction" : "S"
}
}
{
"_id" : ObjectId("5ae9175564de8a00a66b3975"),
"location" : "san francisco",
"history" : {
"timestamp" : 1524560400,
"temp" : 73,
"wind_speed" : 8,
"wind_direction" : "S"
}
}
{
"_id" : ObjectId("5ae9175564de8a00a66b3976"),
"location" : "miami",
"history" : {
"timestamp" : 1524560400,
"temp" : 87,
"wind_speed" : 12,
"wind_direction" : "S"
}
}
That is of course without actually needing to "group" anything and simply find the $max value from within each document, as you seem to be trying to do. This avoids you needing to "mangle" any other document output by forcing it through a $group or indeed an $unwind.
The usage essentially is that the $max returns the "maximum" value from the specified array property since $history.timestamp is a short way of notating to extract "just those values" from within the objects of the array.
This is used in comparison with the same "list of values" to determine the matching "index" via $indexOfArray, which takes an array as it's first argument and the value to match as the second.
The $arrayElemAt operator also takes an array as it's first argument, here we use the full "$history" array since we want to extract the "full object". Which we do by the "returned index" value of the $indexOfArray operator.
"Multiple" matches per document
Of course that's fine for "single" matches, but if you wanted to expand that to "multiple" matches of the same $max value, then you would use $filter instead:
db.collection.aggregate([
{ "$addFields": {
"history": {
"$filter": {
"input": "$history",
"cond": { "$eq": [ "$$this.timestamp", { "$max": "$history.timestamp" } ] }
}
}
}}
])
Which would output:
{
"_id" : ObjectId("5ae9175564de8a00a66b3974"),
"location" : "new york",
"history" : [
{
"timestamp" : 1524560400,
"temp" : 78,
"wind_speed" : 4,
"wind_direction" : "S"
}
]
}
{
"_id" : ObjectId("5ae9175564de8a00a66b3975"),
"location" : "san francisco",
"history" : [
{
"timestamp" : 1524560400,
"temp" : 73,
"wind_speed" : 8,
"wind_direction" : "S"
}
]
}
{
"_id" : ObjectId("5ae9175564de8a00a66b3976"),
"location" : "miami",
"history" : [
{
"timestamp" : 1524560400,
"temp" : 87,
"wind_speed" : 12,
"wind_direction" : "S"
}
]
}
The main difference being of course that the "history" property is still an "array" since that is what $filter will produce. Also noting of course that if there were in fact "multiple" entries with the same timestamp value, then this would of course return them all and not just the "first index" matched.
The comparison is basically done instead against "each" array element to see if the "current" ( "$$this" ) object has the specified property which matches the $max result, and ultimately returning only those array elements which are a match for the supplied condition.
These are essentially your "modern" approaches which avoid the overhead of $unwind, and indeed $sort and $group where they may not be needed. Of course they are not needed for just dealing with individual documents.
If however you really need to $group across "multiple documents" by a specific grouping key and consideration of values "inside" the array, then the initial approach outlined as you discovered is actually the fit for that scenario, as ultimately you "must" $unwind to deal with items "inside" an array in such a way. And also with consideration "across documents".
So be mindful to use stages like $group and $unwind only where you actually need to and where "grouping" is your actual intent. If you are just looking to find something "in the document", then there are far more efficient ways to do this without all the additional overhead that those stages bring with them to processing.

Update an array element with inc mongo update

HI All I have this Data in mongo,
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 1
}
],
"count" : NumberLong(1),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
I want to update it using this new data
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
}
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
What i need in the output is
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 4
}
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(7),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
Could you please suggest me how can i achieve this this using update operation
My update query will have tags field as query parameter.
You'll never get this in a single query operation as presently there is no way for MongoDB updates to refer to the existing values of fields. The exception of course is operators such as $inc, but this has a bit more going on than can be really handled by this.
You need multiple updates, but there is a consistent model to follow and the Bulk Operations API can at least help with sending all of those updates in a single request:
var updoc = {
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
},
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
};
var bulk = db.collection.initializeOrderedBulkOp();
// Inspect the document variable for update
// For each array entry
updoc.articleId.forEach(function(doc) {
// First try to match the document and array entry to update
bulk.find({
"tags": updoc.tags,
"articleId.articleId": doc.articleId
}).update({
"$inc": { "articleId.$.articleCount": doc.articleCount }
});
// Then try to "push" the array entry where it does not exist
bulk.find({
"tags": updoc.tags,
"articleId.articleId": { "$ne": doc.articleId }
}).update({
"$push": { "articleId": doc }
});
})
// Finally increment the overall count
bulk.find({ "tags": updoc.tags }).update({
"$inc": { "count": updoc.count }
});
bulk.execute();
Now that is not "truly" atomic and there is a very small chance that the modified document could be read without all of the modifications in place. And the Bulk API sends these over to the server to process all at once, then that is a lot better than individual operations between the client and server where the chance of the document being read in a non-consistent state would be much higher.
So for each array member in the document to "merge" you want to both try to $inc where the
member is matched in the query and to $push a new member where it was not. Finally you just want to $inc again for the total count on the merged document with the existing one.
For this sample that is a total of 5 update operations but all sent in one package. Note that the response though will confirm that only 3 operations where applied here as 2 of the operations would not actually match a document due to the conditions specified:
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 0,
"nUpserted" : 0,
"nMatched" : 3,
"nModified" : 3,
"nRemoved" : 0,
"upserted" : [ ]
})
So that is one way to handle it. Another may be to just submit each document individually and then periodically "merge" the data into grouped documents using the aggregation framework. It depends on how "real time" you want to do this. The above is as close to "real time" updates as you can generally get.
Delayed Processing
As mentioned, there is another approach to this where you can consider a "delayed" processing of this "merging" where you do not need the data to be updated in real time. The approach considers the use of the aggregation framework to perform the "merge", and you could even use the aggregation as the general query for the data, but you probably want to accumulate in a collection instead.
The basic premise of the aggregation is that you store each "change" document as a separate document in the collection, rather than merge in real time. So two documents in the collection would be represented like this:
{
"_id" : ObjectId("548fe1c78ad2c25d4c952eee"),
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 1
}
],
"count" : NumberLong(1),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
},
{
"_id" : ObjectId("548fe2286032bac607405eb3"),
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
},
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
In order to "merge" these results for a given "tags" value, you want an aggregation pipeline like this:
db.collection.aggregate([
// Unwinds the array members to de-normalize
{ "$unwind": "$articleId" },
// Group the elements by "tags" value and "articleId"
{ "$group": {
"_id": {
"tags": "$tags",
"articleId": "$articleId.articleId",
},
"articleCount": { "$sum": "$articleId.articleCount" },
"timeStamp": { "$max": "$timeStamp" },
"interval": { "$max": "$interval" },
}},
// Now group again creating the array of "merged" items
{ "$group": {
"_id": "$tags",
"articleId": {
"$push": {
"articleId": "$_id.articleId",
"articleCount": "$articleCount"
}
},
"count": { "$sum": "$articleCount" },
"timeStamp": { "$max": "$timeStamp" },
"interval": { "$max": "$interval" },
}}
])
So using "tags" and "articleId" ( the inner value ) you group the results together, taking the $sum of the "articleCount" fields where both of those fields are the same and the $max value for the rest of the fields, which makes sense.
In a second $group pass you then just break the result documents down to "tags", pushing each matching "articleId" value under that into an array. To avoid any duplication the document "count" is summed at this stage and the other values are just taken from the same groupings.
The result is the same "merged" document, which you could either use the above aggregation query to simply return your results from such a collection, or use those results to either just create a new collection for the merged documents ( see the $out operator for one option ) or use a similar process to the first example to "merge" these "merged" results with an existing "merged" collection.
Accumulating data like this is generally a wide topic, even though a common use case for many. There is a reference project maintained but MongoDB solutions architecture called HVDF or High Volume Data Feed. It is aimed at providing a framework or at least a reference example of handling volume feeds ( for which change document accumulation is a case ) and aggregating these in a series manner for analysis.
The actual approaches depend on the overall needs of your application. Concepts such as these are employed internally by a framework like HVDF, it's just a matter of how much complexity you need and the approach that suits your application best for how you need to access the data.

Mongodb Update/Upsert array exact match

I have a collection :
gStats : {
"_id" : "id1",
"criteria" : ["key1":"value1", "key2":"value2"],
"groups" : [
{"id":"XXXX", "visited":100, "liked":200},
{"id":"YYYY", "visited":30, "liked":400}
]
}
I want to be able to update a document of the stats Array of a given array of criteria (exact match).
I try to do this on 2 steps :
Pull the stat document from the array of a given "id" :
db.gStats.update({
"criteria" : {$size : 2},
"criteria" : {$all : [{"key1" : "2096955"},{"value1" : "2015610"}]}
},
{
$pull : {groups : {"id" : "XXXX"}}
}
)
Push the new document
db.gStats.findAndModify({
query : {
"criteria" : {$size : 2},
"criteria" : {$all : [{"key1" : "2015610"}, {"key2" : "2096955"}]}
},
update : {
$push : {groups : {"id" : "XXXX", "visited" : 29, "liked" : 144}}
},
upsert : true
})
The Pull query works perfect.
The Push query gives an error :
2014-12-13T15:12:58.571+0100 findAndModifyFailed failed: {
"value" : null,
"errmsg" : "exception: Cannot create base during insert of update. Cause
d by :ConflictingUpdateOperators Cannot update 'criteria' and 'criteria' at the
same time",
"code" : 12,
"ok" : 0
} at src/mongo/shell/collection.js:614
Neither query is working in reality. You cannot use a key name like "criteria" more than once unless under an operator such and $and. You are also specifying different fields (i.e groups) and querying elements that do not exist in your sample document.
So hard to tell what you really want to do here. But the error is essentially caused by the first issue I mentioned, with a little something extra. So really your { "$size": 2 } condition is being ignored and only the second condition is applied.
A valid query form should look like this:
query: {
"$and": [
{ "criteria" : { "$size" : 2 } },
{ "criteria" : { "$all": [{ "key1": "2015610" }, { "key2": "2096955" }] } }
]
}
As each set of conditions is specified within the array provided by $and the document structure of the query is valid and does not have a hash-key name overwriting the other. That's the proper way to write your two conditions, but there is a trick to making this work where the "upsert" is failing due to those conditions not matching a document. We need to overwrite what is happening when it tries to apply the $all arguments on creation:
update: {
"$setOnInsert": {
"criteria" : [{ "key1": "2015610" }, { "key2": "2096955" }]
},
"$push": { "stats": { "id": "XXXX", "visited": 29, "liked": 144 } }
}
That uses $setOnInsert so that when the "upsert" is applied and a new document created the conditions specified here rather than using the field values set in the query portion of the statement are used instead.
Of course, if what you are really looking for is truly an exact match of the content in the array, then just use that for the query instead:
query: {
"criteria" : [{ "key1": "2015610" }, { "key2": "2096955" }]
}
Then MongoDB will be happy to apply those values when a new document is created and does not get confused on how to interpret the $all expression.