Scala / MongoDB - removing duplicate - mongodb

I have seen very similar questions with solutions to this problem, but I am unsure how I would incorporate it in to my own query. I'm programming in Scala and using a MongoDB Aggregates "framework".
val getItems = Seq (
Aggregates.lookup(Store...)...
Aggregates.lookup(Store.STORE_NAME, "relationship.itemID", "uniqueID", "item"),
Aggregates.unwind("$item"),
// filter duplicates here ?
Aggregates.lookup(Store.STORE_NAME, "item.content", "ID", "content"),
Aggregates.unwind("$content"),
Aggregates.project(Projections.fields(Projections.include("store", "item", "content")))
)
The query returns duplicate objects which is undesirable. I would like to remove these. How could I go about incorporating Aggregates.group and "$addToSet" to do this? Or any other reasonable solution would be great too.
Note: I have to omit some details about the query, so the store lookup aggregate is not there. However, I want to remove the duplicates later in the query so it hopefully shouldn't matter.
Please let me know if I need to provide more information.
Thanks.
EDIT: 31/ 07/ 2019: 13:47
I have tried the following:
val getItems = Seq (
Aggregates.lookup(Store...)...
Aggregates.lookup(Store.STORE_NAME, "relationship.itemID", "uniqueID", "item"),
Aggregates.unwind("$item"),
Aggregates.group("$item.itemID,
Accumulators.first("ID", "$ID"),
Accumulators.first("itemName", "$itemName"),
Accumulators.addToSet("item", "$item")
Aggregates.unwind("$items"),
Aggregates.lookup(Store.STORE_NAME, "item.content", "ID", "content"),
Aggregates.unwind("$content"),
Aggregates.project(Projections.fields(Projections.include("store", "items", "content")))
)
But my query now returns zero results instead of the duplicate result.

You can use $first to remove the duplicates.
Suppose I have the following data:
[
{"_id": 1,"item": "ABC","sizes": ["S","M","L"]},
{"_id": 2,"item": "EFG","sizes": []},
{"_id": 3, "item": "IJK","sizes": "M" },
{"_id": 4,"item": "LMN"},
{"_id": 5,"item": "XYZ","sizes": null
}
]
Now, let's aggregate it using $first and $unwind and see the difference:
First let's aggregate it using $first
db.collection.aggregate([
{ $sort: {
item: 1
}
},
{ $group: {
_id: "$item",firstSize: {$first: "$sizes"}}}
])
Output
[
{"_id": "XYZ","firstSize": null},
{"_id": "ABC","firstSize": ["S","M","L" ]},
{"_id": "IJK","firstSize": "M"},
{"_id": "EFG","firstSize": []},
{"_id": "LMN","firstSize": null}
]
Now, Let's aggregate it using $unwind
db.collection.aggregate([
{
$unwind: "$sizes"
}
])
Output
[
{"_id": 1,"item": "ABC","sizes": "S"},
{"_id": 1,"item": "ABC","sizes": "M"},
{"_id": 1,"item": "ABC","sizes": "L},
{"_id": 3,"item": "IJK","sizes": "M"}
]
You can see $first removes the duplicates where as $unwind keeps the duplicates.
Using $unwind and $first together.
db.collection.aggregate([
{ $unwind: "$sizes"},
{
$group: {
_id: "$item",firstSize: {$first: "$sizes"}}
}
])
Output
[
{"_id": "IJK", "firstSize": "M"},
{"_id": "ABC","firstSize": "S"}
]

group then addToSet is an effective way to deal with your problem !
it looks like this in mongoshell
db.sales.aggregate(
[
{
$group:
{
_id: { day: { $dayOfYear: "$date"}, year: { $year: "$date" } },
itemsSold: { $addToSet: "$item" }
}
}
]
)
in scala you can do it like
Aggregates.group("$groupfield", Accumulators.addToSet("fieldName","$expression"))
if you have multiple field to group
Aggregates.group(new BasicDBObject().append("fieldAname","$fieldA").append("fieldBname","$fieldB")), Accumulators.addToSet("fieldName","expression"))
then unwind

Related

Limit number of objects pushed to array in MongoDB aggregation

I've been trying to find a way to limit the number of objects i'm pushing to arrays I'm creating while using "aggregate" on a MongoDB collection.
I have a collection of students - each has these relevant keys:
class number it takes this semester (only one value),
percentile in class (exists if is enrolled in class, null if not),
current score in class (> 0 if enrolled in class, else - 0),
total average (GPA),
max grade
I need to group all students who never failed, per class, in one array that contains those with a GPA higher than 80, and another array containing those without this GPA, sorted by their score in this specific class.
This is my query:
db.getCollection("students").aggregate([
{"$match": {
"class_number":
{"$in": [49, 50, 16]},
"grades.curr_class.percentile":
{"$exists": true},
"grades.min": {"$gte": 80},
}},
{"$sort": {"grades.curr_class.score": -1}},
{"$group": {"_id": "$class_number",
"studentsWithHighGPA":
{"$push":
{"$cond": [{"$gte": ["$grades.gpa", 80]},
{"id": "$_id"},
"$$REMOVE"]
}
},
"studentsWithoutHighGPA":
{"$push":
{"$cond": [{"$lt": ["$grades.gpa", 80]},
{"id": "$_id"},
"$$REMOVE"]
},
},
},
},
])
What i'm trying to do is limit the number of students in each of these arrays. I only want the top 16 in each array, but i'm not sure how to approach this.
Thanks in advance!
I've tried using limit in different variations, and slice too, but none seem to work.
Since mongoDb version 5.0, one option is to use $setWindowFields for this, and in particular, its $rank option. This will allow to keep only the relevant students and limit their count even before the $group step:
$match only relevant students as suggested by the OP
$set the groupId for the setWindowFields (as it can currently partition by one key only
$setWindowFields to define the rank of each student in their array
$match only students with the wanted rank
$group by class_number as suggested by the OP:
db.collection.aggregate([
{$match: {
class_number: {$in: [49, 50, 16]},
"grades.curr_class.percentile": {$exists: true},
"grades.min": {$gte: 80}
}},
{$set: {
groupId: {$concat: [
{$toString: "$class_number"},
{$toString: {$toBool: {$gte: ["$grades.gpa", 80]}}}
]}
}},
{$setWindowFields: {
partitionBy: "$groupId",
sortBy: {"grades.curr_class.score": -1},
output: {rank: {$rank: {}}}
}},
{$match: {rank: {$lte: rankLimit}}},
{$group: {
_id: "$class_number",
studentsWithHighGPA: {$push: {
$cond: [{$gte: ["$grades.gpa", 80]}, {id: "$_id"}, "$$REMOVE"]}},
studentsWithoutHighGPA: {$push: {
$cond: [{$lt: ["$grades.gpa", 80]}, {id: "$_id"}, "$$REMOVE"]}}
}}
])
See how it works on the playground example
*This solution will limit the rank of the students, so there is an edge case of more than n students in the array (In case there are multiple students with the exact rank of n). it can be simply solved by adding a $slice step
Maybe MongoDB $facets are a solution. You can specify different output pipelines in one aggregation call.
Something like this:
const pipeline = [
{
'$facet': {
'studentsWithHighGPA': [
{ '$match': { 'grade': { '$gte': 80 } } },
{ '$sort': { 'grade': -1 } },
{ '$limit': 16 }
],
'studentsWithoutHighGPA': [
{ '$match': { 'grade': { '$lt': 80 } } },
{ '$sort': { 'grade': -1 } },
{ '$limit': 16 }
]
}
}
];
coll.aggregate(pipeline)
This should end up with one document including two arrays.
studentsWithHighGPA (array)
0 (object)
1 (object)
...
studentsWithoutHighGPA (array)
0 (object)
1 (object)
See each facet as an aggregation pipeline on its own. So you can also include $group to group by classes or something else.
https://www.mongodb.com/docs/manual/reference/operator/aggregation/facet/
I don't think there is a mongodb-provided operator to apply a limit inside of a $group stage.
You could use $accumulator, but that requires server-side scripting to be enabled, and may have performance impact.
Limiting studentsWithHighGPA to 16 throughout the grouping might look something like:
"studentsWithHighGPA": {
"$accumulator": {
init: "function(){
return {combined:[]};
}",
accumulate: "function(state, id, score){
if (score >= 80) {
state.combined.push({_id:id, score:score})
};
return {combined:state.combined.slice(0,16)}
}",
accumulateArgs: [ "$_id", "$grades.gpa"],
merge: "function(A,B){
return {combined:
A.combined.concat(B.combined).sort(
function(SA,SB){
return (SB.score - SA.score)
})
}
}",
finalize: "function(s){
return s.combined.slice(0,16).map(function(A){
return {_id:A._id}
})
}",
lang: "js"
}
}
Note that the score is also carried through until the very end so that partial result sets from different shards can be combined properly.

MongoDB - Obtain full document of a group taking into account the minimum value of one property

Good afternoon, I'm starting in MongoDB and I have a doubt with the group aggregation.
From the following set of documents; I need to get the cheapest room of all similar (grouping by identifier room).
{"_id":"874521035","provider":{"id":{"$numberInt":"2"},"name":"HotelBeds"},"accommodation":{"id":{"$numberInt":"36880"},"name":"Hotel Goya"},"room":{"id":{"$numberInt":"1"},"name":"Doble"},"board":{"id":{"$numberInt":"1"},"name":"Sólo alojamiento"},"fare":{"id":"NRF","name":"No reembolsable"},"price":{"cost":{"$numberInt":"115"},"net":{"$numberInt":"116"},"pvp":{"$numberInt":"126"}},"fees":{"agency":{"$numberInt":"10"},"cdv":{"$numberInt":"1"}},"cancellation-deadeline":"2019-12-31","payment-deadeline":"2019-12-30"}
{"_id":"123456789","provider":{"id":{"$numberInt":"2"},"name":"HotelBeds"},"accommodation":{"id":{"$numberInt":"36880"},"name":"Hotel Goya"},"room":{"id":{"$numberInt":"1"},"name":"Doble"},"board":{"id":{"$numberInt":"2"},"name":"Alojamiento y desayuno"},"fare":{"id":"NOR","name":"Reembolsable"},"price":{"cost":{"$numberInt":"120"},"net":{"$numberInt":"121"},"pvp":{"$numberInt":"131"}},"fees":{"agency":{"$numberInt":"10"},"cdv":{"$numberInt":"1"}},"cancellation-deadeline":"2019-12-31","payment-deadeline":"2019-12-30"}
{"_id":"987654321","provider":{"id":{"$numberInt":"2"},"name":"HotelBeds"},"accommodation":{"id":{"$numberInt":"36880"},"name":"Hotel Goya"},"room":{"id":{"$numberInt":"2"},"name":"Triple"},"board":{"id":{"$numberInt":"1"},"name":"Sólo alojamiento"},"fare":{"id":"NOR","name":"Reembolsable"},"price":{"cost":{"$numberInt":"125"},"net":{"$numberInt":"126"},"pvp":{"$numberInt":"136"}},"fees":{"agency":{"$numberInt":"10"},"cdv":{"$numberInt":"1"}},"cancellation-deadeline":"2019-12-31","payment-deadeline":"2019-12-30"}
{"_id":"852963147","provider":{"id":{"$numberInt":"2"},"name":"HotelBeds"},"accommodation":{"id":{"$numberInt":"36880"},"name":"Hotel Goya"},"room":{"id":{"$numberInt":"3"},"name":"Doble uso individual"},"board":{"id":{"$numberInt":"1"},"name":"Sólo alojamiento"},"price":{"cost":{"$numberInt":"99"},"net":{"$numberInt":"100"},"pvp":{"$numberInt":"110"}},"fees":{"agency":{"$numberInt":"10"},"cdv":{"$numberInt":"1"}},"cancellation-deadeline":"2019-12-31","payment-deadeline":"2019-12-30"}
I've got obtain only the cheapest price, the room identifier and the number of repetitions.
db.consolidation.aggregate ([
{
$group: {
_id: "$ room.id",
"cheapest": {$ min: "$ price.pvp"},
        "qty": {$ sum: 1}
}
}]);
{"_id": 2, "cheapest": 136, "qty": 1}
{"_id": 3, "cheapest": 110, "qty": 1}
{"_id": 1, "cheapest": 126, "qty": 2}
Investigating I have seen that data can be obtained with $first or $last, but the data is not the data I need since it is obtained according to the position of the document.
What I need is to obtain from the set of documents, each document with the cheapest room. This is the result I expect:
{"_id":"874521035","provider":{"id":{"$numberInt":"2"},"name":"HotelBeds"},"accommodation":{"id":{"$numberInt":"36880"},"name":"Hotel Goya"},"room":{"id":{"$numberInt":"1"},"name":"Doble"},"board":{"id":{"$numberInt":"1"},"name":"Sólo alojamiento"},"fare":{"id":"NRF","name":"No reembolsable"},"price":{"cost":{"$numberInt":"115"},"net":{"$numberInt":"116"},"pvp":{"$numberInt":"126"}},"fees":{"agency":{"$numberInt":"10"},"cdv":{"$numberInt":"1"}},"cancellation-deadeline":"2019-12-31","payment-deadeline":"2019-12-30"}
{"_id":"987654321","provider":{"id":{"$numberInt":"2"},"name":"HotelBeds"},"accommodation":{"id":{"$numberInt":"36880"},"name":"Hotel Goya"},"room":{"id":{"$numberInt":"2"},"name":"Triple"},"board":{"id":{"$numberInt":"1"},"name":"Sólo alojamiento"},"fare":{"id":"NOR","name":"Reembolsable"},"price":{"cost":{"$numberInt":"125"},"net":{"$numberInt":"126"},"pvp":{"$numberInt":"136"}},"fees":{"agency":{"$numberInt":"10"},"cdv":{"$numberInt":"1"}},"cancellation-deadeline":"2019-12-31","payment-deadeline":"2019-12-30"}
{"_id":"852963147","provider":{"id":{"$numberInt":"2"},"name":"HotelBeds"},"accommodation":{"id":{"$numberInt":"36880"},"name":"Hotel Goya"},"room":{"id":{"$numberInt":"3"},"name":"Doble uso individual"},"board":{"id":{"$numberInt":"1"},"name":"Sólo alojamiento"},"price":{"cost":{"$numberInt":"99"},"net":{"$numberInt":"100"},"pvp":{"$numberInt":"110"}},"fees":{"agency":{"$numberInt":"10"},"cdv":{"$numberInt":"1"}},"cancellation-deadeline":"2019-12-31","payment-deadeline":"2019-12-30"}
I hope I have explained.
Thanks in advance.
Regards.
You can add capture $$ROOT as part of your $group stage and then use $filter to compare a list of your rooms against min value. $replaceRoot will allow you to get original shape:
db.collection.aggregate([
{
$group: {
_id: "$room.id",
"cheapest": {
$min: "$price.pvp"
},
"qty": { $sum: 1 },
docs: { $push: "$$ROOT" }
}
},
{
$replaceRoot: {
newRoot: { $arrayElemAt: [ { $filter: { input: "$docs", cond: { $eq: [ "$$this.price.pvp", "$cheapest" ] } } }, 0 ] }
}
}
])
Mongo Playground

Mongodb aggregation taking more than 15 seconds

I have more than 100k records in my collections, and for every 5 seconds it will add a record into collection. I have a aggregate query to get 720(approx) records from last one year data.
The aggregate query:
db.collectionName.aggregate([
{"$match": {
"Id": "****-id-****",
"receivedDate": {
"$gte": ISODate("2016-06-26T18:30:00.463Z"),
"$lt": ISODate("2017-06-26T18:30:00.463Z")
}
}
},
{"$group": {
"_id": {
"$add": [
{"$subtract": [
{"$subtract": ["$receivedDate", ISODate("1970-01-01T00:00:00.000Z")]},
{"$mod": [
{"$subtract": ["$receivedDate", ISODate("1970-01-01T00:00:00.000Z")]},
43200000
]}
]},
ISODate("1970-01-01T00:00:00.000Z")
]
},
"_rid": {"$first": "$_id"},
"_data": {"$first": "$receivedData.data"},
"count": {"$sum": 1}
}
},
{"$sort": {"_id": -1}},
{"$project": {
"_id": "$_rid",
"receivedDate": "$_id",
"receivedData": {"data": "$_data"}
}
}
])
I am not sure why its taking more than 15 seconds, when I try to get data for 1 month it is working fine.
Its too late to answer this question, This would be helpful for others,
Might be the compound index can help in this situation, Compound indexes can support queries that match on multiple fields.
You can create compound index on Id and receivedDate fields,
db.collectionName.createIndex({ Id: -1, receivedDate: -1 });
The order of the fields listed in a compound index is important. The index will contain references to documents sorted first by the values of the Id field and, within each value of the Id field, sorted by values of the receivedDate field.

In Mongo DB Getting Whole collection of date even if one is matched inside it [duplicate]

This question already has answers here:
How to filter array in subdocument with MongoDB [duplicate]
(3 answers)
Closed 6 years ago.
In Mongo DB Getting Whole collection of date even if one is matched inside it.
Creating a new Collection with the below data:
db.details.insert({
"_id": 1,
"name": "johnson",
"dates": [
{"date": ISODate("2016-05-01")},
{"date": ISODate("2016-08-01")}
]
})
Fetching Back:
db.details.find().pretty()
Output:
{
"_id": 1,
"name": "Johnson",
"dates": [
{"date": ISODate("2016-05-01T00:00:00Z")},
{"date": ISODate("2016-08-01T00:00:00Z")}
]
}
So here there is a collection called dates inside another collection details.
Now I want to filter the date inside dates using Greater than and want the result showing "2016-08-01".
But when I search like the following:
db.details.find(
{"dates.date": {$gt: ISODate("2016-07-01")}},
{"dates.date": 1, "_id": 0}
).pretty()
Getting the Result as below, Its giving me the entire collection even if one date is matched in it:
{
"dates": [
{"date": ISODate("2016-05-01T00:00:00Z")},
{"date": ISODate("2016-08-01T00:00:00Z")}
]
}
Please help in getting the Expected data, i.e.:
{
"date": ISODate("2016-08-01T00:00:00Z")
}
You can use aggregate framework for this:
db.details.aggregate([
{$unwind: '$dates'},
{$match: {'dates.date': {$gt: ISODate("2016-07-01")}}},
{$project: {_id: 0, 'dates.date': 1}}
]);
Another way (Works only for mongo 3.2):
db.details.aggregate([
{$project: {
_id: 0,
dates: {
$filter: {
input: '$dates',
as: 'item',
cond: {
$gte: ['$$item.date', ISODate('2016-08-01T00:00:00Z')]
}
}
}
}
}]);
To only return the date field:
db.details.aggregate([
{$unwind: '$dates'},
{$match: {'dates.date': {$gt: ISODate('2016-07-01')}}},
{$group: {_id: '$dates.date'}},
{$project: {_id: 0, date: '$_id'}}
]);
Returns:
{
"date" : ISODate("2016-08-01T00:00:00Z")
}

MongoDB: How can I get a count of a field in a collection grouped by first character and matching a 2nd field?

Following this question's answer (https://stackoverflow.com/a/20817040/2656506) I was able to group a field based on it's first character with this command:
db.kits.aggregate({ $group: {_id: {$substr: ['$kit', 0, 1]}, count: {$sum: 1}}})
But I can't figure out how I can additionally group only those documents which match an additional condition like _id: 'abc' in the same query. Can it be done in one query?
Thanks in advance!
add $match pipeline stage to your aggregation query:
db.kits.aggregate(
[
{
$match: {
_id: 'abc'
}
},
{
$group: {
_id: {
$substr: ['$kit', 0, 1]
},
count: {$sum: 1}
}
}
]
)