Removing duplicates in mongodb with aggregate query

Removing duplicates in mongodb with aggregate query - mongodb

db.games.aggregate([
{ $unwind : "$rounds"},
{ $match: {
"rounds.round_values.gameStage": "River",
"rounds.round_values.decision": "BetPlus" }
},
{ $project: {"FinalFundsChange":1, "GameID":1}
}])
The resulting output is:
{ "_id" : ObjectId("57cbce66e281af12e4d0731f"), "GameID" : "229327202", "FinalFundsChange" : 0.8199999999999998 }
{ "_id" : ObjectId("57cbe2fce281af0f34020901"), "FinalFundsChange" : -0.1599999999999997, "GameID" : "755030199" }
{ "_id" : ObjectId("57cbea3ae281af0f340209bc"), "FinalFundsChange" : 0.10000000000000009, "GameID" : "231534683" }
{ "_id" : ObjectId("57cbee43e281af0f34020a25"), "FinalFundsChange" : 1.7000000000000002, "GameID" : "509975754" }
{ "_id" : ObjectId("57cbee43e281af0f34020a25"), "FinalFundsChange" : 1.7000000000000002, "GameID" : "509975754" }
As you can see the last element is a duplicate, that's because the unwind creates two elements of it, which it should. How can I (while keeping the aggregate structure of the query) keep the first element of the duplicate or keep the last element of the duplicate only?
I have seen that the ways to do it seem to be related to either $addToSet or $setUnion (any details how this works exactly are appreciated as well), but I don't understand how I can choose the 'subset' by which I want to identify the duplicates (in my case that's the 'GameID', other values are allowed to be different) and how I can select whether I want the first or the last element.

You could group by _id via $group and then use the $last and $first operator respectively to keep the last or first values.
db.games.aggregate([
{ $unwind : "$rounds"},
{ $match: {
"rounds.round_values.gameStage": "River",
"rounds.round_values.decision": "BetPlus" }
},
{ $group: {
_id: "$_id",
"FinalFundsChange": { $first: "$FinalFundsChange" },
"GameID": { $last: "$GameID" }
}
}
])

My problem was find all users who purchase same product, where a user can purchase a product multiple time.
https://mongoplayground.net/p/UTuT4e_N6gn
db.payments.aggregate([
{
"$lookup": {
"from": "user",
"localField": "user",
"foreignField": "_id",
"as": "user_docs"
}
},
{
"$unwind": "$user_docs",
},
{
"$group": {
"_id": "$user_docs._id",
"name": {
"$first": "$user_docs.name"
},
}
},
{
"$project": {
"_id": 0,
"id": "$_id",
"name": "$name"
}
}
])

Related

Count _id occurrences in other collection

We have a DB structure similar to the following:
Pet owners:
/* 1 */
{
"_id" : ObjectId("5baa8b8ce70dcbe59d7f1a32"),
"name" : "bob"
}
/* 2 */
{
"_id" : ObjectId("5baa8b8ee70dcbe59d7f1a33"),
"name" : "mary"
}
Pets:
/* 1 */
{
"_id" : ObjectId("5baa8b4fe70dcbe59d7f1a2a"),
"name" : "max",
"owner" : ObjectId("5baa8b8ce70dcbe59d7f1a32")
}
/* 2 */
{
"_id" : ObjectId("5baa8b52e70dcbe59d7f1a2b"),
"name" : "charlie",
"owner" : ObjectId("5baa8b8ce70dcbe59d7f1a32")
}
/* 3 */
{
"_id" : ObjectId("5baa8b53e70dcbe59d7f1a2c"),
"name" : "buddy",
"owner" : ObjectId("5baa8b8ee70dcbe59d7f1a33")
}
I need a list of all pet owners and additionally the number of pets they own. Our current query looks similar to the following:
db.getCollection('owners').aggregate([
{ $lookup: { from: 'pets', localField: '_id', foreignField: 'owner', as: 'pets' } },
{ $project: { '_id': 1, name: 1, numPets: { $size: '$pets' } } }
]);
This works, however it's quite slow and I'm asking myself if there's a more efficient way to perform the query?
[update and feedback] Thanks for the answers. The solutions work, however I can unfortunately see no performance improvement compared to the query given above. Obviously, MongoDB still needs to scan the entire pet collection. My hope was, that the owner index (which is present) on the pets collection could somehow be exploited for getting just the counts (not needing to touch the pet documents), but this does not seem to be the case.
Are there any other ideas or solutions for a very fast retrieval of the 'pet count' beside explicitly storing the count within the owner documents?

In MongoDB 3.6 you can create custom $lookup pipeline and count instead of entire pets documents, try:
db.owners.aggregate([
{
$lookup: {
from: "pets",
let: { ownerId: "$_id" },
pipeline: [
{ $match: { $expr: { $eq: [ "$$ownerId", "$owner" ] } } },
{ $count: "count" }
],
as: "numPets"
}
},
{
$unwind: "$numPets"
}
])

You can try below aggregation
db.owners.aggregate([
{ "$lookup": {
"from": "pets",
"let": { "ownerId": "$_id" },
"pipeline": [
{ "$match": { "$expr": { "$eq": [ "$$ownerId", "$owner" ] }}},
{ "$count": "count" }
],
"as": "numPets"
}},
{ "$project": {
"_id": 1,
"name": 1,
"numPets": { "$ifNull": [{ "$arrayElemAt": ["$numPets.count", 0] }, 0]}
}}
])

Lookup and sort the foreign collection

so I have a collection users, and each document in this collection, as well as other properties, has an array of ids of documents in the other collection: workouts.
Every document in the collection workouts has a property named date.
And here's what I want to get:
For a specific user, I want to get an array of {workoutId, workoutDate} for the workouts that belong to that user, sorted by date.
This is my attempt, which is working fine.
Users.aggregate([
{
$match : {
_id : ObjectId("whateverTheUserIdIs")
}
},
{
$unwind : {
path : "$workouts"
}
}, {
$lookup : {
from : "workouts",
localField : "workouts",
foreignField : "_id",
as : "workoutDocumentsArray"
}
}, {
$project : {
_id : false,
workoutData : {
$arrayElemAt : [
$workoutDocumentsArray,
0
]
}
}
}, {
$project : {
date : "$workoutData.date",
id : "$workoutData._id"
}
}, {
$sort : {date : -1}
}
])
However I refuse to believe I need all this for what would be such a simple query in SQL!? I believe I must at least be able to merge the two $project stages into one? But I've not been able to figure out how looking at the docs.
Thanks in advance for taking the time! ;)
====
EDIT - This is some sample data
Collection users:
[{
_id:xxx,
workouts: [2,4,6]
},{
_id: yyy,
workouts: [1,3,5]
}]
Colleciton workouts:
[{
_id:1,
date: 1/1/1901
},{
_id:2,
date: 2/2/1902
},{
_id:3,
date: 3/3/1903
},{
_id:4,
date: 4/4/1904
},{
_id:5,
date: 5/5/1905
},{
_id:6,
date: 6/6/1906
}]
And after running my query, for example for user xxx, I would like to get only the workouts that belong to him (whose ids appear in his workouts array), so the result I want would look like:
[{
id:6,
date: 6/6/1906
},{
id:4,
date: 4/4/1904
},{
id:2,
date: 2/2/1902
}]

You don't need to $unwind the workouts array as it already contains array of _ids and use $replaceRoot instead of doing $project
Users.aggregate([
{ "$match": { "_id" : ObjectId("whateverTheUserIdIs") }},
{ "$lookup": {
"from" : "workouts",
"localField" : "workouts",
"foreignField" : "_id",
"as" : "workoutDocumentsArray"
}},
{ "$unwind": "$workoutDocumentsArray" },
{ "$replaceRoot": { "newRoot": "$workoutDocumentsArray" }}
{ "$sort" : { "date" : -1 }}
])
or even with new $lookup syntax
Users.aggregate([
{ "$match" : { "_id": ObjectId("whateverTheUserIdIs") }},
{ "$lookup" : {
"from" : "workouts",
"let": { "workouts": "$workouts" },
"pipeline": [
{ "$match": { "$expr": { "$in": ["$_id", "$$workouts"] }}},
{ "$sort" : { "date" : -1 }}
]
"as" : "workoutDocumentsArray"
}},
{ "$unwind": "$workoutDocumentsArray" },
{ "$replaceRoot": { "newRoot": "$workoutDocumentsArray" }}
])

Filter $lookup results

I have 2 collections (with example documents):
reports
{
id: "R1",
type: "xyz",
}
reportfiles
{
id: "F1",
reportid: "R1",
time: ISODate("2016-06-13T14:20:25.812Z")
},
{
id: "F14",
reportid: "R1",
time: ISODate("2016-06-15T09:20:29.809Z")
}
As you can see one report may have multiple reportfiles.
I'd like to perform a query, matching a report id, returning the report document as is, plus an additional key storing as subdocument the reportfile with the most recent time (even better without reportid, as it would be redundant), e.g.
{
id: "R1",
type: "xyz",
reportfile: {
id: "F14",
reportid: "R1",
time: ISODate("2016-06-15T09:20:29.809Z")
}
}
My problem here is that every report type has its own set of properties, so using $project in an aggregation pipeline is not the best way.
So far I got
db.reports.aggregate([{
$match : 'R1'
}, {
$lookup : {
from : 'reportfiles',
localField : 'id',
foreignField : 'reportid',
as : 'reportfile'
}
}
])
returning of course as ´reportfile´ the list of all files with the given reportid. How can I efficiently filter that list to get the only element I need?
efficiently -> I tried using $unwind as next pipeline step but the resulting document was frighteningly and pointlessly long.
Thanks in advance for any suggestion!

You need to add another $project stage to your aggregation pipeline after the $lookup stage.
{ "$project": {
"id": "R1",
"type": "xyz",
"reportfile": {
"$let": {
"vars": {
"obj": {
"$arrayElemAt": [
{ "$filter": {
"input": "$reportfile",
"as": "report",
"cond": { "$eq": [ "$$report.time", { "$max": "$reportfile.time" } ] }
}},
0
]
}
},
"in": { "id": "$$obj.id", "time": "$$obj.time" }
}
}
}}
The $filter operator "filter" the $lookup result and return an array with the document that satisfy your condition. The condition here is $eq which return true when the document has the $maximum value.
The $arrayElemAt operator slice the $filter's result and return the element from the array that you then assign to a variable using the $let operator. From there, you can easily access the field you want in your result with the dot notation.

What you would require is to run the aggregation operation on the reportfile collection, do the "join" on the reports collection, pipe a $group operation to ordered (with $sort) and flattened documents (with $unwind) from the $lookup pipeline. The preceding result can then be grouped by the reportid and output the desired result using the $first accumulator aoperators.
The following demonstrates this approach:
db.reportfiles.aggregate([
{ "$match": { "reportid": "R1" } },
{
"$lookup": {
"from": 'reports',
"localField" : 'reportid',
"foreignField" : 'id',
"as": 'report'
}
},
{ "$unwind": "$report" },
{ "$sort": { "time": -1 } },
{
"$group": {
"_id": "$reportid",
"type": { "$first": "$report.type" },
"reportfile": {
"$first": {
"id": "$id",
"reportid": "$reportid",
"time": "$time"
}
}
}
}
])
Sample Output:
{
"_id" : "R1",
"type" : "xyz",
"reportfile" : {
"id" : "F14",
"reportid" : "R1",
"time" : ISODate("2016-06-15T09:20:29.809Z")
}
}

MongoDB - get $max among fields at different levels

I have a MongoDB collection with documents of this (simplified) form
{
"_id": "Doc"
"created": NumberLong("1422526079335")
}
Additionally, this documents may have an additional edited field
{
"_id": "Doc"
"created": NumberLong("1422526079335")
"edited": {
"date": NumberLong("1458128507498")
}
}
What I need is to get the most recent timestamp (among created and edited.date) for a subset of these documents, matching certain conditions.
What I achieved so far is to get the most recent created timestamp
db.myCollection.aggregate([ { $match: { ... } },
{ $project: { _id:0, created: 1 } },
{ $group: { _id: 'latest', latest: { $max: '$created' } } }
])
which returns
{ "_id" : "latest", "latest" : NumberLong("1422526079335") }
How can I integrate the check against edited.date in the $max logic above? Or alternatively is there another solution? Thanks is advance!

Try this script. it's simple $max operator.
I have following documents in collection
{
"_id" : "Doc",
"created" : NumberLong(1422526079335),
"edited" : {
"date" : NumberLong(1458128507498)
}
}
{
"_id" : "Doc1",
"created" : NumberLong(1422526079335)
}
Try running following query:
db.doc.aggregate([
{
$match: { ... }
},
{
$project:{
latest:{ $max:["$created", "$edited.date"]}
}
}
])
Output will be:
{
"_id" : "Doc",
"latest" : NumberLong(1458128507498)
}
{
"_id" : "Doc1",
"latest" : NumberLong(1422526079335)
}

you can use $cond in last $project pipe - lastModifedDate makes the trick :-)
db.ill.aggregate([{
$project:{
lastModifedDate:{
$cond: {
if: { $gte: [ "$created", "$edited.date"] },
then: "$created", else: "$edited.date" }
}}}])

mongodb sorting array documents

This is my document i want to sort array documents by ascending order to get so for that my queries are in following code.but i am not getting the docs in sorted way.
The query is
db.sample.find({_id: ObjectId("55b32f5957e47fabd30c5d2e")}).sort({'naresh.ts':1}).pretty();
This is the result I am getting
{
"_id" : ObjectId("55b32f5957e47fabd30c5d2e"),
"naresh" : [
{
"ts" : "hi",
"created_by" : 1437806425105
},
{
"ts" : "hello",
"created_by" : 1437806425105
},
{
"ts" : "waht",
"created_by" : 1437807757261
},
{
"ts" : "lefo",
"created_by" : 1437807768514
},
{
"ts" : "lefow",
"created_by" : 1437807775719
}
]
}

You can use $aggregation like following query:
db.collection.aggregate({
"$match": {
"_id": ObjectId("55b32f5957e47fabd30c5d2e")
}
}, {
$unwind: "$naresh"
}, {
$sort: {
"naresh.ts": 1
}
}, {
"$group": {
_id: "$_id",
"naresh": {
$push: "$naresh"
}
}
})

The cursor .sort() only looks at the values in the array to decide to use the "smallest" value of the specified field ( in ascending order ) to determine how to "sort" the documents in the response. This does not "sort" the array content itself.
In order to sort the array, you need to use the aggregation framework to manipulate the document:
db.sample.aggregate([
{ "$match": { "_id": ObjectId("55b32f5957e47fabd30c5d2e") },
{ "$unwind": "$naresh" },
{ "$sort": { "$naresh.ts": 1 } },
{ "$group": {
"_id": "$_id",
"naresh": { "$push": "$naresh" }
}}
])
That sorts the array.
Better yet, if you "always" want then results sorted then do it as you update the document:
db.sample.update({},{ "$push": { "$each": [], "$sort": { "ts": 1 } } },{ "multi": true })
And use those same, $each and $sort modifiers when adding new elements to the array and the content will remain sorted.

If you want just query the collection and get the output sorted, then Blackes Seven's answer will work perfectly for you.
However if you want to update the documents in the sorted order, go with this update query:
update(
{
_id: ObjectId("55b32f5957e47fabd30c5d2e")
},
{
$push: {
naresh: {
$each: [],
$sort: {created_by: 1}
}
}
}
)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Removing duplicates in mongodb with aggregate query - mongodb

Related

Count _id occurrences in other collection

Lookup and sort the foreign collection

Filter $lookup results

MongoDB - get $max among fields at different levels

mongodb sorting array documents

Categories

Resources