mongodb - Find every document with the same field but different case - mongodb

I'm having trouble with my database because I have documents representing my users with the field email with different cases (due to the ability to create ghost user, waiting for them to register). When the user registers, I use the lowered version of his email and overwrite the previous entry. The problem is that 'ghost' email has not been lowered.
If Foo#bar.com ghost is created, Foo#bar.com register, he will be known as 'foo#bar.com', so Foo#bar.com will just pollute my database.
I looking for a way in order to find the duplicates entries, remove the irrelevant one (by hand) before I push my fix about case. Ideas?
Thank you!

Try this:
db.users.aggregate([
{ $match: {
"username": { $exists: true }
}},
{ $project: {
"username": { "$toLower": [ "$username" ]}
}},
{ $group: {
_id: "$username",
total: { $sum : 1 }
}},
{ $match: {
total: { $gte: 2 }
}},
{ $sort: {
total: -1
}}
]);
This will find every user with a username, make the user names lower case, then group them by username and display the usernames that have a count greater than 1.

You can use projection and toLower function to achieve what you are looking for. Assuming that your attribute name is "email" in your collection document, here is an example of how to achieve this:
db.yourcollection.aggregate([
{ $project: {
"email": { "$toLower" : [ "$email" ] }
}},
{ $match: {
"email": /foo#bar.com/
}}
]);

Related

MongoDB $group by id where id name varies (conditionals?)

I need to group all entries by id using $group. However, the id is under different objects, so is $student.id for some and $teacher.id for others. I've tried $cond, $is and any other conditional I could find but haven't had any luck. What I'd want is something like this:
lessons.aggregate([
// matches, lookups, etc.
{$group: {
"_id":{
"id": (if student exists "student.id", else if teacher exists "teacher.id"
},
// other fields
}
}}]);
How can I do this? I've scoured the MongoDB docs for hours yet nothing works. I'm new to this company and trying to debug something so not familiar with the tech yet, so apologies if this is rudimentary stuff!
Update: providing some sample data to demo what I'd want. Shortened from the real thing to fit the question. After all the matches, lookups, etc and before using $group, the data looks like this. As student.id of first and second objects are the same, I want them to be grouped.
{
student: {
_id: new ObjectId("61dc0fce904d07184b461c03"),
name: Jess W
},
duration: 30
},
{
student:{
_id: new ObjectId("61dc0fce904d07184b461c03"),
name: Jess W
},
duration: 30
},
{
teacher: {
_id: new ObjectId("61dc0f6a904d07184b461be7"),
name: Michael S
},
duration: 30
},
{
teacher: {
_id: new ObjectId("61dc1087904d07184b461c6a"),
name: Andrew J
},
duration: 30
},
If the fields exist "exclusive only", then you can simply combine them:
{ $group: {_id: {student: "$student.id", teacher: "$teacher.id"} }, // other fields }
concatenate them should also work:
{ $group: {_id: {$concat: [ "$student.id", "$teacher.id" ] }, // other fields }
You can just use $ifNull and chain them, like so:
db.collection.aggregate([
{
$group: {
_id: {
"$ifNull": [
"$student._id",
{
$ifNull: [
"$teacher._id",
"$archer._id"
]
}
]
}
}
}
])

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

Calculating the sum of specific fields from a complex array object

I would like to migrate one of my FireBase projects to Mongo and move the calculations from server side to DB. I already wrote most of the queries but this one is beyond my knowledge.
Player data are saved by week and I need to calculate the sum of donations and points for each players (the rest of the fields should be ignored).
PS: Some of the players are already banned so it would be enough the calculate the fields for a given player set (like: tag in ['playerId1', 'playerId2', ...]). If it's too complex I will do this filtering later on server side.
[
{
"week":"2021-01",
"players":[
{
"donations":20,
"games":3,
"name":"Player1",
"points":258,
"tag":"playerId1"
},
{
"donations":37,
"games":5,
"name":"Player2",
"points":634,
"tag":"playerId2"
},
{ ... }
]
},
{
"week":"2021-02",
"players":[ { ... } ]
}
]
So the result should be something like this:
[
{
"name":"Player1",
"tag":"playerId1",
"donations":90,
"points":980
},
{
"name":"Player2",
"tag":"playerId2",
"donations":80,
"points":1211
}
]
I think the $unwind and the $group operators could be the key but I can't figure out how to use them properly here.
$unwind deconstruct players array
$group by name and get sum of donations and points and get first tag
$project to show required fields
db.collection.aggregate([
{ $unwind: "$players" },
{
$group: {
_id: "$players.name",
donations: { $sum: "$players.donations" },
points: { $sum: "$players.points" },
tag: { $first: "$players.tag" }
}
},
{
$project: {
_id: 0,
name: "$_id",
points: 1,
tag: 1,
donations: 1
}
}
])
Playground
PS: Some of the players are already banned so it would be enough the calculate the fields for a given player set (like: tag in ['playerId1', 'playerId2', ...]).
You can put match condition after $unwind stage,
{ $match: { "players.tag": { $in: ['playerId1', 'playerId2', ..more] } } }
You were right,
play
db.collection.aggregate([
{//Denormalize
"$unwind": "$players"
},
{//Group by name
"$group": {
"_id": "$players.name",
"donations": {
"$sum": "$players.donations"
},
"points": {
"$sum": "$players.points"
},
}
}
])
You can add project stage if you really need name as key than _id

MongoDB use $last with $cond

Is there any way to get the last non-zero value in an aggregation. Is that possible?
Scenario:
I have an events collection, in which I store all the events from users. I want to fetch the list of users with last purchased items cost is $1.99 and logged in at least once last week.
My events collection will have records like
{_id:ObjectId("58af54d5ab7df73d71822708"),uid:1,event:"login"}
{_id:ObjectId("58db7189296fdedde1c04bc1"),uid:2,event:"login"}
{_id:ObjectId("5888419bfa4b69dc4af7c76c"),uid:2,event:"purchase",amount:3}
{_id:ObjectId("5888419bfa4b69dc4af7d45c"),uid:1,event:"purchase",amount:1.9}
{_id:ObjectId("5888819bfa4b69dc4af7c76c"),uid:1,event:"custom",type:3,value:2}
What am trying to do:
db.events.aggregate([{
{
$group: {
_id: uid,
last_login: {
$max: {
$cond: [{
$eq: ['$event', 'login']
}, '$_id', 0]
}
},
last_amount: {
$last: {
$cond: [{
$eq: ['$event', 'login']
}, '$_id', 0]
}
}
}
}
}, {
$match: {
last_purchase: {
$gte: ObjectId("58af54d50000000000000000")
},
last_amount: 1.9
}
}])
which obviously will fail because last will have 0 as the last item.
The output am expecting is
{_id:1,last_login:_id:ObjectId("58af54d5ab7df73d71822708"),last_amount:1.9}
The query is system generated. Please help.
To answer your question, you cannot modify the behaviour of $last using $cond. i.e. 'not the last entry because of a criteria'.
There are few alternatives that you could try:
Alter your document schema for the access pattern, especially if this is a frequently used query in your application. You should utilise the schema as a leverage to optimise your application queries.
Use one of the MongoDB supported drivers to process the expected outcome.
Depending on the use case, you could execute 2 separate aggregation queries; first query all the users uid that logged in at least once last week, and second to query only those users that have the last purchased item value of $1.99.
I found a workaround.Instead of $last, I used $push in the $group and added $slice with $setDifference in $project to remove null values. The query will now look something like
db.events.aggregate([{
{
$group: {
_id: uid,
last_login: {
$max: {
$cond: [{
$eq: ['$event', 'login']
}, '$_id', 0]
}
},
last_amount: {
$push: {
$cond: [{
$eq: ['$event', 'login']
}, '$_id', null]
}
}
}
}
},
{$project:{last_login:{$slice:[{$setDifference,'$last_login',[null]},-1,1]}}}
, {
$match: {
last_purchase: {
$gte: ObjectId("58af54d50000000000000000")
},
last_amount: 1.9
}
}])

Sub-query in MongoDB

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:
{_id: ObjectId("xxxxx"), country: "UK",...}
and actions like
{_id: ObjectId("yyyyy"), createdAt: ISODate(), user: ObjectId("xxxxx"),...}
I am trying to count events and distinct users split by country. The first half of which is working fine, however when I try to add in a sub-query to pull the country I only get nulls out for country
db.events.aggregate({
$match: {
createdAt: { $gte: ISODate("2013-01-01T00:00:00Z") },
user: { $exists: true }
}
},
{
$group: {
_id: {
year: { $year: "$createdAt" },
user_obj: "$user"
},
count: { $sum: 1 }
}
},
{
$group: {
_id: {
year: "$_id.year",
country: db.users.findOne({
_id: { $eq: "$_id.user_obj" },
country: { $exists: true }
}).country
},
total: { $sum: "$count" },
distinct: { $sum: 1 }
}
})
No Joins in here, just us bears
So MongoDB "does not do joins". You might have tried something like this in the shell for example:
db.events.find().forEach(function(event) {
event.user = db.user.findOne({ "_id": eventUser });
printjson(event)
})
But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.
For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:
db.events.aggregate([
{ "$match": {
"user": {
"$in": db.users.distinct("_id",{ "country": "UK" })
}
}}
])
Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.
Correcting
You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:
Embed your whole user object data within the event data.
At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.
So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:
{
"_id": ObjectId("yyyyy"),
"createdAt": ISODate(),
"user": {
"_id": ObjectId("xxxxx"),
"country": "UK"
}
}
Then the "aggregation" process becomes simple:
db.events.aggregate([
{ "$match": {
"createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },
"user": { "$exists": true }
}},
{ "$group": {
"_id": {
"year": { "$year": "$createdAt" },
"user_id": "$user._id"
"country": "$user.country"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.country",
"total": { "$sum": "$count" },
"distinct": { "$sum": 1 }
}}
])
We're not normal
Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:
var bulk = db.events.intitializeUnorderedBulkOp(),
count = 0;
db.users.find().forEach(function(user) {
// update multiple events for user
bulk.find({ "user": user._id }).update({
"$set": { "user": { "_id": user._id, "country": user.country } }
});
count++;
// Send batch every 1000
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.events.intitializeUnorderedBulkOp();
}
});
// Clear any queued
if ( count % 1000 != 0 )
bulk.execute();
So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.
If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".
But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.