Counting distinct number of users from beginning - mongodb

I have a MongoDB aggregation pipeline that has been frustrating me for a while now, because it never seems to be accurate or correct to my needs. The aim is to count the number of new unique users each day per chatbot, starting from the very beginning.
Here's what my pipeline looks like right now.
[
{
"$project" : {
"_id" : 0,
"bot_id" : 1,
"customer_id" : 1,
"timestamp" : {
"$ifNull" : [
'$incoming_log.created_at', '$outcome_log.created_at'
]
}
}
},
{
"$project" : {
"customer_id" : 1,
"bot_id" : 1,
"timestamp" : {
"$dateFromString" : {
"dateString" : {
"$substr" : [
"$timestamp", 0, 10
]
}
}
}
}
},
{
"$group" : {
"_id" : "$customer_id",
"timestamp" : {
"$first" : "$timestamp"
},
"bot_id" : {
"$addToSet" : "$bot_id"
}
}
},
{
"$unwind" : "$bot_id"
},
{
"$group" : {
"_id" : {
"bot_id" : "$bot_id",
"customer_id" : "$_id"
},
"timestamp" : {
"$first" : "$timestamp"
}
}
},
{
"$project" : {
"_id" : 0,
"timestamp" : 1,
"customer_id" : "$_id.customer_id",
"bot_id" : "$_id.bot_id"
}
},
{
"$group" : {
"_id": {
"timestamp" : "$timestamp",
"bot_id" : "$bot_id"
},
"new_users" : {
"$sum" : 1
}
}
},
{
"$project" : {
"_id" : 0,
"timestamp" : "$_id.timestamp",
"bot_id" : "$_id.bot_id",
"new_users" : 1
}
}
]
Some sample data for an idea of what the data looks like...
{
"mid" : "...",
"bot_id" : "...",
"bot_name" : "JOBBY",
"customer_id" : "U122...",
"incoming_log" : {
"created_at" : ISODate("2020-12-08T09:14:16.237Z"),
"event_payload" : "",
"event_type" : "text"
},
"outcome_log" : {
"created_at" : ISODate("2020-12-08T09:14:18.145Z"),
"distance" : 0.25,
"incoming_msg" : "🥺"
}
}
My expected outcome is something along the lines of:
{
"new_users" : 1187.0,
"timestamp" : ISODate("2021-01-27T00:00:00.000Z"),
"bot_id" : "5ffd......."
},
{
"new_users" : 1359.0,
"timestamp" : ISODate("2021-01-27T00:00:00.000Z"),
"bot_id" : "6def......."
}
Have I overcomplicated my pipeline somewhere? I seem to get a reasonable number of new users per bot each day, but for some reason my colleague tells me that the number is too high. I need some tips, please!

I have really no idea what you are looking for.
"The aim is to count the number of new unique users each day per chatbot, starting from the very beginning."
What is "new unique users"? What do you mean by "starting from the very beginning"? You ask for count per day but you use {"$group": {"_id": "$customer_id", "timestamp": { "$first": "$timestamp" } } }
For me your grouping does not make any sense. With only one single sample document, it is almost impossible to guess what you like to count.
Regarding group per day: I prefer to work always with Date values, rather than strings. It is less error prone. Maybe you have to consider time zones, because UTC midnight is not your local midnight. When you work with Dates then you have better control over it.
The $project stages are useless when you do $group afterwards. Typically you have only one $project stage at the end.
So, put something to start.
db.collection.aggregate([
{
$set: {
day: {
$dateToParts: {
date: { $ifNull: ["$incoming_log.created_at", "$outcome_log.created_at"] }
}
}
}
},
{
$group: {
_id: "$customer_id",
timestamp: {$min: { $dateFromParts: { year: "$day.year", month: "$day.month", day: "$day.day" } }}
}
}
]);

Related

Creating measures in a mongodb aggregation pipeline

I have a report that has been developed in PowerBI. It runs over a collection of jobs, and for a given month and year counts the number of jobs that were created, due or completed in that month using measures.
I am attempting to reproduce this report using a mongoDB aggregation pipeline. At first, I thought I could just use the $group stage to do this, but quickly realised that grouping by a specific date would exclude jobs.
Some sample documents are below (most fields excluded as they are not relevant):
{
"_id": <UUID>,
"createdOn": ISODate("2022-07-01T00:00"),
"dueOn": ISODate("2022-08-01T00:00"),
"completedOn": ISODate("2022-07-29T00:00")
},
{
"_id": <UUID>,
"createdOn": ISODate("2022-06-01T00:00"),
"dueOn": ISODate("2022-08-01T00:00"),
"completedOn": ISODate("2022-07-24T00:00")
}
For example, if I group by created date, the record for July 2022 would show 1 created job and only 1 completed job, but it should show 2.
How can I go about recreating this report? One idea was that I needed to determine the minimum and maximum of all the possible dates across those 3 date fields in my collection, but I don't know where to go from there
I ended up solving this by using a facet. I followed this process:
Each facet field grouped by a different date field from the source document, and then aggregated the relevant field (e.g. counts, or sums as required). I ensured each of these fields in the facet had a unique name.
I then did a project stage where I took each of the facet stage fields (arrays), and concat them into a single array
I unwound the array, and then replaced the root to make it simpler to work with
I then grouped again by the _id field which was set to the relevant date during the facet field, and then grabbed the relevant fields.
The relevant parts of the pipeline are below:
db.getCollection("jobs").aggregate(
// Pipeline
[
// Stage 3
{
$facet: {
//Facet 1, group by created date, count number of jobs created
//facet 2, group by completed date, count number of jobs completed
//facet 3, group by due date, count number of jobs due
"created" : [
{
$addFields : {
"monthStarting" : {
"$dateFromString" : {
"dateString" : {
"$dateToString" : {
"date" : {
"$dateTrunc" : {
"date" : "$createdAt",
"unit" : "month",
"binSize" : 1.0,
"timezone" : "$timezone",
"startOfWeek" : "mon"
}
},
"timezone" : "$timezone"
}
}
}
},
"yearStarting" : {
"$dateFromString" : {
"dateString" : {
"$dateToString" : {
"date" : {
"$dateTrunc" : {
"date" : "$createdAt",
"unit" : "year",
"binSize" : 1.0,
"timezone" : "$timezone"
}
},
"timezone" : "$timezone"
}
}
}
}
}
},
{
$group : {
"_id" : {
"year" : "$yearStarting",
"month" : "$monthStarting"
},
"monthStarting" : {
"$first" : "$monthStarting"
},
"yearStarting" : {
"$first" : "$yearStarting"
},
"createdCount": {$sum: 1}
}
}
],
"completed" : [
{
$addFields : {
"monthStarting" : {
"$dateFromString" : {
"dateString" : {
"$dateToString" : {
"date" : {
"$dateTrunc" : {
"date" : "$completedDate",
"unit" : "month",
"binSize" : 1.0,
"timezone" : "$timezone",
"startOfWeek" : "mon"
}
},
"timezone" : "$timezone"
}
}
}
},
"yearStarting" : {
"$dateFromString" : {
"dateString" : {
"$dateToString" : {
"date" : {
"$dateTrunc" : {
"date" : "$completedDate",
"unit" : "year",
"binSize" : 1.0,
"timezone" : "$timezone"
}
},
"timezone" : "$timezone"
}
}
}
}
}
},
{
$group : {
"_id" : {
"year" : "$yearStarting",
"month" : "$monthStarting"
},
"monthStarting" : {
"$first" : "$monthStarting"
},
"yearStarting" : {
"$first" : "$yearStarting"
},
"completedCount": {$sum: 1}
}
}
],
"due": [
{
$match: {
"dueDate": {$ne: null}
}
},
{
$addFields : {
"monthStarting" : {
"$dateFromString" : {
"dateString" : {
"$dateToString" : {
"date" : {
"$dateTrunc" : {
"date" : "$dueDate",
"unit" : "month",
"binSize" : 1.0,
"timezone" : "$timezone",
"startOfWeek" : "mon"
}
},
"timezone" : "$timezone"
}
}
}
},
"yearStarting" : {
"$dateFromString" : {
"dateString" : {
"$dateToString" : {
"date" : {
"$dateTrunc" : {
"date" : "$dueDate",
"unit" : "year",
"binSize" : 1.0,
"timezone" : "$timezone"
}
},
"timezone" : "$timezone"
}
}
}
}
}
},
{
$group : {
"_id" : {
"year" : "$yearStarting",
"month" : "$monthStarting"
},
"monthStarting" : {
"$first" : "$monthStarting"
},
"yearStarting" : {
"$first" : "$yearStarting"
},
"dueCount": {$sum: 1},
"salesRevenue": {$sum: "$totalSellPrice"},
"costGenerated": {$sum: "$totalBuyPrice"},
"profit": {$sum: "$profit"},
"avgValue": {$avg: "$totalSellPrice"},
"finalisedRevenue": {$sum: {
$cond: {
"if": {$in: ["$status",["Finalised","Closed"]]},
"then": "$totalSellPrice",
"else": 0
}
}}
}
}
]
}
},
// Stage 4
{
$project: {
"docs": {$concatArrays: ["$created","$completed","$due"]}
}
},
// Stage 5
{
$unwind: {
path: "$docs",
}
},
// Stage 6
{
$replaceRoot: {
// specifications
"newRoot": "$docs"
}
},
// Stage 7
{
$group: {
_id: "$_id",
"monthStarting" : {
"$first" : "$monthStarting"
},
"yearStarting" : {
"$first" : "$yearStarting"
},
"monthStarting" : {
"$first" : "$monthStarting"
},
"createdCountSum" : {
"$sum" : "$createdCount"
},
"completedCountSum" : {
"$sum" : "$completedCount"
},
"dueCountSum" : {
"$sum" : "$dueCount"
},
"salesRevenue" : {
"$sum" : "$salesRevenue"
},
"costGenerated" : {
"$sum" : "$costGenerated"
},
"profit" : {
"$sum" : "$profit"
},
"finalisedRevenue" : {
"$sum" : "$finalisedRevenue"
},
"avgJobValue": {
$sum: "$avgValue"
}
}
},
],
);

MongoDB: get sum by year/month with nested values

I'm trying to sum (spending by month/year) of a collection with nested amounts - with no luck.
This is the collection (extract):
[
{
"_id" : ObjectId("5faaf88d0657287993e541a5"),
"segment" : {
"l1" : "Segment A",
"l2" : "001"
},
"invoiceNo" : "2020.10283940",
"invoicePos" : 3,
"date" : ISODate("2019-09-06T00:00:00.000Z"),
"amount" : {
"document" : {
"amount" : NumberDecimal("125.000000000000"),
"currCode" : "USD"
},
"local" : {
"amount" : NumberDecimal("123.800000000000"),
"currCode" : "CHF"
},
"global" : {
"amount" : NumberDecimal("123.800000000000"),
"currCode" : "CHF"
}
}
},
...
]
I would like to sum up the aggregated invoice volume per month in "global" currency.
I tried this query on MongoDB:
db.invoices.aggregate(
{$project : {
month : {$month : "$date"},
year : {$year : "$date"},
amount : 1
}},
{$unwind: '$amount'},
{$group : {
_id : {month : "$month" ,year : "$year" },
total : {$sum : "$amount.global.amount"}
}})
I am getting as result this:
/* 1 */
{
"_id" : ObjectId("5faaf88d0657287993e541a5"),
"amount" : {
"document" : {
"amount" : NumberDecimal("125.000000000000"),
"currCode" : "USD"
},
"local" : {
"amount" : NumberDecimal("123.800000000000"),
"currCode" : "CHF"
},
"global" : {
"amount" : NumberDecimal("123.800000000000"),
"currCode" : "CHF"
}
},
"month" : 9,
"year" : 2019
}
/* 2 */
{
"_id" : ObjectId("5faaf88d0657287993e541ac"),
"amount" : {
"document" : {
"amount" : NumberDecimal("105.560000000000"),
"currCode" : "CHF"
},
"local" : {
"amount" : NumberDecimal("105.560000000000"),
"currCode" : "CHF"
},
"global" : {
"amount" : NumberDecimal("105.560000000000"),
"currCode" : "CHF"
}
},
"month" : 11,
"year" : 2020
}
This however does not sum up all invoices per month, but looks like single invoice lines - no aggregation.
I would like to get a result like this:
[
{
"month": 11,
"year": 2020,
"amount" : NumberDecimal("99999.99")
},
{
"month": 10,
"year": 2020,
"amount" : NumberDecimal("99999.99")
},
{
"month": 9,
"year": 2020,
"amount" : NumberDecimal("99999.99")
}
]
What is wrong with my query?
Would this be helpful?
db.invoices.aggregate([
{
$group: {
_id: {
month: {
$month: "$date"
},
year: {
$year: "$date"
}
},
total: {
$sum: "$amount.global.amount"
}
}
},
{$sort:{"_id.year":-1, "_id.month":-1}}
])
Playground
If you need any extra explanation let me know, but the code is pretty short and self-explanatory.
In principle your aggregation pipeline is fine, there a few mistakes:
An aggregation pipeline expects an array
$unwind is useless, because $amount is not an array. One element in -> one document out
You can use date function directly
So, short and simple:
db.invoices.aggregate([
{
$group: {
_id: { month: { $month: "$date" }, year: { $year: "$date" } },
total: { $sum: "$amount.global.amount" }
}
}
])

Mongodb aggregate by day and delete duplicate value

I'm trying to clean a huge database.
Sample DB :
{
"_id" : ObjectId("59fc5249d5ab401d99f3de7f"),
"addedAt" : ISODate("2017-11-03T11:26:01.744Z"),
"__v" : 0,
"check" : 17602,
"lastCheck" : ISODate("2018-04-05T11:47:00.609Z"),
"tracking" : [
{
"timeCheck" : ISODate("2017-11-06T13:17:20.861Z"),
"_id" : ObjectId("5a0060e00f3c330012bafe39"),
"rank" : 2395,
},
{
"timeCheck" : ISODate("2017-11-06T13:22:31.254Z"),
"_id" : ObjectId("5a0062170f3c330012bafe77"),
"rank" : 2395,
},
{
"timeCheck" : ISODate("2017-11-06T13:27:40.551Z"),
"_id" : ObjectId("5a00634c0f3c330012bafebe"),
"rank" : 2379,
},
{
"timeCheck" : ISODate("2017-11-06T13:32:41.084Z"),
"_id" : ObjectId("5a0064790f3c330012baff03"),
"rank" : 2395,
},
{
"timeCheck" : ISODate("2017-11-06T13:37:51.012Z"),
"_id" : ObjectId("5a0065af0f3c330012baff32"),
"rank" : 2379,
},
{
"timeCheck" : ISODate("2017-11-07T13:37:51.012Z"),
"_id" : ObjectId("5a0065af0f3c330012baff34"),
"rank" : 2379,
}]
}
I have a lot of duplicate value but I need to clean only by day.
To obtain this for example :
{
"_id" : ObjectId("59fc5249d5ab401d99f3de7f"),
"addedAt" : ISODate("2017-11-03T11:26:01.744Z"),
"__v" : 0,
"check" : 17602,
"lastCheck" : ISODate("2018-04-05T11:47:00.609Z"),
"tracking" : [
{
"timeCheck" : ISODate("2017-11-06T13:17:20.861Z"),
"_id" : ObjectId("5a0060e00f3c330012bafe39"),
"rank" : 2395,
},
{
"timeCheck" : ISODate("2017-11-06T13:27:40.551Z"),
"_id" : ObjectId("5a00634c0f3c330012bafebe"),
"rank" : 2379,
},
{
"timeCheck" : ISODate("2017-11-07T13:37:51.012Z"),
"_id" : ObjectId("5a0065af0f3c330012baff34"),
"rank" : 2379,
}]
}
How can I aggregate by day and after delete last value duplicate?
I need to keep the values per day even if they are identical with another day.
The aggregation framework cannot update data at this stage. However, you can use the following aggregation pipeline in order to get the desired output and then use e.g. a bulk replace to update all your documents:
db.collection.aggregate({
$unwind: "$tracking" // flatten the "tracking" array into separate documents
}, {
$sort: {
"tracking.timeCheck": 1 // sort by timeCheck to allow us to use the $first operator in the next stage reliably
}
}, {
$group: {
_id: { // group by
"_id": "$_id", // "_id" and
"rank": "$tracking.rank", // "rank" and
"date": { // the "date" part of the "timeCheck" field
$dateFromParts : {
year: { $year: "$tracking.timeCheck" },
month: { $month: "$tracking.timeCheck" },
day: { $dayOfWeek: "$tracking.timeCheck" }
}
}
},
"doc": { $first: "$$ROOT" } // only keep the first document per group
}
}, {
$sort: {
"doc.tracking.timeCheck": 1 // restore ascending sort order - may or may not be needed...
}
}, {
$group: {
_id: "$_id._id", // merge everything again per "_id"
"addedAt": { $first: "$doc.addedAt" },
"__v": { $first: "$doc.__v" },
"check": { $first: "$doc.check" },
"lastCheck": { $first: "$doc.lastCheck" },
"tracking": { $push: "$doc.tracking" } // in order to join the tracking values into an array again
}
})

Get lowest per date from multiple arrays in mongodb

I've the following structure of docs:
{
"_id" : ObjectId("5786458371d24d924d8b4575"),
"uniqueNumber" : "3899822714",
"lastUpdatedAt" : ISODate("2016-07-13T20:11:11.000Z"),
"new" : [
{
"price" : 8.4,
"created" : ISODate("2016-07-13T13:11:28.000Z")
},
{
"price" : 10.0,
"created" : ISODate("2016-07-13T14:50:56.000Z")
}
],
"used" : [
{
"price" : 10.99,
"created" : ISODate("2016-07-08T13:46:31.000Z")
},
{
"price" : 8.59,
"created" : ISODate("2016-07-13T13:11:28.000Z")
}
]
}
Now I need to get a list that gives me the lowest price of each array per date.
So, as example:
{
"uniqueNumber" : 1234,
"prices" : {
"created" : 2016-07-08,
"minNew" : 123,
"minUsed" : 22
}
}
By now I've built the following query
db.getCollection('col').aggregate([
{
$match : {
"uniqueNumber" : "3899822714"
}
},
{
$unwind : "$used"
},
{
$project : {
"uniqueNumber" : "$uniqueNumber",
"price" : "$used.price",
"ts" : "$used.created"
}
},
{
$sort : { "ts" : 1 }
},
{
$group : {_id: "$uniqueNumber", priceOfMaxTS : { $min: "$price" }, ts : { $last: "$ts" }}
}
]);
But this one will only give me the lowest price for the highest date. I couldn't really find anything that pushes me to the right direction to get the desired result.
UPDATE
I've found a way to get the lowest price of the used array grouped by day with this query:
db.getCollection('col').aggregate([
{
$match : {
"uniqueNumber" : "3899822714"
}
},
{
$unwind : "$used"
},
{
$project : {
"asin" : "$uniqueNumber",
"price" : "$used.price",
"ts" : "$used.created",
"y" : { "$year" : "$used.created" },
"m" : { "$month" : "$used.created" },
"d" : { "$dayOfMonth" : "$used.created" }
}
},
{
$group : { _id : { "year" : "$y", "month" : "$m", "day" : "$d" }, minPriceOfDay : { $min: "$price" }}
}
]);
No I only need to find a way to do this also to the new array in the same query.

Multiplication on nested fields on MongoDB

In a database in MongoDB I am trying to group some data by their date (one group for each day of the year), and then add an additional field that would be the result of the multiplication of two of the already existing fields.
The data structure is:
{
"_id" : ObjectId("567a7c6d9da4bc18967a3947"),
"units" : 3.0,
"price" : 50.0,
"name" : "Name goes here",
"datetime" : ISODate("2015-12-23T10:50:21.560+0000")
}
I first tried a two stage approach using $project and then $group like this
db.things.aggregate(
[
{
$project: {
"_id" : 1,
"name" : 1,
"units" : 1,
"price" : 1,
"datetime":1,
"unitsprice" : { $multiply: [ "$price", "$units" ] }
}
},
{
$group: {
"_id" : {
"day" : {
"$dayOfMonth" : "$datetime"
},
"month" : {
"$month" : "$datetime"
},
"year" : {
"$year" : "$datetime"
}
},
"things" : {
"$push" : "$$ROOT"
}
}
}
],
)
in this case, the first step (the $project) gives the expected output (with the expected value of unitsprice), but then when doing the second $group step, it outputs this error:
"errmsg":$multiply only supports numeric types, not String",
"code":16555
I tried also turning around things, doing the $group step first and then the $project
db.things.aggregate(
[
{
$group: {
"_id" : {
"day" : {
"$dayOfMonth" : "$datetime"
},
"month" : {
"$month" : "$datetime"
},
"year" : {
"$year" : "$datetime"
}
},
"things" : {
"$push" : "$$ROOT"
}
}
},
{
$project: {
"_id" : 1,
"things":{
"name" : 1,
"units" : 1,
"price" : 1,
"datetime":1,
"unitsprice" : { $multiply: [ "$price", "$units" ] }
}
}
}
],
);
But in this case, the result of the multiplication is: unitsprice:null
Is there any way of doing this multiplication? Also, it would be nice to do it in a way that the output would not have nested fields, so it would look like:
{"_id":
"units":
"price":
"name":
"datetime":
"unitsprice":
}
Thanks in advance
PS:I am running MongoDB 3.2
Finally found the error. When importing one of the fields, a few of the price fields were created as a string. Surprisingly, the error didn't came out when first doing the multiplication in the project step (the output was normal until it reached the first wrong field, then it stopped), but when doing the group step.
In order to find the text fields I used this query:
db.things.find( { price: { $type: 2 } } );
Thanks for the hints