Data processing before sending result back in MongoDB - mongodb

We have a requirement to rank our records based on some defined algorithm. We have 4 fields in our MongoDB like following;
{
"rating" : 3.5
"review" : 4
"revenue" : 100
"used" : 3.9
},
{
"rating" : 1.5
"review" : 2
"revenue" : 10
"used" : 2.1
}
While querying the data, we will send % as weightage for our calculation. So assume we are sending 30% for rating, 30% for review and 20% each for revenue and uthe sed.
Now we need to score each record based on following calculation.
Score per column = (Existing Value - Average(Column) / StandardDeviation) * %weightage
for rating = (3.5 - 2.5) /1 * 30% = .03
So we need to count score for each column (or field) and than total of all 4 field will give a score to each record.
Is it possible to do such calculation with any MongoDB inbuilt function ?
Thanks in advance

Mongo has default operators for finding standard deviation ($stdDevPop) and for finding the average ($avg) and obviously for subtraction, multiplication, and division as well.
So, using all these operators you can definitely write an aggregation for what you require.
I've done for rating below, you can use the same logic for the other fields.
Also, replace 0.3 with your %weightage.
db.collection.aggregate([
{
$match: {
rating: {
$ne: null
}
}
},
{
$group: {
_id: null,
ratings: {
$push: "$rating"
},
avg_rating: {
$avg: "$rating"
},
std_deviation_rating: {
$stdDevPop: "$rating"
}
}
},
{
$project: {
ratings: {
$map: {
input: "$ratings",
as: "rating",
in: {
$multiply: [
{
$divide: [
{
$subtract: [
"$$rating",
"$avg_rating"
]
},
"$std_deviation_rating"
]
},
0.3
]
}
}
}
}
}
])

Related

MongoDB Query lower/upper ranges with an array as input

In MongoDB I have a collection that looks like:
{ low: 1, high: 5 },
{ low: 6, high: 15 },
{ low: 16, high 412 },
...
I have input that's an array of integers:
[ 4, 16, ...]
I want to find all the documents in the collection which have values included in the range depicted by low and high. In this example it would pick the first and third documents.
I've found lots of Q&A here on how to filter using a single value as the input but could not find one that included an array as input. It could be that my search failed me and that this has been answered.
Update: I should have mentioned that I'm constructing this query in an application and not running this in the CLI. Given that flexibility what if I create a $or query with each of the inputs? Something like:
$or: [{
high: { $gte: 4 },
low: { $lte: 4 },
}, {
high: { $gte: 16 },
low: { $lte: 16 },
},
...
]
It could be massive and have thousands of elements in the $or.
You can use $anyElementTrue along with $map to check if any value is included within a range defined in your documents:
db.collection.find({
$expr: {
$anyElementTrue: {
$map: {
input: [ 4, 16 ],
in: {
$and: [
{ $gte: [ "$$this", "$low" ] },
{ $lte: [ "$$this", "$high" ] },
]
}
}
}
}
})
Mongo Playground
//working code from Mongo Shell CLI 4.2.6 on windows 10
//you can use forEach and loop through for comparison if a value exists between two numbers
> print("MongoDB",db.version());
MongoDB 4.2.6
> db.lhColl.find();
{ "_id" : ObjectId("5f889258f3b30cd04c8a78e5"), "low" : 1, "high" : 5 }
{ "_id" : ObjectId("5f889258f3b30cd04c8a78e6"), "low" : 6, "high" : 15 }
{ "_id" : ObjectId("5f889258f3b30cd04c8a78e7"), "low" : 16, "high" : 412 }
> var arrayInput = [4,16,500];
> var inputLength = arrayInput.length;
> db.lhColl.aggregate([
... {$match:{}}
... ]).forEach(function(doc){
... for (i=0; i<inputLength; i++){
... if (arrayInput[i]>=doc.low){
... if(arrayInput[i] <= doc.high)
... print("arrayInputs value match:",arrayInput[i]);
... }
... }
... });
arrayInputs value match: 4
arrayInputs value match: 16

MongoDB aggregate with total count as variable

I have a condition which says: Create a mongodb query that pulls 5% of total settled claims by claims examinar and my document for example is:
claims collection
{
"_id" : ObjectId("5dbbb6b693f50332a533f4db"),
"active" : true,
"status" : "settled",
}
{
"_id" : ObjectId("5dbbb6b693f50332a533f4db"),
"active" : true,
"status" : "unsettled",
}
I can calculate the total like this.
db.getCollection("claims").aggregate([
{$match: { 'status': 'trm'}},
{ $count: "total"}
])
It gives me the count of 42 for example.
So what I am trying to achieve is calculate the total count of settled data, set the total as a variable and apply the 5% of the formula in the $limit section as
{
$limit: $total * 0.05
}
I am unable to set the total from one pipeline as the variable and apply that in another pipeline.
Help please. How to achieve this type of condition?
You can do this by adding project stage and using multiply operator like this:
db.getCollection("claims").aggregate([
{
$match: {
"status": "trm"
}
},
{
$count: "total"
},
{
$project: {
"total": {
$multiply: [
"$total",
0.95
]
}
}
}
])
https://mongoplayground.net/p/VsypWLI6iJt
Docs: https://docs.mongodb.com/manual/reference/operator/aggregation/multiply

Aggregation pipeline slow with large collection

I have a single collection with over 200 million documents containing dimensions (things I want to filter on or group by) and metrics (things I want to sum or get averages from). I'm currently running against some performance issues and I'm hoping to gain some advice on how I could optimize/scale MongoDB or suggestions on alternative solutions. I'm running the latest stable MongoDB version using WiredTiger. The documents basically look like the following:
{
"dimensions": {
"account_id": ObjectId("590889944befcf34204dbef2"),
"url": "https://test.com",
"date": ISODate("2018-03-04T23:00:00.000+0000")
},
"metrics": {
"cost": 155,
"likes": 200
}
}
I have three indexes on this collection, as there are various aggregations being ran on this collection:
account_id
date
account_id and date
The following aggregation query fetches 3 months of data, summing cost and likes and grouping by week/year:
db.large_collection.aggregate(
[
{
$match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } }
},
{
$match: { "dimensions.account_id": { $in: [ "590889944befcf34204dbefc", "590889944befcf34204dbf1f", "590889944befcf34204dbf21" ] }}
},
{
$group: {
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" },
_id: {
year: { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
week: { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}
}
},
{
$project: {
cost: 1,
likes: 1
}
}
],
{
cursor: {
batchSize: 50
},
allowDiskUse: true
}
);
This query takes about 25-30 seconds to complete and I'm looking to reduce this to at least 5-10 seconds. It's currently a single MongoDB node, no shards or anything. The explain query can be found here: https://pastebin.com/raw/fNnPrZh0 and executionStats here: https://pastebin.com/raw/WA7BNpgA As you can see, MongoDB is using indexes but there are still 1.3 million documents that need to be read. I currently suspect I'm facing some I/O bottlenecks.
Does anyone have an idea how I could improve this aggregation pipeline? Would sharding help at all? Is MonogDB the right tool here?
The following could improve performances if and only if precomputing dimensions within each record is an option.
If this type of query represents an important portion of the queries on this collection, then including additional fields to make these queries faster could be a viable alternative.
This hasn't been benchmarked.
One of the costly parts of this query probably comes from working with dates.
First during the $group stage while computing for each matching record the year and the iso week associated to a specific time zone.
Then, to a lesser extent, during the initial filtering, when keeping dates from the 3 last months.
The idea would be to store in each record the year and the isoweek, for the given example this would be { "year" : 2018, "week" : 10 }. This way the _id key in the $group stage wouldn't need any computation (which would otherwise represent 1M3 complex date operations).
In a similar fashion, we could also store in each record the associated month, which would be { "month" : "201803" } for the given example. This way the first match could be on months [2, 3, 4, 5] before applying a more precise and costlier filtering on the exact timestamps. This would spare the initial costlier Date filtering on 200M records to a simple Int filtering.
Let's create a new collection with these new pre-computed fields (in a real scenario, these fields would be included during the initial insert of the records):
db.large_collection.aggregate([
{ $addFields: {
"prec.year": { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.week": { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.month": { $dateToString: { format: "%Y%m", date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}},
{ "$out": "large_collection_precomputed" }
])
which will store these documents:
{
"dimensions" : { "account_id" : ObjectId("590889944befcf34204dbef2"), "url" : "https://test.com", "date" : ISODate("2018-03-04T23:00:00Z") },
"metrics" : { "cost" : 155, "likes" : 200 },
"prec" : { "year" : 2018, "week" : 10, "month" : "201803" }
}
And let's query:
db.large_collection_precomputed.aggregate([
// Initial gross filtering of dates (months) (on 200M documents):
{ $match: { "prec.month": { $gte: "201802", $lte: "201805" } } },
{ $match: {
"dimensions.account_id": { $in: [
ObjectId("590889944befcf34204dbf1f"), ObjectId("590889944befcf34204dbef2")
]}
}},
// Exact filtering of dates (costlier, but only on ~1M5 documents).
{ $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } } },
{ $group: {
// The _id is now extremly fast to retrieve:
_id: { year: "$prec.year", "week": "$prec.week" },
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" }
}},
...
])
In this case we would use indexes on account_id and month.
Note: Here, months are stored as String ("201803") since I'm not sure how to cast them to Int within an aggregation query. But best would be to store them as Int when records are inserted
As a side effect, this obviously will make the storage disk/ram of the collection heavier.

Get average value from array consisting of objects based on objects fields

I have a document of the following structure:
{
"_id" : ObjectId("598446bb13c7141f1"),
"trackerId" : "598446ba-fa9b-4000-8000-4ea290e",
"powerMetrics" : [
{
"duration" : 0.15,
"powerConsumption" : 0.1
},
{
"duration" : 0.1,
"powerConsumption" : 0.05
}
]
}
My goal is to get another document, which would contain a single value avgMetric. This avgMetrics should be calculated using powerMetrics array in the following way:
(powerMetrics[0].powerConsumption/powerMetrics[0].duration
+ powerMetrics[1].powerConsumption/powerMetrics[1].duration) / powerMetrics.size()
So this avgMetrics should represent the average of all (powerConsumption/duration) from the powerMetrics array.
After experimenting with query I could not achieve this,
Size of the powerMetrics array can vary, Mongo db version is 3.2.14
Could someone please help with that?
Thanks
You can use $map to with $avg to output avg in 3.2 mongo version.
db.col_name.aggregate(
[{"$project":{
"avgMetrics":{
"$avg":{
"$map":{
"input":"$powerMetrics",
"as":"val",
"in":{"$divide":["$$val.powerConsumption","$$val.duration"]}
}
}
}
}}])
db.collection.aggregate(
// Pipeline
[
// Stage 1
{
$unwind: {
path: "$powerMetrics",
preserveNullAndEmptyArrays: true // optional
}
},
// Stage 2
{
$group: {
_id: '$_id',
metrics: {
$addToSet: {
$divide: ['$powerMetrics.powerConsumption', '$powerMetrics.duration']
}
}
}
},
// Stage 3
{
$project: {
avgVal: {
$avg: '$metrics'
}
}
},
]
);

MongoDB aggregate query to SpringDataMongoDB

I have below MongoDB aggregate query and would like to have it's equivalent SpringData Mongodb query.
MongoDB Aggregate Query :
db.response.aggregate(
// Pipeline
[
// Stage 1 : Group by Emotion & Month
{
$group: {
_id: {
emotion: "$emotion",
category: "$category"
},
count: {
$sum: 1
},
point: {
$first: '$point'
}
}
},
// Stage 2 : Total Points
{
$addFields: {
"totalPoint": {
$multiply: ["$point", "$count"]
}
}
},
// Stage3 : Group By Category - Overall Response Total & totalFeedbacks
{
$group: {
_id: '$_id.category',
totalFeedbacks: {
$sum: "$count"
},
overallResponseTotal: {
$sum: "$totalPoint"
}
}
},
// Stage4 - Overall Response Total & totalFeedbacks
{
$project: {
_id: 1,
overallResponseTotal: '$overallResponseTotal',
maxTotalFrom: {
"$multiply": ["$totalFeedbacks", 3.0]
},
percent: {
"$multiply": [{
"$divide": ["$overallResponseTotal", "$maxTotalFrom"]
}, 100.0]
}
}
},
// Stage4 - Percentage Monthwise
{
$project: {
_id: 1,
overallResponseTotal: 1,
maxTotalFrom: 1,
percent: {
"$multiply": [{
"$divide": ["$overallResponseTotal", "$maxTotalFrom"]
}, 100.0]
}
}
}
]
);
I have tried it's equivalent in Spring Data but got stuck at Stage 2 on how to convert "$addFields" to java code. Though I search about it on multiple sites but couldn't find anything useful. Please see my equivalent java code for Stage 1.
//Stage 1 -Group By Emotion and Category and return it's count
GroupOperation groupEmotionAndCategory = Aggregation.group("emotion","category").count().as("count").first("point")
.as("point");
Aggregation aggregation = Aggregation.newAggregation(groupEmotionAndCategory);
AggregationResults<CategoryWiseEmotion> output = mongoTemplate.aggregate(aggregation, Response.class, CategoryWiseEmotion.class);
Any helps will be highly appreciated.
$addFields is not yet supported by Spring Data Mongodb.
One workaround is to pass the raw aggregation pipeline to Spring.
But since you have a limited number of fields after stage 1, you could also downgrade stage 2 to a projection:
{
$project: {
// _id is included by default
"count" : 1, // include count
"point" : 1, // include point
"totalPoint": {
$multiply: ["$point", "$count"] // compute totalPoint
}
}
}
I haven't tested it myself, but this projection should translate to something like:
ProjectionOperation p = project("count", "point").and("point").multiply(Fields.field("count")).as("totalPoint");
Then you can translate stage 3, 4 and 5 similarly and pass the whole pipeline to Aggregation.aggregate().