How to spot an outlier in MongoDB - mongodb

Assume the following records in mongodb
{
_id: // primary key
age: // some age.
}
The system generates primary key and is guaranteed to be increasing monotonically.
The business logic provides value for age. Age should be increasing, however due to a bug, under some remote cases, the age could be decreasing.
Eg: age could go from 1 yr, 2 yr, 3yr, "2 yr", 4yr, 5yr etc.
How to write a query to spot the outlier in the age ?

Assuming your collection is called 'junk' (sorry, no bad intentions here) I think this might work...
db.junk.aggregate([
{$lookup: {
from: "junk",
let: { age: "$age", id: "$_id" },
pipeline: [
{ $match :
{ $expr:
{ $and:
[
{$gt: ["$_id", "$$id"]},
{ $lt: ["$age", "$$age"] }
]
}
}
}
],
as: "data"
}},
{ $project: { _id: 1, "age": 1, "data": 1, "found": { $gt: [{ $size: "$data" }, 0] } } },
{ $match : { found: true }}
])
The intent is to self join on the same collection where the id is greater than another document, but the age is less for the same document. Count how many records are in this collection, and if the count is greater than 0 output.
Example Collections:
So, for testing this I populated a collection called 'junk' with 7 documents...
> db.junk.find()
{ "_id" : ObjectId("5daf4700090553aca6da1535"), "age" : 0 }
{ "_id" : ObjectId("5daf4700090553aca6da1536"), "age" : 1 }
{ "_id" : ObjectId("5daf4700090553aca6da1537"), "age" : 2 }
{ "_id" : ObjectId("5daf471b090553aca6da1538"), "age" : 3 }
{ "_id" : ObjectId("5daf471e090553aca6da1539"), "age" : 4 }
{ "_id" : ObjectId("5daf4721090553aca6da153a"), "age" : 3 }
{ "_id" : ObjectId("5daf4724090553aca6da153b"), "age" : 5 }
Results:
Here is what my results look like after running this query...
{ "_id" : ObjectId("5daf471e090553aca6da1539"), "age" : 4, "data" : [ { "_id" : ObjectId("5daf4721090553aca6da153a"), "age" : 3 } ], "found" : true }
It found a record having a later outlier (ObjectId 5daf471e090553aca6da1539 precedes the outlier, ObjectId 5daf4721090553aca6da153a is the outlier). Obviously this could be projected differently to show just the outlier, but I wanted to first verify the query works as expected and not invest more time on a inadequate approach.

Related

Add field to documents after $sort aggregation pipeline which include its index in sorted list using MongoDb aggregation

I want to get the order of some user from a list after $sort aggregation pipeline.
Let's say we have a leaderboard, and I need to get my rank in the leaderboard with only one query getting only my data.
I have tried $addFields and some queries with $map
Let's say we have these documents
/* 1 createdAt:8/18/2019, 4:42:41 PM*/
{
"_id" : ObjectId("5d5963e1c6c93b2da849f067"),
"name" : "x4",
"points" : 69
},
/* 2 createdAt:8/18/2019, 4:42:41 PM*/
{
"_id" : ObjectId("5d5963e1c6c93b2da849f07b"),
"name" : "x24",
"points" : 968
},
/* 3 createdAt:8/18/2019, 4:42:41 PM*/
{
"_id" : ObjectId("5d5963e1c6c93b2da849f06a"),
"name" : "x7",
"points" : 997
},
And I want to write a query like this
db.table.aggregate(
[
{ $sort : { points : 1 } },
{ $addFields: { order : "$index" } },
{ $match : { name : "x24" } }
]
)
I need to inject the order field with something like $index
I expect to have something like this in return
{
"_id" : ObjectId("5d5963e1c6c93b2da849f07b"),
"name" : "x24",
"points" : 968,
"order" : 2
}
I need something like the metadata of the result here which return 2
/* 2 createdAt:8/18/2019, 4:42:41 PM*/
One of the workaround for this situation is to convert your all documents into one single array and hence resolve the index of the document using this array with help of $unwind and finally project the data with fields as required.
db.collection.aggregate([
{ $sort: { points: 1 } },
{
$group: {
_id: 1,
register: { $push: { _id: "$_id", name: "$name", points: "$points" } }
}
},
{ $unwind: { path: "$register", includeArrayIndex: "order" } },
{ $match: { "register.name": "x4" } },
{
$project: {
_id: "$register._id",
name: "$register.name",
points: "$register.points",
order: 1
}
}
]);
To make it more efficient you can apply limit, match, and filter as per your requirement.

Check if subdocument in range complies with a condition

I'm working on a mongoDB query.
I have several documents which I query with following results:
{
"_id" : 1000.0,
"date" : ISODate("2018-05-25T00:20:00.000Z"),
"value" : true
}
{
"_id" : 1000.0,
"date" : ISODate("2018-05-25T00:26:00.000Z"),
"value" : false
}
{
"_id" : 1000.0,
"date" : ISODate("2018-05-25T00:30:00.000Z"),
"value" : false
}
The original documents are filtered so that I get only document within the last 15 minutes before now and there is no way of knowing how many entries are in that time range.
I need to expand my existing query so that it returns a status based on the "value". If there are no true I need a status 0, if there is at least 1 but not only true I need a status 1, and if there are only true I need a status 2.
For example:
{
"_id" : 1000,
"status" : 1
},
{
"_id" : 1001,
"status" : 2
}
Is there a way of accomplishing this using mongoDB? Or would it be better/easier to do it on java side? Note that there are several _id in the database.
You can gather all values from each group into one array (using $group and $push) and then use $switch to apply your logic. To determine whether array contains any true value or all values are true you can use $anyElementTrue and $allElementsTrue:
db.col.aggregate([
{
$group: {
_id: "$_id",
values: { $push: "$value" }
}
},
{$unwind:"$values"},
{
$project: {
_id: 1,
status: {
$switch: {
branches: [
{ case: { $allElementsTrue: "$values" }, then: 2 },
{ case: { $anyElementTrue: "$values" }, then: 1 },
],
default: 0
}
}
}
}
])

Multiple conditional sums in mongodb aggregation

I'm trying to return the total of requests by type based on their status:
If there is no status set, the request should be added to requested
If the status is ordered, the request should be added to ordered
If the status is arrived, the request should be added to arrived
caseRequest.aggregate([{
$group: {
_id: "$product",
suggested: {
$sum: {
$cond: [{
$ifNull: ["$status", true]
},
1, 0
]}
},
ordered: {
$sum: {
$cond: [{
$eq: ["$status", "ordered"]
},
1, 0
]
}
},
arrived: {
$sum: {
$cond: [{
$eq: ["$status", "arrived"]
},
1, 0
]
}
}
}
}
But for some reason it doesn't find any request status ordered or arrived. If in the database I have 48 requests, 45 of them without status, 2 with ordered and 1 with arrived, it returns:
[
{
_id: "xxx",
suggested: 48,
ordered: 0,
arrived: 0,
},
...
]
Try this approach,
Return the total number of requests by type based on their status
Now the simplest way to get the count of different status is to use aggregate pipeline with $group on the status field
db.stackoverflow.aggregate([{ $group: {_id: "$status", count: {$sum:1}} }])
We will be getting a result similar to this
{ "_id" : "", "count" : 2 }
{ "_id" : "arrived", "count" : 3 }
{ "_id" : "ordered", "count" : 4 }
The schema which is used to retrieve these records is very simple so that it will be easier to understand. The schema will have a parameter on the top level of the document and the value of status can be "ordered", "arrived" or empty
Schema
{ "_id" : ObjectId("5798c348d345404e7f9e0ced"), "status" : "ordered" }
The collection is populated with 9 records, with status as ordered, arrived and empty
db.stackoverflow.find()
{ "_id" : ObjectId("5798c348d345404e7f9e0ced"), "status" : "ordered" }
{ "_id" : ObjectId("5798c349d345404e7f9e0cee"), "status" : "ordered" }
{ "_id" : ObjectId("5798c34ad345404e7f9e0cef"), "status" : "ordered" }
{ "_id" : ObjectId("5798c356d345404e7f9e0cf0"), "status" : "arrived" }
{ "_id" : ObjectId("5798c357d345404e7f9e0cf1"), "status" : "arrived" }
{ "_id" : ObjectId("5798c358d345404e7f9e0cf2"), "status" : "arrived" }
{ "_id" : ObjectId("5798c35ad345404e7f9e0cf3"), "status" : "ordered" }
{ "_id" : ObjectId("5798c361d345404e7f9e0cf4"), "status" : "" }
{ "_id" : ObjectId("5798c362d345404e7f9e0cf5"), "status" : "" }
db.stackoverflow.count()
9
Hope it Helps!!

Mongo DB aggregation framework calculate avg documents

I have question collection each profile can have many questions.
{"_id":"..." , "pid":"...",.....}
Using mongo DB new aggregation framework how can I calculate the avg number of questions per profile?
tried the following without success:
{ "aggregate" : "question" , "pipeline" : [ { "$group" : { "_id" : "$pid" , "qCount" : { "$sum" : 1}}} , { "$group" : { "qavg" : { "$avg" : "qCount"} , "_id" : null }}]}
Can it be done with only one group operator?
Thanks.
For this you just need to know the amount of questions, and the amount of different profiles (uniquely identified with "pid" I presume). With the aggregation framework, you need to do that in two stages:
First, you calculate the number of questions per PID
Then you calculate the average of questions per PID
You'd do that like this:
Step one:
db.profiler.aggregate( [
{ $group: { _id: '$pid', count: { '$sum': 1 } } },
] );
Which outputs (in my case, with some sample data):
{
"result" : [
{ "_id" : 2, "count" : 7 },
{ "_id" : 1, "count" : 1 },
{ "_id" : 3, "count" : 3 },
{ "_id" : 4, "count" : 5 }
],
"ok" : 1
}
I have four profiles, respectively with 7, 1, 3 or 5 questions.
Now with this result, we run another group, but in this case we don't really want to group by anything, and thus do we need to set the _id value to null as you see in the second group below:
db.profiler.aggregate( [
{ $group: { _id: '$pid', count: { '$sum': 1 } } },
{ $group: { _id: null, avg: { $avg: '$count' } } }
] );
And then this outputs:
{
"result" : [
{ "_id" : null, "avg" : 4 }
],
"ok" : 1
}
Which tells me that I have on average, 4 questions per profile.

Obtaining $group result with group count

Assuming I have a collection called "posts" (in reality it is a more complex collection, posts is too simple) with the following structure:
> db.posts.find()
{ "_id" : ObjectId("50ad8d451d41c8fc58000003"), "title" : "Lorem ipsum", "author" :
"John Doe", "content" : "This is the content", "tags" : [ "SOME", "RANDOM", "TAGS" ] }
I expect this collection to span hundreds of thousands, perhaps millions, that I need to query for posts by tags and group the results by tag and display the results paginated. This is where the aggregation framework comes in. I plan to use the aggregate() method to query the collection:
db.posts.aggregate([
{ "$unwind" : "$tags" },
{ "$group" : {
_id: { tag: "$tags" },
count: { $sum: 1 }
} }
]);
The catch is that to create the paginator I would need to know the length of the output array. I know that to do that you can do:
db.posts.aggregate([
{ "$unwind" : "$tags" },
{ "$group" : {
_id: { tag: "$tags" },
count: { $sum: 1 }
} }
{ "$group" : {
_id: null,
total: { $sum: 1 }
} }
]);
But that would discard the output from previous pipeline (the first group). Is there a way that the two operations be combined while preserving each pipeline's output? I know that the output of the whole aggregate operation can be cast to an array in some language and have the contents counted but there may be a possibility that the pipeline output may exceed the 16Mb limit. Also, performing the same query just to obtain the count seems like a waste.
So is obtaining the document result and count at the same time possible? Any help is appreciated.
Use $project to save tag and count into tmp
Use $push or addToSet to store tmp into your data list.
Code:
db.test.aggregate(
{$unwind: '$tags'},
{$group:{_id: '$tags', count:{$sum:1}}},
{$project:{tmp:{tag:'$_id', count:'$count'}}},
{$group:{_id:null, total:{$sum:1}, data:{$addToSet:'$tmp'}}}
)
Output:
{
"result" : [
{
"_id" : null,
"total" : 5,
"data" : [
{
"tag" : "SOME",
"count" : 1
},
{
"tag" : "RANDOM",
"count" : 2
},
{
"tag" : "TAGS1",
"count" : 1
},
{
"tag" : "TAGS",
"count" : 1
},
{
"tag" : "SOME1",
"count" : 1
}
]
}
],
"ok" : 1
}
I'm not sure you need the aggregation framework for this other than counting all the tags eg:
db.posts.aggregate(
{ "unwind" : "$tags" },
{ "group" : {
_id: { tag: "$tags" },
count: { $sum: 1 }
} }
);
For paginating through per tag you can just use the normal query syntax - like so:
db.posts.find({tags: "RANDOM"}).skip(10).limit(10)