Count and range MongoDB - mongodb

Let´s say I have a bunch of documents in this format;
{Person: "X" , Note: 4}
What I need to do is to count the total of Person who has the field Note within the range 0 - 50, 51-100, 101-150 and 150 or more
Something like this
//range of Note //total of persons in this range
0-50 14
51-100 32
101-150 34
151 21

In MongoDb you have $lt and $gt commands through which you can get less then and greater then values.
Then you can use $count on it like this->
db.table.aggregate(
[
{
$match: {
Note: {
$gt: 0, $lt: 50
}
}
},
{
$count: "0-50"
}
]
)
It will show result like:
{ "0-50" : 14 }

Related

How to use nested mongoDB query to calculate percentage?

It is our coursework today, this is the dataset, and this is the column description:
VARIABLE DESCRIPTIONS:
Column
1 Class (0 = crew, 1 = first, 2 = second, 3 = third)
10 Age (1 = adult, 0 = child)
19 Sex (1 = male, 0 = female)
28 Survived (1 = yes, 0 = no)
and the last question is
What percentage of passenger survived? (use a nested mongodb query)
I know if I am going to calculate the percentage, I use .count to find how many rows that Survive = 1, and how many rows in total and use .find(Survive:1).count() divide .find().count() , for now I know I can use aggregate to solve the problem but it does not meet the requirement. Any ideas?
Considering following data:
db.titanic.insert({ Survived: 1 })
db.titanic.insert({ Survived: 1 })
db.titanic.insert({ Survived: 1 })
db.titanic.insert({ Survived: 0 })
you can use $group with $sum. 1 passed as a argument will give you total count while $Survived will count survived people. Then you can use $divide to get the percentage.
db.titanic.aggregate([
{
$group: {
_id: null,
Survived: { $sum: "$Survived" },
Total: { $sum: 1 }
}
},
{
$project: {
_id: 0,
SurvivedPercentage: { $divide: [ "$Survived", "$Total" ] }
}
}
])
which outputs: { "SurvivedPercentage" : 0.75 }

MongoDB: calculate 90th percentile among all documents

I need to calculate the 90th percentile of the duration where the duration for each document is defined as finish_time - start_time.
My plan was to:
Create $project to calculate that duration in seconds for each document.
Calculate the index (in sorted documents list) that correspond to the 90th percentile: 90th_percentile_index = 0.9 * amount_of_documents.
Sort the documents by the $duration variable the was created.
Use the 90th_percentile_index to $limit the documents.
Choose the first document out of the limited subset of document.
I'm new to MongoDB so I guess the query can be improved. So, the query looks like:
db.getCollection('scans').aggregate([
{
$project: {
duration: {
$divide: [{$subtract: ["$finish_time", "$start_time"]}, 1000] // duration is in seconds
},
Percentile90Index: {
$multiply: [0.9, "$total_number_of_documents"] // I don't know how to get the total number of documents..
}
}
},
{
$sort : {"$duration": 1},
},
{
$limit: "$Percentile90Index"
},
{
$group: {
_id: "_id",
percentiles90 : { $max: "$duration" } // selecting the max, i.e, first document after the limit , should give the result.
}
}
])
The problem I have is that I don't know how to get the total_number_of_documents and therefore I can't calculate the index.
Example:
let's say I have only 3 documents:
{
"_id" : ObjectId("1"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:01:00.000Z"),
}
{
"_id" : ObjectId("2"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:03:00.000Z"),
}
{
"_id" : ObjectId("3"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:08:00.000Z"),
}
So I would expect the result to be something like:
{
percentiles50 : 3 // in minutes, since percentiles50=3 is the minimum value that setisfies the request of atleast 50% of the documents have duration <= percentiles50
}
I used percentiles 50th in the example because I only gave 3 documents but it really doesn't matter, just show me please a query for the i-th percentiles and it will be fine :-)

Mongoose: Score query then sort by score - non text fields

In my db, I have a collection of books.
Each have:
a count of upvotes
a count of downvotes
a count of views
I would like to sort my db by scoring as follows:
upvote: 8 points
downvote: -4 points
view: 1/2 point
So the score will be:
(NumberOfViews*(1/2)) + (NumberOfDownvotes*-4)+ (NumberOfUpvotes*8)
So if I have:
book1 = {name:'book1', views:3000,upvotes:340, downvotes:120}
book2 = {name:'book2', views:9000,upvotes:210, downvotes:620}
book3 = {name:'book3', views:7000,upvotes:6010, downvotes:2}
The score should be:
book1Score = 3740
book2Score = 3700
book3Score = 51572
And the query should output
book3,book1,book2
How can I achieve such a thing in mongoose?
Bonus: What if I want records that are more recent to rank higher than older records on that same query?
Thanks
Well I ended up doing it all inside mongoose.
I run this query every 24 hours to re-score my collection.
Book.aggregate(
[
//I match my query
{$match:query},
{
$project: {
//take the id for reference
_id: 1,
//calculate the score of the views
viewScore: {
$multiply: [ "$views", 0.5 ]
},
//calculate the score of the upvotes
upvoteScore: {
$multiply: [ {$size: '$upvotes'}, 8 ]
},
//calculate the score of the downvotes
downvoteScore: {
$multiply: [ {$size: '$downvotes'}, -4 ]
}
}
},
{
//project a second time
$project: {
//take my id for reference
_id: 1,
//get my total score
score: {
$add:['$viewScore','$upvoteScore','$downvoteScore']
},
}
},
//sort by the score.
{$sort : {'score' : -1}},
]
)
I think the best way would be to query mongoose for the list of book then do the sorting yourself.
Something like:
// Get query results from mongoose then ...
books.sort((a,b) => {
return ((a.views*(1/2))+(a.downvotes*-4)+(a.upvotes*8))-((b.view*(1/2))+ b.downvotes*-4)+(b.upvotes*8))
});
This would sort the books in ascending order of highest points
EDIT: The above answer is for sorting after you've received the query. (And also just realized you want descending for above^ so just switch the placement to be b - a)
If you want to receive the query already sorted, you could instead calculate the score at the time you input the book and add that as a field. The use mongoose's Query#sort. Which would look something like
query.sort({ score: 'desc'});
More info on Query#sort: http://mongoosejs.com/docs/api.html#query_Query-sort

MongoDB pagination with location criteria

I want get data sorted by field. For example
db.Users.find().limit(200).sort({'rating': -1}).skip(0)
It's work, I get sorted data. And can use pagination.
But, If add criteria .find({'location':{$near : [12,32], $maxDistance: 10}}) sorting doesn't work correctly.
Full the query:
db.Users.find({'location':{$near : [12,32], $maxDistance: 10}}).limit(200).sort({'rating': -1}).skip(0)
For example
Whithout criteria location:
offset 0
rating 100
rating 99
rating 98
rating 97
rating 96
offset 5
rating 95
rating 94
rating 93
rating 92
rating 91
offset 10
rating 90
rating 89
rating 88
rating 87
rating 86
With criteria location
offset 0
rating 100
rating 99
rating 98
rating 97
rating 96
offset 5
rating 90
rating 89
rating 88
rating 87
rating 86
offset 10
rating 95
rating 94
rating 93
rating 92
rating 91
What could be the problem? Can I use pagination with location criteria in MongoDB?
The aggregation framework has a way to do this using the $geoNear pipeline stage. Basically it will "project" a "distance" field which you can then use in a combination sort:
db.collection.aggregate([
{ "$geoNear": {
"near": [12,32],
"distanceField": "distance",
"maxDistance": 10
}},
{ "$sort": { "distance": 1, "rating" -1 } }
{ "$skip": 0 },
{ "$limit": 25 }
])
Should be fine, but "skip" and "limit" are not really efficient over large skips. If you can get away without needing "page numbering" and just want to go forwards, then try a different technique.
The basic principle is to keep track of the last distance value found for the page and also the _id values of the documents from that page or a few previous, which can then be filtered out using the $nin operator:
db.collection.aggregate([
{ "$geoNear": {
"near": [12,32],
"distanceField": "distance",
"maxDistance": 10,
"minDistance": lastSeenDistanceValue,
"query": {
"_id": { "$nin": seenIds },
"rating": { "$lte": lastSeenRatingValue }
},
"num": 25
}},
{ "$sort": { "distance": 1, "rating": -1 }
])
Essentially that is going to be a lot better, but it won't help you with jumps to "page" 25 for example. Not without a lot more effort in working that out.

MongoDB - Querying between a time range of hours

I have a MongoDB datastore set up with location data stored like this:
{
"_id" : ObjectId("51d3e161ce87bb000792dc8d"),
"datetime_recorded" : ISODate("2013-07-03T05:35:13Z"),
"loc" : {
"coordinates" : [
0.297716,
18.050614
],
"type" : "Point"
},
"vid" : "11111-22222-33333-44444"
}
I'd like to be able to perform a query similar to the date range example but instead on a time range. i.e. Retrieve all points recorded between 12AM and 4PM (can be done with 1200 and 1600 24 hour time as well).
e.g.
With points:
"datetime_recorded" : ISODate("2013-05-01T12:35:13Z"),
"datetime_recorded" : ISODate("2013-06-20T05:35:13Z"),
"datetime_recorded" : ISODate("2013-01-17T07:35:13Z"),
"datetime_recorded" : ISODate("2013-04-03T15:35:13Z"),
a query
db.points.find({'datetime_recorded': {
$gte: Date(1200 hours),
$lt: Date(1600 hours)}
});
would yield only the first and last point.
Is this possible? Or would I have to do it for every day?
Well, the best way to solve this is to store the minutes separately as well. But you can get around this with the aggregation framework, although that is not going to be very fast:
db.so.aggregate( [
{ $project: {
loc: 1,
vid: 1,
datetime_recorded: 1,
minutes: { $add: [
{ $multiply: [ { $hour: '$datetime_recorded' }, 60 ] },
{ $minute: '$datetime_recorded' }
] }
} },
{ $match: { 'minutes' : { $gte : 12 * 60, $lt : 16 * 60 } } }
] );
In the first step $project, we calculate the minutes from hour * 60 + min which we then match against in the second step: $match.
Adding an answer since I disagree with the other answers in that even though there are great things you can do with the aggregation framework, this really is not an optimal way to perform this type of query.
If your identified application usage pattern is that you rely on querying for "hours" or other times of the day without wanting to look at the "date" part, then you are far better off storing that as a numeric value in the document. Something like "milliseconds from start of day" would be granular enough for as many purposes as a BSON Date, but of course gives better performance without the need to compute for every document.
Set Up
This does require some set-up in that you need to add the new fields to your existing documents and make sure you add these on all new documents within your code. A simple conversion process might be:
MongoDB 4.2 and upwards
This can actually be done in a single request due to aggregation operations being allowed in "update" statements now.
db.collection.updateMany(
{},
[{ "$set": {
"timeOfDay": {
"$mod": [
{ "$toLong": "$datetime_recorded" },
1000 * 60 * 60 * 24
]
}
}}]
)
Older MongoDB
var batch = [];
db.collection.find({ "timeOfDay": { "$exists": false } }).forEach(doc => {
batch.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$set": {
"timeOfDay": doc.datetime_recorded.valueOf() % (60 * 60 * 24 * 1000)
}
}
}
});
// write once only per reasonable batch size
if ( batch.length >= 1000 ) {
db.collection.bulkWrite(batch);
batch = [];
}
})
if ( batch.length > 0 ) {
db.collection.bulkWrite(batch);
batch = [];
}
If you can afford to write to a new collection, then looping and rewriting would not be required:
db.collection.aggregate([
{ "$addFields": {
"timeOfDay": {
"$mod": [
{ "$subtract": [ "$datetime_recorded", Date(0) ] },
1000 * 60 * 60 * 24
]
}
}},
{ "$out": "newcollection" }
])
Or with MongoDB 4.0 and upwards:
db.collection.aggregate([
{ "$addFields": {
"timeOfDay": {
"$mod": [
{ "$toLong": "$datetime_recorded" },
1000 * 60 * 60 * 24
]
}
}},
{ "$out": "newcollection" }
])
All using the same basic conversion of:
1000 milliseconds in a second
60 seconds in a minute
60 minutes in an hour
24 hours a day
The modulo from the numeric milliseconds since epoch which is actually the value internally stored as a BSON date is the simple thing to extract as the current milliseconds in the day.
Query
Querying is then really simple, and as per the question example:
db.collection.find({
"timeOfDay": {
"$gte": 12 * 60 * 60 * 1000, "$lt": 16 * 60 * 60 * 1000
}
})
Of course using the same time scale conversion from hours into milliseconds to match the stored format. But just like before you can make this whatever scale you actually need.
Most importantly, as real document properties which don't rely on computation at run-time, you can place an index on this:
db.collection.createIndex({ "timeOfDay": 1 })
So not only is this negating run-time overhead for calculating, but also with an index you can avoid collection scans as outlined on the linked page on indexing for MongoDB.
For optimal performance you never want to calculate such things as in any real world scale it simply takes an order of magnitude longer to process all documents in the collection just to work out which ones you want than to simply reference an index and only fetch those documents.
The aggregation framework may just be able to help you rewrite the documents here, but it really should not be used as a production system method of returning such data. Store the times separately.