MongoDB: Sorting, subtracting dates and counting - mongodb

I'm a begginer in MongoDB and I'm practising the aggregation method. I'm my example, I would like to get the wine which has been produced in the last 5 years (5 years back from the newest wine), then I would like to count how many wines have been produced in that period of time (the database gives us the year of the wines, in an integer)
I believe that first, I have to sort the wines by year, then I should get the year of the newest wine, and sustract five years, using that period of time to count the wines. But I don't know how to write all of that using the aggregation code.
Thanks!

You need to use various aggregation pipeline stages to transform your data.
MongoDB’s aggregation framework is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result.
As you have mentioned,
First you have to get the year of the newest wine.
I have used $group to group the data, and $max is used to get newestWineYear and entire documents($$ROOT) is pushed to data by using $push
Stage1
{
$group: {
_id: null,
"newestWineYear": {
$max: "$year"
},
data: {
$push: "$$ROOT"
}
}
}
The output of the first stage contains the entire documents in an array which we had named asdata and the newestWineYear
So, inorder to flatten the data array $unwind is used.
Stage2
{
$unwind: "$data"
}
Get the count of the wines that has been produced in the last 5 years.
I have used $group to get the count, and count is obtained using $sum.
Stage3
{
$group: {
_id: null,
count: {
"$sum": {
$cond: [
{
"$gte": [
"$data.year",
{
"$subtract": [
"$newestWineYear",
5
]
}
]
},
1,
0
]
}
}
}
}
A condition is added to $sum to count only the wines that has been produced in the last 5 years.
Condition is available inside $cond.
It says:
If "data.year" >= [ "$newestWineYear" - 5 ], then add 1 to count, else add 0
data.year is used because we had pushed year of the wines to the data array in our first stage of aggregation pipeline.
Final aggregation query can be found here: Playground
Alternative method can be found here without a $cond inside $group but a $match stage is introduced to get only the wines that has been produced in the last 5 years.

As you guessed, you need to use mongo's aggregation framework.
Start with a simple pipeline with 3 simple steps:
group with $ group to get last year's data (returns an array)
remove array and get single documents ($ unwind operator)
filter documents with the $ match operator
Finally, you can replace the root of your documents to have better formed results.
Imagine having data like this:
[
{
"wine": "Red xxx",
"year": 2018
},
{
"wine": "Red yyy",
"year": 2017
},
{
"wine": "Red zzz",
"year": 2017
},
{
"wine": "White 1",
"year": 2016
},
{
"wine": "White 2",
"year": 2013
},
{
"wine": "White 3",
"year": 2017
},
{
"wine": "White 4",
"year": 2009
}
]
Here's the pipeline and here you can see the results in playground.
db.collection.aggregate({
$group: {
_id: null,
"lastYear": {
$max: "$year"
},
data: {
$push: "$$ROOT"
}
}
},
{
$unwind: "$data"
},
{
$match: {
$expr: {
$gte: [
"$data.year",
{
"$subtract": [
"$lastYear",
5
]
}
]
}
}
},
{
"$replaceWith": "$data"
})

Related

How to find all the the ['id'] by month in pymongo?

I have a huge dataset consisting of collections with fields like this
{"id":"f3fd1b6c",
"originalVersion":"v2",
"rotation":[{"0.5"},{"-0.5"},{"-0.5"},{"-0.5"}],
"scale":[{"1.0"},{""1.0"},{""1.0"}],
"translation":[{"-2.8820719718933105"},{"11.548246383666992"},{"0.0"}],
"timestamp":"2020-03-27T13:28:09.883+00:00"
I want to get all the field ids that were created in same month.
So far I have tried using "find" with exact timestamp query
db.collection.find({'timestamp':date})
But I want to get all the elements that were created in same month,
If you are going to search records by a given month, you can do a simple find with $month
db.collection.find({
$expr: {
$eq: [
{
$month: "$timestamp"
},
3
]
}
})
Here is the Mongo playground for your reference.
If you want to group by month and group the ids together, you can do like this.
db.collection.aggregate([
{
$group: {
_id: {
"$month": "$timestamp"
},
idsToFetch: {
"$push": "$id"
}
}
}
])
Here is the Mongo playground for your reference.

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

MongoDB slow aggregate time

I'm facing an issue where the aggregate function is performing very slowly where it takes about 30 seconds to gather all my data. Assume 1 of the record in this structure:
{
"_id":{
"$oid":"5909a5cefece40f172895a6b"
},
"Record":1,
"Link":"https://www.google.com",
"Location":["loc1", "loc2", "loc3"],
"Organization":["org1", "org2", "org3"],
"Date":2017,
"PeoplePPL":["ppl1", "ppl2", "ppl3"]
}
And the aggregate query as follows:
db.testdata_4.aggregate([{
"$unwind": "$PeoplePPL"
},{
"$unwind": "$Location"
},{
"$match": {
Date: {
$gte: lowerBoundYear,
$lte: upperBoundYear
}
}
},{
"$group": {
"_id": {
"People": "$PeoplePPL",
"Date": "$Date"
},
Links: {
$addToSet: "$Link"
},
Locations: {
$addToSet: "$Location"
}
}
},{
"$group": {
"_id": "$_id.People",
Record: {
$push: {
"Country": "$Locations",
"Year": "$_id.Date",
"Links": "$Links"
}
}
}
}]).toArray()
There are a total of 154 records in the "testdata_4" collection, and upon aggregation, there will be 5571 records returned with the query time of 28 seconds. I have performed the ensureIndex() on "Locations" and "Date". Is this supposed to be normal as the number of records returned increases?
If it isn't normal, may I know if there's a workaround to decrease my query time to 5 seconds at max instead of having it at 28 seconds or more?
It's very likely that the index on Date isn't being used.
The $match and $sort operators can take advantage of indexes when they are being used at the beginning of the pipeline. In this case, the filters are applied after several $unwind stages, which mean it likely isn't be used.
Suggestions:
Move the $match stage to the beginning of the pipeline
The "Location", "Date" and "Link" fields aren't arrays, so it isn't immediately clear why there are $unwind aggregation stages on these fields. You may want to remove these.

Mongo $subtract date doesn't work in aggregation $match block

I am creating a mongo aggregation query which use a $subtract operator in my $match block. As explained in these codes below.
This query doesn't work:
db.coll.aggregate(
[
{
$match: {
timestamp: {
$gte: {
$subtract: [new Date(), 24 * 60 * 60 * 1000]
}
}
}
},
{
$group: {
_id: {
timestamp: "$timestamp"
},
total: {
$sum: 1
}
}
},
{
$project: {
_id: 0,
timestamp: "$_id.timestamp",
total: "$total",
}
},
{
$sort: {
timestamp: -1
}
}
]
)
However, this second query work:
db.coll.aggregate(
[
{
$match: {
timestamp: {
$gte: new Date(new Date() - 24 * 60 * 60 * 1000)
}
}
},
{
$group: {
_id: {
timestamp: "$timestamp"
},
total: {
$sum: 1
}
}
},
{
$project: {
_id: 0,
timestamp: "$_id.timestamp",
total: "$total",
}
},
{
$sort: {
timestamp: -1
}
}
]
)
I need to use $subtract on my $match block so I can't use the last query.
As of mongodb 3.6 you can use $subtract in the $match stage via the $expr. Here's the docs: https://docs.mongodb.com/manual/reference/operator/query/expr/
I was able to get a query like what you're describing via this $expr and a new system variable in mongodb 4.2 called $$NOW. Here is my query, which gives me orders that have been created within the last 4 hours:
[
{ $match:
{ $expr:
{ $gt: [
"$_created_at",
{ $subtract: [ "$$NOW", 4 * 60 * 60 * 1000] } ]
}
}
}
]
Well you cannot do that and you are not meant to do so either. Another valid thing is that you say to "need" to do this but in reality you really do not.
Pretty much all of the general aggregation operators outside of the pipeline operators are really only valid within a $project or a $group pipeline stage. Mostly within $project but certainly not in others.
A $match pipeline is really the same as a general "query" operation, so the only things valid in there are the query operators.
As for the case for your "need", any "value" that is submitted within an aggregation pipeline and particularly within a $match needs to be evaluated outside of the actual pipeline before the BSON representation is sent to the server.
The only exception is the notation that defines variables in the document, particularly "fieldnames" such a "$fieldname" and then only really in $project or $group. So that means something that "refers" to an existing value of a document, and that is something that cannot be done within any type of "query" document expression.
If you need to work with the value of another field in the document then you work it out with $project first, as in:
db.collection.aggregate([
{ "$project": {
"fieldMath": { "$subtract": [ "$fieldOne", "$fieldTwo" ] }
}},
{ "$match": { "fieldMath": { "$gt": 2 } }}
])
For any other purpose you really want to evaluate the value "outside" the pipeline.
The above answers the question you asked, but this answers the question you didn't ask.
Your pipeline doesn't make any sense since grouping on the "timestamp" alone would be unlikely to group anything since the values are of millisecond accuracy and there is likely not to be more than just a few at best for very active systems.
It appears like you are looking for the math to group by "day", which you can do like this:
db.collection.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$subtract": [ "$timestamp", new Date(0) ] },
{ "$mod": [
{ "$subtract": [ "$timestamp", new Date(0) ] },
1000 * 60 * 60 * 24
]}
]
},
"total": { "$sum": "$total" }
}}
])
That "rounds" your timestamp value to a single day and has a much better chance of "aggregating" something than you would otherwise have.
Or you can use the "date aggregation operators" to do much the same thing with a composite key.
So if you want to "query" then it evaluates externally. If you want to work on a value "within the document" then you must do so in either a $project or $group pipeline stage.
The $subtract operator is a projection-operator. It is only available during a $project step. So your options are:
(not recommended) Add a $project-step before your $match-step to convert the timestamp field of all documents for the following match-step. I would not recommend you to do this because this operation needs to be performed on every single document on your database and prevents the database from using an index on the timestamp field, so it could cost you a lot of performance.
(recommended) Generate the Date you want to match against in the shell / in your application. Generate a new Date() object, store it in a variable, subtract 24 hours from it and perform your 2nd query using that variable.

Mongodb query specific month|year not date

How can I query a specific month in mongodb, not date range, I need month to make a list of customer birthday for current month.
In SQL will be something like that:
SELECT * FROM customer WHERE MONTH(bday)='09'
Now I need to translate that in mongodb.
Note: My dates are already saved in MongoDate type, I used this thinking that will be easy to work before but now I can't find easily how to do this simple thing.
With MongoDB 3.6 and newer, you can use the $expr operator in your find() query. This allows you to build query expressions that compare fields from the same document in a $match stage.
db.customer.find({ "$expr": { "$eq": [{ "$month": "$bday" }, 9] } })
For other MongoDB versions, consider running an aggregation pipeline that uses the $redact operator as it allows you to incorporate with a single pipeline, a functionality with $project to create a field that represents the month of a date field and $match to filter the documents
which match the given condition of the month being September.
In the above, $redact uses $cond tenary operator as means to provide the conditional expression that will create the system variable which does the redaction. The logical expression in $cond will check
for an equality of a date operator field with a given value, if that matches then $redact will return the documents using the $$KEEP system variable and discards otherwise using $$PRUNE.
Running the following pipeline should give you the desired result:
db.customer.aggregate([
{ "$match": { "bday": { "$exists": true } } },
{
"$redact": {
"$cond": [
{ "$eq": [{ "$month": "$bday" }, 9] },
"$$KEEP",
"$$PRUNE"
]
}
}
])
This is similar to a $project +$match combo but you'd need to then select all the rest of the fields that go into the pipeline:
db.customer.aggregate([
{ "$match": { "bday": { "$exists": true } } },
{
"$project": {
"month": { "$month": "$bday" },
"bday": 1,
"field1": 1,
"field2": 1,
.....
}
},
{ "$match": { "month": 9 } }
])
With another alternative, albeit slow query, using the find() method with $where as:
db.customer.find({ "$where": "this.bday.getMonth() === 8" })
You can do that using aggregate with the $month projection operator:
db.customer.aggregate([
{$project: {name: 1, month: {$month: '$bday'}}},
{$match: {month: 9}}
]);
First, you need to check whether the data type is in ISODate.
IF not you can change the data type as the following example.
db.collectionName.find().forEach(function(each_object_from_collection){each_object_from_collection.your_date_field=new ISODate(each_object_from_collection.your_date_field);db.collectionName.save(each_object_from_collection);})
Now you can find it in two ways
db.collectionName.find({ $expr: {
$eq: [{ $year: "$your_date_field" }, 2017]
}});
Or by aggregation
db.collectionName.aggregate([{$project: {field1_you_need_in_result: 1,field12_you_need_in_result: 1,your_year_variable: {$year: '$your_date_field'}, your_month_variable: {$month: '$your_date_field'}}},{$match: {your_year_variable:2017, your_month_variable: 3}}]);
Yes you can fetch this result within date like this ,
db.collection.find({
$expr: {
$and: [
{
"$eq": [
{
"$month": "$date"
},
3
]
},
{
"$eq": [
{
"$year": "$date"
},
2020
]
}
]
}
})
If you're concerned about efficiency, you may want to store the month data in a separate field within each document.