Trying to Compute Difference Between Two fields in MongoDB - mongodb

Good evening. I have a project in MongoDB and i keep getting an error while trying to run my last querie. The dataset i have is about wines. What i want to do is to print the wines that match the Red wine category and that are at least 20 years old (i have a start date-the date that they were put in their barrels- and an end date, which is the date they were bottled). Note:i want all these fields to be printed at the end and be printed according to their rating.
let me give an example of the data:
{ _id: ObjectId("638f389d8830abb3f19aaf51"),
tconst: 'tt0040030',
wineType: 'Red',
Brand: '#brand',
startYear: 1990,
endYear: 2002
rating:6.6}
When i am using the $match function for my three criteria all is working just fine but i haven't figured out how to subtract the two date fields so i can find the wines that are at least 20 years old and print the results properly.

Query
you need to refer to a field to compute the difference so you need $expr and aggregate operators
the bellow keeps red wines, with years difference>=20, and then sorts by rating.
Playmongo
aggregate(
[{"$match":
{"$expr":
{"$and":
[{"$eq": ["$wineType", "Red"]},
{"$gte": [{"$subtract": ["$endYear", "$startYear"]}, 20]}]}}},
{"$sort": {"rating": -1}}])

In MongoDB, the $where query operator is used for comparing the fields or documents that satisfy the JavaScript expression. We can also pass a string.

Related

MongoDB big data processing takes huge amount of time

I had ran into a pretty big issue in my project where I use MongoDB as the database. I have a growing collection, with 6 million documents on which I would like to find duplicates in the past 90 days. The issue is that the optimal, logical solution that I came up with doesn't work because of a bottleneck, and everything that I had tried after that is suboptimal.
The documents: Each document contains 26 fields, but the relevant ones are created_at, and fingerprint. The first one is obvious, the second one is a field composed of some multiple fields - in function of this one I detect duplicate documents.
The problem: I made a function that needs to remove duplicate documents in the past 90 days.
Solutions that I've tried:
The aggregation pipeline: first $search using range, with gt: today - 90 days; second step I $project the document's _id, fingerprint, and posted_at fields, third step I $group by fingerprint, count it, and push into an array the items, last step $match the documents that have a count more than 1.
Here I found out that the $group phase is really slow, and I've read that the it can be improved, by sorting by the grouped field before. But the problem is that sorting after the search phase is a bottleneck, but if I remove the search phase I get even more data, which makes the calculations even worse. Also, I have noticed that not pushing the items in the group phase significantly improves the performance, but I need that.
If someone had this problem before, or knows the solution to it, please advise me on how to handle it.
I'm using the MongoDB C# driver, but that shouldn't matter, I believe.
First you should know that group is not very fast but index can help it by using COUNT_SCAN search.
I think you need to create a compound index for the created_at and fingerprint
schema.index({created_at: 1, fingerprint: 1});
then you need to sort your aggregate pipeline to be like this
const time = new Date(new Date().getTime() - (90 * 24 * 60 * 60 * 1000));
db.collection.aggregate([
{ $match: { created_at: {$gt: time}}, // the index will be used
{ $group: { _id: "$fingerprint", count: {$sum: 1}}}, // the index will be used
{ $match: {count: {$gt: 1}},
])
After getting the names that have duplicate fingerprint value you can make requests to delete duplicate and only keep one because $push in group will slow down the group step.
Note: the code here is in JS language.

MongoDB keysExamined:return Ratio

In MongoDB, we use aggregate and $group if we want to get count of some action. Suppose, we have to get total number of deposits users have done in last 2 months then we use $group and then $sum to get count. Now in MongoDB Atlas profiler, it shows that as a very time taking and intensive operation. Because it scans keys of 2 months' data and return only 1 document(count). So is this a good way to get count or not?
If your query is "get total number of deposits for a set of users between two dates where that set can be all users" and deposits is a collection with a field user or similar then you do not need $group. Simply count() the filtered set:
db.deposits.find({$and:[
{"user":{$in: [ list ]}},
{"depositDate":{$gte: startDate}},
{"depositDate":{$lt: endDate}}
]}).count();
or with the aggregation pipeline:
db.deposits.aggregate([
{$match: {$and:[
{"user":{$in: [ list ]}},
{"depositDate":{$gte: startDate}},
{"depositDate":{$lt: endDate}}
]}},
{$count: "n"}
]});
Note you should be using a real datetime type for 'depositDate' -- not a string, not even ISO 8601 -- to facilitate more sophisticated queries involving dates.

Does order matter within a MongoDB $match block having $text search?

I am performing an aggregate query that contains a match, group, project, and then a sort pipeline. I am wondering if adding the text search block within my match block first vs last makes any performance difference. I am currently going based on the Robo 3T response time metric and not noticing a difference and wanted to confirm whether my observation holds true to the facts or not.
The query looks something like this:
db.COLLECTION.aggregate(
{$match: {$text:{$search: 'xyz'},...}},
{$group: {...}},
{$project: {...}},
{$sort: {...}}
)

ISO8601 Date Strings in MongoDB (Indexing and Querying)

I've got a collection of documents in MongoDB that all have a "date" field, which is an ISO8601 formatted date string (i.e. created by the moment.js format() method).
I'd like to be able to efficiently run queries against this collection that expresses the sentiment "documents that have a date that is after A and before B" (i.e. documents within a range of dates).
Should I be storing my dates as something other than ISO-8601 strings? Contingent upon that answer, what would the MongoDB query operation look like for my aforementioned requirement? I'm envisioning something like:
{$and: [
{date: {$gt: "2017-05-02T03:15:22-04:00"}},
{date: {$lt: "2017-06-02T03:15:22-04:00"}},
]}
Does that "just work"? I need some convincing.
You definitely want to use the built-in date data type with an index on your date field and not strings. String comparison is going to be slow as hell compared to comparing dates which are just 64bit integers.
As far as querying is concerned, check out the answer here:
return query based on date

Improve distinct query performance using indexes

Documents in the collection follow the schema as given below -
{
"Year": String,
"Month": String
}
I want to accomplish the following tasks -
I want to run distinct queries like
db.col.distinct('Month', {Year: '2016'})
I have a compound index on {'Year': 1, 'Month': 1}, so by
intuition looks like the answer can be computed by looking at
indices only. The query performance on a collection in millions is
really poor right now. Is there any way that this could be done?
I want to find the latest month of a given year. One way is to sort the result of the above distinct query and take the first element as the answer.
A much better and faster solution as pointed out by # Blakes Seven in the discussion below, would be to use the query db.col.find({ "Year": "2016"}).sort({ "Year": -1, "Month": -1 }).limit(1)