Mongodb Index: why field dateStr like 2022-12-30 is faster than Date? - mongodb

We have a collection holding 20M+ documents. It's a frequent operation to query it by time range.
Also there are some other fields like type, category.
We found if we search data in long range, like a month, with {modified: {$gte: new Date("2022-12-01"), $lte: new Date("2022-12-31")}, ...}, it costs more than {modifiedStr: {$gte: "2022-12-01", $lte: "2022-12-31"}, ...}.
Of course they both hits their indexes.
According my knowledge, index on Date field has better carinality than date string.
Counld anyone give us a explaination on this? Or this is just a illusion.
Will B+ tree depth be the answer?

Related

MongoDB big data processing takes huge amount of time

I had ran into a pretty big issue in my project where I use MongoDB as the database. I have a growing collection, with 6 million documents on which I would like to find duplicates in the past 90 days. The issue is that the optimal, logical solution that I came up with doesn't work because of a bottleneck, and everything that I had tried after that is suboptimal.
The documents: Each document contains 26 fields, but the relevant ones are created_at, and fingerprint. The first one is obvious, the second one is a field composed of some multiple fields - in function of this one I detect duplicate documents.
The problem: I made a function that needs to remove duplicate documents in the past 90 days.
Solutions that I've tried:
The aggregation pipeline: first $search using range, with gt: today - 90 days; second step I $project the document's _id, fingerprint, and posted_at fields, third step I $group by fingerprint, count it, and push into an array the items, last step $match the documents that have a count more than 1.
Here I found out that the $group phase is really slow, and I've read that the it can be improved, by sorting by the grouped field before. But the problem is that sorting after the search phase is a bottleneck, but if I remove the search phase I get even more data, which makes the calculations even worse. Also, I have noticed that not pushing the items in the group phase significantly improves the performance, but I need that.
If someone had this problem before, or knows the solution to it, please advise me on how to handle it.
I'm using the MongoDB C# driver, but that shouldn't matter, I believe.
First you should know that group is not very fast but index can help it by using COUNT_SCAN search.
I think you need to create a compound index for the created_at and fingerprint
schema.index({created_at: 1, fingerprint: 1});
then you need to sort your aggregate pipeline to be like this
const time = new Date(new Date().getTime() - (90 * 24 * 60 * 60 * 1000));
db.collection.aggregate([
{ $match: { created_at: {$gt: time}}, // the index will be used
{ $group: { _id: "$fingerprint", count: {$sum: 1}}}, // the index will be used
{ $match: {count: {$gt: 1}},
])
After getting the names that have duplicate fingerprint value you can make requests to delete duplicate and only keep one because $push in group will slow down the group step.
Note: the code here is in JS language.

Query mongodb for a specific date range

I've googled and googled this, and for the life of me I can't make it work...which I know I am just missing something totally simple, so I'm hoping someone here can save my sanity.
I have am trying to lookup a range of documents from a mongodb collection (using mongoose) based on a date range.
I have this code:
var startDate = new Date(req.query.startDate);
var stopDate = new Date(req.query.stopDate);
studentProgression.find({ $and: [ {'ACT_START': {$gte: startDate, $lte: stopDate} } ]})
But it doesn't return any documents.
The date in the MongoDb collection is stored as a string, and looks like this (for example):
ACT_START: "25-MAY-20"
a
I know I'm probaby tripping somewhere with the fact it's a string and not a date object in mongo, as I haven't had this issue before. I'd rather not go through the collection and change all the string dates to actual date objects.
Is there anyway to do this lookup the way I've laid it out?
Unfortunately, you have to change the format of your ACT_START field and cast it to a Date.
By doing $gte and/or $lte on a string field, mongo will compare the two strings based on their alphabetical order.

why does $eq comparison is not working on mongodb with dates

I have the following query on mongo
db.getCollection('someCollection').find({"status":"failed","start_date": {$gt: new Date("2019/03/01")}})
that will retrieve all the records that have a failed status and the start_date is equal or greater than "2019/03/01".
But when i try to only retrieve records specifically for "2019/03/01":
db.getCollection('someCollection').find({"status":"failed","start_date": {$eq: new Date("2019/03/01")}})
It doesn't retrieve anything.
$lt and $gt queries work is just $eq that doesn't work. is that the correct way to use the $eq?
thank you
When you use new Date("2019/03/01") the actual date being searched is 2019/03/01 00:00:00.00. That is, the Date is exactly midnight 2019/03/01 (down to the millisecond). Unless your record also has the exact same date recorded, it will not match. That's why you need to use {$gt: new Date("2019/03/01"), $lt: new Date("2019/03/02")}. Just think of it as that 1 "day" is actually a range of 24 hours. Which is why you need to specify it as a range, and not as a simple $eq comparison.
The reason why your first query works is because when you search for $gt 2019/03/01 you're really searching for $gt 2019/03/01 00:00:00.00. And so, of course every record started on 2019/03/01 will match.
One thing to take note of: technically, if you have a document made exactly at midnight i.e. with a timestamp of 2019/03/01 00:00:00.00, it won't match. So you really should use {$gte: new Date ('2019/03/01')} (note the extra 'e'). $gte is greater than or equal. This may sound trivial but is actually important, because if you don't use actual timestamps and record just the date when you create the record (i.e. you insert with {start_date: new Date('2019/03/01')}, all those timestamps will actually have 00:00:00.00 as their time component, and they won't match with a $gt comparison, only a $gte comparison. You probably use new Date() when creating the record, so you're getting the full timestamp, which is why it hasn't bitten you yet. But to be semantically correct, you should use $gte in your first query. Does that make sense?

ISO8601 Date Strings in MongoDB (Indexing and Querying)

I've got a collection of documents in MongoDB that all have a "date" field, which is an ISO8601 formatted date string (i.e. created by the moment.js format() method).
I'd like to be able to efficiently run queries against this collection that expresses the sentiment "documents that have a date that is after A and before B" (i.e. documents within a range of dates).
Should I be storing my dates as something other than ISO-8601 strings? Contingent upon that answer, what would the MongoDB query operation look like for my aforementioned requirement? I'm envisioning something like:
{$and: [
{date: {$gt: "2017-05-02T03:15:22-04:00"}},
{date: {$lt: "2017-06-02T03:15:22-04:00"}},
]}
Does that "just work"? I need some convincing.
You definitely want to use the built-in date data type with an index on your date field and not strings. String comparison is going to be slow as hell compared to comparing dates which are just 64bit integers.
As far as querying is concerned, check out the answer here:
return query based on date

MongoDB: To Find Objects with Three Integer Values, Sort Range of Three Values or Perform Three Queries?

My basic question is this: Which is more efficient?
mongo_db[collection].find(year: 2000)
mongo_db[collection].find(year: 2001)
mongo_db[collection].find(year: 2002)
or
mongo_db[collection].find(year: { $gte: 2000, $lte: 2002 }).sort({ year: 1 })
More detail: I have a MongoDB query in which I'll be selecting objects with 'year' attribute values of either 2000, 2001, or 2002, but no others. Is this best done as a find() with a sort(), or three separate find()s for each value? If it depends on the size of my collection, at what size does the more efficient search pattern change?
The single query is going to be faster because Mongo only has to scan the collection (or its index) once instead of three times. But you don't need a sort clause for your range query unless you actually want the results sorted for a separate reason.
You could also use $in for this:
mongo_db[collection].find({year: {$in: [2000, 2001, 2002]}})
Its performance should be very similar to your range query.