Improve distinct query performance using indexes - mongodb

Documents in the collection follow the schema as given below -
{
"Year": String,
"Month": String
}
I want to accomplish the following tasks -
I want to run distinct queries like
db.col.distinct('Month', {Year: '2016'})
I have a compound index on {'Year': 1, 'Month': 1}, so by
intuition looks like the answer can be computed by looking at
indices only. The query performance on a collection in millions is
really poor right now. Is there any way that this could be done?
I want to find the latest month of a given year. One way is to sort the result of the above distinct query and take the first element as the answer.
A much better and faster solution as pointed out by # Blakes Seven in the discussion below, would be to use the query db.col.find({ "Year": "2016"}).sort({ "Year": -1, "Month": -1 }).limit(1)

Related

Trying to Compute Difference Between Two fields in MongoDB

Good evening. I have a project in MongoDB and i keep getting an error while trying to run my last querie. The dataset i have is about wines. What i want to do is to print the wines that match the Red wine category and that are at least 20 years old (i have a start date-the date that they were put in their barrels- and an end date, which is the date they were bottled). Note:i want all these fields to be printed at the end and be printed according to their rating.
let me give an example of the data:
{ _id: ObjectId("638f389d8830abb3f19aaf51"),
tconst: 'tt0040030',
wineType: 'Red',
Brand: '#brand',
startYear: 1990,
endYear: 2002
rating:6.6}
When i am using the $match function for my three criteria all is working just fine but i haven't figured out how to subtract the two date fields so i can find the wines that are at least 20 years old and print the results properly.
Query
you need to refer to a field to compute the difference so you need $expr and aggregate operators
the bellow keeps red wines, with years difference>=20, and then sorts by rating.
Playmongo
aggregate(
[{"$match":
{"$expr":
{"$and":
[{"$eq": ["$wineType", "Red"]},
{"$gte": [{"$subtract": ["$endYear", "$startYear"]}, 20]}]}}},
{"$sort": {"rating": -1}}])
In MongoDB, the $where query operator is used for comparing the fields or documents that satisfy the JavaScript expression. We can also pass a string.

MongoDB big data processing takes huge amount of time

I had ran into a pretty big issue in my project where I use MongoDB as the database. I have a growing collection, with 6 million documents on which I would like to find duplicates in the past 90 days. The issue is that the optimal, logical solution that I came up with doesn't work because of a bottleneck, and everything that I had tried after that is suboptimal.
The documents: Each document contains 26 fields, but the relevant ones are created_at, and fingerprint. The first one is obvious, the second one is a field composed of some multiple fields - in function of this one I detect duplicate documents.
The problem: I made a function that needs to remove duplicate documents in the past 90 days.
Solutions that I've tried:
The aggregation pipeline: first $search using range, with gt: today - 90 days; second step I $project the document's _id, fingerprint, and posted_at fields, third step I $group by fingerprint, count it, and push into an array the items, last step $match the documents that have a count more than 1.
Here I found out that the $group phase is really slow, and I've read that the it can be improved, by sorting by the grouped field before. But the problem is that sorting after the search phase is a bottleneck, but if I remove the search phase I get even more data, which makes the calculations even worse. Also, I have noticed that not pushing the items in the group phase significantly improves the performance, but I need that.
If someone had this problem before, or knows the solution to it, please advise me on how to handle it.
I'm using the MongoDB C# driver, but that shouldn't matter, I believe.
First you should know that group is not very fast but index can help it by using COUNT_SCAN search.
I think you need to create a compound index for the created_at and fingerprint
schema.index({created_at: 1, fingerprint: 1});
then you need to sort your aggregate pipeline to be like this
const time = new Date(new Date().getTime() - (90 * 24 * 60 * 60 * 1000));
db.collection.aggregate([
{ $match: { created_at: {$gt: time}}, // the index will be used
{ $group: { _id: "$fingerprint", count: {$sum: 1}}}, // the index will be used
{ $match: {count: {$gt: 1}},
])
After getting the names that have duplicate fingerprint value you can make requests to delete duplicate and only keep one because $push in group will slow down the group step.
Note: the code here is in JS language.

How to perform queries with large $in data set in mongo db

I have simple query like this: {"field": {$nin: ["value1","value2","valueN"]}}.
The problem is large amount of unique values to exclude (using $nin operator). It's about 50000 unique values to filter and about 1Kb of query length.
Question: Is there elegant and performant way to do such operations?
Example.
Collection daily_stat with 56M of docs. Each day increases collection with 100K docs. Example of document
{
"day": "2020-04-15",
"username": "uniq_name",
"total": 12345
}
I run next query:
{
"date": "2020-04-15",
"username": {
$nin: [
"name1",
"name2",
"...",
"name50000"
]
}
}
MongoDB version: 3.6.12
I would say the big $nin array is the elegant solution. If there is an index on field then it will also be performant -- but only in terms of quickly excluding those docs not to be returned in the cursor. If you have, say, 10 million docs in a collection and you do a find() to exclude 50000, you are still dragging 9,950,000 records out of the DB and across the wire; that is non-trivial.
if you can find a pattern in the values you pass in you can try with regex. Example given below
db.persons.find({'field':{$nin:[/san/i]}},{_id:0,"field":1})
more details on regex in
https://docs.mongodb.com/manual/reference/operator/query/regex/

Does mongodb has some 'soft' indexing optimization?

I have a 10 Go collection with pretty small documents (~1kb) which all contain the field 'date'. I need to do some daily mapreduce over the documents, only on the last day.
I have a few options :
no index
index over 'date'
create a field "day" which is date without the time.
have one collection per day. myCollection_20140106 for instance
I am thinking of 3 because it looks to me as a good compromise for indexing (slow) and reading the entire not indexed database (slow). Sorting the array 1, 3, 2, 3, 3, 2, 2, 1, 3, 3 ,1, 2 might be faster than sorting 1, 13, 2, 8, 5, 4, 6, 3, 7, 11 because there are more equal valued items. Does it apply to mongodb indexes ? Is the solution 3 good for this or is it just stupid and not faster than 2 ?
Any advice on this is most welcomed. Thank you very much!
EDIT : MR code :
db.my_col.mapReduce(map, reduce, {finalize: finalize, out: {merge: "day"},
query: {"date": {$gte: start_date, $lt: end_date, $exists: true}}})
map/reduce/finalize are basic functions to compute the average of a given field over the day "group by" another field. (e.g date, name, price -> compute the average price per person for a given day). (This is not the case but you can consider it is, I think the mapReduce/query are the things of interest here and I don't want to pollute the question by adding extra weight)
Given the fact that you are using date for your initial selection criteria, having an index over date makes more sense than having an index over day. Date is superset of day values and in terms of entries they still refer to index of similar (just to be cautious, it's not same) order of magnitude.
M/R functions are not analyzed and cannot use any indexes in mongodb. However as in your case, the query and sort portion of the command can take advantage of the indexes defined in mongodb.
I would also suggest taking a look at Mongodb MapReduce performance using Indexes .

MongoDB: To Find Objects with Three Integer Values, Sort Range of Three Values or Perform Three Queries?

My basic question is this: Which is more efficient?
mongo_db[collection].find(year: 2000)
mongo_db[collection].find(year: 2001)
mongo_db[collection].find(year: 2002)
or
mongo_db[collection].find(year: { $gte: 2000, $lte: 2002 }).sort({ year: 1 })
More detail: I have a MongoDB query in which I'll be selecting objects with 'year' attribute values of either 2000, 2001, or 2002, but no others. Is this best done as a find() with a sort(), or three separate find()s for each value? If it depends on the size of my collection, at what size does the more efficient search pattern change?
The single query is going to be faster because Mongo only has to scan the collection (or its index) once instead of three times. But you don't need a sort clause for your range query unless you actually want the results sorted for a separate reason.
You could also use $in for this:
mongo_db[collection].find({year: {$in: [2000, 2001, 2002]}})
Its performance should be very similar to your range query.