How mongo index intersection works - mongodb

If I create two indexes, one with descending and one with ascending with time, if I query old and new data do I get to search in both the indexes and I get a good performance in both the cases.
Currently I have index in descending so when I query old data it is coming really really slow. I am thinking of creating one more ascending and try it out. Since I have a huge number of documents (32 million) I thought of asking here first.
This is my index and the query which cause me issue when start/end time is bit old. I have a TTL close to 100 days which make my collection to keep 32 million documents.
index: {
"source_type" : 1.0 ,
"source_id" : 1.0 ,
"key" : 1.0 ,
"start" : -1.0 ,
"end" : -1.0
}
query: keys = diag_db.telemetry_series.aggregate([
{ '$match': {
'source_type': 'SERVER',
'start': { '$gte': start },
'end': { '$lte': end },
'$or': stream_id_query
}},
{ '$project': {
'source_id': 1,
'key': 1
}},
{ '$group':
{ '_id': { 'source': '$source_id', 'key': '$key' }
}}
])['result']

Related

Efficient group distinct count in MongoDB aggregation

I have the following data and I would like to know How many account that has LogCounts >= 7
Account
LogCounts
AAA
2
BBB
7
AAA
7
AAA
8
AAA
3
CCC
2
Here is my working MongoDB pipeline
[
{
'$match': {
'LogCounts': {
'$gt': 6
}
}
}, {
'$project': {
'Account': 1
}
}, {
'$group': {
'_id': '$Account'
}
}, {
'$count': 'FinalAccountCounts'
}
]
But it took about 5 minutes for ~800 million records collection. I'd like to know if there's any better, faster or more efficient to solve this problem.
Thank you.
From a query perspective you are pretty much as efficient as you can be, this is under the assumption you have an index on LogCounts.
What you can try is to split your query into in different ranges of LogCounts values, this will require prior knowledge of your data distribution. But using this approach you can "map - reduce" the query results. This approach will not help much if the cardinality is extremely low, i.e LogCount has max value of 7 for example
Let's assume for a second LogCounts can only be in the range of 7 to 21. i.e in [7,8,9,10,...,20,21] with equal distribution for each value.
In this case you could execute x queries each of them only querying a certain range of the values:
for example 5 queries at once will look like:
// in nodejs, all queries are executed at once.
const results = await Promise.all([
db.collection.aggregate([ { match: { LogCounts: {$gt: 7, $lte: 10} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 10, $lte: 13} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 13, $lte: 16} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 16, $lte: 19} }}, ...restOfPipeline]),
db.collection.aggregate([ { match: { LogCounts: {$gt: 19, $lte: 21} }}, ...restOfPipeline]),
])
// now merge results in memory.
I can tell you that from testing this approach on a 800M collection on a dataset with string values and high cardinality my 1 query ran in 3min, when I split it to 2 queries I managed to run it in 2.1min including the "reduce" part in memory.
Obviously optimization this will require trial an error as you have several parameters to consider, # of buckets, value cardinality for query optimization ( one query can be 7 to 10 and one query 10 to 21 depending on distribution ), # of results etc
If you do end up choosing my approach I'd be happy to get an update after some testing.

Performance when filtering after $geoNear query

I have a MongoDB collection which contains a location (GeoJSON Point) and other fields to filter on.
{
"Location" : {
"type" : "Point",
"coordinates" : [
-118.42359,
33.974563
]
},
"Filters" : [
{
"k" : 1,
"v" : 5
},
{
"k" : 2,
"v" : 8
}
]
}
My query uses the aggregate function because it performs a sequence of filtering, sorting, grouping, etc... The first step where it's filtering is where I'm having trouble performing the geo near operation.
$geoNear: {
spherical: true,
near: [-118.236391, 33.782092],
distanceField: 'Distance',
query: {
// Filter by other fields.
Filters: {
$all: [
{ $elemMatch: { k: 1 /* Bedrooms */, v: 5 } }
]
}
},
maxDistance: 8046
},
For indexing I tried two approaches:
Approach #1: Create two separate indexes, one with the Location field and one with the fields we subsequently filter on. This approach is slow, with very little data in my collection it takes 3+ seconds to query within a 5 mile radius.
db.ResidentialListing.ensureIndex( { Location: '2dsphere' }, { name: 'ResidentialListingGeoIndex' } );
db.ResidentialListing.ensureIndex( { "Filters.k": 1, "Filters.v": 1 }, { name: 'ResidentialListingGeoQueryIndex' } );
Approach #2: Create one index with both the Location and other fields we filter on. Creating the index never completed, as it generated a ton of warnings about "Insert of geo object generated a high number of keys".
db.ResidentialListing.ensureIndex( { Location: '2dsphere', "Filters.k": 1, "Filters.v": 1 }, { name: 'ResidentialListingGeoIndex' } );
The geo index itself seems to work fine, if I only perform the $geoNear operation and don't try to query after then it executes in 60ms. However, as soon as I try to query on other fields after is when it gets slow. Any ideas would be appreciated on how to set up the query and indexes correctly so that it performs well...

MongoDB: calculate 90th percentile among all documents

I need to calculate the 90th percentile of the duration where the duration for each document is defined as finish_time - start_time.
My plan was to:
Create $project to calculate that duration in seconds for each document.
Calculate the index (in sorted documents list) that correspond to the 90th percentile: 90th_percentile_index = 0.9 * amount_of_documents.
Sort the documents by the $duration variable the was created.
Use the 90th_percentile_index to $limit the documents.
Choose the first document out of the limited subset of document.
I'm new to MongoDB so I guess the query can be improved. So, the query looks like:
db.getCollection('scans').aggregate([
{
$project: {
duration: {
$divide: [{$subtract: ["$finish_time", "$start_time"]}, 1000] // duration is in seconds
},
Percentile90Index: {
$multiply: [0.9, "$total_number_of_documents"] // I don't know how to get the total number of documents..
}
}
},
{
$sort : {"$duration": 1},
},
{
$limit: "$Percentile90Index"
},
{
$group: {
_id: "_id",
percentiles90 : { $max: "$duration" } // selecting the max, i.e, first document after the limit , should give the result.
}
}
])
The problem I have is that I don't know how to get the total_number_of_documents and therefore I can't calculate the index.
Example:
let's say I have only 3 documents:
{
"_id" : ObjectId("1"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:01:00.000Z"),
}
{
"_id" : ObjectId("2"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:03:00.000Z"),
}
{
"_id" : ObjectId("3"),
"start_time" : ISODate("2019-02-03T12:00:00.000Z"),
"finish_time" : ISODate("2019-02-03T12:08:00.000Z"),
}
So I would expect the result to be something like:
{
percentiles50 : 3 // in minutes, since percentiles50=3 is the minimum value that setisfies the request of atleast 50% of the documents have duration <= percentiles50
}
I used percentiles 50th in the example because I only gave 3 documents but it really doesn't matter, just show me please a query for the i-th percentiles and it will be fine :-)

"too much data for sort()" on a small collection

When trying to do a find and sort on a mongodb collection I get the error below. The collection is not large at all - I have only 28 documents and I start getting this error when I cross the limit of 23 records.
The special thing about that document is that it holds a large ArrayCollection inside but I am not fetching that specific field at all, I am only trying to get a DateTime field.
db.ANEpisodeBreakdown.find({creationDate: {$exists:true}}, {creationDate: true} ).limit(23).sort( { creationDate: 1}
{ "$err" : "too much data for sort() with no index. add an index or specify a smaller limit", "code" : 10128 }
So the problem here is a 32MB limit and you have no index that can be used as an "index only" or "covered" query to get to the result. Without that, your "big field" still gets loaded in the data to sort.
Easy to replicate;
var string = "";
for ( var n=0; n < 10000000; n++ ) {
string += 0;
}
for ( var x=0; x < 4; x++ ) {
db.large.insert({ "large": string, "date": new Date() });
sleep(1000);
}
So this query will blow up, unless you limit to 3:
db.large.find({},{ "date": 1 }).sort({ "date": -1 })
To overcome this:
Create an index on date (and other used fields) so the whole document is not loaded in your covered index query:
db.large.ensureIndex({ "date": 1 })
db.large.find({},{ "_id": 0, "date": 1 }).sort({ "date": -1 })
{ "date" : ISODate("2014-07-07T10:08:33.067Z") }
{ "date" : ISODate("2014-07-07T10:08:31.747Z") }
{ "date" : ISODate("2014-07-07T10:08:30.391Z") }
{ "date" : ISODate("2014-07-07T10:08:29.038Z") }
Don't index and use aggregate instead, as the $project there does not suffer the same limitations as the document actually gets altered before passing to $sort.
db.large.aggregate([
{ "$project": { "_id": 0, "date": 1 }},
{ "$sort": {"date": -1 }}
])
{ "date" : ISODate("2014-07-07T10:08:33.067Z") }
{ "date" : ISODate("2014-07-07T10:08:31.747Z") }
{ "date" : ISODate("2014-07-07T10:08:30.391Z") }
{ "date" : ISODate("2014-07-07T10:08:29.038Z") }
Either way gets you the results under the limit without modifying cursor limits in any way.
Without an index, the size you can use for a sort only extends over shellBatchSize which by default is 20.
DBQuery.shellBatchSize = 23;
This should do the trick.
The problem is that projection in this particular scenario still loads the entire document, it just sends it to your application without the large array field.
As such MongoDB is still sorting with too much data for its 32mb limit.

How to count the number of documents on date field in MongoDB

Scenario: Consider, I have the following collection in the MongoDB:
{
"_id" : "CustomeID_3723",
"IsActive" : "Y",
"CreatedDateTime" : "2013-06-06T14:35:00Z"
}
Now I want to know the count of the created document on the particular day (say on 2013-03-04)
So, I am trying to find the solution using aggregation framework.
Information:
So far I have the following query built:
collection.aggregate([
{ $group: {
_id: '$CreatedDateTime'
}
},
{ $group: {
count: { _id: null, $sum: 1 }
}
},
{ $project: {
_id: 0,
"count" :"$count"
}
}
])
Issue: Now considering above query, its giving me the count. But not based on only date! Its taking time as well into consideration for unique count.
Question: Considering the field has ISO date, Can any one tell me how to count the documents based on only date (i.e excluding time)?
Replace your two groups with
{$project:{day:{$dayOfMonth:'$createdDateTime'},month:{$month:'$createdDateTime'},year:{$year:'$createdDateTime'}}},
{$group:{_id:{day:'$day',month:'$month',year:'$year'}, count: {$sum:1}}}
You can read more about the date operators here: http://docs.mongodb.org/manual/reference/aggregation/#date-operators