Does mongodb has some 'soft' indexing optimization? - mongodb

I have a 10 Go collection with pretty small documents (~1kb) which all contain the field 'date'. I need to do some daily mapreduce over the documents, only on the last day.
I have a few options :
no index
index over 'date'
create a field "day" which is date without the time.
have one collection per day. myCollection_20140106 for instance
I am thinking of 3 because it looks to me as a good compromise for indexing (slow) and reading the entire not indexed database (slow). Sorting the array 1, 3, 2, 3, 3, 2, 2, 1, 3, 3 ,1, 2 might be faster than sorting 1, 13, 2, 8, 5, 4, 6, 3, 7, 11 because there are more equal valued items. Does it apply to mongodb indexes ? Is the solution 3 good for this or is it just stupid and not faster than 2 ?
Any advice on this is most welcomed. Thank you very much!
EDIT : MR code :
db.my_col.mapReduce(map, reduce, {finalize: finalize, out: {merge: "day"},
query: {"date": {$gte: start_date, $lt: end_date, $exists: true}}})
map/reduce/finalize are basic functions to compute the average of a given field over the day "group by" another field. (e.g date, name, price -> compute the average price per person for a given day). (This is not the case but you can consider it is, I think the mapReduce/query are the things of interest here and I don't want to pollute the question by adding extra weight)

Given the fact that you are using date for your initial selection criteria, having an index over date makes more sense than having an index over day. Date is superset of day values and in terms of entries they still refer to index of similar (just to be cautious, it's not same) order of magnitude.
M/R functions are not analyzed and cannot use any indexes in mongodb. However as in your case, the query and sort portion of the command can take advantage of the indexes defined in mongodb.
I would also suggest taking a look at Mongodb MapReduce performance using Indexes .

Related

Is there a way to find the 10 documents that are "nearest" search values in Mongo using Aggregation Pipeline?

This is not for geospatial data.
Each document has up to 10 integer values:
{ val1: 5,
val2: 8,
val3: -4,
...
How to find the 10 documents "nearest" { val1: 4, val2: -1, ... ?
I can see using $addField to create a distance for each document
{ $addFields: { distance: ($val1-4)^2 + ($val2+1)^2 ...
And then sorting and $limit...
But I'm not sure if this will be performant (although there are only 40K documents)...
Perhaps there is a better way?
Without any preprocess there is no better way, If you have some prior knowledge on the data distribution you could add rigid rules to filter most documents on the initial match ( like val1: {$lt: 11, $gt: 0}), however if this returns less than 10 results you'll have to query again.
This is a very common access pattern that already has some production grade solutions, I recommend you choose one of those unless you want to develop something to custom fit your needs.
Use a database that is built for vector querying, for example elasticsearch has the vector type, once the data is indexed it gives you OOB search engine functionality with various distance formula's.
Allow lower accuracy in your queries, this combined with some preprocess approaches can help querying time, Here is a very interesting post on how spotify dealt with this issue. The basically split their data into different clusters and then each query does not need to scan the entire dataset. Again this compromises query accuracy.

Improve distinct query performance using indexes

Documents in the collection follow the schema as given below -
{
"Year": String,
"Month": String
}
I want to accomplish the following tasks -
I want to run distinct queries like
db.col.distinct('Month', {Year: '2016'})
I have a compound index on {'Year': 1, 'Month': 1}, so by
intuition looks like the answer can be computed by looking at
indices only. The query performance on a collection in millions is
really poor right now. Is there any way that this could be done?
I want to find the latest month of a given year. One way is to sort the result of the above distinct query and take the first element as the answer.
A much better and faster solution as pointed out by # Blakes Seven in the discussion below, would be to use the query db.col.find({ "Year": "2016"}).sort({ "Year": -1, "Month": -1 }).limit(1)

MongoDB: To Find Objects with Three Integer Values, Sort Range of Three Values or Perform Three Queries?

My basic question is this: Which is more efficient?
mongo_db[collection].find(year: 2000)
mongo_db[collection].find(year: 2001)
mongo_db[collection].find(year: 2002)
or
mongo_db[collection].find(year: { $gte: 2000, $lte: 2002 }).sort({ year: 1 })
More detail: I have a MongoDB query in which I'll be selecting objects with 'year' attribute values of either 2000, 2001, or 2002, but no others. Is this best done as a find() with a sort(), or three separate find()s for each value? If it depends on the size of my collection, at what size does the more efficient search pattern change?
The single query is going to be faster because Mongo only has to scan the collection (or its index) once instead of three times. But you don't need a sort clause for your range query unless you actually want the results sorted for a separate reason.
You could also use $in for this:
mongo_db[collection].find({year: {$in: [2000, 2001, 2002]}})
Its performance should be very similar to your range query.

Compound Index along with Single Index

I have two fields in a document I want to index. One of them is Receive Time, and the other one is Serial Number. I want users to be able to query on Serial Number alone or on both Serial Number and Receive Time.
The way I see it, I have two options.
A.
db.collection.ensureIndex({SerialNumber: 1, ReceiveTime: 1})
db.collection.ensureIndex({ReceiveTime: 1})
B.
db.collection.ensureIndex({ReceiveTime: 1, SerialNumber: 1})
db.collection.ensureIndex({SerialNumber: 1})
Apparently, option A is a better choice (you want fields with low uniqueness to be later on in an index) versus option B. Why is that the case?
However, at the same time the MongoDB documentation states that if your index increments then the whole index need not fit in RAM. If this is a very write heavy application, would B then be the better option? (compound indexes are larger than single indexes and the compound index increments as opposed to A which doesn't increment)
The decision between {SerialNumber: 1, ReceiveTime: 1} and {ReceiveTime: 1, SerialNumber: 1} should be based on the type of queries that you plan to perform. If you generally query for a specific SerialNumber but a large range of possible ReceiveTimes, then you want to use {SerialNumber: 1, ReceiveTime: 1}. Conversely, if your queries are specific for ReceiveTime but more general for SerialNumber then go for {ReceiveTime: 1, SerialNumber: 1}. This way each query is likely to require fewer pages of the index, and will minimize the amount of swapping that the OS has to do.
Similarly, if you are always querying by, say, the most recent ReceiveTimes, then you can keep the working set small by using {ReceiveTime: 1, SerialNumber: 1}. You will only need to keep the pages corresponding to the most recent ReceiveTimes in memory. This is what the documentation you linked to is suggesting.

Sorting on Multiple fields mongo DB

I have a query in mongo such that I want to give preference to the first field and then the second field.
Say I have to query such that
db.col.find({category: A}).sort({updated: -1, rating: -1}).limit(10).explain()
So I created the following index
db.col.ensureIndex({category: 1, rating: -1, updated: -1})
It worked just fined scanning as many objects as needed, i.e. 10.
But now I need to query
db.col.find({category: { $ne: A}}).sort({updated: -1, rating: -1}).limit(10)
So I created the following index:
db.col.ensureIndex({rating: -1, updated: -1})
but this leads to scanning of the whole document and when I create
db.col.ensureIndex({ updated: -1 ,rating: -1})
It scans less number of documents:
I just want to ask to be clear about sorting on multiple fields and what is the order to be preserved when doing so. By reading the MongoDB documents, it's clear that the field on which we need to perform sorting should be the last field. So that is the case I assumed in my $ne query above. Am I doing anything wrong?
The MongoDB query optimizer works by trying different plans to determine which approach works best for a given query. The winning plan for that query pattern is then cached for the next ~1,000 queries or until you do an explain().
To understand which query plans were considered, you should use explain(1), eg:
db.col.find({category:'A'}).sort({updated: -1}).explain(1)
The allPlans detail will show all plans that were compared.
If you run a query which is not very selective (for example, if many records match your criteria of {category: { $ne:'A'}}), it may be faster for MongoDB to find results using a BasicCursor (table scan) rather than matching against an index.
The order of fields in the query generally does not make a difference for the index selection (there are a few exceptions with range queries). The order of fields in a sort does affect the index selection. If your sort() criteria does not match the index order, the result data has to be re-sorted after the index is used (you should see scanAndOrder:true in the explain output if this happens).
It's also worth noting that MongoDB will only use one index per query (with the exception of $ors).
So if you are trying to optimize the query:
db.col.find({category:'A'}).sort({updated: -1, rating: -1})
You will want to include all three fields in the index:
db.col.ensureIndex({category: 1, updated: -1, rating: -1})
FYI, if you want to force a particular query to use an index (generally not needed or recommended), there is a hint() option you can try.
That is true but there are two layers of ordering you have here since you are sorting on a compound index.
As you noticed when the first field of the index matches the first field of sort it worked and the index was seen. However when working the other way around it does not.
As such by your own obersvations the order needed to be preserved is query order of fields from first to last. The mongo analyser can sometimes move around fields to match an index but normally it will just try and match the first field, if it cannot it will skip it.
try this code it will sort data first based on name then keeping the 'name' in key holder it will sort 'filter'
var cursor = db.collection('vc').find({ "name" : { $in: [ /cpu/, /memo/ ] } }, { _id: 0, }).sort( { "name":1 , "filter": 1 } );
Sort and Index Use
MongoDB can obtain the results of a sort operation from an index which
includes the sort fields. MongoDB may use multiple indexes to support
a sort operation if the sort uses the same indexes as the query
predicate. ... Sort operations that use an index often have better
performance than blocking sorts.
db.restaurants.find().sort( { "borough": 1, "_id": 1 } )
more information :
https://docs.mongodb.com/manual/reference/method/cursor.sort/