Indexing two fields in mogo: timestamp and text - mongodb

I do much find requests on collection like this:
{'$and': [{'time': {'$lt': 1375214400}},
{'time': {'$gte': 1375128000}},
{'$or': [{'uuid': 'test'},{'uuid': 'test2'}]}
]}
Which index i must create: compound or two single or both?
uuid - name of data collector.
time - timestamp
I want to retrieve data, collected by one or few collectors in specified time interval.

Your query would be better written without the $and and using $in instead of $or:
{
'time': {'$lt': 1375214400, '$gte': 1375128000},
'uuid': {'$in': ['test', 'test2']}
}
Then it's pretty clear you need a compound index that covers both time and uuid for best query performance. But it's important to always confirm your index is being used as you expect by using explain().

Related

MongoDB: Get whole document with aggregate method

I'm trying to reach something like this:
I have collections of activities that belong to some user.
I want to get the activity names distincted ordered by 'added_time', so I used 'group by' on the activity name and get the max value of 'added_time'.
Also, I want to sort them by 'added_time', and then to get the whole document.
The only thing that I reached so far, is to get only the name that I grouped by, and the 'added_time' property.
This is the query:
db.getCollection('user_activities').aggregate
(
{$match: {'type': 'food', 'user_id': '123'}},
{$group:{'_id':'$name', 'added_time':{$max:'$added_time'}}},
{$sort:{'added_time':-1}},
{$project: {_id: 0,name: "$_id",count: 1,sum: 1, 'added_time': 1}}
)
Can someone help me with reaching the whole document?
Thank's!

MongoDB compound index, aggregation performance

I've gone through many articles about indexes in MongoDB but still have no clear understanding of indexing strategy for my particular case.
I have the following collection with more than 10 million items:
{
active: BOOLEAN,
afp: ARRAY_of_integers
}
Previously I was using aggregation with this pipeline:
pipeline = [
{'$match': {'active': True}},
{'$project': {
'sub': {'$setIsSubset': [afp, '$afp']}
}},
{'$match': {'sub': True}}
]
Queries were pretty slow, so I started experimenting without aggregation:
db.collection.find({'active': True, 'afp': {'$all': afp}})
The latter query using $all runs much faster with the same indexes - no idea why...
I've tried these indexes (not much variations possible though):
{afp: 1}
{active: 1}
{active: 1, afp: 1}
{afp: 1, active: 1}
I don't know why, but the latest index gives the best performance - any ideas about the reason would be much appreciated.
Then I decided to add constraints in order to possibly improve speed.
Considering that number of integers in "afp" field can be different, there's no reason to scan documents having less number of integers than in the query. So I created one more field for all documens called "len_afp" which contains that number (afp length indeed).
Now documents look like this:
{
active: BOOLEAN,
afp: ARRAY_of_integers
len_afp: INTEGER
}
Query is:
db.collection.find({
'active': True,
'afp': {'$all': afp},
'len_afp: {'$gte': len_afp}
})
Also I've added three new indexes:
{afp: 1, len_afp: 1, active: 1}
{afp: 1, active: 1, len_afp: 1}
{len_afp: 1, afp: 1, active: 1}
The latest index gives the best performance - again for unknown reason.
So my question is: what the logic is behind fields order in compound indexes, and how this logic has to be considered when creating indexes?
Also it's interesting why $all works faster than $setIsSubset with all the same other conditions?
Can index intersection be used for my case instead of compound indexes?
Still the performance is pretty low - what am I doing wrong?
Can sharding help in my particular case (will it work using aggregation, or $all, or $gte)?
Sorry for huge question and Thank you in advance!

MongoDB complex indices

I'm trying to understand how to best work with indices in MongoDB. Lets say that I have a collection of documents like this one:
{
_id: 1,
keywords: ["gap", "casual", "shorts", "oatmeal"],
age: 21,
brand: "Gap",
color: "Black",
gender: "female",
retailer: "Gap",
style: "Casual Shorts",
student: false,
location: "US",
}
and I regularly run a query to find all documents that match a set of criteria for each of those fields, something like:
db.items.find({ age: { $gt: 13, $lt: 40 },
brand: { $in: ['Gap', 'Target'] },
retailer: { $in: ['Gap', 'Target'] },
gender: { $in: ['male', 'female'] },
style: { $in: ['Casual Shorts', 'Jeans']},
location: { $in: ['US', 'International'] },
color: { $in: ['Black', 'Green'] },
keywords: { $all: ['gap', 'casual'] }
})
I'm trying to figure what sort of index I can create to improve the speed of a query such as this. Should I create a compound index like this:
db.items.ensureIndex({ age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
or is there a better set of indices I can create to optimize this query?
Should I create a compound index like this:
db.items.ensureIndex({age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
You can create an index like the one above, but you're indexing almost the entire collection. Indexes take space; the more fields in the index, the more space is used. Usually RAM, although they can be swapped out. They also incur write penalty.
Your index seems wasteful, since probably indexing just a few of those fields will make MongoDB scan a set of documents that is close to the expected result of the find operation.
Is there a better set of indices I can create to optimize this query?
Like I said before, probably yes. But this question is very difficult to answer without knowing details of the collection, like the amount of documents it has, which values each field can have, how those values are distributed in the collection (50% gender male, 50% gender female?), how they correlate to each other, etc.
There are a few indexing strategies, but normally you should strive to create indexes with high selectivity. Choose "small" field combinations that will help MongoDB locate the desired documents scanning a "reasonable" amount of them. Again, "small" and "reasonable" will depend on the characteristics of the collection and query you are performing.
Since this is a fairly complex subject, here are some references that should help you building more appropriate indexes.
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
http://docs.mongodb.org/manual/faq/indexes/#how-do-you-determine-what-fields-to-index
http://docs.mongodb.org/manual/tutorial/create-queries-that-ensure-selectivity/
And use cursor.explain to evaluate your indexes.
http://docs.mongodb.org/manual/reference/method/cursor.explain/
Large index like this one will penalize you on writes. It is better to index just what you need, and let Mongo's optimiser do most of the work for you. You can always give him an hint or, in last resort, reindex if you application or data usage changes drastically.
Your query will use the index for fields that have one (fast), and use a table scan (slow) on the remaining documents.
Depending on your application, a few stand alone indexes could be better. Adding more indexes will not improve performance. With the write penality, it could even make it worse (YMMV).
Here is a basic algorithm for selecting fields to put in an index:
What single field is in a query the most often?
If that single field is present in a query, will a table scan be expensive?
What other field could you index to further reduce the table scan?
This index seems to be very reasonable for your query. MongoDB calls the query a covered query for this index, since there is no need to access the documents. All data can be fetched from the index.
from the docs:
"Because the index “covers” the query, MongoDB can both match the query conditions and return the results using only the index; MongoDB does not need to look at the documents, only the index, to fulfill the query. An index can also cover an aggregation pipeline operation on unsharded collections."
Some remarks:
This index will only be used by queries that include a filter on age. A query that only filters by brand or retailer will probably not use this index.
Adding an index on only one or two of the most selective fields of your query will already bring a very significant performance boost. The more fields you add the larger the index size will be on disk.
You may want to generate some random sample data and measure the performance of this with different indexes or sets of indexes. This is obviously the safest way to know.

Ranged pagination when querying & sorting on dynamic, non-unique fields in mongodb

Ranged pagination is cut and dry when you're paginating based on single unique fields, but how does it work, if at all, in situations with non-unique fields, perhaps several of them at a time?
TL;DR: Is it reasonable or possible to paginate and sort an "advanced search" type query using range-based pagination? This means querying on, and sorting on, user-selected, perhaps non-unique fields.
For example say I wanted to paginate a search for played word docs in a word game. Let's say each doc has a score and a word and I'd like to let users filter and sort on those fields. Neither field is unique. Assume a sorted index on the fields in question.
Starting simple, say the user wants to see all words with a score of 10:
// page 1
db.words.find({score: 10}).limit(pp)
// page 2, all words with the score, ranged on a unique _id, easy enough!
db.words.find({score: 10, _id: {$gt: last_id}}).limit(pp)
But what if the user wanted to get all words with a score less than 10?
// page 1
db.words.find({score: {$lt: 10}}).limit(pp)
// page 2, getting ugly...
db.words.find({
// OR because we need everything lt the last score, but also docs with
// the *same* score as the last score we haven't seen yet
$or: [
{score: last_score, _id: {$gt: last_id}},
{score: {$lt: last_score}
]
}).limit(pp)
Now what if the user wanted words with a score less than 10, and an alphabetic value greater than "FOO"? The query quickly escalates in complexity, and this is for just one variation of the search form with the default sort.
// page 1
db.words.find({score: {$lt: 10}, word: {$gt: "FOO"}}).limit(pp)
// page 2, officially ugly.
db.words.find({
$or: [
// triple OR because now we need docs that have the *same* score but a
// higher word OR those have the *same* word but a lower score, plus
// the rest
{score: last_score, word: {$gt: last_word}, _id: {$gt: last_id}},
{word: last_word, score: {$lt: last_score}, _id: {$gt: last_id}},
{score: {$lt: last_score}, word: {$gt: last_word}}
]
}).limit(pp)
I suppose writing a query builder for this sort of pattern would be doable, but it seems terribly messy and error prone. I'm leaning toward falling back to skip pagination with a capped result size, but I'd like to use ranged pagination if possible. Am I completely wrong in my thinking of how this would have to work? Is there a better way?
Edit: For the record...
With no viable alternatives thus far I'm actually just using skip based pagination with a limited result set, keeping the skip manageable. For my purposes this is actually sufficient, as there's no real need to search then paginate into the thousands.
You can get ranged pagination by sorting on a unique field and saving the value of that field for the last result. For example:
// first page
var page = db.words.find({
score:{$lt:10},
word:{$gt:"FOO"}
}).sort({"_id":1}).limit(pp);
// Get the _id from the last result
var page_results = page.toArray();
var last_id = page_results[page_results.length-1]._id;
// Use last_id to get your next page
var next_page = db.words.find({
score:{$lt:10},
word:{$gt:"FOO"},
_id:{$gt:last_id}
}).sort({"_id":1}).limit(pp);

MongoDB - Pagination based on non-unique fields

I am familiar with the best practice of range based pagination on large MongoDB collections, however I am struggling with figuring out how to paginate a collection where the sort value is on a non-unique field.
For example, I have a large collection of users, and there is a field for the number of times they have done something. This field is defintely non-unique, and could have large groups of documents that have the same value.
I would like to return results sorted by that 'numTimesDoneSomething' field.
Here is a sample data set:
{_id: ObjectId("50c480d81ff137e805000003"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000005"), numTimesDoneSomething: 9}
{_id: ObjectId("50c480d81ff137e805000006"), numTimesDoneSomething: 7}
{_id: ObjectId("50c480d81ff137e805000007"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000002"), numTimesDoneSomething: 15}
{_id: ObjectId("50c480d81ff137e805000008"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000009"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000004"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000010"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000011"), numTimesDoneSomething: 1}
How would I return this data set sorted by 'numTimesDoneSomething' with 2 records per page?
#cubbuk shows a good example using offset (skip) but you can also mould the query he shows for ranged pagination as well:
db.collection.find().sort({numTimesDoneSomething:-1, _id:1})
Since the _id here will be unique and you are seconding on it you can actually then range by _id and the results, even between two records having numTimesDoneSomething of 12, should be consistent as to whether they should be on one page or the next.
So doing something as simple as
var q = db.collection.find({_id: {$gt: last_id}}).sort({numTimesDoneSomething:-1, _id:1}).limit(2)
Should work quite good for ranged pagination.
You can sort on multiple fields in this case sort on numTimesDoneSomething and id field. Since id_ field is ascending in itself already according to the insertion timestamp, you will able to paginate through the collection without iterating over duplicate data unless new data is inserted during the iteration.
db.collection.find().sort({numTimesDoneSomething:-1, _id:1}).offset(index).limit(2)