Is the MongoDB query
{key: {$in: ["Only One Element"]}}
equivalent to
{key: "Only One Element"}
in terms of performance?
A wrote a simple test script: https://gist.github.com/nellessen/5f2d36de4ef74b5a34b3
with the following result:
Average milis for $in with index: 0.12271258831
Average milis for simple query with index: 0.114794611931
Average milis for $in without index: 1.51113886833
Average milis for simple query without index: 1.40885763168
So as a result {key: {$in: ["Only One Element"]}} is 7% slower than {key: "Only One Element"}!
Related
{
field_1: [
{A: 1, ...}
{A: 10, ...}
{A: 2, ...}
]
}
What if field_1 has 10s of 1000s of subdocuments in its array?
What kind of query performance would be negatively affected? And why?
What if the subdocuments are multi key indexed?
If field_1 has a million subdocuments then it overload the stack in runtime and compiler will not be able to process that amount of data, it will result in crashing your code.
I've gone through many articles about indexes in MongoDB but still have no clear understanding of indexing strategy for my particular case.
I have the following collection with more than 10 million items:
{
active: BOOLEAN,
afp: ARRAY_of_integers
}
Previously I was using aggregation with this pipeline:
pipeline = [
{'$match': {'active': True}},
{'$project': {
'sub': {'$setIsSubset': [afp, '$afp']}
}},
{'$match': {'sub': True}}
]
Queries were pretty slow, so I started experimenting without aggregation:
db.collection.find({'active': True, 'afp': {'$all': afp}})
The latter query using $all runs much faster with the same indexes - no idea why...
I've tried these indexes (not much variations possible though):
{afp: 1}
{active: 1}
{active: 1, afp: 1}
{afp: 1, active: 1}
I don't know why, but the latest index gives the best performance - any ideas about the reason would be much appreciated.
Then I decided to add constraints in order to possibly improve speed.
Considering that number of integers in "afp" field can be different, there's no reason to scan documents having less number of integers than in the query. So I created one more field for all documens called "len_afp" which contains that number (afp length indeed).
Now documents look like this:
{
active: BOOLEAN,
afp: ARRAY_of_integers
len_afp: INTEGER
}
Query is:
db.collection.find({
'active': True,
'afp': {'$all': afp},
'len_afp: {'$gte': len_afp}
})
Also I've added three new indexes:
{afp: 1, len_afp: 1, active: 1}
{afp: 1, active: 1, len_afp: 1}
{len_afp: 1, afp: 1, active: 1}
The latest index gives the best performance - again for unknown reason.
So my question is: what the logic is behind fields order in compound indexes, and how this logic has to be considered when creating indexes?
Also it's interesting why $all works faster than $setIsSubset with all the same other conditions?
Can index intersection be used for my case instead of compound indexes?
Still the performance is pretty low - what am I doing wrong?
Can sharding help in my particular case (will it work using aggregation, or $all, or $gte)?
Sorry for huge question and Thank you in advance!
I do much find requests on collection like this:
{'$and': [{'time': {'$lt': 1375214400}},
{'time': {'$gte': 1375128000}},
{'$or': [{'uuid': 'test'},{'uuid': 'test2'}]}
]}
Which index i must create: compound or two single or both?
uuid - name of data collector.
time - timestamp
I want to retrieve data, collected by one or few collectors in specified time interval.
Your query would be better written without the $and and using $in instead of $or:
{
'time': {'$lt': 1375214400, '$gte': 1375128000},
'uuid': {'$in': ['test', 'test2']}
}
Then it's pretty clear you need a compound index that covers both time and uuid for best query performance. But it's important to always confirm your index is being used as you expect by using explain().
I'm fairly new to MongoDB so forgive me if this question has a simple answer.
I'm trying to design a query that aggregates across event-generated documents' "states". In particular, I'm interested in the time "spent" in each state.
Let's say I have a MongoDB collection with the following schema:
{
timestamp: {type: Number, required: true},
state: {type: Number, required: true}
}
I want to generate a list of states and the amount of time spent in each state. For example, if I have the following documents (ordered by timestamp),
{timestamp: 100, state: 0},
{timestamp: 102, state: 1},
{timestamp: 110, state: 1},
{timestamp: 120, state: 0},
{timestamp: 123, state: 1}
then I would like to produce [{state: 0, time: 5}, {state: 1, time: 18}] where the first entry's time is due to (102-100)+(123-120) and the second entry's time is due to (120-102).
I'm aware that Mongo's aggregation framework uses streams, so it seems like this sort of state-dependent aggregation would be pretty straightforward. However, I haven't come across such a mechanism or a term for this kind of technique yet.
Any suggestions? Is there a built-in mechanism to do something like this?
I'm going to answer my own question with the solution I ended implementing.
I realized that I was really interested in the previous state of each document. In my case documents are inserted in large batches in temporal order. So, I simply created a state_prev field and a delta field (the difference between sequential documents' timestamp values).
{
timestamp: Number,
state: Number,
state_prev: Number,
delta: Number
}
I'm now able to $sum the new delta field and $group by the state_prev field to achieve my desired aggregate computation.
I am familiar with the best practice of range based pagination on large MongoDB collections, however I am struggling with figuring out how to paginate a collection where the sort value is on a non-unique field.
For example, I have a large collection of users, and there is a field for the number of times they have done something. This field is defintely non-unique, and could have large groups of documents that have the same value.
I would like to return results sorted by that 'numTimesDoneSomething' field.
Here is a sample data set:
{_id: ObjectId("50c480d81ff137e805000003"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000005"), numTimesDoneSomething: 9}
{_id: ObjectId("50c480d81ff137e805000006"), numTimesDoneSomething: 7}
{_id: ObjectId("50c480d81ff137e805000007"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000002"), numTimesDoneSomething: 15}
{_id: ObjectId("50c480d81ff137e805000008"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000009"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000004"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000010"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000011"), numTimesDoneSomething: 1}
How would I return this data set sorted by 'numTimesDoneSomething' with 2 records per page?
#cubbuk shows a good example using offset (skip) but you can also mould the query he shows for ranged pagination as well:
db.collection.find().sort({numTimesDoneSomething:-1, _id:1})
Since the _id here will be unique and you are seconding on it you can actually then range by _id and the results, even between two records having numTimesDoneSomething of 12, should be consistent as to whether they should be on one page or the next.
So doing something as simple as
var q = db.collection.find({_id: {$gt: last_id}}).sort({numTimesDoneSomething:-1, _id:1}).limit(2)
Should work quite good for ranged pagination.
You can sort on multiple fields in this case sort on numTimesDoneSomething and id field. Since id_ field is ascending in itself already according to the insertion timestamp, you will able to paginate through the collection without iterating over duplicate data unless new data is inserted during the iteration.
db.collection.find().sort({numTimesDoneSomething:-1, _id:1}).offset(index).limit(2)