State-dependent aggregation in MongoDB - mongodb

I'm fairly new to MongoDB so forgive me if this question has a simple answer.
I'm trying to design a query that aggregates across event-generated documents' "states". In particular, I'm interested in the time "spent" in each state.
Let's say I have a MongoDB collection with the following schema:
{
timestamp: {type: Number, required: true},
state: {type: Number, required: true}
}
I want to generate a list of states and the amount of time spent in each state. For example, if I have the following documents (ordered by timestamp),
{timestamp: 100, state: 0},
{timestamp: 102, state: 1},
{timestamp: 110, state: 1},
{timestamp: 120, state: 0},
{timestamp: 123, state: 1}
then I would like to produce [{state: 0, time: 5}, {state: 1, time: 18}] where the first entry's time is due to (102-100)+(123-120) and the second entry's time is due to (120-102).
I'm aware that Mongo's aggregation framework uses streams, so it seems like this sort of state-dependent aggregation would be pretty straightforward. However, I haven't come across such a mechanism or a term for this kind of technique yet.
Any suggestions? Is there a built-in mechanism to do something like this?

I'm going to answer my own question with the solution I ended implementing.
I realized that I was really interested in the previous state of each document. In my case documents are inserted in large batches in temporal order. So, I simply created a state_prev field and a delta field (the difference between sequential documents' timestamp values).
{
timestamp: Number,
state: Number,
state_prev: Number,
delta: Number
}
I'm now able to $sum the new delta field and $group by the state_prev field to achieve my desired aggregate computation.

Related

Mongodb is slow compared to MySQL on massive update

I'm trying to migrate some MySQL code into mongodb (intermediate to senior in MySQL, beginner to intermediate in MongoDB). I need to perform severals updates an hour that imply all documents in the collection. Updates don't overlap document-to-document but they use inner values (see code bellow).
Before diving into actual migration of the code, I've done a benchmark :
Let's take this structure :
const MySchema = new Schema({
amount: {type: Number, required: true, default: 0},
myField: {type: Number, required: true, default: 0}
});
For the benchmark, I need to perform an update that will do the following : myField = myField + (amount * 2). The real calculation will be much more complex, so it's critical that a "simple" operation like this works.
Bellow, the implementation (nb: mongoose used) :
await MyModel.updateMany({amount: {$gte: 0}}, [
{$set: {
myField: {
$add: [
'$myField',
{$multiply: ['$amount', 2]}
]
}
}}
]);
Please note that the "foreach()" approach is not something I have in mind because the number of row will increase significantly, all I need is the fastest method to do so
I seeded 50k entries with random amounts,
The execution of this query take between 1200ms and 1400ms
The equivalent in MySQL took about 75ms, so a ratio of 1:18 in the worst case
The filter {amount: {$gte: 0}} can be omitted (it decrease execution time by 200ms).
Now the question :
Is it "normal" considering that's a simple update, that MySQL performs faster than MongoDB on a basic 50k rows set ?
Does something in my implementation is missing to go as low as 75ms update, or should I consider mongoDB is not intended for this type of usage ?

MongoDB range query with a sort - how to speed up?

I have a query which is routinely taking around 30 seconds to run for a collection with 1 million documents. This query is to form part of a search engine, where the requirement is that every search completes in less than 5 seconds. Using a simplified example here (the actual docs has embedded documents and other attributes), let's say I have the following:
1 millions docs of a Users collections where each looks as follows:
{
name: Dan,
age: 30,
followers: 400
},
{
name: Sally,
age: 42,
followers: 250
}
... etc
Now, lets I'm wanting to return the IDs of 10 users with a follower count between 200 and 300, sorted by age in descending order. This can be achieved with the following:
db.users.find({
'followers': { $gt: 200, $lt: 300 },
}).
projection({ '_id': 1 }).
sort({ 'age': -1 }).
limit(10)
I have the following compound Index created, which winningPlan tells me is being used:
db.users.createIndex({ 'followed_by': -1, 'age': -1 })}
But this query is still taking ~30 seconds as it's having to examine thousands of docs, near equal to the amount of docs in this case that match the find query. I have experimented with different indexes (with different positions and sort orders) with no luck.
So my question is, what else can I do to either reduce the number of documents examined with the query, or speed up the the process of having to examine the docs?
The query is taking long both in production and on my local dev environment, somewhat ruling many network and hardware factors. currentOp shows that the query is not waiting for locks while running, or that there are any other queries running at the same time.
For me, it looks like you have an incorrect index: { 'followed_by': -1, 'age': -1 } for your query. You should have an index { 'followers': 1} (but take into consideration cardinality of that field). But even with that index, you will need to do inmem sort. Anyway, it should be much faster in the way you have high cardinality because you will not need to scan the whole collection for filtering step as you do with index prefix followed_by.

MongoDB compound index, aggregation performance

I've gone through many articles about indexes in MongoDB but still have no clear understanding of indexing strategy for my particular case.
I have the following collection with more than 10 million items:
{
active: BOOLEAN,
afp: ARRAY_of_integers
}
Previously I was using aggregation with this pipeline:
pipeline = [
{'$match': {'active': True}},
{'$project': {
'sub': {'$setIsSubset': [afp, '$afp']}
}},
{'$match': {'sub': True}}
]
Queries were pretty slow, so I started experimenting without aggregation:
db.collection.find({'active': True, 'afp': {'$all': afp}})
The latter query using $all runs much faster with the same indexes - no idea why...
I've tried these indexes (not much variations possible though):
{afp: 1}
{active: 1}
{active: 1, afp: 1}
{afp: 1, active: 1}
I don't know why, but the latest index gives the best performance - any ideas about the reason would be much appreciated.
Then I decided to add constraints in order to possibly improve speed.
Considering that number of integers in "afp" field can be different, there's no reason to scan documents having less number of integers than in the query. So I created one more field for all documens called "len_afp" which contains that number (afp length indeed).
Now documents look like this:
{
active: BOOLEAN,
afp: ARRAY_of_integers
len_afp: INTEGER
}
Query is:
db.collection.find({
'active': True,
'afp': {'$all': afp},
'len_afp: {'$gte': len_afp}
})
Also I've added three new indexes:
{afp: 1, len_afp: 1, active: 1}
{afp: 1, active: 1, len_afp: 1}
{len_afp: 1, afp: 1, active: 1}
The latest index gives the best performance - again for unknown reason.
So my question is: what the logic is behind fields order in compound indexes, and how this logic has to be considered when creating indexes?
Also it's interesting why $all works faster than $setIsSubset with all the same other conditions?
Can index intersection be used for my case instead of compound indexes?
Still the performance is pretty low - what am I doing wrong?
Can sharding help in my particular case (will it work using aggregation, or $all, or $gte)?
Sorry for huge question and Thank you in advance!

MongoDB complex indices

I'm trying to understand how to best work with indices in MongoDB. Lets say that I have a collection of documents like this one:
{
_id: 1,
keywords: ["gap", "casual", "shorts", "oatmeal"],
age: 21,
brand: "Gap",
color: "Black",
gender: "female",
retailer: "Gap",
style: "Casual Shorts",
student: false,
location: "US",
}
and I regularly run a query to find all documents that match a set of criteria for each of those fields, something like:
db.items.find({ age: { $gt: 13, $lt: 40 },
brand: { $in: ['Gap', 'Target'] },
retailer: { $in: ['Gap', 'Target'] },
gender: { $in: ['male', 'female'] },
style: { $in: ['Casual Shorts', 'Jeans']},
location: { $in: ['US', 'International'] },
color: { $in: ['Black', 'Green'] },
keywords: { $all: ['gap', 'casual'] }
})
I'm trying to figure what sort of index I can create to improve the speed of a query such as this. Should I create a compound index like this:
db.items.ensureIndex({ age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
or is there a better set of indices I can create to optimize this query?
Should I create a compound index like this:
db.items.ensureIndex({age: 1, brand: 1, retailer: 1, gender: 1, style: 1, location: 1, color: 1, keywords: 1})
You can create an index like the one above, but you're indexing almost the entire collection. Indexes take space; the more fields in the index, the more space is used. Usually RAM, although they can be swapped out. They also incur write penalty.
Your index seems wasteful, since probably indexing just a few of those fields will make MongoDB scan a set of documents that is close to the expected result of the find operation.
Is there a better set of indices I can create to optimize this query?
Like I said before, probably yes. But this question is very difficult to answer without knowing details of the collection, like the amount of documents it has, which values each field can have, how those values are distributed in the collection (50% gender male, 50% gender female?), how they correlate to each other, etc.
There are a few indexing strategies, but normally you should strive to create indexes with high selectivity. Choose "small" field combinations that will help MongoDB locate the desired documents scanning a "reasonable" amount of them. Again, "small" and "reasonable" will depend on the characteristics of the collection and query you are performing.
Since this is a fairly complex subject, here are some references that should help you building more appropriate indexes.
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
http://docs.mongodb.org/manual/faq/indexes/#how-do-you-determine-what-fields-to-index
http://docs.mongodb.org/manual/tutorial/create-queries-that-ensure-selectivity/
And use cursor.explain to evaluate your indexes.
http://docs.mongodb.org/manual/reference/method/cursor.explain/
Large index like this one will penalize you on writes. It is better to index just what you need, and let Mongo's optimiser do most of the work for you. You can always give him an hint or, in last resort, reindex if you application or data usage changes drastically.
Your query will use the index for fields that have one (fast), and use a table scan (slow) on the remaining documents.
Depending on your application, a few stand alone indexes could be better. Adding more indexes will not improve performance. With the write penality, it could even make it worse (YMMV).
Here is a basic algorithm for selecting fields to put in an index:
What single field is in a query the most often?
If that single field is present in a query, will a table scan be expensive?
What other field could you index to further reduce the table scan?
This index seems to be very reasonable for your query. MongoDB calls the query a covered query for this index, since there is no need to access the documents. All data can be fetched from the index.
from the docs:
"Because the index “covers” the query, MongoDB can both match the query conditions and return the results using only the index; MongoDB does not need to look at the documents, only the index, to fulfill the query. An index can also cover an aggregation pipeline operation on unsharded collections."
Some remarks:
This index will only be used by queries that include a filter on age. A query that only filters by brand or retailer will probably not use this index.
Adding an index on only one or two of the most selective fields of your query will already bring a very significant performance boost. The more fields you add the larger the index size will be on disk.
You may want to generate some random sample data and measure the performance of this with different indexes or sets of indexes. This is obviously the safest way to know.

MongoDB - Pagination based on non-unique fields

I am familiar with the best practice of range based pagination on large MongoDB collections, however I am struggling with figuring out how to paginate a collection where the sort value is on a non-unique field.
For example, I have a large collection of users, and there is a field for the number of times they have done something. This field is defintely non-unique, and could have large groups of documents that have the same value.
I would like to return results sorted by that 'numTimesDoneSomething' field.
Here is a sample data set:
{_id: ObjectId("50c480d81ff137e805000003"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000005"), numTimesDoneSomething: 9}
{_id: ObjectId("50c480d81ff137e805000006"), numTimesDoneSomething: 7}
{_id: ObjectId("50c480d81ff137e805000007"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000002"), numTimesDoneSomething: 15}
{_id: ObjectId("50c480d81ff137e805000008"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000009"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000004"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000010"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000011"), numTimesDoneSomething: 1}
How would I return this data set sorted by 'numTimesDoneSomething' with 2 records per page?
#cubbuk shows a good example using offset (skip) but you can also mould the query he shows for ranged pagination as well:
db.collection.find().sort({numTimesDoneSomething:-1, _id:1})
Since the _id here will be unique and you are seconding on it you can actually then range by _id and the results, even between two records having numTimesDoneSomething of 12, should be consistent as to whether they should be on one page or the next.
So doing something as simple as
var q = db.collection.find({_id: {$gt: last_id}}).sort({numTimesDoneSomething:-1, _id:1}).limit(2)
Should work quite good for ranged pagination.
You can sort on multiple fields in this case sort on numTimesDoneSomething and id field. Since id_ field is ascending in itself already according to the insertion timestamp, you will able to paginate through the collection without iterating over duplicate data unless new data is inserted during the iteration.
db.collection.find().sort({numTimesDoneSomething:-1, _id:1}).offset(index).limit(2)