We have several map-reduce jobs that run on a scheduler and aggregate some counts for us. We'd like to switch these over to real time aggregate calls. The problem is, the map-reduce in all it's infinite flexibility is tallying 4 different counts for the collection it runs against. Things like:
var result =
{
MessageId: this.MessageId,
Date: this.Created,
Queued : (this.Status == 0 ? 1 : 0),
Sent : (this.Status == 1 ? 1 : 0),
Failed : (this.Status == 2 ? 1 : 0),
Total : 1,
Unsubscribes: 0
};
If I were in SQL I don't think I could pull this off with a single GROUP BY/SUM, because I need a different filter for each SUM. Is it possible in mongo, or do I need to run 4 separate $group statements with different $match clauses?
Related
I have a collection of millions of docs as follows:
{
customerId: "12345" // string of numbers
foo: "xyz"
}
I want to read every document in the collection and use the data in each for a large batch job. Each customer is independent, but 1 customer may have multiple docs which must be processed together.
I would like to split the work into N separate queries i.e. N tasks (that can be spread over M clients if N > M).
How can each query consider different exclusive and adjoining sets of customers efficiently?
One way might be for task 1 to query all customers who's ids start with "1"; task2 to query all docs for all customers who's ids start with "2" etc - giving N=10, which is spreadable over up to 10 clients. Not sure whether querying by substring is fast though. Is there a better method?
You may use $skip / $limit operators to split your data into separate queries.
Pseudocode
I assume MongoDB driver automatically generates an ObjectId for the _id field
var N = 10;
var M = db.collection.count({});
// We calculate how many tasks we should execute
var tasks = M / N + (M % N > 0 ? 1 : 0);
//Iterate over tasks to get fixed amount data for each job
for (var i = 0; i < tasks; i++) {
var batch = db.collection.aggregate([
{ $sort : { _id : 1 } },
{ $skip : i },
{ $limit : N },
//Use $lookup "multiple docs"
]).toArray();
//i=0 data: 0 - 10
//i=1 data: 11 - 20
//i=2 data: 21 - 30
...
//i=100 data: 1000 - 1010
//Note: If there are no enough N results, MongoDB will return 0 ... N records
// Process batch here
}
Traceability
How can you know if job finished or not? Where job stuck?
Add extra fields once you finish job execution:
jobId - You can know what task processed this data
startDate - When did data processing started
endDate - When did data processing finished
Let's assume I have a collection entry of 100000.
So What is the approach to get only 50 data every time rather than 100000, Because calling the whole dataset is foolishness.
My Dataset is kind of this type:
{
"_id" : ObjectId("5a2e282417d0b91708fa83b5"),
"post" : "Hello world",
"createdate" : ISODate("2017-12-11T06:39:32.035Z"),
"__v" : 0
}
Like what are the techniques I have to append on my query?
//What filter I have to add.?
db.collection.find({}).sort({'createdate': 1}).exec(function(err, data){
console.log(data);
});
db.collection.find({}).sort({'createdate': 1}).skip(0).limit(50).exec(function(err, data){
console.log(data);
});
there are two more ways to use pagination
one is mongoose-paginate npm module link :- https://www.npmjs.com/package/mongoose-paginate
seconnd is aggregation pipeline with $skip and $limit options
eg:
//from 1 to 50 records
db.col.aggregate[{$match:{}},{$sort:{_id:-1}},{$skip:0},{$limit:50}];
//form 51 to 100 records
db.col.aggregate[{$match:{}},{$sort:{_id:-1}},{$skip:50},{$limit:50}];
First, we have to sort the data and then do limit and skip function.
db.collection.aggregate([{"$sort": {f2: -1}, {$limit : 2}, { $skip : 5 }}]);
using limit with find,
db.collection.find().limit(3)
Using limit with aggregate,
db.collection.aggregate({$limit : 2})
usually aggregate is used if we need to get the pipe lined out, for example we need to to have limit and sort together, then
// sorting happens only on the pipelined out put from limit.
db.collection.aggregate([{$limit : 50},{"$sort": {_id: -1}}]);
// . operator - sorting happening on entire values even though it comes last.
db.collection.find().limit(50).sort({_id:-1});
The same with added skip to get offset
db.collection.aggregate([{$limit : 50},{ $skip : 50 },{"$sort": {_id: -1}}]);
db.collection.find().skip(50).limit(50).sort({_id:-1});
I have a collection foo:
{ "_id" : ObjectId("5837199bcabfd020514c0bae"), "x" : 1 }
{ "_id" : ObjectId("583719a1cabfd020514c0baf"), "x" : 3 }
{ "_id" : ObjectId("583719a6cabfd020514c0bb0") }
I use this query:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$sum:1}}})
Then I get a result:
{ "_id" : 1, "avg" : 2, "sum" : 3 }
What does {$sum:1} mean in this query?
From the official docs:
When used in the $group stage, $sum has the following syntax and returns the collective sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key:
{ $sum: < expression > }
Since in your example the expression is 1, it will aggregate a value of one for each document in the group, thus yielding the total number of documents per group.
Basically it will add up the value of expression for each row. In this case since the number of rows is 3 so it will be 1+1+1 =3 . For more details please check mongodb documentation https://docs.mongodb.com/v3.2/reference/operator/aggregation/sum/
For example if the query was:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$sum:$x}}})
then the sum value would be 1+3=4
I'm not sure what MongoDB version was there 6 years ago or whether it had all these goodies, but it seems to stand to reason that {$sum:1} is nothing but a hack for {$count:{}}.
In fact, $sum here is more expensive than $count, as it is being performed as an extra, whereas $count is closer to the engine. And even if you don't give much stock to performance, think of why you're even asking: because that is a less-than-obvious hack.
My option would be:
db.foo.aggregate({$group:{_id:1, avg:{$avg:"$x"}, sum:{$count:{}}}})
I just tried this on Mongo 5.0.14 and it runs fine.
The good old "Just because you can, doesn't mean you should." is still a thing, no?
I am using MongoDB 2.4.10, and I have a collection of four million records, and a query that creates a subset of no more than 50000 even for our power users. I need to select a random 30 items from this subset, and, given the potential performance issues with skip and limit (especially when doing it 30 times with random skip amounts from 1-50000), I stumbled across the following solution:
Create a field for each record which is a completely random number
Create an index over this field
Sort by the field, and use skip(X).limit(30) to get a page of 30 items that, while consecutive in terms of the random field, actually bear no relation to each other. To the user, they seem random.
My index looks like this:
{a: 1, b: 1, c: 1, d: 1}
I also have a separate index:
{d : 1}
'd' is the randomised field.
My query looks like this:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.sort({d : 1}).skip(X).limit(30)
When the collection is small, this works perfectly. However, on our performance and live systems, this query fails, because instead of using the a, b, c, d index, it uses this index only:
{d : 1}
As a result, the query ends up scanning more records than it needs to (by a factor of 25). So, I introduced hint:
db.content.find({a : {$in : ["xyz", "abc"]}, b : "ok", c : "Image"})
.hint({a : 1, b : 1, c : 1, d : 1}).sort({d : 1}).skip(X).limit(30)
This now works great with all values of X up to 11000, and explain() shows the correct index in use. But, when the skip amount exceeds 11000, I get:
{
"$err" : "too much data for sort() with no index. add an index or specify a smaller limit",
"code" : 10128
}
Presumably, the risk of hitting this error is why the query (without the hint) wasn't using this index earlier. So:
Why does Mongo think that the sort has no index to use, when I've forced it to use an index that explicitly includes the sorting field at the end?
Is there a better way of doing this?
I have a MondoDB collection with over 5 million items. Each item has a "start" and "end" fields containing integer values.
Items don't have overlapping starts and ends.
e.g. this would be invalid:
{start:100, end:200}
{start:150, end:250}
I am trying to locate an item where a given value is between start and end
start <= VALUE <= end
The following query works, but it takes 5 to 15 seconds to return
db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1);
I've added the following indexes for testing with very little improvement
db.blocks.ensureIndex({start:1});
db.blocks.ensureIndex({end:1});
//also a compounded one
db.blocks.ensureIndex({start:1,end:1});
** Edit **
The result of explain() on the query results in:
> db.blocks.find({ "start" : { $lt : 3232235521 }, "end" :{ $gt : 3232235521 }}).limit(1).explain();
{
"cursor" : "BtreeCursor end_1",
"nscanned" : 1160982,
"nscannedObjects" : 1160982,
"n" : 0,
"millis" : 5779,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"end" : [
[
3232235521,
1.7976931348623157e+308
]
]
}
}
What would be the best approach to speeding this specific query up?
actually I'm working on similar problem and my friend find a nice way to solve this.
If you don't have overlapping data, you can do this:
query using start field and sort function
validate with end field
for example you can do
var x = 100;
var results = db.collection.find({start:{$lte:x}}).sort({start:-1}).limit(1)
if (results!=null) {
var result = results[0];
if (result.end > x) {
return result;
} else {
return null; // no range contain x
}
}
If you are sure that there will always range containing x, then you do not have to validate the result.
By using this piece of code, you only have to index by either start or end field and your query become a lot faster.
--- edit
I did some benchmark, using composite index takes 100-100,000ms per query, in the other hand using one index takes 1-5ms per query.
I guess compbound index should work faster for you:
db.blocks.ensureIndex({start:1, end:1});
You can also use explain to see number of scanned object, etc and choose best index.
Also if you are using mongodb < 2.0 you need to update to 2.0+, because there indexes work faster.
Also you can limit results to optimize query.
This might help: how about you introduce some redundancy. If there is not a big variance in the lengths of the intervals, then you can introduce a tag field for each record - this tag field is a single value or string that represents a large interval - say for example tag 50,000 is used to tag all records with intervals that are at least partially in the range 0-50,000 and tag 100,000 is for all intervals in the range 50,000-100,000, and so on. Now you can index on the tag as primary and one of the end points of record range as secondary.
Records on the edge of big interval would have more than one tag - so we are talking multikeys. On your query you would of course calculate the big interval tag and use it in the query.
You would roughly want SQRT of total records per tag - just a starting point for tests, then you can fine tune the big interval size.
Of course this would make writing bit slower.