Bucketing and counting for histogram in MongoDB - mongodb

I want to implement a histogram based on the data stored in MongoDB. I want to get counts based on bucketing. I have to create buckets based on only one input value that is number of groups. for example group = 4
Consider there are multiple transactions are running and we stored transaction time as one of the fields. I want to calculate counts of transactions based on time required to finish the transaction.
How can I use aggregation framework or map reduce to create a bucketing?
Sample data:
{
"transactions": {
"149823": {
"timerequired": 5
},
"168243": {
"timerequired": 4
},
"168244": {
"timerequired": 10
},
"168257": {
"timerequired": 15
},
"168258": {
"timerequired": 8
},
"timerequired": 18
}
}
In the output I want to print bucket size and count of transactions fall into that bucket.
Bucket count
0-5 2
5-10 2
10-15 1
15-20 1

From mongo version 3.4, the functions $bucket and $bucketAuto are available . They can easily solve your request:
db.transactions.aggregate( [
{
$bucketAuto: {
groupBy: "$timerequired",
buckets: 4
}
}
])

Related

Write efficiency difference between Update, Set, and Set with merge

Cloud Firestore has this limit:
"Maximum writes per second per database of 10,000 (up to 10 MiB per second)"
https://firebase.google.com/docs/firestore/quotas
I was wondering how the 10 MiB per second limit works.
Let's say I have a document doc that is 100 KB
And I run a transaction in which I only want to update 1 KB of the doc
Using update (passing the entire object)
doc.foo.bar = calculateNewBar(doc.foo.bar)
t.update(docRef, doc);
Using update (targeting specific field)
t.update(docRef, {
"doc.foo.bar": calculateNewBar(doc.foo.bar),
});
Using set (passing the entire object)
doc.foo.bar = calculateNewBar(doc.foo.bar)
t.set(docRef, doc);
Using set (with merge)
t.set(docRef, {
doc: { foo: { bar: calculateNewBar(doc.foo.bar) } },
}, {
merge: true,
});
Which way would be the most efficient?
Does Cloud Firestore smartly calculate the diff in all 4 cases above and only write 1 KB?
Or would 1 and 3 write the entire 100 KB, and 2 and 4 write 1 KB?

Mongo - split 1 query into N queries

I have a collection of millions of docs as follows:
{
customerId: "12345" // string of numbers
foo: "xyz"
}
I want to read every document in the collection and use the data in each for a large batch job. Each customer is independent, but 1 customer may have multiple docs which must be processed together.
I would like to split the work into N separate queries i.e. N tasks (that can be spread over M clients if N > M).
How can each query consider different exclusive and adjoining sets of customers efficiently?
One way might be for task 1 to query all customers who's ids start with "1"; task2 to query all docs for all customers who's ids start with "2" etc - giving N=10, which is spreadable over up to 10 clients. Not sure whether querying by substring is fast though. Is there a better method?
You may use $skip / $limit operators to split your data into separate queries.
Pseudocode
I assume MongoDB driver automatically generates an ObjectId for the _id field
var N = 10;
var M = db.collection.count({});
// We calculate how many tasks we should execute
var tasks = M / N + (M % N > 0 ? 1 : 0);
//Iterate over tasks to get fixed amount data for each job
for (var i = 0; i < tasks; i++) {
var batch = db.collection.aggregate([
{ $sort : { _id : 1 } },
{ $skip : i },
{ $limit : N },
//Use $lookup "multiple docs"
]).toArray();
//i=0 data: 0 - 10
//i=1 data: 11 - 20
//i=2 data: 21 - 30
...
//i=100 data: 1000 - 1010
//Note: If there are no enough N results, MongoDB will return 0 ... N records
// Process batch here
}
Traceability
How can you know if job finished or not? Where job stuck?
Add extra fields once you finish job execution:
jobId - You can know what task processed this data
startDate - When did data processing started
endDate - When did data processing finished

Aggregation using $sample

With an aggregation using { $sample: { size: 3 } }, I'll get 3 random documents returned.
How can I use a percentage of all documents instead?
Something that'd look like { $sample: { size: 50% } }?
You can not do it, as expression to $sample should be a positive number.
If you still needed to use $sample you can try to get the total count of documents in a collection, get number half of it & then run $sample :
1) Count no.of documents in a collection (mongo Shell) :
var totalDocumentsCount = db.yourCollectionName.count()/2
print(totalDocumentsCount) // Replace it with console.log() in code
2) $sample for random documents :
db.yourCollectionName.aggregate([{$sample : {size : totalDocumentsCount}}])
Note :
If you wanted to get half of the documents from the collection (Which is 50% of documents) then $sample might not be a good option - it can become an inefficient query. Also result of $sample can have duplicate documents being returned (So really you might not get unique 50% of documents). Try to read more about it here : $sample
If someone is looking for this solution in PHP just use this as required in your aggregate at the end ( i.e before projection ) and avoid using limit and sort
[
'$sample' => [
'size' => 30
]
]
Starting in Mongo 4.4, you can use the $sampleRate operator:
// { x: 1 }
// { x: 2 }
// { x: 3 }
// { x: 4 }
// { x: 5 }
// { x: 6 }
db.collection.aggregate([ { $match: { $sampleRate: 0.33 } } ])
// { x: 3 }
// { x: 5 }
This matches a random selection of input documents (33%). The number of documents selected approximates the sample rate expressed as a percentage of the total number of documents.
Note that this is equivalent to adding a random number between 0 and 1 for each document and filtering them in if this random value is bellow 0.33. Such that you may get more or less documents in output, and running this several times won't necessarily give you the same output.

MongoDB aggregation over a range

I have documents of the following format:
[
{
date:"2014-07-07",
value: 20
},
{
date:"2014-07-08",
value: 29
},
{
date:"2014-07-09",
value: 24
},
{
date:"2014-07-10",
value: 21
}
]
I want to run an aggregation query that gives me results in date ranges. for example
[
{ sum: 49 },
{ sum:45 },
]
So these are daily values, I need to know the sum of value field for last 7 days. and 7 days before these. for example sum from May 1 to May 6 and then sum from May 7 to May 14.
Can I use aggregation with multiple groups and range to get this result in a single mongodb query?
You can use aggregation to group by anything that can be computed from the source documents, as long as you know exactly what you want to do.
Based on your document content and sample output, I'm guessing that you are summing by two day intervals. Here is how you would write aggregation to output this on your sample data:
var range1={$and:[{"$gte":["$date","2014-07-07"]},{$lte:["$date","2014-07-08"]}]}
var range2={$and:[{"$gte":["$date","2014-07-09"]},{$lte:["$date","2014-07-10"]}]}
db.range.aggregate(
{$project:{
dateRange:{$cond:{if:range1, then:"dateRange1",else:{$cond:{if:range2, then:"dateRange2", else:"NotInRange"}}}},
value:1}
},
{$group:{_id:"$dateRange", sum:{$sum:"$value"}}}
)
{ "_id" : "dateRange2", "sum" : 45 }
{ "_id" : "dateRange1", "sum" : 49 }
Substitute your dates for strings in range1 and range2 and optionally you can filter before you start to only operate on documents which are already in the full ranges you are aggregating over.

mongodb complex map/reduce - or so I think

I have a mongodb collection that contains every sale and looks like this
{_id: '999',
buyer:{city:'Dallas','state':'Texas',...},
products: {...},
order_value:1000,
date:"2011-11-23T11:34:33Z"
}
I need to show stats about order volumes, by state, in the last 30,60 and 90 days.
so, to get something like this
State Last 30 Last 60 Last 90
Arizona 12000 22000 35000
Texas 5000 9000 16000
how would you do this in a single query?
That's not very difficult :
map = function() {
emit({key : this.buyer.state, value : order_value})
}
reduce = function(key,values) {
sum = 0;
values.forEach( function(o) {
sum += o
}
return sum
}
and then you map reduce your collection with query {date : {$gt : { [today minus 30 days] }}
(i d'ont remember the syntax but you should the excellent mapreduce doc on mongodb site).
To make more efficient use of map reduce, think with incremental map reduce by querying first on the last 30 days, then map reduce again (incrementally) filtering -60 to -30 days to get information on the las t60 days. Finally, run incremental map reduce filtering -60 to -90 days to get the last 90 days.
This is not bad because you have 3 queryies but you only recompute aggregation on data you don't have yet.
I can provide example, but you should be able to do it by yourself now.