mongodb complex map/reduce - or so I think - mongodb

I have a mongodb collection that contains every sale and looks like this
{_id: '999',
buyer:{city:'Dallas','state':'Texas',...},
products: {...},
order_value:1000,
date:"2011-11-23T11:34:33Z"
}
I need to show stats about order volumes, by state, in the last 30,60 and 90 days.
so, to get something like this
State Last 30 Last 60 Last 90
Arizona 12000 22000 35000
Texas 5000 9000 16000
how would you do this in a single query?

That's not very difficult :
map = function() {
emit({key : this.buyer.state, value : order_value})
}
reduce = function(key,values) {
sum = 0;
values.forEach( function(o) {
sum += o
}
return sum
}
and then you map reduce your collection with query {date : {$gt : { [today minus 30 days] }}
(i d'ont remember the syntax but you should the excellent mapreduce doc on mongodb site).
To make more efficient use of map reduce, think with incremental map reduce by querying first on the last 30 days, then map reduce again (incrementally) filtering -60 to -30 days to get information on the las t60 days. Finally, run incremental map reduce filtering -60 to -90 days to get the last 90 days.
This is not bad because you have 3 queryies but you only recompute aggregation on data you don't have yet.
I can provide example, but you should be able to do it by yourself now.

Related

Mongo - split 1 query into N queries

I have a collection of millions of docs as follows:
{
customerId: "12345" // string of numbers
foo: "xyz"
}
I want to read every document in the collection and use the data in each for a large batch job. Each customer is independent, but 1 customer may have multiple docs which must be processed together.
I would like to split the work into N separate queries i.e. N tasks (that can be spread over M clients if N > M).
How can each query consider different exclusive and adjoining sets of customers efficiently?
One way might be for task 1 to query all customers who's ids start with "1"; task2 to query all docs for all customers who's ids start with "2" etc - giving N=10, which is spreadable over up to 10 clients. Not sure whether querying by substring is fast though. Is there a better method?
You may use $skip / $limit operators to split your data into separate queries.
Pseudocode
I assume MongoDB driver automatically generates an ObjectId for the _id field
var N = 10;
var M = db.collection.count({});
// We calculate how many tasks we should execute
var tasks = M / N + (M % N > 0 ? 1 : 0);
//Iterate over tasks to get fixed amount data for each job
for (var i = 0; i < tasks; i++) {
var batch = db.collection.aggregate([
{ $sort : { _id : 1 } },
{ $skip : i },
{ $limit : N },
//Use $lookup "multiple docs"
]).toArray();
//i=0 data: 0 - 10
//i=1 data: 11 - 20
//i=2 data: 21 - 30
...
//i=100 data: 1000 - 1010
//Note: If there are no enough N results, MongoDB will return 0 ... N records
// Process batch here
}
Traceability
How can you know if job finished or not? Where job stuck?
Add extra fields once you finish job execution:
jobId - You can know what task processed this data
startDate - When did data processing started
endDate - When did data processing finished

Mongo $geoNear query - incorrect nscanned number and incorrect results

I have a collection with around 6k documents with 2dsphere index on location field, example below:
"location" : {
"type" : "Point",
"coordinates" : [
138.576187,
-35.010441
]
}
When using the below query I only get around 450 docs returned with nscanned around 3k. Every document has a location, many locations are duplicated. Distances returned from GeoJSON are in meters, and a distance multiplier of 0.000625 will convert distances to miles. To test, I'm expecting max distance of 32180000000000 to return all the documents on the planet, ie 6000
db.x.aggregate([
{"$geoNear":{
"near":{
"type":"Point",
"coordinates":[-0.3658702,51.45686]
},
"distanceField":"distance",
"limit":100000,
"distanceMultiplier":0.000625,
"maxDistance":32180000000000,
"spherical":true,
}}
])
Why dont I get 6000 documents returned? I'm unable to find the logic behind this behaviour from Mongo. I've found on the mongo forums:
"geoNear's major limitation is that as a command it can return a result set up to the maximum document size as all of the matched documents are returned in a single result document."
I'm pretty sure that mongodb has a limit of 16 MB on the results of $GeoNear. In https://github.com/mongodb/mongo/blob/master/src/mongo/db/commands/geo_near_cmd.cpp you can see that while the results of the geonear are being built, there's this condition
// Don't make a too-big result object.
if (resultBuilder.len() + resObj.objsize()> BSONObjMaxUserSize) {
warning() << "Too many geoNear results for query " << rewritten.toString()
<< ", truncating output.";
break;
}
And in https://github.com/mongodb/mongo/blob/master/src/mongo/bson/util/builder.h youll see its limited to 16 MB.
const int BSONObjMaxUserSize = 16 * 1024 * 1024;

MongoDB aggregation over a range

I have documents of the following format:
[
{
date:"2014-07-07",
value: 20
},
{
date:"2014-07-08",
value: 29
},
{
date:"2014-07-09",
value: 24
},
{
date:"2014-07-10",
value: 21
}
]
I want to run an aggregation query that gives me results in date ranges. for example
[
{ sum: 49 },
{ sum:45 },
]
So these are daily values, I need to know the sum of value field for last 7 days. and 7 days before these. for example sum from May 1 to May 6 and then sum from May 7 to May 14.
Can I use aggregation with multiple groups and range to get this result in a single mongodb query?
You can use aggregation to group by anything that can be computed from the source documents, as long as you know exactly what you want to do.
Based on your document content and sample output, I'm guessing that you are summing by two day intervals. Here is how you would write aggregation to output this on your sample data:
var range1={$and:[{"$gte":["$date","2014-07-07"]},{$lte:["$date","2014-07-08"]}]}
var range2={$and:[{"$gte":["$date","2014-07-09"]},{$lte:["$date","2014-07-10"]}]}
db.range.aggregate(
{$project:{
dateRange:{$cond:{if:range1, then:"dateRange1",else:{$cond:{if:range2, then:"dateRange2", else:"NotInRange"}}}},
value:1}
},
{$group:{_id:"$dateRange", sum:{$sum:"$value"}}}
)
{ "_id" : "dateRange2", "sum" : 45 }
{ "_id" : "dateRange1", "sum" : 49 }
Substitute your dates for strings in range1 and range2 and optionally you can filter before you start to only operate on documents which are already in the full ranges you are aggregating over.

search in limited number of record MongoDB

I want to search in the first 1000 records of my document whose name is CityDB. I used the following code:
db.CityDB.find({'index.2':"London"}).limit(1000)
but it does not work, it return the first 1000 of finding, but I want to search just in the first 1000 records not all records. Could you please help me.
Thanks,
Amir
Note that there is no guarantee that your documents are returned in any particular order by a query as long as you don't sort explicitely. Documents in a new collection are usually returned in insertion order, but various things can cause that order to change unexpectedly, so don't rely on it. By the way: Auto-generated _id's start with a timestamp, so when you sort by _id, the objects are returned by creation-date.
Now about your actual question. When you first want to limit the documents and then perform a filter-operation on this limited set, you can use the aggregation pipeline. It allows you to use $limit-operator first and then use the $match-operator on the remaining documents.
db.CityDB.aggregate(
// { $sort: { _id: 1 } }, // <- uncomment when you want the first 1000 by creation-time
{ $limit: 1000 },
{ $match: { 'index.2':"London" } }
)
I can think of two ways to achieve this:
1) You have a global counter and every time you input data into your collection you add a field count = currentCounter and increase currentCounter by 1. When you need to select your first k elements, you find it this way
db.CityDB.find({
'index.2':"London",
count : {
'$gte' : currentCounter - k
}
})
This is not atomic and might give you sometimes more then k elements on a heavy loaded system (but it can support indexes).
Here is another approach which works nice in the shell:
2) Create your dummy data:
var k = 100;
for(var i = 1; i<k; i++){
db.a.insert({
_id : i,
z: Math.floor(1 + Math.random() * 10)
})
}
output = [];
And now find in the first k records where z == 3
k = 10;
db.a.find().sort({$natural : -1}).limit(k).forEach(function(el){
if (el.z == 3){
output.push(el)
}
})
as you see your output has correct elements:
output
I think it is pretty straight forward to modify my example for your needs.
P.S. also take a look in aggregation framework, there might be a way to achieve what you need with it.

optimizing hourly statistics retrieval with mongodb

I've collected about 10 mio documents spaning a few weeks in my mongodb database, and I want to be able to calculate some simple statistics and output them.
The statistics I'm trying to get is the average of the rating on each document within a timespan, in one hour intervals.
To give an idea of what I'm trying to do, follow this sudo code:
var dateTimeStart;
var dateTimeEnd;
var distinctHoursBetweenDateTimes = getHours(dateTimeStart, dateTimeEnd);
var totalResult=[];
foreach( distinctHour in distinctHoursBetweenDateTimes )
tmpResult = mapreduce_getAverageRating( distinctHour, distinctHour +1 )
totalResult[distinctHour] = tmpResult;
return totalResult;
My document structure is something like:
{_id, rating, topic, created_at}
Created_at is the date I'm gathering my statistics based on (time of insertion and time created are not always the same)
I've created an index on the created_at field.
The following is my mapreduce:
map = function (){
emit( this.Topic , { 'total' : this.Rating , num : 1 } );
};
reduce = function (key, values){
var n = {'total' : 0, num : 0};
for ( var i=0; i<values.length; i++ ){
n.total += values[i].total;
n.num += values[i].num;
}
return n;
};
finalize = function(key, res){
res.avg = res.total / res.num;
return res;
};
I'm pretty sure this can be done more effectively - possibly by letting mongo do more work, instead of running several map-reduce statements in a row.
At this point each map-reduce takes about 20-25 seconds so counting statistics for all the hours over a few days suddenly takes up a very long time.
My impression is that mongo should be suited for this kind of work - hence I must obviously be doing something wrong.
Thanks for your help!
And I assume the time is part of the documents you are MapReducing?
When you run the MapReduce over all documents, determine the hour in the map function and add it to the key you emit, you could do all this in a single MapReduce.