My objects are of the following structure:
{id: 1234, ownerId: 1, typeId: 3456, date:...}
{id: 1235, ownerId: 1, typeId: 3456, date:...}
{id: 1236, ownerId: 1, typeId: 12, date:...}
I would like to query the database so that it returns all the items that belong to a given ownerId but only the first item of a given typeId. IE the typeId field is unique in the results. I would also like to be able to use skip and limit.
In SQL the query would be something like:
SELECT * FROM table WHERE ownerId=1 SORT BY date GROUP BY typeId LIMIT 10 OFFSET 300
I currently have the following query (using pymongo) but it is giving my errors for using $sort, $limit and $skip:
search_dict['ownerId'] = 1
search_dict['$sort'] = {'date': -1}
search_dict['$limit'] = 10
search_dict['$skip'] = 200
collectionName.group(['typeId'], search_dict, {'list': []}, 'function(obj, prev) {prev.list.push(obj)}')
-
I have also tried the aggregation route but as I understand grouping will touch all the items in the collection, group them, and then limit and skip. This will be too computationally expensive and slow. I need an iterative grouping algorithm.
search_dict = {'ownerId':1}
collectionName.aggregate([
{
'$match': search_dict
},
{
'$sort': {'date': -1}
},
{
'$group': {'_id': "$typeId"}
},
{
'$skip': skip
},
{
'$limit': 10
}
])
Your aggregation looks correct. You need to include the fields you want in the output in the $group stage using $first.
grouping will touch all the items in the collection, group them, and then limit and skip. This will be too computationally expensive and slow.
It won't touch all items in the collection. If the match + sort is indexed ({ "ownerId" : 1, "date" : -1 }), the index will be used for the match + sort, and the group will only process the documents that are the result of the match.
The constraint is hardly ever cpu, except in cases of unindexed sort. It's usually disk I/O.
I need an iterative grouping algorithm.
What precisely do you mean by "iterative grouping"? The grouping is iterative, as it iterates over the result of the previous stage and checks which group each document belongs to!
I am not to sure how you get the idea that this operation should be computational expensive. This isn't really true for most SQL databases, and it surely isn't for MongoDB. All you need is to create an index over your sort criterium.
Here is how to prove it:
Open up a mongo shell and have this executed.
var bulk = db.speed.initializeOrderedBulkOp()
for ( var i = 1; i <= 100000; i++ ){
bulk.insert({field1:i,field2:i*i,date:new ISODate()});
if((i%100) == 0){print(i)}
}
bulk.execute();
The bulk execution may take some seconds. Next, we create a helper function:
Array.prototype.avg = function() {
var av = 0;
var cnt = 0;
var len = this.length;
for (var i = 0; i < len; i++) {
var e = +this[i];
if(!e && this[i] !== 0 && this[i] !== '0') e--;
if (this[i] == e) {av += e; cnt++;}
}
return av/cnt;
}
The troupe is ready, the stage is set:
var times = new Array();
for( var i = 0; i < 10000; i++){
var start = new Date();
db.speed.find().sort({date:-1}).skip(Math.random()*100000).limit(10);
times.push(new Date() - start);
}
print(times.avg() + " msecs");
The output is in msecs. This is the output of 5 runs for comparison:
0.1697 msecs
0.1441 msecs
0.1397 msecs
0.1682 msecs
0.1843 msecs
The test server runs inside a docker image which in turn runs inside a VM (boot2docker) on my 2,13 GHz Intel Core 2 Duo with 4GB of RAM, running OSX 10.10.2, a lot of Safari windows, iTunes, Mail, Spotify and Eclipse additionally. Not quite a production system. And that collection does not even have an index on the date field. With the index, the averages of 5 runs look like this:
0.1399 msecs
0.1431 msecs
0.1339 msecs
0.1441 msecs
0.1767 msecs
qed, hth.
Related
I want to calculate moving average for my data in MongoDB. My data structure is as below
{
"_id" : NUUID("54ab1171-9c72-57bc-ba20-0a06b4f858b3"),
"DateTime" : ISODate("2018-05-30T21:31:05.957Z"),
"Type" : 3,
"Value" : NumberDecimal("15.905414991993847")
}
I want to calculate the average of values for each type within 2 days and for each 5 seconds. In this case I put Type in $match pipeline but I prefer to group the result by Type and separate the result by Type. Something I did is as below
var start = new Date("2018-05-30T21:31:05.957Z");
var end = new Date("2018-06-01T21:31:05.957Z");
var arr = new Array();
for (var i = 0; i < 34560; i++) {
start.setSeconds(start.getSeconds() + 5);
if (start <= end)
{
var a = new Date(start);
arr.push(a);
}
}
db.Data.aggregate([
{$match:{"DateTime":{$gte:new Date("2018-05-30T21:31:05.957Z"),
$lte:new Date("2018-06-01T21:31:05.957Z")}, "Type":3}},
{$bucket: {
groupBy: "$DateTime",
boundaries: arr,
default: "Other",
output: {
"count": { $sum: 1 },
"Value": {$avg:"$Value"}
}
}
}
])
It seems, it is working but the performance is too slow. How can I make this faster?
I reproduced the behavior you describe with 2 days worth of 1 second observations in the DB and a $match that pulls just one day's worth. The agg works "fine" if you bucket by, say, 60 seconds. But 15 seconds took 6 times as long, to 30 seconds. And every 5 seconds? 144 seconds. 5 seconds yields an array of 17280 buckets. Yep.
So I went client-side and dragged all 43200 docs to the client and created a naive linear search bucket slot finder and calc in javascript.
c=db.foo.aggregate([
{$match:{"date":{$gte:new Date(osv), $lte:new Date(endv) }}}
]);
c.forEach(function(r) {
var x = findSlot(arr, r['date']);
if(buckets[x] == undefined) {
buckets[x] = {lb: arr[x], ub: arr[x+1], n: 0, v:0};
}
var zz = buckets[x];
zz['n']++;
zz['v'] += r['val'];
});
This actually worked somewhat faster but same order of performance, about 92 seconds.
Next, I changed the linear search in findSlot to a bisection search. The 5 second bucket went from 144 seconds to .750 seconds: almost 200x faster. This includes dragging the 43200 records and running the forEach and bucketing logic above. So it stands to reason that $bucket may not be using a great algo and suffers when the bucket array is more than a couple hundred long.
Acknowledging this, we can instead make use of $floor of the delta between the start time and the observation time to bucket the data:
db.foo.aggregate([
{$match:{"date":{$gte:now, $lte:new Date(endv) }}}
// Bucket by turning offset from "now" into floor divided by the number
// of seconds of grouping. In this way, the resulting number becomes the
// slot into the virtual buckets, e.g.:
// date now diff/1000 floor # 5 seconds:
// 1514764800000 1514764800000 0 0
// 1514764802000 1514764800000 2 0
// 1514764804000 1514764800000 4 0
// 1514764806000 1514764800000 6 1
// 1514764808000 1514764800000 8 1
// 1514764810000 1514764800000 10 2
,{$addFields: {"ff": {$floor: {$divide: [ {$divide: [ {$subtract: [ "$date", now ]}, 1000.0 ]}, secondsBucket ] }} }}
// Now just group by the numeric slot number!
,{$group: {_id: "$ff", n: {$sum:1}, tot: {$sum: "$val"}, avg: {$avg: "$val"}} }
// Get it in 0-n order....
,{$sort: {_id: 1}}
]);
found 17280 in 204 millis
So we now have a server-side solution that is just .204 seconds, or 700x faster. And you don't have to sort the input because $group will take care of bundling the slot numbers. And the $sort after the $group is optional (but sort of handy...)
I have a index on id_profile and i do db.myCollection.count({"id_profile":xxx}). It's quite fast if the count is low, but if the count is large, it starts being slow. For example if there is 1 000 000 records matching {"id_profile":xxx} then it can take up to 500 ms to return the count. I think that internally the engine is simply loading all the documents matching {"id_profile":xxx} to count them.
Is there a way to quickly retrieve a count where the filter match exactly an index? I would like to avoid to use a counter collection :(
NOTE: I m on mongoDB 3.6.3 and this the script i used:
db.createCollection("following");
db.following.createIndex( {"id_profile": 1}, {unique: false} );
function randInt(n) { return parseInt(Math.random()*n); }
for(var j=0; j<10; j++) {
print("Building op "+j);
var bulkop=db.following.initializeOrderedBulkOp() ;
for (var i = 0; i < 1000000; ++i) {
bulkop.insert(
{
id_profile: NumberLong("-4578128619402503089"),
id_following: NumberLong(randInt(9223372036854775807))
}
)
};
print("Executing op "+j);
bulkop.execute();
}
db.following.count({"id_profile":NumberLong("-4578128619402503089")});
I need to group data into subgroups of a set size. Like if there are 6 records, ordered by date.
[1,2,3,4,5,6]
and I have a subgroup size of 2. I would end up with an array(length of 3) of arrays(each length 2):
[[1,2],[3,4],[5,6]]
Nothing about the record factors into the grouping, just how they are ordered over all and the subgroup size.
Does the aggregation framework have something that would help with this?
The best way to currently do this is with mapReduce:
db.collection.mapReduce(
function() {
var result = [];
var x = 0;
for ( x=0; x < Math.floor( this.array.length / 2 ); x+2 ) {
result.push( this.array.slice( x, x+2 ) );
}
var diff = Math.ceil( this.array.length )
- Math.floor( this.array.length );
if ( diff != 1 )
result.push( this.array.slice( x, x+diff ) );
emit( this._id, result );
},
function(){},
{
"out": { "inline": 1 }
}
);
Or basically something along those lines.
The aggregation framework does not do slice type operations well, but JavaScript processes do, especially in this case.
Hi im newbie into MongoDB and im needing to translate this sql query to mongodb using two techniques first in MapReduce method and other Aggregation method. Someone may help?
select
sum(l_extendedprice*l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1994-01-01'
and l_shipdate < date '1994-01-01' + interval '1' year
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
and l_quantity < 24;
http://www.mongodb.org/display/DOCS/MapReduce
For your sample, using map/reduce
var m = function () { emit(1, {this.l_extendedprice * this.l_discount})};
var r = function (k, vals) {
var sum = 0;
for (var i = 0; i < vals.length; i++) {
sum += vals[i];
}
return sum;
}
var res = db.stuff.mapReduce(m, r, {
out:"stuff_aggr",
query: {
"l_shipdate": {$gte: ISODate("1994-01-01T00:00:00.000Z")},
"l_shipdate": {$lte: ISODate("1995-01-01T00:00:00.000Z")},
"l_discount": {$gte: 0.05},
"l_discount": {$lte: 0.07},
"l_quantity": {$lt: 24}
}
});
Aggregation is still a beta feature. MapReduce is still the better option. Am assuming you wanted to see if a complex where clause can be handled easily... Its not that different from SQL as long as you are restricting yourself to one collection/table.
Just started mongo and started having issue with querying already. i have a collection called 'externalTransaction' and i want to write a equivalent of this mysql query:
select transactionCode,
sum(amount) as totalSum,
count(amount) as totalCount
from externalTransaction
where transactioncode in ('aa','bb','cc')
group by sum(amount)
below is my attempt:
{
"collectionName": "externalTransaction",
sort: {transactionCode:-1},
query: {this._id: {$in:['aa','bb','cc']}},
mapReduce:{
'map': 'function(){
emit(this.transactionCode, this.amount);
}',
'reduce': 'function(key, values){
var result = {count: 0, sum: 0.0};
values.forEach(function(value) {
result.count++
result.sum += value.amount;
});
return result;
}',
'out' : 'sumAmount'
}
}
the above query give me a result set looking like this:
_id value.count value.sum
ct 2.0 NaN
bb 40.0 NaN
fg 71.0 NaN
fd 36.0 NaN
sd 5.0 NaN
as 4.0 NaN
aa 71.0 NaN
df 4.0 NaN
cc 10.0 NaN
From the documentation with the version 2.0.6 i can't use the aggregation framework just yet so how to handle simple queries like mine in mongo. thanks for reading and excuse the triviality of my question.
You have a few errors in your map and reduce functions. First, in map you emit a simple number, and in reduce you try to take amount of a number. I bet, it doesn't have that property. Second, outputs of map and reduce must be uniform, because reduce is supposed to be runnable over partially reduced results. Try these functions:
var map = function() {
emit(this.transactionCode, {sum: this.amount, count: 1})
}
var reduce = function(k, vals) {
var result = {sum: 0, count: 0};
vals.forEach(function(v) {
result.sum += v.sum;
result.count += v.count;
});
return result;
}