How to store MapReduce result in hierarchically in mongo - mongodb

I want to perform map-reduce operation on some metric and want to store its result aggregated and time-series.
Storing the aggregated result seems to be very simple, but how can i store the result in time-series fashion i.e. whenever the map-reduce function run's the value at that interval should also be recorded in the result collection. (i.e. time-series data)
Let's say i have a following result out of my map-reduce aggregation:-
> db.result.find()
{ "_id" : { "eventId" : 1}, "value" : { "sum" : 21 } }
{ "_id" : { "eventId" : 2}, "value" : { "sum" : 31 } }
I am able to achieve the above very easily with map_reduce aggregation framework.
I want the result to be stored in timeseries as well, like below:-
> db.result.find()
{ "_id" : { "eventId" : 1}, "value" : { "sum" : 21, "ts": {1: 15, 2: 4, 3: 2 } } }
{ "_id" : { "eventId" : 2}, "value" : { "sum" : 31, "ts": {1: 12, 2: 12, 3: 7 } } }
Now whenever the map-reduce function would run it should update the result collection.
I tried numerous ways to do so, but was unable to succeed in it. Any idea how can i achieve it?
Also, if this could be possible under the same map-reduce function call then that would be great.

The general recommendation for such time series data is to use pre-aggregated reports.
If that is not possible, first consider using the aggregation pipeline instead of map-reduce. It's faster and easier if your use case allows it.
With both the aggregation pipeline and map-reduce, you can use the results create the desired document. setOnInsert may be helpful.

Related

Can sorting before grouping improve query performance in Mongo using the aggregate framework?

I'm trying to aggregate data for 100 accounts for a 14-15 month period, grouping by year and month.
However, the query performance is horrible as it takes 22-27 seconds. There are currently over 15 million records in the collection and I've got an index on the match criteria and can see using explain() that the optimizer uses it.
I tried adding another index on the sort criteria in the query below and after adding the index, the query now takes over 50 seconds! This happens even after I remove the sort from the query.
I'm extremely confused. I thought because grouping can't utilize an index, that if the collection was sorted beforehand, then the grouping could be much faster. Is this assumption correct? If not, what other options do I have? I can bear the query performance to be as much as 5 seconds but nothing more than that.
//Document Structure
{
Acc: 1,
UIC: true,
date: ISODate("2015-12-01T05:00:00Z"),
y: 2015
mm: 12
value: 22.3
}
//Query
db.MyCollection.aggregate([
{ "$match" : { "UIC" : true, "Acc" : { "$in" : [1, 2, 3, ..., 99, 100] }, "date" : { "$gte" : ISODate("2015-12-01T05:00:00Z"), "$lt" : ISODate("2017-02-01T05:00:00Z") } } },
//{ "$sort" : { "UIC" : 1, "Acc" : 1, "y" : -1, "mm" : 1 } },
{ "$group" : { "_id" : { "Num" : "$Num", "Year" : "$y", "Month" : "$mm" }, "Sum" : { "$sum" : "$value" } } }
])
What I would suggest you to do is to make a script (can be in nodejs) that aggregates the data in a different collection. When you have these long queries, what's advisable is to make a different collection containing the aggregation data and query from that.
My second advice would be to create a composed index in this aggregated collection and search by regular expression. In your case I would make an index containing accountId:period. For example, for account 1, and February of 2016, The index would be something like 1:201602.
Then you would be able to perform queries using regular expressions by account and timestamp. Like as if you wanted the registers for 2016 of account 1, you could do something like:
db.aggregatedCollection.find{_id : \1:2016\})
Hope my answer was helpful

MongoDb Max Query

If I have the following data in MongoDb:
{ "_id" : ObjectId("1"), "name" : "call", "hour": 10, "number" : 14 }
{ "_id" : ObjectId("2"), "name" : "call", "hour": 11, "number" : 100 }
{ "_id" : ObjectId("3"), "name" : "call", "hour": 12, "number" : 200 }
I want to query from the MongoDb to shows that which hour has the most number of calling, so after querying the result will be like this(I want it to show me the whole line):
{ "_id" : ObjectId("3"), "name" : "call", "hour": 12, "number" : 200 }
I find that MongoDb has its own max() function but it does not work like this. Could anybody tell me how to do this simple query?
Thank you for kij's idea. So if I query like this:
db.xxx.find({name:"call"}).sort({number:-1}).limit(1)
It gives me the correct result which I want; but it is quite stupid. Is there any other more directly way?
this should do the job:
db.collection.find().limit(1).sort( { number: -1 } )
Instead you could sort by desc and take first result. I think it's an alternative if max does not works as expected.
I did not practive since a long time but with something like that maybe:
db.collection.sort({'object.number: -1}).limit(1).first()
kij's pointed into correct answer I believe. Just add index on your "number" field and queries will be optimized.
The order of sort, limit and find is always the same in MongoDB find queries, no matter how query is written (end paragraph on page's bottom):
http://docs.mongodb.org/manual/reference/method/db.collection.find/
Aggregation in MongoDB gives more control on order, sort and limit I think.
Peter

Comparing documents between two MongoDB collections

I have two existing collections and need to populate a third collection based on the comparison between the two existing.
The two collections that need to be compared have the following schema:
// Settings collection:
{
"Identifier":"ABC123",
"C":"1",
"U":"V",
"Low":116,
"High":124,
"ImportLogId":1
}
// Data collection
{
"Identifier":"ABC123",
"C":"1",
"U":"V",
"Date":"11/6/2013 12AM",
"Value":128,
"ImportLogId": 1
}
I am new to MongoDB and NoSQL in general so I am having a tough time grasping how to do this. The SQL would look something like this:
SELECT s.Identifier, r.ReadValue, r.U, r.C, r.Date
FROM Settings s
JOIN Reads r
ON s.Identifier = r.Identifier
AND s.C = r.C
AND s.U = r.U
WHERE (r.Value <= s.Low OR r.Value >= s.High)
In this case using the sample data, I would want to return a record because the value from the Data collection is greater than the high value from the setting collection. Is this possible using Mongo queries or map reduce, or is this bad collection structure (i.e. maybe all of this should be in one collection)?
A few more additional notes:
The Settings collection should really only have 1 record per "Identifier". The Data collection will have many records per "Identifier". This process could potentially be scanning hundreds of thousands of documents at one time, so resource consideration is somewhat important
There is no good way of performing operation like this using MongoDB. If you want BAD way you can use code like this:
db.settings.find().forEach(
function(doc) {
data = db.data.find({
Identifier: doc.Idendtifier,
C: doc.C,
U: doc.U,
$or: [{Value: {$lte: doc.Low}}, {Value: {$gte: doc.High}}]
}).toArray();
// Do what you need
}
)
but don't expect it will perform even remotely as good as any decent RDBMS.
You could rebuild your schema and embed documents from data collection like this:
{
"_id" : ObjectId("527a7f4b07c17a1f8ad009d2"),
"Identifier" : "ABC123",
"C" : "1",
"U" : "V",
"Low" : 116,
"High" : 124,
"ImportLogId" : 1,
"Data" : [
{
"Date" : ISODate("2013-11-06T00:00:00Z"),
"Value" : 128
},
{
"Date" : ISODate("2013-10-09T00:00:00Z"),
"Value" : 99
}
]
}
It may work if number of embedded document is low but to be honest working with arrays of documents is far from being pleasant experience. Not even mention that you can easily hit document size limit with growing size of the Data array.
If this kind of operations is typical for your application I would consider using different solution. As much as I like MongoDB it works well only with certain type of data and access patterns.
Without the concept of JOIN, you must change your approach and denormalize.
In your case, looks like you're doing a data log validation. My advice is looping settings collection and with each of them use the findAndModify operator in order to set a validation flag on data collection records who matches; after that, you could just use the find operator on the data collection, filtering by the new flag.
Starting Mongo 4.4, we can achieve this type of "join" with the new $unionWith aggregation stage coupled with a classic $group stage:
// > db.settings.find()
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Low" : 116 }
// { "Identifier" : "DEF456", "C" : "1", "U" : "W", "Low" : 416 }
// { "Identifier" : "GHI789", "C" : "1", "U" : "W", "Low" : 142 }
// > db.data.find()
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Value" : 14 }
// { "Identifier" : "GHI789", "C" : "1", "U" : "W", "Value" : 43 }
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Value" : 45 }
// { "Identifier" : "DEF456", "C" : "1", "U" : "W", "Value" : 8 }
db.data.aggregate([
{ $unionWith: "settings" },
{ $group: {
_id: { Identifier: "$Identifier", C: "$C", U: "$U" },
Values: { $push: "$Value" },
Low: { $mergeObjects: { v: "$Low" } }
}},
{ $match: { "Low.v": { $lt: 150 } } },
{ $out: "result-collection" }
])
// > db.result-collection.find()
// { _id: { Identifier: "ABC123", C: "1", U: "V" }, Values: [14, 45], Low: { v: 116 } }
// { _id: { Identifier: "GHI789", C: "1", U: "W" }, Values: [43], Low: { v: 142 } }
This:
Starts with a union of both collections into the pipeline via the new $unionWith stage.
Continues with a $group stage that:
Groups records based on Identifier, C and U
Accumulates Values into an array
Accumulates Lows via a $mergeObjects operation in order to get a value of Low that isn't null. Using a $first wouldn't work since this could potentially take null first (for elements from the data collection). Whereas $mergeObjects discards null values when merging an object containing a non-null value.
Then discards joined records whose Low value is bigger than let's say 150.
And finally output resulting records to a third collection via an $out stage.
A feature we've developed called Data Compare & Sync might be able to help here.
It lets you compare two MongoDB collections and see the differences (e.g. spot the same, missing, or different fields).
You can then export these comparison results to a CSV file, and use that to create your new, third collection.
Disclosure: We are the creators of the MongoDB GUI, Studio 3T.

Using Spring Data Mongodb, is it possible to get the max value of a field without pulling and iterating over an entire collection?

Using mongoTemplate.find(), I specify a Query with which I can call .limit() or .sort():
.limit() returns a Query object
.sort() returns a Sort object
Given this, I can say Query().limit(int).sort(), but this does not perform the desired operation, it merely sorts a limited result set.
I cannot call Query().sort().limit(int) either since .sort() returns a Sort()
So using Spring Data, how do I perform the following as shown in the mongoDB shell? Maybe there's a way to pass a raw query that I haven't found yet?
I would be ok with extending the Paging interface if need be...just doesn't seem to help any. Thanks!
> j = { order: 1 }
{ "order" : 1 }
> k = { order: 2 }
{ "order" : 2 }
> l = { order: 3 }
{ "order" : 3 }
> db.test.save(j)
> db.test.save(k)
> db.test.save(l)
> db.test.find()
{ "_id" : ObjectId("4f74d35b6f54e1f1c5850f19"), "order" : 1 }
{ "_id" : ObjectId("4f74d3606f54e1f1c5850f1a"), "order" : 2 }
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
> db.test.find().sort({ order : -1 }).limit(1)
{ "_id" : ObjectId("4f74d3666f54e1f1c5850f1b"), "order" : 3 }
You can do this in sping-data-mongodb. Mongo will optimize sort/limit combinations IF the sort field is indexed (or the #Id field). This produces very fast O(logN) or better results. Otherwise it is still O(N) as opposed to O(N*logN) because it will use a top-k algorithm and avoid the global sort (mongodb sort doc). This is from Mkyong's example but I do the sort first and set the limit to one second.
Query query = new Query();
query.with(new Sort(Sort.Direction.DESC, "idField"));
query.limit(1);
MyObject maxObject = mongoTemplate.findOne(query, MyObject.class);
Normally, things that are done with aggregate SQL queries, can be approached in (at least) three ways in NoSQL stores:
with Map/Reduce. This is effectively going through all the records, but more optimized (works with multiple threads, and in clusters). Here's the map/reduce tutorial for MongoDB.
pre-calculate the max value on each insert, and store it separately. So, whenever you insert a record, you compare it to the previous max value, and if it's greater - update the max value in the db.
fetch everything in memory and do the calculation in the code. That's the most trivial solution. It would probably work well for small data sets.
Choosing one over the other depends on your usage of this max value. If it is performed rarely, for example for some corner reporting, you can go with the map/reduce. If it is used often, then store the current max.
As far as I am aware Mongo totally supports sort then limit: see http://www.mongodb.org/display/DOCS/Sorting+and+Natural+Order
Get the max/min via map reduce is going to be very slow and should be avoided at all costs.
I don't know anything about Spring Data, but I can recommend Morphia to help with queries. Otherwise a basic way with the Java driver would be:
DBCollection coll = db.getCollection("...");
DBCursor curr = coll.find(new BasicDBObject()).sort(new BasicDBObject("order", -1))
.limit(1);
if (cur.hasNext())
System.out.println(cur.next());
Use aggregation $max .
As $max is an accumulator operator available only in the $group stage, you need to do a trick.
In the group operator use any constant as _id .
Lets take the example given in Mongodb site only --
Consider a sales collection with the following documents:
{ "_id" : 1, "item" : "abc", "price" : 10, "quantity" : 2, "date" : ISODate("2014-01-01T08:00:00Z") }
{ "_id" : 2, "item" : "jkl", "price" : 20, "quantity" : 1, "date" : ISODate("2014-02-03T09:00:00Z") }
{ "_id" : 3, "item" : "xyz", "price" : 5, "quantity" : 5, "date" : ISODate("2014-02-03T09:05:00Z") }
{ "_id" : 4, "item" : "abc", "price" : 10, "quantity" : 10, "date" : ISODate("2014-02-15T08:00:00Z") }
{ "_id" : 5, "item" : "xyz", "price" : 5, "quantity" : 10, "date" : ISODate("2014-02-15T09:05:00Z") }
If you want to find out the max price among all the items.
db.sales.aggregate(
[
{
$group:
{
_id: "1", //** This is the trick
maxPrice: { $max: "$price" }
}
}
]
)
Please note that the value of "_id" - it is "1". You can put any constant...
Since the first answer is correct but the code is obsolete, I'm replying with a similar solution that worked for me:
Query query = new Query();
query.with(Sort.by(Sort.Direction.DESC, "field"));
query.limit(1);
Entity maxEntity = mongoTemplate.findOne(query, Entity.class);

Chaining time-based sort and limit issue

Lately I've encountered some strange behaviours (i.e. meaning that they are, IMHO, counter-intuitive) while playing with mongo and sort/limit.
Let's suppose I do have the following collection:
> db.fred.find()
{ "_id" : ObjectId("..."), "record" : 1, "time" : ISODate("2011-12-01T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 2, "time" : ISODate("2011-12-02T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 3, "time" : ISODate("2011-12-03T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 4, "time" : ISODate("2011-12-04T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 5, "time" : ISODate("2011-12-05T00:00:00Z") }
What I would like is retrieving, in time order, the 2 records previous to "record": 4 plus record 4 (i.e. record 2, record 3 and record 4)
Naively I was about running something along:
db.fred.find({time: {$lte: ISODate("2011-12-04T00:00:00Z")}}).sort({time: -1}).limit(2).sort({time: 1})
but it does not work the way I expected:
{ "_id" : ObjectId("..."), "record" : 1, "time" : ISODate("2011-12-01T00:00:00Z") }
{ "_id" : ObjectId("..."), "record" : 2, "time" : ISODate("2011-12-02T00:00:00Z") }
I was thinking that the result would have been record 2, record 3 and 4.
From what I recollected, it seems that the 2 sort does apply before limit:
sort({time: -1}) => record 4, record 3, record 2, record 1
sort({time: -1}).limit(2) => record 4, record 3
sort({time: -1}).limit(2).sort({time: 1}) => record 1, record 2
i.e it's like the second sort was applied to the cursor returned by find (i.e. the whole set) and then only, the limit is applied.
What is my mistake here and how can I achieve the expected behavior?
BTW: running mongo 2.0.1 on Ubuntu 11.01
The MongoDB shell lazily evaluates cursors, which is to say, the series of chained operations you've done results in one query being sent to the server, using the final state based on the chained operations. So when you say "sort({time: -1}).limit(2).sort({time: 1})" the second call to sort overrides the sort set by the first call.
To achieve your desired result, you're probably better off reversing the cursor output in your application code, especially if you're limiting to a small result set (here you're using 2). The exact code to do so depends on the language you're using, which you haven't specified.
Applying sort() to the same query multiple times makes no sense here. The effective sorting will be taken from the last sort() call. So
sort({time: -1}).limit(2).sort({time: 1})
is the same as
sort({time: 1}).limit(2)