Use aggregation framework to get peaks from a pre-aggregated dataset - mongodb

I have a few collections of metrics that are stored pre-aggregated into hour and minute collections like this:
"_id" : "12345CHA-2RU020130104",
"metadata" : {
"adaptor_id" : "CHA-2RU",
"processor_id" : NumberLong(0),
"date" : ISODate("2013-01-04T00:00:00Z"),
"processor_type" : "CHP",
"array_serial" : NumberLong(12345)
},
"hour" : {
"11" : 4.6665907,
"21" : 5.9431519999999995,
"7" : 0.6405864,
"17" : 4.712744,
---etc---
},
"minute" : {
"11" : {
"33" : 4.689972,
"32" : 4.7190895,
---etc---
},
"3" : {
"45" : 5.6883,
"59" : 4.792,
---etc---
}
The minute collection has a sub-document for each hour which has an entry for each minute with the value of the metric at that minute.
My question is about the aggregation framework, how should I process this collection if I wanted to find all minutes where the metric was above a certain highwater mark? Investigating the aggregation framework is showing an $unwind function but that seems to only work on arrays..
Would the map/reduce functionality be better suited for this? With that I could simply emit any entry above the highwatermark and count them.

You could build an array of "keys" using a reduce function that iterates through the objects attributes.
reduce: function(obj,prev)
{
for(var key in obj.minute) {
prev.results.push( { hour:key, minutes: obj.minute[key]});
}
}
will give you something like
{
"results" : [
{
"hour" : "11",
"minutes" : {
"33" : 4.689972,
"32" : 4.7190895
}
},
{
"hour" : "3",
"minutes" : {
"45" : 5.6883,
"59" : 4.792
}
}
]
}
I've just done a quick test using a group() - you'll need something more complex to iterate though the sub-sub documents (minutes) but hopefully points you in right direction.
db.yourcoll.group(
{
initial: { results: [] },
reduce: function(obj,prev)
{
for(var key in obj.minute) {
prev.results.push( { hour:key, minutes: obj.minute[key]});
}
}
}
);
In the finalizer you could reshape the data again. It's not going to be pretty, it might be easier to hold the minute and hour data as arrays rather than elements of the document.
hope it helps a bit

Related

MongoDB Finding nested object value

I'm trying to find documents within my collection that have a numeric value greater than x amount. The documentation explains how to do this for top level values however I'm struggling to retrieve the correct data for values that are within child objects.
Sample JSON
{
"_id" : ObjectId("5c32646c9f3315c3e8300673"),
"key" : "20190107",
"__v" : 0,
"chart" : [
{
"_id" : ObjectId("5c3372e5c35e924984f28e03"),
"volume" : "0",
"close" : "47.24",
"time" : "09:30 AM"
},
{
"_id" : ObjectId("5c3372e5c35e924984f28d34"),
"volume" : "50",
"close" : "44.24",
"time" : "09:50 AM"
}
]
}
I want to retrieve volumes greater than 10. I've tried
db.symbols.find({"chart.volume": { $gt: 10 } } )
db.symbols.find({"volume": { $gt: 10 } } )
Any help appreciated.
Your sample JSON has string values for the chart.volume field. If it was numeric, then your first solution:
db.symbols.find({"chart.volume": { $gt: 10 } } )
would work fine. The docs do explain how to do this.

Optimizing query with &elemMatch inside array

First of all I'd to apologize for my english.
I have serious issue with performance on my query. Unfortunetly I'm pretty new to mongoDB. So i have collection test which looks similar to this
{
"_id" : ObjectId("1"),
[...]
"statusHistories" : [
{
"created" : ISODate("2016-03-15T14:59:11.597Z"),
"status" : "STAT1",
},
{
"created" : ISODate("2016-03-15T14:59:20.465Z"),
"status" : "STAT2",
},
{
"created" : ISODate("2016-03-15T14:51:11.000Z"),
"status" : "STAT3",
}
],
}
statusHistories is an array.
Daily there's more than 3000 records inserted into that collection.
What I want to achieve is to find all tests that have given status and they are beeten two dates. So I have prepared query like this:
db.getCollection('test').find({
'statusHistories' : {
$elemMatch : {
created : {
"$gte" : ISODate("2016-07-11 00:00:00.052Z"),
"$lte" : ISODate("2016-07-11 23:59:00.052Z")
},
'status' : 'STAT1'
}
}
})
It gives expected result. Unfortunetly it takes around 120 seconds to complete. Which is way too long. Surprisingly if I split this query into two seperate it takes way less:
db.getCollection('test').find({
'statusHistories' : {
$elemMatch : {
created : {
"$gte" : ISODate("2016-07-11 00:00:00.052Z"),
"$lte" : ISODate("2016-07-11 23:59:00.052Z")
}
}
}
})
db.getCollection('test').find({
'statusHistories' : {
$elemMatch : {
'status' : 'STAT1'
}
}
})
Both of them need less then a second in order to complete.
So what am I doing wrong with my original query? I need to take those records in one query but when I combine two elemMatch statements into one it takes ages. I tried to ensureIndex on statusHistories but it didn't work out. Any suggestion would be really helpfull.

MongoDB aggregation query to split and convert JSON?

I have a JSON file with a horrific data structure
{ "#timestamp" : "20160226T065604,39Z",
"#toplevelentries" : "941",
"viewentry" : [ { "#noteid" : "161E",
"#position" : "1",
"#siblings" : "941",
"entrydata" : [
and entrydata is a list of 941 entries, each of which look like this
{ "#columnnumber" : "0",
"#name" : "$Created",
"datetime" : { "0" : "20081027T114133,55+01" }
},
{ "#columnnumber" : "1",
"#name" : "WriteLog",
"textlist" : { "text" : [ { "0" : "2008.OCT.28 12:54:39 CET # EMI" },
{ "0" : "2008.OCT.28 12:56:13 CET # EMI" },
There are many more columns. The structure is always this:
{
"#columnnumber": "17",
"#name": "PublicDocument",
"text": {
"0": "TMI-1-2005.pdf"
}
}
there's a column number which we can throw away, a #name which is the important part, then one of text, datetime or textlist fields where the value is always this weird subdocument with a 0 key and the actual value.
All 941 entries have the same number of these column entries and the column entry is always the same structure. Ie. if "#columnnumber": "13" has a #name: foo then it'll always be foo and if it has a datetime key then it always will have a datetime key, never a text or textlist. This monster was borne out of a SQL or similar database somewhere at the very far end of the pipeline but I have no access to the source beyond this. The goal is to revert the transformation and make it into something a SELECT statement would produce (except textlist, although I guess array_agg and similar could produce that too).
Is there a way to get 941 separate JSON entries out of MongoDB looking like:
{
$Created: "20081027T114133,55+01",
WriteLog: ["2008.OCT.28 12:54:39 CET # EMI", "2008.OCT.28 12:56:13 CET # EMI"],
PublicDocument: "TMI-1-2005.pdf"
}
is viewentry also a list?
if you do an aggregate on the collection, and $unwind on viewentry.entrydata you will get one document for every entrydata. It should be possible to the do a $project to reformat these documents to produce the output you need
this is a nice challenge,
to get outupt like that:
{
"_id" : "161E",
"field" : [
{
"name" : "$Created",
"datetime" : {
"0" : "20081027T114133,55+01"
}
},
{
"name" : "WriteLog",
"textlist" : {
"text" : [
{
"0" : "2008.OCT.28 12:54:39 CET# EMI"
},
{
"0" : "2008.OCT.28 12:56:13 CET# EMI"
}
] } } ]}
use this aggregation pipelines:
db.chx.aggregate([ {$unwind: "$viewentry"}
, {$unwind: "$viewentry.entrydata"}
,{$group:{
"_id":"$viewentry.#noteid", field:{ $push:{
"name": "$viewentry.entrydata.#name" ,
datetime:"$viewentry.entrydata.datetime",
textlist:"$viewentry.entrydata.textlist" }}
}}
]).pretty()
the next step should be extracting log entries, but I have no idea, as my brain is already fried tonight - so probably I can return later...

Queries on arrays with timestamps

I have documents that look like this:
{
"_id" : ObjectId( "5191651568f1f6000282b81f" ),
"updated_at" : "2013-05-16T09:46:16.199660",
"activities" : [
{
"worker_name" : "image",
"completed_at" : "2013-05-13T21:34:59.293711"
},
{
"worker_name" : "image",
"completed_at" : "2013-05-16T07:33:22.550405"
},
{
"worker_name" : "image",
"completed_at" : "2013-05-16T07:41:47.845966"
}
]
}
and I would like to find only those documents where the updated_at time is greater than the last activities.completed_at time (the array is in time order)
i currently have this, but it matches any activities[].completed_at
{
"activities.completed_at" : {"$gte" : "updated_at"}
}
thanks!
update
well, i have different workers, and each has its own "completed_at".
i'll have to invert activites as follows:
activities: { image :
last : {
completed_at: t3,
},
items: [
{completed_at: t0},
{completed_at: t1},
{completed_at: t2},
{completed_at: t3},
]
}
and use this query:
{
"activities.image.last.completed_at" : {"$gte" : "updated_at"}
}
Assuming that you don't know how many activities you have (it would be easy if you always had 3 activities for example with a activities.3.completed_at positional operator) and since there's no $last positional operator, the short answer is that you cannot do this efficiently.
When the activities are inserted, I would update the record's updated_at value (or another field). Then it becomes a trivial problem.

Upsert with pymongo and a custom _id field

I'm attempting to store pre-aggregated performance metrics in a sharded mongodb according to this document.
I'm trying to update the minute sub-documents in a record that may or may not exist with an upsert like so (self.collection is a pymongo collection instance):
self.collection.update(query, data, upsert=True)
query:
{ '_id': u'12345CHA-2RU020130304',
'metadata': { 'adaptor_id': 'CHA-2RU',
'array_serial': 12345,
'date': datetime.datetime(2013, 3, 4, 0, 0, tzinfo=<UTC>),
'processor_id': 0}
}
data:
{ 'minute': { '16': { '45': 1.6693091}}}
The problem is that in this case the 'minute' subdocument always only has the last hour: { minute: metric} entry, the minute subdocument does not create new entries for other hours, it's always overwriting the one entry.
I've also tried this with a $set style data entry:
{ '$set': { 'minute': { '16': { '45': 1.6693091}}}}
but it ends up being the same.
What am I doing wrong?
In both of the examples listed you are simply setting a field ('minute')to a particular value, the only reason it is an addition the first time you update is because the field itself does not exist and so must be created.
It's hard to determine exactly what you are shooting for here, but I think what you could do is alter your schema a little so that 'minute' is an array. Then you could use $push to add values regardless of whether they are already present or $addToSet if you don't want duplicates.
I had to alter your document a little to make it valid in the shell, so my _id (and some other fields) are slightly different to yours, but it should still be close enough to be illustrative:
db.foo.find({'_id': 'u12345CHA-2RU020130304'}).pretty()
{
"_id" : "u12345CHA-2RU020130304",
"metadata" : {
"adaptor_id" : "CHA-2RU",
"array_serial" : 12345,
"date" : ISODate("2013-03-18T23:28:50.660Z"),
"processor_id" : 0
}
}
Now let's add a minute field with an array of documents instead of a single document:
db.foo.update({'_id': 'u12345CHA-2RU020130304'}, { $addToSet : {'minute': { '16': {'45': 1.6693091}}}})
db.foo.find({'_id': 'u12345CHA-2RU020130304'}).pretty()
{
"_id" : "u12345CHA-2RU020130304",
"metadata" : {
"adaptor_id" : "CHA-2RU",
"array_serial" : 12345,
"date" : ISODate("2013-03-18T23:28:50.660Z"),
"processor_id" : 0
},
"minute" : [
{
"16" : {
"45" : 1.6693091
}
}
]
}
Then, to illustrate the addition, add a slightly different entry (since I am using $addToSet this is required for a new field to be added:
db.foo.update({'_id': 'u12345CHA-2RU020130304'}, { $addToSet : {'minute': { '17': {'48': 1.6693391}}}})
db.foo.find({'_id': 'u12345CHA-2RU020130304'}).pretty()
{
"_id" : "u12345CHA-2RU020130304",
"metadata" : {
"adaptor_id" : "CHA-2RU",
"array_serial" : 12345,
"date" : ISODate("2013-03-18T23:28:50.660Z"),
"processor_id" : 0
},
"minute" : [
{
"16" : {
"45" : 1.6693091
}
},
{
"17" : {
"48" : 1.6693391
}
}
]
}
I ended up setting the fields like this:
query:
{ '_id': u'12345CHA-2RU020130304',
'metadata': { 'adaptor_id': 'CHA-2RU',
'array_serial': 12345,
'date': datetime.datetime(2013, 3, 4, 0, 0, tzinfo=<UTC>),
'processor_id': 0}
}
I'm setting the metrics like this:
data = {"$set": {}}
for metric in csv:
date_utc = metric['date'].astimezone(pytz.utc)
data["$set"]["minute.%d.%d" % (date_utc.hour,
date_utc.minute)] = float(metric['metric'])
which creates data like this:
{"$set": {'minute.16.45': 1.6693091,
'minute.16.46': 1.566343,
'minute.16.47': 1.22322}}
So that when self.collection.update(query, data, upsert=True) is run it updates those fields.