I have time series data in mongodb as follows:
{
"_id" : ObjectId("558912b845cea070a982d894"),
"code" : "ZL0KOP",
"time" : NumberLong("1420128024000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d895"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128025000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d896"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128003000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d897"),
"code" : "ZL0KOP",
"time" : NumberLong("1420041724000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d89e"),
"code" : "YBUHCW",
"time" : NumberLong("1420041732000"),
"direction" : "10",
"siteId" : "0002"
}
{
"_id" : ObjectId("558912b845cea070a982d8a1"),
"code" : "U48AIW",
"time" : NumberLong("1420041729000"),
"direction" : "10",
"siteId" : "0002"
}
{
"_id" : ObjectId("558912b845cea070a982d8a0"),
"code" : "OJ3A06",
"time" : NumberLong("1420300927000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d89d"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420300885000"),
"direction" : "10",
"siteId" : "0003"
}
{
"_id" : ObjectId("558912b845cea070a982d8a2"),
"code" : "ZLV05H",
"time" : NumberLong("1420300922000"),
"direction" : "10",
"siteId" : "0001"
}
{
"_id" : ObjectId("558912b845cea070a982d8a3"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420300928000"),
"direction" : "10",
"siteId" : "0000"
}
The codes that match two or more conditions need to be filtered out.
For example:
condition1: 1420128000000 < time < 1420128030000,siteId == 0000
condition2: 1420300880000 < time < 1420300890000,siteId == 0003
results for the first condition:
{
"_id" : ObjectId("558912b845cea070a982d894"),
"code" : "ZL0KOP",
"time" : NumberLong("1420128024000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d895"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128025000"),
"direction" : "10",
"siteId" : "0000"
}
{
"_id" : ObjectId("558912b845cea070a982d896"),
"code" : "AQ0ZSQ",
"time" : NumberLong("1420128003000"),
"direction" : "10",
"siteId" : "0000"
}
results for the second condition:
{
"_id" : ObjectId("558912b845cea070a982d89d"),
"code" : "AQ0ZSQ", "time" : NumberLong("1420300885000"),
"direction" : "10",
"siteId" : "0003"
}
The only code that matchs all the conditions above should be:
{"code" : "AQ0ZSQ", "count":2}
"count" means, the code "AQ0ZSQ" appeared in both conditions
The only solution I can think of is using two querys. For example, using python
result1 = list(db.codes.objects({'time': {'$gt': 1420128000000,'$lt': 1420128030000}, 'siteId': "0000"}).only("code"))
result2 = list(db.codes.objects({'time': {'$gt': 1420300880000,'$lt': 1420300890000}},{'siteId':'0003'}).only("code"))
and then found the shared code in both results.
The Problem is that there are millions of documents in the collection, and both query can easily exceed the 16mb limitation.
So is it possible to do that in one query? or should I change the document structure?
What you are asking for here requires the usage of the aggregation framework in order to calculate that there was an intersection between results on the server.
The first part of the logic is you need an $or query for the two conditions, then there will be some additional projection and filtering on those results:
db.collection.aggregate([
// Fetch all possible documents for consideration
{ "$match": {
"$or": [
{
"time": { "$gt": 1420128000000, "$lt": 1420128030000 },
"siteId": "0000"
},
{
"time": { "$gt": 1420300880000, "$lt": 1420300890000 },
"siteId": "0003"
}
]
}},
// Locigically compare the conditions agaist results and add a score
{ "$project": {
"code": "$code",
"score": { "$add": [
{ "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420128000000 ] },
{ "$lt": [ "$time", 1420128030000 ] },
{ "$eq": [ "$siteId", "0000" ] }
]},
1,
0
]},
{ "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420300880000 ] },
{ "$lt": [ "$time", 1420300890000 ] },
{ "$eq": [ "$siteId", "0003" ] }
]},
1,
0
]}
]}
}},
// Now Group the results by "code"
{ "$group": {
"_id": "$code",
"score": { "$sum": "$score" }
}},
// Now filter to keep only results with score 2
{ "$match": { "score": 2 } }
])
So break that down and see how it works.
First you want a query with $match to get all the possible documents for "all" of your conditions of "intersection". That is what the $or expression allows here by considering that matched documents must meet either set. You need all of them to work out the "intersection" here.
In the second $project pipeline stage a boolean test of your conditions is performed with each set. Notice the usage of $and here as well as other boolean operators of the aggregation framework is slightly different to that of the query usage form.
In the aggregation framework form ( outside of $match which uses normal query operators ) these operators take an array of arguments, to typically represent "two" values for comparison rather than the operation being assigned to the "right" of the field name.
Since these conditions are logical or "boolean" we want to return the result as "numeric" rather than a true/false value. This is what $cond does here. So where the condition is true for the document inspected a score of 1 is emitted otherwise it is 0 when false.
Finally in this $project expression both of your conditions are wrapped with $add to form the "score" result. So if none of the conditions ( not possible after the $match ) were not true the score would be 0, if "one" is true then 1, or where "both" are true then 2.
Noting here that the specific conditions asked for here will never score above 1 for a single document since no document can have the overlapping range or "two" "siteId" values as is present here.
Now the important part is to $group by the "code" value and $sum the score value to get a total per "code".
This leaves the final $match filter stage of the pipeline to only keep those documents with a "score" value that is equal to the number of conditions you asked for. In this case 2.
There is a failing there however in that where there is more than one value of "code" in the matches for either condition ( as there is ) then the "score" here would be incorrect.
So after the introduction to the principles of using logical operators in aggregation, you can fix that fault by essentially "tagging" each result logically as to which condition "set" it applies to. Then you can basically consider which "code" appeared in "both" sets in this case:
db.collection.aggregate([
{ "$match": {
"$or": [
{
"time": { "$gt": 1420128000000, "$lt": 1420128030000 },
"siteId": "0000"
},
{
"time": { "$gt": 1420300880000, "$lt": 1420300890000 },
"siteId": "0003"
}
]
}},
// If it's the first logical condition it's "A" otherwise it can
// only be the other, therefore "B". Extend for more sets as needed.
{ "$group": {
"_id": {
"code": "$code",
"type": { "$cond": [
{ "$and":[
{ "$gt": [ "$time", 1420128000000 ] },
{ "$lt": [ "$time", 1420128030000 ] },
{ "$eq": [ "$siteId", "0000" ] }
]},
"A",
"B"
]}
}
}},
// Simply add up the results for each "type"
{ "$group": {
"_id": "$_id.code",
"score": { "$sum": 1 }
}}
// Now filter to keep only results with score 2
{ "$match": { "score": 2 } }
])
It might be a bit to take in if this is your first time using the aggregation framework. Please take the time to look at the operators used as defined with the links here and also look at Aggregation Pipeline Operators in general.
Beyond simple data selection, this is the tool you should be reaching to most often when using MongoDB, so you would do well to understand all the operations that are possible.
Related
I have the following query in MongoDB:
db.getCollection('message').aggregate([
{
"$match": {
"who" : { "$in" : ["manager", "woker"] },
"sendTo": { "$in": ["userId:243369", "userId:160921"] },
"exceptSendTo": { "$nin": ["userId:37355"] },
"msgTime": { "$lt": 1559716155 },
"isInvalid": { "$exists": false }
}
},
{
"$sort": { "msgTime": 1, "who": 1, "sendTo": 1 }
},
{
"$group": { "_id": "$who", "doc": { "$first": "$type" } }
}
], { allowDiskUse: true})
forget about the field meaning. and I have this index:
/* 1 */
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "db.message"
},
{
"v" : 1,
"key" : {
"who" : 1.0,
"sendTo" : 1.0
},
"name" : "who_sendTo",
"ns" : "db.message"
},
{
"v" : 1,
"key" : {
"msgTime" : 1.0
},
"name" : "msgTime_1",
"ns" : "db.message"
},
{
"v" : 1,
"key" : {
"msgTime" : 1.0,
"who" : 1.0,
"sendTo" : 1.0
},
"name" : "msgTime_1.0_who_1.0_sendTo_1.0",
"ns" : "db.message",
"background" : true
}
]
Perform the query above, It cost 1.52s, use explain to see it indeed has used msgTime_1.0_who_1.0_sendTo_1.0 index.
Why is query is still low while index has been used? and is there any way to solve the low problem like change index or something?
I dont think you are using the sort at all the way you intend to use it.
The $firs argument requires a sort on the actual first arguement
https://docs.mongodb.com/manual/reference/operator/aggregation/first/
You need to sort the key you want the first element of.
OR you could use $$ROOT, witch returns the first document.
I think you should modify it to something like:
{"$sort": {"who": 1, "msgTime": 1, "sendTo": 1}},
{"$group": {"_id": "$who", "doc": {"$first": "$$root"}}},
In this case the $group operator can find the result for each group "instantly" since they are all next to each other.
If you are only interested in the type, add an projection:
{'$project': {'doc.type': 1}
I want to aggregate my data and make an array with multiple stored date, grouped by user and day of week and for this day, something like for this data (according we are february, the 24th) :
{
"_id" : ObjectId("58b0b4b732d3cd188cea9e1b"),
"user" : 1,
"heure" : ISODate("2017-02-24T22:33:27.858Z")
}
{
"_id" : ObjectId("58b0b4b732d3cd188cea9e1b"),
"user" : 1,
"heure" : ISODate("2017-02-24T23:33:27.858Z")
}
{
"_id" : ObjectId("58b0b4b732d3cd188cea9e1b"),
"user" : 2,
"heure" : ISODate("2017-02-24T22:34:27.858Z")
}
{
"_id" : ObjectId("58b0b4b732d3cd188cea9e1b"),
"user" : 1,
"heure" : ISODate("2017-02-25T07:21:27.858Z")
}
Get this :
{
"_id" : {user : 1, jour : 55}
"date" : [ISODate("2017-02-24T22:33:27.858Z"), ISODate("2017-02-24T23:33:27.858Z") ]
}
{
"_id" : {user : 2, jour : 55}
"date" : [ISODate("2017-02-24T22:34:27.858Z") ]
}
I tried using $push of $match, but everything failed.
Optionally, i want to have the time beetween time two date, like for user 1, adding another field which contains 1 hours. But i don't wan't to use ate at most once, so with 4 date in array, i need to have only a addition : the value of first and second with the value of third and fourth. I want to see this to learn how to use the $cond properly
Here is my actual pipeline :
[
{ $match : {$eq : [{$dayOfYear : "$heure"}, {$dayOfYear : ISODate()}] }
{
$group : {
_id : {
user : "$user",
},
date : {$push: "$heure"},
nombre: { $sum : 1 }
}
}
]
For now, i don't handle the second part of the aggregate function
For the first filter part you need to use $redact pipeline as it will return all documents that match the condition with the $$KEEP system variable returned by $cond based on the $dayOfYear date operator and discards documents otherwise with $$PRUNE.
Consider composing your final aggregate pipeline as:
[
{
"$redact": {
"$cond": [
{
"$eq": [
{ "$dayOfYear": "$heure" },
{ "$dayOfYear": new Date() }
]
},
"$$KEEP",
"$$PRUNE"
]
}
},
{
"$group": {
"_id": {
"user": "$user",
"jour": { "$dayOfYear": "$heure" }
},
"date": { "$push": "$heure" },
"nombre": { "$sum": 1 }
}
}
]
The question is Calculate the average age of the users who have more than 3 strengths listed.
One of the data is like this :
{
"_id" : 1.0,
"user_id" : "jshaw0",
"first_name" : "Judy",
"last_name" : "Shaw",
"email" : "jshaw0#merriam-webster.com",
"age" : 39.0,
"status" : "disabled",
"join_date" : "2016-09-05",
"last_login_date" : "2016-09-30 23:59:36 -0400",
"address" : {
"city" : "Deskle",
"province" : "PEI"
},
"strengths" : [
"star schema",
"dw planning",
"sql",
"mongo queries"
],
"courses" : [
{
"code" : "CSIS2300",
"total_questions" : 118.0,
"correct_answers" : 107.0,
"incorect_answers" : 11.0
},
{
"code" : "CSIS3300",
"total_questions" : 101.0,
"correct_answers" : 34.0,
"incorect_answers" : 67.0
}
]
}
I know I need to count how many strengths this data has, and then set it to $gt, and then calculate the average age.
However, I don't know how to write 2 function which are count and average in one query. Do I need to use aggregation, if so, how?
Thanks so much
Use $redact to match your array size & $group to calculate the average :
db.collection.aggregate([{
"$redact": {
"$cond": [
{ "$gt": [{ "$size": "$strengths" }, 3] },
"$$KEEP",
"$$PRUNE"
]
}
}, {
$group: {
_id: 1,
average: { $avg: "$age" }
}
}])
The $redact part match the size of strenghs array greater than 3, it will $$KEEP record that match this condition otherwise $$PRUNE the record that don't match. Check $redact documentation
The $group just perform an average with $avg
I have a collection of documents, each has a field which is an array of subdocuments, and all subdocuments have a common field 'status'. I want to find all documents that have the same status for all subdocuments.
collection:
{
"name" : "John",
"wives" : [
{
"name" : "Mary",
"status" : "dead"
},
{
"name" : "Anne",
"status" : "alive"
}
]
},
{
"name" : "Bill",
"wives" : [
{
"name" : "Mary",
"status" : "dead"
},
{
"name" : "Anne",
"status" : "dead"
}
]
},
{
"name" : "Mohammed",
"wives" : [
{
"name" : "Jane",
"status" : "dead"
},
{
"name" : "Sarah",
"status" : "dying"
}
]
}
I want to check if all wives are dead and find only Bill.
You can use the following aggregation query to get records of person whose wives are all dead:
db.collection.aggregate(
{$project: {name:1, wives:1, size:{$size:'$wives'}}},
{$unwind:'$wives'},
{$match:{'wives.status':'dead'}},
{$group:{_id:'$_id',name:{$first:'$name'}, wives:{$push: '$wives'},size:{$first:'$size'},count:{$sum:1}}},
{$project:{_id:1, wives:1, name:1, cmp_value:{$cmp:['$size','$count']}}},
{$match:{cmp_value:0}}
)
Output:
{ "_id" : ObjectId("56d401de8b953f35aa92bfb8"), "name" : "Bill", "wives" : [ { "name" : "Mary", "status" : "dead" }, { "name" : "Anne", "status" : "dead" } ], "cmp_value" : 0 }
If you need to find records of users who has same status, then you may remove the initial match stage.
The most efficient way to handle this is always going to be to "match" on the status of "dead" as the opening query, otherwise you are processing items that cannot possibly match, and the logic really quite simply followed with $map and $allElementsTrue:
db.collection.aggregate([
{ "$match": { "wives.status": "dead" } },
{ "$redact": {
"$cond": {
"if": {
"$allElementsTrue": {
"$map": {
"input": "$wives",
"as": "wife",
"in": { "$eq": [ "$$wife.status", "dead" ] }
}
}
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
])
Or the same thing with $where:
db.collection.find({
"wives.status": "dead",
"$where": function() {
return this.wives.length
== this.wives.filter(function(el) {
el.status == "dead";
}).length;
}
})
Both essentially test the "status" value of all elements to make sure they match in the fastest possible way. But the aggregate pipeline with just $match and $redact should be faster. And "less" pipeline stages ( essentially each a pass through the data ) means faster as well.
Of course keeping a property on the document is always fastest, but it would involve logic to set that only where "all elements" are the same property. Which of course would typically mean inspecting the document by loading it from the server prior to each update.
I am building a dashboard that rotates between different webpages. I am wanting to pull all slides that are part of the "Test" deck and order them appropriately. After the query my result would ideally look like.
[
{ "url" : "http://10.0.1.187", "position": 1, "duartion": 10 },
{ "url" : "http://10.0.1.189", "position": 2, "duartion": 3 }
]
I currently have a dataset that looks like the following
{
"_id" : ObjectId("53a612043c24d08167b26f82"),
"url" : "http://10.0.1.189",
"decks" : [
{
"title" : "Test",
"position" : 2,
"duration" : 3
}
]
}
{
"_id" : ObjectId("53a6103e3c24d08167b26f81"),
"decks" : [
{
"title" : "Test",
"position" : 1,
"duration" : 2
},
{
"title" : "Other Deck",
"position" : 1,
"duration" : 10
}
],
"url" : "http://10.0.1.187"
}
My attempted query looks like:
db.slides.aggregate([
{
"$match": {
"decks.title": "Test"
}
},
{
"$sort": {
"decks.position": 1
}
},
{
"$project": {
"_id": 0,
"position": "$decks.position",
"duration": "$decks.duration",
"url": 1
}
}
]);
But it does not yield my desired results. How can I query my dataset and get my expected results in a optimal way?
Well to truly "flatten" the document as your title suggests then $unwind is always going to be employed as there really is not other way to do that. There are however some different approaches if you can live with the array being filtered down to the matching element.
Basically speaking, if you really only have one thing to match in the array then your fastest approach is to simply use .find() matching the required element and projecting:
db.slides.find(
{ "decks.title": "Test" },
{ "decks.$": 1 }
).sort({ "decks.position": 1 }).pretty()
That is still an array but as long as you have only one element that matches then this does work. Also the items are sorted as expected, though of course the "title" field is not dropped from the matched documents, as that is beyond the possibilities for simple projection.
{
"_id" : ObjectId("53a6103e3c24d08167b26f81"),
"decks" : [
{
"title" : "Test",
"position" : 1,
"duration" : 2
}
]
}
{
"_id" : ObjectId("53a612043c24d08167b26f82"),
"decks" : [
{
"title" : "Test",
"position" : 2,
"duration" : 3
}
]
}
Another approach, as long as you have MongoDB 2.6 or greater available, is using the $map operator and some others in order to both "filter" and re-shape the array "in-place" without actually applying $unwind:
db.slides.aggregate([
{ "$project": {
"url": 1,
"decks": {
"$setDifference": [
{
"$map": {
"input": "$decks",
"as": "el",
"in": {
"$cond": [
{ "$eq": [ "$$el.title", "Test" ] },
{
"position": "$$el.position",
"duration": "$$el.duration"
},
false
]
}
}
},
[false]
]
}
}},
{ "$sort": { "decks.position": 1 }}
])
The advantage there is that you can make the changes without "unwinding", which can reduce processing time with large arrays as you are not essentially creating new documents for every array member and then running a separate $match stage to "filter" or another $project to reshape.
{
"_id" : ObjectId("53a6103e3c24d08167b26f81"),
"decks" : [
{
"position" : 1,
"duration" : 2
}
],
"url" : "http://10.0.1.187"
}
{
"_id" : ObjectId("53a612043c24d08167b26f82"),
"url" : "http://10.0.1.189",
"decks" : [
{
"position" : 2,
"duration" : 3
}
]
}
You can again either live with the "filtered" array or if you want you can again "flatten" this truly by adding in an additional $unwind where you do not need to filter with $match as the result already contains only the matched items.
But generally speaking if you can live with it then just use .find() as it will be the fastest way. Otherwise what you are doing is fine for small data, or there is the other option for consideration.
Well as soon as I posted I realized I should be using an $unwind. Is this query the optimal way to do it, or can it be done differently?
db.slides.aggregate([
{
"$unwind": "$decks"
},
{
"$match": {
"decks.title": "Test"
}
},
{
"$sort": {
"decks.position": 1
}
},
{
"$project": {
"_id": 0,
"position": "$decks.position",
"duration": "$decks.duration",
"url": 1
}
}
]);