Mongodb: how to aggregate data from different collection based on id - mongodb

I have a document in which data are like
collection A
{
"_id" : 69.0,
"values" : [
{
"date_data" : "2016-12-16 10:00:00",
"valueA" : 8,
"valuB" : 9
},
{
"date_data" : "2016-12-16 11:00:00",
"valueA" : 8,
"valuB" : 9
},.......
}
collection B
{
"_id" : 69.0,
"values" : [
{
"date_data" : "2017-12-16 10:00:00",
"valueA" : 8,
"valuB" : 9
},
{
"date_data" : "2017-12-16 11:00:00",
"valueA" : 8,
"valuB" : 9
},.......
}
data is being stored at each hour, as it store in one documents, it may reach its limit 16Mb at some point, that's why i'm thinking to spread data across the years, means in one collection all the id's will hold the data on yearly basis. But when we want to show data combined, how we can use aggregate function?
For example, collectionA has data from 7th dec'16 to 7th dec'17 and collectionB has data from 6th dec'15 to 6th dec'16. how i can show data between 1st dec'16 to 1st jan'17 which are in different collection?

Very simple, use mongodb $lookup query which is the equivalent of a left outer join. All the documents on the left will be scanned for a value inside a field and the documents from the right considered the foreign document will match with respect to value. For your case, here is the parent collection
Parent A
Child collection B
Now all we have to do is make a query from the collection A
With a very simple aggregation $lookup query, you ll see the following result
db.getCollection('uniques').aggregate([
{
"$lookup": {
"from": "values",//Get data from values table
"localField": "_id", //The field _id of the current table uniques
"foreignField": "parent_id", //The foreign column containing a matching value
"as": "related" //An array containing all items under 69
}
},
{
"$unwind": "$related" //Unwind that array
},
{
"$project": {
"value_id": "$related._id",//project only what you need
"date": "$related.date_data",
"a": "$related.valueA",
"b": "$related.valueB"
}
}
], {"allowDiskUse": true})
Remember a few things
Local field for the lookup doesnt care if you have indexed it or not so run it over a table with the least number of rows
Foreign field works best when indexed or directly on an _id
There is an option to specify a pipeline and do some custom filtering work while matching, I wont recommend it as pipelines are ridiculously slow
Dont forget to "allowDiskUse" if you are going to aggregate large amounts of data

Related

Matching on a tuple key in an array on a mongo document

Background
I have a collection of items. Here is one:
{ "_id" : ObjectId("5d3e315975132b3a43149225"),
"thetime" : 201812,
"name": "watermelon",
"cost" : 7,
"info" : "empty"
"taglist" : [
{ "color" : "red" },
{ "color": "green"},
{ "store" : "market" },
{ "taste" : "sweet" } ]
}
I am trying to do an aggregate with a $match on every item that has the key color in its taglist (at least once). Later I want to group on total cost of every color or every store etc. So, my output for just this item collection would be (red: $7, green: $7). Point is: I don't want to use $find; I want to use $match.
Question:
How do I match on a tuple key in an array?
What I have so far
This query works for getting items that have the key value pair: (color, red): db.items.aggregate([{$match: {"taglist":{"color":"red"}}}]);
But, I do not know how to change the query to return all the items with any color.
Note: I'd prefer to not start with an $unwind because the documents can actually be larger than this one and performance is important.
If you want to return all the data that has the key color.
You can use $exist
db.items.aggregate([{$match: {"taglist.color":{$exists:true}}}])
Before match you need to unwind taglist
db.items.aggregate([
{
$unwind: "$taglist"
},
{
$match: { $exists:{"$taglist.color":true}}}
}
}
]);
On the basis of key you can group further

Projecting nested documents in Mongo

I am trying to find and project from a nested structure. For example, I have the following document where each unit might have an embedded sub-unit:
{
"_id" : 1,
"unit" : {
"_id" : 2,
"unit" : {
"_id" : 3,
"unit" : {
"_id" : 4
}
}
}
}
And I want to get the id's of all the subunits under unit 1:
[{_id:2}, {_id:3}, {_id:4}]
$graphlookup does not seem to handle this kind of nested structure. As far as I understand, it works when the units are saved at a single level without nesting and each keep a reference to its parent unit.
What is the correct way to retrieve the desired result?
Firstly, $graphlookup isn't operator for your problem, because it's recursive search on a collection, not recursive in a document
$graphLookup Performs a recursive search on a collection, with options
for restricting the search by recursion depth and query filter.
Therefore, it didn't recursive search in your document, it only recursive search on a collection (includes multiple documents), it cannot handle your problem.
With your problem, I think it is not responsibility of Mongo, because you've retrieved your wanted document. You want to parse the retrieved document to array of sub-documents, you can do it in your language.
Example if you use JavaScript (Node.JS for backend), you can parse this document to array:
const a = {
"_id": 1,
"unit": {
"_id": 2,
"unit": {
"_id": 3,
"unit": {
"_id": 4
}
}
}
}
const parse = o => {
const { _id } = o;
if (!o.unit) return [{ _id }];
return [{ _id }, ...parse(o.unit) ];
}
console.log(parse(a.unit));
You can not do that from mongodb query. Mongodb will guarantee the document with id :1, and will not recursively search inside the document.
What you can do is: retrieve the document from mongodb, then parse it into a Map object and retrieve the information from that map, recursively.

MongoDB query for finding number of people with conflicting schedules [duplicate]

I have startTime and endTime for all records like this:
{
startTime : 21345678
endTime : 31345678
}
I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]
As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.
I have not used mongoDB much so what kind of query can I use to achieve something like this?
As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.
Current Range Match
MongoDB 3.6 $lookup
The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:
db.getCollection('collection').aggregate([
{ "$lookup": {
"from": "collection",
"let": {
"_id": "$_id",
"starttime": "$starttime",
"endtime": "$endtime"
},
"pipeline": [
{ "$match": {
"$expr": {
"$and": [
{ "$ne": [ "$$_id", "$_id" },
{ "$or": [
{ "$and": [
{ "$gte": [ "$$starttime", "$starttime" ] },
{ "$lte": [ "$$starttime", "$endtime" ] }
]},
{ "$and": [
{ "$gte": [ "$$endtime", "$starttime" ] },
{ "$lte": [ "$$endtime", "$endtime" ] }
]}
]},
]
},
"as": "overlaps"
}},
{ "$count": "count" },
]
}},
{ "$match": { "overlaps.0": { "$exists": true } } }
])
The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.
Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.
The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime"
$or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.
Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.
An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.
The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.
Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.
Other versions - Queries to "join"
The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:
db.getCollection('collection').find().map( d => {
var overlaps = db.getCollection('collection').find({
"_id": { "$ne": d._id },
"$or": [
{ "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
{ "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
]
}).toArray();
return ( overlaps.length !== 0 )
? Object.assign(
d,
{
"overlaps": {
"count": overlaps.length,
"documents": overlaps
}
}
)
: null;
}).filter(e => e != null);
This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.
Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.
The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.
Reducing Complexity
So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.
A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:
{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }
With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } }
])
And the output:
{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }
That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.
This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.
This can even be expanded for alternate presentation:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } },
{ "$unwind": "$docs" },
{ "$group": {
"_id": "$docs",
"intervals": { "$push": "$_id" }
}}
])
With output:
{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }
It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.
Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.

How to join to two additional collections with conditions

select tb1.*,tb3 from tb1,tb2,tb3
where tb1.id=tb2.profile_id and tb2.field='<text>'
and tb3.user_id = tb2.id and tb3.status =0
actually i converted the sql as mongo sql as follows
mongo sql which i used
db.getCollection('tb1').aggregate
([
{ $lookup:
{ from: 'tb2',
localField: 'id',
foreignField: 'profile_id',
as: 'tb2detail'
}
},
{ $lookup:
{ from: 'tb3',
localField: 'tb2.id',
foreignField: 'user_id',
as: 'tb3details'
}
},
{ $match:
{ 'status':
{ '$ne': 'closed'
},
'tb2.profile_type': 'agent',
'tb3.status': 0
}
}
])
but not getting as per the expected result..
Any help will be appreciated..
What you are missing in here is that $lookup produces an "array" in the output field specified by as in it's arguments. This is the general concept of MongoDB "relations", in that a "relation" between documents is represented as a "sub-property" that is "within" the document itself, being either singular or an "array" for many.
Since MongoDB is "schemaless", the general presumption of $lookup is that you mean "many" and the result is therefore "always" an array. So looking for the "same result as in SQL" then you need to $unwind that array after the $lookup. Whether it's "one" or "many" is of no consequence, since it's still "always" an array:
db.getCollection.('tb1').aggregate([
// Filter conditions from the source collection
{ "$match": { "status": { "$ne": "closed" } }},
// Do the first join
{ "$lookup": {
"from": "tb2",
"localField": "id",
"foreignField": "profileId",
"as": "tb2"
}},
// $unwind the array to denormalize
{ "$unwind": "$tb2" },
// Then match on the condtion for tb2
{ "$match": { "tb2.profile_type": "agent" } },
// join the second additional collection
{ "$lookup": {
"from": "tb3",
"localField": "tb2.id",
"foreignField": "id",
"as": "tb3"
}},
// $unwind again to de-normalize
{ "$unwind": "$tb3" },
// Now filter the condition on tb3
{ "$match": { "tb3.status": 0 } },
// Project only wanted fields. In this case, exclude "tb2"
{ "$project": { "tb2": 0 } }
])
Here you need to note the other things you are missing in the translation:
Sequence is "important"
Aggregation pipelines are more "tersely expressive" than SQL. They are in fact best considered as "a sequence of steps" applied to the datasource in order to collate and transform the data. The best analog to this is "piped" command line instructions, such as:
ps -ef | grep mongod | grep -v grep | awk '{ print $1 }'
Where the "pipe" | can be considered as a "pipeline stage" in a MongoDB aggregation "pipeline".
As such we want to $match in order to filter things from the "source" collection as our first operation. And this is generally good practice since it removes any documents that did not meet required conditions from further conditions. Just like what is happening in our "command line pipe" example, where we take "input" then "pipe" to a grep to "remove" or "filter".
Paths Matter
Where the very next thing you do here is "join" via $lookup. The result is an "array" of the items from the "from" collection argument matched by the supplied fields to output in the "as" "field path" as an "array".
The naming chosen here is important, since now the "document" from the source collection considers all items from the "join" to now exist at that given path. To make this easy, I use the same "collection" name as the "join" for the new "path".
So starting from the first "join" the output is to "tb2" and that will hold all the results from that collection. There is also an important thing to note with the following sequence of $unwind and then $match, as to how MongoDB actually processes the query.
Certain Sequences "really" matter
Since it "looks like" there are "three" pipeline stages, being $lookup then $unwind and then $match. But in "fact" MongoDB really does something else, which is demonstrated in the output of { "explain": true } added to the .aggregate() command:
{
"$lookup" : {
"from" : "tb2",
"as" : "tb2",
"localField" : "id",
"foreignField" : "profileId",
"unwinding" : {
"preserveNullAndEmptyArrays" : false
},
"matching" : {
"profile_type" : {
"$eq" : "agent"
}
}
}
},
{
"$lookup" : {
"from" : "tb3",
"as" : "tb3",
"localField" : "tb2.id",
"foreignField" : "id",
"unwinding" : {
"preserveNullAndEmptyArrays" : false
},
"matching" : {
"status" : {
"$eq" : 0.0
}
}
}
},
So aside from the first point of "sequence" applying where you need to put the $match statements where they are needed and do the "most good", this actually becomes "really important" with the concept of "joins". The thing to note here is that our sequences of $lookup then $unwind and then $match, actually get processed by MongoDB as just the $lookup stages, with the other operations "rolled up" into the one pipeline stage for each.
This is an important distinction to other ways of "filtering" the results obtained by $lookup. Since in this case, the actual "query" conditions on the "join" from $match are performed on the collection to join "before" the results are returned to the parent.
This in combination with $unwind ( which is translated into unwinding ) as shown above is how MongoDB actually deals with the possibility that the "join" could result in producing an array of content in the source document which causes it to exceed the 16MB BSON limit. This would only happen in cases where the result being joined to is very large, but the same advantage is in where the "filter" is actually applied, being on the target collection "before" results are returned.
It is that kind of handling that best "correlates" to the same behavior as a SQL JOIN. It is also therefore the most effective way to obtain results from a $lookup where there are other conditions to apply to the JOIN aside from simply the "local" of "foreign" key values.
Also note that the other behavior change is is from what is essentially a LEFT JOIN performed by $lookup where the "source" document would always be retained regardless of the presence of a matching document in the "target" collection. Instead the $unwind adds to this by "discarding" any results from the "source" which did not have anything matching from the "target" by the additional conditions in $match.
In fact they are even discarded beforehand due to the implied preserveNullAndEmptyArrays: false which is included and would discard anything where the "local" and "foreign" keys did not even match between the two collections. This is a good thing for this particular type of query as the "join" is intended to the "equal" on those values.
Conclude
As noted before, MongoDB generally treats "relations" a lot differently to how you would use a "Relational Database" or RDBMS. The general concept of "relations" is in fact "embedding" the data, either as a single property or as an array.
You may actually desire such output, which is also part of the reason why that without the $unwind sequence here the output of $lookup is actually an "array". However using $unwind in this context is actually the most effective thing to do, as well as giving a guarantee that the "joined" data does not actually cause the aforementioned BSON limit to be exceed as a result of that "join".
If you actually want arrays of output, then the best thing to do here would be to use the $group pipeline stage, and possibly as multiple stages in order to "normalize" and "undo the results" of $unwind
{ "$group": {
"_id": "$_id",
"tb1_field": { "$first": "$tb1_field" },
"tb1_another": { "$first": "$tb1_another" },
"tb3": { "$push": "$tb3" }
}}
Where you would in fact for this case list all the fields you required from "tb1" by their property names using $first to only keep the "first" occurrence ( essentially repeated by results of "tb2" and "tb3" unwound ) and then $push the "detail" from "tb3" into an "array" to represent the relation to "tb1".
But the general form of the aggregation pipeline as given is the exact representation of how results would be obtained from the original SQL, which is "denormalized" output as a result of the "join". Whether you want to "normalize" results again after this is up to you.

MongoDb insert or add in embedded document

This is my final document structure. I am on mongo 3.0.
{
_id: "1",
stockA: [ { size: "S", qty: 25 }, { size: "M", qty: 50 } ],
stockB: [ { size: "S", qty: 27 }, { size: "M", qty: 58 } ]
}
I am randomly adding elements inside stockA or stockB collections.
I would like come up with a query which would satisfy following rules:
If item is not exist - create whole document and insert item in stock A or B collection.
If item is exist but stock A or B collection is missing create stock collection and add element inside collection.
if item is exist and stock collection is exist just append element inside collection.
Question: Is it possible to achieve all that in one query? My main requirement that current insert has to be extremely fast and scalable.
If yes, could you please help me to come up with required query.
If no, could you please guide me how do I achieve requirements in fastest and cleanest way.
Thanks for any help!
The operator you are looking for is $addToSet or $push in combination with upsert:true.
Both operators can take multiple key/value pairs. It will then operate separately on each key. Both will create the array when it doesn't exist yet, and in either case will add the value.
The difference is that $addToSet will first check if an identical element already exists in the array. When performance is an issue, keep in mind that MongoDB must perform a linear search to do this. So for very large arrays, $addToSet can become quite slow.
The upsert:true option to the update function will result in the document being created when it isn't found.
You can use the update method with the upsert option which will create a new document when no document matches the query criteria.
db.collection.update({ "_id": 1 },
{
"$push" : {
"stockB" : {
"size" : "M",
"qty" : 51
},
"stockA" : {
"size" : "M",
"qty" : 21
}
}
},
{ "upsert": true }
)