Let's say I have 1,000,000,000 entities in a MongoDB, and each entity has 3 numerical properties, A, B, and C.
for example:
entity1 : { A: 35, B: 60, C: 5 }
entity2 : { A: 15, B: 10, C: 55 }
entity2 : { A: 10, B: 10, C: 10 }
...
Now I need to query the database. The input of the query would be 3 numbers: (a, b, c). The result would be a list of entities in descending order as defined by the weighted average, or A * a + B * b + C * c.
so q(1, 100, 1) would return (entity1, entity2, entity3)
and q(1, 1, 100) would return (entity2, entity1, entity3)
Can something like this be achieved with MongoDB, without calculating the weighted average of every entity on every query? I am not bound to MongoDB, but am learning the MEAN stack. If I have to use something else, that is fine too.
NOTE: I chose 1,000,000,000 entities as an extreme example. My actual use case will only have ~5000 entities, so iterating over everything might be OK, I am just interested in a more clever solution.
Well of course you have to calculate it if you are providing input and cannot use a pre-calculated field, but the only difference here would be returning all items and sorting them in the client or letting the server do the work:
var a = 1,
b = 1,
c = 100;
db.collection.aggregate(
[
{ "$project": {
"A": 1,
"B": 1,
"C": 1,
"weight": {
"$add": [
{ "$multiply": [ "$A", a ] },
{ "$multiply": [ "$B", b ] },
{ "$multiply": [ "$C", c ] }
]
}
}},
{ "$sort": { "weight": -1 } }
],
{ "allowDiskUse": true }
)
So the key here is the .aggregate() method allows for document manipulation which is required to generate the value on which to apply the $sort.
The calculated value is provided in a $project pipeline stage before this using $multiply against each field value to each external variable fed into the pipeline, with the final math operation performing an $add on each argument in result to produce "weight" as a field to sort on.
You cannot directly feed algorithms to any "sort" methods in MongoDB, as they need to act on a field present in the document. The aggregation framework provides the means to "project" this value, so a later pipeline stage can then perform the sort required.
The other case here is that due to the sizes of documents you are generally proposing, it is better to supply "allowDiskUse" as an option to force the aggregation process to store processed documents temporily on disk and not in memory, as there is a restriction on the amount of memory that can be used in an aggregation process without this option.
Related
I have a MongoDB collection that holds some data. I have simplified the below example but imagine each object has 10 keys, their data types being a mixture of numbers, dates and arrays of numbers and sub-documents.
{
'_id': ObjectId,
A: number,
B: number,
C: datetime,
D: [
number, number, number
]
}
I have an application that can send queries against any of the keys A, B, C and D in any combination, for example { A: 1, C: 'ABC' } and { B: 10: D: 2 }. Aside from a couple of fields, it is expected that each query should be performant enough to return in under 5 seconds.
I understand MongoDB compound indexes are only used when the query key order matches that of the index. So even if made an index on every key { A: 1, B: 1, C: 1, D: 1 }, then queries to { A: 2, D: 1 ] would not use the index. Is my best option therefore to make indexes for every combination of keys? This seems quite arduous given the number of keys on each document, but unsure how else I could solve this? I have considered making all queries query each key, so that the order is always the same, but unsure how I could write a query when a particular key is not queried. Example, application wants to query on some value of B but would also need
{
A: SomeAllMatchingValue?,
B: 1:,
C: SomeAllMatchingValue?,
D: SomeAllMatchingValue?
}
I am wondering if keeping the least queried fields to the last in the query would make sense, as then index prefixes would work for the majority of commoner use cases, but reduce the number of indexes that need to be generated.
What would be the recommended best practice for this use case? Thanks!
EDIT:
Having researched further and I think the attribute pattern is the way to go. The document keys that are numeric could all be moved into attributes and one index could cover all bases.
https://www.mongodb.com/blog/post/building-with-patterns-the-attribute-pattern
Your case seems a perfect use case of wildcard index,
which is introduced in MongoDB v4.2+.
You can create a wildcard index of all top-level fields like this:
db.collection.createIndex( { "$**" : 1 } )
Running with arbitrary criteria:
{
D: 3,
B: 2,
}
or
{
A: 1,
C: ISODate('1970-01-01T00:00:00.000+00:00')
}
will results in IXSCAN in explain():
{
"explainVersion": "1",
"queryPlanner": {
...
"parsedQuery": {
"$and": [
...
]
},
...
"winningPlan": {
"stage": "FETCH",
"filter": {
"C": {
"$eq": {
"$date": {
"$numberLong": "0"
}
}
}
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"$_path": 1,
"A": 1
},
"indexName": "$**_1",
...
"indexBounds": {
...
}
}
},
"rejectedPlans": [
...
]
},
...
}
Let's say I have a collection that looks like:
{
_id: 'aaaaaaaaaaaaaaaaaaaaaaaaa',
score: 10
hours: 50
},
{
_id: 'aaaaaaaaaaaaaaaaaaaaaaaab',
score: 5
hours: 55
},
{
_id: 'aaaaaaaaaaaaaaaaaaaaaaaac',
score: 15
hours: 60
}
I want to sort this list by a custom order, namely
value = (score - 1) / (T + 2) ^ G
score: score
T: current_hours - hours
G: some constant
How do I do this? I assume this is going to require writing a custom sorting function that compares the score and hours fields in addition to taking a current_hours as an input, performs that comparison and returns the sorted list. Note that hours and current_hours is simply the number of hours that have elapsed since some arbitrary starting point. So if I'm running this query 80 hours after the application started, current_hours takes the value of 80.
Creating an additional field value and keeping it constantly updated is probably too expensive for millions of documents.
I know that if this is possible, this is going to look something like
db.items.aggregate([
{ "$project" : {
"_id" : 1,
"score" : 1,
"hours" : 1,
"value" : { SOMETHING HERE, ALSO REQUIRES PASSING current_hours }
}
},
{ "$sort" : { "value" : 1 } }
])
but I don't know what goes into value
I think value will look something like this:
"value": {
$let: {
vars: {
score: "$score",
t: {
"$subtract": [
80,
"$hours"
]
},
g: 3
},
in: {
"$divide": [
{
"$subtract": [
"$$score",
1
]
},
{
"$pow": [
{
"$add": [
"$$t",
2
]
},
"$$g"
]
}
]
}
}
}
Playground example here
Although it's verbose, it should be reasonably straightforward to follow. It uses the arithmetic expression operators to build the calculation that you are requesting. A few specific notes:
We use $let here to set some vars for usage. This includes the "runtime" value for current_hours (80 in the example per the description) and 3 as an example for G. We also "reuse" score here which is not strictly necessary, but done for consistency of the next point.
$ refers to fields in the document where $$ refer to variables. That's why everything in the vars definition uses $ and everything for the actual calculation in in uses $$. The reference to score inside of in could have been done via just the field name ($), but I personally prefer the consistency of this approach.
I have a structure whereby a user-created object can end up in a specific Document key. I know what the key is, but I have no idea what the structure of the underlying value is. For the purposes of my problem, let's assume it's an array, a single value, or a dictionary.
For extra fun, I am also trying to solve this problem for nested dictionaries.
What I am trying to do is run an aggregation across all objects that have this key, and summarize the values of the terminal nodes of the structure. For example, if I have the following:
ObjectA.foo = {"a": 2, "b": 4}
ObjectB.foo = {"a": 8, "b": 16}
ObjectC.bar = {"nested": {"d": 20}}
ObjectD.bar = {"nested": {"d": 30}}
I want to end up with an output value of
foo.a = 10
foo.b = 20
bar.nested.d = 50
My initial thought is to try to figure out how to get Mongo to flatten the keys of the hierarchy. If I could break the source data down from objects to a series of key-values where a key represents the entire path to the value, I could easily do the aggregation on that. However, I am not sure how to do that.
Ideally, I'd have something like $unwindKeys, but alas there is no such operator. There is $objectToArray, which I imagine I could then $unwind, but at that point I already start getting lost in stacking these operators. It also does not answer the problem of arbitrary depth, though I suppose a single-depth solution would be a good start.
Any ideas?
EDIT: So I've solved the single-depth problem using $objectToArray. Behold:
db.mytable.aggregate(
[
{
'$project': {
'_id': false,
'value': {
'$objectToArray': '$data.input_field_with_dict'
}
}
},
{
'$unwind': '$value'
},
{
'$group': {
'_id': '$value.k',
'sum': {
'$sum': '$value.v'
}
}
}
]
)
This will give you key-value pairs across your chosen docs that you can then iterate on. So in case of my sample above involving ObjectA and ObjectB, the result of the above query would be:
{"_id": "a", "sum": 10}
{"_id": "b", "sum": 20}
I still don't know how to traverse the structure recursively though. The $objectToArray solution works fine on a single known level with unknown keys, but I don't have a solution if you have both unknown keys and unknown depth.
The search goes on: how do I recursively sum or at least project fields with nested structures and preserve their key sequences? In other words, how do I flatten a structure of unknown depth? If I could flatten, I could easily aggregate on keys at that point.
If your collection is like this
/* 1 */
{
"a" : 2,
"b" : 4
}
/* 2 */
{
"a" : 8,
"b" : 16
}
/* 3 */
{
"nested" : {
"d" : 20
}
}
/* 4 */
{
"nested" : {
"d" : 30
}
}
the below the query will get you the required result.
db.sof.aggregate([
{'$group': {
'_id': null,
'a': {$sum: '$a'},
'b': {$sum: '$b'},
'd': {$sum: '$nested.d'}
}}
])
I have startTime and endTime for all records like this:
{
startTime : 21345678
endTime : 31345678
}
I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]
As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.
I have not used mongoDB much so what kind of query can I use to achieve something like this?
As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.
Current Range Match
MongoDB 3.6 $lookup
The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:
db.getCollection('collection').aggregate([
{ "$lookup": {
"from": "collection",
"let": {
"_id": "$_id",
"starttime": "$starttime",
"endtime": "$endtime"
},
"pipeline": [
{ "$match": {
"$expr": {
"$and": [
{ "$ne": [ "$$_id", "$_id" },
{ "$or": [
{ "$and": [
{ "$gte": [ "$$starttime", "$starttime" ] },
{ "$lte": [ "$$starttime", "$endtime" ] }
]},
{ "$and": [
{ "$gte": [ "$$endtime", "$starttime" ] },
{ "$lte": [ "$$endtime", "$endtime" ] }
]}
]},
]
},
"as": "overlaps"
}},
{ "$count": "count" },
]
}},
{ "$match": { "overlaps.0": { "$exists": true } } }
])
The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.
Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.
The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime"
$or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.
Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.
An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.
The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.
Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.
Other versions - Queries to "join"
The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:
db.getCollection('collection').find().map( d => {
var overlaps = db.getCollection('collection').find({
"_id": { "$ne": d._id },
"$or": [
{ "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
{ "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
]
}).toArray();
return ( overlaps.length !== 0 )
? Object.assign(
d,
{
"overlaps": {
"count": overlaps.length,
"documents": overlaps
}
}
)
: null;
}).filter(e => e != null);
This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.
Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.
The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.
Reducing Complexity
So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.
A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:
{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }
With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } }
])
And the output:
{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }
That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.
This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.
This can even be expanded for alternate presentation:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } },
{ "$unwind": "$docs" },
{ "$group": {
"_id": "$docs",
"intervals": { "$push": "$_id" }
}}
])
With output:
{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }
It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.
Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.
My gut feeling is that the answer is no, but is it possible to perform a search in Mongodb comparing the similarity of arrays where order is important?
E.g.
I have three documents like so
{'_id':1, "my_list": ["A",2,6,8,34,90]},
{'_id':2, "my_list": ["A","F",2,6,19,8,90,55]},
{'_id':3, "my_list": [90,34,8,6,3,"A"]}
1 and 2 are the most similar, 3 is wildly different irrespective of the fact it contains all of the same values as 1.
Ideally I would do a search similar to {"my_list" : ["A",2,6,8,34,90] } and the results would be document 1 and 2.
It's almost like a regex search with wild cards. I know I can do this in python easily enough, but speed is important and I'm dealing with 1.3 million documents.
Any "comparison" or "selection" is actually more or less subjective to the actual logic applied. But as a general principle you could always consider the product of the matched indices from the array to test against and the array present in the document. For example:
var sample = ["A",2,6,8,34,90];
db.getCollection('source').aggregate([
{ "$match": { "my_list": { "$in": sample } } },
{ "$addFields": {
"score": {
"$add": [
{ "$cond": {
"if": {
"$eq": [
{ "$size": { "$setIntersection": [ "$my_list", sample ] }},
{ "$size": { "$literal": sample } }
]
},
"then": 100,
"else": 0
}},
{ "$sum": {
"$map": {
"input": "$my_list",
"as": "ml",
"in": {
"$multiply": [
{ "$indexOfArray": [
{ "$reverseArray": "$my_list" },
"$$ml"
]},
{ "$indexOfArray": [
{ "$reverseArray": { "$literal": sample } },
"$$ml"
]}
]
}
}
}}
]
}
}},
{ "$sort": { "score": -1 } }
])
Would return the documents in order like this:
/* 1 */
{
"_id" : 1.0,
"my_list" : [ "A", 2, 6, 8, 34, 90],
"score" : 155.0
}
/* 2 */
{
"_id" : 2.0,
"my_list" : ["A", "F", 2, 6, 19, 8, 90, 55],
"score" : 62.0
}
/* 3 */
{
"_id" : 3.0,
"my_list" : [ 90, 34, 8, 6, 3, "A"],
"score" : 15.0
}
The key being that when applied using $reverseArray, the values from $indexOfArray will be "larger" produced by the matching index on order from "first to last" ( reversed ) which gives a larger "weight" to matches at the beginning of the array than those as it moves towards the end.
Of course you should make consideration for things like the second document does in fact contain "most" of the matches and have more array entries would place a "larger" weight on the initial matches than in the first document.
From the above "A" scores more in the second document than in the first because the array is longer even though both matched "A" in the first position. However there is also some effect that "F" is a mismatch and therefore has a greater negative effect than it would if it was later in the array. Same applies to "A" in the last document, where at the end of the array the match has little bearing on the overall weight.
The counter to this in consideration is to add some logic to consider the "exact match" case, such as here the $size comparison from the $setIntersection of the sample and the current array. This would adjust the scores to ensure that something that matched all provided elements actually scored higher than a document with less positional matches, but more elements overall.
With a "score" in place you can then filter out results ( i.e $limit ) or whatever other logic you can apply in order to only return the actual results wanted. But the first step is calculating a "score" to work from.
So it's all generally subjective to what logic actually means a "nearest match", but the $reverseArray and $indexOfArray operations are generally key to putting "more weight" on the earlier index matches rather than the last.
Overall you are looking for "calculation" of logic. The aggregation framework has some of the available operators, but which ones actually apply are up to your end implementation. I'm just showing something that "logically works" to but more weight on "earlier matches" in an array comparison rather than "latter matches", and of course the "most weight" where the arrays are actually the same.
NOTE: Similar logic could be achieved using the includeArrayIndex option of $unwind for earlier version of MongoDB without the main operators used above. However the process does require usage of $unwind to deconstruct arrays in the first place, and the performance hit this would incur would probably negate the effectiveness of the operation.