I have a MongoDB collection that holds some data. I have simplified the below example but imagine each object has 10 keys, their data types being a mixture of numbers, dates and arrays of numbers and sub-documents.
{
'_id': ObjectId,
A: number,
B: number,
C: datetime,
D: [
number, number, number
]
}
I have an application that can send queries against any of the keys A, B, C and D in any combination, for example { A: 1, C: 'ABC' } and { B: 10: D: 2 }. Aside from a couple of fields, it is expected that each query should be performant enough to return in under 5 seconds.
I understand MongoDB compound indexes are only used when the query key order matches that of the index. So even if made an index on every key { A: 1, B: 1, C: 1, D: 1 }, then queries to { A: 2, D: 1 ] would not use the index. Is my best option therefore to make indexes for every combination of keys? This seems quite arduous given the number of keys on each document, but unsure how else I could solve this? I have considered making all queries query each key, so that the order is always the same, but unsure how I could write a query when a particular key is not queried. Example, application wants to query on some value of B but would also need
{
A: SomeAllMatchingValue?,
B: 1:,
C: SomeAllMatchingValue?,
D: SomeAllMatchingValue?
}
I am wondering if keeping the least queried fields to the last in the query would make sense, as then index prefixes would work for the majority of commoner use cases, but reduce the number of indexes that need to be generated.
What would be the recommended best practice for this use case? Thanks!
EDIT:
Having researched further and I think the attribute pattern is the way to go. The document keys that are numeric could all be moved into attributes and one index could cover all bases.
https://www.mongodb.com/blog/post/building-with-patterns-the-attribute-pattern
Your case seems a perfect use case of wildcard index,
which is introduced in MongoDB v4.2+.
You can create a wildcard index of all top-level fields like this:
db.collection.createIndex( { "$**" : 1 } )
Running with arbitrary criteria:
{
D: 3,
B: 2,
}
or
{
A: 1,
C: ISODate('1970-01-01T00:00:00.000+00:00')
}
will results in IXSCAN in explain():
{
"explainVersion": "1",
"queryPlanner": {
...
"parsedQuery": {
"$and": [
...
]
},
...
"winningPlan": {
"stage": "FETCH",
"filter": {
"C": {
"$eq": {
"$date": {
"$numberLong": "0"
}
}
}
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"$_path": 1,
"A": 1
},
"indexName": "$**_1",
...
"indexBounds": {
...
}
}
},
"rejectedPlans": [
...
]
},
...
}
Related
In a mongo collection, createTime and lastUpdateTime are defined as timestamps and keeps increasing. User might query frequently about the latest updated value. Given the condition this collection might contain 100 millions of data, index shall be created.
Sample documents:
{ _id:"1", a: "xx", b: "yy", createTime:ISODate("2022-05-16T06:07:47.280Z"), lastUpdateTime:ISODate("2022-05-16T06:07:47.280Z"), v: ["v1"] }
{ _id:"2", a: "xx", b: "yy", createTime:ISODate("2022-05-16T06:07:47.280Z"), lastUpdateTime:ISODate("2022-05-17T07:07:47.280Z"), v: ["v1", "v2"] }
Typical query pattern: db["data"].find({a: "xx", "b": "yy", "createTime": { "$lte":ISODate("2022-05-16T06:07:47.280Z") }}).sort({lastUpdateTime :-1}).limit(1)
There're two options about the index creation based on ESR rule.
CreateIndex({ a: 1, b: 1, lastUpdateTime :1, createTime: 1})
CreateIndex({ a: 1, b: 1, lastUpdateTime :-1, createTime: 1}).
Both index can help this query. I don't assume there're obvious difference while I doubt there're different performance impact during insert/update/delete. Since timestamp is increasing by nature, I guess the "asc" index should be created. But not sure how to verify it.
Let's say we have a document like
{
someField: [
[
{
a: 1,
b: 2
},
{
a: 3,
b: 4
}
],
[
{
a: 5,
b: 6
}
]
]
}
It has been created an index of someField.a.
I'v tried some different filters (query criteria) for searching.
{'someField.a': 5} couldn't return the expected result.
{someField': {$elemMatch: {$elemMatch: {'a': 5}}}} could return the expected documents, but it seems that the query planner shows no index available for this query.
So could anyone help me that how to make a query of such an embedded field using the index?
I have a structure whereby a user-created object can end up in a specific Document key. I know what the key is, but I have no idea what the structure of the underlying value is. For the purposes of my problem, let's assume it's an array, a single value, or a dictionary.
For extra fun, I am also trying to solve this problem for nested dictionaries.
What I am trying to do is run an aggregation across all objects that have this key, and summarize the values of the terminal nodes of the structure. For example, if I have the following:
ObjectA.foo = {"a": 2, "b": 4}
ObjectB.foo = {"a": 8, "b": 16}
ObjectC.bar = {"nested": {"d": 20}}
ObjectD.bar = {"nested": {"d": 30}}
I want to end up with an output value of
foo.a = 10
foo.b = 20
bar.nested.d = 50
My initial thought is to try to figure out how to get Mongo to flatten the keys of the hierarchy. If I could break the source data down from objects to a series of key-values where a key represents the entire path to the value, I could easily do the aggregation on that. However, I am not sure how to do that.
Ideally, I'd have something like $unwindKeys, but alas there is no such operator. There is $objectToArray, which I imagine I could then $unwind, but at that point I already start getting lost in stacking these operators. It also does not answer the problem of arbitrary depth, though I suppose a single-depth solution would be a good start.
Any ideas?
EDIT: So I've solved the single-depth problem using $objectToArray. Behold:
db.mytable.aggregate(
[
{
'$project': {
'_id': false,
'value': {
'$objectToArray': '$data.input_field_with_dict'
}
}
},
{
'$unwind': '$value'
},
{
'$group': {
'_id': '$value.k',
'sum': {
'$sum': '$value.v'
}
}
}
]
)
This will give you key-value pairs across your chosen docs that you can then iterate on. So in case of my sample above involving ObjectA and ObjectB, the result of the above query would be:
{"_id": "a", "sum": 10}
{"_id": "b", "sum": 20}
I still don't know how to traverse the structure recursively though. The $objectToArray solution works fine on a single known level with unknown keys, but I don't have a solution if you have both unknown keys and unknown depth.
The search goes on: how do I recursively sum or at least project fields with nested structures and preserve their key sequences? In other words, how do I flatten a structure of unknown depth? If I could flatten, I could easily aggregate on keys at that point.
If your collection is like this
/* 1 */
{
"a" : 2,
"b" : 4
}
/* 2 */
{
"a" : 8,
"b" : 16
}
/* 3 */
{
"nested" : {
"d" : 20
}
}
/* 4 */
{
"nested" : {
"d" : 30
}
}
the below the query will get you the required result.
db.sof.aggregate([
{'$group': {
'_id': null,
'a': {$sum: '$a'},
'b': {$sum: '$b'},
'd': {$sum: '$nested.d'}
}}
])
Let's say I have 1,000,000,000 entities in a MongoDB, and each entity has 3 numerical properties, A, B, and C.
for example:
entity1 : { A: 35, B: 60, C: 5 }
entity2 : { A: 15, B: 10, C: 55 }
entity2 : { A: 10, B: 10, C: 10 }
...
Now I need to query the database. The input of the query would be 3 numbers: (a, b, c). The result would be a list of entities in descending order as defined by the weighted average, or A * a + B * b + C * c.
so q(1, 100, 1) would return (entity1, entity2, entity3)
and q(1, 1, 100) would return (entity2, entity1, entity3)
Can something like this be achieved with MongoDB, without calculating the weighted average of every entity on every query? I am not bound to MongoDB, but am learning the MEAN stack. If I have to use something else, that is fine too.
NOTE: I chose 1,000,000,000 entities as an extreme example. My actual use case will only have ~5000 entities, so iterating over everything might be OK, I am just interested in a more clever solution.
Well of course you have to calculate it if you are providing input and cannot use a pre-calculated field, but the only difference here would be returning all items and sorting them in the client or letting the server do the work:
var a = 1,
b = 1,
c = 100;
db.collection.aggregate(
[
{ "$project": {
"A": 1,
"B": 1,
"C": 1,
"weight": {
"$add": [
{ "$multiply": [ "$A", a ] },
{ "$multiply": [ "$B", b ] },
{ "$multiply": [ "$C", c ] }
]
}
}},
{ "$sort": { "weight": -1 } }
],
{ "allowDiskUse": true }
)
So the key here is the .aggregate() method allows for document manipulation which is required to generate the value on which to apply the $sort.
The calculated value is provided in a $project pipeline stage before this using $multiply against each field value to each external variable fed into the pipeline, with the final math operation performing an $add on each argument in result to produce "weight" as a field to sort on.
You cannot directly feed algorithms to any "sort" methods in MongoDB, as they need to act on a field present in the document. The aggregation framework provides the means to "project" this value, so a later pipeline stage can then perform the sort required.
The other case here is that due to the sizes of documents you are generally proposing, it is better to supply "allowDiskUse" as an option to force the aggregation process to store processed documents temporily on disk and not in memory, as there is a restriction on the amount of memory that can be used in an aggregation process without this option.
It seems to me that when you are creating a Mongo document and have a field {key: value} which is sometimes not going to have a value, you have two options:
Write {key: null} i.e. write null value in the field
Don't store the key in that document at all
Both options are easily queryable, in one you query for {key : null} and the other you query for {key : {$exists : false}}.
I can't really think of any differences between the two options that would have any impact in an application scenario (except that option 2 has slightly less storage).
Can anyone tell me if there are any reasons one would prefer either of the two approaches over the other, and why?
EDIT
After asking the question it also occurred to me that indexes may behave differently in the two cases i.e. a sparse index can be created for option 2.
Indeed you have also a third possibility :
key: "" (empty value)
And you forget a specificity about null value.
Query on
key: null will retrieve you all document where key is null or where key doesn't exist.
When a query on $exists:false will retrieve only doc where field key doesn't exist.
To go back to your exact question it depends of you queries and what data represent.
If you need to keep that, by example, a user set a value then unset it, you should keep the field as null or empty. If you dont need, you may remove this field.
Note that, since MongoDB doesnt use field name dictionary compression, field:null consumes disk space and RAM, while storing no key at all doesnt consume resources.
It really comes down to:
Your scenario
Your querying manner
Your index needs
Your language
I personally have chosen to store null keys. It makes it much easier to integrate into my app. I use PHP with Active Record and uisng null values makes my life a lot easier since I am not having to put the stress of field depedancy upon the app. Also I do not need to make any complex code to deal with magics to set non-existant variables.
I personally would not store an empty value like "" since if your not careful you could have two empty values null and "" and then you'll have a hap-hazard time of querying specifically. So I personally prefer null for empty values.
As for space and index: it depends on how many rows might not have this colum but I doubt you will really notice the index size increase due to a few extra docs with null in. I mean the difference in storage is mineute especially if the corresponding key name is small as well. That goes for large setups too.
I am quite frankly unsure of the index usage between $exists and null however null could be a more standardised method by which to query the existance since remember that MongoDB is schemaless which means you have no requirement to have that field in the doc which again produces two empty values: non-existant and null. So better to choose one or the other.
I choose null.
Another point you might want to consider is when you use OGM tools like Hibernate OGM.
If you are using Java, Hibernate OGM supports the JPA standard. So if you can write a JPQL query, you would be theoretically easy if you want to switch to an alternate NoSQL datastore which is supported by the OGM tool.
JPA does not define a equivalent for $exists in Mongo. So if you have optional attributes in your collection then you cannot write a proper JPQL for the same. In such a case, if the attribute's value is stored as NULL, then it is still possible to write a valid JPQL query like below.
SELECT p FROM pppoe p where p.logout IS null;
I think in terms of disk space the difference is negligible. If you need to create an index on this field then consider Partial Index.
In index with { partialFilterExpression: { key: { $exists: true } } } can be much smaller than a normal index.
Also should be noted, that queries look different, see values like this:
db.collection.insertMany([
{ _id: 1, a: 1 },
{ _id: 2, a: '' },
{ _id: 3, a: undefined },
{ _id: 4, a: null },
{ _id: 5 }
])
db.collection.aggregate([
{
$set: {
type: { $type: "$a" },
ifNull: { $ifNull: ["$a", true] },
defined: { $ne: ["$a", undefined] },
existing: { $ne: [{ $type: "$a" }, "missing"] }
}
}
])
{ _id: 1, a: 1, type: double, ifNull: 1, defined: true, existing: true }
{ _id: 2, a: "", type: string, ifNull: "", defined: true, existing: true }
{ _id: 3, a: undefined, type: undefined, ifNull: true, defined: false, existing: true }
{ _id: 4, a: null, type: null, ifNull: true, defined: true, existing: true }
{ _id: 5, type: missing, ifNull: true, defined: false, existing: false }
Or with db.collection.find():
db.collection.find({ a: { $exists: false } })
{ _id: 5 }
db.collection.find({ a: { $exists: true} })
{ _id: 1, a: 1 },
{ _id: 2, a: '' },
{ _id: 3, a: undefined },
{ _id: 4, a: null }
db.collection.find({ a: null })
{ _id: 3, a: undefined },
{ _id: 4, a: null },
{ _id: 5 }
db.collection.find({ a: {$ne: null} })
{ _id: 1, a: 1 },
{ _id: 2, a: '' },
db.collection.find({ a: {$type: "null"} })
{ _id: 4, a: null }