I want to query against a subdocument. The problem is, that MongoDB seems to respect the order of the object keys. So if I do the following query, I won't receive any results:
db.getCollection('test').find({docId:'tLDmtdeYuG9DiGGrL',optimizeParams:{"deviceType":"mobile","daytime":"night"}})
If I change the order of optimizeParams, I will get a result:
db.getCollection('test').find({docId:'tLDmtdeYuG9DiGGrL',optimizeParams:{"daytime":"night","deviceType":"mobile"}})
Now a soultion would be to use dot notation, but in this case, I will receive ALL documents which contains both keys. But I only want the documents that ONLY have both keys (no others):
db.getCollection('test').find({docId:'tLDmtdeYuG9DiGGrL',"optimizeParams.deviceType":"mobile","optimizeParams.daytime":"night"})
Is there a way how I can execute a query without respecting the key order?
Even if there was a way to respect they key orders, I would recommend not to use it. JSON documents are meant to not respect the key orders. You should be designing your schema with this in mind.
For your particular use case, you could do this if you're using version 3.6 or higher:
db.test.aggregate([{
$match: {
"optimizeParams.daytime": "night",
"optimizeParams.deviceType": "mobile"
},
},{
$addFields: {
count: {
$size: {
$objectToArray: "$optimizeParams"
}
}
}
}, {
$match: {
count: 2
}
}])
This answer helped me with the above solution. Please keep indexes in mind when using it.
Although this could be done, "find documents with only these two keys" seems to be irrelevant from any application's perspective. There must be something meaningful that this situation represents. You could use another field to indicate that meaning, and use that to filter instead.
Related
My DB has many documents with mostly random field order as displayed in Mongo Compass. The first field is always _id but the rest of the fields could be in any order. This makes scanning records by eye very difficult.
I have read that this reordering due to upserts no longer happens with Mongo 4.2 and I have upgraded - but the problem remains.
Is there a way for me to reorder my fields so each document in a collection has the same field order - say -id first then a-z?
You can use $replaceWith to do this.
https://mongoplayground.net/p/VBzpabZuJpy
db.YOURCOLLECTION.updateMany({}, [
{$replaceWith: {
$mergeObjects: [
{
"fieldA": "$fieldA",
"fieldB": "$fieldB",
"fieldC": "$fieldC",
"fieldD": "$fieldD",
"fieldE": "$fieldE"
},
"$$ROOT"
]
}}
])
You can try reading each document in a language that preserves hash key order, reordering fields as you see fit, then writing each document back.
Since bson implements maps as lists of ordered key-value pairs, I expect all regular tools to preserve the order of keys that currently exists for each individual document.
When looking for pagination techniques on the internet, one usually finds two ways :
Offset-based pagination :
Collection.find(
{ where_this: "equals that" },
{ skip: 15, limit: 5 }
)
Cursor-based pagination :
Collection.find(
{ where_this: "equals that", _id: { $gt: cursor }},
{ sort: { _id: 1 }}
)
But is there a way to have cursor-based pagination without sorting the collection according to that cursor ? Like, telling Mongo "Alright, I want the 5 next items after that _id, no matter in which order the _ids are, just give me 5 items after you see that _id". Something along those lines :
Collection.find(
{ where_this: "equals that", _id: { $must_come_after: cursor }},
{ sort: { other_field: 1 }}
)
It is not always possible to use the field you're sorting with as the cursor. First of all, because these fields can be of different types and you might allow your app users to sort, for example, tables as they please. With a strongly-typed API framework like GraphQL, this would be a mess to handle. Secondly, you could have two or more equal values for that field following each other in the sorted collection. If your pages split in the middle, asking for the next page will either give you duplicates or ignore items.
Is there a way to do that ? How is it usually done to allow custom sorting fields without offset-based pagination ? Thanks.
When we talk about "paginating a result set", the implicit assumption is that the result set stays the same throughout the process. In order for the result set to stay the same, it must normally be sorted. Without specifying an order, the database is free to return the documents in any order, and this order may change from one retrieval to the next. Reordering of documents that a user is paginating through creates a poor user experience.
you could have two or more equal values for that field following each other in the sorted collection. If your pages split in the middle, asking for the next page will either give you duplicates or ignore items.
The database can return the documents which compare equal in any order, and this order can change between adjacent queries. This is why when sorting on a field that has a low cardinality, it is a good idea to add another field with a high cardinality to the sort expression to ensure the documents are returned in a stable order.
But is there a way to have cursor-based pagination without sorting the collection according to that cursor ?
You can encode offset in a cursor identifier and use skip/limit.
My main object looks like this:
{
"_id" : ObjectId("56eb0a06560fd7047318465a"),
...
"intervalAbsenceDates" : [
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a06560fd7047318463b")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184467")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184468")),
DBRef("ScheduleIntervalContainer", ObjectId("56eb0a05560fd70473184469")),
An embedded ScheduleIntervalContainer object looks like this:
{
"_id" : ObjectId("56eb0a06560fd7047318463b"),
"end" : ISODate("2022-08-23T07:06:00Z"),
"available" : true,
"confirmation" : true,
"start" : ISODate("2022-08-19T09:33:00Z")
}
Now I will query all ScheduleIntervalContainers where start and end is in a range.
I have tried a lot but I not even can query one ScheduleIntervalContainer by id.
This is my approach:
db.InstitutionUserConnection.find( {
"intervalAbsenceDates" : {
"$ref" : "ScheduleIntervalContainer",
"$id" : ObjectId("56eb0a05560fd7047318446d")
}
})
Could anyone give me a hint how to query all ScheduleIntervalContainers which have start and end in a time range.
Using DBRef is fraught with problems and the usage is somewhat "antiquated" as it was more of a "shoehorn" solution to requests to provide a mechanism for providing references to external collection data than a really well thought solution.
The general recommendation is to not use DBRef and rather simply use a plain ObjectId or other unique identifier and resolve the actual "collection" or even "database container" with other information, being either another standard property stored in the document or just a plain external reference in your code which identifies the target collection.
The Basic BSON Problem
A good reason for this is that people often make the common mistake ( just like you have ) of interpretting the "serialized" output from a stored object containing a DBRef to actually be "properties" present on the object. This is not true, as in fact the object has it's own BSON type, just like Date and ObjectId. The only time that serialized form is valid is when used with a "strict mode" JSON parser, which would yeild the actual DBRef object.
The manual itself is not particularly helpful with that fact and says "DBRefs have the following fields". This is misleading since those properties are not actually available as fields for query, and would not be exposed other than object inspection available to JavaScript processing of $where or mapReduce. And not a good approach to use either of those for "query" purposes.
So those properties are not available for query, but you can of course specifiy the BSON form directly from your API. Such as:
db.InstitutionUserConnection.find({
"intervalAbsenceDates": DBRef(
"ScheduleIntervalContainer",
ObjectId("56eb0a05560fd70473184469")
)
})
Which resolves and matches correctly since the correct BSON form was sent in a query and the element will actually match.
From this principle we can then move on to the the concept of "ranges" as asked in the question.
MongoDB does not "really" do Joins
Until recent releases ( MongoDB 3.2.x series ) the statement was actually "MongoDB does not do joins", and this has been a general distinction in design philosophy away from relational databases.
The general mantra "has" been that "joins are costly" and therefore do not scale well in distributed data systems such as what MongoDB is principally designed for.
So if you are asking for referencing a property in the document which the DBRef refers to, then you are basically out of luck. The only possible action for filtering results based on an external property in this case would be:
Look at all the data in the master collection, either loaded to query in whole or processed individually
For each retrieved document, lookup and expand the DBRef values to their target collection data.
Filter out documents whose expanded data from external references do not actually meet the conditions.
This means all the "expansion" and "filtering" must take place on the "client" to the database rather than on the server itself. There simply is no mechanism to do this, so you end up pulling a lot of data over the network connection.
With a DBRef in place this is purely not possible to perform on the server even with modern releases. Again it's the same BSON type problem, since the "source" contains DBRef types and the "target" contains ObjectId types.
If however you can simply live with the fact that your "range" is looking at the "creation date" data that would be inherently present in any ObjectId, then there is of course another approach that does not involve a "join".
Filtering on ObjectId "range"
Every ObjectId starts with 4-bytes that represents the current timestamp value ( excluding milliseconds ) at the time the ObjectId was created. This is generally a good indicator of the time of "insertion" for the document in question where it was used.
From this you can determine that the "created date" of the documents refernced with DBRef is approximately equal to that portion of the ObjectId value used by that document in the target collection. This allows you to basically construct a "range" value for ObjectId values that would fall between the given range. As a consequence, you can then construct DBRef BSON objects that would work with range operators:
// Define start and end dates to query
var dateStart = new Date("2016-03-17T19:48:21Z"), // equal to 56eb0a05
dateEnd = new Date("2016-03-17T19:48:25Z"); // equal to 56eb0a09
// Convert to hex and pad to ObjectId length
var startRange = new ObjectId(
( dateStart.valueOf() / 1000 ).toString(16) + "0000000000000000"
),
// Yields ObjectId("56eb0a050000000000000000")
endRange = new ObjectId(
( dateEnd.valueOf() / 1000 ).toString(16) + "ffffffffffffffff"
);
// Yields ObjectId("56eb0a09ffffffffffffffff")
// Now query with contructed DBRef values
db.InstitutionUserConnection.find({
"intervalAbsenceDates": {
"$elemMatch": {
"$gte": DBRef("ScheduleIntervalContainer",startRange),
"$lt": DBRef("ScheduleIntervalContainer",endRange),
}
}
})
So as long as "created" is what you are looking for, then that method should suffice for selecting the matching parents without first expanding the DBRef values in the array for further inpection.
The Reverse Case Lookup
Of course the other case here is to simply query the "joined" collection first and then look for documents in the "master" collection that contain the ObjectId values within the DBRef. This does of course mean multiple queries to be issued, but it does cure the case of expanding every DBRef just to match the related properties:
// Create array of matching DBRef values
var refs = db.ScheduleIntervalContainer.find({
"start" { "$lte": targetDate },
"end": { "$gte": targetDate }
}).map(function(doc) {
return DBRef("ScheduleIntervalContainer",doc._id)
});
// Find documents that match the DBRef's within the array
db.InstitutionUserConnection.find({
"intervalAbsenceDates": { "$in": refs }
})
The practicallity of this varies on the number of matches from the related collection resulting in the array that would be passed to $in, but it does actually yield the desired result.
Actually doing Joins
I mention earlier that "modern" MongoDB releases now have an approach to "joining" data from different collections. This is the $lookup aggregation pipeline operator.
But while this can be used to "join" data, the current usage of DBRef does not work here. I did also mention earlier that the basic problem is the data in the array is a DBRef, but the data in the referenced collection is instead an ObjectId.
So if you wanted to use a $lookup approach, then you would first need to use plain ObjectId values in place of the existing DBRef values:
{
"_id" : ObjectId("56eb0a06560fd7047318465a"),
"intervalAbsenceDates" : [
ObjectId("56eb0a06560fd7047318463b"),
ObjectId("56eb0a05560fd70473184467"),
ObjectId("56eb0a05560fd70473184468"),
ObjectId("56eb0a05560fd70473184469")
]
}
With data in that structure you could then use $lookup and other aggregation pipeline methods to just return the documents that actually match the value of a related property. I.e "end" within the related object:
db.InstitutionUserConnection.aggregate([
// Presently you need to unwind the array first
{ "$unwind": "$intervalAbsenceDates" },
// Then $lookup to get a resulting array of matches for each member
{ "$lookup": {
"from": "ScheduleIntervalContainer",
"localField": "intervalAbsenceDates",
"foreignField": "_id",
"as": "absenceDates"
}},
// unwind the array result field as well
{ "$unwind": "$absenceDates" },
// Now reform the documents
{ "$group": {
"_id": "$_id",
"intervalAbsenceDates": { "$push": "$absenceDates" }
}},
// Then query on the "end" property for the range
{ "$match": {
"intervalAbsenceDates": {
"$elemMatch": {
"end": {
"$gte": new Date("2016-03-23"),
"$lt": new Date("2016-03-24")
}
}
}
}}
])
The current behaviour of $lookup is that you cannot directly process on an array property in the document, so the procedure shown in "$lookup on ObjectId's in an array" is used to replace the current array with the expanded objects from the other collection.
Once the operations here actually produce a document that now has that related data embedded, it's a straightforward process of looking at the properties of the documents within the array to see if they match the query conditions.
Conclusion
This all should show that DBRef is not a good idea for storing references. Whist it is possible as demonstrated to work around the problem using the ObjectId values, you generally want to have a plain ObjectId or other key value as the reference and resolve them by other means. And even if the workaround is sufficient, it works just the same with plain ObjectId values or anything else that presents a natural range.
When it comes to using the "values" of refernced properties in such a "join", then of course regardless of using DBRef or other value, it is not a possibilty without $lookup to use that in query conditions on the server. All data would first need to be loaded to the client and then resolved with additional queries to the database before those properties can be inspected for filtering.
Since the mechanism of $lookup will in fact result in a form that will look exactly like if you "embedded" the data in the first place, then "embedding" is most often the correct approach, since the data is already present in the source collection and available for query.
There is a lot of "scare media" around regarding the BSON limit of 16MB and saying this is why you keep data in another collection. Sometimes this does indeed apply, but most of the time it does not. Afterall, 16MB is really quite a large amount of data, and more than most would actually use in general applications.
Citation from MongoDB: The Definitive Guide
To give you an idea of how much 16MB is, the entire text of War and Peace is just 3.14MB.
To examine for queries to need to get to an "embedded form" anyway, and it is arguable that if you can store an array of DBRef or ObjectId or whatever as embedded data, then storing all of the content they are actually pointing to is not really that much more of a stretch.
The general lesson is that you should be designing based on the actual usage patterns your applcation applies to the data. If you are querying "related data" all of the time, then it makes most sense to keep that data all in the one collection. Of course other factors apply, but always keep in mind the trade-off in performance considerations by what you are doing.
Let's say I have a document that looks like this:
{
_id: ObjectId("5260ca3a1606ed3e76bf3835"),
event_id: "20131020_NFL_SF_TEN",
team: {
away: "SF",
home: "TEN"
}
}
I want to query for any game with "SF" as the away team or home team. So I put an index on team.away and team.home and run an $or query to find all San Francisco games.
Another option:
{
_id: ObjectId("5260ca3a1606ed3e76bf3835"),
event_id: "20131020_NFL_SF_TEN",
team: [
{
name: "SF",
loc: "AWAY"
},
{
name: "TEN",
loc: "HOME"
}
]
}
In the array above, I could put an index on team.name instead of two indexes as before. Then I would query team.name for any game with "SF" inside.
Which query would be more efficient? Thanks!
I believe that you would want to use the second example you gave with the single index on team.name.
There are some special considerations that you need to know when working with the $or operator. Quoting from the documentation (with some additional formatting):
When using indexes with $or queries, remember that each clause of an $or query will execute in parallel. These clauses can each use their own index.
db.inventory.find ( { $or: [ { price: 1.99 }, { sale: true } ] } )
For this query, you would create one index on price:db.inventory.ensureIndex({ price: 1 },
and another index on sale:db.inventory.ensureIndex({ sale: 1 } )
rather than a compound index.
Taking your first example into consideration, it doesn't make much sense to index a field that you are not going to specifically query. When you say that you don't mind if SF is playing on an away or home game, you would always include both the away and home fields in your query, so you're using two indexes where all you need to query is one value - SF.
It seems appropriate to mention at this stage that you should always consider the majority of your queries when thinking about the format of your documents. Think about the queries that you are planning to make most often and build your documents accordingly. It's always better to handle 80% of the cases as best you can rather than trying to solve all the possibilities (which might lead to worse performance overall).
Looking at your second example, of nested documents, as you said, you would only need to use one index (saving valuable space on your server).
Some more relevant quotes from the $or docs (again with added formatting):
Also, when using the $or operator with the sort() method in a query, the query will not use the indexes on the $or fields. Consider the following query which adds a sort() method to the above query:
db.inventory.find ({ $or: [{ price: 1.99 }, { sale: true }] }).sort({item:1})
This modified query will not use the index on price nor the index on sale.
So the question now is - are you planning to use the sort() function? If the answer is yes then you should be aware that your indexes might turn out to be useless! :(
The take-away from this is pretty much "it depends!". Consider the queries you plan to make, and consider what document structure and indexes will be most beneficial to you according to your usage projections.
The docs for MongoDB seem to suggest that in order to sort the results of an aggregate call you should specify a dictionary/object like this:
db.users.aggregate(
{ $sort : { age : -1, posts: 1 } }
);
This is supposed to sort by age and then by posts.
What do I do if I want to sort by posts and then by age? Changing the order of the keys seems to have no effect, probably because this is a JS object's properties. In other words, sorting, it seems, is always according to the lexical order of the keys, which seems rather odd as a design choice...
Am I missing something? Is there a way to specify an ordered list of keys to sort by?
From the docs:
As python dictionaries don’t maintain order you should use SON or
collections.OrderedDict where explicit ordering is required eg
"$sort"
from bson.son import SON
db.users.aggregate([
{"$sort": SON([("posts", 1), ("age", -1)])}
])