MongoDB Aggregate Framework - Grouping with Multiple Fields in _id - mongodb

Before marking this question as a duplicate - please read through. I don't think a sufficiently conclusive and general answer has been given yet, as most questions have focused on specific examples.
The MongoDB documentation says that you can specify an aggregate key for the _id value of a $group operation. There are a number of previously answered questions about using MongoDB's aggregate framework to group over multiple fields in this way, i.e:
{$group: {_id:{field_a:'$field_a', field_b:'$field_b'} } }
Q: In the most general sense, what does this action do?
If grouping documents by field A condenses any documents sharing the same value of field A into a single document, does grouping by fields A and B condense documents with matching values of both A and B into a single document?
Is the grouping operation sequential?
If so, does that imply any level of precedence between 'field_a' and 'field_b' depending on their ordering?

If grouping documents by field A condenses any documents sharing the same value of field A into a single document, does grouping by fields A and B condense documents with matching values of both A and B into a single document?
Let A = { a:A, b:B }, then that automatically follows from the assumption. You didn't make any assumption about the type of A, which is correct: the type doesn't matter. If the type of A is document, the usual comparison rules apply (equal content is considered equal).
Is the grouping operation sequential?
I'm not sure what that means. The aggregation pipeline runs accumulator functions on all items in each stage, so it certainly iterates the entire set, but I'd refrain from making assumptions about the exact order that happens in, i.e. from performing any non-associative operations.
If so, does that imply any level of precedence between 'field_a' and 'field_b' depending on their ordering?
No, documents are compared field-by-field and there are no strict guarantees on the ordering of fields ("attempts to...") in MongoDB. However, one can, in principle, create documents that contain multiple fields of the same name where the ordering might matter. But it's hard to do so, since most client interfaces don't allow different fields of equal name.

Related

View MongoDB array in order of indices with Compass

I am working on a database of Polish verbs and I'd like to find out how to display my results such that each verb conjugation appears in the following order: 1ps (1st person singular), 2ps, 3ps, 1ppl (1st person plural, etc.), 2ppl, 3ppl. It displays fine when I insert documents:
verb "żyć/przeżyć" conjugation as array and nested document
But when I go to perform queries it jumbles all the array elements up, in the first case (I want to see them in order of array indices), and sorts the nested document elements into alphabetical order (whereas I want to see them in the order in which they were inserted).
verb "żyć/przeżyć" conjugation array/document query
This should be an easy one to solve, I hope this comes across as a reasonable beginner's question. I have searched for answers but couldn't find much info on this topic. Any and all help is greatly appreciated!
Cheers,
LC.
Your screenshots highlight two different views in MongoDB Compass.
The Schema view is based on a sampling of multiple documents and the order of the fields displayed cannot be specified. The schema analysis (as at Compass 1.7) lists fields in case-insensitive alphabetical order with the _id field at the top. Since this is an aggregate schema view based on multiple documents, the ordering of fields is not expected to reflect individual document order.
If you want to work with individual documents and field ordering you need to use the Documents view, as per your second screenshot. In addition to displaying the actual documents, this view allows you to include sort and skip options for queries:

How does MongoDB order their docs in one collection? [duplicate]

This question already has answers here:
How does MongoDB sort records when no sort order is specified?
(2 answers)
Closed 7 years ago.
In my User collection, MongoDB usually orders each new doc in the same order I create them: the last one created is the last one in the collection. But I have detected another collection where the last one I created has the 6 position between 27 docs.
Why is that?
Which order follows each doc in MongoDB collection?
It's called natural order:
natural order
The order in which the database refers to documents on disk. This is the default sort order. See $natural and Return in Natural Order.
This confirms that in general you get them in the same order you inserted, but that's not guaranteed–as you noticed.
Return in Natural Order
The $natural parameter returns items according to their natural order within the database. This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Index Use
Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index.
MMAPv1
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents.
Obviously, like the docs mentioned, you should not rely on this default order (This ordering is an internal implementation feature, and you should not rely on any particular structure within it.).
If you need to sort the things, use the sort solutions.
Basically, the following two calls should return documents in the same order (since the default order is $natural):
db.mycollection.find().sort({ "$natural": 1 })
db.mycollection.find()
If you want to sort by another field (e.g. name) you can do that:
db.mycollection.find().sort({ "name": 1 })
For performance reasons, MongoDB never splits a document on the hard drive.
When you start with an empty collection and start inserting document after document into it, mongoDB will place them consecutively on the disk.
But what happens when you update a document and it now takes more space and doesn't fit into its old position anymore without overlapping the next? In that case MongoDB will delete it and re-append it as a new one at the end of the collection file.
Your collection file now has a hole of unused space. This is quite a waste, isn't it? That's why the next document which is inserted and small enough to fit into that hole will be inserted in that hole. That's likely what happened in the case of your second collection.
Bottom line: Never rely on documents being returned in insertion order. When you care about the order, always sort your results.
MongoDB does not "order" the documents at all, unless you ask it to.
The basic insertion will create an ObjectId in the _id primary key value unless you tell it to do otherwise. This ObjectId value is a special value with "monotonic" or "ever increasing" properties, which means each value created is guaranteed to be larger than the last.
If you want "sorted" then do an explicit "sort":
db.collection.find().sort({ "_id": 1 })
Or a "natural" sort means in the order stored on disk:
db.collection.find().sort({ "$natural": 1 })
Which is pretty much the standard unless stated otherwise or an "index" is selected by the query criteria that will determine the sort order. But you can use that to "force" that order if query criteria selected an index that sorted otherwise.
MongoDB documents "move" when grown, and therefore the _id order is not always explicitly the same order as documents are retrieved.
I could find out more about it thanks to the link Return in Natural Order provided by Ionică Bizău.
"The $natural parameter returns items according to their natural order within the database.This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents."

MongoDB skip & limit when querying two collections

Let's say I have two collections, A and B, and a single document in A is related to N documents in B. For example, the schemas could look like this:
Collection A:
{id: (int),
propA1: (int),
propA2: (boolean)
}
Collection B:
{idA: (int), # id for document in Collection A
propB1: (int),
propB2: (...),
...
propBN: (...)
}
I want to return properties propB2-BN and propA2 from my API, and only return information where (for example) propA2 = true, propB6 = 42, and propB1 = propA1.
This is normally fairly simple - I query Collection B to find documents where propB6 = 42, collect the idA values from the result, query Collection A with those values, and filter the results with the Collection A documents from the query.
However, adding skip and limit parameters to this seems impossible to do while keeping the behavior users would expect. Naively applying skip and limit to the first query means that, since filtering occurs after the query, less than limit documents could be returned. Worse, in some cases no documents could be returned when there are actually still documents in the collection to be read. For example, if the limit was 10 and the first 10 Collection B documents returned pointed to a document in Collection A where propA2 = false, the function would return nothing. Then the user would assume there's nothing left to read, which may not be the case.
A slightly less naive solution is to simply check if the return count is < limit, and if so, repeat the queries until the return count = limit. The problem here is that skip/limit queries where the user would expect exclusive sets of documents returned could actually return the same documents.
I want to apply skip and limit at the mongo query level, not at the API level, because the results of querying collection B could be very large.
MapReduce and the aggregation framework appear to only work on a single collection, so they don't appear to be alternatives.
This seems like something that'd come up a lot in Mongo use - any ideas/hints would be appreciated.
Note that these posts ask similar sounding questions but don't actually address the issues raised here.
Sounds like you already have a solution (2).
You cannot optimize/skip/limit on first query, depending on search you can perhaps do it on second query.
You will need a loop around it either way, like you write.
I suppose, the .skip will always be costly for you, since you will need to get all the results and then throw them away, to simulate the skip, to give the user consistent behavior.
All the logic would have to go to your loop - unless you can match in a clever way to second query (depending on requirements).
Out of curiosity: Given the time passed, you should have a solution by now?!

Mongodb store and select order

Basic question. Does mongodb find command will always return documents in the order they where added to collection? If no how is it possible to implement selection docs in the right order?
Sort? But what if docs where added simultaneously and say created date is the same, but there was an order still.
Well, yes and ... not exactly.
Documents are default sorted by natural order. Which is initially the order the documents are stored on disk, which is indeed the order in which the documents had been added to a collection.
This order however, is not deterministic, as document may be moved on disk once these documents grow after update operations, and can't be fit into current space anymore. This way the initial (insert) order may change.
The way to guarantee insert order sort is sort by {_id : 1} as long as the _id is of type ObjectId. This will return your documents sorted in ascending order.
Write operations do not take place simultaneously. Write locks are imposed in database level (V 2.4 and on). The first four bytes of _id is insert timestamp, and 3 last digits is a random counter used to distinguish (and sort) between ObjectId instances with same timestamp.
_id field is indexed by default

The fastest way to show Documents with certain property first in MongoDB

I have collections with huge amount of Documents on which I need to do custom search with various different queries.
Each Document have boolean property. Let's call it "isInTop".
I need to show Documents which have this property first in all queries.
Yes. I can easy do sort in this field like:
.sort( { isInTop: -1 } );
And create proper index with field "isInTop" as last field in it. But this will be work slowly, as indexes in mongo works best with unique fields.
So is there is solution to show Documents with field "isInTop" on top of each query?
I see two solutions here.
First: set Documents wich need to be in top the _id from "future". As you know, ObjectId contains timestamp. So I can create ObjectId with timestamp from future and use natural order
Second: create separate collection for Ducuments wich need to be in top. And do queries in it first.
Is there is any other solutions for this problem? Which will work fater?
UPDATE
I have done this issue with sorting on custom field which represent rank.
Using the _id field trick you mention has the problem that at some point in time you will reach the special time, and you can't change the _id field (without inserting a new document and removing the old one).
Creating a special collection which just holds the ones you care about is probably the best option. It gives you the ability to logically (and to some extent, physically) separate the documents.
Newly introduced in mongodb there is also support for a "sparse" index which may fulfill your needs as well. You could only set the "isInTop" field when you want it to be special, and then create a sparse index on it which would not have the problems you would normally have with a single indexed boolean field (in btrees).