Sort collection permanently in Mongodb - mongodb

Whenever we do db.Collection.find().sort(), only our output is sorted, not the collection itself,
i.e. If i do db.collection.find() then i see the original collection, not the sorted one.
Is there any way to sort the collection itself insted of just sorting the output?
Exporting the sorted result into entire new collection would also work.
if i have numbered _id field.(like _id:1 , _id_2 , _id:3 and so on)

Also I do not see any reason for doing this (index on the field on which you are going to sort it will help you to get this sort fast), here is a solution for your problem:
You have your test collection this way
{ "_id" : ObjectId("5273f6987c6c502364ddfe94"), "n" : 5 }
{ "_id" : ObjectId("5273f6e57c6c502364ddfe95"), "n" : 14}
{ "_id" : ObjectId("5273f6ee7c6c502364ddfe96"), "n" : -5}
Then the following command will create a sorted collection for you
db.test.find().sort({n : 1}).forEach(function(e){
db.testSorted.insert(e);
})
Completely the same way you can achieve with this (which I assume might perform a faster, but I have not done any testing):
db.testSorted.insert(db.test.find().sort({n : 1}).toArray());
And just to make this answer complete, also I understand that this is an overkill, you can do this with aggregation framework option $out.
Just to highlight: with all this you can solve bigger problem: save into another collection some sort of modification/subset of previous collection.

Documents in a collection are stored in natural order which is affected by document moves (when the document grows larger than the current record space allocated) and deletions (free space can be reused for inserted/moved documents). There is currently (as at MongoDB 2.4) no option to control the order of documents on disk aside from using a capped collection, which is a fixed-size collection that maintains insertion order but is subject to a number of restrictions.
An index is the appropriate way to efficiently return documents in an expected sort order. For more information see: Using Indexes to Sort Query Results in the MongoDB manual.
A related feature is a clustered index, which would store documents on disk to match an index ordering. This is not a current feature of MongoDB, although it has been requested (see SERVER-3294).

Related

MongoDB - Using Index to get nested IDs is slow

I have a MongoDB collection with 8k+ documents, around 40GB. Inside it, the data follows this format:
{
_id: ...,
_session: {
_id: ...
},
data: {...}
}
I need to get all the _session._id for my application. The following approach (python) takes too long to get them:
cursor = collection.find({}, projection={'_session._id': 1})
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
Is there a way to speed this query such that I get all the _session._id very fast?
In mongo shell you can hint() the query optimizer to use the available index as follow:
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Following test is confirmed to work via python:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:12345")
mydb=db["test"]
docs= mydb.test2.find( {} ).hint([ ("x.y", pymongo.ASCENDING) ])
for i in docs:
print(i)
db.test2.createIndex({"x.y":1})
{
"v" : 2,
"key" : {
"x.y" : 1
},
"name" : "x.y_1"
}
python 3.7 ,
pymongo 3.11.2 ,
mongod 5.0.5
In your case seems to be text index , btw it seems abit strange why session is text index , for text index somethink like this must work:
db.test2.find({}).hint("x.y_text").explain()
And here is working example with text index:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:123456")
print('Get first 10 docs from test.test:')
mydb=db["test"]
docs= mydb.test2.find( {"x.y":"3"} ).hint( "x.y_text" )
print("===start:====")
for i in docs:
print(i)
db.test2.createIndex({"x.y":"text"}):
{
"v" : 2,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "x.y_text",
"weights" : {
"x.y" : 1
},
"default_language" : "english",
"language_override" : "language",
"textIndexVersion" : 3
}
There are a few points of confusion in this question and the ensuing discussion which generally come down to:
What indexes are present in the environment (and why the attempts to hint it failed)
When using indexing is most appropriate
Current Indexes
I think there are at least 5 indexes that were mentioned so far:
A standard index of {"_session._id":1} mentioned originally in #R2D2's answer.
A text index on the _session._id field (mentioned in this comment)
A text index on the _ts_meta.session field (mentioned in this comment)
A standard index of {"x.y":1} mentioned second in #R2D2's answer.
A text index of {"x.y":"text"} mentioned at the end of #R2D2's answer.
Only the first of these is likely to even really be relevant to the original question. Note that the difference a text index is a specialized index that is meant for performing more advanced text searching. Such indexes are not required for simple string matching or value retrieval. But standard indexes, { '_session._id': 1}, will also store string values and are relevant here.
What Indexing is For
Indexes are typically useful for retrieving a small subset of results from the database. The larger that set of results becomes relative to the overall size of the collection, the less helpful using an index will become. In your situation you are looking to retrieve data from all of the documents in the collection which is why the database doesn't consider using any index at all.
Now it is still possible that an index could help in this situation. That would be if we used it to perform a covered query which means that the data can be retrieved from the index alone without looking at the documents themselves. In this case the database would have to scan the full index, so it is not clear that it would be faster or not. But you could certainly try. To do so you would need to follow #R2D2's instructions, specifically by creating the index and then hinting it in the query (while also projecting out the _id field):
db.collection.createIndex({"_session._id":1})
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Additional Questions
There were two other things mentioned in the question that are important to address.
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
We talked about why this was the case above. But to find out if the database is using it or not you could navigate to the Explain tab in compass to take a look. If you explain plan visualization it should indicate if the index was used. Remember that you will need to hint the index based on your query.
Is there a way to speed this query such that I get all the _session._id very fast?
What is your definition of "very fast" here?
The general answer is that your operation requires scanning either all documents in the collection or a full index. There is no way to do this more efficiently based on the current schema. Therefore how fast it happens is largely going to come down to the hardware that the database is running on and it will slow down as the collection grows.
If this operation is something that you will be running frequently or have strict performance requirements around, then it may be important to think through your intended goals to see if there are other ways of achieving them. What will you or the application be doing with this list of session IDs?

Fundamental misunderstanding of MongoDB indices

So, I read the following definition of indexes from [MongoDB Docs][1].
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
Indexes are special data structures that store a small portion of the
collection’s data set in an easy to traverse form. The index stores
the value of a specific field or set of fields, ordered by the value
of the field. The ordering of the index entries supports efficient
equality matches and range-based query operations. In addition,
MongoDB can return sorted results by using the ordering in the index.
I have a sample database with a collection called pets. Pets have the following structure.
{
"_id": ObjectId(123abc123abc)
"name": "My pet's name"
}
I created an index on the name field using the following code.
db.pets.createIndex({"name":1})
What I expect is that the documents in the collection, pets, will be indexed in ascending order based on the name field during queries. The result of this index can potentially reduce the overall query time, especially if a query is strategically structured with available indices in mind. Under that assumption, the following query should return all pets sorted by name in ascending order, but it doesn't.
db.pets.find({},{"_id":0})
Instead, it returns the pets in the order that they were inserted. My conclusion is that I lack a fundamental understanding of how indices work. Can someone please help me to understand?
Yes, it is misunderstanding about how indexes work.
Indexes don't change the output of a query but the way query is processed by the database engine. So db.pets.find({},{"_id":0}) will always return the documents in natural order irrespective of whether there is an index or not.
Indexes will be used only when you make use of them in your query. Thus,
db.pets.find({name : "My pet's name"},{"_id":0}) and db.pets.find({}, {_id : 0}).sort({name : 1}) will use the {name : 1} index.
You should run explain on your queries to check if indexes are being used or not.
You may want to refer the documentation on how indexes work.
https://docs.mongodb.com/manual/indexes/
https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/

How does MongoDB order their docs in one collection? [duplicate]

This question already has answers here:
How does MongoDB sort records when no sort order is specified?
(2 answers)
Closed 7 years ago.
In my User collection, MongoDB usually orders each new doc in the same order I create them: the last one created is the last one in the collection. But I have detected another collection where the last one I created has the 6 position between 27 docs.
Why is that?
Which order follows each doc in MongoDB collection?
It's called natural order:
natural order
The order in which the database refers to documents on disk. This is the default sort order. See $natural and Return in Natural Order.
This confirms that in general you get them in the same order you inserted, but that's not guaranteed–as you noticed.
Return in Natural Order
The $natural parameter returns items according to their natural order within the database. This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Index Use
Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index.
MMAPv1
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents.
Obviously, like the docs mentioned, you should not rely on this default order (This ordering is an internal implementation feature, and you should not rely on any particular structure within it.).
If you need to sort the things, use the sort solutions.
Basically, the following two calls should return documents in the same order (since the default order is $natural):
db.mycollection.find().sort({ "$natural": 1 })
db.mycollection.find()
If you want to sort by another field (e.g. name) you can do that:
db.mycollection.find().sort({ "name": 1 })
For performance reasons, MongoDB never splits a document on the hard drive.
When you start with an empty collection and start inserting document after document into it, mongoDB will place them consecutively on the disk.
But what happens when you update a document and it now takes more space and doesn't fit into its old position anymore without overlapping the next? In that case MongoDB will delete it and re-append it as a new one at the end of the collection file.
Your collection file now has a hole of unused space. This is quite a waste, isn't it? That's why the next document which is inserted and small enough to fit into that hole will be inserted in that hole. That's likely what happened in the case of your second collection.
Bottom line: Never rely on documents being returned in insertion order. When you care about the order, always sort your results.
MongoDB does not "order" the documents at all, unless you ask it to.
The basic insertion will create an ObjectId in the _id primary key value unless you tell it to do otherwise. This ObjectId value is a special value with "monotonic" or "ever increasing" properties, which means each value created is guaranteed to be larger than the last.
If you want "sorted" then do an explicit "sort":
db.collection.find().sort({ "_id": 1 })
Or a "natural" sort means in the order stored on disk:
db.collection.find().sort({ "$natural": 1 })
Which is pretty much the standard unless stated otherwise or an "index" is selected by the query criteria that will determine the sort order. But you can use that to "force" that order if query criteria selected an index that sorted otherwise.
MongoDB documents "move" when grown, and therefore the _id order is not always explicitly the same order as documents are retrieved.
I could find out more about it thanks to the link Return in Natural Order provided by Ionică Bizău.
"The $natural parameter returns items according to their natural order within the database.This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents."

Does MongoDB find() query return documents sorted by creation time?

I need documents sorted by creation time (from oldest to newest).
Since ObjectID saves timestamp by default, we can use it to get documents sorted by creation time with CollectionName.find().sort({_id: 1}).
Also, I noticed that regular CollectionName.find() query always returns the documents in same order as CollectionName.find().sort({_id: 1}).
My question is:
Is CollectionName.find() guaranteed to return documents in same order as CollectionName.find().sort({_id: 1}) so I could leave sorting out?
No. Well, not exactly.
A db.collection.find() will give you the documents in the order they appear in the data files most of the times, though this isn't guaranteed.
Result Ordering
Unless you specify the sort() method or use the $near operator, MongoDB does not guarantee the order of query results.
As long as your data files are relatively new and few updates happen, the documents might (and most of the times will) be returned in what appears to be sorted by _id since ObjectId is monotonically increasing.
Later in the lifecycle, old documents may have been moved from their old position (because they increased in size and documents are never partitioned) and new ones are written in the place formerly occupied by another document. In this case, a newer document may be returned in a position between two old documents.
There is nothing wrong with sorting documents by _id, since the index will be used for that, adding only some latency for document retrieval.
However, I would strongly recommend against using the ObjectId for date operations for several reasons:
ObjectIds can not be used for date comparison queries. So you couldn't query for all documents created between date x and date y. To archive that, you'd have to load all documents, extract the date from the ObjectId and compare it – which is extremely inefficient.
If the creation date matters, it should be explicitly addressable in the documents
I see ObjectIds as a choice of last resort for the _id field and tend to use other values (compound on occasions) as _ids, since the field is indexed by default and it is very likely that one can save precious RAM by using a more meaningful value as id.
You could use the following for example which utilizes DBRefs
{
_id: {
creationDate: new ISODate(),
user: {
"$ref" : "creators",
"$id" : "mwmahlberg",
"$db" : "users"
}
}
}
And do a quite cheap sort by using
db.collection.find().sort({_id.creationDate:1})
Is CollectionName.find() guaranteed to return documents in same order as CollectionName.find().sort({_id: 1})
No, it's not! If you didn't specify any order, then a so-called "natural" ordering is used. Meaning that documents will be returned in the order in which they physically appear in data files.
Now, if you only insert documents and never modify them, this natural order will coincide with ascending _id order. Imagine, however, that you update a document in such a way that it grows in size and has to be moved to a free slot inside of a data file (usually this means somewhere at the end of the file). If you were to query documents now, they wouldn't follow any sensible (to an external observer) order.
So, if you care about order, make it explicit.
Source: http://docs.mongodb.org/manual/reference/glossary/#term-natural-order
natural order
The order in which the database refers to documents on disk. This is the default sort order. See $natural and Return in Natural Order.
Testing script (for the confused)
> db.foo.insert({name: 'Joe'})
WriteResult({ "nInserted" : 1 })
> db.foo.insert({name: 'Bob'})
WriteResult({ "nInserted" : 1 })
> db.foo.find()
{ "_id" : ObjectId("55814b944e019172b7d358a0"), "name" : "Joe" }
{ "_id" : ObjectId("55814ba44e019172b7d358a1"), "name" : "Bob" }
> db.foo.update({_id: ObjectId("55814b944e019172b7d358a0")}, {$set: {answer: "On a sharded collection the $natural operator returns a collection scan sorted in natural order, the order the database inserts and stores documents on disk. Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index. You cannot specify $natural sort order if the query includes a $text expression."}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
> db.foo.find()
{ "_id" : ObjectId("55814ba44e019172b7d358a1"), "name" : "Bob" }
{ "_id" : ObjectId("55814b944e019172b7d358a0"), "name" : "Joe", "answer" : "On a sharded collection the $natural operator returns a collection scan sorted in natural order, the order the database inserts and stores documents on disk. Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index. You cannot specify $natural sort order if the query includes a $text expression." }

In MongoDB, is db.collection.find() same as db.collection.find().sort({$natural:1})?

I'm sure this is an easy one, but I just wanted to make sure. Is find() with some search and projection criterion same as applying a sort({$natural:1}) on it?
Also, what is the default natural sort order? How is it different from a sort({_id:1}), say?
db.collection.find() has the result as same as db.collection.find().sort({$natural:1})
{"$natural" : 1} forces the find query to do a table scan (default sort), it specifies hard-disk order when used in a sort.
When you are updating your document, mongo could move your document to another place of hard-disk.
for example insert documents as below
{
_id : 0,
},
{
_id : 1,
}
then update:
db.collection.update({ _id : 0} , { $set : { blob : BIG DATA}})
And when you perform the find query you will get
{
"_id" : 1
},
{
"_id" : 0,
"blob" : BIG DATA
}
as you see the order of documents has changed => the default order is not by _id
If you don't specify the sort then mongodb find() will return documents in the order they are stored on disk. Document storage on disk may coincide with insertion order but thats not always going to be true. It is also worth noting that the location of a document on disk may change. For instance in case of update, mongodb may move a document from one place to another if needed.
In case of index - The default order will be the order in which indexes are found if the query uses an index.
The $natural is the order in which documents are found on disk.
It is recommended that you specifiy sort explicitly to be sure of sorting order.