MongoDB - Using Index to get nested IDs is slow - mongodb

I have a MongoDB collection with 8k+ documents, around 40GB. Inside it, the data follows this format:
{
_id: ...,
_session: {
_id: ...
},
data: {...}
}
I need to get all the _session._id for my application. The following approach (python) takes too long to get them:
cursor = collection.find({}, projection={'_session._id': 1})
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
Is there a way to speed this query such that I get all the _session._id very fast?

In mongo shell you can hint() the query optimizer to use the available index as follow:
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Following test is confirmed to work via python:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:12345")
mydb=db["test"]
docs= mydb.test2.find( {} ).hint([ ("x.y", pymongo.ASCENDING) ])
for i in docs:
print(i)
db.test2.createIndex({"x.y":1})
{
"v" : 2,
"key" : {
"x.y" : 1
},
"name" : "x.y_1"
}
python 3.7 ,
pymongo 3.11.2 ,
mongod 5.0.5
In your case seems to be text index , btw it seems abit strange why session is text index , for text index somethink like this must work:
db.test2.find({}).hint("x.y_text").explain()
And here is working example with text index:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:123456")
print('Get first 10 docs from test.test:')
mydb=db["test"]
docs= mydb.test2.find( {"x.y":"3"} ).hint( "x.y_text" )
print("===start:====")
for i in docs:
print(i)
db.test2.createIndex({"x.y":"text"}):
{
"v" : 2,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "x.y_text",
"weights" : {
"x.y" : 1
},
"default_language" : "english",
"language_override" : "language",
"textIndexVersion" : 3
}

There are a few points of confusion in this question and the ensuing discussion which generally come down to:
What indexes are present in the environment (and why the attempts to hint it failed)
When using indexing is most appropriate
Current Indexes
I think there are at least 5 indexes that were mentioned so far:
A standard index of {"_session._id":1} mentioned originally in #R2D2's answer.
A text index on the _session._id field (mentioned in this comment)
A text index on the _ts_meta.session field (mentioned in this comment)
A standard index of {"x.y":1} mentioned second in #R2D2's answer.
A text index of {"x.y":"text"} mentioned at the end of #R2D2's answer.
Only the first of these is likely to even really be relevant to the original question. Note that the difference a text index is a specialized index that is meant for performing more advanced text searching. Such indexes are not required for simple string matching or value retrieval. But standard indexes, { '_session._id': 1}, will also store string values and are relevant here.
What Indexing is For
Indexes are typically useful for retrieving a small subset of results from the database. The larger that set of results becomes relative to the overall size of the collection, the less helpful using an index will become. In your situation you are looking to retrieve data from all of the documents in the collection which is why the database doesn't consider using any index at all.
Now it is still possible that an index could help in this situation. That would be if we used it to perform a covered query which means that the data can be retrieved from the index alone without looking at the documents themselves. In this case the database would have to scan the full index, so it is not clear that it would be faster or not. But you could certainly try. To do so you would need to follow #R2D2's instructions, specifically by creating the index and then hinting it in the query (while also projecting out the _id field):
db.collection.createIndex({"_session._id":1})
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Additional Questions
There were two other things mentioned in the question that are important to address.
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
We talked about why this was the case above. But to find out if the database is using it or not you could navigate to the Explain tab in compass to take a look. If you explain plan visualization it should indicate if the index was used. Remember that you will need to hint the index based on your query.
Is there a way to speed this query such that I get all the _session._id very fast?
What is your definition of "very fast" here?
The general answer is that your operation requires scanning either all documents in the collection or a full index. There is no way to do this more efficiently based on the current schema. Therefore how fast it happens is largely going to come down to the hardware that the database is running on and it will slow down as the collection grows.
If this operation is something that you will be running frequently or have strict performance requirements around, then it may be important to think through your intended goals to see if there are other ways of achieving them. What will you or the application be doing with this list of session IDs?

Related

MongoDB: Indexes, Sorting

After having read the official documentations on indexes, sort, intersection, i'm a little bit confuse on how everything work together.
I've trouble making my query use the indexes i've created. I work on a mongodb 3.0.3, on a collection having ~4millions of document.
To simplify, let's say my document is composed of 6 fields:
{
a:<text>,
b:<boolean>,
c:<text>,
d:<boolean>,
e:<date>,
f:<date>
}
The query I want to achieve is the following :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
So intuitively I've created two indexes
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1, e:1 }, {background: true,name: "test1"})
db.mycoll.createIndex({f:1}, {background: true,name: "test2"})
But the explain() give me that the first index is not used at all.
I known there is some kind of limitation when there is ranges in play in the filter (in the e field), but I can't find my way around it.
Also instead of having a single index on f, I try a compound index on {e:1,f:1} but it didn't change anything.
So What I have misunderstood?
Thanks for your support.
Update: also I find some time the following predicate for mongodb 2.6 :
A good rule of thumb for queries with sort is to order the indexed fields in this order:
First, the field(s) on which you will query for exact values.
Second, the field(s) on which you will sort.
Finally, field(s) on which you will query for a range of values (e.g., $gt, $lt, $in)
An example of using this rule of thumb is in the section on “Sorting the results of a complex query on a range of values” below, including a link to further reading.
Does this also apply for 3.X version?
Update 2: following above predicate, I created the following index
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1 , f:1, e:1}, {background: true,name: "test1"})
And for the same query :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
the index is indeed used. However too much keys seems to be scan, I may need to find a better order the fields in the query/index.
Mongo acts sometimes a bit strange when it comes to the index selection.
Mongo automagically decides what index to use. The smaller an index is the more likely it is used (especially indexes with only one field) - this is my experience. May be this happens because it is more often already loaded in RAM? To find out what index to use when Mongo performs test queries when it is idle. However the result is sometimes unexpected.
Therefore if you know what index to use you can force a query to use a specific index using the $hint option. You should try that.
Your two indexes used in the query and the sort does not overlap so MongoDB can not use them for index intersection:
Index intersection does not apply when the sort() operation requires an index completely separate from the query predicate.

How to use an index with MongoCollection.Update()

I am writing a method that updates a single document in a very large MongoCollection,
and I have an index that I want the MongoCollection.Update() call to use to drastically reduce lookup time, but I can't seem to find anything like MongoCursor.SetHint(string indexName).
Is using an index on an update operation possible? If so, how?
You can create index according to your query section of update command.
For example if you have this collection, named data:
> db.data.find()
{ "_id" : ObjectId("5334908bd7f87918dae92eaf"), "name" : "omid" }
{ "_id" : ObjectId("5334943fd7f87918dae92eb0"), "name" : "ali" }
{ "_id" : ObjectId("53349478d7f87918dae92eb1"), "name" : "reza" }
and if you do this update query:
> db.data.update(query={name:'ali'}, update={name: 'Ali'})
without any defined index, the number of scanned document is 2:
"nscanned" : 2,
But if you define an index, according to your query, here for name field:
db.data.ensureIndex({name:1})
Now if you update it again:
> db.data.update(query={name:'Ali'}, update={name: 'ALI'})
Mongodb use your index for doing update, and number of scanned document is 1:
"nscanned" : 1,
But if you want to hint for update, you can hint it for your query:
# Assume that the index and field of it exists.
> var cursor = db.data.find({name:'ALI'}).hint({family:1})
Then use it in your update query:
> db.data.update(query=cursor, update={name: 'ALI'})
If you already have indexed your collection, update will be using the CORRECT index right away. There is no point to provide hint (in fact you can't hint with update).
Hint is only for debugging and testing purposes. Mongo is in most cases smart enough to automatically decide which index (if you have many of them) should be used in a particular query and it reviews its strategy from time to time.
So short answer - do nothing. If you have an index and it is useful, it will be automatically used on find, update, delete, findOne.
If you want to see if it is used - take the part of the query which searches for something and run it through find with explain.
Example for hellboy. This is just an example and in real life it can be more complex.
So you have a collection with docs like this {a : int, b : timestamp}. You have 2 indexes: one is on a, another is on b. So right now you need to do a query like a > 5 and b is after 2014. For some reason it uses index a, which does not give you the faster time (may be because you have 1000 elements and most of them are bigger than 5 and only 10 are > 2004 ). SO you decided to hint it to use b index. Cool it works much faster now. But your collection changes and right now you are in 2020 year and most of your documents have b bigger than 2014. So right now your index b is not doing so much work. But mongo still uses it, because you told so.

Sort collection permanently in Mongodb

Whenever we do db.Collection.find().sort(), only our output is sorted, not the collection itself,
i.e. If i do db.collection.find() then i see the original collection, not the sorted one.
Is there any way to sort the collection itself insted of just sorting the output?
Exporting the sorted result into entire new collection would also work.
if i have numbered _id field.(like _id:1 , _id_2 , _id:3 and so on)
Also I do not see any reason for doing this (index on the field on which you are going to sort it will help you to get this sort fast), here is a solution for your problem:
You have your test collection this way
{ "_id" : ObjectId("5273f6987c6c502364ddfe94"), "n" : 5 }
{ "_id" : ObjectId("5273f6e57c6c502364ddfe95"), "n" : 14}
{ "_id" : ObjectId("5273f6ee7c6c502364ddfe96"), "n" : -5}
Then the following command will create a sorted collection for you
db.test.find().sort({n : 1}).forEach(function(e){
db.testSorted.insert(e);
})
Completely the same way you can achieve with this (which I assume might perform a faster, but I have not done any testing):
db.testSorted.insert(db.test.find().sort({n : 1}).toArray());
And just to make this answer complete, also I understand that this is an overkill, you can do this with aggregation framework option $out.
Just to highlight: with all this you can solve bigger problem: save into another collection some sort of modification/subset of previous collection.
Documents in a collection are stored in natural order which is affected by document moves (when the document grows larger than the current record space allocated) and deletions (free space can be reused for inserted/moved documents). There is currently (as at MongoDB 2.4) no option to control the order of documents on disk aside from using a capped collection, which is a fixed-size collection that maintains insertion order but is subject to a number of restrictions.
An index is the appropriate way to efficiently return documents in an expected sort order. For more information see: Using Indexes to Sort Query Results in the MongoDB manual.
A related feature is a clustered index, which would store documents on disk to match an index ordering. This is not a current feature of MongoDB, although it has been requested (see SERVER-3294).

How to implement persisted sorted list which is often updated and you need maintain the order

I need to display members of community which are sorted by last visit. There are millions of communities each of wich can have millions of members. The list should be scrollable. Because of sorting by last visit time the order is updated very often.
In RDBMS this functionality could be simply done by ordinary B-tree index. But how can I do it with NoSQL approach?
My current thoughts are:
Standart NoSQL scrollable list approach which uses buckets of fixed length that are chained doesn't help much because of requirements of reordering.
Cassandra keeps values ordered by column name. So theoretically I could use last visit time as column key but for each update I would need to delete existing column and insert new one which doesn't sound very effectively.
Apache Lucene is not NoSQL storage but also an option because it creates sorted index. But I'm not sure how it is scalable for massive updates.
Redis Sorted Sets sounds really promising but I haven't had experience with it.
What other options do I have?
If you keep the last modification date in the object you could sort at query time in many NoSQL db's:
MongoDB (see docs on indexes):
db.collection.find({ ... spec ... }).sort({ key: 1 })
db.collection.ensureIndex( { "username" : 1, "timestamp" : -1 } )
Elastic search has sorting in queries too:
{
"sort" : [
{ "date" : {"order" : "asc"} }
],
"query" : {
...
}
}
Some storages like CouchDB seem to lack built-in sorting feature altogether so it pays off to have a look at a particular solution before investing in it.

Adding an index to a MongoDB collection hash field

I have a MongoDB collection that I would like to add an index on. For the purpose of this post, let's say the collection name is Cats. I have a hash key on the Cats collection so if you do db.cats.findOne(); it'll look like the following:
> db.cats.findOne();
{
"_id" : ObjectId("4f248f8ae4b0b775c9eb002d"),
"metaData" : {
"type" : "cute",
"id" : "4ed3b6c599114b488be52bc3"
},
....
}
I query very often (using Mongoid), with something like this:
Cat.first(:conditions => { "metaData.id" => an_id }
I'd really like to be able to take advantages of indexes here, but I'm not entirely sure if I should index all of metaData or just metaData.id (I query against id specifically, and very often).
Would love any solution to this problem because I think I can dramatically speed up queries if I do the right thing here. Also, this is a unique index.
also metaData is not an embedded document. it does not have its own collection. it is simply a hash with a 1:1 mapping in each cats object.
You can just define an index on the embedded document. This is covered here:
http://www.mongodb.org/display/DOCS/Indexes#Indexes-UsingDocumentsasKeys
For your specific example, this would be:
db.Cats.ensureIndex({ "metaData.id" : 1}, {unique : true})
To compare your results do some of your standard queries in the shell with a .explain() to compare the speed with and without the index. If you are not doing a lot of queries you might need to hint the index to use so that it doesn't cache the "best" index (don't forget there is one on _id by default). More explain info here:
http://www.mongodb.org/display/DOCS/Explain