Adding an index to a MongoDB collection hash field - mongodb

I have a MongoDB collection that I would like to add an index on. For the purpose of this post, let's say the collection name is Cats. I have a hash key on the Cats collection so if you do db.cats.findOne(); it'll look like the following:
> db.cats.findOne();
{
"_id" : ObjectId("4f248f8ae4b0b775c9eb002d"),
"metaData" : {
"type" : "cute",
"id" : "4ed3b6c599114b488be52bc3"
},
....
}
I query very often (using Mongoid), with something like this:
Cat.first(:conditions => { "metaData.id" => an_id }
I'd really like to be able to take advantages of indexes here, but I'm not entirely sure if I should index all of metaData or just metaData.id (I query against id specifically, and very often).
Would love any solution to this problem because I think I can dramatically speed up queries if I do the right thing here. Also, this is a unique index.
also metaData is not an embedded document. it does not have its own collection. it is simply a hash with a 1:1 mapping in each cats object.

You can just define an index on the embedded document. This is covered here:
http://www.mongodb.org/display/DOCS/Indexes#Indexes-UsingDocumentsasKeys
For your specific example, this would be:
db.Cats.ensureIndex({ "metaData.id" : 1}, {unique : true})
To compare your results do some of your standard queries in the shell with a .explain() to compare the speed with and without the index. If you are not doing a lot of queries you might need to hint the index to use so that it doesn't cache the "best" index (don't forget there is one on _id by default). More explain info here:
http://www.mongodb.org/display/DOCS/Explain

Related

MongoDB - Using Index to get nested IDs is slow

I have a MongoDB collection with 8k+ documents, around 40GB. Inside it, the data follows this format:
{
_id: ...,
_session: {
_id: ...
},
data: {...}
}
I need to get all the _session._id for my application. The following approach (python) takes too long to get them:
cursor = collection.find({}, projection={'_session._id': 1})
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
Is there a way to speed this query such that I get all the _session._id very fast?
In mongo shell you can hint() the query optimizer to use the available index as follow:
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Following test is confirmed to work via python:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:12345")
mydb=db["test"]
docs= mydb.test2.find( {} ).hint([ ("x.y", pymongo.ASCENDING) ])
for i in docs:
print(i)
db.test2.createIndex({"x.y":1})
{
"v" : 2,
"key" : {
"x.y" : 1
},
"name" : "x.y_1"
}
python 3.7 ,
pymongo 3.11.2 ,
mongod 5.0.5
In your case seems to be text index , btw it seems abit strange why session is text index , for text index somethink like this must work:
db.test2.find({}).hint("x.y_text").explain()
And here is working example with text index:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:123456")
print('Get first 10 docs from test.test:')
mydb=db["test"]
docs= mydb.test2.find( {"x.y":"3"} ).hint( "x.y_text" )
print("===start:====")
for i in docs:
print(i)
db.test2.createIndex({"x.y":"text"}):
{
"v" : 2,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "x.y_text",
"weights" : {
"x.y" : 1
},
"default_language" : "english",
"language_override" : "language",
"textIndexVersion" : 3
}
There are a few points of confusion in this question and the ensuing discussion which generally come down to:
What indexes are present in the environment (and why the attempts to hint it failed)
When using indexing is most appropriate
Current Indexes
I think there are at least 5 indexes that were mentioned so far:
A standard index of {"_session._id":1} mentioned originally in #R2D2's answer.
A text index on the _session._id field (mentioned in this comment)
A text index on the _ts_meta.session field (mentioned in this comment)
A standard index of {"x.y":1} mentioned second in #R2D2's answer.
A text index of {"x.y":"text"} mentioned at the end of #R2D2's answer.
Only the first of these is likely to even really be relevant to the original question. Note that the difference a text index is a specialized index that is meant for performing more advanced text searching. Such indexes are not required for simple string matching or value retrieval. But standard indexes, { '_session._id': 1}, will also store string values and are relevant here.
What Indexing is For
Indexes are typically useful for retrieving a small subset of results from the database. The larger that set of results becomes relative to the overall size of the collection, the less helpful using an index will become. In your situation you are looking to retrieve data from all of the documents in the collection which is why the database doesn't consider using any index at all.
Now it is still possible that an index could help in this situation. That would be if we used it to perform a covered query which means that the data can be retrieved from the index alone without looking at the documents themselves. In this case the database would have to scan the full index, so it is not clear that it would be faster or not. But you could certainly try. To do so you would need to follow #R2D2's instructions, specifically by creating the index and then hinting it in the query (while also projecting out the _id field):
db.collection.createIndex({"_session._id":1})
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Additional Questions
There were two other things mentioned in the question that are important to address.
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
We talked about why this was the case above. But to find out if the database is using it or not you could navigate to the Explain tab in compass to take a look. If you explain plan visualization it should indicate if the index was used. Remember that you will need to hint the index based on your query.
Is there a way to speed this query such that I get all the _session._id very fast?
What is your definition of "very fast" here?
The general answer is that your operation requires scanning either all documents in the collection or a full index. There is no way to do this more efficiently based on the current schema. Therefore how fast it happens is largely going to come down to the hardware that the database is running on and it will slow down as the collection grows.
If this operation is something that you will be running frequently or have strict performance requirements around, then it may be important to think through your intended goals to see if there are other ways of achieving them. What will you or the application be doing with this list of session IDs?

Using object as _id in MongoDb causes collscan on queries

I'm having some issues with using a custom object as my _id value in MongoDb.
The objects I'm storing in _id looks like this:
"_id" : {
"EDIEL" : "1010101010101",
"StartDateTicks" : NumberLong(636081120000000000)
}
Now, when I'm performing the following query:
.find({
"_id.EDIEL": { $eq: "1010101010101" },
"_id.StartDateTicks": { $gte: 636082776000000000, $lt: 636108696000000000 }
}).explain()
I does a COLLSCAN. I can't figure out why exactly. Is it because I'm not querying against the _id object with an object?
Does anyone know what I'm doing wrong here? :-)
Edit:
Tried to create a compound index containing the EDIEL and StartDateTicks fields, ran the query again and now it uses the index instead of a column scan. While this works, it would still be nice to avoid having the extra index and just having the _id (since it's basically a "free" index) So, the question still stands: why can't I query against the _id.EDIEL and _id.StartDateTicks and make use of the index?
Indexes are used on keys and not on objects, so when you use object for _id, the indexing on object can't be used for the specific query you do on the field of the object.
This is true not only for _id but subdocument also.
{
"name":"awesome book",
"detail" :{
"pages":375,
"alias" : "AB"
}
}
Now when you have index on detail and you query by detail.pages or detail.alias, the index on detail cannot be used and certainly not for range queries. You need to have indexes on detail.pages and detail.alias.
when index is applied on object it maintains the index of object as a whole and not per field, that's why queries on object fields are not able to use object indexes.
Hope that helps
You will need to index the two fields separately, since indexes cant be on embedded documents. Thus creating a compound index is the only option available, or creating multiple indexes on the fields which in turn use intersection index are the options for you.

Sort collection permanently in Mongodb

Whenever we do db.Collection.find().sort(), only our output is sorted, not the collection itself,
i.e. If i do db.collection.find() then i see the original collection, not the sorted one.
Is there any way to sort the collection itself insted of just sorting the output?
Exporting the sorted result into entire new collection would also work.
if i have numbered _id field.(like _id:1 , _id_2 , _id:3 and so on)
Also I do not see any reason for doing this (index on the field on which you are going to sort it will help you to get this sort fast), here is a solution for your problem:
You have your test collection this way
{ "_id" : ObjectId("5273f6987c6c502364ddfe94"), "n" : 5 }
{ "_id" : ObjectId("5273f6e57c6c502364ddfe95"), "n" : 14}
{ "_id" : ObjectId("5273f6ee7c6c502364ddfe96"), "n" : -5}
Then the following command will create a sorted collection for you
db.test.find().sort({n : 1}).forEach(function(e){
db.testSorted.insert(e);
})
Completely the same way you can achieve with this (which I assume might perform a faster, but I have not done any testing):
db.testSorted.insert(db.test.find().sort({n : 1}).toArray());
And just to make this answer complete, also I understand that this is an overkill, you can do this with aggregation framework option $out.
Just to highlight: with all this you can solve bigger problem: save into another collection some sort of modification/subset of previous collection.
Documents in a collection are stored in natural order which is affected by document moves (when the document grows larger than the current record space allocated) and deletions (free space can be reused for inserted/moved documents). There is currently (as at MongoDB 2.4) no option to control the order of documents on disk aside from using a capped collection, which is a fixed-size collection that maintains insertion order but is subject to a number of restrictions.
An index is the appropriate way to efficiently return documents in an expected sort order. For more information see: Using Indexes to Sort Query Results in the MongoDB manual.
A related feature is a clustered index, which would store documents on disk to match an index ordering. This is not a current feature of MongoDB, although it has been requested (see SERVER-3294).

Can you match sub-fields with $all in Mongo?

I have a collection of document, where each document looks like this:
{'name' : 'John', 'locations' :
[
{'place' : 'Paris', 'been' : true}
{'place' : 'Moscow', 'been' : false}
{'place' : 'Berlin', 'been' : true}
]
}
Where the locations array could have any length.
I want to match documents where the been field is true for all elements in the locations array. Looking at the documentation it looks like I should use $and somehow but I'm not sure if it works with sub-fields.
There are several options:
use $ne: db.destinations.find({"locations.been":{$ne:false}})
change your business logic to precompute that value before saving the document. Otherwise, this search must look through all records and then all places. This value could be indexed.
use the $where operator, but, understand the performance implications. It may require a full table scan. In this case, it would.
write a map-reduce function with the filter logic and only emit those that are valid. You'd need to incrementally update it per the docs.
write a query using the aggregation framework. There are a lot of good examples here. Although, like other solutions, this could end up looping through the entire collection.
I think it's impossible to do with standart MongoDB operators like $elemMatch or $all. The only possible way is to write custom JS query:
db.test.find("return this.locations.every(function(loc){return loc.been});")

mongodb: create a top-level index for a nested document instead of having to index each individual sublevel?

This question is about how I can use indexes in MongoDB to look something up in nested documents, without having to index each individual sublevel.
I have a collection "test" in MongoDB which basically goes something like this:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : {
"1" : { [1,2,3] },
"2" : { [4,5,6] }
}}
Scenario has multiple keys, each document can have any subset of the scenarios (i.e. from none to a subset to all). Also: Scenario can't be an array because i need it as a dictionary in Python. I created an index on the "scenario" field.
My issue is that i want to select on the collection, filtering for documents that have a certain value. So this works fine functionally:
db.test.find({"scenario.1": {$exists: true}})
However, it won't use any index i've put on scenario. Only if i put an index on the "scenario.1" an index is used. But I can have thousands (or more) scenarios (and the collection itself has 100.000s of records), so i would prefer not to!
So i tried alternatives:
db.test.find({"scenario": "1"})
This will use the index on scenario, but won't return results. Making scenario an array still gives the same index issue.
Is my question clear? Can anyone give a pointer on how I could achieve the best performance here?
P.s. I have seen this: How to Create a nested index in MongoDB? but that solution is not possible in my case (due to the amount of scenarios)
Putting an index on a subobject like scenario is useless in this case as it would only be used when you're filtering on complete scenario objects rather than individual fields (think of it as a binary blob comparison).
You either need to add an index on each of your possible fields ("scenario.1", "sceanario.2", etc.) or rework your schema to get rid of the dynamic keys by doing something like this:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : [
{ id: "1", value: [1,2,3] },
{ id: "2", value: [4,5,6] }
}}
Then you can add a single index to scenario.id to support the queries you need to perform.
I know you said you need scenario to be a dict and not an array, but I don't see how you have much choice.
Johnny HK's answer is a nice explained answer and should be used in general cases. I will just suggest a workaround for you to solve your issue if you have to have many scenarios and don't need complex querying. Instead of keeping values under scenario field, just hold the id of the scenario under that field, and hold the values as another field in the document and use the scenario id as the key of this field.
Example:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : [ "1", "2"],
"scenario_1": [1,2,3],
"scenario_2": [4,5,6]
}}
With this schema you can use index on scenario to find specific scenarios. But if you need to query for specific scenario values, you again need to have an index on each scenario value field i.e scenario_1, scenario_2, etc.. If you need to have indexes for each field, then don't change your original schema and use sparse indexes for each nested field and that might help reduce the size of your indexes.