Short version:
If I have an index {"category": 1}, and a document {"category": {type: "memory", class: "DDR400"}, how can I do a query such as {"category.type": "memory"} that uses my index?
Long version:
With MongoDB, I want to use an embedded document as a key for an index.
For example, I might have some documents such as this (for a hypothetical product database):
{"category": {"type": "hard-drive", "form_factor": "2.5in", "size": "500GB"}, ...}
{"category": {"type": "hard-drive", "form_factor": "3.5in", ...}, ...}
{"category": {"type": "memory", "class": "DDR400", ...}, ...}
For the above examples, I might want to do queries such as:
{"category.type": "hard-drive"}
{"category.type": "hard-drive", "category.form_factor": "2.5in"}
{"category.type": "memory"}
{"category.type": "memory", "category.class": "DDR400"}
My issues is creating an index. The document at http://www.mongodb.org/display/DOCS/Indexes#Indexes-DocumentsasKeys describes two options:
The first options is to create a compound index, for example { "category.type": 1, "category.class": 1 }. This does not work well for my case, as I might have many different types of sub-categories.
The second option is to use the document as the key: { "category": 1 }. Now a query such as {"category": {"type": "memory", "class": "DDR400"}} would use the index, but {"category": {"type": "memory"}} would return nothing, and {"category.type": "memory"} would not use the index. Is there a way to do a query using this index that would give the same results as {"category.type": "memory"}?
I suspect a query using something like {"category" {"$gt": ..., "$lt": ...} should work, but what should I put in the blank spaces there?
Creating a separate index for category.type (probably in addition to category) seems like the best option.
You could use a range query with $gt and $lt. Those would work on the binary representation of the embedded object, which only works for the first (in storage order) field, and only if that first field is the same in all documents, so it is not very flexible, and easy to break.
{"category" : {"$gt": {"type": "memory"}, "$lt": {"type": "memoryX" } } }
"memoryX" here serves as a cut-off point: Everything with "memory" will sort before that.
Note that this requires that the "type" field is the first one in the binary representation for all documents that have it. It also ONLY works for the "type" field (no way to query on other fields in the first position, you have to choose one up front), thus giving you practically no advantage over a dedicated "category.type" index (just space savings).
I was experimenting with this idea before, see this thread on the mailing list. It does work, but you have to be careful what you are doing:
It is both supported and stable. Many of the sharding/replication
internals use _id values that are embedded docs.
The only thing to watch out for here is the ordering of the keys in
embedded element. They are sorted by their binary representation so
{x:1, y:1} is different than {y:1, x:1}, and sorted differently. Not
only are they sorted differently, they are different values. Some
languages always sort the keys in a dictionary/hash/map by default.
Again, consider creating extra indexes on the fields that you need.
In my case I'll only need to query on 'a', 'a,b' or 'a,b,c', or on 'a,x,y', where documents containing x never contain 'b' or 'c'
That would probably work then. I'd still do two composite indexes a,b and a,x, though. Or maybe just b and x. Given that a document contains b or x, you probably already have effectively filtered out the irrelevant documents with regard to a ( form_factor = 2.5in already tells you it is a hard disk, class = DDR400 already makes it memory).
And after filtering by a,b, you may not need an index to drill down further on c.
By using this tricky query on the binary representation you are making yourself dependent on what could be called an implementation detail. You may be hit by drivers that like to re-order fields, or something like this issue about Mongo itself reshuffling things sometimes.
If there is one basic property that you are searching for for each "type", then simply also add is as a separate field, and create a compound index, eg:
{"category": {"type": "hard-drive", "form_factor": "2.5in", "searchfield: "2.5in", ...}, ...}
{"category": {"type": "memory", "class": "DDR400", searchfield: "DDR400", ...}, ...}
If there are several fields you are searching for, but the values for these fields differ, you could add the values as tags and, again, create a compound key:
{"category": {"type": "hard-drive", "form_factor": "2.5in", "size": "500GB", "tags": ["2.5in", "500GB"]}, ...}
{"category": {"type": "memory", "class": "DDR400", "tags": ["DDR400"], ...}, ...}
Related
{
field_1: ["a", "b", "c"],
field_2: ["x", "y", "z"]
}
Can I create a compound index with field_1 and field_2? If not, what is my alternative to get around this?
Experimentally it is pretty easy to confirm that the answer to the first question is "no". As demonstrated in this playground example, attempting to do so will generate an error similar to the following (with the behavior dependent on whether the index or the data was created first):
cannot index parallel arrays [field_2] [field_1]
This limitation is documented here and a request to lift this restriction seems to be tracked here.
The response to the second question is a little more open-ended. How to workaround this depends on your environment and your goals.
The most straightforward thing to do would be to create indexes that only include the more selective of these two fields. If field_1 is reasonably selective (potentially along with other indexed non-array fields), then the filtering the database will have to do for field_2 may be perfectly acceptable from a performance and efficiency perspective.
If you really wanted, though I don't necessarily recommend this, you could modify the schema to change these "parallel arrays" into nested ones. This might look something like the following:
{
arr: [
{ field_1: "a", field_2: ["x", "y", "z"] },
{ field_1: "b", field_2: ["x", "y", "z"] },
{ field_1: "c", field_2: ["x", "y", "z"] }
]
}
You can see in this playground example that the following index can be successfully built with that schema:
{ "arr.field_1": 1, "arr.field_2": 1 }
This is most feasible if the size of one of the arrays is bounded and pretty small.
If neither of these are appropriate then you may need to consider more drastic changes to your schema and overall data model.
The example below shows two possible document structures to be used for a contact in a contacts collection on MongoDB 3.4. Note the relationship between the contact and the campaigns where he belong to.
Approach A: campaigns is an object which holds campaigns as a key:value pair where key is the campaign ID and value the other campaign data.
{
"first_name": "John",
"last_name": "Doe",
"user_id": 1170,
"campaigns": {
3452: {
subscription_dt: ISODate("2017-01-28T19:00:00Z"),
score: 19
},
243: {
subscription_dt: ISODate("2017-01-15T16:45:00Z"),
score: 27
}
}
}
Approach B: campaigns is an array which simply holds campaigns as objects.
{
"first_name": "John",
"last_name": "Doe",
"user_id": 1170,
"campaigns": [
{
campaign_id: 3452,
subscription_dt: ISODate("2017-01-28T19:00:00Z"),
score: 19
},
{
campaign_id: 243,
subscription_dt: ISODate("2017-01-15T16:45:00Z"),
score: 27
}
]
}
Please imagine any kind of query on the collection so:
Which is the best approach for queries?
Is there any particular query which is harder to write using some of the solutions? even impossible to write?
Which is the best approach for better performance? (I mean for example, the creation of a compound index user_id, campaign_id)
For the purpose of the analysis assume that the relationship must be placed in the contact document.
I would have chosen approach B, it is the common use. It is a good approach for queries to get data in campaigns array.
You can create index on campaign_id and use it to get better performance on queries. And also you can create a multikey index with user_id and campaign_id (compound) as an answer to your question. A disadvantage about multikey indexes is multikey indexes need more storage than other indexes. But it provides you to query from arrays with high performance.
In approach A, to query the data with campaign_id, you have to create index for each campaign_id and that is nonsense (I am not sure if someone uses that approach but I would not do that). New campaign_id is going to force you to create a new index with new campaign_id to get better performance in queries. Maybe a better answer might be given for approach A but my experience on MongoDB tells me to choice approach B for that question.
I need compound index for my collection, but I'm not sure about keys order
My item:
{
_id,
location: {
type: "Point",
coordinates: [<lng>, <lat>]
},
isActive: true,
till: ISODate("2016-12-29T22:00:00.000Z"),
createdAt : ISODate("2016-10-31T12:02:51.072Z"),
...
}
My main query is:
db.collection.find({
$and: [
{
isActive: true
}, {
'till': {
$gte: new Date()
}
},
{
'location': { $geoWithin: { $box: [ [ SWLng,SWLat], [ NELng, NELat] ] } }
}
]
}).sort({'createdAt': -1 })
In human, I need all active items on visible part of my map that are not expired, newly added - first.
Is it normal to create this index:
db.collection.createIndex( { "isActive": 1, "till": -1, "location": "2dsphere", "createdAt": -1 } )
And what is the best order for performance, for disk usage? Or maybe I have to create several indexes...
Thank you!
The order of fields in an index should be:
fields on which you will query for exact values.
fields on which you will sort.
fields on which you will query for a range of values.
In your case it would be:
db.collection.createIndex( { "isActive": 1, "createdAt": -1, "till": -1, "location": "2dsphere" } )
However, indexes on boolean fields are often not very useful, as on average MongoDB will still need to access half of your documents. So I would advise you to do the following:
duplicate the collection for test purposes
create index, you would like to test (i.e. {"isActive": 1, "createdAt": -1, "till": -1, "location": "2dsphere" })
in the mongo shell create variable:
var exp = db.testCollection.explain('executionStats')
execute the following query exp.find({'you query'}) it will return statistics describing the execution of the winning plan
analyze the keys like: "nReturned", "totalKeysExamined","totalDocsExamined"
drop the index, create new one (i.e. {"createdAt": -1, "till": -1, "location": "2dsphere"}), execute exp.find({'you query'}) compare the result to the previous one
In Mongo, many things depend upon data and its access patterns. There are few things to consider while creating index on your collection-
How the data will be accessed from application. (You already know the main query so this part is almost done)
The data size and cardinality and data span.
Operations on the data. (how often reads and writes will happen and in what pattern)
A particular query can use only one index at a time.
Index usage is not static. Mongo keeps changing index used by heuristics and it tries to do it in optimized way. So if you see index1 being used at soem time, it may happen that mongo uses index2 after some time when some/enough different type/cardinality of data is entered.
Indices can be good and worse as well for your application performance. It is best to test them via shell/compass before using them in production.
var ex = db.<collection>.explain("executionStats")
Above line when entered in mongo shell gives you the cursor on explainable object which can be used further to check performance issues.
ex.find(<Your query>).sort(<sort predicate>)
Points to note in above output are
"executionTimeMillis"
"totalKeysExamined"
"totalDocsExamined"
"stage"
"nReturned"
We strive for minimum for first three items (executionTimeMillis, totalKeysExamined and totalDocsExamined) and "stage" is one important thing to tell what is happening. If Stage is "COLLSCAN" then it means it is looking for every document to fulfil the query, if Stage is "SORT" then it means it is doing in-memory sorting. Both are not good.
Coming to your query, there are few things to consider-
If "till" is going to have a fix value like End of month date for all the items entered during a month then it's not a good idea to have index on it. DB will have to scan many documents even after this index. Moreover there will be only 12 entries for this index for a year given it is month end date.
If "till" is a fix value after "createdAt" then it is not good to have index on both.
Indexing "isActive" is not good because there are only two values it can take.
So please try with the actual data and execute below indices and determine which index should fit considering time, no. of docs. examined etc.
1. {"location": "2dsphere" , "createdAt": -1}
2. {"till":1, "location": "2dsphere" , "createdAt": -1}
Apply both indices on collection and execute ex.find().sort() where ex is explainable cursor. Then you need to analyze both outputs and compare to decide best.
We are currently using a collection called items which contain 10 million entries in our MongoDB database.
This collection contains (amongst many others) two columns named title and country_code. One such entry looks like this
{
"_id": ObjectId("566acf868fdd29578f35e8db"),
"feed": ObjectId("566562f78fdd2933aac85b42"),
"category": "Mobiles & Tablets",
"title": "360DSC Crystal Clear Transparent Ultra Slim Shockproof TPU Case for Iphone 5 5S (Transparent Pink)",
"URL": "http://www.lazada.co.id/60dsc-crystal-clear-transparent-ultra-slim-shockproof-tpu-case-for-iphone-5-5s-transparent-pink-3235992.html",
"created_at": ISODate("2015-12-11T13:28:38.470Z"),
"barcode": "36834ELAA1XCWOANID-3563358",
"updated_at": ISODate("2015-12-11T13:28:38.470Z"),
"country_code": "ID",
"picture-url": "http://id-live.slatic.net/p/image-2995323-1-product.jpg",
"price": "41000.00"
}
The cardinality on column country_code is very high. We have created two text indices for these columns:
db.items.createIndex({title: "text", country_code: "text"})
In our examples we try to query:
db.items.find({"title": { "$regex": "iphone", "$options": "i" }, country_code: "US"}).limit(10)
A query which takes around 6 seconds to complete, which seems unusually high for this type of database.
Whenever we try a country_code (e.g., country_code: "UK") that has less results, it will return results within milliseconds.
Would there be any particular reason, why these queries differ so much in their time to return results?
EDIT:
All answers here helped so if you have this issue yourself, try all of 3 of the solutions. Could only mark 1 as correct though.
Switch around the order of the fields in your index. Order matters.
db.items.createIndex({country_code: "text", title: "text"})
Ensure you maintain this order when querying:
db.items.find({country_code: "US", "title": { "$regex": "iphone", "$options": "i" }}).limit(10)
What this will do is drastically decrease the amount of title fields you need so search a substring for.
Also as mentioned by #Jaco, you should take advantage of your "text" index. See how to query a text index here.
As you do an exact search on country_code, you can add the text index on title only:
db.items.createIndex({title:"text"})
and add a seperate index on country_code:
db.items.createIndex({country_code:1})
As you have defined a text index on title you don't have to use a regular expression, but instead you can do a text search like this:
db.items.find({$text:{$search:"iphone"},country_code:"US"})
You should build an index like {country_code: 1, title: "text"}.
Equal is much more faster than regular expression, make it count.
UPDATE: I need to add that the point of this question is to allow me to define schemas for Json Rest Stores. The user can search by any one key, or several keys. So, I cannot easily predict what the users will search by -- it could be 1, 2, 5 fields (this is especially true for data-rich fields like people, bookings, etc.)
Imagine that I have an index as such:
{ "item": 1, "location": 1, "stock": 1 }
Following the MongoDb manual on indexes:
MongoDB can use this index to support queries that include:
the item field,
the item field and the location field,
the item field and the location field and the stock field, or
only the item and stock fields; however, this index would be less efficient than an index on only item and stock.
MongoDB cannot use this index to support queries that include:
only the location field,
only the stock field, or
only the location and stock fields.
Now, suppose I have a schema with exactly these fields:
item: String
location: String
stock: String
qty: number
And imagine I want to make sure every query is indeed indexed. I would do:
For item:
item, location, stock, qty
item, location, qty, stock
item, stock, qty, location
item, stock, location, qty
item, qty, location, stock
item, qty, stock, location
For location:
...you know the gist
Now... this seems a little insane. If you have a database where you have TEN searchable fields, this becomes clearly unworkable as the number of indexes grows exponentially.
Am I missing something? My idea was to define a schema, define which fields were searchable, and write a function that makes up all of the needed indexes regardless of what fields were present and what fields weren't. However, I am thinking about it, and... well, I must be missing something.
Am I?
I will try to explain what does this mean by example. The indexes based on B-tree is not something mongodb specific. In contrast it is rather common concept.
So when you create an index - you show the database an easier way to find something. But this index is stored somewhere with a pointer pointing to a location of the original document. This information is ordered and you might look at it as binary tree which has a really nice property: the search is reduced from O(n) (linear scan) to O(log(n)). Which is much much faster because each time we trim our space in half (potentially we can reduce the time from 10^6 to 20 lookups). For example we have a big collection with field {a : some int, b: 'some other things'} and if we index it by a, we end up with another data structure which is sorted by a. It looks this way (by this I do not mean that it is another collection, this is just for demonstration):
{a : 1, pointer: to the field with a = 1}, // if a is the smallest number in the starting collection
...
{a : 999, pointer: to the field with a = 990} // assuming that 999 is the biggest field
So right now we are searching for a field a = 18. Instead of going one by one through all elements we take something in the middle and if it is bigger then 18, then we are dividing the lower part in half and checking the element there. We continue till we will find a = 18. Then we look at the pointer and knowing it we extract the original field.
The situation with compound index is similar (instead of ordering by one element we order by many). For example you have a collection:
{ "item": 5, "location": 1, "stock": 3, 'a lot of other fields' } // was stored at position 5 on the disk
{ "item": 1, "location": 3, "stock": 1, 'a lot of other fields' } // position 1 on the disk
{ "item": 2, "location": 5, "stock": 7, 'a lot of other fields' } // position 3 on the disk
... huge amount of other data
{ "item": 1, "location": 1, "stock": 1, 'a lot of other fields' } // position 9 on the disk
{ "item": 1, "location": 1, "stock": 2, 'a lot of other fields' } // position 7 on the disk
and want an index { "item": 1, "location": 1, "stock": 1 }. The lookup table would look like this (one more time - this is not another collection, this is just for demonstration):
{ "item": 1, "location": 1, "stock": 1, pointer = 9 }
{ "item": 1, "location": 1, "stock": 2, pointer = 7 }
{ "item": 1, "location": 3, "stock": 1, pointer = 1 }
{ "item": 2, "location": 5, "stock": 7, pointer = 3 }
.. huge amount of other data (but not necessarily here. If item would be one it would be somewhere next to items 1)
{ "item": 5, "location": 1, "stock": 3, pointer = 5 }
See that here everything is basically sorted by item, then by location and then by pointer.
The same way as with a single index we do not need to scan everything. If we have a query which looks for item = 2, location = 5 and stock = 7 we can quickly identify where documents with item = 2 are and then the same way quickly identify where among these items item with location 5 and so on.
And right now an interesting part. Also we created just one index (although this is a compound index, it is still one index) we can use it to quickly find the element
only by the item. Really all we need to do is only the first step. So there is no point to create another index {location : 1} because it is already covered by compound index.
also we can quickly find only by item and by location (we need only 2 steps).
Cool 1 index but helps us in three different ways. But wait a minute: what if we want to find by item and stock. Oh it looks like we can speed up this query as well. We can in log(n) find all elements with specific item and ... here we have to stop - magic has finished. We need to iterate through all of them. But still pretty good.
But may it can help us with other queries. Lets look at a query by location which looks like was already ordered. But if you will look at it - you see that this is a mess. One in the beginning and then one in the end. It can not help you at all.
I hope this clarifies few things:
why indexes are good (reduce time from O(n) to potentially O(log(n))
why compound indexes can help with some queries nonetheless we have not created an index on that particular field and help with some other queries.
what indexes are covered by compound index
why indexes can harm (it creates additional datastructure which should be maintained)
And this should tell another valid thing: index is not a silver bullet. You can not speed up all your queries, so it sound silly to think that by creating indexes on all fields EVERYTHING would be super fast.
What are your real query patterns? It's very unlikely that you would need to create all of these possible index combinations. I also doubt that including qty in the index would be of much use. Do you need to search for things where qty == 4 independent of location and item type?
An index doesn't need to identify every single record, it just needs to be specific enough to make any final scan small. Given an item code or a stock value are there really that many locations that you'd also need to index on them?
I suspect in this case an index on item, an index on location and and index on stock would be sufficient to answer most likely queries with sufficient speed. (But we'd need to know more about what these field names mean and what the count and distribution of values is within them).
Use explain with your queries and you can see how well they are performing. Add indices as necessary, don't create every possible ordering.