MongoDB add fields of low cardinality to compound indexes? - mongodb

I have read putting indexes on low cardinality fields is pointless.
Would this hold true for a compound index as such:
db.perms.createIndex({"owner": 1, "object_type": 1, "target": 1});
With queries as such:
db.perms.find({"owner": "me", "object_type": "square"});
db.perms.find({"owner": "me", "object_type": "circle", "target": "you"});
The amount of distinct object_type's would grow over time (probably no more than 10 or 20 max) but would only start out with about 2 or 3.
Similarly would a hash index be worth looking into?
UPDATE:
owner and target would grow immensely. Think of this like a file system wherein the owner would "own" a target (i.e. file). But, like unix systems, a file could be a folder, a symlink, or a regular file (hence the type). So although there are only 3 object_type's, a owner and target combination could have thousands of entries with an even distribution of types.

I may not be able to answer your question, but giving my cents for index cardinality:
Index cardinality: it refers to the number of index points for each different type of index that MongoDB supports.
Regular - for every single key that we put in the index, there's certainly going to be an index point. And in addition, if there is no key, then there's going to be an index point under the null entry. We get 1:1 relative to the number of documents in the collection in terms of index cardinality. That makes the index a certain size. It's proportional to the collection size in terms of it's end pointers to documents
Sparse - when a document is missing a key being indexed, it's not in the index because it's a null and we don't keep nulls in the index for a sparse index. We're going to have index points that could be potentially less than or equal to the number of documents.
Multikey - this is an index on array values. There'll be multiple index points (for each element of the array) for each document. So, it'll be greater than the number of documents.
Let's say you update a document with a key called tags and that update causes the document to need to get moved on disk. Assume you are using the MMAPv1 storage engine. If the document has 100 tags in it, and if the tags array is indexed with a multikey index, 100 index points need to be updated in the index to accommodate the move?

Related

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks
Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.
So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.
I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

DB Compound indexing best practices Mongo DB

How costly is it to index some fields in MongoDB,
I have a table where i want uniqueness combining two fields, Every where i search they suggested compound index with unique set to true. But what i was doing is " Appending both field1_field2 and making it a key, so that field2 will be always unique for field1.(and add Application logic) As i thought indexing is costly.
And also as MongoDB documentation advices us not to use Custom Object ID like auto incrementing number, I end up giving big numbers to Models like Classes, Students etc, (where i could have used easily used 1,2,3 in sql lite), I didn't think to add a new field for numbering and index that field for querying.
What are the best practices advice for production
The advantage of using compound indexes vs your own indexed field system is that compound indexes allows sorting quicker than regular indexed fields. It also lowers the size of every documents.
In your case, if you want to get the documents sorted with values in field1 ascending and in field2 descending, it is better to use a compound index. If you only want to get the documents that have some specific value contained in field1_field2, it does not really matter if you use compound indexes or a regular indexed field.
However, if you already have field1 and field2 in seperate fields in the documents, and you also have a field containing field1_field2, it could be better to use a compound index on field1 and field2, and simply delete the field containing field1_field2. This could lower the size of every document and ultimately reduce the size of your database.
Regarding the cost of the indexing, you almost have to index field1_field2 if you want to go down that route anyways. Queries based on unindexed fields in MongoDB are really slow. And it does not take much more time adding a document to a database when the document has an indexed field (we're talking 1 millisecond or so). Note that adding an index on many existing documents can take a few minutes. This is why you usually plan the indexing strategy before adding any documents.
TL;DR:
If you have limited disk space or need to sort the results, go with a compound index and delete field1_field2. Otherwise, use field1_field2, but it has to be indexed!

MongoDB query by one index, sort by antoher

The relevant fields of the documents in my collection are the following:
{
point: {
type: Point,
coordinates: [15.6446464, 45.231323]
}
score: 24
}
I have a 2dsphere index on point and a "normal", descending index on score.
I want to run the following query:
db.properties.find({point: {$geoWithin: <some polygon> }}).sort({score: -1}).limit(2000)
Is there any way to make mongo use the index on point for the find part, and then the index on score for sorting?
The collection has about 700k documents, the find part can return tens of thousands of documents, each of which has up to a MB.
The current problem is that, when using the point index, the size of the returned collection is too big for sorting in memory. When using the score index, the query is too slow because of a sequential scan on coordinates.
When executing your current query, MongoDB will only use the index on point. After running the find you will have a subset of the data and therefore Mongo can no longer use the index on score. You could instead create a composite index on point and score with score indexed in descending order. Even though the first values are unique, it helps to speed up the sorting as MongoDB can use the index to sort on score rather than having to scan through the entire document (which can be up to a MB in size).
The composite index follows the general rule of thumb when indexing. In general the order of an index should be:
Fields on which you will query for exact values.
Fields on which you will sort.
Fields on which you will query for a range of values.
However, as per your comment this composite index is not very fast and this suggest that MongoDB can't load the entire index into memory. How much RAM have you allocated to MongoDB ? Is there any chance you can increase this ?

What does the digit "1" mean when creating indexes in mongodb

I am new to mongodb and want to make indexes for a specific collection. I have seen people use a digit "1" in front of the field name when they want to create an index. for example:
db.users.ensureIndex({user_name: 1})
now I want to know what does this digit mean and is it necessary to use it?
It's the type of index. MongoDB supports different kinds of indexes. However, only the first two indexes can be combined to a compound index.
1: Ascending binary-tree index.
-1: Descending binary-tree index. Very similar to the default index but the difference can matter for the behavior of compound indexes.
"hashed": A hashtable index. Very fast for lookup by exact value, especially in very large collections. But not usable for inexact queries ($gt, $regex or similar).
"text": A text index designed for searching for words in strings with natural language.
"2d": A geospatial index on a flat plane
"2dsphere": A geospatial index on a sphere
For more information, see the documentation of index types.
It defines the index type on that specefic field. For example the value of 1 creates an index with ascending order, while the value -1 create the index with descending order.
For more information, see the Manual

Slow creation of four-field index in MongoDB

I have a ProductRequest collection in MongoDB. It is somewhat large collection, but not that many documents. Number of documents is a bit over 300,000, but average size of a document is close to 1MB, thus data footprint is large.
To speed up certain queries I am setting up index on this collection:
db.ProductRequest.ensureIndex ({processed: 1, parsed: 1, error:1,processDate:1})
First three fields are Boolean, the last one is date time.
The command runs for soon 24 hours and would not come back
I already have index on ‘processed’ and ‘parsed’ fields (together) and a separate one on ‘error’. Why creation of that four-field index takes forever? My understanding is that size of an individual record should not matter in this case, am I wrong?
Additional Info:
MongoDB version 2.6.1 64-bit
Host OS Centos 6.5
Sharding: yes, shard key is _id. Number of shards: 2, number of replica sets in each shard is 3.
I belive its because of putting index for boolean fields.
since there are only two values (true or false), if you have 300.000 rows putting an index on that field will have to scan 150.00 rows to find all documents and in your case you have 3 Boolean fields, it makes it more slow.
You won't see a huge benefit from an index on those three fields and processDate compared to an index just on processDate. Indexes on boolean fields aren't very useful in the presence of other index-able fields because they aren't very selective. If you give a process date, there are only 8 possibilities for the combination of the other fields to further narrow down the results via the index.
Also, you should switch the order. Put processDate first as it is much more selective than a boolean field. That should greatly simplify the index and speed up the index build.
Finally, index creation in MongoDB is sometimes unavoidably slow and expensive because it involves creating large B-trees. The payoff, which is absolutely worth it, of course, is faster queries. It's possible that more than 24 hours are needed for an index build. Have you checked what the saturated resource is? It's likely the CPU for an index build. Your best option for this case is to create the index in the background. Background index builds
don't block read and write operation for the duration like foreground index builds
take longer
produce initially larger indexes that will converge to the size of an equivalent foreground index over time
You set an index build to occur in the background with an extra option to the ensureIndex call:
db.myCollection.ensureIndex({ "myField" : 1 }, { "background" : 1 })