sparse indexes and null values in mongo - mongodb

I'm not sure I understand sparse indexes correctly.
I have a sparse unique index on fbId
{
"ns" : "mydb.users",
"key" : {
"fbId" : 1
},
"name" : "fbId_1",
"unique" : true,
"sparse" : true,
"background" : false,
"v" : 0
}
And I was expecting that would allow me to insert records with null as the fbId, but that throws a duplicate key exception. It only allows me to insert if the fbId property is removed completely.
Isn't a sparse index supposed to deal with that?

Sparse indexes do not contain documents that miss indexed field. However, if field exists and has value of null, it will still be indexed. So, if absense of the field and its equality to null look the same for your application and you want to maintain uniqueness of fbId, just don't insert it until you have a value for it.
You need sparse indexes when you have a large number of documents, but only a small portion of them contains some field, and you want to be able to quickly find documents by that field. Creating a normal index would be too expensive, you would just waste precious RAM on indexing documents you're not interested in.

To ensure maximum performance of the indexes, we may want to omit from indexing those documents NOT containing the field on which you are performing an index. To do this MongoDB has the sparse property that works as follows:
db.addresses.ensureIndex( { "secondAddress": 1 }, { sparse: true } );
This index will omit all the documents not containing the secondAddress field and when performing a query, those document will never be scanned.
Let me share this article about basic indexes and some of their properties:
Geospatial, Text, Hash indexes and unique and sparse properties: http://mongodbspain.com/en/2014/02/03/mongodb-indexes-part-2-geospatial-2d-2dsphere/

{a:1, b:5, c:2}
{a:8, b:15, c:7}
{a:4, b:7}
{a:3, b:10}
Let's assume that we wish to create an index on the above documents. Creating index on a & b will not be a problem. But what if we need to create an index on c. The unique constraint will not work for c keys because null value is duplicated for 2 documents. The solution in this case is to use sparse option. This option tells the database to not include the documents which misses the key. The command in concern is db.collectionName.createIndex({thing:1}, {unique:true, sparse:true}). The sparse index lets us use less space as well.
Notice that even if we have a sparse index, the database performs all documents scan especially when doing sort. This can be seen in the winning plan section of explain's result.

Sparse indexes only contain entries for documents that have the indexed field, even if the index field contains a null value. The index skips over any document that is missing the indexed field. The index is "sparse" because it does not include all documents of a collection.

Related

Mongo sparse index is not working as I expected

I said 'as I expected', because I might be misunderstanding how it should work.
I have a model containing objects like this one :
{
"_id" : ObjectId("56408d76ef82679937000008"),
"_type" : "ford",
"year" : 1986,
"model" : "sierra",
"model_unique" : 1,
"__v" : 0
}
I need a compound unique index that will not allow to insert two objects with the same _type and model combination unless specified.
The way I thought I could specify that, was using the model_unique column and make the index sparse, so adding the former document twice should fail, whereas the following should be allowed (note that there is no model_unique field):
{
"_id" : ObjectId("56408e0d636779c83700000a"),
"_type" : "veridianDynamics",
"year" : 1986,
"model" : "sierra",
"__v" : 0
}
{
"_id" : ObjectId("another ID"),
"_type" : "veridianDynamics",
"year" : 2003,
"model" : "sierra",
"__v" : 0
}
I thought this would work with this index:
Schema.index({"_type": 1, "model": 1, "model_unique": 1}, { unique: true, sparse: true });
But it is actually failing with:
[MongoError: insertDocument :: caused by :: 11000 E11000 duplicate key error index: mongoose-schema-extend.vehicles.$_type_1_model_1_model_unique_1 dup key: { : "veridianDynamics", : "sierra", : null }]
So apparently it is considering that the undefined fields have a null value.
I'm using mongod --version
db version v2.6.11
And npm -v mongoose
2.14.4
From the documentation on sparse compound indexes:
Sparse compound indexes that only contain ascending/descending index
keys will index a document as long as the document contains at least
one of the keys.
What this means in your case is that only when all three components of the compound index are missing from the document, will the document be excluded from the index, and thus exempt from the unique constraint.
So the sparse index you're trying to add would allow multiple docs without any of the three keys, but for all other cases, the combination of all three fields must be unique, with any missing fields getting a value of null.
In your example docs, they both would look like the following from the perspective of the unique index:
{
"_type" : "veridianDynamics",
"model" : "sierra",
"model_unique : null
}
And thus, not unique.
FYI, there are exceptions to this rule where the existence of a geospatial or text index in your compound, sparse index changes the rules to only consider that specially indexed field when determining whether to include the document in the index.
According to the unique index documentation for missing fields
Unique Index and Missing Field
If a document does not have a value for a field, the index entry for that item will be null in any index that includes it. Thus, in many situations you will want to combine the unique constraint with the sparse option. Sparse indexes skip over any document that is missing the indexed field, rather than storing null for the index entry. Since unique indexes cannot have duplicate values for a field, without the sparse option, MongoDB will reject the second document and all subsequent documents without the indexed field. Consider the following prototype.
Therefore, it seems legitimate to think that this will also work on compound indexes.
This was reported as a bug on jira.
MongoDB developers decided not to include this functionality and closed the request
It makes more sense to exclude documents from the index if ALL fields in the index are missing. Compound indexes also serve queries on the first, first+second, etc fields in the index, and so an index on a,b,c should be able to find all the documents where a=1, not only the ones where b and/or c also have values. This is more intuitive, and should be the default behavior.
Although some suggestions were made in an effort to define a proper semantics to differentiate the two possible cases
{sparse : true, sparseIfAnyValueMissing : true}
It could be useful not only for what I describe in the question, but also for document inheritance and support partial indexing
I have the situation where one of my columns is null when I first create it, but may get set to an ID later. And when it's set to an ID it needs to be unique with another column.
Unfortunately I can't enforce this using a unique index because it will fail since many rows may have null in one of the columns. If I were using a regular RDBMS with a sparse unique multi-column index, this would work fine. Unfortunately Mongo has chosen to work in a different way from all of the RDBMS' out there and cannot support this scenario.
Given that partial indexes are not a quick thing to add and don't seem like they will be added anytime soon, why is this issue closed? Please reopen and consider implementing this issue.
Unfortunately, it is not possible yet (I hope it will be at some point)

Dealing with mongodb unique, sparse, compound indexes

Because mongodb will index sparse, compound indexes that contain 1 or more of the indexed fields, it is causing my unique, sparse index to fail because one of those fields is optional, and is being coerced to null by mongodb for the purpose of the index.
I need database-level ensurance of uniqueness for the combination of this field and a few others, and having to manage this at the application level via some concatenated string worries me.
As an alternative, I considered setting the default value of the possibly null indexed field to 'null ' + anObjectId, because it would allow me to keep the index without causing errors. Does this seem like a sensisble (although hacky) solution? Does anyone know of a better way I could enforce database-level uniqueness on a compound index?
Edit: I was asked to elaborate on the actual problem domain a bit more, so here it goes.
We get large data feeds from our customers that we need to integrate into our database. These feeds include various (3) unique identifiers supplied by the customer that we use for updating the versions we store in our database when the data feeds refresh. I need to tie uniqueness of these identifiers to the customer, because the same identifier could appear from multiple sources, and we want to allow that.
The document structure looks like this:
{
"identifiers": {
"identifierA": ...,
"identifierB": ...,
"identifierC": ...
},
"client": ...
}
Because the each individual identifier is optional (at least one of the three is required), I need to uniquely index the combination of the index with the client (e.g. one index is the combination of client plus identifierA). However, this index must only occur when the identifier exists, but this is not supported my mongodb (see the hyperlink above).
I was considering the above solution, but I would like to hear if anyone else has solved this or has suggestions.
https://docs.mongodb.org/manual/core/index-partial/
As of mongoDB 3.2 you can create partial index to support this as well.
db.users.createIndex(
{ name: 1, email: 1 },
{ unique: true, partialFilterExpression: { email: { $exists: true } } }
)
A sparse index avoids indexing a field that doesn't exist.
A unique index avoid documents being inserted that have the same field values.
Unfortunately as of MongoDB 2.6.7, the unique constraint is always enforced even when creating a compound index (indexing two or more fields) with the sparse and unique properties.
Example:
db = db.connect("test");
db.a.drop();
db.a.insert([
{},
{a : 1},
{b : 1},
{a : 1, b : 1}
]);
db.a.ensureIndex({a:1,b:1}, { sparse: true, unique: true } );
db.a.insert({a : 1}); // throws Error but wanted insert to be valid.
However, it works as expected for a single index field with sparse and unique properties.
I feel like this is a bug that will get fixed in future releases.
Anyhow, here are two solutions to get around this problem.
1) Add a non-null hash field to each document that is only computed when all the required fields for checking the uniqueness are supplied.
Then create a sparse unique index on the hash field.
function createHashForUniqueCheck(obj){
if( obj.firstName && obj.id){
return MD5( String( obj.firstName) + String(obj.id) );
}
return null;
}
2) On the application side, check for uniqueness before insertion into Mongodb. :-)
sparse index doc
A hash index ended up being sufficient for this

Mongodb indexing optional fields

I have some fields in my mongodb collection that are optional parts of a search. How can I index this query consistently (i.e. every query, regardless of parameters will use an index) if I don't know what fields the user might be querying?
You can use a Sparse Index
Sparse indexes only contain entries for documents that have the
indexed field, even if the index field contains a null value. The
index skips over any document that is missing the indexed field. The
index is “sparse” because it does not include all documents of a
collection. By contrast, non-sparse indexes contain all documents in a
collection, storing null values for those documents that do not
contain the indexed field.
db.addresses.createIndex( { "xmpp_id": 1 }, { sparse: true } )

Mongodb performance difference between Hash and Ascending indices (Any reason not to use hash in a not ordered field?)

In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)
I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) because you get an error
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"errmsg" : "exception: Currently only single field hashed index supported.",
"code" : 16763,
"ok" : 0
}
My question:
Between the indices:
db.test.ensureIndex( { "key": 1 } )
db.test.ensureIndex( { "key": "hashed" } )
For the query db.products.find( { key: "a" } ), which one is more performant?, is the hashed key O(1)
How I got to the question:
Before I knew that you could not have multi-key indices with hashed, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } ), and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.
The situation in which I made the index was:
I had a collection that contained a sorted list of documents classified by keys.
e.g.
{key: a, sortOrder: 1, ...}, {key: a, sortOrder: 2, ...}, {key: a, sortOrder: 3, ...}, {key: b, sortOrder: 1, ...}, {key: b, sortOrder: 2, ...}, ...
Since I used the key to classify and the sortOrder for pagination, I always queried filtering with one value for the key and using the sortOrder for the order of the documents.
That means that I had two possible queries:
For the first page db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
And for the other pages db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})
In this specific scenario, searching with O(1) for the key and O(log(n)) for the sortOrder would have been ideal, but that wasn't allowed.
For the query db.products.find( { key: "a" } ), which one is more performant?
Given that field key is indexed in both cases, the complexity index search itself would be very similar. As the value of a would be hashed, and stored in the index tree.
If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a before matching the value in the index tree. See also mongo/db/index/hash_access_method.h
Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.
Any reason not to use hash in a non-ordered field?
Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.
Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.
is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?
Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.
The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.
In a specific type of usage the index will be smaller!
Yes! In a very specific scenario where all three of the following conditions are satisfied.
Your access pattern (how you search) must be only to find documents with a specific value for the indexed field (key-value lookup, e.g., finding a product by the SKU, or finding a user by their ID, etc.)
You don't need range based queries or sorting for the indexed field.
Your field is a very large string and Mongo's numerical hash of the field is smaller than the original field.
For example, I created two indexes, and for the hashed version, the size of the index was smaller. This can result in better memory and disk utilization.
// The type of data in the collection. Each document is a random string with 65 characters.
{
"myLargeRandomString": "40a9da87c3e22fe5c47392b0209f296529c01cea3fa35dc3ba2f3d04f1613f8e"
}
The index is about 1/4 of the normal version!
mongos> use MyDb
mongos> db.myCollection.stats()["indexSizes"]
{
// A regular index. This one is sorted by the value of myLargeRandomString
"myLargeRandomString_-1" : 23074062336,
// The hashed version of the index for the same field. It is around 1/4 of the original size.
"myLargeRandomString_hashed" : 6557511680,
}
NOTE:
If you're already using _id as the foreign key for your documents, then this is not relevant since collections will have an _id index by default.
As always, do your own testing of your data to check if this change will actually benefit you. There is a significant tradeoff in terms of search capabilities on this type of index.

Sparse Index in MongoDB not working?

I'm trying to understand sparse index in MongoDB. I understand that if I do this :
> db.check.ensureIndex({"id":1},{sparse:true, unique:true})
I can only insert documents where the id field is not a duplicate and is also not absent.
Hence, I tried,
> db.check.insert({id:1})
> db.check.insert({id:1})
which, as I expected, gave :
E11000 duplicate key error index: test.check.$id_1 dup key: { : 1.0 }
However, inserting a document with a non-existant id field :
> db.check.insert({})
works! What is going wrong?
A sparse unique index means that a document doesn't need to have the indexed field(s), but when it has that field, it must be unique.
You can add any number of documents into the collection when the field is absent. Note that when you insert an empty document, the _id field will get an auto-generated ObjectID which is (as good as guaranteed) unique.
That's pretty much what sparse means. From the docs;
Sparse indexes only contain entries for documents that have the indexed field. [5] Any document that is missing the field is not indexed.
In other words, your missing id field makes the index not even consider that entry for a unique check.