Dealing with mongodb unique, sparse, compound indexes - mongodb

Because mongodb will index sparse, compound indexes that contain 1 or more of the indexed fields, it is causing my unique, sparse index to fail because one of those fields is optional, and is being coerced to null by mongodb for the purpose of the index.
I need database-level ensurance of uniqueness for the combination of this field and a few others, and having to manage this at the application level via some concatenated string worries me.
As an alternative, I considered setting the default value of the possibly null indexed field to 'null ' + anObjectId, because it would allow me to keep the index without causing errors. Does this seem like a sensisble (although hacky) solution? Does anyone know of a better way I could enforce database-level uniqueness on a compound index?
Edit: I was asked to elaborate on the actual problem domain a bit more, so here it goes.
We get large data feeds from our customers that we need to integrate into our database. These feeds include various (3) unique identifiers supplied by the customer that we use for updating the versions we store in our database when the data feeds refresh. I need to tie uniqueness of these identifiers to the customer, because the same identifier could appear from multiple sources, and we want to allow that.
The document structure looks like this:
{
"identifiers": {
"identifierA": ...,
"identifierB": ...,
"identifierC": ...
},
"client": ...
}
Because the each individual identifier is optional (at least one of the three is required), I need to uniquely index the combination of the index with the client (e.g. one index is the combination of client plus identifierA). However, this index must only occur when the identifier exists, but this is not supported my mongodb (see the hyperlink above).
I was considering the above solution, but I would like to hear if anyone else has solved this or has suggestions.

https://docs.mongodb.org/manual/core/index-partial/
As of mongoDB 3.2 you can create partial index to support this as well.
db.users.createIndex(
{ name: 1, email: 1 },
{ unique: true, partialFilterExpression: { email: { $exists: true } } }
)

A sparse index avoids indexing a field that doesn't exist.
A unique index avoid documents being inserted that have the same field values.
Unfortunately as of MongoDB 2.6.7, the unique constraint is always enforced even when creating a compound index (indexing two or more fields) with the sparse and unique properties.
Example:
db = db.connect("test");
db.a.drop();
db.a.insert([
{},
{a : 1},
{b : 1},
{a : 1, b : 1}
]);
db.a.ensureIndex({a:1,b:1}, { sparse: true, unique: true } );
db.a.insert({a : 1}); // throws Error but wanted insert to be valid.
However, it works as expected for a single index field with sparse and unique properties.
I feel like this is a bug that will get fixed in future releases.
Anyhow, here are two solutions to get around this problem.
1) Add a non-null hash field to each document that is only computed when all the required fields for checking the uniqueness are supplied.
Then create a sparse unique index on the hash field.
function createHashForUniqueCheck(obj){
if( obj.firstName && obj.id){
return MD5( String( obj.firstName) + String(obj.id) );
}
return null;
}
2) On the application side, check for uniqueness before insertion into Mongodb. :-)
sparse index doc

A hash index ended up being sufficient for this

Related

Mongodb performance difference between Hash and Ascending indices (Any reason not to use hash in a not ordered field?)

In mongodb there are multiple types of index. For this question I'm interested in the ascending (or descending) index which can be used for sorting and the hash index which according to the documentation is "primarily used with sharded clusters to support hashed shard keys" (source) ensuring "a more even distribution of data"(source)
I know that you can't create an index like: db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) because you get an error
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"errmsg" : "exception: Currently only single field hashed index supported.",
"code" : 16763,
"ok" : 0
}
My question:
Between the indices:
db.test.ensureIndex( { "key": 1 } )
db.test.ensureIndex( { "key": "hashed" } )
For the query db.products.find( { key: "a" } ), which one is more performant?, is the hashed key O(1)
How I got to the question:
Before I knew that you could not have multi-key indices with hashed, I created an index of the form db.test.ensureIndex( { "key": 1, "sortOrder": 1 } ), and while creating it I wondered if the hashed index was more performant than the ascending one (hash usually is O(1)). I left the key as it is now because (as I mentioned above) db.test.ensureIndex( { "key": "hashed", "sortOrder": 1 } ) was not allowed. But the question of is the hashed index faster for searches by a key stayed in my mind.
The situation in which I made the index was:
I had a collection that contained a sorted list of documents classified by keys.
e.g.
{key: a, sortOrder: 1, ...}, {key: a, sortOrder: 2, ...}, {key: a, sortOrder: 3, ...}, {key: b, sortOrder: 1, ...}, {key: b, sortOrder: 2, ...}, ...
Since I used the key to classify and the sortOrder for pagination, I always queried filtering with one value for the key and using the sortOrder for the order of the documents.
That means that I had two possible queries:
For the first page db.products.find( { key: "a" } ).limit(10).sort({"sortOrder", 1})
And for the other pages db.products.find( { key: "a" , sortOrder: { $gt: 10 } } ).limit(10).sort({"sortOrder", 1})
In this specific scenario, searching with O(1) for the key and O(log(n)) for the sortOrder would have been ideal, but that wasn't allowed.
For the query db.products.find( { key: "a" } ), which one is more performant?
Given that field key is indexed in both cases, the complexity index search itself would be very similar. As the value of a would be hashed, and stored in the index tree.
If we're looking for the overal performance cost, the hashed version would incur an extra (negligible) cost of hashing the value of a before matching the value in the index tree. See also mongo/db/index/hash_access_method.h
Also, hashed index would not be able to utilise index prefix compression (WiredTiger). Index prefix compression is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field.
Any reason not to use hash in a non-ordered field?
Generally there is no reason to hash a non-range value. To choose a shard key, consider the cardinality, frequency, and rate of change of the value.
Hashed index is commonly used for a specific case of sharding. When a shard key value is a monotonically increasing/decreasing value, the distribution of data would likely to go into one shard only. This is where a hashed shard key would be able to improve the distribution of writes. It's a minor trade-off to greatly improve your sharding cluster. See also Hashed vs Ranged Sharding.
is it worth to insert a random hash or value with the document, and use that for sharding instead of a hash generated on the _id ?
Whether it's worth it, depends on the use case. A custom hash value would mean that any query for the hash value would have to go through a custom hashing code i.e. application.
The advantage for utilising the built-in hash function is that MongoDB automatically computes the hashes when resolving queries using hashed indexes. Therefore, applications do not need to compute hashes.
In a specific type of usage the index will be smaller!
Yes! In a very specific scenario where all three of the following conditions are satisfied.
Your access pattern (how you search) must be only to find documents with a specific value for the indexed field (key-value lookup, e.g., finding a product by the SKU, or finding a user by their ID, etc.)
You don't need range based queries or sorting for the indexed field.
Your field is a very large string and Mongo's numerical hash of the field is smaller than the original field.
For example, I created two indexes, and for the hashed version, the size of the index was smaller. This can result in better memory and disk utilization.
// The type of data in the collection. Each document is a random string with 65 characters.
{
"myLargeRandomString": "40a9da87c3e22fe5c47392b0209f296529c01cea3fa35dc3ba2f3d04f1613f8e"
}
The index is about 1/4 of the normal version!
mongos> use MyDb
mongos> db.myCollection.stats()["indexSizes"]
{
// A regular index. This one is sorted by the value of myLargeRandomString
"myLargeRandomString_-1" : 23074062336,
// The hashed version of the index for the same field. It is around 1/4 of the original size.
"myLargeRandomString_hashed" : 6557511680,
}
NOTE:
If you're already using _id as the foreign key for your documents, then this is not relevant since collections will have an _id index by default.
As always, do your own testing of your data to check if this change will actually benefit you. There is a significant tradeoff in terms of search capabilities on this type of index.

Mongo indexes when searching on one value, sorting on another

I've started to get a hang of proper indexing in Mongo, but there's one thing I'm a bit confused on.
If I want to search on one field (level) and sort on another (random), how do I setup that index? Which field comes first?
Note: Above, by random, I mean I have a field called random. I am not sorting on a randomly selected field.
Sorting on "random" field in mongodb is not a good idea. If sort is not indexed, then sorting will be done in memory, which is a problem for a large result sets.
An index can support sort operations on a non-prefix subset of the index key pattern. To do so, the query must include equality conditions on all the prefix keys that precede the sort keys. So if your index is { a : 1, b : 1} you can have query like this
db.data.find( { a: "foo" } ).sort( { b: 1 } )
Mongobd documentation explains it well: http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/
Edit: Based on updated question your index should be { level : 1, random : 1}

MongoDB workaround for not supported sparse unique compound index

I need a workaround because MongoDB does not support sparse unique compound indexes (it will set the values to null if not present whereas it doesn't add the field to the index when it's a non-compound index). See https://jira.mongodb.org/browse/SERVER-2193
In my particular case I have events. They can either be one-time or recurring. I have a field parent which is only present when the event is an instance of a recurring event (I periodically create new copies of the parent to have the recurring events for the next weeks in the system).
I thought I'd just add this index in order to prevent duplicate copies when the cronjob runs twice
events.ensureIndex({ dateFrom: 1, dateTo: 1, parent: 1 }) { sparse: true, unique: true }
Unfortunately as said above MongoDB does not support sparse on compound indexes. What this means is that for one-time events the parent field is not present and is set to null by MongoDB. If I now have a second one-time event at the same time, it causes a duplicate key error, which I only want when parent is set.
Any ideas?
Edit: I've seen MongoDB: Unique and sparse compound indexes with sparse values , but checking for uniqueness on application level is a no-go. I mean that's what the database is there for, to guarantee uniqueness.
You can add a 4th field which would be dateFrom+dateTo+parent (string concatenation). When the parent is null, choose a uid, for example from ObjectId function, and then index that field (unique).
This way you can enforce the uniqueness you want. However you can hardly use it for anything else than enforcing this constraint. (Although queries like "get docs where the string starts with blah blah" may be pretty efficient)

sparse indexes and null values in mongo

I'm not sure I understand sparse indexes correctly.
I have a sparse unique index on fbId
{
"ns" : "mydb.users",
"key" : {
"fbId" : 1
},
"name" : "fbId_1",
"unique" : true,
"sparse" : true,
"background" : false,
"v" : 0
}
And I was expecting that would allow me to insert records with null as the fbId, but that throws a duplicate key exception. It only allows me to insert if the fbId property is removed completely.
Isn't a sparse index supposed to deal with that?
Sparse indexes do not contain documents that miss indexed field. However, if field exists and has value of null, it will still be indexed. So, if absense of the field and its equality to null look the same for your application and you want to maintain uniqueness of fbId, just don't insert it until you have a value for it.
You need sparse indexes when you have a large number of documents, but only a small portion of them contains some field, and you want to be able to quickly find documents by that field. Creating a normal index would be too expensive, you would just waste precious RAM on indexing documents you're not interested in.
To ensure maximum performance of the indexes, we may want to omit from indexing those documents NOT containing the field on which you are performing an index. To do this MongoDB has the sparse property that works as follows:
db.addresses.ensureIndex( { "secondAddress": 1 }, { sparse: true } );
This index will omit all the documents not containing the secondAddress field and when performing a query, those document will never be scanned.
Let me share this article about basic indexes and some of their properties:
Geospatial, Text, Hash indexes and unique and sparse properties: http://mongodbspain.com/en/2014/02/03/mongodb-indexes-part-2-geospatial-2d-2dsphere/
{a:1, b:5, c:2}
{a:8, b:15, c:7}
{a:4, b:7}
{a:3, b:10}
Let's assume that we wish to create an index on the above documents. Creating index on a & b will not be a problem. But what if we need to create an index on c. The unique constraint will not work for c keys because null value is duplicated for 2 documents. The solution in this case is to use sparse option. This option tells the database to not include the documents which misses the key. The command in concern is db.collectionName.createIndex({thing:1}, {unique:true, sparse:true}). The sparse index lets us use less space as well.
Notice that even if we have a sparse index, the database performs all documents scan especially when doing sort. This can be seen in the winning plan section of explain's result.
Sparse indexes only contain entries for documents that have the indexed field, even if the index field contains a null value. The index skips over any document that is missing the indexed field. The index is "sparse" because it does not include all documents of a collection.

MongoDB: Unique Key in Embedded Document

Is it possible to set a unique key for a key in an embedded document?
I have a Users collection with the following sample documents:
{
Name: "Bob",
Items: [
{
Name: "Milk"
},
{
Name: "Bread"
}
]
},
{
Name: "Jim"
},
Is there a way to create an index on the property Items.Name?
I got the following error when I tried to create an index:
> db.Users.ensureIndex({"Items.Name": 1}, {unique:true});
E11000 duplicate key error index: GroceryGuruApp.Users.$Items.Name_1 dup key: {
: null }
Any suggestions? Thank you!
Unique indexes exist only across collection. To enforce uniqueness and other constraints across document you must do it in client code. (Probably virtual collections would allow that, you could vote for it.)
What are you trying to do in your case is to create index on key Items.Name which doesn't exist in any of the documents (it doesn't refer to embedded documents inside array Items), thus it's null and violates unique constraint across collection.
You can create a unique compound sparse index to accomplish something like what you are hoping for. It may not be the best option (client side still might be better), but it can do what you're asking depending on specific requirements.
To do it, you'll need to create another field on the same level as Name: Bob that is unique to each top-level record (could do FirstName + LastName + Address, we'll call this key Identifier).
Then create an index like this:
ensureIndex({'Identifier':1, 'Items.name':1},{'unique':1, 'sparse':1})
A sparse index will ignore items that don't have the field, so that should get around your NULL key issue. Combining your unique Identifier and Items.name as a compound unique index should ensure that you can't have the same item name twice per person.
Although I should add that I've only been working with Mongo for a couple of months and my science could be off. This is not based on empirical evidence but rather observed behavior.
More on MongoDB Indexes
Compound Keys Indexes
Sparse Indexes
An alternative would be to model the items as a hash with the item name as the key.
Items: { "Milk": 1, "Bread": 1 }
I'm not sure about whether you're trying to use the index for performance or purely for the constraint. The right way to approach depends on your use cases, and determining whether the atomic operations are enough to keep your data consistent.
The index will be across all Users and since you asked it for 'unique', no user will be able to have two of the same named item AND no two users will be able to have the same named Item.
Is that what you want?
Furthermore, it appears that it's objecting to two Users having a 'null' value for Items.Name, clearly Jim does, is there another record like that?
It would be unusual to require uniqueness on an indexed collection like this.
MongoDB does allow unique indexes where it indexes only the first of each value, see
http://www.mongodb.org/display/DOCS/Indexes#Indexes-DuplicateValues, but I suspect the real solution is to not require uniqueness in this case.
If you want to ensure uniqueness only within the Items for a single user you might want to try the $addToSet option. See http://www.mongodb.org/display/DOCS/Updating#Updating-%24addToSet
You can use use findAndModify to create a sequence/counter function.
function getNextSequence(name) {
var ret = db.counters.findAndModify({
query: { _id: name },
update: { $inc: { seq: 1 } },
new: true,
upsert: true
});
return ret.seq;
}
Then use it whenever a new id is needed...
db.users.insert({
_id: getNextSequence("userid"),
name: "Sarah C."
})
This is from http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/. Check it out.