MongoDB workaround for not supported sparse unique compound index - mongodb

I need a workaround because MongoDB does not support sparse unique compound indexes (it will set the values to null if not present whereas it doesn't add the field to the index when it's a non-compound index). See https://jira.mongodb.org/browse/SERVER-2193
In my particular case I have events. They can either be one-time or recurring. I have a field parent which is only present when the event is an instance of a recurring event (I periodically create new copies of the parent to have the recurring events for the next weeks in the system).
I thought I'd just add this index in order to prevent duplicate copies when the cronjob runs twice
events.ensureIndex({ dateFrom: 1, dateTo: 1, parent: 1 }) { sparse: true, unique: true }
Unfortunately as said above MongoDB does not support sparse on compound indexes. What this means is that for one-time events the parent field is not present and is set to null by MongoDB. If I now have a second one-time event at the same time, it causes a duplicate key error, which I only want when parent is set.
Any ideas?
Edit: I've seen MongoDB: Unique and sparse compound indexes with sparse values , but checking for uniqueness on application level is a no-go. I mean that's what the database is there for, to guarantee uniqueness.

You can add a 4th field which would be dateFrom+dateTo+parent (string concatenation). When the parent is null, choose a uid, for example from ObjectId function, and then index that field (unique).
This way you can enforce the uniqueness you want. However you can hardly use it for anything else than enforcing this constraint. (Although queries like "get docs where the string starts with blah blah" may be pretty efficient)

Related

How do I remove the duplicate Key check before insert in mongoDB?

I use mongoDB to manage my DB for a yearly contest. Every year many users just renew their registration. MongoDB rejects duplicate emails, therefore they cannot register if they participated any of the edition of early years.
My question, is there any way to remove that limitation? Or maybe change the dup-key-check to be i.e. the "_id" (or whatever) instead of the "email"?
Apart from the mandatory _id field, MongoDB will only enforce uniqueness based on additional unique indexes that have been created. In your situation it sounds like there may be a unique index defined on { email: 1 }.
If that is not the logic that you wish to enforce, then you should drop that index and replace it with a different one. How exactly you define that really depends on your desired application logic. If you had a registrationYear field, for example, perhaps a compound unique index on both of those fields ({ email: 1, registrationYear: 1 }) would be appropriate.
But this probably isn't the only way to solve the problem. An alternative approach may be to combine a unique index with a partial index. With this approach, you could define as index as follows assuming that there is some active field in the document that becomes false after the specified amount of time:
db.foo.createIndex({ email: 1}, { unique: true, partialFilterExpression: { active: true } })
Such an index would only include documents that were currently considered active therefore only enforcing uniqueness on them. Once it was time to renew and an old document was no longer active the database would accept a new one.
Another alternative approach would be to just update the existing documents rather than creating new ones. Again this depends on what exactly you are trying to achieve, but you could use a similar approach of marking a document as no longer active and having the registration process perform an upsert (either an insert or an update).
Finally, if you don't need historical information at all then you could additionally do some sort of archival or deletion (perhaps via a TTL index) to expire the old documents.

Unique multi key hashed index in MongoDB

I have a collection with several billion documents and need to create a unique multi-key index for every attribute of my documents.
The problem is, I get an error if I try to do that because the generated keys would be too large.
pymongo.errors.OperationFailure: WiredTigerIndex::insert: key too large to index, failing
I found out MongoDB lets you create hashed indexes, which would resolve this problem, however they are not to be used for multi-key indexes.
How can i resolve this?
My first idea was to create another attribute for each of my document with an hash of every value of its attributes, then creating an index on that new field.
However this would mean to recalculate the hash every time I wish to add a new attribute, plus the excessive amount of time necessary to create both the hashes and the indexes.
This is a feature added in mongoDB since 2.6 to prevent the total size of an index entry to exceed 1024 bytes (also known as Index Key Length Limit).
In MongoDB 2.6, if you attempt to insert or update a document so that the value of an indexed field is longer than the Index Key Length Limit, the operation will fail and return an error to the client. In previous versions of MongoDB, these operations would successfully insert or modify a document but the index or indexes would not include references to the document.
For migration purposes and other temporary scenarios you can downgrade to 2.4 handling of this use case where this exception would not be triggered via setting this mongoDB server flag:
db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )
This however is not recommended.
Also consider that creating indexes for every attribute of your documents may not be the optimal solution at all.
Have you examined how you query your documents and on which fields you key on? Have you used explain to view the query plan? It would be an exception to the rule if you tell us that you query on all fields all the time.
Here are the recommended MongoDB indexing strategies.
Excessive indexing has a price as well and should be avoided.

Dealing with mongodb unique, sparse, compound indexes

Because mongodb will index sparse, compound indexes that contain 1 or more of the indexed fields, it is causing my unique, sparse index to fail because one of those fields is optional, and is being coerced to null by mongodb for the purpose of the index.
I need database-level ensurance of uniqueness for the combination of this field and a few others, and having to manage this at the application level via some concatenated string worries me.
As an alternative, I considered setting the default value of the possibly null indexed field to 'null ' + anObjectId, because it would allow me to keep the index without causing errors. Does this seem like a sensisble (although hacky) solution? Does anyone know of a better way I could enforce database-level uniqueness on a compound index?
Edit: I was asked to elaborate on the actual problem domain a bit more, so here it goes.
We get large data feeds from our customers that we need to integrate into our database. These feeds include various (3) unique identifiers supplied by the customer that we use for updating the versions we store in our database when the data feeds refresh. I need to tie uniqueness of these identifiers to the customer, because the same identifier could appear from multiple sources, and we want to allow that.
The document structure looks like this:
{
"identifiers": {
"identifierA": ...,
"identifierB": ...,
"identifierC": ...
},
"client": ...
}
Because the each individual identifier is optional (at least one of the three is required), I need to uniquely index the combination of the index with the client (e.g. one index is the combination of client plus identifierA). However, this index must only occur when the identifier exists, but this is not supported my mongodb (see the hyperlink above).
I was considering the above solution, but I would like to hear if anyone else has solved this or has suggestions.
https://docs.mongodb.org/manual/core/index-partial/
As of mongoDB 3.2 you can create partial index to support this as well.
db.users.createIndex(
{ name: 1, email: 1 },
{ unique: true, partialFilterExpression: { email: { $exists: true } } }
)
A sparse index avoids indexing a field that doesn't exist.
A unique index avoid documents being inserted that have the same field values.
Unfortunately as of MongoDB 2.6.7, the unique constraint is always enforced even when creating a compound index (indexing two or more fields) with the sparse and unique properties.
Example:
db = db.connect("test");
db.a.drop();
db.a.insert([
{},
{a : 1},
{b : 1},
{a : 1, b : 1}
]);
db.a.ensureIndex({a:1,b:1}, { sparse: true, unique: true } );
db.a.insert({a : 1}); // throws Error but wanted insert to be valid.
However, it works as expected for a single index field with sparse and unique properties.
I feel like this is a bug that will get fixed in future releases.
Anyhow, here are two solutions to get around this problem.
1) Add a non-null hash field to each document that is only computed when all the required fields for checking the uniqueness are supplied.
Then create a sparse unique index on the hash field.
function createHashForUniqueCheck(obj){
if( obj.firstName && obj.id){
return MD5( String( obj.firstName) + String(obj.id) );
}
return null;
}
2) On the application side, check for uniqueness before insertion into Mongodb. :-)
sparse index doc
A hash index ended up being sufficient for this

Mongodb id on bulk insert performance

I have a class/object that have a guid and i want to use that field as the _id object when it is saved to Mongodb. Is it possible to use other value instead of the ObjectId?
Is there any performance consideration when doing bulk insert when there is an _id field? Is _id an index? If i set the _id to different field, would it slow down the bulk insert? I'm inserting about 10 million records.
1) Yes you can use that field as the id. There is no mention of what API (if any) you are using for inserting the documents. So if you would do the insertion at the command line, the command would be:
db.collection.insert({_id : <BSONString_version_of_your_guid_value>, field1 : value1, ...});
It doesn't have to be BsonString. Change it to whatever Bson value is closest matching to your guid's original type (except the array type. Arrays aren't allowed as the value of _id field).
2) As far as i know, there IS effect on performance when db.collection.insert when you provide your own ids, especially in bulk, BUT if the id's are sorted etc., there shouldn't be a performance loss. The reason, i am quoting:
The structure of index is a B-tree. ObjectIds have an excellent
insertion order as far as the index tree is concerned: they are always
increasing, meaning they are always inserted at the right edge of
B-tree. This, in turn, means that MongoDB only has to keep the right
edge of the B-Tree in memory.
Conversely, a random value in the _id field means that _ids will be
inserted all over the tree. Then the machine must move a page of the
index into memory, update a tiny piece of it, then probably ignore it
until it slides out of memory again. This is less efficient.
:from the book `50 Tips and Tricks for MongoDB Developers`
The tip's title says - "Override _id when you have your own simple, unique id." Clearly it is better to use your id if you have one and you don't need the properties of an ObjectId. And it is best if your ids are increasing for the reason stated above.
3) There is a default index on _id field by MongoDB.
So...
Yes. It is possible to use other types than ObjectId, including GUID that will be saved as BinData.
Yes, there are considerations. It's better if your _id is always increasing (like a growing number, or ObjectId) otherwise the index needs to rebuild itself more often. If you plan on using sharding, the _id should also be hashed evenly.
_id indeed has an index automatically.
It depends on the type you choose. See section 2.
Conclusion: It's better to keep using ObjectId unless you have a good reason not to.

Should I use sparse index for boolean flags in mongodb?

I have a boolean flag :finished. Should I
A: index({ finished: 1 })
B: index({ finished: 1 }, {sparse: true})
C: use flag :unfinished instead, to query by that
D: other?
Ruby mongoid syntax. Most my records will have flag finished=true, and most operations fetch those unfinished, obviously. I'm not sure if I understand when to use sparse and when not to. Thanks!
The sparse flag is a little weird. To understand when to use it, you have to understand why "sparse" exists in the first place.
When you create a simple index on one field, there is an entry for each document, even documents that don't have that field.
For example, if you have an index on {rarely_set_field : 1}, you will have an index that is filled mostly with null because that field doesn't exist in most cases. This is a waste of space and it's inefficient to search.
The {sparse:true} option will get rid of the null values, so you get an index that only contain entries when {rarely_set_field} is defined.
Back to your case.
You are asking about using a boolean + sparse. But sparse doesn't really affect "boolean", sparse affect "is set vs. is not set".
In your case, you are trying to fetch unfinished. To leverage sparse the key is not the boolean value, but the fact that unfinished entries have that key and that "finished" entries have no key at all.
{ _id: 1, data: {...}, unfinished: true }
{ _id: 2, data: {...} } // this entry is finished
It sounds like you are using a Queue
You can definitely leverage the information above to implement a sparse index. However, it actually sounds like you are using a Queue. MongoDB is serviceable as a Queue, here are two examples.
However, if you look at the Queue, they are not doing it the way you are doing it. I'm personally using MongoDB as a Queue for some production systems and it runs pretty well, but test your expected load as a dedicated Queue will perform much better.
Sparse is only helpful if the value is null, not false. When you say "most will have finished=true", I'm guessing that most of finished is non-null, making sparse not very beneficial.
And since most values are a single value, I doubt any type of index would help at all if your queries are specific enough.