Custom MongoDB Object _id vs Compound index - mongodb

So I need to create a lookup collection in MongoDB to verify uniqueness. The requirement is to check if the same 2 values are being repeated or not. In SQL, I would something like this
SELECT count(id) WHERE key1 = 'value1' AND key2 = 'value2'
If the above query returns a count then it means the combination is not unique. I have 2 solutions in mind but I am not sure which one is more scalable. There are 30M+ docs against which I need to create this mapping.
Solution1:
I create a collection of docs with compound index on key1 and key2
{
_id: <MongoID>,
key1: <value1>,
key2: <value2>
}
Solution2:
I write application logic to create custom _id by concatenating value1 and value2
{
_id: <value1>_<value2>
}
Personally, I feel the second one is more optimised as it only has a single index and the size of doc is also smaller. But I am not sure if it is a good practice to create my own _id indexes as they may not be completely random. What do you think?
Thanks in advance.
Update:
My database already has a lot of indexes which take up memory so I want to keep index size to as low as possible specially for collections which are only used to verify uniqueness.

I would suggest Solution 1 i.e to use compound index and use two different properties key1 and key2
db.yourCollection.ensureIndex( { "key1": 1, "key2": 1 }, { unique: true } )
You can search easily by individual field if required. i.e if you require to search only by key1 or key2 then it would be easy with compound index. If you make _id with combination of keys, then it will be hard to search by individual field.
Size of document in Mongo is very least bothered while designing document.
If in near future if you would required to change keys values of same document with respect to other values, it will be easy. Keep in mind if you are using reference of this document in other collection's document.
In terms of your scalability, _id index would be sequential, easily shardable, and you can let MongoDB manage it.
If you are searching with those keys then it will use that index otherwise it will use the other required indexes for your search.
If you are still thinking of size of document than searching then you can go with Solution 1, make _id like
{_id:{key1:<value1>,key2:<value2>}}
By this you can search specific _id.key1 too.
Update:
Yes if document size is your concern than maintaining. And if you are sure about keys will not modify in future of same document and if it still modifying and do not have reference in other collections, then you can use Solution 1. Just use keys as objects than underscore _. You can add more keys later too if wanted in future.

I think the solution 2 is more suitable for your requirement. It is absolutely ok to generate the _id value of MongoDB. Most of the applications does populate the _id value with UUID. In your case, it make sense to concatenate value 1 and 2 for _id value assuming this collection is primarily used for verifying the uniqueness (i.e kind of temporary table) or lookup purpose.
Solution 1 is expensive as it requires additional index. Again, it depends on whether you are going to use this collection for verifying the uniqueness purpose alone or for some other use case as well.
Please note that you need to create the unique compound index, so that it doesn't allow to insert data for duplicate values.

Related

DB Compound indexing best practices Mongo DB

How costly is it to index some fields in MongoDB,
I have a table where i want uniqueness combining two fields, Every where i search they suggested compound index with unique set to true. But what i was doing is " Appending both field1_field2 and making it a key, so that field2 will be always unique for field1.(and add Application logic) As i thought indexing is costly.
And also as MongoDB documentation advices us not to use Custom Object ID like auto incrementing number, I end up giving big numbers to Models like Classes, Students etc, (where i could have used easily used 1,2,3 in sql lite), I didn't think to add a new field for numbering and index that field for querying.
What are the best practices advice for production
The advantage of using compound indexes vs your own indexed field system is that compound indexes allows sorting quicker than regular indexed fields. It also lowers the size of every documents.
In your case, if you want to get the documents sorted with values in field1 ascending and in field2 descending, it is better to use a compound index. If you only want to get the documents that have some specific value contained in field1_field2, it does not really matter if you use compound indexes or a regular indexed field.
However, if you already have field1 and field2 in seperate fields in the documents, and you also have a field containing field1_field2, it could be better to use a compound index on field1 and field2, and simply delete the field containing field1_field2. This could lower the size of every document and ultimately reduce the size of your database.
Regarding the cost of the indexing, you almost have to index field1_field2 if you want to go down that route anyways. Queries based on unindexed fields in MongoDB are really slow. And it does not take much more time adding a document to a database when the document has an indexed field (we're talking 1 millisecond or so). Note that adding an index on many existing documents can take a few minutes. This is why you usually plan the indexing strategy before adding any documents.
TL;DR:
If you have limited disk space or need to sort the results, go with a compound index and delete field1_field2. Otherwise, use field1_field2, but it has to be indexed!

Mongoose indexes at both field and schema levels

I understand that indexing can be a valuable tool for quickly retrieving data, if implemented properly. I would like to be able to scan my documents for a certain field value or a combination of field values.
There are two fields I would be indexing (category, tags). Category is a string and tags is an array. I need to be able to query for items in a specific category and/or items that contain a specific tag.
Here are three examples:
Show me all of the documents in the category: "cars"
Show me all of the documents that contain the tag: "electric"
Show me all of the documents in the "cars" category that contain the "electric" tag
Will a schema level index for both fields suffice for all three scenarios?
docSchema.index({category:1, tags:1});
Or do I also need to define them at the field level, to support the scenarios when I am only searching through a single field?
docSchema = mongoose.Schema({
category: {
type: String,
index: true
},
tags: {
type: [String],
index: true
}
});
docSchema.index({category:1, tags:1}); is a compound index.
This compound index supports the scenarios 1 and 3:
-> Show me all of the documents in the category: "cars"
-> Show me all of the documents in the "cars" category that contain the "electric" tag
To support scenario 2 you will need to define an additional single index on the tag field.
docSchema.index({tags:1});
A compound index supports queries that involve all fields in the compound index as well as queries that involve a prefix of the compound index. In this case your compound index supports queries involving both categories and tags as well as queries involving just categories.
To better understand the logic please take a look at the Compound Indexes articles on MongoDB documentation site. Pay special attention to the section that talks about Prefixes.
You need an single field index on category and a multikey index on tags. You might be tempted to use a compound index instead of one of them. But it is not mandatory if you are using MongoDB >= 2.6, as it has a nice feature called index intersection.
Show me all of the documents in the category: "cars"
Show me all of the documents that contain the tag: "electric"
Show me all of the documents in the "cars" category that contain the "electric" tag
(1) will use the index on category (incl. any index having category as a prefix)
(2) will use the index on tags (incl. any index having tags as a prefix)
(3) will use the index on tags or the index on category or the index intersection of both of them (depending the choice of the query planner).
As a reference, there is a nice discussion about index intersection in the MongoDB blog. Worth reading the entire article. But to quote the conclusion, mostly comparing index intersection to compound indexes:
To be clear, compound indexing will ALWAYS be more performant [than index intersection] IF you know what you are going to be querying on and can create one ahead of time. Furthermore, if your working set is entirely in memory, then you will not reap any of the benefits of Index Intersection as it is primarily based on reducing IO. But in a more ad-hoc case where one cannot predict the shape of the queries and the working set is much larger than available memory, index intersection will automatically take over and choose the most performant path.

Mongodb id on bulk insert performance

I have a class/object that have a guid and i want to use that field as the _id object when it is saved to Mongodb. Is it possible to use other value instead of the ObjectId?
Is there any performance consideration when doing bulk insert when there is an _id field? Is _id an index? If i set the _id to different field, would it slow down the bulk insert? I'm inserting about 10 million records.
1) Yes you can use that field as the id. There is no mention of what API (if any) you are using for inserting the documents. So if you would do the insertion at the command line, the command would be:
db.collection.insert({_id : <BSONString_version_of_your_guid_value>, field1 : value1, ...});
It doesn't have to be BsonString. Change it to whatever Bson value is closest matching to your guid's original type (except the array type. Arrays aren't allowed as the value of _id field).
2) As far as i know, there IS effect on performance when db.collection.insert when you provide your own ids, especially in bulk, BUT if the id's are sorted etc., there shouldn't be a performance loss. The reason, i am quoting:
The structure of index is a B-tree. ObjectIds have an excellent
insertion order as far as the index tree is concerned: they are always
increasing, meaning they are always inserted at the right edge of
B-tree. This, in turn, means that MongoDB only has to keep the right
edge of the B-Tree in memory.
Conversely, a random value in the _id field means that _ids will be
inserted all over the tree. Then the machine must move a page of the
index into memory, update a tiny piece of it, then probably ignore it
until it slides out of memory again. This is less efficient.
:from the book `50 Tips and Tricks for MongoDB Developers`
The tip's title says - "Override _id when you have your own simple, unique id." Clearly it is better to use your id if you have one and you don't need the properties of an ObjectId. And it is best if your ids are increasing for the reason stated above.
3) There is a default index on _id field by MongoDB.
So...
Yes. It is possible to use other types than ObjectId, including GUID that will be saved as BinData.
Yes, there are considerations. It's better if your _id is always increasing (like a growing number, or ObjectId) otherwise the index needs to rebuild itself more often. If you plan on using sharding, the _id should also be hashed evenly.
_id indeed has an index automatically.
It depends on the type you choose. See section 2.
Conclusion: It's better to keep using ObjectId unless you have a good reason not to.

MongoDB index on many (nested) fields/attributes

In e-commerce application I have documents like this:
{ category:'A', ..., price:122,
attr:{ width:6, height:4, hasLCD:true, lcdType:'some text', ..., a36:null }
}
I.e. every product has many attributes of various simple types.
Now I want to filter products by dynamic queries containing top level fields plus some attributes. For example:
find({category:'A', price:{$lt:200}, ...,
'attr.height':{$lt:6}, 'attr.hasLCD':true, 'attr.lcdType':{$in:[...]}, ...})
And I'd like this to perform fast.
Trying to index on all possible 'attr.*' variants gives me an error (too many compound keys). I also suspect that if I index it that way and then omit one of attrs in query index won't work.
Trying to index on 'attr' as a whole does not help either.
What is the proper way to model this under MongoDB?
Update
I have tried this approach (also mentioned here). I.e. store attributes as array of key-value pairs:
attr2: [ {tag:'lcgType', value:'some text'}, ...
And index it like this:
ensureIndex({ 'attr2.tag':1, 'attr2.value':1 })
And query like this:
find({attr2:{$all:[
{$elemMatch:{tag:'bestseller',value:true}},
{$elemMatch:{tag:'weight',value:{$lte:100}}}
]}})
Now explain() says that it is using "BtreeCursor attr2.tag_1_attr2.value_1" but still "nscanned" : 31607 and the whole execution time have actually increased (compared to non-indexed scenario).
Something is wrong here.
Sub-question
What if I select some (less than 31) most frequently queried attributes and try to index on those. If I put all of them in single compound index:
ensureIndex({'attr.a1':1, 'attr.a2':1, ...})
According to the docs this index won't be used for queries missing attr.a1 attribute.
How to define index in this case?
If you really have to allow a lot of filters, combinations and possibly even sorts, MongoDB is not a good fit because it uses only one index per query. The number of indexes then grows way too fast, because compound keys are somewhat inflexible (that should answer the subquestion) and becomes a performance hog.
Use a search database like ElasticSearch, SolR, etc. instead that comes with the features you need. You can the use a $in on the ids that the search server returned if you want to keep the base information in MongoDB (it's usually a good idea to have the search database simply replicate the information of the primary data store so you don't need to sync changes two-way, which would be a nightmare)

Mongo _id for subdocument array

I wish to add an _id as property for objects in a mongo array.
Is this good practice ?
Are there any problems with indexing ?
I wish to add an _id as property for objects in a mongo array.
I assume:
{
g: [
{ _id: ObjectId(), property: '' },
// next
]
}
Type of structure for this question.
Is this good practice ?
Not normally. _ids are unique identifiers for entities. As such if you are looking to add _id within a sub-document object then you might not have normalised your data very well and it could be a sign of a fundamental flaw within your schema design.
Sub-documents are designed to contain repeating data for that document, i.e. the addresses or a user or something.
That being said _id is not always a bad thing to add. Take the example I just stated with addresses. Imagine you were to have a shopping cart system and (for some reason) you didn't replicate the address to the order document then you would use an _id or some other identifier to get that sub-document out.
Also you have to take into consideration linking documents. If that _id describes another document and the properties are custom attributes for that document in relation to that linked document then that's okay too.
Are there any problems with indexing ?
An ObjectId is still quite sizeable so that is something to take into consideration over a smaller, less unique id or not using an _id at all for sub-documents.
For indexes it doesn't really work any different to the standard _id field on the document itself and a unique index across the field should work across the collection (scenario dependant, test your queries).
NB: MongoDB will not add an _id to sub-documents for you.