mongodb sharding, use multiple fields as the shard key? - mongodb

I have documents with the following schema:
{
idents: {
list: ['foo', 'bar', ...],
id: 123
}
...
}
the field idents.list is an array of string and always contains at least one element.
the field idents.id may or may not be existant.
over time more entries are added to 'idents.list' and at some point in the future the field idents.id may be set too.
these two fields are used to clearly identify a document and therefore are relevant for a shard key.
is it possible to use sharding with this schema?
UPDATE:
documents are always queried via {idents.list: 'foo'} OR { $or: [ {idents.list: 'foo'}, {idents.id: 42} ] }

Yes,you can do this. The documentation says:
Use a compound shard key that uses two or three values from all documents that provide the right mix of cardinality with scalable write operations and query isolation.
https://docs.mongodb.org/manual/tutorial/choose-a-shard-key/

Related

Custom MongoDB Object _id vs Compound index

So I need to create a lookup collection in MongoDB to verify uniqueness. The requirement is to check if the same 2 values are being repeated or not. In SQL, I would something like this
SELECT count(id) WHERE key1 = 'value1' AND key2 = 'value2'
If the above query returns a count then it means the combination is not unique. I have 2 solutions in mind but I am not sure which one is more scalable. There are 30M+ docs against which I need to create this mapping.
Solution1:
I create a collection of docs with compound index on key1 and key2
{
_id: <MongoID>,
key1: <value1>,
key2: <value2>
}
Solution2:
I write application logic to create custom _id by concatenating value1 and value2
{
_id: <value1>_<value2>
}
Personally, I feel the second one is more optimised as it only has a single index and the size of doc is also smaller. But I am not sure if it is a good practice to create my own _id indexes as they may not be completely random. What do you think?
Thanks in advance.
Update:
My database already has a lot of indexes which take up memory so I want to keep index size to as low as possible specially for collections which are only used to verify uniqueness.
I would suggest Solution 1 i.e to use compound index and use two different properties key1 and key2
db.yourCollection.ensureIndex( { "key1": 1, "key2": 1 }, { unique: true } )
You can search easily by individual field if required. i.e if you require to search only by key1 or key2 then it would be easy with compound index. If you make _id with combination of keys, then it will be hard to search by individual field.
Size of document in Mongo is very least bothered while designing document.
If in near future if you would required to change keys values of same document with respect to other values, it will be easy. Keep in mind if you are using reference of this document in other collection's document.
In terms of your scalability, _id index would be sequential, easily shardable, and you can let MongoDB manage it.
If you are searching with those keys then it will use that index otherwise it will use the other required indexes for your search.
If you are still thinking of size of document than searching then you can go with Solution 1, make _id like
{_id:{key1:<value1>,key2:<value2>}}
By this you can search specific _id.key1 too.
Update:
Yes if document size is your concern than maintaining. And if you are sure about keys will not modify in future of same document and if it still modifying and do not have reference in other collections, then you can use Solution 1. Just use keys as objects than underscore _. You can add more keys later too if wanted in future.
I think the solution 2 is more suitable for your requirement. It is absolutely ok to generate the _id value of MongoDB. Most of the applications does populate the _id value with UUID. In your case, it make sense to concatenate value 1 and 2 for _id value assuming this collection is primarily used for verifying the uniqueness (i.e kind of temporary table) or lookup purpose.
Solution 1 is expensive as it requires additional index. Again, it depends on whether you are going to use this collection for verifying the uniqueness purpose alone or for some other use case as well.
Please note that you need to create the unique compound index, so that it doesn't allow to insert data for duplicate values.

In MongoDB, when to use a simple subdocument, when an array with 2-field elements?

Background
I am storing table rows as MongoDb documents, with each column having a name. Let's say table has these columns of interest: Identifier, Person, Date, Count. The MongoDb document also has some extra fields separate from the table data, represented by timestamp. Columns are not fixed (which is why I use schema-free database to store them in the first place).
There will be need to do various complex, but so far unspecified queries. I am not very concerned about performance, though query performance may conceivably become a bottleneck. Once inserted, documents will not be modifed (a new document with same Identifier will be created instead), and insertions are not very frequent (let's say, 1000 new MongoDb documents per day). So amount of data will steadily grow over time.
Example
The straight-forward approach is having a collection of MongoDb documents like:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: {
Identifier: "AB002",
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
}
Now I have seen an alternative approach (for example in accepted answer of this question), using array with two fields per object:
{
_id: XXXX,
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
data: [
{ field: "Identifier", value: "AB002" },
{ field: "Person", value: "John001" },
{ field: "Date", value: ISODate("2013-11-16T21:26:17Z") },
{ field: "Count", value: 1 }
]
}
Questions
Does the 2nd approach make any sense at all?
If yes, then how to choose which to use? Especially, are there some specific kinds of queries which are easy/cheap with one approach, hard/costly with another? Any "rules of thumb" on which way to go, or pro-con lists for both? Example real-life cases of one aproach being inconvenient would be especially valuable.
In your specific example the First version is a lot more appropriate and simple. You have to think in terms of how you would query your document.
It is a lot simpler to query your database like this: db.collection.find({"data.Identifier": "AB002"})
Although I'm not 100% sure why you even need the inner document. Why can't structure your document like:
{
_id: "AB002",
insertDate: ISODate("2012-10-15T21:26:17Z"),
flag: true,
Person: "John002",
Date: ISODate("2013-11-16T21:26:17Z"),
Count: 1
}
Pros of first example:
Simple to query
Enforces unique keys, but your data won't have two columns with the same name anyway
I would assume mongoDB would generate better query plans because the structure is a lot more simple (haven't tested)
Pros of second example:
Allows multiple entries with the same key/field, but I don't feel that is useful in your case
A single index on the array can be used for all of its entries regardless of their field name
I don't think that the situation in the other example here and yours are the same. In the other example, they're creating a list of items with one of two answers, which would be more appropriately in an array, and the goal is to return a list of subdocuments that match the criteria. In your example, you're really just describing an object since they all hold different types of information, and you won't need to retrieve searchable bits of the subdocuments.

Can a list field be a shard key in MongoDB?

Have some data that looks like this:
widget:
{
categories: ['hair', 'nails', 'dress']
colors: ['red', 'white']
}
The data needs to be queried like this:
SELECT * FROM widget_table WHERE categories == 'hair' AND colors == 'red'
Would like to put this data into a MongoDB sharded cluster. However, it seems like an ideal shard key would not be a list field. In this case, that is not possible because all of the fields are list fields.
Is it possible to use a list field, such as the field categories as the shard key in MongoDB?
If so, what things should I look out for / be aware of?
Thanks so much!
Based on some of the feed back I am getting that seems to assert that it is not possible to shard using a list field as a shard key, I wanted to illustrate how this use case could be sharded using the limitations of MongoDB:
Original object:
widget:
{
primary_key: '2389sdjsdafnlfda'
categories: ['hair', 'nails', 'dress']
colors: ['red', 'white']
#All the other fields in the document that don't need to be queried upon:
...
...
}
Data layer splits object into multiple pointer objects based on the number of elements in the field chosen for the shard key:
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'hair',
colors: ['red', 'white']
}
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'nails',
colors: ['red', 'white']
}
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'dress',
colors: ['red', 'white']
}
Explanation:
The field categories can now be the shard key in MongoDB.
The original object will now be stored in a key-value store. Queries against the data in MongoDB will return a pointer object that will be used to get the object from the key-value store.
Queries on the MongoDB data will hit only one shard.
Insertions on the MongoDB data will hit as many shards as there are elements in the list, in most cases, only a small subset of the total number of shards will be affected.
Sharding in MongoDB (as at 2.4) works by partitioning your documents into ranges of values based on the shard key. A list or array shard key does not make sense as a shard key because it contains multiple values.
It's also worth noting that the shard key is immutable (cannot be changed once set for a document), so you do not want to choose fields that you intend to update.
If you do not have any candidate fields in your documents, you could always add one. A straightforward solution in your case could be to use the new hashed sharding in MongoDB 2.4:
The field you choose as your hashed shard key should have a good cardinality, or large number of different values. Hashed keys work well with fields that increase monotonically like ObjectId values or timestamps.
An obvious question to consider before sharding is "do you need to shard?". Sharding is an approach for scaling out writes with MongoDB, but can be overkill if you aren't yet pushing the limits of your current configuration.

mongodb: create a top-level index for a nested document instead of having to index each individual sublevel?

This question is about how I can use indexes in MongoDB to look something up in nested documents, without having to index each individual sublevel.
I have a collection "test" in MongoDB which basically goes something like this:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : {
"1" : { [1,2,3] },
"2" : { [4,5,6] }
}}
Scenario has multiple keys, each document can have any subset of the scenarios (i.e. from none to a subset to all). Also: Scenario can't be an array because i need it as a dictionary in Python. I created an index on the "scenario" field.
My issue is that i want to select on the collection, filtering for documents that have a certain value. So this works fine functionally:
db.test.find({"scenario.1": {$exists: true}})
However, it won't use any index i've put on scenario. Only if i put an index on the "scenario.1" an index is used. But I can have thousands (or more) scenarios (and the collection itself has 100.000s of records), so i would prefer not to!
So i tried alternatives:
db.test.find({"scenario": "1"})
This will use the index on scenario, but won't return results. Making scenario an array still gives the same index issue.
Is my question clear? Can anyone give a pointer on how I could achieve the best performance here?
P.s. I have seen this: How to Create a nested index in MongoDB? but that solution is not possible in my case (due to the amount of scenarios)
Putting an index on a subobject like scenario is useless in this case as it would only be used when you're filtering on complete scenario objects rather than individual fields (think of it as a binary blob comparison).
You either need to add an index on each of your possible fields ("scenario.1", "sceanario.2", etc.) or rework your schema to get rid of the dynamic keys by doing something like this:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : [
{ id: "1", value: [1,2,3] },
{ id: "2", value: [4,5,6] }
}}
Then you can add a single index to scenario.id to support the queries you need to perform.
I know you said you need scenario to be a dict and not an array, but I don't see how you have much choice.
Johnny HK's answer is a nice explained answer and should be used in general cases. I will just suggest a workaround for you to solve your issue if you have to have many scenarios and don't need complex querying. Instead of keeping values under scenario field, just hold the id of the scenario under that field, and hold the values as another field in the document and use the scenario id as the key of this field.
Example:
{
"_id" : ObjectId("50fdd7d71d41c82875a5b6c1"),
"othercol" : "bladiebla",
"scenario" : [ "1", "2"],
"scenario_1": [1,2,3],
"scenario_2": [4,5,6]
}}
With this schema you can use index on scenario to find specific scenarios. But if you need to query for specific scenario values, you again need to have an index on each scenario value field i.e scenario_1, scenario_2, etc.. If you need to have indexes for each field, then don't change your original schema and use sparse indexes for each nested field and that might help reduce the size of your indexes.

MongoDB: Unique Key in Embedded Document

Is it possible to set a unique key for a key in an embedded document?
I have a Users collection with the following sample documents:
{
Name: "Bob",
Items: [
{
Name: "Milk"
},
{
Name: "Bread"
}
]
},
{
Name: "Jim"
},
Is there a way to create an index on the property Items.Name?
I got the following error when I tried to create an index:
> db.Users.ensureIndex({"Items.Name": 1}, {unique:true});
E11000 duplicate key error index: GroceryGuruApp.Users.$Items.Name_1 dup key: {
: null }
Any suggestions? Thank you!
Unique indexes exist only across collection. To enforce uniqueness and other constraints across document you must do it in client code. (Probably virtual collections would allow that, you could vote for it.)
What are you trying to do in your case is to create index on key Items.Name which doesn't exist in any of the documents (it doesn't refer to embedded documents inside array Items), thus it's null and violates unique constraint across collection.
You can create a unique compound sparse index to accomplish something like what you are hoping for. It may not be the best option (client side still might be better), but it can do what you're asking depending on specific requirements.
To do it, you'll need to create another field on the same level as Name: Bob that is unique to each top-level record (could do FirstName + LastName + Address, we'll call this key Identifier).
Then create an index like this:
ensureIndex({'Identifier':1, 'Items.name':1},{'unique':1, 'sparse':1})
A sparse index will ignore items that don't have the field, so that should get around your NULL key issue. Combining your unique Identifier and Items.name as a compound unique index should ensure that you can't have the same item name twice per person.
Although I should add that I've only been working with Mongo for a couple of months and my science could be off. This is not based on empirical evidence but rather observed behavior.
More on MongoDB Indexes
Compound Keys Indexes
Sparse Indexes
An alternative would be to model the items as a hash with the item name as the key.
Items: { "Milk": 1, "Bread": 1 }
I'm not sure about whether you're trying to use the index for performance or purely for the constraint. The right way to approach depends on your use cases, and determining whether the atomic operations are enough to keep your data consistent.
The index will be across all Users and since you asked it for 'unique', no user will be able to have two of the same named item AND no two users will be able to have the same named Item.
Is that what you want?
Furthermore, it appears that it's objecting to two Users having a 'null' value for Items.Name, clearly Jim does, is there another record like that?
It would be unusual to require uniqueness on an indexed collection like this.
MongoDB does allow unique indexes where it indexes only the first of each value, see
http://www.mongodb.org/display/DOCS/Indexes#Indexes-DuplicateValues, but I suspect the real solution is to not require uniqueness in this case.
If you want to ensure uniqueness only within the Items for a single user you might want to try the $addToSet option. See http://www.mongodb.org/display/DOCS/Updating#Updating-%24addToSet
You can use use findAndModify to create a sequence/counter function.
function getNextSequence(name) {
var ret = db.counters.findAndModify({
query: { _id: name },
update: { $inc: { seq: 1 } },
new: true,
upsert: true
});
return ret.seq;
}
Then use it whenever a new id is needed...
db.users.insert({
_id: getNextSequence("userid"),
name: "Sarah C."
})
This is from http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/. Check it out.