Can a list field be a shard key in MongoDB? - mongodb

Have some data that looks like this:
widget:
{
categories: ['hair', 'nails', 'dress']
colors: ['red', 'white']
}
The data needs to be queried like this:
SELECT * FROM widget_table WHERE categories == 'hair' AND colors == 'red'
Would like to put this data into a MongoDB sharded cluster. However, it seems like an ideal shard key would not be a list field. In this case, that is not possible because all of the fields are list fields.
Is it possible to use a list field, such as the field categories as the shard key in MongoDB?
If so, what things should I look out for / be aware of?
Thanks so much!

Based on some of the feed back I am getting that seems to assert that it is not possible to shard using a list field as a shard key, I wanted to illustrate how this use case could be sharded using the limitations of MongoDB:
Original object:
widget:
{
primary_key: '2389sdjsdafnlfda'
categories: ['hair', 'nails', 'dress']
colors: ['red', 'white']
#All the other fields in the document that don't need to be queried upon:
...
...
}
Data layer splits object into multiple pointer objects based on the number of elements in the field chosen for the shard key:
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'hair',
colors: ['red', 'white']
}
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'nails',
colors: ['red', 'white']
}
widget_pointer:
{
primary_key: '2389sdjsdafnlfda'
categories: 'dress',
colors: ['red', 'white']
}
Explanation:
The field categories can now be the shard key in MongoDB.
The original object will now be stored in a key-value store. Queries against the data in MongoDB will return a pointer object that will be used to get the object from the key-value store.
Queries on the MongoDB data will hit only one shard.
Insertions on the MongoDB data will hit as many shards as there are elements in the list, in most cases, only a small subset of the total number of shards will be affected.

Sharding in MongoDB (as at 2.4) works by partitioning your documents into ranges of values based on the shard key. A list or array shard key does not make sense as a shard key because it contains multiple values.
It's also worth noting that the shard key is immutable (cannot be changed once set for a document), so you do not want to choose fields that you intend to update.
If you do not have any candidate fields in your documents, you could always add one. A straightforward solution in your case could be to use the new hashed sharding in MongoDB 2.4:
The field you choose as your hashed shard key should have a good cardinality, or large number of different values. Hashed keys work well with fields that increase monotonically like ObjectId values or timestamps.
An obvious question to consider before sharding is "do you need to shard?". Sharding is an approach for scaling out writes with MongoDB, but can be overkill if you aren't yet pushing the limits of your current configuration.

Related

Sharding with array in Cloud Firestore with composite index

I have read in the documentation, that writes per second can be limited to 500 per second if a collection has sequential values with an index.
I can add a shard field to avoid this.
Therefore I should add the shard field before the sequential field in a composite index.
But what if my sequential field is an array?
An array must always be the first field in a composite index.
For example:
I have a Collection "users" with an array field "reminders".
The field reminders contains time strings like ["12:15", "17:45", "20:00", ...].
I think these values could result in hot spotting but maybe I am wrong.
I don't know how Firestore handles arrays in composite indexes.
Clould my array reminders slow down the writes per second? And if so how could I implement a shard field? Or is there a completely different solution?

Best shard key (or optimised query) for range query on sub-document array

Below is a simplified version of a document in my database:
{
_id : 1,
main_data : 100,
sub_docs: [
{
_id : a,
data : 100
},
{
_id: b,
data : 200
},
{
_id: c,
data: 150
}
]
}
So imagine I have lots of these documents with varied data values (say 0 - 1000).
Currently my query is something like:
db.myDb.find(
{ sub_docs.data : { $elemMatch: { $gte: 110, $lt: 160 } } }
)
Is there any shard key I could use to help this query? As currently it is querying all shards.
If not is there a better way to structure my query?
Jackson,
You are thinking about this problem the right way. The problem with broadcast queries in MongoDB is that they can't scale.
Any MongoDB query that does not filter on the shard key, will be broadcast to all shards. Also, range queries are likely to either cause broadcasts of at the very least cause your queries to be sent to multiple shards.
So here is some things to think about
Query Frequency -- Is the range query your most frequent query? What
is the expected workload?
Range Logic -- Is there any instrinsic logic to how you are going to
apply the ranges? Let's say, you would say 0-200 is small, 200 - 400
is medium. You could potentially add another field to your document
and shard on it.
Additional shard key candidates -- Sometimes there are other fields
that can be included in all or most of your queries and it would
provide good distribution. By combining filtering with your range
queries you could restrict your query to one or fewer shards.
Break array -- You could potentially have multiple documents instead
of an array. In this scenario, you would have multiple docs, one for
each occurrence of the array and main data would be duplicated across
mulitple documents. Range query on this item would still be a
problem, but you could involve multiple shards, not necessarily all
(it depends on your data demographics and query patterns)
It boils down to the nature of your data and queries. The sample document that you provided is very anonymized so it is harder to know what would be good shard key candidates in your domain.
One last piece of advice is to be careful on your insert/update query patterns if you plan to update your document frequently to add more entries to the array. Growing documents present scaling problems for MongoDB. See this article on this topic.

mongodb sharding, use multiple fields as the shard key?

I have documents with the following schema:
{
idents: {
list: ['foo', 'bar', ...],
id: 123
}
...
}
the field idents.list is an array of string and always contains at least one element.
the field idents.id may or may not be existant.
over time more entries are added to 'idents.list' and at some point in the future the field idents.id may be set too.
these two fields are used to clearly identify a document and therefore are relevant for a shard key.
is it possible to use sharding with this schema?
UPDATE:
documents are always queried via {idents.list: 'foo'} OR { $or: [ {idents.list: 'foo'}, {idents.id: 42} ] }
Yes,you can do this. The documentation says:
Use a compound shard key that uses two or three values from all documents that provide the right mix of cardinality with scalable write operations and query isolation.
https://docs.mongodb.org/manual/tutorial/choose-a-shard-key/

mongodb range sharding with string field

I use 'id' field in mongodb documents which is the HASH of '_id' (ObjectId field generated by mongo). I want to use RANGE sharding with 'id' field. The question is the following:
How can I set ranges for each shard when 'shardKey' is some long String (for example 64 chars)?
If you want your data to be distributed based on a hash key, MongoDB has a built-in way of doing that:
sh.shardCollection("yourDB.yourCollection", { _id: "hashed" })
This way, data will be distributed between your shards randomly, as well as uniformly (or very close to it) .
Please note that you can't have both logical key ranges and random data distribution. It's either one or the other, they are mutually exclusive. So:
If you want random data distribution, use { fieldName: "hashed" } as your shard key definition.
If you want to manually control how data is distributed and accessed, use a normal shard key and define shard tags.

Mongo _id for subdocument array

I wish to add an _id as property for objects in a mongo array.
Is this good practice ?
Are there any problems with indexing ?
I wish to add an _id as property for objects in a mongo array.
I assume:
{
g: [
{ _id: ObjectId(), property: '' },
// next
]
}
Type of structure for this question.
Is this good practice ?
Not normally. _ids are unique identifiers for entities. As such if you are looking to add _id within a sub-document object then you might not have normalised your data very well and it could be a sign of a fundamental flaw within your schema design.
Sub-documents are designed to contain repeating data for that document, i.e. the addresses or a user or something.
That being said _id is not always a bad thing to add. Take the example I just stated with addresses. Imagine you were to have a shopping cart system and (for some reason) you didn't replicate the address to the order document then you would use an _id or some other identifier to get that sub-document out.
Also you have to take into consideration linking documents. If that _id describes another document and the properties are custom attributes for that document in relation to that linked document then that's okay too.
Are there any problems with indexing ?
An ObjectId is still quite sizeable so that is something to take into consideration over a smaller, less unique id or not using an _id at all for sub-documents.
For indexes it doesn't really work any different to the standard _id field on the document itself and a unique index across the field should work across the collection (scenario dependant, test your queries).
NB: MongoDB will not add an _id to sub-documents for you.