Is a hashed shard key right for me? - mongodb

Assume my documents look something like this:
{
"_id" : ObjectId("53d9560f2521e7a28f550a78"),
"tenantId" : "tenant1",
"body" : "Some text - it's the point of the document."
}
There are a couple of obviously bad shard key choices:
{tenantId : 1} This would eventually give me large, unsplittable chunks.
{_id : 1} There are a lot of writes and no updates. The ascending key would give me hotspots.
I think I'm left with two possibilities:
{tenantId : 1, _id : 1} The hotspot problem with _id is mitigated by the addition of tenantId. I can easily search with this full key.
{_id : "hashed"} No hotspots, but I have concerns....
My concern with the hashed key is that it's now random. In Scaling MongoDB, the author warns against random keys because:
The configuration server notices that Shard 2 has 10 more chunks than
Shard 1 and decides it should even things out. MongoDB now has to load
a random five chunks’ worth of data into memory and send it to Shard
1. This is data that wouldn’t have been in memory ordinarily, because it’s a completely random order of data. So, now MongoDB is going to be
putting a lot more pressure on RAM and there’s going to be a lot of
disk IO going on (which is always slow).
So, my question is: Are hashed keys only a good choice if your only other choice is a monotonically ascending key? In my case, would the combination of tenantId and _id be better?
Update: To answer a question in the comments, we only ever retrieve these documents one-by-one. So depending on which shard key we choose, queries would be like these:
{_id : "53d9560f2521e7a28f550a78"}
or
{_id : "53d9560f2521e7a28f550a78", tenantId : "tenant1"}

Related

MongoDB querying performance for over 5 million records

We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.
They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.
I've added several compound indexes with the filtered fields and the timetamp
e.g:
db.events.ensureIndex({somefield: 1, timestamp:-1})
I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.
I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.
I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..
EDIT: example for a query:
> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['nickey#acme.com']}}).sort({timestamp: -1}).limit(25).explain()
{
"cursor" : "BtreeCursor user.userName_1_timestamp_-1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 30060,
"nscanned" : 30060,
"nscannedObjectsAllPlans" : 120241,
"nscannedAllPlans" : 120241,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 26495,
"indexBounds" : {
"user.userName" : [
[
"nickey#acme.com",
"nickey#acme.com"
]
],
"timestamp" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"server" : "yarin:27017"
}
please note that deviceType has only 2 values in my collection.
This is searching the needle in a haystack. We'd need some output of explain() for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:
Ensure it's not because of insufficient RAM and excessive paging
Enable the DB profiler (using db.setProfilingLevel(1, timeout) where timeout is the threshold for the number of milliseconds the query or command takes, anything slower will be logged)
Inspect the slow queries in db.system.profile and run the queries manually using explain()
Try to identify the slow operations in the explain() output, such as scanAndOrder or large nscanned, etc.
Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.
A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.
Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.
Let's say you have a query for all users with status "active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.
So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.
Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.
The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.
If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.
-- EDIT --
The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "nickey#acme.com". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.
I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).
Making this particular query faster could be done using a compound index that contains the device type, e.g. using
a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})
or
b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})
Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1}); can't use the same index anymore, so, as described, the number of indexes will grow very quickly.
I'm afraid there's no very good solution for this using mongodb at this time.
Mongo only uses 1 index per query.
So if you want to filter on 2 fields, mongo will use the index with one of the fields, but still needs to scan the entire subset.
This means that basically you'll need an index for every type of query in order to achieve the best performance.
Depending on your data, it might not be a bad idea to have one query per field, and process the results in your app.
This way you'll only need indexes on every fields, but it may be too much data to process.
If you are using $in, mongodb never uses INDEX. Change your query, by removing this $in. It should use index and it would give better performance than what you got earlier.
http://docs.mongodb.org/manual/core/query-optimization/

Mongodb choose shard key

I have a mongodb collection which I want to shard. This collection holds messages from users and a document from the collection has the following properties
{
_id : ObjectId,
conversationId: ObjectId,
created: DateTime
}
All queries will be done using the converstionId property and sorter by created.
Sharding by _id obviously won't work because I need to query by conversationId (plus _id is of type ObjectId which won't scale very well to many inserts)
Sharding by conversationId would be a logical choice in terms of query isolation but I'm afraid that it won't scale very well many inserts (even if I use a hashed shard key on conversationId or if I change the type of the property from ObjectId to some other type which isn't incremental like GUID) because some conversation might be much more active than others (i.e.: have many more message added to them)
From what I see in the mongo documentation The shard key is either an indexed field or an indexed compound field that exists in every document in the collection.
Does this mean that I can create a shard key on a compound index ?
Bottom line is that:
creating a hashed shard key from the _id property would offer good distribution of the data
creating a shard key on conversationId would offer good query isolation
So a combination of these two things would be great, if it could be done.
Any ideas?
Thanks
For your case, neither of fields look good choice for sharding. For instance, if you shard on conversationId, it will result in hot spotting, i.e. most of your inserts will happen to the last shard as conversationId would monotonically increase over time. Same problem with other two fields as well.
Also, conversationId will not offer high degree of isolation as conversationId would monotonically increase over time. (Since newer conversations will get updated much more frequently than very old ones)
In your case, a "hashed shard key"(version 2.4 onwards) over conversationId would be the smart choice as one would imagine that there can be tons of conversations going on in parallel.
Refer following link for details on creating hashed shard key: [ http://docs.mongodb.org/manual/tutorial/shard-collection-with-a-hashed-shard-key/ ]

Good Shard Keys in MongoDB

From the book Scaling MongoDB:
The general case
We can generalize this to a formula for shard keys:
{coarseLocality : 1, search : 1}
So my question is, is that correct? shouldn't be the oposite for better writing?
Also from the book:
This pattern continues: everything will always be added to the “last”
chunk, meaning everything will be added to one shard. This shard key
gives you a single, undistributable hot spot.
So saying that my app always search by user_id, and last entries in the collection.
What is the best shard key i should have, this:
{_id:1, user_id:1}
or:
{user_id:1,_id:1}
Kristina (author of Scaling MongoDB) wrote a blog post which has some example strategies explained in the guise of a game: How to Choose a Shard Key: The Card Game.
There are many considerations to choosing a good shard key based on your application requirements and use cases.
The general advice of {coarseLocality : 1, search : 1} order is to ensure there is some locality of your data for reading.
So in your case, you would most likely want: {user_id:1,_id:1}.
That will provide some locality of data for the same user_id when querying, and ideally your common queries will be able to get their data from a single shard.
The opposite order may provide for better write distribution (assuming _id is not a monotonically increasing key like a default ObjectId) but a potential downside is reliability: if your data for a read query is scattered across all shards, you will have retrieval problems if any one shard is down.
So saying that my app always search by user_id, and last entries in the collection.
If you commonly search by user_id (and without _id) this will also affect your choice of shard key and index optimization. To find the last entries MongoDB will have to do a sort; you will want to be doing that sort on a single shard rather than having to gather the data from all shards and sorting. If your _id happens to be date-based that would be beneficial as part of the shard key in order to find the last entries.

choosing a mongodb sharding key

I am launching a data storage service for my application. MongoDB
is running as the storage mechanism, and I have created 2 shards to
start.
The application will be storing event data, and all data will be
structured as follows:
{
_id: '4fa2f7e25626cd1374000002',
created_at: '2012-05-03T21:25:54 00:00',
name: 'client_session_connect',
session_remote_id: '74ACF9AA-9E09-11E1-8C9E-8462380DA5E6',
zone_id: '74ACF9AA-9E09-11E1-8C9E-1231380DA5E6',
additional: {
some_other_key: 'value'
}
}
Events will have a variety of names, and any new event can be created
at any time with a new event name. There will be plenty of events in
the system with the same name. _id, created_at, and name will be part
of every event, but no other values are guaranteed.
Based on what I have read (here, and here), it seems that the best sharding key would
be { name: 1, created_at: 1 }. Would I be correct in this
interpretation?
From what you've stated, it seems like that would be a good shard key, with a few caveats:
-shard keys are immutable, so if you ever need to change the "name" field of a document, you'll need to delete and reinsert it (probably this isn't an issue for you, unless you intend to change names often).
-If you write a lot of documents with the same "name" in quick succession, all these writes will go to the same chunk, since "created_at" is presumably an increasing field. Eventually the chunk will be split into multiple chunks and balanced off the receiving machine, so this is only a problem if you expect to receive a huge volume of writes of docs with same "name."
-If the "name"s are not uniformly distributed, you could hash the name and store the result in a new field of your document, then make the shard key {hashedName : 1, created_at : 1}. This might give a more even load distribution, reducing the amount of balancing later. It does add a little complexity to your documents, though.
Assuming you're aware of these things, {name: 1, created_at: 1} may very well be the best shard key for you.

does multiple shard key helps performance in mongodb?

Since sharding database use shard key to split chunk AND route queries, so I think maybe more shard key can helps to make more queries targeted
I tried to specify multiple keys like this
db.runCommand( { shardcollection : "test.users" , key : {_id:1, email : 1 ,address:1}
but I have no idea if it works and what the downsides of doing this
To be clear here, you can only have one shard key. So you cannot have multiple shard keys.
However, you are suggesting a compound index as the shard key. This can be done, but there are some limitations.
For example the combination of _id, email and address must be unique.
Documents for choosing a shard key. There are several more considerations that I cannot list here. Please see that document.
Selection of shard key based on :
{coarseLocality : 1, search : 1}
coarseLocality is whatever locality you want for your data,
search is a common search on your data.
You must have an index on the key you shard by, so if you choose a randomly-valued
key that you don’t query by, you’re basically wasting an index. Every additional index
makes writes slower, so it’s important to keep the number of indexes as low as possible.
So,increasing shard key combination doesn't help much.
Extract taken from Kristina Chodrow's book "Scaling MongoDB".