MongoDB: using modulo to compute sharding key - mongodb

I'am going to move my 500 millions rows table from postgresql to sharded collection in MongoDB.
I am in the middle of choosing proper sharding key.
Table Posts(id, users_id,title,content). Each post belongs to specified user. Users have from 100 to 1 million posts.
Is it possible to set sharding key as modulo from users_id (e.g. users_id % 128)? I query database by WHERE users_id.
Is it possible? Is it good idea? I am asking, because when i haven't found anything about using module in shadring key.

You probably want your shard key to be {users_id: 'hashed'}, this way MongoDB will take care of the distribution for you. Read more here: http://docs.mongodb.org/manual/tutorial/shard-collection-with-a-hashed-shard-key/

Related

Mongodb: Determining shard key strategy on compound index

I have a collection with 170 millions+ documents and it is only going
to increase. The size of the collection is not that huge, currently
around 70 GB.
The collection has two fields indexed on: {AgentId:1, PropertyId:1}.
Generally one imports a huge file(millions of documents) belonging to
a particular AgentId but the PropertyId(non numeric nullable) is
mostly random unique value.
Currently I have two shards with shard key based on {_id: hashed}. But
I am planning to change the shard key to compound Index {AgentId:1,
PropertyId:1} because I think it will improve query performance( most
of the queries are based on AgentId filter). Not sure whether one can
have a nullable field in the shard key. If this is the case then app
will make sure that the PropertyId is random no.
So looking to get a picture as to
How the data will be distributed to shards during insertion
and how the range of a chunks are calculated during insertion?
Since the PropertyId is random value. Does the compound key fits the
definition of monotonically increasing value?
I am a newbie to mongodb. And wanted to know if I am on the right path?
Thanks
There is no automatic support in MongoDB for changing a shard key after sharding a collection.
This reality underscores the importance of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
dump all data from MongoDB into an external format.
drop the original sharded collection.
configure sharding using a more ideal shard key.
pre-split the shard key range to ensure initial even distribution.
restore the dumped data into MongoDB.

Choosing the right shard key in MongoDB

We are building our first MongoDB and currently we are trying to choose the right shard key.
Each document in our main collection contain around 40 voice call related fields and the main field that we use in queries is the UserId field. This is why we are thinking about compound shard key of userid and CallStartTime.
They are not sure regarding the second field since StartTime is always advancing and one might argue that it is not random enough. This led us to consider replace it with UserId and hashed _id (mongo internal id after hash).
Is the first option is ok or do we better use the latter?
Consider the recommendations in the documentation here: http://docs.mongodb.org/manual/core/sharded-cluster-internals/#shard-keys
Or, if there is no natural choice, consider using a hashed shard key (mongodb 2.4+)
http://docs.mongodb.org/manual/reference/glossary/#term-hashed-shard-key
What sort of queries are you performing? What are the access patterns.
Ideally you want a key with good cardinality, write scaling and query isolation.
In your examples above you would need to know the callstarttime or hash to avoid scatter-gather operations.

MongoDB: Should sharding be enabled from the first day?

I'm new to Mongo as well as sharding.
Our app will be served by MongoDB and we expect billions of records to be stored. However, the db will grow slowly, ie it'll probably take years to reach that huge size.
Furthermore, we will mostly use a special encrypted value to look up records. This holds all the info needed to find the given record, ie the primary key, the shard key etc.
My question is: Should we enable sharding from the first day and encrypt shard key + PK? Or could we enable sharding later (when needed) and tell mongo to look up certain records (the ones whose encrypted ID holds no shard key) in the default, "unsharded" collection?
What's the best way to do this?
Thanks in advance!

Can we move the document dynamically across shards in mongo db?

I am building a tracking platform which has the following use cases.
Need to track 50,000 vehicles
Each vehicle relays its location every 60 secs.
Get API which returns all the vehicles in the X km range.
So, i need to scale writes and also achieve query isolation.
I can create a shard cluster with geographical region as shard key(geohash). This will help me to balance the writes and also achieve query isolation. But what happens when a vehicle moves across regions does mangodb automatically move the document to the new shard in this case?
You cannot change the shard key fields for a record once written. Using the region as the shard key would prevent you from moving across regions unless you delete the record in the original region and the insert using the new one.
On choosing a shard key, look for one which matches your most common query pattern. Querying on the shard key will allow you to retrieve a record directly from a shard. Queries which don't use the shard key will have to perform a scatter gather query against all shards.
If are on or can use Mongodb 2.4, and you don't need to perform range based queries, you may want to consider using a hashed shard key which will allow for even distribution, even if your shard key is an monotonically increasing. See this page for advice on choosing a shard key.

Use Native Java UUID.getRandom() as Sharding key and as the _id?

I understand that with a write-heavy application, using the ObjectId is a really bad idea for a sharding key. However, would it be a good idea to use native *UUID.randomUUID() from Java as a Shard key since they are truly random and won't cause hotspotting for a single shard.
These IDs are 128 bit ID and look like :
5842fa92557947f1b020041ff74868a4
308947443e564d80b97dd8411b4b727e
f8a7ee765bed4ce3bcc5800ac3a2a710
1bcfd08b89e94c58ae7695b3e7a1bc4f
It's very similar to an ObjectId (96bit int).
Plus, since this is mandatory to have an Index on the _id, the shard key would be the _id and we would save RAM by creating another index for the shard_key. Everything collection would be ready for sharding.
Is it for performances issues within Mongod or for disk/ram space problem?
The collision rate for a UUID is (from wikipedia) :
only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%. The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
Using UUID would distribute your write access across the shards but you would have no query isolation, so you'll have less than optimum results with your queries. The fastest queries are the one answered by only one shard.
http://docs.mongodb.org/manual/core/sharding-internals/#sharding-shard-key-query-isolation
That would help to know what is in your collection to help you more efficiently.
Using UUIDs is perfectly ok (provided that you are only going to lookup those documents by their primary/shard key). One of the purposes of shard key is to group related documents together. If we're building, say, flickr, our shard key would start with user_id, so that photos of a user sit together on one shard. If your documents are not related and primary key is also a shard key, then there's no problem.
You may run into a problem due to https://jira.mongodb.org/browse/JAVA-403, which is slated to be fixed in the next release.