I'm new in NoSQL databases and now I use MongoDB, BTW I have a question about MongoDB shard key and I want to know what it does actually? Is it related to queries performance? And how we can choose a good shard key for a collection?
Thanks in advance
From 10gen's docs: http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key
Choosing a shard-key is very dependent on your data and your use case.
Here's some more documentation you may find relevant:
http://docs.mongodb.org/manual/faq/sharding/
http://docs.mongodb.org/manual/sharding/
Specifically:
http://docs.mongodb.org/manual/core/sharding/
Essentially sharding allows you to partition your data across different servers. This means different writes/reads are going to different servers -- distributing the load of the application across multiple servers.
The shard key is the value in the collection that you are evaluating to determine which shard/server the document is being routed too.
You can have more explanation on shard key selection and working in Kristina Chodrow's book "Scaling MongoDB"
Check out this also
Related
My team will deploy a new version of our app (Capture social media posts, hashtags etc.) they create a different DB for each user and we may have thousands of collections on each DB. I read all mongoDB shard documentation and I saw that I can only shard an collection or one DB at time, I'm missing something ?
We will start this new version fresh, without any databases and we will grow from 0 again (For now, we have 23k users) but we will escalate this number really quickly (100.000+ at the end of the year)
My question is: I really need a Shard cluster ? (My test setup have 3 shards with 3 microshards, 3 config servers and 2 mongos) for now, in production, i have a large server doing all the hard work but i dont want to scale to top, the horizontal scale is the best choice, i think.
Can I shard all my databases automatically or I really need to do that one by one doing the shard key procedure and so. ?
Thanks in advance
You are reading correctly. What you intend to do is so far away from what any sensible person would do that MongoDB doesn't offer any tools to support this. If you really want to go with this WTF solution, your application will be responsible to set up sharding for each collection it creates. This forces you to give administration permission to the application (despite what any security guides recommend).
"Will you really need a sharded cluster" - that depends on how much data you will have and how often you query it with what kind of query. But it is unlikely to work anyway, because your sharded cluster will have to manage (100,000 databases* 1.000 collections) = a hundred million collections. MongoDB is not designed for scaling in that direction. The cluster will likely be so busy with bookkeeping that you won't really see any notable performance gain.
It is also questionable if clustering would even theoretically make sense. Clustering is usually only useful when you have very large collections. But in your scenario where your data is so heavily fragmented into a million collections, each individual collection is unlikely to be very large.
If you really want to go this route, it might in fact be a better solution to separate the databases physically by assigning each user to a database server.
Or you could just build a database architecture like a normal team would with one database for all users and one collection per type of document. You would then speed up lookups by creating a compound index on user and whatever criteria you used to tell which database a document belonged to. This index might also be a good shard key.
I am trying to understand the sharding concept with respect to MongoDB. To understand the concept, lets say we have two scenarios:
I have two databases 'customer' and 'item'.
I have two collections 'customer' and 'item' in the same database.
Both 'customer' and 'item' datasets are huge (in TB).
My question is: In the above listed scenarios how is sharding designed and which one is preferred.
The examples I have come across talk about sharding with one collection. But when we have multiple databases/collections. How do we handle it?
Please point me in the right direction.
MongoDB distributes data, or shards, at the collection level.
See here:
https://docs.mongodb.org/manual/core/sharding-introduction/#data-partitioning
The procedure requires you to first enable sharding on the database level (which will not automatically shard any collection).
I think you should read through the docs carefully as sharding is by no means magic, and requires thorough planning and understanding of the mechanics.
I have a system where users save metadata of files,
the system serves different companies, each can have up to millions of files arranged in a classic folder structure
I need to choose a shard key, any directions on that ?
To prevent queries from having to check all the shards for results, the shard-key should be something which appears in each of your find- or update-queries. In a multi-client solution I would expect that the company is part of each query, so the company would make a good shard-key. When your companies have very different usage-schemas and you notice that some of them have so heavy use that a single shard isn't enough for them, you could add the filename to the shard-key. But that's an optimization you can't consider before you have real-world usage statistics.
For further advise, consult the chapter Considerations for Selecting Shard Keys from the documentation.
Any recommended readings for setting up mongodb for sharding/scalability?
I'm looking for best practices. i don't know a lot about sharding or scaling db solutions. are there examples out there with practical real world examples?
i apologize if i'm using the wrong terms.
Is my understanding correct in that mongodb acts like a "single database" but knows how to distribute data across disparate instances of mongodb (maybe located in different locations, etc)
Are each of those instances called shards? is that data replicated across all instances?
MongoDB provides two types of scaling.
Read scaling: is provided by Replica Sets.
Write scaling is provided by Sharding.
Those links are a reasonable place to start.
There are also numerous slides and videos from the multiple Mongo conferences that have run recently. Here are some recent ones with use cases.
are each of those instances called shards? is that data replicated across all instances?
Think of a shard as a "slice" of your data. Each shard is generally composed of a replica set. So each shard has multiple computers managing replication of data.
is my understanding correct in that mongodb acts like a "single database" but knows how to distribute data across disparate instances of mongodb...
Sharding allows MongoDB to automatically distribute writes. But there's a little more to it, so I think it's best you work through some of the presentations.
MongoDB has a great documentation. Issues like Sharding and Replica sets are documented in depth:
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Replica+Sets
Apart from that there are lot of presentations
http://www.10gen.com/presentations
and videos
http://www.10gen.com/presentations
dealing with your questions.
Please research first and come up with some more specific questions.
I'm documenting about the GridFS and the possibility to shard it among different machines.
Reading the documentation here, the suggested shard key is chunks.files_id. This key will be linked to the _id of the files collection, thus this _id is incremental. Every new file i save in the Grid will have a new incremental _id.
In the O'Reilly "Scaling MongoDB" book the use of an incremental shard key is discouraged to avoid HotSpots (the last shard will receive all the write and read).
what is your suggestion for sharding the GridFS collection?
have anybody experienced the HotSpot problem?
thank you.
You should shard on files_id to keep file chunks together, but you are correct that that will create a hotspot. If you can, use something other than ObjectId for _ids in the fs.files collection (probably MD5s would be better than ObjectIds).
We'll be adding hashing for sharding, which will solve this, but not until at least 2.0.
You can shard gridfs data because gridfs it just two collecttions: chunks and files. And gridfs sharding it's very useful and great thing. About gridfs shard key it's always bad choose random or incremental shard key, because data not evenly distribute across shards. In case of incremental shard key all writes going to the last shard and it growth and once difference between become 10 or more chunks, balancer move data to another shards. Moving data to another shard always difficult task that should be avoided as it possible.
So when you choose shard key you should care about even distribution of data.
Also if you get luck mb author of 'Scaling MongoDB' kristina(great specialist in shard keys) will answer to your question.
Documentation says that in common cases you should choose default index fileId:1,n:1 as shard key:
There are different ways that GridFS
can be sharded, depending on the need.
One common way to shard, based on
pre-existing indexes, is:
"files" collection is not sharded. All
file records will live in 1 shard. It
is highly recommended to make that
shard very resilient (at least 3 node
replica set) "chunks" collection gets
sharded using the existing index
"files_id: 1, n: 1". Some files at the
end of ranges may have their chunks
split across shards, but most files
will be fully contained within the
same shard.
Currently MongoDB as of version 1.8.1 supports only sharding on "file_id" field, because of using md5 to verify the upload, but it doesn't
work across shards yet. So you cannot split single file across shards.
Answer on google group7