Can we have more than 1024 nodes in Couchbase? - nosql

Disclaimer : Just started NoSQL.
As per my understanding, in case of multiple nodes, 1024 V Buckets will be divided symmetrically inbetween available nodes.
Say in case of 2 nodes system, 512 V Buckets will be residing in each node.
Similarly in case of 4 nodes, 256 V Buckets will be residing on each nodes.
On Extrapolating same distribution, How the system will behave in case 1025th Node is being added to the cluster?

Couchbase has a fixed number of vbuckets, they will always be 1024. This also means that the maximum number of nodes a couchbase cluster could have is 1024, and this 10x bigger than the biggest clusters we have so far. (Yes, some clients have clusters with ~100 nodes in it )
The advantage of sharding data into 1024 vbuckets is that you won't ever need to reshard your data (an expensive operation in mongo, for instance). It also makes couchbase super easy to scale out ( as we just need to move some buckets to the new node) and also super easy to recover from a node failure (as we just need to guarantee the correct number of replicas of each bucket)

Related

Opensearch: Data node costs

I don't understand the costs of having 1 data node vs having 2 or more data nodes.
Will I have the same cost regardless of the number of nodes?
If I have 2 data nodes, that means that I will have double the cost of the instances?
Thanks
Depends on the instance size: i3.2xlarge would be ~2x more expensive than i3.xlarge.
If you use one instance size then yes, 2 nodes would be 2x more expensive than 1 node but you'll get more resilience (if one node goes down your cluster can still get updates and serve data) and rolling restarts.
Though, Opensearch requires an odd number of nodes for master election to work reliably so 3 smaller nodes might be better than 2 larger ones.

Erasure Coded Pool suggested PG count

I'm messing around with pg calculator to figure out the best pg count for my cluster. I have an erasure coded FS pool which will most likely use half space of the cluster in the forseeable future. But the pg calculator only has options for replicated pools. Should i just type according to the erasure-code ratio for replica # or is there another way around this?
From Ceph Nautilus version onwards there's a pg-autoscaler that does the scaling for you. You just need to create a pool with an initial (maybe low) value. As for the calculation itself your assumption is correct, you take the number of chunks into account when planning the pg count.
From :
redhat docs:
3.3.4. Calculating PG Count
If you have more than 50 OSDs, we recommend approximately 50-100 placement groups per OSD to balance out resource usage, data durability and distribution. If you have less than 50 OSDs, choosing among the PG Count for Small Clusters is ideal. For a single pool of objects, you can use the following formula to get a baseline:
(OSDs * 100)
Total PGs = ------------
pool size
Where pool size is either the number of replicas for replicated pools or the K+M sum for erasure coded pools (as returned by ceph osd erasure-code-profile get).
You should then check if the result makes sense with the way you designed your Ceph cluster to maximize data durability, data distribution and minimize resource usage.
The result should be rounded up to the nearest power of two. Rounding up is optional, but recommended for CRUSH to evenly balance the number of objects among placement groups.
For a cluster with 200 OSDs and a pool size of 3 replicas, you would estimate your number of PGs as follows:
(200 * 100)
----------- = 6667. Nearest power of 2: 8192
3
With 8192 placement groups distributed across 200 OSDs, that evaluates to approximately 41 placement groups per OSD. You also need to consider the number of pools you are likely to use in your cluster, since each pool will create placement groups too. Ensure that you have a reasonable maximum PG count.

Choosing the compute resources of the nodes in the cluster with horizontal scaling

Horizontal scaling means that we scale by adding more machines into the pool of resources. Still, there is a choice of how much power (CPU, RAM) each node in the cluster will have.
When cluster managed with Kubernetes it is extremely easy to set any CPU and memory limit for Pods. How to choose the optimal CPU and memory size for cluster nodes (or Pods in Kubernetes)?
For example, there are 3 nodes in a cluster with 1 vCPU and 1GB RAM each. To handle more load there are 2 options:
Add the 4th node with 1 vCPU and 1GB RAM
Add to each of the 3 nodes more power (e.g. 2 vCPU and 2GB RAM)
A straightforward solution is to calculate the throughput and cost of each option and choose the cheaper one. Are there any more advanced approaches for choosing the compute resources of the nodes in a cluster with horizontal scalability?
For this particular example I would go for 2x vCPU instead of another 1vCPU node, but that is mainly cause I believe running OS for anything serious on a single vCPU is just wrong. System to behave decently needs 2+ cores available, otherwise it's too easy to overwhelm that one vCPU and send the node into dust. There is no ideal algorithm for this though. It will depend on your budget, on characteristics of your workloads etc.
As a rule of thumb, don't stick to too small instances as you have a bunch of stuff that has to run on them always, regardless of their size and the more node, the more overhead. 3x 4vCpu+16/32GB RAM sounds like nice plan for starters, but again... it depends on what you want, need and can afford.
The answer is related to such performance metrics as latency and throughput:
Latency is a time interval between sending request and receiving response.
Throughput is a request processing rate (requests per second).
Latency has influence on throughput: bigger latency = less throughput.
If a business transaction consists of multiple sequential calls of the services that can't be parallelized, then compute resources (CPU and memory) has to be chosen based on the desired latency value. Adding more instances of the services (horizontal scaling) will not have any positive influence on the latency in this case.
Adding more instances of the service increases throughput allowing to process more requests in parallel (if there are no bottlenecks).
In other words, allocate CPU and memory resources so that service has desired response time and add more service instances (scale horizontally) to handle more requests in parallel.

Whats the maximum size of the zookeeper ensemble

How many nodes at max can be part of the Zookeeper ensemble , is it 255 . If you want to go beyond that should there be multiple ensembles ?
Here is a similar question: Maximum servers in a ZooKeeper ensemble cluster?
Not sure about the actual limits in ZK code, but any cluster of size larger than e.g. 13 would be really strange. At some point write performance would start to suffer significantly.
Proper scaling would be having multiple clusters for different use cases. Alternatively, using Observers which don't affect write speed.

What is the max size of collection in mongodb

I would like to know what is the max size of collection in mongodb.
In mongodb limitations documentation it is mentioned single MMAPv1 database has a maximum size of 32TB.
This means max size of collection is 32TB?
If I want to store more than 32TB in one collection what is the solution?
There are theoretical limits, as I will show below, but even the lower bound is pretty high. It is not easy to calculate the limits correctly, but the order of magnitude should be sufficient.
mmapv1
The actual limit depends on a few things like length of shard names and alike (that sums up if you have a couple of hundred thousands of them), but here is a rough calculation with real life data.
Each shard needs some space in the config db, which is limited as any other database to 32TB on a single machine or in a replica set. On the servers I administrate, the average size of an entry in config.shards is 112 bytes. Furthermore, each chunk needs about 250 bytes of metadata information. Let us assume optimal chunk sizes of close to 64MB.
We can have at maximum 500,000 chunks per server. 500,000 * 250byte equals 125MB for the chunk information per shard. So, per shard, we have 125.000112 MB per shard if we max everything out. Dividing 32TB by that value shows us that we can have a maximum of slightly under 256,000 shards in a cluster.
Each shard in turn can hold 32TB worth of data. 256,000 * 32TB is 8.19200 exabytes or 8,192,000 terabytes. That would be the limit for our example.
Let's say its 8 exabytes. As of now, this can easily translated to "Enough for all practical purposes". To give you an impression: All data held by the Library of Congress (arguably one of the biggest library in the world in terms of collection size) holds an estimated size of data of around 20TB in size including audio, video, and digital materials. You could fit that into our theoretical MongoDB cluster some 400,000 times. Note that this is the lower bound of the maximum size, using conservative values.
WiredTiger
Now for the good part: The WiredTiger storage engine does not have this limitation: The database size is not limited (since there is no limit on how many datafiles can be used), so we can have an unlimited number of shards. Even when we have those shards running on mmapv1 and only our config servers on WT, the size of a becomes nearly unlimited – the limitation to 16.8M TB of RAM on a 64 bit system might cause problems somewhere and cause the indices of the config.shard collection to be swapped to disk, stalling the system. I can only guess, since my calculator refuses to work with numbers in that area (and I am too lazy to do it by hand), but I estimate the limit here in the two digit yottabyte area (and the space needed to host that somewhere in the size of Texas).
Conclusion
Do not worry about the maximum data size in a sharded environment. No matter what, it is by far enough, even with the most conservative approach. Use sharding, and you are done. Btw: even 32TB is a hell lot of data: Most clusters I know hold less data and shard because the IOPS and RAM utilization exceeded a single nodes capacity.