Shard and rebalance via CLI - nosql

Is it possible to rebalance shards via rethinkDb command-line?
I tried to do it but all data remains in one of the shards. In web interface I can rebalance automatically.
Thanks!

This screencast (starting from 8:48) explains how to set up a cluster with a mix of command line and web interface.
In the documentation: Sharding and replication (section: Sharding via the command-line interface) there is some explanation on how to set up split points.
Unfortunately there is little documentation to do so specific things right now.

You can shard with the CLI, but this is done by manually setting split points (and not by setting the number of shards)
The syntax is
split shard <TABLE> <SPLIT-POINT>
The web interface infers what good split points are based on the distribution of keys. The CLI currently doesn't do it.

Related

Kafka on multiple instances of ec2

I am new to Kafka, trying to do a project. Wanted to do it as it would be in real life example, but I am kinda confused. While searching thru the internet I found that if I want to have 3 brokers and 3 zookeepers, to provide replication factor = 2 and quorum, I need 6 EC2 instances. I am looking thru youtube to find some examples, but as far as I see all of them show multiple brokers on one cluster. From my understanding it's better to keep ZKs and all brokers separately on each VM, so if one goes down I still have all of the rest. Can you confirm that ?
Also, wondering how to set partitioning. Is it important at the beginning of creating a topic, or I change that later when I need to scale ?
Thanks in advance
looking for information on yt, google.
My suggestion would be to use MSK Serverless and forget how many machines exist.
Kafka 3.3.1 doesn't need Zookeeper anymore. Zookeeper doesn't need to be separate machines (although recommended). You can also run multiple brokers on one server... So, I'm not fully sure why you would need 6 for replication factor of 2.
Regarding partitions, yes, create it ahead of time (over provision, if necessary) since you cannot easily move data across partitions when you do scale

MongoDB read preferences Secondary

I am just starting to use MongoDB while testing it with YCSB and I have a couple of questions about read preferences and its implementation.
I have setup 1 Primary and 2 Secondary nodes, and set reading preferences on YCSB java client like this mongo.setReadPreference(ReadPreference.secondary());
1. Why if I point YCSB to connect to primary node it still can perform read operations without generating error message? Also I checked the logs and I can see that Primary is the node that served these requests.
2 How do clients know about Secondary nodes in a production environment? Where do you connect clients by default? Do all the clients go to Primary, retrieve list of Secondaries and then reconnect to secondaries to perform reads ?
3 By browsing source code I have found that logic of selecting appropriate replica based on preferences is done in replica_set_monitor.cpp Although it is not yet clear to me where this code is executed, is it on Primary, Secondary or client?
Thank you
When your application connects only to the primary, it doesn't learn about any secondaries. ReadPreference.secondary() is just a preference, not a mandate. When the application doesn't know that a secondary exists, it will read from the primary.
To make your application aware of the secondaries, you need to use the class DBClientReplicaSet instead of DBClientConnection which takes an std::vector of hosts as a constructor argument. This array should include all members of the set.
When you would prefer to have the application unaware of the replica-set members, you could set up a sharded cluster (which might consist of only a single shard) and connect to the router. The mongos process will then handle the replica-set abstraction.
When an application connects to any active replica member it will issue a internal type of rs.status() which is infact a isMaster command (http://docs.mongodb.org/meta-driver/latest/legacy/connect-driver-to-replica-set/) and caches the response of that for a specific time until deemed fit to refresh that information, in fact in the c++ driver even tells you the class that will hold the cache: http://api.mongodb.org/cxx/current/classmongo_1_1_replica_set_monitor.html
Holds state about a replica set and provides a means to refresh the local view.
There are number of ways that the application can connect to a set to understand, the most common way is by providing a seed list into the connection string within your application code to the driver, that way it can connect to any member and ask: "What is there here?"

MongoDB - how to best achieve active/active configuration?

I have an application which is very low on writes. I'm therefore interested in deploying a mongo installation which maximizes the read throughput for the hardware I have (3 database servers in one location). I don't really care for redundancy (backups), but would like automatic failover. Additionally, I'm fine with "eventual consistency", and don't mind if data which isn't the latest data is returned.
I've looked into both sharding and replica sets, and as far as I can tell, I don't really need to use sharding as its benefits suit more for applications with many writes.
I therefore went ahead and installed a replica set on the three servers I have, and I then set the reading preference to "Nearest", as that would allow reads to take place on any server.
The problem is, I later read that the client is "sticky" and basically once it has chosen a "nearest" mongo server, it's not likely to change it. Besides, even if it were to "check for nearest" again, it'll probably choose the same one over. This pretty much results in an active/passive configuration, without any load-balancing. I do have two application servers, so if they choose different mongo servers, it might work ok, but say I wanted to have more than 3 mongo servers in the replica set, then any servers besides specific two would be passive.
Basically my question is, what's the best way to have an active/active configuration for my deployment? All I want is for requests to go to free mongo servers rather than busy ones.
One way to force this which I thought of is to create three sharded-clusters (each server participating in all three), where each server is the primary in one of these clusters - but this is still not optimal, because besides the relative complexity involved in this configuration, this also doesn't guarantee complete load balancing (for example, in case all requests at a given moment happen to go to one specific shard).
What's the right way to achieve what I want? If it's not possible to achieve this kind of load balancing with mongo, would you recommend that I go with the sharded-clusters solution?
As you already suspected, scaling reads is not a "one size fits all" problem. Everything will depend on your data, your access patterns, your requirements and probably a few other things only you can determine.
In a nutshell, the main thing to consider is why a single server can't handle your read load. If it's because of the size of your data set and the size of your indexes then sharding your data across three shards will reduce the RAM requirements of each of them (or to put it another way will give you the combined RAM of all three systems). As long as you pick a good shard key (one that will distribute the load approximately evenly across all the systems) you will get almost three times the throughput on targeted queries.
If the main requirement for your reads is to reduce as much as possible the latency of reading the data, then a replica set can serve your purposes well as reading from the "nearest" node will reduce the network round-trip time without changing the duration of the operation on the MongoDB server. This assumes that your writes are infrequent enough or that your application has tolerance of possibly stale data.

In Mongo what is the difference between sharding and replication?

Replication seems to be a lot simpler than sharding, unless I am missing the benefits of what sharding is actually trying to achieve. Don't they both provide horizontal scaling?
In the context of scaling MongoDB:
replication creates additional copies of the data and allows for automatic failover to another node. Replication may help with horizontal scaling of reads if you are OK to read data that potentially isn't the latest.
sharding allows for horizontal scaling of data writes by partitioning data across multiple servers using a shard key. It's important to choose a good shard key. For example, a poor choice of shard key could lead to "hot spots" of data only being written on a single shard.
A sharded environment does add more complexity because MongoDB now has to manage distributing data and requests between shards -- additional configuration and routing processes are added to manage those aspects.
Replication and sharding are typically combined to created a sharded cluster where each shard is supported by a replica set.
From a client application point of view you also have some control in relation to the replication/sharding interaction, in particular:
Read preferences
Write concerns
Consider you have a great music collection on your hard disk, you store the music in logical order based on year of release in different folders.
You are concerned that your collection will be lost if drive fails.
So you get a new disk and occasionally copy the entire collection keeping the same folder structure.
Sharding >> Keeping your music files in different folders
Replication >> Syncing your collection to other drives
Replication is a mostly traditional master/slave setup, data is synced to backup members and if the primary fails one of them can take its place. It is a reasonably simple tool. It's primarily meant for redundancy, although you can scale reads by adding replica set members. That's a little complicated, but works very well for some apps.
Sharding sits on top of replication, usually. "Shards" in MongoDB are just replica sets with something called a "router" in front of them. Your application will connect to the router, issue queries, and it will decide which replica set (shard) to forward things on to. It's significantly more complex than a single replica set because you have the router and config servers to deal with (these keep track of what data is stored where).
If you want to scale Mongo horizontally, you'd shard. 10gen likes to call the router/config server setup auto-sharding. It's possible to do a more ghetto form of sharding where you have the app decide which DB to write to as well.
Sharding
Sharding is a technique of splitting up a large collection amongst multiple servers. When we shard, we deploy multiple mongod servers. And in the front, mongos which is a router. The application talks to this router. This router then talks to various servers, the mongods. The application and the mongos are usually co-located on the same server. We can have multiple mongos services running on the same machine. It's also recommended to keep set of multiple mongods (together called replica set), instead of one single mongod on each server. A replica set keeps the data in sync across several different instances so that if one of them goes down, we won't lose any data. Logically, each replica set can be seen as a shard. It's transparent to the application, the way MongoDB chooses to shard is we choose a shard key.
Assume, for student collection we have stdt_id as the shard key or it could be a compound key. And the mongos server, it's a range based system. So based on the stdt_id that we send as the shard key, it'll send the request to the right mongod instance.
So, what do we need to really know as a developer?
insert must include a shard key, so if it's a multi-parted shard key, we must include the entire shard key
we've to understand what the shard key is on collection itself
for an update, remove, find - if mongos is not given a shard key - then it's going to have to broadcast the request to all the different shards that cover the collection.
for an update - if we don't specify the entire shard key, we have to make it a multi update so that it knows that it needs to broadcast it
Whenever you're thinking about sharding or replication, you need to think in the context of writers/update operations. If you don't need to scale writes then replications, as it fairly simpler, is a good choice for you.
On the other hand, if you workload mostly updates/writes then at some point you'll hit a write bottleneck. If write request comes Mongo blocks other writes request. Those write request blocks until the first request will be done. If you want to scale this writes and want parallelize it then you need to implement sharding.
Just to put this somewhere...
The most basic way to run mongo is as standalone server.
You write a config (file or cli options)
initiate the server using mongod
For this picture, I didn't include the "client". Check the next one.
A replica set is a set of servers initialized exactly as above with a different config file.
To link them, we connect to one of them, and initialize the replica set mode.
They will mirror each other (in the most common configuration). This system guarantees high availability of data.
The initialization of the replica set is represented in the red border box.
Sharding is not about replicating data, but about fragmenting data.
Each fragment of data is called chunk and goes to a different shard. shard = each replica set.
"main" server, running mongos instead of mongod. This is a router for queries from the client.
Obvious: The trade-off is a more complex architecture.
Novelty: configuration server (again, a different config file).
There is much more to add, but apart from the words the pictures hold much the same.
Even mongoDB recommends to study your case carefully before going sharding. Vertical scaling (vs) is probably a good idea at least once before horizontal scaling (hs).
vs is done upgrading hardware (cpu, ram, etc). hs is needs more computers (but could be cheap computers).
Both replication and sharding can be used (individually or together) for horizontal scaling of a MongoDB installation.
Sharding is MongoDB's solution for meeting the demands of data growth. Sharding stores data records across multiple servers to provide faster throughput on read and write queries, particularly for very large data sets.
Any of the servers in the sharded cluster can respond to a read or write operation, which greatly speeds up query responses.
Replication is MongoDB's solution for providing stability, backup, and disaster recovery to a MongoDB installation. This process copies and synchronizes the replica data set across multiple servers. This prevents downtime if one server goes offline.
Any of the secondary servers can respond to read queries, but only the primary server will perform write operations. The results of the write operation will then be propagated out to the secondary servers.
Scenario 1: Fault-Tolerance
In this scenario, the user is storing billing data in a MongoDB installation. This data is mission-critical to the user's business, and needs to be available 24/7, even if a server crashes or is taken offline.
MongoDB replication is the best solution for this user. With replication, the entire data set is mirrored on multiple servers. If a server fails or is taken offline, the other servers in the cluster take over.
Scenario 2: High Performance
In this scenario, the user is running a social networking site which is run from a MongoDB database. As the social network grows, the MongoDB data set has grown along with it. The user is seeing query times and page loads increase beyond an acceptable point. It is critical that the user's MongoDB installation receives a major performance boost.
Setting up a sharded MongoDB cluster is the best solution for this user. The sharded cluster will break up the user's data set and store parts of it on separate secondary servers. Each secondary server can respond to read or write queries on its portion of the data, which greatly increases the installation's response time
MongoDB Atlas is a Database as a service in could. It support three major cloud providers such as Azure , AWS and GCP. In cloud environment , we usually talk about high availability and scalability. In Atlas “clusters”, can be either a replica set or a sharded cluster.
These two address high availability and scalability features of our cloud environment.
In general Cluster is a group of servers used to achieve a specific task. So sharded clusters are used to store data in across multiple machines to meet the demand of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharded clusters supports the horizontal scalability of the underling cloud environment.
A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments.In a replica, one node is a primary node that receives all write operations. All other instances, such as secondaries, apply operations from the primary so that they have the same data set. Replica set mainly focus on the availability of data.
Please check the documentation
Thank You.

MongoS Call Distributions Analysis

I would like to see how well my shard key is and I thinking to monitor how many calls goes to each shard by the MongoS for each 100 parallel BatchInsert that I do. I probablly can do this at application layer, but is there a way to record this at monogS level?
I am using monogoStat but I wan the details of monogS. Also, the mongoS log does not say much from what I gather
Do you have trending graphs or some other form of monitoring software? 10gen actually provides a free one called MMS.
If you are monitoring the activity on your shards, that should correlate to the calls being made from mongos. The only caveat here is that activity is not broken out by collection or by database. So if you're sharding multiple DBs on the same instances this may not work.
Otherwise, just look at the activity on the shards and that should clearly tell you what's happening.
If you use mongostat --discover you can see the traffic per shard as well as the total traffic going through the mongos that mongostat is using. This should give you full insight unto your load distribution in real time.
Note that your shard key always "works" because MongoDB splits on your shard key median rather than simply split the data in two so provided your shard key has high enough cardinality your data will always be perfectly balanced (provided the balancer had time to balance the chunks appropriately)
In addition to MMS and mongostat, you can also see the overall status and health of your sharding cluster via the printShardingStatus() function, as outlined here