I've noticed that when deploying stateless service to a cluster, the maximum number of instances cannot exceed the number of physical nodes. Two questions about that:
Is that a hard limit by design (i.e. not just a dev cluster issue)?
What happens to that service`s processing ability when half a cluster is down?
The number of instances can not exceed the number of nodes in the cluster, no, there is a placement constraint that ensures that. You can modify placement constraints for your cluster, but I don't think you can set it to ignore the ReplicaExclusionStatic constraint completely. It's also worth to note that in the default setting it is treated as a warning, i.e. if you have node count = 7 and your cluster has 5 nodes, it will give you warning that it is unable to place 2 instances.
In Service Fabric there is both the concept of instance and partition, and most commonly instance is associated with Stateless services, and partition with Stateful, but that doesn't mean that you cannot have partitions of Stateless services (but it might not make sense in most cases either).
If you wish to have more actual instances running of your service then you could change the partition count in order to create further partitions on each node. The default for a Stateless service is SingletonPartition and an instance count of -1 that puts one instance on each available node:
In ApplicationManifest.xml:
<Service Name="MyService">
<StatelessService ServiceTypeName="MyServiceType" InstanceCount="[MyService_InstanceCount]">
<SingletonPartition />
</StatelessService>
</Service>
And in your PublishProfiles/Cloud.xml:
<Parameters>
<Parameter Name="MyService_InstanceCount" Value="-1" />
</Parameters>
On a 5 node cluster you will get 5 instances running, nothing strange there. If you change to instance count = 6 then you will get the placement warning (and still no more than 5 actual instances running).
Now, if you want to have more instances (or replicas) running on each node you could modify the partition count of your Stateless service:
<Service Name="MyService">
<StatelessService ServiceTypeName="MyServiceType" InstanceCount="[MyService_InstanceCount]">
<UniformInt64Partition PartitionCount="4" LowKey="0" HighKey="4" />
</StatelessService>
</Service>
On the same cluster and with the same instance count you will instead get 20 (5 nodes x 4 partitions) replicas of your service running now.
What complicates it now is that you will always have to include a partitionkey when you talk to your Stateless services. It might not make sense in the same way it does for a Stateful service as there is no expected state, but your selection of partition key should be evenly distributed to even the load.
When you then scale your cluster you will always get the same number of replicas per partitions, but the number of partitions per node may differ if you set instance count to a specific number instead of -1.
This article discusses more about partitioning, instances and replicas.
https://blogs.msdn.microsoft.com/mvpawardprogram/2015/10/13/understanding-service-fabric-partitions/
Related
I've been experimenting with Vert.x high availability features to test horizontal scalability and resiliency. I have a cluster of several nodes based on Hazelcast. I'm creating verticles on any nodes via an HTTP API. Verticles have the HA flag set when they are created.
Testing scalability
If I have n nodes Nn loaded with HA-verticles and if I add one additional node there is no verticle that is migrated from the Nn node on the new one so that the load would be balanced. Is there a way to tell Vert.x to do so, or not ? I believe it's not so simple...
Testing resilience
If I have n nodes Nn loaded with HA-verticles and I kill one of the nodes, all the verticles from that very node are migrated, but are migrated on one single of the remaining nodes that is not always the least loaded one. That destination node may become overloaded and the whole cluster would be at risk of freeze or crash. Same question as before: is there a way to force Vert.x to balance the restarted verticles on all nodes, or at least on the node that is the least loaded ?
Your observations are correct, there is no way:
to distribute verticles from a failed node over the rest of the nodes
to prevent starting verticles in a node that is already loaded
Improving the HA features is not on the Vert.x roadmap.
If, as it seems, you need more than basic failover, I would recommend to use specialized infrastructure tools that can leverage info from monitoring systems and start/stop new nodes as needed.
I am learning kubernetes by following the official documentation and in the Creating Highly Available clusters with kubeadm part it's recommended to use 3 masters and 3 workers as a minimum required to set a HA cluster.
This recommendation is given with no explanation about the reasons behind it. In other words, why a 2 masters and 2 workers configuration is not ok with HA ?
You want an uneven number of master eligible nodes, so you can form a proper quorum (two out of three, three out of five). The total number of nodes doesn't actually matter. Smaller installation often make the same node master eligible and data holding at the same time, so in that case you'd prefer an uneven number of nodes. Once you move to a setup with dedicated master eligible nodes, you're freed from that restriction. You could also run 4 nodes with a quorum of 3, but that will make the cluster unavailable if any two nodes die. The worst setup is 2 nodes since you can only safely run with a quorum of 2, so if a node dies you're unavailable.
(This was an answer from here which I think is a good explanation)
This why exactly:
https://etcd.io/docs/v3.3/faq/#why-an-odd-number-of-cluster-members
check then the concept of quorum, you can find plenty of info especially in the pacemaker/corosync documentation
Service Fabric offers the capability to rebalance partitions whenever a node is removed or added to the cluster. The Service Fabric Cluster Resource Manager will move one or more partitions to this node so more work can be done.
Imagine a reliable actor service which has thousands of actors running who are distributed across multiple partitions. If the Resource Manager decides to move one or more partitions, will this cause any downtime? Or does rebalancing partitions work the same as upgrading a service?
They act pretty much the same way, The main difference I can point is that Upgrades might affect only the services being updated, and re-balancing might affect multiple services at once. During an upgrade, the cluster might re-balance the services as well to fit the new service instance in a node.
Adding or Removing nodes I would compare more with node failures. In any of these cases they will be rebalanced because of the cluster capacity changes, not because of the service metric\load changes.
The main difference between a node failure and a cluster scaling(Add/remove node) is that the rebalance will take in account the services states during the process, when a infrastructure notification comes in telling that a node is being shutdown(for updates or maintenance, or scaling down) the SF will ask the Infrastructure to wait so it can prepare for this announced 'failure', and then start re-balancing the services.
Even though re-balancing cares about the service states for a scale down it should not be considered more reliable than a node failure, because the infrastructure will wait for a while before shutting down the node(the limit it can wait will depend on the reliability tier you defined for your cluster), until SF check if the services meet health conditions, like turn down services and creating new ones, checking if they will run fine without errors, if this process takes too long, these service might be killed once the timeout is reached and the infrastructure proceed with the changes, Also, the new instances of the services might fail on new nodes, forcing the services to move again.
When you design the services is safer to consider the re-balancing as a node failure, because at the end is not much different. Your services will move around, data stored in memory will be lost if not persisted, the service address will change, and etc. The services should have replicated data and the clients should always use a retry logic and refresh the services location to reduce the down time.
The main difference between service upgrade and service rebalancing is that during upgrade all replicas from all partitions are get turned off on particular node. According to documentation here balancing is done on replica basis i.e. only some replicas from some partitions will get moved, so there shouldn't be any outage.
According to https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#sc_zkMulitServerSetup
Cross Machine Requirements For the ZooKeeper service to be active,
there must be a majority of non-failing machines that can communicate
with each other. To create a deployment that can tolerate the failure
of F machines, you should count on deploying 2xF+1 machines. Thus, a
deployment that consists of three machines can handle one failure, and
a deployment of five machines can handle two failures. Note that a
deployment of six machines can only handle two failures since three
machines is not a majority. For this reason, ZooKeeper deployments are
usually made up of an odd number of machines.
To achieve the highest probability of tolerating a failure you should
try to make machine failures independent. For example, if most of the
machines share the same switch, failure of that switch could cause a
correlated failure and bring down the service. The same holds true of
shared power circuits, cooling systems, etc.
My question is:
What should we do after we identified a node failure within Zookeeper cluster to make the cluster 2F+1 again? Do we need to restart all the zookeeper nodes? Also the clients connects to Zookeeper cluster, suppose we used DNS name and the recovered node using same DNS name.
For example:
10.51.22.89 zookeeper1
10.51.22.126 zookeeper2
10.51.23.216 zookeeper3
if 10.51.22.89 dies and we bring up 10.51.22.90 as zookeeper1, and all the nodes can identify this change.
If you connect 10.51.22.90 as zookeeper1 (with the same myid file and configuration as 10.51.22.89 had before) and the data dir is empty, the process will connect to current leader (zookeeper2 or zookeeper3) and copy snapshot of the data. After successful initialization the node will inform rest of the cluster nodes and you have 2F+1 again.
Try this yourself, having tail -f on log files. It won't hurt the cluster and you will learn a lot on zookeeper internals ;-)
I wonder about the best strategy with regard to Zookeeper and SolrCloud clusters. Should one Zookeeper cluster be dedicated per SolrCloud cluster or multiple SolrCloud clusters can share one Zookeeper cluster? I guess the former must be a very safe approach but I am wondering if the 2nd option is fine as well.
As far as I know, SolrCloud use Zookeeper to share cluster state (up, down nodes) and to load core shared configurations (solrconfig.xml, schema.xml, etc...) on boot. If you have clients based on SolrJ's CloudSolrServer implementation than they will mostly perform reads of the cluster state.
In this respect, I think it should be fine to share the same ZK ensemble. Many reads and few writes, this is exactly what ZK is designed for.
SolrCloud puts very little load on a ZooKeeper cluster, so if it's purely a performance consideration then there's no problem. It would probably be a waste of resources to have one ZK cluster per SolrCloud if they're all on a local network. Just make sure the ZooKeeper configurations are in separate ZooKeeper paths. For example, using -zkHost :/ for one SolrCloud, and replace "path1" with "path2" for the second one will put the solr files in separate paths within ZooKeeper to ensure they don't conflict.
Note that the ZK cluster should be well-configured and robust, because if it goes down then none of the SolrClouds are going to be able to respond to changes in node availability or state. (If SolrCloud leader is lost, not connectable, or if a node enters recovering state, etc.)