cassandra racks & replication factor - nosql

I have 2 CASSANDRA DC's
DC1
+-----+
|RAC1 |
+-----+
|node1|
+-----+
|node2|
+-----+
|node3|
+-----+
|node4|
+-----+
DC2
+-----+-----+-----+
|RAC1 |RAC2 |RAC3 |
+-----+-----+-----+
|node1|node1|node1|
+-----+-----+-----+
|node2|node2|node2|
+-----+-----+-----+
Can I use RF=3 in DC2 or RACK nodes count must be higher than RF?

Based on the documentation, I think the rule you are referring to is this:
As a general rule, the replication factor should not exceed the number
of nodes in the cluster.
Your replication factor exceeds the number of nodes in each rack, but I think that's ok. Are you using NetworkTopologyStrategy? The same doc also indicates that:
NetworkTopologyStrategy places replicas in the same data center by
walking the ring clockwise until reaching the first node in another
rack. NetworkTopologyStrategy attempts to place replicas on distinct
racks because nodes in the same rack (or similar physical grouping)
often fail at the same time due to power, cooling, or network issues.
So if you are using NetworkTopologyStrategy, then I think your replication factor of 3 for DC2 should work just fine.

Related

workload balance and partition relationship

In Azure cloud,I have apache spark pool with nodes=3 and scalable to 10 nodes.
My query is taking long time to run but nodes are not getting scaled up. Its always uses only 3 nodes.
Is there anything I have to do with my query?

Difference in Spark SQL Shuffle partitions

I am trying to understand Spark Sql Shuffle Partitions which is set to 200 by default.
The data looks like this, followed by the number of partitions created for the two cases.
scala> flightData2015.show(3)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
| United States| Romania| 15|
| United States| Croatia| 1|
| United States| Ireland| 344|
+-----------------+-------------------+-----+
scala> println(flightData2015.sort("DEST_COUNTRY_NAME").rdd.getNumPartitions)
104
scala> println(flightData2015.groupBy("DEST_COUNTRY_NAME").count().rdd.getNumPartitions)
200
Both cases cause a Shuffle stage which should result in 200 partitions (default value). Can someone explain why there is a difference?
According to the Spark documentation there are two ways of repartition the data. One is via this configuration spark.sql.shuffle.partitions as default 200 and is always applied when you run any join or aggregation as you can see here.
When we are talking about sort() this is a not that simple, Spark uses a planner to identify how skewed is the data across the dataset. If it is not too skewed it instead of using a sort-merge join that would result in 200 partitions as you expected, it prefers to do a broadcast of the data across the partitions avoiding a full shuffle. This can save time during the sorting to reduce amount of network traffic more details here.
The difference between these two situations is that sort and groupBy are using different partitioner under the hood.
groupBy - is using hashPartitioning which means that it computes hash of the key and then it computes pmod by 200 (or whatever is set as the number of shuffle partitions) so it will always create 200 partitions (even though some of them may be empty)
sort/orderBy - is using rangePartitioning which means that it runs a separate job to sample the data and based on that it creates the boundaries for the partitions (trying to make them 200). Now based on the sampled data distribution and the actual row count it may create boundaries for less than 200, which is the reason why you got only 104.

Kubernetes: why would you need more than 2 nodes?

Given a K8s Cluster(managed cluster for example AKS) with 2 worker nodes, I've read that if one node fails all the pods will be restarted on the second node.
Why would you need more than 2 worker nodes per cluster in this scenario? You always have the possibility to select the number of nodes you want. And the more you select the more expensive it is.
It depends on the solution that you are deploying in the kubernetes cluster and the nature of high-availability that you want to achieve
If you want to work on an active-standby mode, where, if one node fails, the pods would be moved to other nodes, two nodes would work fine (as long as the single surviving node has the capacity to run all the pods)
Some databases / stateful applications, for instance, need minimum of three replica, so that you can reconcile if there is a mismatch/conflict in data due to network partition (i.e. you can pick the content held by two replicas)
For instance, ETCD would need 3 replicas
If whatever you are building needs only two nodes, then you wouldn't need more than 2. If you are building anything big where the amount of compute, memory needed is much more, then instead of opting for expensive nodes with huge CPU and RAM, you could instead join more and more lower priced nodes to the cluster. This is called horizontal scaling.

kafka cluter not all replication are coming as ISR

We have multinode Kafka 1.0.1 cluster (tried earlier version also) and when we are creating this cluster we have given replication as n-1 so if 7 we have nodes we are giving 6 as replication factor but not all replicas are becoming ISR. This issue happens for some topics and for some it create. If we recreate multinode cluster 2-3 times it creates ISR for all replicas. Did someone got similar issue and help on this greatly appreciated.
Your replication factor seems to be quite high. Why do you set it to n-1? The replication factor should be independent from the cluster size. For most use cases a replication factor of 3 is sufficient. There are also use cases with stronger demands that use a replication factor of 5. I guess a higher replication factor would be very rare. If you don't have special needs, 3 should be sufficient.
The larger the replication factor, the more time it needs to replicate data -- thus, if you continuously write to the leader, the followers need some time to copy the data, and the network bandwidth to copy the data is obviously limited. Thus, with a larger replication factor, each individual follower has less bandwidth for copying the data and might fall back and thus never become an ISR.

Issue about workload balance in Flink streaming

I have a WordCount program running in a 4-worknodes Flink cluster which reads data from a Kafka topic.
In this topic, there are lot of pre-loaded texts(words). The words in the topic satisfy Zipf distribution. The topic has 16 partitions. Each partition has around 700 M data inside.
There is one node which is much slower than the others. As you can see in the picture, worker2 is the slower node. But the slower node is not always worker2. From my tests, it is also possible that worker3 or other nodes in the cluster can also be slower.
But, there is always such a slow worker node in the cluster. In the cluster, each worker node has 4 task slots, thus 16 task slots in total.
After sometime, Records sent to other worker nodes (except for the slower node) will not increase any more. The records sent to slower node will increase to the same level of others and the speed is much faster.
Is there anyone who can explain why the situation occurs ? Also, What am I doing wrong in my setup ?
Here is the throughput(count by words at Keyed Reduce -> Sink stage) of the cluster.
From this picture we could see that the throughput of the slower node - node2 is much higher than that of the others. It means that node2 received more records from the first stage. I think this would be because of the Zipf distribution of the words in the topic. The words with very high frequency are mapped to node2.
When nodes spend more compute resources on the Keyed Reduce -> Sink stage, the speed of reading data from Kafka decreases. When all the data in partitions corresponding to node1, node3 and node4 are processed, the throughput of the cluster drops down.
As your data follows a Zipf distribution, this behavior is expected. Some workers just receive more data due to the in-balance in the distribution itself. You would observe this behavior in other systems, too.