What happened when zookeeper sequential node run out of all numbers - apache-zookeeper

From java doc, we know,
"actual path name of a sequential node will be the given path plus a suffix "i" where i is the current sequential number of the node. The sequence number is always fixed length of 10 digits, 0 padded. Once such a node is created, the sequential number will be incremented by one"
I am wondering what happened when the sequential number overflow?
For example, we create a sequential node, mysequential9999999999, what's the next number?
Is there some error pop up?

Related

How to use ZooKeeper to distribute work across a cluster of servers

I'm studying up for system design interviews and have run into this pattern in several different problems. Imagine I have a large volume of work that needs to be repeatedly processed at some cadence. For example, I have a large number of alert configurations that need to be checked every 5 min to see if the alert threshold has been breached.
The general approach is to split the work across a cluster of servers for scalability and fault tolerance. Each server would work as follows:
start up
read assigned shard
while true:
process the assigned shard
sleep 5 min
Based on this answer (Zookeeper for assigning shard indexes), I came up with the following approach using ZooKeeper:
When a server starts up, it adds itself as a child under the node /service/{server-id} and watches the children of the node. ZooKeeper assigns a unique sequence number to the server.
Server reads its unique sequence number i from ZooKeeper. It also reads the total number of children n under the /service node.
Server identifies its shard by dividing the total volume of work into n pieces and locating the ith piece.
While true:
If the watch triggers (because servers have been added to or removed from the cluster), server recalculates its shard.
Server processes its shard.
Sleep 5 min.
Does this sound reasonable? Is this generally the way that it is done in real world systems? A few questions:
In step #2, when the server reads the number of children, does it need to wait a period of time to let things settle down? What if every server is joining at the same time?
I'm not sure how timely the watch would be. Seems like there would be a time period where the server is still processing its shard and reassignment of shards might cause another server to pick up a shard that overlaps with what this server is processing, causing duplicate processing (which may or may not be ok). Is there any way to solve this?
Thanks!

Spark executers ideals after groupby operator

We are working in Spark streaming .
Our DataFrame contains the following columns
[unitID,source,avrobyte,schemeType]
The unitID values are [ 10, 76, 510, 269 , 7, 0, 508, , 509 ,511 , 507]
We active the following command :
val dfGrouped :KeyValueGroupedDataset [Int,Car] = dfSource.groupByKey(car1=> ca1.unitID)
val afterLogic : DataSet[CarLogic]= dfGrouped.flatMapGroups{
case(unitID: Int , messages:Iterator[Car])=> performeLogic(...)
}
We allocate 8 Spark executers .
In our Dataset we have 10 different units so we have 10 different unitID,
so we excepted that job processing will split on all over the executers in equal manner, but when we looking on the executers performance via the UI I see that only 2 executers are working and all the other are idle during the mission....
What are we doing wrong? or how we can divide the job over all the executers to be more or the less equal...
What you are seeing can be explained by the low cardinality of your key space. Spark uses a HashPartitioner (by default) to assign keys to partitions (by default 200 partitions). On a low cardinality key space this is rather problematic and requires careful attention as each collision has a massive impact. Even further, these partitions then have to be assigned to executors. At the end of this process it's not surprising to end up with a rather sub-optimal distribution of data.
You have a few options:
If applicable, attempt to increase the cardinality of your keys, e.g. by salting them (appending some randomness temporarily). That has the advantage that you can also better handle skew in the data (when the amount of data per keys is not equally distributed). In a following step you can then remove the random part again and combine the partial results.
If you absolutely require a partition per key (and the key space is static and well-known), you should configure spark.sql.shuffle.partitions to match the cardinality n of your keys space and assign each key a partition id in [0, n) ahead of time (to avoid collisions when hashing). Then you can use this partition id in your groupBy.
Just for completeness, using the RDD API you could provide you own custom partitioner that does the same as described above: rdd.partitionBy(n, customPartitioner)
Though, one final word: Even following one of the latter two options above, using 8 executors for 10 keys (equals 10 non-empty partitions) is a poor choice. If your data is equally distributed, you will still end up with 2 executors doing double the work. If your data is skewed things might even be worse (or you are accidentally lucky) - in any case, it's out of your control.
So it's best to make sure that the number of partitions can be equally distributed among your executors.

Fixed size window for Apache Beam

How to define window of fixed size (fixed number of items) in Apache Beam?
I know that we have
(FixedWindows.of(Duration.standardMinutes(10))
but I do not care about time-only about number of items.
More details:
I am writing significant amount of data (53 gigabytes) to S3. Currently my proces uses
FileIO.<KV<...>writeDynamic()
.by(kv -> kv.getKey())
(grouping by key). This causes serve performance bottleneck, because of skewed key distribution. My total data size is 53Gb, but size of data for one key is 37Gb. This single key takes an hour to write (writing occurs on single executor, single thread, rest of cluster waits idle).
I do not need any special grouping. Ideally I want uniform distribution of data, so writing will happen concurrently and finish as soon as possible.
Guaranteeing exactly equal sized grouping is fairly hard, but you can get pretty close by using hashes of your data modulo some constant as the keys. For example:
FileIO.<KV<...>writeDynamic()
.by(kv -> kv.hashCode() % 530)
This will give roughly equal 100MB partitions.
Additionally, if you are using the DataflowRunner, you don't need to specify keys at all; the system will automatically group up the data, and dynamically rebalance the load to avoid stragglers. For this, use FileIO.write() instead of FileIO.writeDynamic().

How can Kafka reads be constant irrespective of the datasize?

As per the documentation of Kafka
the data structure used in Kafka to store the messages is a simple log where all writes are actually just appends to the log.
What I don't understand here is, many claim that Kafka performance is constant irrespective of the data size it handles.
How can random reads be constant time in a linear data structure?
If I have a single partition topic with 1 billion messages in it. How can the time taken to retrieve the first message be same as the time taken to retrieve the last message, if the reads are always sequential?
In Kafka, the log for each partition is not a single file. It is actually split in segments of fixed size.
For each segment, Kafka knows the start and end offsets. So for a random read, it's easy to find the correct segment.
Then each segment has a couple of indexes (time and offset based). Those are the file named *.index and *.timeindex. These files enable jumping directly to a location near (or at) the desired read.
So you can see that the total number of segments (also total size of the log) does not really impact the read logic.
Note also that the size of segments, the size of indexes and the index interval are all configurable settings (even at the topic level).

do I need to use coalesce before saving my RDD data to file

Imagine I have a RDD with 100 records and I partitioned it with 10, so each partition is now having 10 records I am just converting to rdd to key value pair rdd and saving it to a file now my output data is divided into 10 partitions which is ok to me, but is it best practise to use coalesce function before saving output data to file ? for example rdd.coalesce(1) this gives just one file as output does it not shuffles data insides nodes ? want to know where coalesce should be used.
Thanks
Avoid coalesce if you don't need it. Only use it to reduce the amount of files generated.
As with anything, depends on your use case; coalesce() can be used to either increase or decrease the number of partitions but there is a cost associated with it.
If you are attempting to increase the number of partitions (in which the shuffle parameter must be set to true), you will incur the cost of redistributing data through a HashPartitioner. If you are attempting to decrease the number of partitions, the shuffle parameter can be set to false but the number of nodes actively grabbing from the current set of partitions will be the number of partitions you are coalescing to. For example, if you are coalescing to 1 partition, only 1 node will be active in pulling data from the parent partitions (this can be dangerous if you are coalescing a large amount of data).
Coalescing can be useful though as sometimes you can make your job run more efficiently by decreasing your partition set size (e.g. after a filter or a sparse inner join).
you can simply use it like this
rdd.coalesce(numberOfPartition)
It doesn't shuffle data if you decease partitions but its shuffle data if you increase partitions. Its according to use cases.But we careful to use it because if you decrease partition less than or not equal to number of cores in your cluster then its cant use full resources of your cluster. And Sometimes less shuffle data or network IO like you decrease rdd partition but equal to number of partition so its increase performance of your system.