Legal Hierarchical Quorums in Zookeeper - apache-zookeeper

I am trying to understand hierarchical quorums in Zookeeper. I may not understand the example shown in the documentation (here). Are votes [from at least two servers from each of two different groups] enough to form a legal quorum?
In my opinion, the example here does not gain the majority of all the weight; it only gains more than 4 ballots. A legal quorum should earn more than 5 ballots (9/2+1).
I also read the source code. The algorithm implementation is shown from line 352 to line 371. Zookeeper only checks if all groups have a majority and if the number of selected groups is larger than half of the group number.

Maybe I find the answer.
A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each server, then we are able to form quorums of size 4.
Note that two subsets of processes composed each of a majority of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect that a majority of co-locations will have a majority of servers available with high probability.

Related

Consistent hashing, why are Vnodes a thing?

My understanding of consistent hashing is that you take a key space, hash the key and then mod by say 360, and place the values in a ring. Then you equally space nodes on that ring. You pick the node to handle this key by looking clockwise from where your hashed key landed.
Then in many explanation they go onto describe Vnodes. In the riak docs which refers to the dynamo paper they say:
The basic consistent hashing algorithm presents some challenges. First, the random position assignment of each node on the ring leads to non-uniform data and load distribution.
Then they go on to propose Vnodes as a way to ensure uniform distribution of the input key space around the ring. The gist as I understand is that Vnodes divide up the ranges many more times than you have machines. So say you have 10 machines you might have 100 Vnodes and an individual machines Vnodes would be scattered randomly around the ring.
Now my question is why is this extra Vnode step required. Hash functions are supposed to provide a uniform distribution of their output so it would seem this is unneeed. According to this answer even the modulo of a hash function is still uniformly distributed.
Imo, the missing piece of key information with most explanations of consistent hashing is that they don't detail the part about "multiple hash functions."
For whatever reason, most "consistent hashing for dummies" articles gloss over the implementation detail that makes the virtual nodes work with random distribution.
Before talking about how it does work, let me clarify the above with an example of how it does not work.
How it does not work
A naive implementation of vnodes looks like this:
source
This is naive because you'll notice that, for example, the green vnode always precedes the blue vnode. This means that if the green vnode goes offline then it will be replaced solely by the blue vnode, which defeats the entire purpose of moving from single-token nodes to distributed virtual nodes.
The article quickly mentions that practically, Vnodes are randomly distributed across the cluster. It then shows a separate picture indicating this but without explaining the mechanics of how this is achieved.
How it does work
Random distribution of vnodes is achieved via the use of multiple, unique hash functions. These multiple functions are where the random distribution comes from.
This makes the implementation look something roughly like this:
A) Ring Formation
You have a ring consisting of n physical nodes via physical_nodes = ['192.168.1.1', '192.168.1.2', '192.168.1.3', '192.168.1.4']; (think of this as B/R/P/G in the prior picture's left-side)
You decide to distribute each physical node into k "virtual slices," i.e. a single physical node is sliced into k pieces
In this example, we use k = 4, but in practice we use should use k ≈ log2(num_items) to obtain reasonably balanced loads for storing a total of num_items in the entire datastore
This means that num_virtual_nodes == n * k; (this corresponds to the 16 pieces in the prior picture's right-side)
Assign a unique hashing algo for each k via hash_funcs = [md5, sha, crc, etc]
(You can also use a single function that is recursively called k times)
Divy up the ring by the following:
virtual_physical_map = {}
virtual_node_ids = []
for hash_func in hash_funcs:
for physical_node in physical_nodes:
virtual_hash = hash_func(physical_node)
virtual_node_ids.append(virtual_hash)
virtual_physical_map[virtual_hash] = physical_node
virtual_node_ids.sort()
You now have a ring composed of n * k virtual nodes, which are randomly distributed across the n physical nodes.
B) Partition Request Flow
A partition-request is made with a provided key_tuple to key off of
The key_tuple is hashed to get key_hash
Find the next clock-wise node via virtual_node = binary_search(key_hash, virtual_node_ids)
Lookup the real node via physical_id = virtual_physical_map[virtual_node]
Page 6 of this Stanford Lecture was very helpful to me in understanding this.
The net effect is that the distribution of vnodes across the ring looks like this:
source
First, the random position assignment of each node on the ring leads to non-uniform data and load distribution.
Good hash functions provide uniform distribution, but the input also had to be sufficiently large in number for them to appear spread out. The keys are, but the servers aren't. So a million keys that are hashed and modulo'd by 360 will be evenly distributed around the ring, but if you only use say 3 servers S1 through S3 to hold the key-value pairs, there is no guarantee that they might be hashed (with the same hash function used for the keys) uniformly on the ring at positions 0, 120 and 240. S1 might hash at 10, S2 at 12 and S3 at 50. So S2 will hold very less KV pairs compared to the other two. By having virtual servers, you increase the chances of them being hashed uniformly around the ring.
The other benefit is the even re-distribution of keys when a server is added or removed as mentioned in the doc.

Determine number of clusters for different datasets

I performed a clustering analysis of the media usage of different users in order to find different groups that use a specific set of media (e.g. group 1 use media A, B and C and group 2 use media B, C and D). Then I divided the datset in different groups, since the users belong to a specific group (as a consequence the original dataset and the new datasets have a different size). Within in this groups I like to cluster again which different media sets are used.
How can I determine the number of clusters to guarantee that the results are comparable?
Thank you in advance!
Don't rely on clustering to be stable.
It's a hypothesis generation tool.
You clustered, and now you have the hypothesis that there are groups ABCD of media usage. You should first evaluate if this hypothesis is adequate. Now what you want to do in your next step is to assign the labels to subsets of the data. First of all, you should be able to simply subset this from the previous labels. But if this really is different data, you can label new data, for example using the most similar record (nearest neighbor classification). But that is classification now, because your classes are fixed.

rapidminer: cluster performance operators..what does different value mean?

I have to check performance of various clustering algos using different performance operators in rapidminer. For that I want to know the following things:
what does cluster number index value shows which is output of cluster count performance operator?
what does small and large value of avg within cluster distance and avg. within centroid distance mean in terms of good and bad clustering?
I also want to check other indexes value like Dunn index,Jaccard index, Fowlkes–Mallows for various clustering algos. but rapidminer don't have any operator for this, what to do for that. I don't have experience with R.
I have copied part of the answer I gave on the Rapid-I forum
The cluster number index is the count of clusters - pointless you might say but when used with DBSCAN, it can be quite interesting http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html
The avg within cluster and centroid distances are hard to interpret - one thing to search for is "elbow criterion" in this context. As the number of clusters varies, note how the validity measure changes and look for an "elbow" that marks the point where the natural progression of the measure dominates the structure.
R has many validity measures and it's worth investing some time because you can always call the R process from RapidMiner which makes it easier to work out what is going on.

Mahout K-means has different behavior based on the number of mapping tasks

I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.

How to design a distributed system for "finding something within X miles"?

Question:
Design a distributed system to response the clients' query about "finding something within X miles".
If X is infinite, get all the "something" in the world (if they are all stored in your database)
You can think about two approaches:
when the number of potential results is small and number of queries big divide the space of coordinates between available machines and send queries only to machines which are responsible for areas which intersect with X-mile circle
when the number of potential results is big store objects dispersed, so that they are uniformly distributed on all machines (you can choose the machine by randomization or by objects' origin - it depends) and post every query to all machines and merge received results.
Further changes depend on getting more information about the problem nature.