How can I understand "value" in bacis paxos - distributed-computing

I am reading Lamport's Paxos Mode Simple, and I get confused with the meaning "value" here.
For example, Lamport says:
If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v
I don't know what value v means here:
Does it mean different value of a certain variable, such as variable x's value can be 1 or 42?
Or is it something like one log entry in Raft, such as x=1 or y=42?
I think the first interpretion is right, and basic Paxos can't determine multiple values, it just Propose-Accept-Chosen, and the whole basic Paxos instance finishes its mission.
However, I am not for sure.

Your second interpretation is correct ("It's like one log entry in Raft").
You are also correct that Basic Paxos can't choose multiple values, it just chooses one, like a single log entry in Raft. To choose a series of values you need to chain multiple Basic Paxos instances together, like in Multi-Paxos or Raft.

Related

Clarify "the order of execution for the subtractor and adder is not defined"

The Streams DSL documentation includes a caveat about using the aggregate method to transform a KGroupedTable → KTable, as follows (emphasis mine):
When subsequent non-null values are received for a key (e.g., UPDATE), then (1) the subtractor is called with the old value as stored in the table and (2) the adder is called with the new value of the input record that was just received. The order of execution for the subtractor and adder is not defined.
My interpretation of that last line implies that one of three things can happen:
subtractor can be called before adder
adder can be called before subtractor
adder and subtractor could be called at the same time
Here is the question I'm looking to get answered:
Are all 3 scenarios above actually possible when using the aggregate method on a KGroupedTable?
Or am I misinterpreting the documentation? For my use-case (detailed below), it would be ideal if the subtractor was always be called before the adder.
Why is this question important?
If the adder and subtractor are non-commutative operations and the order in which they are executed can vary, you can end up with different results depending on the order of execution of adder and subtractor. An example of a useful non-commutative operation would be something like if we’re aggregating records into a Set:
.aggregate[Set[Animal]](Set.empty)(
adder = (zooKey, animalValue, setOfAnimals) => setOfAnimals + animalValue,
subtractor = (zooKey, animalValue, setOfAnimals) => setOfAnimals - animalValue
)
In this example, for duplicated events, if the adder is called before the subtractor you would end up removing the value entirely from the set (which would be problematic for most use-cases I imagine).
Why am I doubting the documentation (assuming my interpretation of it is correct)?
Seems like an unusual design choice
When I've run unit tests (using TopologyTestDriver and
EmbeddedKafka), I always see the subtractor is called before the
adder. Unfortunately, if there is some kind of race condition
involved, it's entirely possible that I would never hit the other
scenarios.
I did try looking into the kafka-streams codebase as well. The KTableProcessorSupplier that calls the user-supplied adder/subtracter functions appears to be this one: https://github.com/apache/kafka/blob/18547633697a29b690a8fb0c24e2f0289ecf8eeb/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableAggregate.java#L81 and on line 92, you can even see a comment saying "first try to remove the old value". Seems like this would answer my question definitively right? Unfortunately, in my own testing, what I saw was that the process function itself is called twice; first with a Change<V> value that includes only the old value and then the process function is called again with a Change<V> value that includes only the new value. Unfortunately, I haven't been able to dig deep enough to find the internal code that is generating the old value record and the new value record (upon receiving an update) to determine if it actually produces those records in that order.
The order is hard-coded (ie, no race condition), but there is no guarantee that the order won't change in future releases without notice (ie, it's not a public contract and no KIP is needed to change it). I guess there would be a Jira about it... But as a matter of fact, it does not really matter (detail below).
For the three scenarios you mentioned, the 3rd one cannot happen though: Aggregators are execute in a single thread (per shard) and thus either the adder or subtractor is called first.
first with a Change value that includes only
the old value and then the process function is called again with a Change
value that includes only the new value.
In general, both records might be processed by different threads and thus it's not possible to send only one record. It's just that the TTD simulates a single threaded execution thus both records always end up in the same processor.
Cf TopologyTestDriver sending incorrect message on KTable aggregations
However, the order actually only matters if both records really end up in the same processor (if the grouping key did not change during the upstream update).
Furthermore, the order actually depends not on the downstream aggregate implementation, but on the order of writes into the repartitions topic of the groupBy() and with multiple parallel upstream processor, those writes are interleaved anyway. Thus, in general, you should think of the "add" and "subtract" part as independent entities and not make any assumption about their order (also, even if the key did not change, both records might be interleaved by other records...)
The only guarantee provided is (given that you configured the producer correctly to avoid re-ordering during send()), that if the grouping key does not change, the send of the old and new value will not be re-ordered relative to each other. The order of the send is hard-coded in the upstream processor though:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableRepartitionMap.java#L93-L99
Thus, the order of the downstream aggregate processor is actually meaningless.

Converting binary to decimal within a case structure in LabVIEW 2018

I have two CONTROLs (buttons specifically) that when activated act as one bit each.
So it basically means the highest number I can produce is 2 by having both buttons activated at the same time. EDIT: Okay what I meant to say was that the highest output I'm going to be able to produce is two because I only have 2 buttons, each representing a 1. So 1+1=2.
However, this is only understood logically because the bits are yet to be converted to a numerical(decimal) format. I can use a 'Boolean to 0,1' converted directly to get the values but I'm instructed to use a case structure to complete this.
Right now I'm completely perplexed because a case structure needs exactly ONE case selector but I have TWO buttons. Secondly, this problem seems way too SIMPLE to require a case structure therefore making it genuinely harder to use a more complex method.
So it basically means the highest number I can produce is 2 by having both buttons activated at the same time.
A 2bit-number can have four values, 0...3, hmm?
In general, if the two booleans are bits of the number, or the number can somehow be calculated from the bools, do it.
But if the number can have predefined values which depend on the booleans, but can not be calculated from them, you need some other kind of case distinction. Maybe, whoever instructed you, had this in mind.
You can make a case structure for the first boolean, and in each case, insert a second case structure for the second boolean. This is good when there will be more complex code and logic depending on the boolenans, so you can easily concentrate on one combination of values. For simple cases, this lacks overview, and when adding a third boolean, it's lots of work.
Calculate an interim value, and connect it to a single case structure. Now, there is only one case structure, but you have no overview over all cases. Note I've changed the radix of the case struture to boolean, so you can see the bits in the selector.
Use a simple array to take a value from
Create a lookup table with predefined conditions and values
(Note that the first two solutions force you to implement each case, while the last two do not - what if your arrays are of size 3, only?)

How to implement deterministic single threaded network simulation

I read about how FoundationDB does its network testing/simulation here: http://www.slideshare.net/FoundationDB/deterministic-simulation-testing
I would like to implement something very similar, but cannot figure out how they actually did implement it. How would one go about writing, for example, a C++ class that does what they do. Is it possible to do the kind of simulation they do without doing any code generation (as they presumeably do)?
Also: How can a simulation be repeated, if it contains random events?? Each time the simulation would require to choose a new random value and thus be not the same run as the one before. Maybe I am missing something here...hope somebody can shed a bit of light on the matter.
You can find a little bit more detail in the talk that went along with those slides here: https://www.youtube.com/watch?v=4fFDFbi3toc
As for the determinism question, you're right that a simulation cannot be repeated exactly unless all possible sources of randomness and other non-determinism are carefully controlled. To that end:
(1) Generate all random numbers from a PRNG that you seed with a known value.
(2) Avoid any sort of branching or conditionals based on facts about the world which you don't control (e.g. the time of day, the load on the machine, etc.), or if you can't help that, then pseudo-randomly simulate those things too.
(3) Ensure that whatever mechanism you pick for concurrency has a mode in which it can guarantee a deterministic execution order.
Since it's easy to mess all those things up, you'll also want to have a way of checking whether determinism has been violated.
All of this is covered in greater detail in the talk that I linked above.
In the sims I've built the biggest issue with repeatability ends up being proper seed management (as per the previous answer). You want your simulations to give different results only when you supply a different seed to your random number generators than before.
After that the biggest issue I've seen seems tends to be making sure you don't iterate over collections with nondeterministic ordering. For instance, in Java, you'd use a LinkedHashMap instead of a HashMap.

How to handle the two signals depending on each other?

I read Deprecating the Observer Pattern with Scala.React and found reactive programming very interesting.
But there is a point I can't figure out: the author described the signals as the nodes in a DAG(Directed acyclic graph). Then what if you have two signals(or event sources, or models, w/e) depending on each other? i.e. the 'two-way binding', like a model and a view in web front-end programming.
Sometimes it's just inevitable because the user can change view, and the back-end(asynchronous request, for example) can change model, and you hope the other side to reflect the change immediately.
The loop dependencies in a reactive programming language can be handled with a variety of semantics. The one that appears to have been chosen in scala.React is that of synchronous reactive languages and specifically that of Esterel. You can have a good explanation of this semantics and its alternatives in the paper "The synchronous languages 12 years later" by Benveniste, A. ; Caspi, P. ; Edwards, S.A. ; Halbwachs, N. ; Le Guernic, P. ; de Simone, R. and available at http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1173191&tag=1 or http://virtualhost.cs.columbia.edu/~sedwards/papers/benveniste2003synchronous.pdf.
Replying #Matt Carkci here, because a comment wouldn't suffice
In the paper section 7.1 Change Propagation you have
Our change propagation implementation uses a push-based approach based on a topologically ordered dependency graph. When a propagation turn starts, the propagator puts all nodes that have been invalidated since the last turn into a priority queue which is sorted according to the topological order, briefly level, of the nodes. The propagator dequeues the node on the lowest level and validates it, potentially changing its state and putting its dependent nodes, which are on greater levels, on the queue. The propagator repeats this step until the queue is empty, always keeping track of the current level, which becomes important for level mismatches below. For correctly ordered graphs, this process monotonically proceeds to greater levels, thus ensuring data consistency, i.e., the absence of glitches.
and later at section 7.6 Level Mismatch
We therefore need to prepare for an opaque node n to access another node that is on a higher topological level. Every node that is read from during n’s evaluation, first checks whether the current propagation level which is maintained by the propagator is greater than the node’s level. If it is, it proceed as usual, otherwise it throws a level mismatch exception containing a reference to itself, which is caught only in the main propagation loop. The propagator then hoists n by first changing its level to a level above the node which threw the exception, reinserting n into the propagation queue (since it’s level has changed) for later evaluation in the same turn and then transitively hoisting all of n’s dependents.
While there's no mention about any topological constraint (cyclic vs acyclic), something is not clear. (at least to me)
First arises the question of how is the topological order defined.
And then the implementation suggests that mutually dependent nodes would loop forever in the evaluation through the exception mechanism explained above.
What do you think?
After scanning the paper, I can't find where they mention that it must be acyclic. There's nothing stopping you from creating cyclic graphs in dataflow/reactive programming. Acyclic graphs only allow you to create Pipeline Dataflow (e.g. Unix command line pipes).
Feedback and cycles are a very powerful mechanism in dataflow. Without them you are restricted to the types of programs you can create. Take a look at Flow-Based Programming - Loop-Type Networks.
Edit after second post by pagoda_5b
One statement in the paper made me take notice...
For correctly ordered graphs, this process
monotonically proceeds to greater levels, thus ensuring data
consistency, i.e., the absence of glitches.
To me that says that loops are not allowed within the Scala.React framework. A cycle between two nodes would seem to cause the system to continually try to raise the level of both nodes forever.
But that doesn't mean that you have to encode the loops within their framework. It could be possible to have have one path from the item you want to observe and then another, separate, path back to the GUI.
To me, it always seems that too much emphasis is placed on a programming system completing and giving one answer. Loops make it difficult to determine when to terminate. Libraries that use the term "reactive" tend to subscribe to this thought process. But that is just a result of the Von Neumann architecture of computers... a focus of solving an equation and returning the answer. Libraries that shy away from loops seem to be worried about program termination.
Dataflow doesn't require a program to have one right answer or ever terminate. The answer is the answer at this moment of time due to the inputs at this moment. Feedback and loops are expected if not required. A dataflow system is basically just a big loop that constantly passes data between nodes. To terminate it, you just stop it.
Dataflow doesn't have to be so complicated. It is just a very different way to think about programming. I suggest you look at J. Paul Morison's book "Flow Based Programming" for a field tested version of dataflow or my book (once it's done).
Check your MVC knowledge. The view doesn't update the model, so it won't send signals to it. The controller updates the model. For a C/F converter, you would have two controllers (one for the F control, on for the C control). Both controllers would send signals to a single model (which stores the only real temperature, Kelvin, in a lossless format). The model sends signals to two separate views (one for C view, one for F view). No cycles.
Based on the answer from #pagoda_5b, I'd say that you are likely allowed to have cycles (7.6 should handle it, at the cost of performance) but you must guarantee that there is no infinite regress. For example, you could have the controllers also receive signals from the model, as long as you guaranteed that receipt of said signal never caused a signal to be sent back to the model.
I think the above is a good description, but it uses the word "signal" in a non-FRP style. "Signals" in the above are really messages. If the description in 7.1 is correct and complete, loops in the signal graph would always cause infinite regress as processing the dependents of a node would cause the node to be processed and vice-versa, ad inf.
As #Matt Carkci said, there are FRP frameworks that allow loops, at least to a limited extent. They will either not be push-based, use non-strictness in interesting ways, enforce monotonicity, or introduce "artificial" delays so that when the signal graph is expanded on the temporal dimension (turning it into a value graph) the cycles disappear.

a simple/practical example of fuzzy c-means algorithm

I am writing my master thesis on the subject of dynamic keystroke authentication. To support ongoing research, I am writing code to test out different methods of feature extraction and feature matching.
My current simple approach just checks if the reference password keycodes matches the currently typed in keycodes and also checks if the keypress times (dwell) and the key-to-key times (flight) are the same as reference times +/- 100ms (tolerance). This is of course very limited and I want to extend it with some sort of fuzzy c-means pattern matching.
For each key the features look like: keycode, dwelltime, flighttime (first flighttime is always 0).
Obviously the keycodes can be taken out of the fuzzy algorithm because they have to be exactly the same.
In this context, how would a practical implementation of fuzzy c-means look like?
Generally, you would do the following:
Determine how many clusters you want (2? "Authentic" and "Fake"?)
Determine what elements you want to cluster (individual keystrokes? login attempts?)
Determine what your feature vectors will look like (dwell time, flight time?)
Determine what distance metric you will be using (how will you measure the distance of each sample from each cluster?)
Create exemplar training data for each cluster type (what does an authentic login look like?)
Run the FCM algorithm on the training data to generate the clusters
To create the membership vector for any given login attempt sample, run it through the FCM algorithm using the clusters you found in step 6
Use the resulting membership vector to determine (based on some threshold criteria) whether the login attempt is authentic
I'm not an expert, but this seems like an odd approach to determining whether a login attempt is authentic or not. I've seen FCM used for pattern recognition (eg. which facial expression am I making?), which makes sense because you're dealing with several categories (eg. happy, sad, angry, etc...) with defining characteristics. In your case, you really only have one category (authentic) with defining characteristics. Non-authentic keystrokes are simply "not like" authentic keystrokes, so they won't cluster.
Perhaps I am missing something?
I don't think you really want to do clustering here. You might want to do some proper fuzzy matching though instead of just allowing some delta on each value.
For clustering, you need to have many data points. Additionally, you'd need to know the proper number of means you need.
But what are these multiple objects meant to be? You have one data point for every keycode. You don't want to have the user type the password 100 times to see if he can do it consistently. And even then, what do you expect the clusters to be? You already know which keycode comes at which position, you don't want to find out what keycodes the user use for his password...
Sorry, I really don't see any clustering here. The term "fuzzy" seems to have mislead you to this clustering algorithm. Try "fuzzy logic" instead.