nondeterminism.njoin: maxQueued and prefetching - scala

Why does the njoin prefetch the data before processing? It seems like an unnecessary complication, unless it has something to do with how Processes of Processes are merged?
I have a stream that runs effects whenever a new element is generated. I'd like to keep the effects to a minimum, so whenever a njoin with, say maxOpen = 4, 4 should be the maximum number of elements generated at the same time (no element should be generated unless it can be processed immediately).
Is there a way to solve this gracefully with njoin? Right now I'm using a bounded queue of "tickets" (an element is generated only after it got a ticket).

See https://github.com/scalaz/scalaz-stream/issues/274, specifically the comment below from djspiewak.
"From a conceptual level, the problem here is the interface point between the "pull" model of Process and the "push" model that is required for any concurrent stream merging. Both wye and njoin sit at this boundary point and "cheat" by actively pulling on their source processes to fill an inbound queue, pushing results into an outbound queue pending the pull on the output process. (obviously, both wye and njoin make their inbound queues implicit via Actor) For the most part, this works extremely well and it preserves most of the properties that users care about (e.g. propagation of termination, back pressure, etc)."
The second parameter to njoined, maxQueued, bounds the amount of prefetching. If that parameter is 0, there is no limit on the queue size, and thus no limit on the prefetching. The docs for mergeN, which calls njoin explain a bit more the reasoning for this prefetching behavior. "Internally mergeN keeps small buffer that reads ahead up to n values of A where n equals to number of active source streams. That does not mean that every source process is consulted in this read-ahead cache, it just tries to be as much fair as possible when processes provide their A on almost the same speed." So it seems that the njoin is dealing with the problem of what happens when all the sources provide a value at nearly the same time, but it's trying to prevent any one of those joined streams from crowding out slower streams.

Related

Do memory instructions pass through the load-store queue and issue queue in the microarchitecture

What is the difference between the issue queue and lsq queue for
memory instructions? Do memory instructions pass through both queues, or do they only pass
through the lsq queue.
If they pass through both queues what is their order?
I'm assuming you use the arm-like nomenclature here so the issue queue is what Intel calls RS (reservation station) and by issue you mean sending a uop ready for execution.
The answer is that memory instructions need to pass both. All instructions need to be issued (except the ones that can be eliminated without execution, for example register moves, zero idioms, nops, etc..). Let's rephrase - all instructions that need to go through an ALU need to go through the issue process first. Memory instructions will simply use that step to calculate their addresses.
This is true for loads, for stores there is usually an internal split into store-address and store-data, so the store-address will behave like a load in that sense and calculate its address during that step.
There is usually a dedicated execution port for that and dedicated execution units because the address calculation usually follows one of few specific addressing modes (each architecture has a different set of these), but aside from that the execution needs to follow the same rules like any other operation in the CPU - it needs to have its sources ready and read from the register file or bypassed from an in flight operation, it needs to get arbitrated when the execution port is free and prioritized by the same aging rules, so it makes sense that it uses the common path.
Once the memory operation has finished execution, it will be sent to the LSU (load-store unit, or the DCU, data-cache unit on Intel) and perform the actual memory access using the generated address. The LSU pipe will take care of the address translation, TLB lookups, the page walk if needed (though this is sometimes done in a dedicated unit), the address range and property checks, the cache lookup (if cacheable) and sending a miss to the next cache level or memory if needed. It may also trigger prefetches as part of the process.
For a load, when the LSU pipe has completed (which may require multiple passes and wakeups if the data was not available in the L1), the LSU will signal the issue queue again in order to wakeup anyone who was depended on the result.
For a store, store-address may fetch the line to the cache in advance as an optimization but the actual next step is usually to wakeup after retirement (since stores may not be dispatched to memory while speculative, unless you have some tricks to handle that).
It's also worth to mention that some CPUs try to optimize loads that can forward the data directly from prior stores instead of fetching it from the cache/memory. This can include forwarding (very common) or memory renaming (less common). The former is usually handled by the LSU internally, but the latter can be done much earlier and without the LSU (though the LSU pipe is usually still activated to validate the result).

Should I add a buffer after a Kafka source in Akka stream

According to this blog post:
If the source of the stream polls an external entity for new messages and the downstream processing is non-uniform, inserting a buffer can be crucial to realizing good throughput. For example, a large buffer inserted after the Kafka Consumer from the Reactive Streams Kafka library can improve performance by an order of magnitude in some situations. Otherwise, the source may not poll Kafka fast enough to keep the downstream saturated with work, with the source oscillating between backpressuring and polling Kafka.
The documentation for the alpakka kafka connnector doesn't mention that, so I was wondering if it makes sense to use a buffer in this case. Also does the same thing apply to Kafka sinks (should I add a buffer before)?
...I was wondering if it makes sense to use a buffer in this case
Consider the following segment from the blog post you quoted:
...the downstream processing is non-uniform....
One of the points of that section of the post is to illustrate the similar effects that a user-defined buffer and an asynchronous boundary can have on a stream. The default behavior, in which there are no buffers or asynchronous boundaries, is to enable operator fusion, which runs a stream in a single actor. This essentially means that for every Kafka message that is consumed, the message must go through the entire pipeline of the stream, from source to sink, before the next message goes through the pipeline. In other words, a message m2 will not go through the pipeline until the preceding message m1 is finished processing.
If the processing that occurs downstream from a Kafka connector source is "non-uniform" (i.e, it can take varying amounts of time: sometimes the processing happens quickly, sometimes it takes a while), then introducing a buffer or an asynchronous boundary could improve the overall throughput. This is because a buffer or asynchronous boundary can allow the source to continue consuming Kafka messages even if the downstream processing happens to take a long time. That is, if m1 takes a long time to process, the source could consume messages m2, m3, etc. (until the buffer is full), without waiting for m1 to finish. As Colin Breck states in his post:
The buffer improves performance by decoupling stages, allowing the upstream or downstream to continue to process elements, on average, even if one of them is busy processing a relatively expensive workload.
This potential performance boost doesn't apply for all situations. Again quoting Breck:
Similar to the async method discussed in the previous section, it should be noted that inserting buffers indiscriminately will not improve performance and simply consume additional resources. If adjacent workloads are relatively uniform, the addition of a buffer will not change the performance, as the overall performance of the stream will simply be dominated by the slowest processing stage.
One obvious way to determine whether using a buffer (i.e., .buffer) makes sense in your case is to try it. You might also try adding an asynchronous boundary (i.e., .async) instead. Compare the following three approaches--(1) the default fused behavior without buffering, (2) .buffer, and (3) .async--and see which one results in the best performance.

How can running event handlers on production be done?

On production enviroments event numbers scale massively, on cases of emergency how can you re run all the handlers when it can take days if they are too many?
Depends on which sort of emergency you are describing
If the nature of your emergency is that your event handlers have fallen massively behind the writers (eg: your message consumers blocked, and you now have 48 hours of backlog waiting for you) -- not much. If your consumer is parallelizable, you may be able to speed things up by using a data structure like LMAX Disruptor to support parallel recovery.
(Analog: you decide to introduce a new read model, which requires processing a huge backlog of data to achieve the correct state. There isn't any "answer", except chewing through them all. In some cases, you may be able to create an approximation based on some manageable number of events, while waiting for the real answer to complete, but there's no shortcut to processing all events).
On the other hand, in cases where the history is large, but the backlog is manageable (ie - the write model wasn't producing new events), you can usually avoid needing a full replay.
In the write model: most event sourced solutions leverage an event store that supports multiple event streams - each aggregate in the write model has a dedicated stream. Massive event numbers usually means massive numbers of manageable streams. Where that's true, you can just leave the write model alone -- load the entire history on demand.
In cases where that assumption doesn't hold -- a part of the write model that has an extremely large stream, or a pieces of the read model that compose events of multiple streams, the usual answer is snapshotting.
Which is to say, in the healthy system, the handlers persist their state on some schedule, and include in the meta data an identifier that tracks where in the history that snapshot was taken.
To recover, you reload the snapshot, and the identifier. You then start the replay from that point (this assumes you've got an event store that allows you to start the replay from an arbitrary point in the history).
So managing recovery time is simply a matter of tuning the snapshotting interval so that you are never more than recovery SLA behind "latest". The creation of the snapshots can happen in a completely separate process. (In truth, your persistent snapshot store looks a lot like a persisted read model).

MSI/MESI: How can we get "read miss" in shared state?

In The Cache Memory Book by Jim Handy (excerpt is below), the author has the table description of MESI protocol. The table looks very unclear to me, and unfortunately the text does not help.
The first question (in green on the picture):
Is this right? -- a data block is in the cache of a CPU,
and it is in the shared state, but when the CPU reads it,
the CPU gets read miss.
How is this possible?
The second question (in purple):
Who and when create all these messages "Read miss", etc.?
(afaik, the system bus just translates messages of others)
And finally, the third question (not on the pic):
Do all these cache coherency protocols
(MSI,MESI,MOESI,Firefly,Dragon...)
maintain sequential consistency memory model?
Are there protocols that maintain other consistency models?
For the first question, while there is a data block (in shared state) in the cache, it is the wrong block (i.e., the tag does not match), so there is a cache miss. On a cache miss, the cache still needs to writeback data to memory if it is in the modified state.
For the second question, each bus is providing address and request information (read or write) to the cache; the system bus is providing this from a remote processor. The hit or miss is whether the address provided is a hit in the cache. Requests on the system bus are filtered by the remote processor's cache.
On a read by processor 1 (P1), if the P1's cache has a hit, no signal needs to be sent to the system bus. On a write by P1, if the P1's cache has a hit and the state is exclusive or modified, then no signals need to be sent to the system bus, but if state in P1's cache is shared, then the other (P0) cache must told (via the system bus) that P1 is performing a write and that any entry in P0's cache for that address must be invalidated.
(Note that "shared" state does not necessarily mean that the block of memory is present in some other cache. Most snoop-based protocols allow silent [no system bus communication] dropping of cache blocks in shared state.)
(By the way, the MESI protocol given by that book is a bit unusual. It is more common for Exclusive state to be entered by a read miss that found no other caches with the block [this allows silent update to modified on a later write] and for writes not to use write through to memory on a hit that was in shared state.)
For the third question, cache coherence protocols only address coherence not consistency. The consistency model that the hardware provides determines what kinds of activity can be buffered and completed out of order (in a way that is visible to software). E.g., it helps performance if a write miss does not force any following read hits to wait until the write notification has been seen by all remote caches, but such can violate sequential consistency.
Here is an example how sequential consistency can prevent buffering of writes:
P0 reads "B_old" at location B (cache hit), writes "A_new" to location A (a cache miss), then reads location B.
P1 reads "A_old" at location A (cache hit), writes "B_new" to location B (a cache miss), then reads location A.
If writes are buffered and reads are allowed to proceed without waiting for the preceding write to be recognized by the other cache, the P0's second read of B can read "B_old" (since it will still be a cache hit) and P1's second read of A can read "A_old" (since P0's write has not yet been seen). There is no way to construct a sequential ordering of these memory accesses, so sequential consistency would be violated.
However, if P0's second read of B waited until P0's write to A was recognized by P1 (and P1's second read of A waited until P1's write to B was recognized by P0), then a sequential ordering would be possible.
If P1 saw P0's write to A before its write to B, then sequential ordering is possible, including:
P0.read "B_old", P1.read "A_old", P0.write "A_new", P1.write "B_new", P0.read "B_new", P1.read "A_new"
P0.read "B_old", P1.read "A_old", P0.write "A_new", P0.read "B_old", P1.write "B_new", P1.read "A_new"
Note that hardware can speculate that ordering requirements are not violated and reorder the completion of memory accesses; however, it needs to be able to handle incorrect speculation (e.g., by restoration to a checkpoint). (A classic early paper on this is "Is SC + ILP = RC?" [PDF].)
There is a one-to-many mapping between cache lines and blocks of main memory. A cache line that is shared may not be the block of memory your program is looking for. What actually happened is that the block of memory your program wants, and the block of memory present in the cache, though different, have the same cache line index. Consecutive blocks of memory are mapped to consecutive cache lines, till the cache lines finish, and the subsequent blocks of memory are mapped to cache lines starting with the first cache line all over again.
I found an intuitive explanation here: https://medium.com/breaktheloop/direct-mapping-map-cache-and-main-memory-d5e4c1cbf73e

Akka and state among actors in cluster

I am working on my bc thesis project which should be a Minecraft server written in scala and Akka. The server should be easily deployable in the cloud or onto a cluster (not sure whether i use proper terminology...it should run on multiple nodes). I am, however, newbie in akka and i have been wondering how to implement such a thing. The problem i'm trying to figure out right now, is how to share state among actors on different nodes. My first idea was to have an Camel actor that would read tcp stream from minecraft clients and then send it to load balancer which would select a node that would process the request and then send some response to the client via tcp. Lets say i have an AuthenticationService implementing actor that checks whether the credentials provided by user are valid. Every node would have such actor(or perhaps more of them) and all the actors should have exactly same database (or state) of users all the time. My question is, what is the best approach to keep this state? I have came up with some solutions i could think of, but i haven't done anything like this so please point out the faults:
Solution #1: Keep state in a database. This would probably work very well for this authentication example where state is only represented by something like list of username and passwords but it probably wouldn't work in cases where state contains objects that can't be easily broken into integers and strings.
Solution #2: Every time there would be a request to a certain actor that would change it's state, the actor will, after processing the request, broadcast information about the change to all other actors of the same type whom would change their state according to the info send by the original actor. This seems very inefficient and rather clumsy.
Solution #3: Having a certain node serve as sort of a state node, in which there would be actors that represent the state of the entire server. Any other actor, except the actors in such node would have no state and would ask actors in the "state node" everytime they would need some data. This seems also inefficient and kinda fault-nonproof.
So there you have it. Only solution i actually like is the first one, but like i said, it probably works in only very limited subset of problems (when state can be broken into redis structures). Any response from more experienced gurus would be very appriciated.
Regards, Tomas Herman
Solution #1 could possibly be slow. Also, it is a bottleneck and a single point of failure (meaning the application stops working if the node with the database fails). Solution #3 has similar problems.
Solution #2 is less trivial than it seems. First, it is a single point of failure. Second, there are no atomicity or other ordering guarantees (such as regularity) for reads or writes, unless you do a total order broadcast (which is more expensive than a regular broadcast). In fact, most distributed register algorithms will do broadcasts under-the-hood, so, while inefficient, it may be necessary.
From what you've described, you need atomicity for your distributed register. What do I mean by atomicity? Atomicity means that any read or write in a sequence of concurrent reads and writes appears as if it occurs in single point in time.
Informally, in the Solution #2 with a single actor holding a register, this guarantees that if 2 subsequent writes W1 and then W2 to the register occur (meaning 2 broadcasts), then no other actor reading the values from the register will read them in the order different than first W1 and then W2 (it's actually more involved than that). If you go through a couple of examples of subsequent broadcasts where messages arrive to destination at different points in time, you will see that such an ordering property isn't guaranteed at all.
If ordering guarantees or atomicity aren't an issue, some sort of a gossip-based algorithm might do the trick to slowly propagate changes to all the nodes. This probably wouldn't be very helpful in your example.
If you want fully fault-tolerant and atomic, I recommend you to read this book on reliable distributed programming by Rachid Guerraoui and Luís Rodrigues, or the parts related to distributed register abstractions. These algorithms are built on top of a message passing communication layer and maintain a distributed register supporting read and write operations. You can use such an algorithm to store distributed state information. However, they aren't applicable to thousands of nodes or large clusters because they do not scale, typically having complexity polynomial in the number of nodes.
On the other hand, you may not need to have the state of the distributed register replicated across all of the nodes - replicating it across a subset of your nodes (instead of just one node) and accessing those to read or write from it, providing a certain level of fault-tolerance (only if the entire subset of nodes fails, will the register information be lost). You can possibly adapt the algorithms in the book to serve this purpose.