Requesting a clear, picturesque explanation of Reactive Extensions (RX)? - system.reactive

For a long time now I am trying to wrap my head around RX. And, to be true, I am never sure if I got it - or not.
Today, I found an explanation on http://reactive-extensions.github.com/RxJS/ which - in my opinion - is horrible. It says:
RxJS is to events as promises are to async.
Great. This is a sentence so full of complexity that if you do not have the slightest idea of what RX is about, after that sentence you are quite as dumb as before.
And this is basically my problem: All the explanations in the usual places you find about RX make (at least me) feel dumb. They explain RX as a highly sophisticated concept with lots of highly complicated words and terms and whatsoever, and I am never quite sure what it is about.
So my question is: How would you explain RX to someone who is five years old? I'd like a clear, picturesque explanation of what it is, what it is good for, and what its main concepts are?

So, LINQ (in JavaScript, these are high-level array methods like map, filter, reduce, etc - if you're not a C# dev, just replace that whenever I mention 'LINQ') gives you a bunch of tools that you can apply to Sequences ("Lists" in a crude sense), in order to filter and transform an input into an output (aka "A list that's actually interesting to me"). But what is a list?
What is a List?
A List, is some elements, in a particular order. I can take any list and transform it into a better list with LINQ.
(Not necessarily sorted order, but an order).
An Event is a List
But what about an Event? Let's subscribe to an event:
OnKeyUp += (o,e) => Console.WriteLine(e.Key)
>>> 'H'
>>> 'e'
>>> 'l'
>>> 'l'
>>> 'o'
Hm. That looks like some things, in a particular order. It now suddenly dawns upon you, a list and an event are the same thing!
If Lists and Events are the Same....
...then why can't I transform and filter input events into more interesting events. That's what Rx is. It's taking everything you know about dealing with sequences, including all of the LINQ operators like Select and Where and Aggregate, and applies them to events.
Easy peasy.
A Callback is a Sequence Too
Isn't a Callback just basically an Event that only happens once? Isn't it basically just like a List with one item? Turns out it is, and one of the interesting things about Rx is that it lets us treat Events and Callbacks (and things like Geolocation requests) with the same language (i.e. we can combine the two, or wait for ether one or the other, etc etc).

Along with Paul's excellent answer I'd like to add the concept of pulling vs pushing data.
Pipeline
Lets take the example of some code that generates a series of numbers, and outputs the result. If you think of this as a stream on one end you have a producer that is creating new numbers for you, and on the other end you have a consumer that is doing something with those numbers.
Pull - Primes List
Lets say the producer is generating a list of prime numbers. Normally you would have some function that yields a list of numbers, and every time it returned it would push the next value it has calculated through the pipe to the consumer, which would output that number to the screen.
Prime Generator ---> Console.WriteLine
In this scenario it is easy to see that the producer is doing most of the work, and the consumer would be sitting around waiting for the producer to send the next value. The consumer is pulling on the pipeline, waiting for the producer to return the next value.
Push - Progress percent events from a fast process (Reactive)
Ok, let's say you have a function that is processing 1,000,000 items. Each item takes milliseconds to process, and then the function yields out a percentage value of how far it has gotten. So lots of progress values, very fast.
At the other end of the pipeline you have a progress bar. Now if the progress bar was to handle every update the UI would block trying to keep up with the stream of values.
1-Million-Items-Processor ---> Progress Bar
In this scenario the data is being pushed through the pipeline by the producer and then the consumer is blocking because too much data is being pushed for it to handle.
Reactive allows you to put in delays, windows, or to sample the pipeline depending on how you wish to consume the data. In this case I would sample the data every second before updating the progress bar.
Lists vs Events
So lists and events are kinda the same. The difference is whether the data is pulled or pushed through the system. With lists the data is pulled. With events the data is pushed.

Related

Event Sourcing - stream design

I am sitting here looking into CQRS and event sourcing, really interesting topics. When it comes to stream design, and and aggregate roots, i feel a bit left in the dark. How do you do it?
Lets imagine that i have an UI, where i can add stuff to a basket, generating a lines in a basket.
Would I have:
a stream pr basket (with basic info attached, like shipping details, name, email etc)
a stream pr basketline
So i would have many streams
streams/basket-[basketid]
streams/basketline-[basketid]
Basically i only send the minimal data over the wire.
or would i simply have one stream
stream/basket-[basketid]
And every time i add a line to my basket, i send the whole basket over the wire.
As i understand it, it is best to have one to many streams, and not one big streams/basket stream. Or am I mistaken here as well?
My focus here is streams. Any "best practices" on this kind of design: Links, books etc would be appriciated.
How do you do it?
Start by watching All Our Aggregates are Wrong (Mauro Servienti, 2019), which considers the question of how many different aggregates you might need to represent a digital shopping cart.
I tend to think of aggregates as graphs of information - if two pieces of information must change together (A changes, and therefore B must also change RIGHT NOW; or A can't change, because its range of allowed values is constrained by B), then they belong to the same aggregate. The boundary of the aggregate separates information that is tightly coupled together from everything else.
Because distributed transactions are hard, it follows that we want our aggregates stored in such a way that changing an aggregate only requires holding one single lock. For example, we won't normally spread a single instance of an aggregate across multiple databases, because ensuring that all of the databases change in exactly the right way at the "same" time is really hard.
We normally store all of the information that is tightly coupled together in a single event stream for exactly the same reason: there's only a single lock to manage.

How to "join" a frequently updating stream with an irregularly updating stream in Apache Beam?

I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.
My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.
I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.
Idea 1: CoGroupByKey with fixed time windows
Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.
Idea 2: CoGroupByKey with global window and as non-default trigger
Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.
Idea 3: Side inputs
Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.
The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.
Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.

Parallel design of program working with Flink and scala

This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flagļ¼Œ which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.

Converting Rx-Observables to Twitter Futures in Scala

I want to implement the following functions in the most re-active way. I need these for implementing the bijections for automatic conversion between the said types.
def convertScalaRXObservableToTwitterFuture[A](a: Observable[A]): TwitterFuture[A] = ???
def convertScalaRXObservableToTwitterFutureList[A](a: Observable[A]): TwitterFuture[List[A]] = ???
I came across this article on a related subject but I can't get it working.
Unfortunately the claim in that article is not correct and there can't be a true bijection between Observable and anything like Future. The thing is that Observable is more powerful abstraction that can represent things that can't be represented by Future. For example, Observable might actually represent an infinite sequence. For example see Observable.interval. Obviously there is no way to represent something like this with a Future. The Observable.toList call used in that article explicitly mentions that:
Returns a Single that emits a single item, a list composed of all the items emitted by the finite source ObservableSource.
and later it says:
Sources that are infinite and never complete will never emit anything through this operator and an infinite source may lead to a fatal OutOfMemoryError.
Even if you limit yourself to only finite Observables, still Future can't fully express semantics of Observable. Consider Observable.intervalRange that generates a limited range one by one over some time period. With Observable the first event comes after initialDelay and then you get event each period. With Future you can get only one event and it must be only when the sequence is fully generated so Observable is completed. It means that by transforming Observable[A] into Future[List[A]] you immediately break the main benefit of Observable - reactivity: you can't process events one by one, you have to process them all in a single bunch.
To sum up the claim at the first paragraph of the article:
convert between the two, without loosing asynchronous and event-driven nature of them.
is false because conversion Observable[A] -> Future[List[A]] exactly looses the "event-driven nature" of Observable and there is no way to work this around.
P.S. Actually the fact that Future is less powerful than Observable should not be a big surprise. If it was not, why anybody would create Observable in the first place?

How to model very large work queues in Akka?

I am writing a scala script to download all items from the hacker news API. There are ~12M items, each being a JSON of ~200 bytes.
I identified the following issues:
Storing the data: I tried to save each item as a single JSON file, but it became very hard just to barely list them (using Linux, ext4 file system). So I changed it to just append JSON items to multiple (100) files (by taking the item's id module 100).
Keeping track of what has been downloaded, because I want to be able to stop/continue the application. First I tried writing the downloaded ids to a textfile, but it turned out a little bit buggy. So now I just read all the items and collect the ids. (It works.)
All this is done with 1 Master actor and an arbitrary number of Worker actors (tens). The Master has a Queue[Int] and pops it and Workers ask for work.
The problem I am having is fairly simple but I haven't been able to solve it in a nice way.
I can collect the ids from items already downloaded in a list. But what I really need is the complement to that set; I need all the items I have not downloaded, up to the highest item id.
I tried using a range (1 to maxItemId) and subtracting the set of done jobs but it is really slow. reaaaaaaally slow.
Now I am using a Stream, and when a worker asks for a job, I check if the stream's (the next job) has already been done. If so, I give it to the Worker. Otherwise I check the next one.
The problem with this approach is that I can not put jobs back at the stream if they fail. That would be easy with the Queue; but then again I am having trouble just setting up the queue with millions of items.
What could be a better approach to this? I don't think the issues here are trivial, this is a very large number of tasks to perform and keep track of, but it shouldn't be so hard as well.
Thanks!
As far as I understood your question, I think you don't need a very complicated data structure here.
Assuming your ids are sequential from 1 to maxItemId, you can use an array of Boolean with maxItemId size to keep track of processed items. You initialize this array by reading the processed ids. And you find the next job by searching for the next false entry.
Assuming that your maxItemId is around 12M, iterating over all items is pretty much instantaneous.