How to sessionize stream with Apache Flink? - scala

I want to sessionize this stream: 1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,0,3,3,3,5, ... to these sessions:
1,1,1
2,2,2,2,2
3,3,3,3,3,3,3
0
3,3,3
5
I've wrote CustomTrigger to detect when stream elements change from 1 to 2 (2 to 3, 3 to 0 and so on) and then fire the trigger. But this is not the solution, because when I processing the first element of 2's, and fire the trigger the window will be [1,1,1,2] but I need to fire the trigger on the last element of 1's.
Here is the pesudo of my onElement function in my custom trigger class:
override def onElement(element: Session, timestamp: Long, window: W, ctx: TriggerContext): TriggerResult = {
if (prevState == element.value) {
prevState = element.value
TriggerResult.CONTINUE
} else {
prevState = element.value
TriggerResult.FIRE
}
}
How can I solve this problem?

I think a FlatMapFunction with a ListState is the easiest way to implement this use-case.
When a new element arrives (i.e., the flatMap() method is called), you check if the value changed. If the value did not changed, you append the element to the state. If the value changed, you emit the current list state as a session, clear the list, and insert the new element as the first to the list state.
However, you should keep in mind that this assumes that the order of elements is preserved. Flink ensures within a partition, i.e, as long as elements are not shuffled and all operators run with the same parallelism.

Related

Is there an Operation to block onComplete?

I am trying to learn reactive programming, so forgive me if I ask a silly question. I'm also open to advice on changing my design.
I am working in scala-swing to display the results of a simulator. With one setting, a chart is displayed as a histogram; with the other setting the chart is displayed as the cumulative sum. (I'm probably using the wrong word; in the first setting you might have bin1=2, bin2=5, bin3=3; in the second setting the first height is 2, the second is 2 + 5, the third is 2 + 5 + 3, etc.). The simulator can be slow, so I originally used a Future to compute it, and the set the data into the chart. I decided to try a reactive approach, so my requirements are: 1. I don't want to recreate the data when I change the display mode, and 2. I want to set the Observable once for the chart and have the chart listen to the same Observable permanently.
I got this to work when I started the chain with a PublishSubject and the Future set the data into the start of the chain. When the display mode changed, I created a new PublishSubject().map(newRenderingLogic).subscribe(theChartsObservable). I am now trying to do what looks like the "right way," but it's not working correctly. I've tried to simplify what I have done:
val textObservable: Subject[String] = PublishSubject()
textObservable.subscribe(text => {
println(s"Text: ${text}")
})
var textSubscription: Option[Subscription] = None
val start = Observable.from(Future {
"Base text"
}).cache
var i = 0
val button = new Button() {
text = "Click"
reactions += {
case event => {
i += 1
if (textSubscription.isDefined) {
textSubscription.get.unsubscribe()
}
textSubscription = Some(start.map(((j: Int) => { (base: String) => s"${base} ${j}" })(i)).subscribe(textObservable))
}
}
}
On start, an Observable is created and logic to print some text is added to it. Then, an Observable with the generated data is created and a cache is added so that the result is replayed if the next subscription comes in after its results are generated. Then, a button is created. Then on button clicks a middle observable is chained with unique logic (it's a function that creates a function to append the value of i into the string, run with the current value of i; I tried to make something that couldn't just be reused) that is supposed to change with each click. Then the first Observable is subscribed to it so that the results of the whole chain end up being printed.
In theory, the cache operation takes care of not regenerating the data, and this works once, but onComplete is called on textObservable and then it can't be used again. It works if I subscribe it like this:
textSubscription = Some(start.map(((j: Int) => { (base: String) => s"${base} ${j}" })(i)).subscribe(text => textObservable.onNext(text)))
because the call to onComplete is intercepted, but this looks wrong and I wanted to know if there was a more typical way to do this, or architect it. It makes me think that I don't understand how this is supposed to be done if there isn't an out-of-the-box operation to do this.
Thank you.
I'm not 100% sure if I got the essence of your question right, but: if you have an Observable that may complete and you want to turn it into an Observable that never completes, you can just concatenate it with Observable.never.
For example:
// will complete after emitting those three elements:
val completes = Observable.from(List(1, 2, 3))
// will emit those three elements, but will never complete:
val wontComplete = completes ++ Observable.never

RxJS interleaving merged observables (priority queue?)

UPDATE
I think I've figured out the solution. I explain it in this video. Basically, use timeoutWith, and some tricks with zip (within zip).
https://youtu.be/0A7C1oJSJDk
If I have a single observable like this:
A-1-2--B-3-4-5-C--D--6-7-E
I want to put the "numbers" as lower priority; it should wait until the "letters" is filled up (a group of 2 for example) OR a timeout is reached, and then it can emit. Maybe the following illustration (of the desired result) can help:
A------B-1-----C--D-2----E-3-4-5-6-7
I've been experimenting with some ideas... one of them: first step is to split that stream (groupBy), one containing letters, and the other containing numbers..., then "something in the middle" happen..., and finally those two (sub)streams get merged.
It's that "something in the middle" what I'm trying to figure out.
How to achieve it? Is that even possible with RxJS (ver 5.5.6)? If not, what's the closest one? I mean, what I want to avoid is having the "numbers" flooding the stream, and not giving enough chance for the "letters" to be processed in timely manner.
Probably this video I made of my efforts so far can clarify as well:
Original problem statement: https://www.youtube.com/watch?v=mEmU4JK5Tic
So far: https://www.youtube.com/watch?v=HWDI9wpVxJk&feature=youtu.be
The problem with my solution so far (delaying each emission in "numbers" substream using .delay) is suboptimal, because it keeps clocking at slow pace (10 seconds) even after the "characters" (sub)stream has ended (not completed -- no clear boundary here -- just not getting more value for indeterminate amount of time). What I really need is, to have the "numbers" substream raise its pace (to 2 seconds) once that happen.
Unfortunately I don't know RxJs5 that much and use xstream myself (authored by one of the contributor to RxJS5) which is a little bit simpler in terms of the number of operators.
With this I crafted the following example:
(Note: the operators are pretty much the same as in Rx5, the main difference is with flatten wich is more or less like switch but seems to handle synchronous streams differently).
const xs = require("xstream").default;
const input$ = xs.of("A",1,2,"B",3,4,5,"C","D",6,7,"E");
const initialState = { $: xs.never(), count: 0, buffer: [] };
const state$ = input$
.fold((state, value) => {
const t = typeof value;
if (t === "string") {
return {
...state,
$: xs.of(value),
count: state.count + 1
};
}
if (state.count >= 2) {
const l = state.buffer.length;
return {
...state,
$: l > 0 ? xs.of(state.buffer[0]) : xs.of(value) ,
count: 0,
buffer: state.buffer.slice(1).concat(value)
};
}
return {
...state,
$: xs.never(),
buffer: state.buffer.concat(value),
};
}, initialState);
xs
.merge(
state$
.map(s => s.$),
state$
.last()
.map(s => xs.of.apply(xs, s.buffer))
)
.flatten()
.subscribe({
next: console.log
});
Which gives me the result you are looking for.
It works by folding the stream on itself, looking at the type of values and emitting a new stream depending on it. When you need to wait because not enough letters were dispatched I emit an emptystream (emits no value, no errors, no complete) as a "placeholder".
You could instead of emitting this empty stream emit something like
xs.empty().endsWith(xs.periodic(timeout)).last().mapTo(value):
// stream that will emit a value only after a specified timeout.
// Because the streams are **not** flattened concurrently you can
// use this as a "pending" stream that may or may not be eventually
// consumed
where value is the last received number in order to implement timeout related conditions however you would then need to introduce some kind of reflexivity with either a Subject in Rx or xs.imitate with xstream because you would need to notify your state that your "pending" stream has been consumed wich makes the communication bi-directionnal whereas streams / observables are unidirectionnal.
The key here the use of timeoutWith, to switch to the more aggresive "pacer", when the "events" kicks in. In this case the "event" is "idle detected in the higher-priority stream".
The video: https://youtu.be/0A7C1oJSJDk

Spark - how to handle with lazy evaluation in case of iterative (or recursive) function calls

I have a recursive function that needs to compare the results of the current call to the previous call to figure out whether it has reached a convergence. My function does not contain any action - it only contains map, flatMap, and reduceByKey. Since Spark does not evaluate transformations (until an action is called), my next iteration does not get the proper values to compare for convergence.
Here is a skeleton of the function -
def func1(sc: SparkContext, nodes:RDD[List[Long]], didConverge: Boolean, changeCount: Int) RDD[(Long] = {
if (didConverge)
nodes
else {
val currChangeCount = sc.accumulator(0, "xyz")
val newNodes = performSomeOps(nodes, currChangeCount) // does a few map/flatMap/reduceByKey operations
if (currChangeCount.value == changeCount) {
func1(sc, newNodes, true, currChangeCount.value)
} else {
func1(sc, newNode, false, currChangeCount.value)
}
}
}
performSomeOps only contains map, flatMap, and reduceByKey transformations. Since it does not have any action, the code in performSomeOps does not execute. So my currChangeCount does not get the actual count. What that implies, the condition to check for the convergence (currChangeCount.value == changeCount) is going to be invalid. One way to overcome is to force an action within each iteration by calling a count but that is an unnecessary overhead.
I am wondering what I can do to force an action w/o much overhead or is there another way to address this problem?
I believe there is a very important thing you're missing here:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Because of that accumulators cannot be reliably used for managing control flow and are better suited for job monitoring.
Moreover executing an action is not an unnecessary overhead. If you want to know what is the result of the computation you have to perform it. Unless of course the result is trivial. The cheapest action possible is:
rdd.foreach { case _ => }
but it won't address the problem you have here.
In general iterative computations in Spark can be structured as follows:
def func1(chcekpoinInterval: Int)(sc: SparkContext, nodes:RDD[List[Long]],
didConverge: Boolean, changeCount: Int, iteration: Int) RDD[(Long] = {
if (didConverge) nodes
else {
// Compute and cache new nodes
val newNodes = performSomeOps(nodes, currChangeCount).cache
// Periodically checkpoint to avoid stack overflow
if (iteration % checkpointInterval == 0) newNodes.checkpoint
/* Call a function which computes values
that determines control flow. This execute an action on newNodes.
*/
val changeCount = computeChangeCount(newNodes)
// Unpersist old nodes
nodes.unpersist
func1(checkpointInterval)(
sc, newNodes, currChangeCount.value == changeCount,
currChangeCount.value, iteration + 1
)
}
}
I see that these map/flatMap/reduceByKey transformations are updating an accumulator. Therefore the only way to perform all updates is to execute all these functions and count is the easiest way to achieve that and gives the lowest overhead compared to other ways (cache + count, first or collect).
Previous answers put me on the right track to solve a similar convergence detection problem.
foreach is presented in the docs as:
foreach(func) : Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
It seems like instead of using rdd.foreach() as a cheap action to trigger accumulator increments placed in various transformations, it should be used to do the incrementing itself.
I'm unable to produce a scala example, but here's a basic java version, if it can still help:
// Convergence is reached when two iterations
// return the same number of results
long previousCount = -1;
long currentCount = 0;
while (previousCount != currentCount){
rdd = doSomethingThatUpdatesRdd(rdd);
// Count entries in new rdd with foreach + accumulator
rdd.foreach(tuple -> accumulator.add(1));
// Update helper values
previousCount = currentCount;
currentCount = accumulator.sum();
accumulator.reset();
}
// Convergence is reached

Only Take Last Item After X Seconds

I am trying to skip items from a stream until n-seconds, and then take the last item that was passed in the stream. This is what I have so far:
const delayedState$ = state$.delay(1000);
state$.buffer(
delayedState$
).filter(
(buffer) => buffer && buffer.length > 0
).publishReplay(1).refCount().map(
(buffer) => buffer.slice(-1).pop()
).subscribe((state) => {
saveState({
buttonCount: state.buttonCount
});
})
But this seems messy, and doesn't seem to work when the stream has many changes in very short succession. I am basically trying to follow this:
https://github.com/tayiorbeii/egghead.io_idiomatic_redux_course_notes/blob/master/03-Persisting_the_State_to_the_Local_Storage.md
My constraint; it has to be the last item after n-seconds, not the first item and then wait n-seconds.
In this demo the source Observable emits 10 values with 1s delay between them.
All values are ignored until the inner Observable delayed by 5s emits a value and then only the last values is passed to the subscriber (the last value is 9 because interval() counts value from 0):
let source = Observable.interval(1000).take(10);
source.skipUntil(Observable.of(true).delay(5000))
.takeLast(1)
.subscribe(val => console.log(val));
See live demo: http://plnkr.co/edit/zklJWnnKzHu3smNmr0T2?p=preview

Why I am getting only one item out of this Observable?

I have a cold observable with static number of items, I needed some time delay between each item, I have combined it with another IObservable I got through Observable.Timer. I am using Zip .
var ob1 = Observable.Range(1, 100);
var ob2 = Observable.Timer(TimeSpan.FromSeconds(1.0));
var myObservable = Observable.Zip(ob1, ob2, (a, b) => b);
myObservable.Subscribe(a => Console.WriteLine("Item encountered"));
///Allow enough time for Timer observable to give back multiple ticks
Thread.Sleep(3000);
But output only prints "Item encountered" once. What am I missing ?
To confirm the commentary, Observable.Interval is the way to go for just a single argument - and thus it has always been!
I found the solution. Observable.Timer takes two arguments for my scenario, first one is due time for first item and second due time is for all subsequent items. And if only one TimeSpan argument is supplied, it would yield only one item.
Observable.Timer(TimeSpan.FromSeconds(1.0), TimeSpan.FromSeconds(1.0));