Is there a way to specify infinite allowed lateness in Apache Beam? - apache-beam

I'm using fixed windows to batch data by event time in order to send it to an external API efficiently (batches of 60 seconds), accumulation mode is set to DISCARDING because it doesn't matter if late data is sent to the external API without the previous data.
Is it possible to specify an infinite allowed lateness, so late data is never discarded?

It is definitely possible, you can set allowed lateness to a very high Duration (for instance, Duration.standardDays(36500)). On the other hand , doing so would result in your state growing indefinitely, which might not be what you want. Every open window (every window ever seen) will have at least a timer called a GC timer - a timer set for the end of the window + allowed lateness. Every timer has to be kept in state and therefore, the size of your state will grow over time.
If you do not need batching based on event-time, it might be a better option to use GroupIntoBatches, which should not suffer from this problem (you don't need to set allowed lateness and the size of your state will not grow).

Related

Temporary adjustment of delay time

I have the following problem which I am unable to solve:
I have a situation where a security point (added as delay) holds every half an hour a 15 min break. After the break, the security guards increase their speed till the queue is shorter than 10pp.
I wanted to model this as follows: a state chart with delay.set_capacity(0) after 30 minutes and delay.set_capacity(1) again after the 15 min break. For the increased speed after the break, I added an additional state with condition: queue.size()>10 and now I want to set the action such that the delay function changes the delay time from exponential (1/10) to exponential (1/5) as long as queue.size()>10.
Anyone experience with which function in the action box to use? Or would you suggest a different function?
Since you are using, or at least want to use a statechart I would suggest the following design, where you have composite states inside the working state to indicate if the security agent is working fast or normal and a message transition to let it move from one state to the next.
It is advised to use a message transition and trigger it as needed instead of a conditional state which gets chected for every change inside the agent since this can be a computational expensive exercise.
I assume you already implemented the correct capacity settings for the different on enter actions for working and breaking
Now you simply need to send the message every time an agent enters the queue and every time it exits the delay block, and of course, see the delay time based on the state of the statechart.
Aee screenshot below.

Does Apache Beam stateful Processing consider window lateness constraints (withAllowedLateness) for resetting state?

I'm trying to implement a valueState to filter records in my ParDo transformation. The high level flow is this:
Fixed-Window of 1hr size, with allowedLateness (10min)
The first message (for a given key) that is processed in the ParDo shall set the valueState(boolean) to true. Subsequent messages for the same key shall be dropped if corresponding valueState is set to true. (Allow only first message for a given key in every window).
The messages (that are not dropped in step 2) will be written out as output.
While testing this however, I see that, after the Fixed window time-period ends (1hr), the state is reset/lost. Ideally, the state should be available to process late records until allowedLateness period (10min is complete).
These parts are right:
Each 1 hour window expires when the watermark reaches the end of the hour plus 10 minutes.
For a given window, the state is cleaned up after the window expires.
Here are the parts that I have corrections
State is never reset.
Elements with timestamps in different windows are processed totally independently. Many windows may be receiving data at the same time. Each hour window happened after another, when the data was generated. But it is not processed after the other.
Allowed lateness will not cause elements from a later window to be processed using the state from the prior window. It will simply allow the state to stay longer and the elements to not be dropped.

What is the correct operation of a CANopen inhibit timer?

I understand that the operation of a CANopen inhibit timer is to ensure a minimum time between successive transmissions of the same message, but the specification does not make it clear what to do if the data changes during the inhibit time (and the transmission is on change-of-state). Should I buffer the data and transmit it when the inhibit timer expires, or discard it and wait for a change after the timer has expired?
My assumption would be, since it is not clearly defined, I can choose whichever approach I want, but I'd appreciate the input of any experienced architects / developers on this.
Thanks.
You're correct that the inhibit time is simply the minimum time between consecutive CAN frames with the same CAN-ID. The standard does not specify the behavior for multiple events during the inhibit time window, because it depends on the situation.
For services like NMT, EMCY and perhaps LSS, you'd want to buffer the messages and send them later. In this case the inhibit time is simply a means to help slow (or badly programmed) devices to handle short bursts of messages. I've seen devices that could only handle 3 CAN frames at once, so it's often necessary, but you would not want them to miss messages.
For event-driven Transmit-PDOs, it depends on what the PDO represents. If you use it to track state, it might make sense to drop events during the inhibit window. They're invalidated by subsequent events anyway. To ensure you always emit the latest state, you can store the most recent event and transmit it once the inhibit time has elapsed, or use the event-timer to ensure you're never too far behind. I've used this strategy in the past for analog inputs where line noise would sometimes cause event bursts.
If you use PDOs to track events (or state changes), you'd be better of buffering them so no events get lost. However, this can introduce potentially unbounded delays if the event period is shorter than the inhibit time.
For the products we're working on at Lely (dairy farm robots), we actually prefer to use SYNC-driven PDOs instead. It results in a much more predictable CAN bus load. And we don't have to track state at the receiver side because we receive a full update on every SYNC. However, the receiver is always one SYNC period behind the transmitter, so this may not be appropriate for your use case.

Handle Too Late data in Spark Streaming

Watermark allows late arriving data to be considered for inclusion against already computed results for a period of time using windows. Its premise is that it tracks to a point in time before which it is assumed no more late events are supposed to arrive, but if they do, they are none-the-less discarded.
Is there a way to store the discarded data, that can be used for reconciliation purpose later?
Say In my Structured Streaming, I set the watermark to 1 hour.
I am doing window operation for each 10 min and received a later event 20 min late.
Is there a way I can store the discarded data say at a different location rather than discarding it?
No, there is no way to achieve this aspect.

Delay fixed window from triggering for several minutes

Using Fixed Windows in Apache Beam. The watermark is set by the event time.
Some data may arrive out of order and cause the window to close.
How can a trigger be defined in Java to occur say 2 minutes after the last data was seen?
It's not entire clear what behavior you expect. One question is what do you expect to happen if the data arrives within the two minutes? Do you want to restart the two minutes interval, don't restart it, re-emit the data or not?
Looks like the trigger you are trying to describe is something along these lines:
wait until the watermark passed the end of window, in event time;
wait for additional 2 minutes in processing time;
emit the data;
If in step 2 it was event time, i.e. you wanted to re-emit the window if a late element arrives that fits within window + 2min, then you could use withAllowedLateness(). Though it sounds different from what you want, because it can keep re-emitting the window contents every time a matching late element arrives.
With processing-time in step 2 this is not possible in general with basic triggers that are available in Beam. You can probably achieve a behavior you want if you manually manage state and timers in your own ParDo, e.g. you can watch for the incoming elements, keep track on them in the state, and then on timer emit what you want. This can become very complicated and might still be not flexible enough for your specific use case.
One of the major problems is that there is no good way to define processing time triggers in Beam in general. It would be complicated to define a general mechanism of working with timers in this manner. For example, when you want to express "wait for 2 minutes", the framework needs to understand in relation to what these two minutes are, when to start the timer, so you need a mechanism to express that as well. And with composition, continuation and other complications this doesn't seem easy to reason about. So it's not in the framework in this general form.
In order to implement only the "wait for 2 minutes after the last element was seen in the window", the framework has to watch for it and set the timer. Technically it is possible to do something like this but doesn't seem like anyone has done it yet.
There seems to be only one meaningful processing time trigger available in Beam but it's not generic enough and doesn't do what you want. You can look at composite triggers like AfterFirst or AfterAll but they likely won't help you without a better general processing time trigger.
I decided against using Beam and implemented the solution in Kafka Streams.
I basically grouped by, then used fixed windows and the aggregated the result.
The "grace" on the window allows data to arrive late.
KGroupedStream<Long, OxyStreamItem> grouped = input.groupByKey();
TimeWindowedKStream<Long, OxyStreamItem> windowed =
grouped.windowedBy(
TimeWindows.of(WIN_SIZE)
.advanceBy(WIN_SIZE)
.grace(Duration.ofSeconds(5L)));
return windowed
.aggregate(
makeInitializer(),
makeAggregator(),
Materialized
.<Long, Aggregate, WindowStore<Bytes, byte[]>>as("tmp")
.withValueSerde(new AggregateSerde()))
.suppress(
Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.map(calculateAvg());