FlatMapGroupWithState behaviour when state exists - scala

Spark group state never expires if after a new state is set (with expiration) the group gets "called" for a subsequent time before the initial expiration.
I am setting a timeout of "5 seconds" for any group state. If I send an event which creates a new state and I wait for 5 seconds the group times out successfully. However if I send an event which creates a new state and I send a similar event which also falls in the same group, this group state doesn't time out ever.
// in main class
val aggregation = inputStream
.groupByKey(r => r.key)
.flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.ProcessingTimeTimeout)(Aggregator.aggregate)
// in another object in another file
object Aggregator {
def aggregate(key: String,
rows: Iterator[InputSchema],
state: GroupState[List[InputSchema]]): Iterator[InputSchema] = {
if (state.hasTimedOut) {
print("time out")
val output = state.get
state.remove()
return output.toIterator
}
if (state.exists) {
println("exists")
state.update(state.get ++ rows.toList)
return Iterator()
}
println("coming to update")
state.update(rows.toList)
state.setTimeoutDuration("5 seconds")
Iterator()
}
}
With the above code, the below happens
1) If i send a message where r.key is 12345,
Print message coming to update gets printed
After 5 seconds time out gets printed.
^ Both are expected behaviour
2) If I send a message where r.key is abcd at time t1, and then I send the same message at t2 when t2-t1 is less than 5 seconds (which is the group timeout initially set), following happens -
Print message coming to update gets printed (this is for the first event at t1)
Print message exists gets printed at t2
Even after waiting for 5 seconds, the state never timesout and I don't get the time out print.
Expected behaviour should be that after 5 seconds from t1 the state should time out.

Related

KStream-KStream leftJoin not consistently emitting after window expiry

We have a service where people can order a battery with their solar panels. As part of provisioning we try to fetch some details about the battery product, however it sometimes fails to get any data but we still want to send through the order to our CRM system.
To achieve this we are using the latest version of Kafka Streams leftJoin:
We receive an event on the order-received topic.
We filter out orders that do not contain a battery product.
We then wait up to 30mins for an event to come through on the order-battery-details topic.
If we dont receive that event, we want to send a new event to the battery-order topic with the data we do have.
This seems to be working fine when we receive both events, however it is inconsistent when we only receive the first event. Sometimes the order will come through immediately after the 30 min window, sometimes it takes several hours.
My question is, if the window has expired (ie. we failed to receive the right side of the join), what determines when the event will be sent? And what could be causing the long delay?
Here's a high level example of our service:
#Component
class BatteryOrderProducer {
#Autowired
fun buildPipeline(streamsBuilder: StreamsBuilder) {
// listen for new orders and filter out everything except orders with a battery
val orderReceivedReceivedStream = streamsBuilder.stream(
"order-received",
Consumed.with(Serdes.String(), JsonSerde<OrderReceivedEvent>())
).filter { _, order ->
// check if the order contains a battery product
}.peek { key, order ->
log.info("Received order with a battery product: $key", order)
}
// listen for battery details events
val batteryDetailsStream = streamsBuilder
.stream(
"order-battery-details",
Consumed.with(Serdes.String(), JsonSerde<BatteryDetailsEvent>())
).peek { key, order ->
log.info("Received battery details: $key", order)
}
val valueJoiner: ValueJoiner<OrderReceivedEvent, BatteryDetailsEvent, BatteryOrder> =
ValueJoiner { orderReceived: OrderReceivedEvent, BatteryDetails: BatteryDetailsEvent? ->
// new BatteryOrder
if (BatteryDetails != null) {
// add battery details to the order if we get them
}
// return the BatteryOrder
}
// we always want to send through the battery order, even if we don't get the 2nd event.
orderReceivedReceivedStream.leftJoin(
batteryDetailsStream,
valueJoiner,
JoinWindows.ofTimeDifferenceAndGrace(
Duration.ofMinutes(30),
Duration.ofMinutes(1)
),
StreamJoined.with(
Serdes.String(),
JsonSerde<OrderReceivedEvent>(),
JsonSerde<BatteryDetailsEvent>()
).withStoreName("battery-store")
).peek { key, value ->
log.info("Merged BatteryOrder", value)
}.to(
"battery-order",
Produced.with(
Serdes.String(),
JsonSerde<BatteryOrder>()
)
)
}
}
The leftJoin will not trigger as long as there are no new recods. So if I have an order-received record with key A at time t, and then there is no new record (on either side of the join) for the next 5 hours, then there will be no output for the join for these 5 hours, because the leftJoin will not be triggered. In particular, leftJoin needs to receive a record that has a timestamp > t + 30m, for a null result to be sent.
I think to satisfy your requirements, you need to work with the more low-level Processor API: https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html
In a Processor, you can define a Punctuator that runs regularly and checks if an order has been waiting for more than half an hour for details, and sends off the null record accordingly.

How can I change the period for Flowable.interval

Is there a way to change the Flowable.interval period at runtime?
LOGGER.info("Start generating bullshit for 7 seconds:");
Flowable.interval(3, TimeUnit.SECONDS)
.map(tick -> random.nextInt(100))
.subscribe(tick -> LOGGER.info("tick = " + tick));
TimeUnit.SECONDS.sleep(7);
LOGGER.info("Change interval to 2 seconds:");
I have a workaround, but the best way would be to create a new operator.
How does this solution work?
You have a trigger source, which will provide values, when to start start a new interval. The source is switchMapped with an interval as inner-stream. The inner-stream takes an input value for the upstream source for setting the new interval time.
switchMap
When the source emits a time (Long), the switchMap lambda is invoked and the returned Flowable will be subscribed to immediately. When a new value arrives at the switchMap, the inner subscribed Flowable interval will be unsubscribed from and the lambda will be invoked once again. The returned Inverval-Flowable will be re-subscribed.
This means, that on each emit from the source, a new Inveral is created.
How does it behave?
When the inveral is subscribed to and is about to emit a new value and a new value is emitted from the source, the inner-stream (inverval) is unsubscribed from. Therefore the value is not emitted anymore. The new Interval-Flowable is subscribed to and will emit a value to it's configuration.
Solution
lateinit var scheduler: TestScheduler
#Before
fun init() {
scheduler = TestScheduler()
}
#Test
fun `62232235`() {
val trigger = PublishSubject.create<Long>()
val switchMap = trigger.toFlowable(BackpressureStrategy.LATEST)
// make sure, that a value is emitted from upstream, in order to make sure, that at least one interval emits values, when the upstream-sources does not provide a seed value.
.startWith(3)
.switchMap {
Flowable.interval(it, TimeUnit.SECONDS, scheduler)
.map { tick: Long? ->
tick
}
}
val test = switchMap.test()
scheduler.advanceTimeBy(10, TimeUnit.SECONDS)
test.assertValues(0, 1, 2)
// send new onNext value at absolute time 10
trigger.onNext(10)
// the inner stream is unsubscribed and a new stream with inverval(10) is subscribed to. Therefore the first vale will be emitted at 20 (current: 10 + 10 configured)
scheduler.advanceTimeTo(21, TimeUnit.SECONDS)
// if the switch did not happen, there would be 7 values
test.assertValues(0, 1, 2, 0)
}

Kafka streams aggregation duplication

There is a Topology:
.mapValues((key, messages) -> remoteService.sendMessages(messages))
.flatMapValues(results -> results)
.map((key, result) -> KeyValue.pair(getAggregationKey(result), getAggregationResult(systemClock, result)))
.groupByKey(Grouped.with(createJsonSerde(AggregationKey.class), createJsonSerde(AggregationResult.class)))
.windowedBy(timeWindows)
.reduce((aggregatedResult, v) -> {
int count = aggregatedResult.getCount();
return aggregatedResult.toBuilder().count(count + 1).build();
})
.suppress(untilWindowCloses(Suppressed.BufferConfig.unbounded()))
TimeWindows:
Duration duration = Duration.ofSeconds(60);
TimeWindows timeWindows = TimeWindows.of(duration).grace(Duration.ZERO);
My assumption is that aggregation results has to be sent to sink topic every 60 seconds or so but I noticed sometimes it sends duplicates(numbers are not precise): first event was sent at 50th second with counter 1000 and then at 58th second event with the same key was sent with counter 1050. It happens not every minute but quite frequently. Why could this happen?
What I also noticed is that the second event always has timestamp less than first one but larger offset. The same is for internal reduce topic.

Sample most recent element of Akka Stream with trigger signal, using zipWith?

I have a Planning system that computes kind of a global Schedule from customer orders. This schedule changes over time when customers place or revoke orders to this system, or when certain resources used by events within the schedule become unavailable.
Now another system needs to know the status of certain events in the Schedule. The system sends a StatusRequest(EventName) on a message queue to which I must react with a corresponding StatusSignal(EventStatus) on another queue.
The Planning system gives me an akka-streams Source[Schedule] which emits a Schedule whenever the schedule changed, and I also have a Source[StatusRequest] from which I receive StatusRequests and a Sink[StatusSignal] to which I can send StatusSignal responses.
Whenever I receive a StatusRequest I must inspect the current schedule, ie, the most recent value emitted by Source[Schedule], and send a StatusSignal to the sink.
I came up with the following flow
scheduleSource
.zipWith(statusRequestSource) { (schedule, statusRequest) =>
findEventStatus(schedule, statusRequest.eventName))
}
.map(eventStatus => makeStatusSignal(eventStatus))
.runWith(statusSignalSink)
but I am not at all sure when this flow actually emits values and whether it actually implements my requirement (see bold text above).
The zipWith reference says (emphasis mine):
emits when all of the inputs have an element available
What does this mean? When statusRequestSource emits a value does the flow wait until scheduleSource emits, too? Or does it use the last value scheduleSource emitted? Likewise, what happens when scheduleSource emits a value? Does it trigger a status signal with the last element in statusRequestSource?
If the flow doesn't implement what I need, how could I achieve it instead?
To answer your first set of questions regarding the behavior of zipWith, here is a simple test:
val source1 = Source(1 to 5)
val source2 = Source(1 to 3)
source1
.zipWith(source2){ (s1Elem, s2Elem) => (s1Elem, s2Elem) }
.runForeach(println)
// prints:
// (1,1)
// (2,2)
// (3,3)
zipWith will emit downstream as long as both inputs have respective elements that can be zipped together.
One idea to fulfill your requirement is to decouple scheduleSource and statusRequestSource. Feed scheduleSource to an actor, and have the actor track the most recent element it has received from the stream. Then have statusRequestSource query this actor, which will reply with the most recent element from scheduleSource. This actor could look something like the following:
class LatestElementTracker extends Actor with ActorLogging {
var latestSchedule: Option[Schedule] = None
def receive = {
case schedule: Schedule =>
latestSchedule = Some(schedule)
case status: StatusRequest =>
if (latestSchedule.isEmpty) {
log.debug("No schedules have been received yet.")
} else {
val eventStatus = findEventStatus(latestSchedule.get, status.eventName)
sender() ! eventStatus
}
}
}
To integrate with the above actor:
scheduleSource.runForeach(s => trackerActor ! s)
statusRequestSource
.ask[EventStatus](parallelism = 1)(trackerActor) // adjust parallelism as needed
.map(eventStatus => makeStatusSignal(eventStatus))
.runWith(statusSignalSink)

Omitting all Scala Actor messages except the last

I want omit all the same type of messages except the last one:
def receive = {
case Message(type:MessageType, data:Int) =>
// remove previous and get only last message of passed MessageType
}
for example when I send:
actor ! Message(MessageType.RUN, 1)
actor ! Message(MessageType.RUN, 2)
actor ! Message(MessageType.FLY, 1)
then I want to recevie only:
Message(MessageType.RUN, 2)
Message(MessageType.FLY, 1)
Of course if they will be send very fast, or on high CPU load
You could wait a very short amount of time, storing the most recent messages that arrive, and then process only those most recent ones. This can be accomplished by sending messages to yourself, and scheduleOnce. See the second example under the Akka HowTo: Common Patterns, Scheduling Periodic Messages. Instead of scheduling ticks whenever the last tick ends, you can wait until new messages arrive. Here's an example of something like that:
case class ProcessThis(msg: Message)
case object ProcessNow
var onHold = Map.empty[MessageType, Message]
var timer: Option[Cancellable] = None
def receive = {
case msg # Message(t, _) =>
onHold += t -> msg
if (timer.isEmpty) {
import context.dispatcher
timer = Some(context.system.scheduler.scheduleOnce(1 millis, self, ProcessNow))
}
case ProcessNow =>
timer foreach { _.cancel() }
timer = None
for (m <- onHold.values) self ! ProcessThis(m)
onHold = Map.empty
case ProcessThis(Message(t, data)) =>
// really process the message
}
Incoming Messages are not actually processed right away, but are stored in a Map that keeps only the last of each MessageType. On the ProcessNow tick message, they are really processed.
You can change the length of time you wait (in my example set to 1 millisecond) to strike a balance between responsivity (length of time from a message arriving to response) and efficiency (CPU or other resources used or held up).
type is not a good name for a field, so let's use messageType instead. This code should do what you want:
var lastMessage: Option[Message] = None
def receive = {
case m => {
if (lastMessage.fold(false)(_.messageType != m.messageType)) {
// do something with lastMessage.get
}
lastMessage = Some(m)
}
}