I am planning on setting up a MySQL to Kafka flow, with the end goal being to schedule a process to recalculate a mongoDB document based on the changed data.
This might involve directly patching the mongoDB documents, or running a process that will recreate an entire document.
My question is this, if a set of changes to the MySQL database are all related to one mongoDB document, then I don't want to re-run the recalculate process for each change in real time, I want to wait for the changes to 'settle' so that I only run the recalculate process as needed.
Is there a way to 'debounce' the Kafka stream? E.g. is there a well defined pattern for a Kafka consumer that I can use to implement the logic I want?
At present there's no easy way to debounce events.
The problem, in short, is that Kafka doesn't act based on 'wall clock time'. Processing is generally triggered by incoming events (and the data contained therein), not by arbitrary triggers, like system time.
I'll cover why Suppressed and SessionWindows don't work, the proposed solution in KIP-242, and an untested workaround.
Suppressed
Suppressed has a untilTimeLimit() function, but it isn't suitable for debouncing.
If another record for the same key arrives in the mean time, it replaces the first record in the buffer but does not re-start the timer.
SessionWindows
I thought that using SessionWindows.ofInactivityGapAndGrace() might work.
First I grouped, windowed, aggregated, and suppressed the input KStream:
val windowedData: KTable<Windowed<Key>, Value> =
inputTopicKStream
.groupByKey()
.windowedBy(
SessionWindows.ofInactivityGapAndGrace(
WINDOW_INACTIVITY_DURATION,
WINDOW_INACTIVITY_DURATION,
)
)
.aggregate(...)
.suppress(
Suppressed.untilWindowCloses(
Suppressed.BufferConfig.unbounded()
)
)
Then I aggregated the windows, so I could have a final state
windowedData
.groupBy(...)
.reduce(
/* adder */
{ a, b -> a + b },
/* subtractor */
{ a, a -> a - a },
)
However the problem is that SessionWindows will not close without additional records coming up. Kafka will not independently close the window - it requires additional records to realise that the window can be closed, and that suppress() can forward a new record.
This is noted in Confluent's blog https://www.confluent.io/de-de/blog/kafka-streams-take-on-watermarks-and-triggers/
[I]f you stop getting new records wall-clock time will continue to advance, but stream time will freeze. Wall-clock time advances because that little quartz watch in your computer keeps ticking away, but stream time only advances when you get new records. With no new records, stream time is frozen.
KIP-424
KIP-424 proposed an improvement that would allow Suppress to act as a debouncer, but there's been no progress in a couple of years.
Workaround
Andrey Bratus provided a simple workaround in the JIRA ticket for KIP-424, KAFKA-7748. I tried it but it didn't compile - I think the Kafka API has evolved since the workaround was posted. I've updated the code, but I haven't tested it.
import java.time.Duration;
import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.processor.PunctuationType;
import org.apache.kafka.streams.processor.api.Processor;
import org.apache.kafka.streams.processor.api.ProcessorContext;
import org.apache.kafka.streams.processor.api.Record;
import org.apache.kafka.streams.state.TimestampedKeyValueStore;
import org.apache.kafka.streams.state.ValueAndTimestamp;
/**
* THIS PROCESSOR IS UNTESTED
* <br>
* This processor mirrors the source, but waits for an inactivity gap before forwarding records.
* <br>
* The suppression is key based. Newer values will replace previous values, and reset the inactivity
* gap.
*/
public class SuppressProcessor<K, V> implements Processor<K, V, K, V> {
private final String storeName;
private final Duration debounceCheckInterval;
private final long suppressTimeoutMillis;
private TimestampedKeyValueStore<K, V> stateStore;
private ProcessorContext<K, V> context;
/**
* #param storeName The name of the {#link TimestampedKeyValueStore} which will hold
* records while they are being debounced.
* #param suppressTimeout The duration of inactivity before records will be forwarded.
* #param debounceCheckInterval How regularly all records will be checked to see if they are
* eligible to be forwarded. The interval should be shorter than
* {#code suppressTimeout}.
*/
public SuppressProcessor(
String storeName,
Duration suppressTimeout,
Duration debounceCheckInterval
) {
this.storeName = storeName;
this.suppressTimeoutMillis = suppressTimeout.toMillis();
this.debounceCheckInterval = debounceCheckInterval;
}
#Override
public void init(ProcessorContext<K, V> context) {
this.context = context;
stateStore = context.getStateStore(storeName);
context.schedule(debounceCheckInterval, PunctuationType.WALL_CLOCK_TIME, this::punctuate);
}
#Override
public void process(Record<K, V> record) {
final var key = record.key();
final var value = record.value();
final var storedRecord = stateStore.get(key);
final var isNewRecord = storedRecord == null;
final var timestamp = isNewRecord ? System.currentTimeMillis() : storedRecord.timestamp();
stateStore.put(key, ValueAndTimestamp.make(value, timestamp));
}
private void punctuate(long timestamp) {
try (var iterator = stateStore.all()) {
while (iterator.hasNext()) {
KeyValue<K, ValueAndTimestamp<V>> storedRecord = iterator.next();
if (timestamp - storedRecord.value.timestamp() > suppressTimeoutMillis) {
final var record = new Record<>(
storedRecord.key,
storedRecord.value.value(),
storedRecord.value.timestamp()
);
context.forward(record);
stateStore.delete(storedRecord.key);
}
}
}
}
}
If you are using a Kafka Streams app, you could try to use suppress
It is designed for WindowedKStream and KTable to "hold back" an update and very useful for rate limiting or notification at the end of a window.
There is a quite useful explanation on https://www.confluent.de/blog/kafka-streams-take-on-watermarks-and-triggers/
Related
I'm consuming multiple Kafka topics, windowing them hourly and writing them into separate parquet files for each topic. However, if one of the topics are idle, the window does not get triggered and nothing is written to the FS. For this example, I'm consuming 2 topics with a single partition. taskmanager.numberOfTaskSlots: 2 and parallelism.default: 1. What is the proper way of solving this problem in Apache Beam with Flink Runner?
pipeline
.apply(
"ReadKafka",
KafkaIO
.read[String, String]
.withBootstrapServers(bootstrapServers)
.withTopics(topics)
.withCreateTime(Duration.standardSeconds(0))
.withReadCommitted
.withKeyDeserializer(classOf[StringDeserializer])
.withValueDeserializer(classOf[StringDeserializer])
.withoutMetadata()
)
.apply("ConvertToMyEvent", MapElements.via(new KVToMyEvent()))
.apply(
"WindowHourly",
Window.into[MyEvent](FixedWindows.of(Duration.standardHours(1)))
)
.apply(
"WriteParquet",
FileIO
.writeDynamic[String, MyEvent]()
.by(new BucketByEventName())
//...
)
TimeWindow needs data. If the topic is idle, it means , there is no data to close the Window and the window is open until the data arrives. If you want to window data based on Processing time instead of actual event time , try using a simple process function
public class MyProcessFunction extends
KeyedProcessFunction<KeyDataType,InputDataType,OutputDataType>{
// The data type can be primitive like String or your custom class
private transient ValueState<Long> windowDesc;
#Override
public void open(final Configuration conf) {
final ValueStateDescriptor<Long> windowDesc = new ValueStateDescriptor("windowDesc", Long.class);
this.windowTime = this.getRuntimeContext().getState(windowDesc); // normal variable declaration does not work. Declare variables like this and use it inside the functions
}
#Override
public void processElement(InputType input, Context context, Collector<OutPutType> collector)
throws IOException {
this.windowTime.update( <window interval> ); // milliseconds are recommended
context.timerService().registerProcessingTimeTimer(this.windowTime.value());//register a timer. Timer runs for windowTime from the current time.
.
.
.
if( this.windowTime.value() != null ){
context.timerService().deleteProcessingTimeTimer(this.windowTime.value());
// delete any existing time if you want to reset timer
}
}
#Override
public void onTimer(long timestamp, KeyedProcessFunction<KeyDataType,InputDataType,OutputDataType>.OnTimerContext context,
Collector<OutputType> collector) throws IOException {
//This method is executed when the timer reached
collector.collect( < whatever you want to stream out> );// this data will be available in the pipeline
}
}
```
Everything I read says this should work: I need my listener to trigger every 10 seconds with events. What I am getting now is every event in, it a listener trigger. What am I missing? The basic requirements are to create summarized statistics every 10s. Ideally I just want to pump data into the runtime. So, in this example, I would expect a dump of 10 records, once every 10 seconds
class StreamTest {
private final Configuration configuration = new Configuration();
private final EPRuntime runtime;
private final CompilerArguments args = new CompilerArguments();
private final EPCompiler compiler;
public DatadogApplicationTests() {
configuration.getCommon().addEventType(CommonLogEntry.class);
runtime = EPRuntimeProvider.getRuntime(this.getClass().getSimpleName(), configuration);
args.getPath().add(runtime.getRuntimePath());
compiler = EPCompilerProvider.getCompiler();
}
#Test
void testDisplayStatsEvery10S() throws Exception{
// Display stats every 10s about the traffic during those 10s:
EPCompiled compiled = compiler.compile("select * from CommonLogEntry.win:time(10)", args);
runtime.getDeploymentService().deploy(compiled).getStatements()[0].addListener(
(old, newEvents, epStatement, epRuntime) ->
Arrays.stream(old).forEach(e -> System.out.format("%s: received %n", LocalTime.now()))
);
new BufferedReader(new InputStreamReader(this.getClass().getResourceAsStream("/access.log"))).lines().map(CommonLogEntry::new).forEachOrdered(e -> {
runtime.getEventService().sendEventBean(e, e.getClass().getSimpleName());
try {
Thread.sleep(TimeUnit.SECONDS.toMillis(1));
} catch (InterruptedException ex) {
System.err.println(ex);
}
});
}
}
Which currently outputs every second, corresponding to the sleep in my stream:
11:00:54.676: received
11:00:55.684: received
11:00:56.689: received
11:00:57.694: received
11:00:58.698: received
11:00:59.700: received
A time window is a sliding window. There is a chapter on basic concepts that explains how they work. Here is the link to the basic concepts chapter.
It is not clear what the requirements are but I think what you want to achieve is collecting events for a while and then releasing them. You can draw inspiration from the solution patterns.
This will collect events for 10 seconds.
create schema StockTick(symbol string, price double);
create context CtxBatch start #now end after 10 seconds;
context CtxBatch select * from StockTick#keepall output snapshot when terminated;
I would like to create a Kafka stream-based application that processes a topic and takes messages in batches of size X (i.e. 50) but if the stream has low flow, to give me whatever the stream has within Y seconds (i.e. 5).
So, instead of processing messages one by one, I process a List[Record] where the size of the list is 50 (or maybe less).
This is to make some I/O bound processing more efficient.
I know that this can be implemented with the classic Kafka API but was looking for a stream-based implementation that can also handle offset committing natively, taking errors/failures into account.
I couldn't find anything related int he docs or by searching around and was wondering if anyone has a solution to this problem.
#Matthias J. Sax answer is nice, I just want to add an example for this, I think it might be useful for someone.
let's say we want to combine incoming values into the following type:
public class MultipleValues { private List<String> values; }
To collect messages into batches with max size, we need to create transformer:
public class MultipleValuesTransformer implements Transformer<String, String, KeyValue<String, MultipleValues>> {
private ProcessorContext processorContext;
private String stateStoreName;
private KeyValueStore<String, MultipleValues> keyValueStore;
private Cancellable scheduledPunctuator;
public MultipleValuesTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
scheduledPunctuator = processorContext.schedule(Duration.ofSeconds(30), PunctuationType.WALL_CLOCK_TIME, this::doPunctuate);
}
#Override
public KeyValue<String, MultipleValues> transform(String key, String value) {
MultipleValues itemValueFromStore = keyValueStore.get(key);
if (isNull(itemValueFromStore)) {
itemValueFromStore = MultipleValues.builder().values(Collections.singletonList(value)).build();
} else {
List<String> values = new ArrayList<>(itemValueFromStore.getValues());
values.add(value);
itemValueFromStore = itemValueFromStore.toBuilder()
.values(values)
.build();
}
if (itemValueFromStore.getValues().size() >= 50) {
processorContext.forward(key, itemValueFromStore);
keyValueStore.put(key, null);
} else {
keyValueStore.put(key, itemValueFromStore);
}
return null;
}
private void doPunctuate(long timestamp) {
KeyValueIterator<String, MultipleValues> valuesIterator = keyValueStore.all();
while (valuesIterator.hasNext()) {
KeyValue<String, MultipleValues> keyValue = valuesIterator.next();
if (nonNull(keyValue.value)) {
processorContext.forward(keyValue.key, keyValue.value);
keyValueStore.put(keyValue.key, null);
}
}
}
#Override
public void close() {
scheduledPunctuator.cancel();
}
}
and we need to create key-value store, add it to StreamsBuilder, and build KStream flow using transform method
Properties props = new Properties();
...
Serde<MultipleValues> multipleValuesSerge = Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(MultipleValues.class));
StreamsBuilder builder = new StreamsBuilder();
String storeName = "multipleValuesStore";
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore(storeName);
StoreBuilder<KeyValueStore<String, MultipleValues>> storeBuilder =
Stores.keyValueStoreBuilder(storeSupplier, Serdes.String(), multipleValuesSerge);
builder.addStateStore(storeBuilder);
builder.stream("source", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new MultipleValuesTransformer(storeName), storeName)
.print(Printed.<String, MultipleValues>toSysOut().withLabel("transformedMultipleValues"));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
kafkaStreams.start();
with such approach we used the incoming key for which we did aggregation. if you need to collect messages not by key, but by some message's fields, you need the following flow to trigger rebalancing on KStream (by using intermediate topic):
.selectKey(..)
.through(intermediateTopicName)
.transform( ..)
The simplest way might be, to use a stateful transform() operation. Each time you receive a record, you put it into the store. When you have received 50 records, you do your processing, emit output, and delete the records from the store.
To enforce processing if you don't read the limit in a certain amount of time, you can register a wall-clock punctuation.
It seems that there is no need to use Processors or Transformers and transform() to batch events by count. Regular groupBy() and reduce()/aggregate() should do the trick:
KeyValueSerde keyValueSerde = new KeyValueSerde(); // simple custom Serde
final AtomicLong batchCount = new AtomicLong(0L);
myKStream
.groupBy((k,v) -> KeyValue.pair(k, batchCount.getAndIncrement() / batchSize),
Grouped.keySerde(keyValueSerde))
.reduce(this::windowReducer) // <-- how you want to aggregate values in batch
.toStream()
.filter((k,v) -> /* pass through full batches only */)
.selectKey((k,v) -> k.key)
...
You'd also need to add straightforward Serde for the standard KeyValue<String, Long>.
This option is obviously only helpful when you don't need a "punctuator" to emit incomplete batches on timeout. It also doesn't guarantee the order of elements in the batch in case of distributed processing.
You can also concatenate count to the key string to form the new key (instead of using KeyValue). That would simplify example even further (to using Serdes.String()).
Good day collegues.
I have Kafka project using Spring Kafka what listen a definite topic.
I need one time in a day listen all messages, put them into a collection and find specific message there.
I couldn't understand how to read all messages in one #KafkaListener method.
My class is:
#Component
public class KafkaIntervalListener {
public CountDownLatch intervalLatch = new CountDownLatch(1);
private final SCDFRunnerService scdfRunnerService;
public KafkaIntervalListener(SCDFRunnerService scdfRunnerService) {
this.scdfRunnerService = scdfRunnerService;
}
#KafkaListener(topics = "${kafka.interval-topic}", containerFactory = "intervalEventKafkaListenerContainerFactory")
public void intervalListener(IntervalEvent event) throws UnsupportedEncodingException, JSONException {
System.out.println("Recieved interval message: " + event);
IntervalType type = event.getType();
Instant instant = event.getInterval();
List<IntervalEvent> events = new ArrayList<>();
events.add(event);
events.size();
this.intervalLatch.countDown();
}
}
My events collection always has size = 1;
I tried to use different loops, but then, my collection become filed 530 000 000 times the same message.
UPDATE:
I have found a way to do it with factory.setBatchListener(true); But i need to find launch it with #Scheduled(cron = "${kafka.cron}", zone = "Europe/Moscow"). Right now this method is always is listening. Now iam trying something like this:
#Scheduled(cron = "${kafka.cron}", zone = "Europe/Moscow")
public void run() throws Exception {
kafkaIntervalListener.intervalLatch.await();
}
It doesn't work, in debug mode my breakpoint never works on this site.
The listener container is, by design, message-driven.
For fetching messages on-demand, it's better to use the Kafka Consumer API directly and fetch messages using the poll() method.
My use-case is to identify entities from which expected events have not been received after X amount of time in real-time.
For example:
If we have received PaymentInitiated event at time T but didn't receive either of PaymentFailed / PaymentAborted / PaymentSucedded by T+X, then raise a trigger saying PaymentStuck along with details of PaymentIntitiated event.
How can I model such use-cases in Apache Samza as it is rolling time period X on each event, rather than fixed time interval.
Thanks, Harish
I'm not aware of any native support for this in Samza, but I can imagine a work-around that uses WindowableTask.
public class PaymentEvent implements Comparable<PaymentEvent> {
// if current time > timestamp, payment is stuck
public long timestamp;
// we want a corresponding PaymentFailed... event with the same id
public long interactionId;
// PaymentRequest, PaymentAborted, PaymentSucceeded...
public enum type;
...
#Override
public int compareTo(PaymentEvent o){
return timestamp - o.timestamp;
}
}
Now in your process method you would have something like:
PriorityQueue<PaymentEvent> pqueue;
Map<Long, PaymentEvent> responses;
public void process(...) {
PaymentEvent e = new PaymentEvent(envelope.getMessage());
if (e.enum == PAYMENT_REQUEST) {
pqueue.add(e);
} else {
responses.put(e.interactionId, e);
}
}
And finally during your window you would pop off from the priority queue everything with timestamp > current time and check if there is a corresponding event in the Map.
public void window(...) {
while(pqueue.peek().timestamp <= currentTime) {
if (!map.containsKey(pqueue.poll().interactionId) {
// send the trigger via the collector
}
}
}
Then lastly you would set the window time in your configuration to however long you wanted to poll. The config is task.window.ms.