I have a KStream pipeline which groups by key and then windows on some interval and then applies a custom aggregation on that:
KStream<String, Integer> input = /* define input stream */
/* group by key and then apply windowing */
KTable<Windowed<String>, MyAggregate> aggregateTable =
input.groupByKey()
.windowedBy(/* window defintion here */)
.aggregate(MyAggregate::new, (key, value, agg) -> agg.addAndReturn(value))
// I need to get a change log of aggregateTable so:
aggregateTable.toStream().to("output-topic");
The problem is that majority of the input records will not change the internal state of MyAggregate object. The structure is similar to:
class MyAggregate {
private Set<Integer> checkBeforeInsert = /* some predefined values */
private List<Integer> actualState = new ArrayList<>();
public MyAggregate addAndReturn(Integer value) {
/* for 99% of records the if check passes */
if (checkBeforeInsert.contains(value)) {
/* do nothing and return. Note that the state hasn't been changed */
return this;
} else {
actualState.add(value);
return this;
}
}
}
However, KStream doesn't have any clue that the aggregate object hasn't been changed, it still stores the aggregate (which is same as old). It also propagates to same old value to changelog topic and also triggers aggregateTable.toStream() with the same old value.
Although the semantic of my application works fine (the rest of application is aware of this fact that unchanged state might arrive), but this creates a huge noise traffic on intermediate topics. I need a way to notify KStream whether an aggregate has been really changed and should be stored or it's the same as previous record (just ignore it).
Related
I'm using Kafka Streams in a deduplication events problem over short time windows (<= 1 minute).
First I've tried to tackle the problem by using DSL API with .suppress(Suppressed.untilWindowCloses(...)) operator but, given the fact that wall-clock time is not yet supported (I've seen the KIP 424), this operator is not viable for my use case.
Then, I've followed this official Confluent example in which low level Processor API is used and it was working fine but has one major limitation for my use-case. The single event (obtained by deduplication) is emitted at the beginning of the time window, subsequent duplicated events are "suppressed". In my use case I need the reverse of that, meaning that a single event should be emitted at the end of the window.
I'm asking for suggestions on how to implement this use case with Processor API.
My idea was to use the Processor API with a custom Transformer and a Punctuator.
The transformer would store in a WindowStore the distinct keys received without returning any KeyValue. Simultaneously, I'd schedule a punctuator running with an interval equal to the size of the window in the WindowStore. This punctuator will iterate over the elements in the store and forward them downstream.
The following are some core parts of the logic:
DeduplicationTransformer (slightly modified from official Confluent example):
#Override
#SuppressWarnings("unchecked")
public void init(final ProcessorContext context) {
this.context = context;
eventIdStore = (WindowStore<E, V>) context.getStateStore(this.storeName);
// Schedule punctuator for this transformer.
context.schedule(Duration.ofMillis(this.windowSizeMs), PunctuationType.WALL_CLOCK_TIME,
new DeduplicationPunctuator<E, V>(eventIdStore, context, this.windowSizeMs));
}
#Override
public KeyValue<K, V> transform(final K key, final V value) {
final E eventId = idExtractor.apply(key, value);
if (eventId == null) {
return KeyValue.pair(key, value);
} else {
if (!isDuplicate(eventId)) {
rememberNewEvent(eventId, value, context.timestamp());
}
return null;
}
}
DeduplicationPunctuator:
public DeduplicationPunctuator(WindowStore<E, V> eventIdStore, ProcessorContext context,
long retainPeriodMs) {
this.eventIdStore = eventIdStore;
this.context = context;
this.retainPeriodMs = retainPeriodMs;
}
#Override
public void punctuate(long invocationTime) {
LOGGER.info("Punctuator invoked at {}, searching from {}", new Date(invocationTime), new Date(invocationTime-retainPeriodMs));
KeyValueIterator<Windowed<E>, V> it =
eventIdStore.fetchAll(invocationTime - retainPeriodMs, invocationTime + retainPeriodMs);
while (it.hasNext()) {
KeyValue<Windowed<E>, V> next = it.next();
LOGGER.info("Punctuator running on {}", next.key.key());
context.forward(next.key.key(), next.value);
// Delete from store with tombstone
eventIdStore.put(next.key.key(), null, invocationTime);
context.commit();
}
it.close();
}
Is this a valid approach?
With the previous code, I'm running some integration tests and I've some synchronization issues. How can I be sure that the start of the window will coincide with the Punctuator's scheduled interval?
Also as an alternative approach, I was wondering (I've googled with no result), if there is any event triggered by window closing to which I can attach a callback in order to iterate over store and publish only distinct events.
Thanks.
I have a stream of objects where I'd like to calculate the average of a field in this object and then save that average back onto the object. I'd like to have a tumbling window of 5 minutes with a retention of 1 hour. I'm new to Kafka so I'm wondering if this is the correct way to approach the problem.
First, I create a persistent store:
StoreBuilder<WindowStore<String, Double>> averagesStoreSupplier =
Stores.windowStoreBuilder(
Stores.persistentWindowStore(WINDOW_STORE_NAME, Duration.ofHours(1), Duration.ofMinutes(5), true),
Serdes.String(),
Serdes.Double());
streamsBuilder.addStateStore(averagesStoreSupplier);
Then I call my transformer using:
otherKTable
.leftJoin(objectKTable.transformValues(new AveragingTransformerSupplier(WINDOW_STORE_NAME), WINDOW_STORE_NAME),
myValueJoiner)
.to("outputTopic")
And here is my transformer:
public class AveragingTransformerSupplier implements ValueTransformerWithKeySupplier<String, MyObject, MyObject> {
private final String stateStoreName;
public TelemetryAveragingTransformerSupplier(final String stateStoreName) {
this.stateStoreName = stateStoreName;
}
public ValueTransformerWithKey<String, MyObject, MyObject> get() {
return new ValueTransformerWithKey<>() {
private WindowStore<String, Double> averagesStore;
#Override
public void init(ProcessorContext processorContext) {
averagesStore = Try.of(() ->(WindowStore<String, Double>) processorContext.getStateStore(stateStoreName)).getOrElse((WindowStore<String, Double>)null);
}
#Override
public MyObject transform(String s, MyObject myObject) {
if (averagesStore != null) {
averagesStore.put(s, myObject.getNumber());
Instant timeFrom = Instant.ofEpochMilli(0); // beginning of time = oldest available
Instant timeTo = Instant.now();
WindowStoreIterator<Double> itr = averagesStore.fetch(s, timeFrom, timeTo);
double sum = 0.0;
int size = 0;
while(itr.hasNext()) {
KeyValue<Long, Double> next = itr.next();
size++;
sum += next.value;
}
myObject.setNumber(sum / size);
}
return myObject;
}
#Override
public void close() {
if (averagesStore != null) {
averagesStore.flush();
}
}
};
}
}
I have a couple of questions.
First, is the way I define the WindowStore the correct way to form a tumbling window? How would I create a hopping window?
Second, inside my transformer I get all the items from the store from the beginning of time to now. Since I defined it as a 5 minute window and 1 hour retention does that mean that the items in the store is a snapshot of 5 minutes worth of data? What does the retention do here?
I have this working on trivial cases, but not sure if there is a better way to do this using aggregations and joins or even if I'm doing this correctly. Also I had to surround the retrieval of getting the store in a try catch because the init gets called multiple times and sometimes I get Processor has no access to StateStore exception.
I would recommend to use the DSL instead of the Processor API for this use case. Cf. https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Stream+Usage+Patterns for details.
I have a couple of questions. First, is the way I define the WindowStore the correct way to form a tumbling window? How would I create a hopping window?
A windowed store can be used for either hopping or tumbling window -- it depends how you use it in your processor, not how you create the store, what window semantics you get.
Second, inside my transformer I get all the items from the store from the beginning of time to now. Since I defined it as a 5 minute window and 1 hour retention does that mean that the items in the store is a snapshot of 5 minutes worth of data? What does the retention do here?
The parameter windowSize when you create the store does not work the way you expect it. You would need to code up the windowing logic manually in your Transformer code by using put(key, value, windowStartTimestamp) -- atm, you are using put(key, value) that uses context.timestamp(), ie, the current record timestamp, as windowStartTimestamp -- I doubt that is what you want. The retention time is based on the window timestamps, ie, old windows will be deleted after they expire.
I'm writing a pipeline to replicate data from one source to another. Info about data sources is stored in db (BQ). How I can use this data it to build read/write endpoints dynamically?
I tried to pass Pipeline object to my custom DoFn but it can't be serialized. Later I tried to call method getPipeline() on a passed view but it doesn't work as well. -- which is actually expected
I can't know all tables I need to serialize in advance so I have to read all data from db (or any other source).
// builds some random view
PCollectionView<IdWrapper> idView = ...;
// reads tables meta and replicates data per each table
pipeline.apply(getTableMetaEndpont().read())
.apply(ParDo.of(new MyCustomReplicator(idView)).withSideInputs(idView))
private static class MyCustomReplicator extends DoFn<TableMeta, TableMeta> {
private final PCollectionView<IdWrapper> idView;
private DataReplicator(PCollectionView<IdWrapper> idView) {
this.idView = idView;
}
// TableMeta {string: sourceTable, string: destTable}
#ProcessElement
public void processElement(#Element TableMeta tableMeta, ProcessContext ctx) {
long id = ctx.sideInput(idView).getValue();
// builds read endpoint which depends on table meta
// updates entities
// stores entities using another endpoint
idView
.getPipeline()
.apply(createReadEndpoint(tableMeta).read())
.apply(ParDo.of(new SomeFunction(tableMeta, id)))
.apply(createWriteEndpoint(tableMeta).insert());
ctx.output(tableMetadata);
}
}
I expect it to replicate data specified by TableMeta but I can't use pipeline within DoFn object because it can't be serialized/deserialized.
Is there any way to implement the intended behavior?
I'm trying to write a simple Kafka Streams application (targeting Kafka 2.2/Confluent 5.2) to transform an input topic with at-least-once semantics into an exactly-once output stream. I'd like to encode the following logic:
For each message with a given key:
Read a message timestamp from a string field in the message value
Retrieve the greatest timestamp we've previously seen for this key from a local state store
If the message timestamp is less than or equal to the timestamp in the state store, don't emit anything
If the timestamp is greater than the timestamp in the state store, or the key doesn't exist in the state store, emit the message and update the state store with the message's key/timestamp
(This is guaranteed to provide correct results based on ordering guarantees that we get from the upstream system; I'm not trying to do anything magical here.)
At first I thought I could do this with the Kafka Streams flatMapValues operator, which lets you map each input message to zero or more output messages with the same key. However, that documentation explicitly warns:
This is a stateless record-by-record operation (cf. transformValues(ValueTransformerSupplier, String...) for stateful value transformation).
That sounds promising, but the transformValues documentation doesn't make it clear how to emit zero or one output messages per input message. Unless that's what the // or null aside in the example is trying to say?
flatTransform also looked somewhat promising, but I don't need to manipulate the key, and if possible I'd like to avoid repartitioning.
Anyone know how to properly perform this kind of filtering?
you could use Transformer for implementing stateful operations as you described above. In order to not propagate a message downstream, you need to return null from transform method, this mentioned in Transformer java doc. And you could manage propagation via processorContext.forward(key, value). Simplified example provided below
kStream.transform(() -> new DemoTransformer(stateStoreName), stateStoreName)
public class DemoTransformer implements Transformer<String, String, KeyValue<String, String>> {
private ProcessorContext processorContext;
private String stateStoreName;
private KeyValueStore<String, String> keyValueStore;
public DemoTransformer(String stateStoreName) {
this.stateStoreName = stateStoreName;
}
#Override
public void init(ProcessorContext processorContext) {
this.processorContext = processorContext;
this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, String> transform(String key, String value) {
String existingValue = keyValueStore.get(key);
if (/* your condition */) {
processorContext.forward(key, value);
keyValueStore.put(key, value);
}
return null;
}
#Override
public void close() {
}
}
I have this case: users collect orders as order lines. I implemented this with Kafka topic containing events with order changes, they are merged, stored in local key-value store and broadcasted as second topic with order versions.
I need to somehow react to abandoned orders - ones that were started but there was no change for at least last x hours.
Simple solution could be to scan local storage every y minutes and post event of order status change to Abandoned. It seems I cannot access store not from processor... But it is also not very elegant coding. Any suggestions are welcome.
--edit
I cannot just add puctuation to merge/validation transformer, because its output is different and should be routed elsewhere, like on this image (single kafka streams app):
so "abandoned orders processor/transformer" will be a no-op for its input (the only trigger here is time). Another thing is that i such case (as on image) my transformer gets ForwardingDisabledProcessorContext upon initialization so I cannot emit any messages in punctuator. I could just pass there kafkaTemplate bean and just produce new messages, but then whole processor/transformer is just empty shell only to access local store...
this is snippet of code I used:
public class AbandonedOrdersTransformer implements ValueTransformer<OrderEvent, OrderEvent> {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
stateStore = (KeyValueStore)processorContext.getStateStore(KafkaConfig.OPENED_ORDERS_STORE);
//main scheduler
this.context.schedule(TimeUnit.MINUTES.toMillis(5), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
KeyValueIterator<String, Order> iter = this.stateStore.all();
while (iter.hasNext()) {
KeyValue<String, Order> entry = iter.next();
if(OrderStatuses.NEW.equals(entry.value.getStatus()) &&
(timestamp - entry.value.getLastChanged().getTime()) > TimeUnit.HOURS.toMillis(4)) {
//SEND ABANDON EVENT "event"
context.forward(entry.key, event);
}
}
iter.close();
context.commit();
});
}
#Override
public OrderEvent transform(OrderEvent orderEvent) {
//do nothing
return null;
}
#Override
public void close() {
//do nothing
}
}