Access TimeWindow properties inside an aggregator in Kafka Streams

Access TimeWindow properties inside an aggregator in Kafka Streams - apache-kafka

I want to stream with Kafka-Streams the latest record of a topic within a time window, and I want to set the timestamp of the output record as being equal to the end of the time window the record was registered on.
My problem is that I cannot access inside the aggregator to the window properties.
Here is the code I have for now :
KS0
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofSeconds(this.periodicity)).grace(Duration.ZERO)
)
.aggregate(
Constants::getInitialAssetTimeValue,
this::aggregator,
Materialized.<AssetKey, AssetTimeValue, WindowStore<Bytes, byte[]>>as(this.getStoreName()) /* state store name */
.withValueSerde(assetTimeValueSerde) /* serde for aggregate value */
)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.peek((key, value) -> log.info("key={}, value={}", key, value.toString()))
.to(this.toTopic);
And the aggregation function I am using is this one :
private AssetTimeValue aggregator(AssetKey aggKey, AssetTimeValue newValue, AssetTimeValue aggValue){
// I want to do something like that, but this only works with windowed Keys to which I do
// not have access through the aggregator
// windowEndTime = aggKey.window().endTime().getEpochSecond();
return AssetTimeValue.newBuilder()
.setTimestamp(windowEndTime)
.setName(newValue.getName())
.setValue(newValue.getValue())
.build();
}
Many thanks for you help !

You can manipulate timestamps only via the Processor API. However, you can easily use the Processor API embedded in the DSL.
For your case, you can insert a transform() between toStream() and to(). Within the Transformer you call context.forward(key, value, To.all().withTimestamp(...)) to set a new timestamp. Additionally, you would return null at the end (null means to not emit any record, as you already use context.forward for this purpose).

Related

Spring webflux/reactor using #Scheduled to read database and perform some tasks

I am new to spring webflux and my current spring boot application uses a scheduler(annotated as #Scheduled) to read list of data from DB, call a rest api concurrently in batches and then writes to event stream
I want to achieve the same in Spring webflux.
Should I use #Scheduled or use schedulePeriodically from Webflux?
How can I batch items from DB into smaller sets(say 10 items) and concurrently call rest api?
At present the app fetches max 100 records in one scheduler run and then it processes them. I am planning to shift to r2dbc, if i do so, can i limit the flow of data like 100?
Thanks

1. Should I use #Scheduled or use schedulePeriodically from Webflux?
#Scheduled is an annotation which is part of the spring framework scheduled package, while schedulePeriodically is a function which is part of reactor, so you can't really compare the two. I dont see any problems in using the annotation since it is part of the core framework.
2. How can I batch items from DB into smaller sets (say 10 items) and concurrently call rest api?
By using the Flux#buffer functions which will emit a list of items when the buffer is full.
Flux.just("1", "2", "3", "4")
.buffer(2)
.doOnNext(list -> {
System.out.println(list.size());
}).subscribe()
Will print 2 each time.
3. At present the app fetches max 100 records in one scheduler run and then it processes them. I am planning to shift to r2dbc, if i do so, can i limit the flow of data like 100?
Well you can as written before, you fetch, and then buffer the responses into lists of 100, you can then place each list in its own flux and emit items again, or process each list of 100 items. Up to you.
There are a lot of functions under the buffer segment, check them out.
Flux#buffer

Flux.buffer will combine the streams and will emit a list of streams of mentioned buffer size.
For batching purpose, you can use Flux.expand or Mono.expand. You only have to provide your condition in the expand to execute it again or finally end it.
Here are the examples:
public static void main(String[] args) {
List<String> list = new ArrayList<>();
list.add("1");
Flux.just(list)
.buffer(2)
.doOnNext(ls -> {
System.out.println(ls.getClass());
// Buffering a list returns the list of list of String
System.out.println(ls);
}).subscribe();
Flux.just(list)
.expand(listObj -> {
// Condition to finally end the batch
if (listObj.size()>4) {
return Flux.empty();
}
// Can return the size of data as much as you require
list.add("a");
return Flux.just(listObj);
}).map(ls -> {
// Here it returns list of String which was the original object type not list of list as in case of buffer
System.out.println(ls.getClass());
System.out.println(ls);
return ls;
}).subscribe();
}
Output:
class java.util.ArrayList
[[1]] /// Output of buffer list of list
class java.util.ArrayList
[1]
class java.util.ArrayList
[1, a]
class java.util.ArrayList
[1, a, a]
class java.util.ArrayList
[1, a, a, a]
class java.util.ArrayList
[1, a, a, a, a]

Spring Webflux - how to get value from Flux without block() operations

I wonder how to write non-blocking code with Webflux.
Here is what I want to do:
Get all Products by ProductProperties field (returned as Flux)
Get a list of values from Flux<Product>.availabilityCalendar
Use the data retrieved in step 2 and fetch some other data (returned as Flux<>) - everything should be a non-blocking operations.
How to do that? How to get values from Flux<Object> and then fetch some other data returned as Flux<> avoiding blocking operations like Flux.block() to retrieve data that are needed in the next step to fetch final data to return?
public Flux<Product> getAllProductsByAvailability(Flux<ProductProperties> productProperties,
Map<String, String> searchParams) {
productProperties
.flatMap(property -> productRepository.findByProductPropertiesId(property.getId())) //1. return Products
.flatMap(product -> Flux.just(product.getAvailabilityCalendar())) //2. how to get Product.availabilityCalendar list as non-blocking operation to work with this data afterwards?
(...)
where:
productRepository.findByProductPropertiesId returns Flux
Product has field: ArrayList<ProductAvailability> availabilityCalendar
Is it a good approach?
Thank you!

like this
I check the tag valid
Flux.fromIterable(vo.getTags())
.flatMap((tag) -> tagService.findByCode(tag.getCode()).map(TagBo::createByVo)).filter(Objects::nonNull).collectList().doOnNext(l->vo.setTags(l));

by using the onNext parameter
productRepository.findByProductPropertiesId(property.getId())
.onNext(product -> {
return // Do things here
})

is there a Kafka streams method to reduce a stream of numbers to only "output" when the number is changed

I'm trying to use Kafka steams to reduce a series of numbers, and I only want a record out when data has changed. It works perfect, but the problem is that it is not catching up on data from kafka if the service running the code has been down. So I guess the solution is wrong?
My code:
KGroupedStream<String, JsonNode> groupedStream = filteredStream.groupByKey( Serdes.String(), jsonSerde);
KTable<String, JsonNode> reducedTable = groupedStream.reduce(
(aggValue, newValue) -> Calculate.newValue( newValue, aggValue, logger) ,/* adder */
"reduced-stream-store" /* state store name */);
KStream<String, JsonNode> reducedStream = reducedTable.toStream();
the "Calculate" method :
if (value != oldValue)
return value
else return null.
thanks if you have comments/sugestions

return null in your code will delete the entry from the result table. Hence, your code does not do what you expect.
In fact, the DSL operators emit "on update" not "on change" and thus you cannot use the DSL for your use case. There is a ticket that suggests to add "emit on change" semantics (https://issues.apache.org/jira/browse/KAFKA-8770).
As a workaround, you will need to use a custom transform() with stat store instead. For each input record, you check if it exists in the store. If no, emit the record and put it into the store. If if does exist and is the same, don't emit anything. If it is different emit and update the store.

Kafka Streams - Integral versus Separable handler for flatMapValues

I would like help deciding one of two paths I can follow from those more experienced with Kafka Streams in JAVA. I have two working JAVA apps that can take an inbound stream of integers and perform various calculations and tasks, creating four resultant outbound streams to different topics. The actual calc/tasks is not important, I am concerned
with the two possible methods I could use to define the handler that performs the math and any associated risks with my favorite.
Method 1 uses a separately defined function that is of type Iterable and returns a List type.
Method 2 uses the more common integral method that places the function within the KStream declaration.
I am very new to Kafka Streams and do not want to head down the wrong path. I like Method 1 because the code is very readable, easy to follow, and can have the handlers tested offline without needing to invoke traffic with streams.
Method 2 seems more common, but as the complexity grows, the code gets polluted in main(). Additionally I am boxed-in to testing algorithms using stream traffic, which slows development.
Method 1: Separable handlers (partial):
// Take inbound stream from math-input and perform transformations A-D, then write out to 4 streams.
KStream<String, String> source = src_builder.stream("math-input");
source.flatMapValues(value -> transformInput_A(Arrays.asList(value.split("\\W+"))) ).to("math-output-A");
source.flatMapValues(value -> transformInput_B(Arrays.asList(value.split("\\W+"))) ).to("math-output-B");
source.flatMapValues(value -> transformInput_C(Arrays.asList(value.split("\\W+"))) ).to("math-output-C");
source.flatMapValues(value -> transformInput_D(Arrays.asList(value.split("\\W+"))) ).to("math-output-D");
// More code here, removed for brevity.
// Transformation handlers A, B, C, and D.
// ******************************************************************
// Perform data transformation using method A
public static Iterable transformInput_A (List str_array) {
// Imagine some very complex math here using the integer
// values. This could be 50+ lines of code.
for (int i = 0; i < str_array.size(); i++) {
// grab values and perform ops
}
// Return results in string format
return math_results;
}
// End of Transformation Method A
// ******************************************************************
// Imagine similar handlers for methods B, C, and D below.
Method 2: Handlers internal to KStream declaration (partial):
// Take inbound stream from math-input and perform transformations A-D, then write out to 4 streams.
KStream<String, String> inputStream = src_builder.stream("math-input");
KStream<String, String> outputStream_A = inputStream.mapValues(new ValueMapper<String, String>() {
#Override
public String apply(String s) {
// Imagine some very complex math here using the integer
// values. This could be 50+ lines of code.
for (int i = 0; i < str_array.length; i++) {
// grab values and perform ops
}
// Return results in Iterbale string format
return math_results;
}
});
// Send the data to the outbound topic A.
outputStream_A.to("math-output-A");
KStream<String, String> outputStream_B ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_B.to("math-output-B");
KStream<String, String> outputStream_C ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_C.to("math-output-C");
KStream<String, String> outputStream_D ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_D.to("math-output-D");
Other than my desire to keep main() neat and push the complexity out of view, am I heading in the wrong direction with Method 1?

Map operation over a KStream fails when not specifying default serdes and using custom ones -> org.apache.kafka.streams.errors.StreamsException

Since I am working with Json values I haven't set up default serdes.
I process a KStream, consuming it with the necessary spring and product (json) serdes, but the next step (map operation) fails:
val props = Properties()
props[StreamsConfig.APPLICATION_ID_CONFIG] = applicationName
props[StreamsConfig.BOOTSTRAP_SERVERS_CONFIG] = kafkaBootstrapServers
val productSerde: Serde<Product> = Serdes.serdeFrom(JsonPojoSerializer<Product>(), JsonPojoDeserializer(Product::class.java))
builder.stream(INVENTORY_TOPIC, Consumed.with(Serdes.String(), productSerde))
.map { key, value ->
KeyValue(key, XXX)
}
.aggregate(...)
If I remove the map operation the execution goes ok.
I haven't found a way to specify the serdes for the map(), how can it be done?
Error:
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (key: org.apache.kafka.common.serialization.ByteArraySerializer / value: org.apache.kafka.common.serialization.ByteArraySerializer) is not compatible to the actual key or value type (key type: java.lang.String / value type: com.codependent.kafkastreams.inventory.dto.Product). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:92)

Multiple issues:
After you call map() you call groupByKey().aggregate(). This triggers data repartition and thus after map() data is written into an internal topic for data repartitioning. Therefore, you need to provide corresponding Serdes within groupByKey(), too.
However, because you don't modify the key, you should actually call mapValues() instead, to avoid the unnecessary repartitioning.
Note, that you need to provide Serdes for each operator that should not use the default Serde from the config. Serdes are not passed along downstream, but are operator in-place overwrites. (It's work in progress for Kafka 2.1 to improve this.)

If anyone lands here like I did, here's a practical example that demonstrates Matthias' solution. The following code will fail due to a ClassCastException.
builder.stream(
"my-topic",
Consumed.with(Serdes.String(), customSerde))
.map((k, v) -> {
// You've modified the key. Get ready for an error!
String newKey = k.split(":")[0];
return KeyValue.pair(newKey.toString(), v);
})
.groupByKey()
.aggregate(
MyAggregate::new,
(key, data, aggregation) -> {
// You'll never reach this block due to an error similar to:
// ClassCastException while producing data to topic
return aggregation.updateFrom(shot);
},
Materialized.<String, MyAggregate> as(storeSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(aggregateSerde)
)
As Matthias has mentioned, "you need to provide corresponding Serdes within groupByKey". Here's how:
builder.stream(
"my-topic",
Consumed.with(Serdes.String(), customSerde))
.map((k, v) -> {
// You've modified the key. Get ready for an error!
String newKey = k.split(":")[0];
return KeyValue.pair(newKey.toString(), v);
})
// Voila, you've specified the serdes required by the new internal
// repartition topic and you can carry on with your work
.groupByKey(Grouped.with(Serdes.String(), customSerde))
.aggregate(
MyAggregate::new,
(key, data, aggregation) -> {
// You'll never reach this block due to an error similar to:
// ClassCastException while producing data to topic
return aggregation.updateFrom(shot);
},
Materialized.<String, MyAggregate> as(storeSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(aggregateSerde)
)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Access TimeWindow properties inside an aggregator in Kafka Streams - apache-kafka

Related

Spring webflux/reactor using #Scheduled to read database and perform some tasks

Spring Webflux - how to get value from Flux without block() operations

is there a Kafka streams method to reduce a stream of numbers to only "output" when the number is changed

Kafka Streams - Integral versus Separable handler for flatMapValues

Map operation over a KStream fails when not specifying default serdes and using custom ones -> org.apache.kafka.streams.errors.StreamsException

Categories

Resources