How can I join two ktables with custom values (and custom serdes)

How can I join two ktables with custom values (and custom serdes) - apache-kafka

I want to join two ktables with custom values.
The documentation makes it clear for default types (with default serdes) - https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html#ktable-ktable-join
KTable<String, Long> left = ...;
KTable<String, Double> right = ...;
// Java 8+ example, using lambda expressions
KTable<String, String> joined = left.leftJoin(right,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue /* ValueJoiner */
);
but when I use custom values I get a serialization error and there are no overloads for passing custom serdes. How can I accomplish this?
KTable<String, ModelA> left = ...;
KTable<String, ModelB> right = ...;
// Java 8+ example, using lambda expressions
KTable<String, ModelC> joined = left.leftJoin(right,
(leftValue, rightValue) -> new ModelC("left=" + leftValue.Name + ", right=" + rightValue.Name /* ValueJoiner */
);

I eventually understood what I was doing wrong.
The error message was a bit misleading:
Change the default Serdes in StreamConfig or provide correct Serdes
via method parameters
But I did not want to change the default serde and ktables join had no overload to pass serdes.
The problem really was on the fact that I created the ktable by using the stream.toTable method which has no overload to pass serdes either. What I did was to declare the ktable before (with serdes) and then use the stream.to method.
Probably a newbie mistake, but here it is.

Related

Access TimeWindow properties inside an aggregator in Kafka Streams

I want to stream with Kafka-Streams the latest record of a topic within a time window, and I want to set the timestamp of the output record as being equal to the end of the time window the record was registered on.
My problem is that I cannot access inside the aggregator to the window properties.
Here is the code I have for now :
KS0
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofSeconds(this.periodicity)).grace(Duration.ZERO)
)
.aggregate(
Constants::getInitialAssetTimeValue,
this::aggregator,
Materialized.<AssetKey, AssetTimeValue, WindowStore<Bytes, byte[]>>as(this.getStoreName()) /* state store name */
.withValueSerde(assetTimeValueSerde) /* serde for aggregate value */
)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.peek((key, value) -> log.info("key={}, value={}", key, value.toString()))
.to(this.toTopic);
And the aggregation function I am using is this one :
private AssetTimeValue aggregator(AssetKey aggKey, AssetTimeValue newValue, AssetTimeValue aggValue){
// I want to do something like that, but this only works with windowed Keys to which I do
// not have access through the aggregator
// windowEndTime = aggKey.window().endTime().getEpochSecond();
return AssetTimeValue.newBuilder()
.setTimestamp(windowEndTime)
.setName(newValue.getName())
.setValue(newValue.getValue())
.build();
}
Many thanks for you help !

You can manipulate timestamps only via the Processor API. However, you can easily use the Processor API embedded in the DSL.
For your case, you can insert a transform() between toStream() and to(). Within the Transformer you call context.forward(key, value, To.all().withTimestamp(...)) to set a new timestamp. Additionally, you would return null at the end (null means to not emit any record, as you already use context.forward for this purpose).

is there a Kafka streams method to reduce a stream of numbers to only "output" when the number is changed

I'm trying to use Kafka steams to reduce a series of numbers, and I only want a record out when data has changed. It works perfect, but the problem is that it is not catching up on data from kafka if the service running the code has been down. So I guess the solution is wrong?
My code:
KGroupedStream<String, JsonNode> groupedStream = filteredStream.groupByKey( Serdes.String(), jsonSerde);
KTable<String, JsonNode> reducedTable = groupedStream.reduce(
(aggValue, newValue) -> Calculate.newValue( newValue, aggValue, logger) ,/* adder */
"reduced-stream-store" /* state store name */);
KStream<String, JsonNode> reducedStream = reducedTable.toStream();
the "Calculate" method :
if (value != oldValue)
return value
else return null.
thanks if you have comments/sugestions

return null in your code will delete the entry from the result table. Hence, your code does not do what you expect.
In fact, the DSL operators emit "on update" not "on change" and thus you cannot use the DSL for your use case. There is a ticket that suggests to add "emit on change" semantics (https://issues.apache.org/jira/browse/KAFKA-8770).
As a workaround, you will need to use a custom transform() with stat store instead. For each input record, you check if it exists in the store. If no, emit the record and put it into the store. If if does exist and is the same, don't emit anything. If it is different emit and update the store.

Kafka Streams - Integral versus Separable handler for flatMapValues

I would like help deciding one of two paths I can follow from those more experienced with Kafka Streams in JAVA. I have two working JAVA apps that can take an inbound stream of integers and perform various calculations and tasks, creating four resultant outbound streams to different topics. The actual calc/tasks is not important, I am concerned
with the two possible methods I could use to define the handler that performs the math and any associated risks with my favorite.
Method 1 uses a separately defined function that is of type Iterable and returns a List type.
Method 2 uses the more common integral method that places the function within the KStream declaration.
I am very new to Kafka Streams and do not want to head down the wrong path. I like Method 1 because the code is very readable, easy to follow, and can have the handlers tested offline without needing to invoke traffic with streams.
Method 2 seems more common, but as the complexity grows, the code gets polluted in main(). Additionally I am boxed-in to testing algorithms using stream traffic, which slows development.
Method 1: Separable handlers (partial):
// Take inbound stream from math-input and perform transformations A-D, then write out to 4 streams.
KStream<String, String> source = src_builder.stream("math-input");
source.flatMapValues(value -> transformInput_A(Arrays.asList(value.split("\\W+"))) ).to("math-output-A");
source.flatMapValues(value -> transformInput_B(Arrays.asList(value.split("\\W+"))) ).to("math-output-B");
source.flatMapValues(value -> transformInput_C(Arrays.asList(value.split("\\W+"))) ).to("math-output-C");
source.flatMapValues(value -> transformInput_D(Arrays.asList(value.split("\\W+"))) ).to("math-output-D");
// More code here, removed for brevity.
// Transformation handlers A, B, C, and D.
// ******************************************************************
// Perform data transformation using method A
public static Iterable transformInput_A (List str_array) {
// Imagine some very complex math here using the integer
// values. This could be 50+ lines of code.
for (int i = 0; i < str_array.size(); i++) {
// grab values and perform ops
}
// Return results in string format
return math_results;
}
// End of Transformation Method A
// ******************************************************************
// Imagine similar handlers for methods B, C, and D below.
Method 2: Handlers internal to KStream declaration (partial):
// Take inbound stream from math-input and perform transformations A-D, then write out to 4 streams.
KStream<String, String> inputStream = src_builder.stream("math-input");
KStream<String, String> outputStream_A = inputStream.mapValues(new ValueMapper<String, String>() {
#Override
public String apply(String s) {
// Imagine some very complex math here using the integer
// values. This could be 50+ lines of code.
for (int i = 0; i < str_array.length; i++) {
// grab values and perform ops
}
// Return results in Iterbale string format
return math_results;
}
});
// Send the data to the outbound topic A.
outputStream_A.to("math-output-A");
KStream<String, String> outputStream_B ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_B.to("math-output-B");
KStream<String, String> outputStream_C ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_C.to("math-output-C");
KStream<String, String> outputStream_D ....
// Use ValueMapper in the KStream declaration just like above. 50+ lines of code
outputStream_D.to("math-output-D");
Other than my desire to keep main() neat and push the complexity out of view, am I heading in the wrong direction with Method 1?

Using the Eclipse Collections library, how do I sort MutableMap on the value?

Suppose I have MutableMap<String, Integer>, and I want to sort on the Integer value.
What would be the recommended way to do that with this library? Is there a utility or method or recommended way of going about this with the Eclipse Collections library?
For example, suppose:
MutableMap<String, Integer> mutableMap = Maps.mutable.empty();
mutableMap.add(Tuples.pair("Three", 3));
mutableMap.add(Tuples.pair("One", 1));
mutableMap.add(Tuples.pair("Two", 2));
And I'd like to end up with a MutableMap<String, Integer> containing the same elements, but ordered/sorted so that the first element is ("One", 1), the second element ("Two", 2), and the third element ("Three", 3).

There is currently no direct API available in Eclipse Collections to sort a Map based on its values.
One alternative would be to flip the map into a MutableSortedMap using flipUniqueValues.
MutableSortedMap<Integer, String> sortedMap = SortedMaps.mutable.empty();
sortedMap.putAll(mutableMap.flipUniqueValues());
System.out.println(sortedMap);
This will give you a MutableSortedMap that is sorted on the Integer keys. The output here will be: {1=One, 2=Two, 3=Three}
You could also store the Pairs in a List first and then group them uniquely using the String key to create the MutableMap. If the values in the Map are the Pair instances, they can be used to create a sorted List, SortedSet or SortedBag using direct APIs.
MutableList<Pair<String, Integer>> list = Lists.mutable.with(
Tuples.pair("Three", 3),
Tuples.pair("One", 1),
Tuples.pair("Two", 2)
);
MutableMap<String, Pair<String, Integer>> map =
list.groupByUniqueKey(Pair::getOne);
System.out.println(map);
MutableList<Pair<String, Integer>> sortedList =
map.toSortedListBy(Pair::getTwo);
MutableSortedSet<Pair<String, Integer>> sortedSet =
map.toSortedSetBy(Pair::getTwo);
MutableSortedBag<Pair<String, Integer>> sortedBag =
map.toSortedBagBy(Pair::getTwo);
System.out.println(sortedList);
System.out.println(sortedSet);
System.out.println(sortedBag);
Outputs:
{One=One:1, Three=Three:3, Two=Two:2}
[One:1, Two:2, Three:3]
[One:1, Two:2, Three:3]
[One:1, Two:2, Three:3]
All of the toSorted methods above operate on the values only. This is why I stored the values as the Pair instances.
Note: I am a committer for Eclipse Collections.

Map operation over a KStream fails when not specifying default serdes and using custom ones -> org.apache.kafka.streams.errors.StreamsException

Since I am working with Json values I haven't set up default serdes.
I process a KStream, consuming it with the necessary spring and product (json) serdes, but the next step (map operation) fails:
val props = Properties()
props[StreamsConfig.APPLICATION_ID_CONFIG] = applicationName
props[StreamsConfig.BOOTSTRAP_SERVERS_CONFIG] = kafkaBootstrapServers
val productSerde: Serde<Product> = Serdes.serdeFrom(JsonPojoSerializer<Product>(), JsonPojoDeserializer(Product::class.java))
builder.stream(INVENTORY_TOPIC, Consumed.with(Serdes.String(), productSerde))
.map { key, value ->
KeyValue(key, XXX)
}
.aggregate(...)
If I remove the map operation the execution goes ok.
I haven't found a way to specify the serdes for the map(), how can it be done?
Error:
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (key: org.apache.kafka.common.serialization.ByteArraySerializer / value: org.apache.kafka.common.serialization.ByteArraySerializer) is not compatible to the actual key or value type (key type: java.lang.String / value type: com.codependent.kafkastreams.inventory.dto.Product). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:92)

Multiple issues:
After you call map() you call groupByKey().aggregate(). This triggers data repartition and thus after map() data is written into an internal topic for data repartitioning. Therefore, you need to provide corresponding Serdes within groupByKey(), too.
However, because you don't modify the key, you should actually call mapValues() instead, to avoid the unnecessary repartitioning.
Note, that you need to provide Serdes for each operator that should not use the default Serde from the config. Serdes are not passed along downstream, but are operator in-place overwrites. (It's work in progress for Kafka 2.1 to improve this.)

If anyone lands here like I did, here's a practical example that demonstrates Matthias' solution. The following code will fail due to a ClassCastException.
builder.stream(
"my-topic",
Consumed.with(Serdes.String(), customSerde))
.map((k, v) -> {
// You've modified the key. Get ready for an error!
String newKey = k.split(":")[0];
return KeyValue.pair(newKey.toString(), v);
})
.groupByKey()
.aggregate(
MyAggregate::new,
(key, data, aggregation) -> {
// You'll never reach this block due to an error similar to:
// ClassCastException while producing data to topic
return aggregation.updateFrom(shot);
},
Materialized.<String, MyAggregate> as(storeSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(aggregateSerde)
)
As Matthias has mentioned, "you need to provide corresponding Serdes within groupByKey". Here's how:
builder.stream(
"my-topic",
Consumed.with(Serdes.String(), customSerde))
.map((k, v) -> {
// You've modified the key. Get ready for an error!
String newKey = k.split(":")[0];
return KeyValue.pair(newKey.toString(), v);
})
// Voila, you've specified the serdes required by the new internal
// repartition topic and you can carry on with your work
.groupByKey(Grouped.with(Serdes.String(), customSerde))
.aggregate(
MyAggregate::new,
(key, data, aggregation) -> {
// You'll never reach this block due to an error similar to:
// ClassCastException while producing data to topic
return aggregation.updateFrom(shot);
},
Materialized.<String, MyAggregate> as(storeSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(aggregateSerde)
)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I join two ktables with custom values (and custom serdes) - apache-kafka

Related

Access TimeWindow properties inside an aggregator in Kafka Streams

is there a Kafka streams method to reduce a stream of numbers to only "output" when the number is changed

Kafka Streams - Integral versus Separable handler for flatMapValues

Using the Eclipse Collections library, how do I sort MutableMap on the value?

Map operation over a KStream fails when not specifying default serdes and using custom ones -> org.apache.kafka.streams.errors.StreamsException

Categories

Resources