Spring Cloud 3.1 documentation on how to use KTable in a Spring boot app - spring-cloud

Im struggling to find any documentation on where i can use Spring Cloud Streams that takes a Kafka topic and puts it into a KTable.
Having looked for documentation for example on here https://cloud.spring.io/spring-cloud-static/spring-cloud-stream-binder-kafka/3.0.0.RC1/reference/html/spring-cloud-stream-binder-kafka.html#_materializing_ktable_as_a_state_store nothing is very concrete on the way to do this in Spring boot using annotations.
I was hoping i could just create a simple KTable using a KStream where in my application.properties i have this:
spring.cloud.stream.bindings.process-in-0.destination: my-topic
Then in my Configuration I was hoping i could do something like this
#Bean
public Consumer<KStream<String, String>> process() {
return input -> input.toTable(Materialized.as("my-store"))
}
Please advise what im missing?

If all you want to do is to consume data from Kafka topic as KTable, then you can do this as below.
#Bean
public Consumer<KTable<String, String>> process() {
return input -> {
};
}
If you want to materialize the table into a named store, then you can add this to the configuration.
spring.cloud.stream.kafka.streams.bindings.process_in_0.consumer.materializedAs: my-store
You could also do what you had in the question, i.e. receive it as a KStream and then convert to KTable. However, if that is all you need to do, you might rather receive it as KTable in the first place as I suggest here.

Related

Flink KafkaSource with multiple topics, each topic with different avro schema of data

If I do multiple of the below - one for each topic:
KafkaSource<T> kafkaDataSource = KafkaSource<T>builder().setBootstrapServers(consumerProps.getProperty("bootstrap.servers")).setTopics(topic).setDeserializer(deserializer).setGroupId(identifier)
.setProperties(consumerProps).build();
The deserializer seems to get into some issue and ends up reading data from different topic of different schema than it meant for and fails!
If I provide all topics in the same KafkaSource then watermarks seems to be progress across the topics together.
DataStream<T> dataSource = environment.fromSource(kafkaDataSource,
WatermarkStrategy.<T>forBoundedOutOfOrderness(Duration.ofMillis(2000))
.withTimestampAssigner((event, timestamp) -> {...}, ""));
Also the avro data in the kafka itself holds the first magic byte for schema (schema info is embedded), not using any external avro registry (it's all in the libraries).
It works fine with FlinkKafkaConsumer (created multiple instances of it).
FlinkKafkaConsumer<T> kafkaConsumer = new FlinkKafkaConsumer<>(topic, deserializer, consumerProps);
kafkaConsumer.assignTimestampsAndWatermarks(WatermarkStrategy.<T>forBoundedOutOfOrderness(Duration.ofMillis(2000))
.withTimestampAssigner((event, timestamp) -> {
Not sure if it's a problem the way that I am using? Any pointers on how to solve would be appreciated. Also FlinkKafkaConsumer is deprecated.
Figured it based on the code in here Custom avro message deserialization with Flink. Implemented open method and the instance fields of the deserialisier are made transient.

How to make Spring Cloud Stream consumer in Webflux application?

I have a Webflux based microservice that has a simple reactive repository:
public interface NotificationRepository extends ReactiveMongoRepository<Notification, ObjectId> {
}
Now I would like to extend this microservice to consume event messages from Kafka. This message/event will be then saved into the database.
For the Kafka listener, I used Spring Cloud Stream. I created some simple Consumer and it works great - I'm able to consume the message and save it into the database.
#Bean
public Consumer<KStream<String, Event>> documents(NotificationRepository repository) {
return input ->
input.foreach((key, value) -> {
LOG.info("Received event, Key: {}, value: {}", key, value);
repository.save(initNotification(value)).subscribe();
});
}
But is this the correct way to connect Spring Cloud Stream consumer and reactive repository? It doesn't look like it is when I have to call subscribe() in the end.
I read the Spring Cloud Stream documentation (for 3.0.0 release) and they say
Native support for reactive programming - since v3.0.0 we no longer distribute spring-cloud-stream-reactive modules and instead relying on native reactive support provided by spring cloud function. For backward compatibility you can still bring spring-cloud-stream-reactive from previous versions.
and also in this presentation video they mention they have reactive programming support using project reactor. So I guess there is a way I just don't know it. Can you show me how to do it right?
I apologize if this all sounds too stupid but I'm very new to Spring Cloud Stream and reactive programming and haven't found many articles describing this.
Just use Flux as consumed type, something like this:
#Bean
public Consumer<Flux<Message<Event>>> documents(NotificationRepository repository) {
return input ->
input
.map(message-> /*map the necessary value like:*/ message.getPayload().getEventValue())
.concatMap((value) -> repository.save(initNotification(value)))
.subscribe();
}
If you use Function with empty return type (Function<Flux<Message<Event>>, Mono<Void>>) instead of a Consumer, then framework can automatically subscribe. With Consumer you have to subscribe manually, because the framework has no reference to the stream. But in Consumer case you subscribe not to the repository but the whole stream which is ok.

How does SpringCloudStream Kafka consume in bulk?

Native Kafka can be used KafkaConsumer.poll () to obtain messages in batches. How to configure it after integrating spring cloud starter stream Kafka (3.0.10. Release)? I didn't understand the official documents
Hope to give an example, thank you
See https://docs.spring.io/spring-cloud-stream/docs/3.0.10.RELEASE/reference/html/spring-cloud-stream.html#_batch_consumers
Set the batch-mode consumer binding property and use
#Bean
Consumer<List<Foo>> input() {
return list -> {
System.out.println(list);
};
}

Spring Cloud Stream Kafka with Data Transformation

I'm recently creating a micro-service using Spring Cloud Stream with Kafka. I'm new in both worlds so please forgive me if I asked a stupid question.
The flow of this service is to consume from a topic, transform the data and then produce the result to various topics.
The data that comes into the consumer topic are database changes. Basically we have another service monitoring the database log and produce the changes to the topic that my new service is consuming from.
In my new service I have defined consumer and producer bindings. The database data comes in as byte[] format and the consumer would read the data and decode the byte[] into a Java object DBData, then based on the table name different conversion will be performed.
Please see the following as the code example.
#StreamListener
#SendTo("OUTPUT_MOCK")
public KStream<String, Mock> process(#Input("DB_SOURCE") KStream<String, byte[]> input) {
return input
.map(((String key, byte[] bytes) -> {
try {
return new KeyValue<>(key, DBDecoder.decode(bytes)); // decodes byte[] into DBData
} catch (Exception e) {
return new KeyValue<>(key, null);
}
}))
.filter((key, v) -> (v instanceof DBData)) // filter value type
.map((key, v) -> new KeyValue<>(key, (DBData) v))
.filter((String key, DBData v) -> v.getTableName().equalse("MOCK")) // check the table name
.flatMap((String key, DBData v) -> extractMockDataChanges(v)); // convert to Mock object
}
From the code example you can see that the DB data comes in and then it's being decoded into DBData format. Then the result is being filter based on the table name and eventually produce the converted Mock object to the OUTPUT_MOCK topic.
This code works perfectly but my main issue is the DBData transformation part. This OUTPUT_MOCK topic is one of the many other producer topics. I have to do this for many other tables and each time I would have to repeat the decoding process, this seems to be unnecessary and redundant.
Is there a better way to handle the database data transformation so the transformed DBData would be available for other stream processors?
PS: I looked into the state store, but it seems to be overkill as Kafka would serialize the data when adding them inside the store and upon extraction it will deserialize them. So there are extra overhead which I'm trying to avoid.
if you are building on confluent, then kafka connect jdbc source will be an idle usecase for your scenario.

Apache Flink dynamic number of Sinks

I am using Apache Flink and the KafkaConsumer to read some values from a Kafka Topic.
I also have a stream obtained from reading a file.
Depending on the received values, I would like to write this stream on different Kafka Topics.
Basically, I have a network with a leader linked to many children. For each child, the Leader needs to write the stream read in a child-specific Kafka Topic, so that the child can read it.
When the child is started, it registers itself in the Kafka topic read from the Leader.
The problem is that I don't know a priori how many children I have.
For example, I read 1 from the Kafka Topic, I want to write the stream in just one Kafka Topic named Topic1.
I read 1-2, I want to write on two Kafka Topics (Topic1 and Topic2).
I don't know if it is possible because in order to write on the Topic, I am using the Kafka Producer along with the addSink method and to my understanding (and from my attempts) it seems that Flink requires to know the number of sinks a priori.
But then, is there no way to obtain such behavior?
If I understood your problem well, I think you can solve it with a single sink, since you can choose the Kafka topic based on the record being processed. It also seems that one element from the source might be written to more than one topic, in which case you would need a FlatMapFunction to replicate each source record N times (one for each output topic). I would recommend to output as a pair (aka Tuple2) with (topic, record).
DataStream<Tuple2<String, MyValue>> stream = input.flatMap(new FlatMapFunction<>() {
public void flatMap(MyValue value, Collector<Tupple2<String, MyValue>> out) {
for (String topic : topics) {
out.collect(Tuple2.of(topic, value));
}
}
});
Then you can use the topic previously computed by creating the FlinkKafkaProducer with a KeyedSerializationSchema in which you implement getTargetTopic to return the first element of the pair.
stream.addSink(new FlinkKafkaProducer10<>(
"default-topic",
new KeyedSerializationSchema<>() {
public String getTargetTopic(Tuple2<String, MyValue> element) {
return element.f0;
}
...
},
kafkaProperties)
);
KeyedSerializationSchema
Is now deprecated. Instead you have to use "KafkaSerializationSchema"
The same can be achieved by overriding the serialize method.
public ProducerRecord<byte[], byte[]> serialize(
String inputString, #Nullable Long aLong){
return new ProducerRecord<>(customTopicName,
key.getBytes(StandardCharsets.UTF_8), inputString.getBytes(StandardCharsets.UTF_8));
}