Spring Cloud Stream Kafka with Data Transformation - apache-kafka

I'm recently creating a micro-service using Spring Cloud Stream with Kafka. I'm new in both worlds so please forgive me if I asked a stupid question.
The flow of this service is to consume from a topic, transform the data and then produce the result to various topics.
The data that comes into the consumer topic are database changes. Basically we have another service monitoring the database log and produce the changes to the topic that my new service is consuming from.
In my new service I have defined consumer and producer bindings. The database data comes in as byte[] format and the consumer would read the data and decode the byte[] into a Java object DBData, then based on the table name different conversion will be performed.
Please see the following as the code example.
#StreamListener
#SendTo("OUTPUT_MOCK")
public KStream<String, Mock> process(#Input("DB_SOURCE") KStream<String, byte[]> input) {
return input
.map(((String key, byte[] bytes) -> {
try {
return new KeyValue<>(key, DBDecoder.decode(bytes)); // decodes byte[] into DBData
} catch (Exception e) {
return new KeyValue<>(key, null);
}
}))
.filter((key, v) -> (v instanceof DBData)) // filter value type
.map((key, v) -> new KeyValue<>(key, (DBData) v))
.filter((String key, DBData v) -> v.getTableName().equalse("MOCK")) // check the table name
.flatMap((String key, DBData v) -> extractMockDataChanges(v)); // convert to Mock object
}
From the code example you can see that the DB data comes in and then it's being decoded into DBData format. Then the result is being filter based on the table name and eventually produce the converted Mock object to the OUTPUT_MOCK topic.
This code works perfectly but my main issue is the DBData transformation part. This OUTPUT_MOCK topic is one of the many other producer topics. I have to do this for many other tables and each time I would have to repeat the decoding process, this seems to be unnecessary and redundant.
Is there a better way to handle the database data transformation so the transformed DBData would be available for other stream processors?
PS: I looked into the state store, but it seems to be overkill as Kafka would serialize the data when adding them inside the store and upon extraction it will deserialize them. So there are extra overhead which I'm trying to avoid.

if you are building on confluent, then kafka connect jdbc source will be an idle usecase for your scenario.

Related

How to test a ProducerInterceptor in a Kafka Streams topology?

I have the requirement to pipe records from one topic to another, keeping the original partitioning intact (the original producer use a non-native Kafka partitioner). The reason for this is that the source topic is uncompressed, and we wish to "reprocess" the data into a compressed topic - transparently, from the point of view of the original producers and consumers.
I have a trivial KStreams topology that does this using a ProducerInterceptor:
void buildPipeline(StreamsBuilder streamsBuilder) {
streamsBuilder
.stream(topicProperties.getInput().getName())
.to(topicProperties.getOutput().getName());
}
together with:
interceptor.classes: com.acme.interceptor.PartitionByHeaderInterceptor
This interceptor looks in the message headers (which contain a partition Id header) and simply redirects the ProducerRecord to the original topic:
#Override
public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
int partition = extractSourcePartition(record);
return new ProducerRecord<>(record.topic(), partition, record.timestamp(), record.key(), record.value(), record.headers());
}
My question is: how can I test this interceptor in a test topology (i.e. integration test)?
I've tried adding:
streamsConfiguration.put(StreamsConfig.producerPrefix("interceptor.classes"),
PartitionByHeaderInterceptor.class.getName());
(which is how I enable the interceptor in production code)
to my test topology stream configuration, but my interceptor is not called by the test topology driver.
Is what I'm trying to do currently technically possible?

How to check message on uniqueness Kafka?

There is topic Users with partitions.
Each partitions have messages about user data.
How to avoid duplications, for example dont allow inserting of the same user's name?
If I got this right I should create seperate topic Usernames and append all requested usernames.
Then before adding a new user in topic Users I ensure that there are not dublications in topic Usernames, right?
Accordingly using streams
I assume you are talking about a scenario where you are trying to publish events to Kafka topic from a micro-service.
Also, assuming you want to publish users profile --> username as key, user profile as value.
There are 2 issues of deduplication here :-
1.) you might get different usernames to your service at different times and publishing to topic.
2.) Duplicate message processing - During Broker failure(ack not received) or kafka client failures, the same message can be re-processed as kafka client does not hace ack info.
This can be taken care by enabling idempotency on kafka producers and atomic transactions.(Refer to Exactly Once processing)
I believe your question is about 1.) where your service receives duplicate messages.
Solution 1:-
If you are using micro-service, you can have an inmemory cache/DB of usernames and publish to kafka if duplicate is not found.
Solution 2:- (Handle on Kafka itself using streams)
input topic - users
Build an Kafka Stream client with stateStore(keyValueStore) and transformer to implement your dedupe logic.
So, your kafka stream client consumes the events from users topic and transforms in UserDedupeTransformer(where you have dedupe logic) and then produces to the output topic(as per ur requirement)
StoreBuilder<KeyValueStore<String, String>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("UserDedupeStoreName"),
Serdes.String(),
Serdes.String())
.withCachingEnabled();
builder.addStateStore(storeBuilder)
.stream("users-topic", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new UsersDedupeTransformer(), "usersDedupeStoreName")
.to("destination-topic");
In UserDedupeTransformer - Configured userDedupeStore and override the transform method -
public void init(ProcessorContext context) {
this.context = context;
dedupeStore = (KeyValueStore<String, String>) context.getStateStore("userDedupeStoreName");
}
public KeyValue<String, String> transform(String key, String v) {
if (null != key && null != dedupeStore.get(key))
return KeyValue.pair(key, value);
else
return null;
This dedupe store can be configured as In-Memory and also can be persisted using RocksDB.

How to read from a string delimited kafka topic then convert it to json?

I'm currently working on a POC project that uses Kafka Connect to read from a Kafka topic to write to database. I am aware that JDBC sink connectors requires schema to work. However all of our kafka topics are string delimited. Apart from creating a new topic with json or avro, I'm planning to create an API that converts string to json, any possible solutions that I can try?
If I understood your use case, I would definitely try Kafka Streams to process your delimited String topic and sink it into a JSON topic. Then, you can use this JSON topic as input to your database.
With Kafka Streams, you can map each String record of your topic and apply logic into it to generate a JSON with your data. The processed records could be sinked into your results topic just as Strings with JSON format or even as a JSON Type using the Kafka JSON Serdes (Kafka Streams DatatTypes).
This approach also adds flexibility with your input data, you can process, transform or ignore each delimited field of your string.
If you want to get started, check out the Kafka Demo and the Confluent Kafka-Streams examples for more complex use cases.
Let's create Kafka Streams application. You will consume data as
KStream<String, String> (assumption that key also is the String)
and than in Kafka Stream app transform it to KStream<String,YourAvroClass> and produce messages to target Avro topic.
KStream<String,YourAvroClass> avroKStream ....
avroKStream.to("avro-output-topic", Produced.with(Serdes.String(), yourAvroClassSerde));
The solution to your problem would be creating a custom Converter. Unfortunately, there is no blog post or online resource on how to do that step by step but you can have a look over how the StringConverter has been built. It's a pretty straightforward API that uses Serializers and Deserializers to convert the data.
L.E:
A converter uses Serializers and Deserializers so the only thing that you should do is to create your own:
public class StringConverter implements Converter {
private final CustomStringSerializer serializer = new CustomStringSerializer();
private final CustomStringSerializer deserializer = new CustomStringSerializer();
--- boilerplate code resulted by implementing the converter Interface ---
#Override
public byte[] fromConnectData(String topic, Schema schema, Object value) {
try {
return serializer.serialize(topic, value == null ? null : value.toString());
} catch (SerializationException e) {
throw new DataException("Failed to serialize to a string: ", e);
}
}
#Override
public SchemaAndValue toConnectData(String topic, byte[] value) {
try {
return new SchemaAndValue(createYourOwnSchema(topic, value), deserializer.deserialize(topic, value));
} catch (SerializationException e) {
throw new DataException("Failed to deserialize string: ", e);
}
}
}
You can notice that toConnectData is returning a SchemaAndValue object and actually it is exactly what you need to customize. You can create your own createYourOwnSchema method that will create the schema based on the topic ( and the value? ) and then you have the deserializer that will deserialize your data.

Kafka Streams multiple objects in a topic and deserizalization

In my kafka streams application, I use one topic for multiple object types, seriazliazed as JSON. I use class name as a key and my idea was that consumers will filter only a subset of incoming entries by key and deserialize them from JSON. I assumed that I can apply initial filtering without defining serdes, but in such case source stream is inferred to <Object,Object> and the following code does not compile:
return streamsBuilder.stream("topic")
.filter((k, v) -> k.equals("TestClassA"))
.groupByKey()
.reduce((oldValue, newValue) -> newValue,
Materialized.<String, TestClassA, KeyValueStore<Bytes, byte[]>>as(StoreManager.STORE_NAME)
.withKeySerde(Serdes.String())
.withValueSerde(new JsonSerde<>(TestClassA.class)));
It compiles if I add types to stream definition:
return streamsBuilder.stream(businessEntityTopicName, Consumed.with(Serdes.String(), new JsonSerde<>(TestClassA.class))) {...}
But in this case I get runtime exceptions when for example object of TestClassB appears in a topic.
What is the best practice for such cases or should I just use different topics for different objects?
If you don't specify any Serde in #stream() and don't overwrite the default from StreamsConfig Kafka Streams will use byte-array serdes. Thus, you would get
KStream<byte[], byte[]> streams = builder.<byte[], byte[]>stream("topicName");
Note, that Java itself falls back to KStream<Object, Object> if you don't specify the correct type on the right hand side as shown above. But the actual type at runtime would be byte[] for both cases.
Thus, you could apply a filter, but it would need to work on byte[] data type.
I guess, what you actually want to do is to only specify a StringSerde for the key:
KStream<String, byte[]> streams = builder.<String, byte[]>("topicName", Consumed.with(Serdes.String(), null)); // null with fall back to defaul Serde from StreamConfig
This allows you to apply your filter() based on String keys before the groupByKey() operation.
I have a similar use case. I make all the possible objects inherit a common interface (Event) and annotate with #JsonTypeInfo so jackson can deserialize the object properly.
streamsBuilder.stream("topic")//need to add some sort of JSONSerde<Event> to this stream call, i use personally use the one bundled with spring
.filter((k, v) -> v instanceOf testClassA)

Apache Flink dynamic number of Sinks

I am using Apache Flink and the KafkaConsumer to read some values from a Kafka Topic.
I also have a stream obtained from reading a file.
Depending on the received values, I would like to write this stream on different Kafka Topics.
Basically, I have a network with a leader linked to many children. For each child, the Leader needs to write the stream read in a child-specific Kafka Topic, so that the child can read it.
When the child is started, it registers itself in the Kafka topic read from the Leader.
The problem is that I don't know a priori how many children I have.
For example, I read 1 from the Kafka Topic, I want to write the stream in just one Kafka Topic named Topic1.
I read 1-2, I want to write on two Kafka Topics (Topic1 and Topic2).
I don't know if it is possible because in order to write on the Topic, I am using the Kafka Producer along with the addSink method and to my understanding (and from my attempts) it seems that Flink requires to know the number of sinks a priori.
But then, is there no way to obtain such behavior?
If I understood your problem well, I think you can solve it with a single sink, since you can choose the Kafka topic based on the record being processed. It also seems that one element from the source might be written to more than one topic, in which case you would need a FlatMapFunction to replicate each source record N times (one for each output topic). I would recommend to output as a pair (aka Tuple2) with (topic, record).
DataStream<Tuple2<String, MyValue>> stream = input.flatMap(new FlatMapFunction<>() {
public void flatMap(MyValue value, Collector<Tupple2<String, MyValue>> out) {
for (String topic : topics) {
out.collect(Tuple2.of(topic, value));
}
}
});
Then you can use the topic previously computed by creating the FlinkKafkaProducer with a KeyedSerializationSchema in which you implement getTargetTopic to return the first element of the pair.
stream.addSink(new FlinkKafkaProducer10<>(
"default-topic",
new KeyedSerializationSchema<>() {
public String getTargetTopic(Tuple2<String, MyValue> element) {
return element.f0;
}
...
},
kafkaProperties)
);
KeyedSerializationSchema
Is now deprecated. Instead you have to use "KafkaSerializationSchema"
The same can be achieved by overriding the serialize method.
public ProducerRecord<byte[], byte[]> serialize(
String inputString, #Nullable Long aLong){
return new ProducerRecord<>(customTopicName,
key.getBytes(StandardCharsets.UTF_8), inputString.getBytes(StandardCharsets.UTF_8));
}