How to test a ProducerInterceptor in a Kafka Streams topology? - apache-kafka

I have the requirement to pipe records from one topic to another, keeping the original partitioning intact (the original producer use a non-native Kafka partitioner). The reason for this is that the source topic is uncompressed, and we wish to "reprocess" the data into a compressed topic - transparently, from the point of view of the original producers and consumers.
I have a trivial KStreams topology that does this using a ProducerInterceptor:
void buildPipeline(StreamsBuilder streamsBuilder) {
streamsBuilder
.stream(topicProperties.getInput().getName())
.to(topicProperties.getOutput().getName());
}
together with:
interceptor.classes: com.acme.interceptor.PartitionByHeaderInterceptor
This interceptor looks in the message headers (which contain a partition Id header) and simply redirects the ProducerRecord to the original topic:
#Override
public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
int partition = extractSourcePartition(record);
return new ProducerRecord<>(record.topic(), partition, record.timestamp(), record.key(), record.value(), record.headers());
}
My question is: how can I test this interceptor in a test topology (i.e. integration test)?
I've tried adding:
streamsConfiguration.put(StreamsConfig.producerPrefix("interceptor.classes"),
PartitionByHeaderInterceptor.class.getName());
(which is how I enable the interceptor in production code)
to my test topology stream configuration, but my interceptor is not called by the test topology driver.
Is what I'm trying to do currently technically possible?

Related

Apache Kafka Streams : Out-of-Order messages

I have an Apache Kafka 2.6 Producer which writes to topic-A (TA).
I also have a Kafka streams application which consumes from TA and writes to topic-B (TB).
In the streams application, I have a custom timestamp extractor which extracts the timestamp from the message payload.
For one of my failure handling test cases, I shutdown the Kafka cluster while my applications are running.
When the producer application tries to write messages to TA, it cannot because the cluster is down and hence (I assume) buffers the messages.
Let's say it receives 4 messages m1,m2,m3,m4 in increasing time order. (i.e. m1 is first and m4 is last).
When I bring the Kafka cluster back online, the producer sends the buffered messages to the topic, but they are not in order. I receive for example, m2 then m3 then m1 and then m4.
Why is that ? Is it because the buffering in the producer is multi-threaded with each producing to the topic at the same time ?
I assumed that the custom timestamp extractor would help in ordering messages when consuming them. But they do not. Or maybe my understanding of the timestamp extractor is wrong.
I got one solution from SO here, to just stream all events from tA to another intermediate topic (say tA') which will use the TimeStamp extractor to another topic. But I am not sure if this will cause the events to get reordered based on the extracted timestamp.
My code for the Producer is as shown below (I am using Spring Cloud for creating the Producer):
Producer.java
#Service
public class Producer {
private String topicName = "input-topic";
private ApplicationProperties appProps;
#Autowired
private KafkaTemplate<String, MyEvent> kafkaTemplate;
public Producer() {
super();
}
#Autowired
public void setAppProps(ApplicationProperties appProps) {
this.appProps = appProps;
this.topicName = appProps.getInput().getTopicName();
}
public void sendMessage(String key, MyEvent ce) {
ListenableFuture<SendResult<String,MyEvent>> future = this.kafkaTemplate.send(this.topicName, key, ce);
}
}
Why is that ? Is it because the buffering in the producer is multi-threaded with each producing to the topic at the same time ?
By default, the producer allow for up to 5 parallel in-flight requests to a broker, and thus if some requests fail and are retried the request order might change.
To avoid this re-ordering issue, you can either set max.in.flight.requests.per.connection = 1 (what may have a performance hit) or set enable.idempotence = true.
Btw: you did not say if your topic has a single partition or multiple partitions, and if your messages have a key? If your topic has more then one partition and you messages are sent to different partitions, there is no ordering guarantee on read anyway, because offset ordering is only guaranteed within a partition.
I assumed that the custom timestamp extractor would help in ordering messages when consuming them. But they do not. Or maybe my understanding of the timestamp extractor is wrong.
The timestamp extractor only extracts a timestamp. Kafka Streams does not re-order any messages, but processes messages always in offset-order.
If not, then what are the specific uses of the timestamp extractor ? Just to associate a timestamp with an event ?
Correct.
I got one solution from SO here, to just stream all events from tA to another intermediate topic (say tA') which will use the TimeStamp extractor to another topic. But I am not sure if this will cause the events to get reordered based on the extracted timestamp.
No, it won't do any reordering. The other SO question is just about to change the timestamp, but if you read messages in order a,b,c the result would be written in order a,b,c (just with different timestamps, but offset order should be preserved).
This talk explains some more details: https://www.confluent.io/kafka-summit-san-francisco-2019/whats-the-time-and-why/

How to check message on uniqueness Kafka?

There is topic Users with partitions.
Each partitions have messages about user data.
How to avoid duplications, for example dont allow inserting of the same user's name?
If I got this right I should create seperate topic Usernames and append all requested usernames.
Then before adding a new user in topic Users I ensure that there are not dublications in topic Usernames, right?
Accordingly using streams
I assume you are talking about a scenario where you are trying to publish events to Kafka topic from a micro-service.
Also, assuming you want to publish users profile --> username as key, user profile as value.
There are 2 issues of deduplication here :-
1.) you might get different usernames to your service at different times and publishing to topic.
2.) Duplicate message processing - During Broker failure(ack not received) or kafka client failures, the same message can be re-processed as kafka client does not hace ack info.
This can be taken care by enabling idempotency on kafka producers and atomic transactions.(Refer to Exactly Once processing)
I believe your question is about 1.) where your service receives duplicate messages.
Solution 1:-
If you are using micro-service, you can have an inmemory cache/DB of usernames and publish to kafka if duplicate is not found.
Solution 2:- (Handle on Kafka itself using streams)
input topic - users
Build an Kafka Stream client with stateStore(keyValueStore) and transformer to implement your dedupe logic.
So, your kafka stream client consumes the events from users topic and transforms in UserDedupeTransformer(where you have dedupe logic) and then produces to the output topic(as per ur requirement)
StoreBuilder<KeyValueStore<String, String>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("UserDedupeStoreName"),
Serdes.String(),
Serdes.String())
.withCachingEnabled();
builder.addStateStore(storeBuilder)
.stream("users-topic", Consumed.with(Serdes.String(), Serdes.String()))
.transform(() -> new UsersDedupeTransformer(), "usersDedupeStoreName")
.to("destination-topic");
In UserDedupeTransformer - Configured userDedupeStore and override the transform method -
public void init(ProcessorContext context) {
this.context = context;
dedupeStore = (KeyValueStore<String, String>) context.getStateStore("userDedupeStoreName");
}
public KeyValue<String, String> transform(String key, String v) {
if (null != key && null != dedupeStore.get(key))
return KeyValue.pair(key, value);
else
return null;
This dedupe store can be configured as In-Memory and also can be persisted using RocksDB.

Kafka exactly_once processing - do you need your streams app to produce to kafka topic as well?

I have a kafka streams app consuming from kafka topic. It only consumes and processes the data but doesn't produce anything.
For Kafka's exactly_once processing to work, do you also need your streams app to write to a kafka topic?
How can you achieve exactly_once if your streams app wants to process the message only once but not produce anything?
Providing “exactly-once” processing semantics really means that distinct updates to the state of an operator that is managed by the stream processing engine are only reflected once. “Exactly-once” by no means guarantees that processing of an event, i.e. execution of arbitrary user-defined logic, will happen only once.
Above is the "Exactly once" semantics explanation.
It is not necessary to publish the output to a topic always in KStream application.
When you are using KStream applications, you have to define an applicationID with each which uses a consumer in the backend. In the application, you have to configure few
parameters like processing.guarantee to exactly_once and enable.idempotence
Here are the details :
https://kafka.apache.org/22/documentation/streams/developer-guide/config-streams#processing-guarantee
I am not conflicting on exactly-once stream pattern because that's the beauty of Kafka Stream however its possible to use Kafka Stream without producing to other topics.
Exactly-once stream pattern is simply the ability to execute a read-process-write operation exactly one time. This means you consume one message at a time get the process and published to another topic and commit. So commit will be handle by Stream automatically one message a time.
Kafka Stream achieve these be setting below parameters which can not be overwritten
isolation.level: (read_committed) - Consumers will always read committed data only
enable.idempotence: (true) - Producer will always have idempotency enabled
max.in.flight.requests.per.connection" (5) - Producer will always have one in-flight request per connection
In case any error in the consumer or producer Kafka stream always retries a specific configured number of attempts.
KafkaStream doesn't guarantee inside processing logic we still need to handle e.g. there is a requirement for DB operation and if DB connection got failed in that case Kafka doesn't aware so you need to handle by your own.
As per pattern definition yes we need consumer, process, and producer topic but in general, it's not stopping you if you don't output to another topic. Still, you can consume exactly one item at a time with default time interval commit(DEFAULT_COMMIT_INTERVAL_MS) and again you need to handle your logic transaction failure by yourself
I am putting some sample examples.
StreamsBuilder builder = new StreamsBuilder();
Properties props = getStreamProperties();
KStream<String, String> textLines = builder.stream(Pattern.compile("topic"));
textLines.process(() -> new ProcessInternal());
KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
logger.info("Completed VQM stream");
streams.close();
}));
logger.info("Streaming start...");
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
class ProcessInternal implements Processor<String, String> {
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public void close() {
// Any code for clean up would go here.
}
#Override
public void process(String key, String value) {
///Your transactional process business logic
}
}

How to implement Exactly-Once Kafka Consumer without manually assigning partitions

I was going through this article which explains how to ensure message is processed exactly once by doing following:
Read (topic, partition, offset) from database on start/restart
Read message from specific (topic, partition, offset)
Atomically do following things:
Processing message
Commit offset to database as (topic, partition, offset)
As you can see, it explicitly specified from which partition to read messages. I feel its not good idea as it does not let allow Kafka to assign fair share of partition to active consumers. I am not able to come with logic to implement similar functionality without explicitly specifying partitions while polling kafka topic inside consumer. Is it possible to do?
Good analysis. You have a very good point and if possible you should certainly let kafka handle the partition assignment to consumers.
There is an alternative to consumer.Assign(Partition[]). The kafka brokers will notify your consumers when a partition is revoked or assigned to the consumer. For example, the dotnet client library has a 'SetPartitionsRevoked' and 'SetPartitionsAssigned' handler, that consumers can use to manage their offsets.
When a partition is revoked, persist your last processed offset for each partition being revoked to the database. When a new partition is assigned, get the last processed offset for that partition from the database and use that.
C# Example:
public class Program
{
public void Main(string[] args)
{
using (
var consumer = new ConsumerBuilder<string, string>(config)
.SetErrorHandler(ErrorHandler)
.SetPartitionsRevokedHandler(HandlePartitionsRevoked)
.SetPartitionsAssigned(HandlePartitionsAssigned)
.Build()
)
{
while (true)
{
consumer.Consume()//.Poll()
}
}
}
public IEnumerable<TopicPartitionOffset>
HandlePartitionsRevoked
(
IConsumer<string, string> consumer,
List<TopicPartitionOffset> currentTopicPartitionOffsets
)
{
Persist(<last processed offset for each partition in
'currentTopicPartitionOffsets'>);
return tpos;
}
public IEnumerable<TopicPartitionOffset> HandlePartitionsAssigned
(
IConsumer<string, string> consumer,
List<TopicPartition> tps
)
{
List<TopicPartitionOffset> tpos = FetchOffsetsFromDbForTopicPartitions(tps);
return tpos
}
}
Java Example from the ConsumerRebalanceListener Docs:
If writing in Java, there is a 'ConsumerRebalanceListener' interface that you can implement. You then pass your implementation of the interface into the consumer.Subscribe(topic, listener) method. The example below is taken verbatim from the kafka docs linked above:
public class SaveOffsetsOnRebalance implements ConsumerRebalanceListener {
private Consumer<?,?> consumer;
public SaveOffsetsOnRebalance(Consumer<?,?> consumer) {
this.consumer = consumer;
}
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// save the offsets in an external store using some custom code not described here
for(TopicPartition partition: partitions)
saveOffsetInExternalStore(consumer.position(partition));
}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// read the offsets from an external store using some custom code not described here
for(TopicPartition partition: partitions)
consumer.seek(partition, readOffsetFromExternalStore(partition));
}
}
If my understanding is correct, you would call the java version like this: consumer.Subscribe("My topic", new SaveOffsetsOnRebalance(consumer)).
For more information, see the 'Storing Offsets Outside Kafka' section of the kafka docs.
Here's an excerpt from those docs that summarizes how to store the partitions and offsets for exactly-once processing:
Each record comes with its own offset, so to manage your own offset
you just need to do the following:
Configure enable.auto.commit=false
Use the offset provided with each ConsumerRecord to save your position.
On restart restore the position of the consumer using seek(TopicPartition, long).
This type of usage is simplest when the partition assignment is also
done manually (this would be likely in the search index use case
described above). If the partition assignment is done automatically
special care is needed to handle the case where partition assignments
change. This can be done by providing a ConsumerRebalanceListener
instance in the call to subscribe(Collection,
ConsumerRebalanceListener) and subscribe(Pattern,
ConsumerRebalanceListener). For example, when partitions are taken
from a consumer the consumer will want to commit its offset for those
partitions by implementing
ConsumerRebalanceListener.onPartitionsRevoked(Collection). When
partitions are assigned to a consumer, the consumer will want to look
up the offset for those new partitions and correctly initialize the
consumer to that position by implementing
ConsumerRebalanceListener.onPartitionsAssigned(Collection).
Another common use for ConsumerRebalanceListener is to flush any
caches the application maintains for partitions that are moved
elsewhere.

Apache Flink dynamic number of Sinks

I am using Apache Flink and the KafkaConsumer to read some values from a Kafka Topic.
I also have a stream obtained from reading a file.
Depending on the received values, I would like to write this stream on different Kafka Topics.
Basically, I have a network with a leader linked to many children. For each child, the Leader needs to write the stream read in a child-specific Kafka Topic, so that the child can read it.
When the child is started, it registers itself in the Kafka topic read from the Leader.
The problem is that I don't know a priori how many children I have.
For example, I read 1 from the Kafka Topic, I want to write the stream in just one Kafka Topic named Topic1.
I read 1-2, I want to write on two Kafka Topics (Topic1 and Topic2).
I don't know if it is possible because in order to write on the Topic, I am using the Kafka Producer along with the addSink method and to my understanding (and from my attempts) it seems that Flink requires to know the number of sinks a priori.
But then, is there no way to obtain such behavior?
If I understood your problem well, I think you can solve it with a single sink, since you can choose the Kafka topic based on the record being processed. It also seems that one element from the source might be written to more than one topic, in which case you would need a FlatMapFunction to replicate each source record N times (one for each output topic). I would recommend to output as a pair (aka Tuple2) with (topic, record).
DataStream<Tuple2<String, MyValue>> stream = input.flatMap(new FlatMapFunction<>() {
public void flatMap(MyValue value, Collector<Tupple2<String, MyValue>> out) {
for (String topic : topics) {
out.collect(Tuple2.of(topic, value));
}
}
});
Then you can use the topic previously computed by creating the FlinkKafkaProducer with a KeyedSerializationSchema in which you implement getTargetTopic to return the first element of the pair.
stream.addSink(new FlinkKafkaProducer10<>(
"default-topic",
new KeyedSerializationSchema<>() {
public String getTargetTopic(Tuple2<String, MyValue> element) {
return element.f0;
}
...
},
kafkaProperties)
);
KeyedSerializationSchema
Is now deprecated. Instead you have to use "KafkaSerializationSchema"
The same can be achieved by overriding the serialize method.
public ProducerRecord<byte[], byte[]> serialize(
String inputString, #Nullable Long aLong){
return new ProducerRecord<>(customTopicName,
key.getBytes(StandardCharsets.UTF_8), inputString.getBytes(StandardCharsets.UTF_8));
}