Kafka Streams: Custom TimestampExtractor for aggregation

Kafka Streams: Custom TimestampExtractor for aggregation - apache-kafka

I am building a pretty straightforward KafkaStreams demo application, to test a use case.
I am not able to upgrade the Kafka broker I am using (which is currently on version 0.10.0), and there are several messages written by a pre-0.10.0 Producer, so I am using a custom TimestampExtractor, which I add as a default to the config in the beginning of my main class:
config.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, GenericRecordTimestampExtractor.class);
When consuming from my source topic, this works perfectly fine. But when using an aggregation operator, I run into an exception because the FailOnInvalidTimestamp implementation of TimestampExtractor is used instead of the custom implementation when consuming from the internal aggregation topic.
The code of the Streams app looks something like this:
...
KStream<String, MyValueClass> clickStream = streamsBuilder
.stream("mytopic", Consumed.with(Serdes.String(), valueClassSerde));
KTable<Windowed<Long>, Long> clicksByCustomerId = clickStream
.map(((key, value) -> new KeyValue<>(value.getId(), value)))
.groupByKey(Serialized.with(Serdes.Long(), valueClassSerde))
.windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(1)))
.count();
...
The Exception I'm encountering is the following:
Exception in thread "click-aggregator-b9d77f2e-0263-4fa3-bec4-e48d4d6602ab-StreamThread-1" org.apache.kafka.streams.errors.StreamsException:
Input record ConsumerRecord(topic = click-aggregator-KSTREAM-AGGREGATE-STATE-STORE-0000000002-repartition, partition = 9, offset = 0, CreateTime = -1, serialized key size = 8, serialized value size = 652, headers = RecordHeaders(headers = [], isReadOnly = false), key = 11230, value = org.example.MyValueClass#2a3f2ea2) has invalid (negative) timestamp.
Possibly because a pre-0.10 producer client was used to write this record to Kafka without embedding a timestamp, or because the input topic was created before upgrading the Kafka cluster to 0.10+. Use a different TimestampExtractor to process this data.
Now the question is: Is there any way I can make Kafka Streams use the custom TimestampExtractor when reading from the internal aggregation topic (optimally while still using the Streams DSL)?

You cannot change the timestamp extractor (as of v1.0.0). This is not allowed for correctness reasons.
But I am really wondering, how a record with timestamp -1 is written into this topic in the first place. Kafka Streams uses the timestamp that was provided by your custom extractor when writing the record. Also note, that KafkaProducer does not allow to write records with negative timestamp.
Thus, the only explanation I can think of is that some other producer did write into the repartitioning topic -- and this is not allowed... Only Kafka Streams should write into the repartioning topic.
I guess, you will need to delete this topic and let Kafka Streams recreate it to get back into a clean state.
From the discussion/comment of the other answer:
You need 0.10+ format to work with Kafka Streams. If you upgrade your brokers and keep 0.9 format or older, Kafka Streams might not work as expected.

It is well known issue :-). I have the same problem with old clients in the projects which are still using older Kafka clients like 0.9 and also when communicating with some "not certified" .NET clients.
Therefore I wrote dedicated class:
public class MyTimestampExtractor implements TimestampExtractor {
private static final Logger LOG = LogManager.getLogger( MyTimestampExtractor.class );
#Override
public long extract ( ConsumerRecord<Object, Object> consumerRecord, long previousTimestamp ) {
final long timestamp = consumerRecord.timestamp();
if ( timestamp < 0 ) {
final String msg = consumerRecord.toString().trim();
LOG.warn( "Record has wrong Kafka timestamp: {}. It will be patched with local timestamp. Details: {}", timestamp, msg );
return System.currentTimeMillis();
}
return timestamp;
}
}
When there are many messages you may skip logging, as it may flood.

After reading Matthias' answer I double checked everything and the cause of the issue were incompatible versions between the Kafka Broker and the Kafka Streams app. I was stupid enough to use Kafka Streams 1.0.0 with a 0.10.1.1 Broker, which is clearly stated as incompatible in the Kafka Wiki here.
Edit (thx to Matthias): The actual cause of the problem was the fact, that the log format used by our 0.10.1.x broker was still 0.9.0.x, which is incompatible with Kafka Streams.

Related

Flink KafkaSource with multiple topics, each topic with different avro schema of data

If I do multiple of the below - one for each topic:
KafkaSource<T> kafkaDataSource = KafkaSource<T>builder().setBootstrapServers(consumerProps.getProperty("bootstrap.servers")).setTopics(topic).setDeserializer(deserializer).setGroupId(identifier)
.setProperties(consumerProps).build();
The deserializer seems to get into some issue and ends up reading data from different topic of different schema than it meant for and fails!
If I provide all topics in the same KafkaSource then watermarks seems to be progress across the topics together.
DataStream<T> dataSource = environment.fromSource(kafkaDataSource,
WatermarkStrategy.<T>forBoundedOutOfOrderness(Duration.ofMillis(2000))
.withTimestampAssigner((event, timestamp) -> {...}, ""));
Also the avro data in the kafka itself holds the first magic byte for schema (schema info is embedded), not using any external avro registry (it's all in the libraries).
It works fine with FlinkKafkaConsumer (created multiple instances of it).
FlinkKafkaConsumer<T> kafkaConsumer = new FlinkKafkaConsumer<>(topic, deserializer, consumerProps);
kafkaConsumer.assignTimestampsAndWatermarks(WatermarkStrategy.<T>forBoundedOutOfOrderness(Duration.ofMillis(2000))
.withTimestampAssigner((event, timestamp) -> {
Not sure if it's a problem the way that I am using? Any pointers on how to solve would be appreciated. Also FlinkKafkaConsumer is deprecated.

Figured it based on the code in here Custom avro message deserialization with Flink. Implemented open method and the instance fields of the deserialisier are made transient.

Apache Kafka Streams : Out-of-Order messages

I have an Apache Kafka 2.6 Producer which writes to topic-A (TA).
I also have a Kafka streams application which consumes from TA and writes to topic-B (TB).
In the streams application, I have a custom timestamp extractor which extracts the timestamp from the message payload.
For one of my failure handling test cases, I shutdown the Kafka cluster while my applications are running.
When the producer application tries to write messages to TA, it cannot because the cluster is down and hence (I assume) buffers the messages.
Let's say it receives 4 messages m1,m2,m3,m4 in increasing time order. (i.e. m1 is first and m4 is last).
When I bring the Kafka cluster back online, the producer sends the buffered messages to the topic, but they are not in order. I receive for example, m2 then m3 then m1 and then m4.
Why is that ? Is it because the buffering in the producer is multi-threaded with each producing to the topic at the same time ?
I assumed that the custom timestamp extractor would help in ordering messages when consuming them. But they do not. Or maybe my understanding of the timestamp extractor is wrong.
I got one solution from SO here, to just stream all events from tA to another intermediate topic (say tA') which will use the TimeStamp extractor to another topic. But I am not sure if this will cause the events to get reordered based on the extracted timestamp.
My code for the Producer is as shown below (I am using Spring Cloud for creating the Producer):
Producer.java
#Service
public class Producer {
private String topicName = "input-topic";
private ApplicationProperties appProps;
#Autowired
private KafkaTemplate<String, MyEvent> kafkaTemplate;
public Producer() {
super();
}
#Autowired
public void setAppProps(ApplicationProperties appProps) {
this.appProps = appProps;
this.topicName = appProps.getInput().getTopicName();
}
public void sendMessage(String key, MyEvent ce) {
ListenableFuture<SendResult<String,MyEvent>> future = this.kafkaTemplate.send(this.topicName, key, ce);
}
}

Why is that ? Is it because the buffering in the producer is multi-threaded with each producing to the topic at the same time ?
By default, the producer allow for up to 5 parallel in-flight requests to a broker, and thus if some requests fail and are retried the request order might change.
To avoid this re-ordering issue, you can either set max.in.flight.requests.per.connection = 1 (what may have a performance hit) or set enable.idempotence = true.
Btw: you did not say if your topic has a single partition or multiple partitions, and if your messages have a key? If your topic has more then one partition and you messages are sent to different partitions, there is no ordering guarantee on read anyway, because offset ordering is only guaranteed within a partition.
I assumed that the custom timestamp extractor would help in ordering messages when consuming them. But they do not. Or maybe my understanding of the timestamp extractor is wrong.
The timestamp extractor only extracts a timestamp. Kafka Streams does not re-order any messages, but processes messages always in offset-order.
If not, then what are the specific uses of the timestamp extractor ? Just to associate a timestamp with an event ?
Correct.
I got one solution from SO here, to just stream all events from tA to another intermediate topic (say tA') which will use the TimeStamp extractor to another topic. But I am not sure if this will cause the events to get reordered based on the extracted timestamp.
No, it won't do any reordering. The other SO question is just about to change the timestamp, but if you read messages in order a,b,c the result would be written in order a,b,c (just with different timestamps, but offset order should be preserved).
This talk explains some more details: https://www.confluent.io/kafka-summit-san-francisco-2019/whats-the-time-and-why/

Write multiple topics in Kafka by Flink dynamically Exception handling

I am currently reading data from a single kafka topic and writing to dynamic topic based on data itself. I have implemented the following code (which is dynamically selecting topic based on accountId of data) and it working just fine :
class KeyedEnrichableEventSerializationSchema(schemaRegistryUrl: String)
extends KafkaSerializationSchema[KeyedEnrichableEvent]
with KafkaContextAware[KeyedEnrichableEvent] {
private val enrichableEventClass = classOf[EnrichableEvent]
private val enrichableEventSerialization: AvroSerializationSchema[EnrichableEvent] =
ConfluentRegistryAvroSerializationSchema.forSpecific(enrichableEventClass, enrichableEventClass.getCanonicalName, schemaRegistryUrl)
override def serialize(element: KeyedEnrichableEvent, timestamp: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] =
new ProducerRecord("trackingevents."+element.value.getEventMetadata.getAccountId, element.key, enrichableEventSerialization.serialize(element.value))
override def getTargetTopic(element: KeyedEnrichableEvent): String = "trackingevents."+element.value.getEventMetadata.getAccountId
The problem/concern is, if the topic does not exist, i am having an exception in JobManager UI about the topic not present and whole processing is halted until I create the topic. May be this is recommended behaviour but is there any alternate like maybe we can put the data in different topic and resume processing instead of stopping whole processing. Or at least is there any notification mechanism available in flink which notifies immediately that the processing is halted.

I think that the simplest solution in this case is to enable autocreation of topics in Kafka, so that problem is solved totally.
However, if for some reason it's impossible to be done in Your case, the simplest solution IMHO would be to create a ProcessFunction, which would keep a connection to Kafka using KafkaConsumer or AdminClient and periodically check if the topic that would be used for given message exists. You could then whether push the data to sideOutput or try to create topics that are missing.

Ordering of streams while reading data from multiple kafka partitions

I have used flink streaming API in my application where the source of streaming is kafka. My kafka producer will publish data in ascending order of time in different partitions of kafka and consumer will read data from these partitions. However some kafka partitions may be slow due to some operation and produce late results. Is there any way to maintain order in this stream though the data arrive out of order. I have tried BoundedOutOfOrdernessTimestampExtractor but it didn't served the purpose. While digging this problem I came across your documentation (URL: https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams) and tried to implement this but it didnt worked. I also tried with Table API order by but it seems you not support orderBy in flink 1.5 version. Please suggest me any workaround for this. I am using below custom watermark generator with parallelism 4.
DataStream<Document> streamSource = env
.addSource(kafkaConsumer).setParallelism(4);
public class BoundedOutOfOrdernessGenerator implements AssignerWithPeriodicWatermarks<Document> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
#Override
public long extractTimestamp(Document event, long previousElementTimestamp) {
Map timeStamp = (Map) event.get("ts");
this.currentMaxTimestamp = (long) timeStamp.get("value");
return currentMaxTimestamp;
}
#Override
public Watermark getCurrentWatermark() {
// return the watermark as current highest timestamp minus the out-of-orderness bound
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}
Thanks,

Apache Flink dynamic number of Sinks

I am using Apache Flink and the KafkaConsumer to read some values from a Kafka Topic.
I also have a stream obtained from reading a file.
Depending on the received values, I would like to write this stream on different Kafka Topics.
Basically, I have a network with a leader linked to many children. For each child, the Leader needs to write the stream read in a child-specific Kafka Topic, so that the child can read it.
When the child is started, it registers itself in the Kafka topic read from the Leader.
The problem is that I don't know a priori how many children I have.
For example, I read 1 from the Kafka Topic, I want to write the stream in just one Kafka Topic named Topic1.
I read 1-2, I want to write on two Kafka Topics (Topic1 and Topic2).
I don't know if it is possible because in order to write on the Topic, I am using the Kafka Producer along with the addSink method and to my understanding (and from my attempts) it seems that Flink requires to know the number of sinks a priori.
But then, is there no way to obtain such behavior?

If I understood your problem well, I think you can solve it with a single sink, since you can choose the Kafka topic based on the record being processed. It also seems that one element from the source might be written to more than one topic, in which case you would need a FlatMapFunction to replicate each source record N times (one for each output topic). I would recommend to output as a pair (aka Tuple2) with (topic, record).
DataStream<Tuple2<String, MyValue>> stream = input.flatMap(new FlatMapFunction<>() {
public void flatMap(MyValue value, Collector<Tupple2<String, MyValue>> out) {
for (String topic : topics) {
out.collect(Tuple2.of(topic, value));
}
}
});
Then you can use the topic previously computed by creating the FlinkKafkaProducer with a KeyedSerializationSchema in which you implement getTargetTopic to return the first element of the pair.
stream.addSink(new FlinkKafkaProducer10<>(
"default-topic",
new KeyedSerializationSchema<>() {
public String getTargetTopic(Tuple2<String, MyValue> element) {
return element.f0;
}
...
},
kafkaProperties)
);

KeyedSerializationSchema
Is now deprecated. Instead you have to use "KafkaSerializationSchema"
The same can be achieved by overriding the serialize method.
public ProducerRecord<byte[], byte[]> serialize(
String inputString, #Nullable Long aLong){
return new ProducerRecord<>(customTopicName,
key.getBytes(StandardCharsets.UTF_8), inputString.getBytes(StandardCharsets.UTF_8));
}