What happens to the timestamp of a message in a stream when it's mapped into another stream? - apache-kafka

I've an application where I process a stream and convert it into another. Here is a sample:
public void run(final String... args) {
final Serde<Event> eventSerde = new EventSerde();
final Properties props = streamingConfig.getProperties(
applicationName,
concurrency,
Serdes.String(),
eventSerde
);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, EXACTLY_ONCE);
props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, EventTimestampExtractor.class);
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, Event> eventStream = builder.stream(inputStream);
final Serde<Device> deviceSerde = new DeviceSerde();
eventStream
.map((key, event) -> {
final Device device = modelMapper.map(event, Device.class);
return new KeyValue<>(key, device);
})
.to("device_topic", Produced.with(Serdes.String(), deviceSerde));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
streams.start();
}
Here are some details about the app:
Spring Boot 1.5.17
Kafka 2.1.0
Kafka Streams 2.1.0
Spring Kafka 1.3.6
Although a timestamp is set in the messages inside the input stream, I also place an implementation of TimestampExtractor to make sure that a proper timestamp is attached into all messages (as other producers may send messages into the same topic).
Within the code, I receive a stream of events and I basically convert them into different objects and eventually route those objects into different streams.
I'm trying to understand whether the initial timestamp I set is still attached to the messages published into device_topic in this particular case.
The receiving end (of device stream) is like this:
#KafkaListener(topics = "device_topic")
public void onDeviceReceive(final Device device, #Header(KafkaHeaders.RECEIVED_TIMESTAMP) final long timestamp) {
log.trace("[{}] Received device: {}", timestamp, device);
}
Unfortunetely the printed timestamp seems to be wall clock time. Is this the expected behaviour or am I missing something?

Spring Kafka 1.3.x uses a very old 0.11 client; perhaps it doesn't propagate the timestamp. I just tested with Boot 2.1.3 and Spring Kafka 2.2.4 and the timestamp is propagated ok...
#SpringBootApplication
#EnableKafkaStreams
public class So54771130Application {
public static void main(String[] args) {
SpringApplication.run(So54771130Application.class, args);
}
#Bean
public ApplicationRunner runner(KafkaTemplate<String, String> template) {
return args -> {
template.send("so54771130", 0, 42L, null, "baz");
};
}
#Bean
public KStream<String, String> stream(StreamsBuilder builder) {
KStream<String, String> stream = builder.stream("so54771130");
stream
.map((k, v) -> {
System.out.println("Mapping:" + v);
return new KeyValue<>(null, "bar");
})
.to("so54771130-1");
return stream;
}
#Bean
public NewTopic topic1() {
return new NewTopic("so54771130", 1, (short) 1);
}
#Bean
public NewTopic topic2() {
return new NewTopic("so54771130-1", 1, (short) 1);
}
#KafkaListener(id = "so54771130", topics = "so54771130-1")
public void listen(String in, #Header(KafkaHeaders.RECEIVED_TIMESTAMP) long ts) {
System.out.println(in + "#" + ts);
}
}
and
Mapping:baz
bar#42

Related

Reactive program exiting early before sending all messages to Kafka

This is a subsequent question to a previous reactive kafka issue (Issue while sending the Flux of data to the reactive kafka).
I am trying to send some log records to the kafka using the reactive approach. Here is the reactive code sending messages using reactive kafka.
public class LogProducer {
private final KafkaSender<String, String> sender;
public LogProducer(String bootstrapServers) {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.CLIENT_ID_CONFIG, "log-producer");
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
SenderOptions<String, String> senderOptions = SenderOptions.create(props);
sender = KafkaSender.create(senderOptions);
}
public void sendMessages(String topic, Flux<Logs.Data> records) throws InterruptedException {
AtomicInteger sentCount = new AtomicInteger(0);
sender.send(records
.map(record -> {
LogRecord lrec = record.getRecords().get(0);
String id = lrec.getId();
Thread.sleep(0, 5); // sleep for 5 ns
return SenderRecord.create(new ProducerRecord<>(topic, id,
lrec.toString()), id);
})).doOnNext(res -> sentCount.incrementAndGet()).then()
.doOnError(e -> {
log.error("[FAIL]: Send to the topic: '{}' failed. "
+ e, topic);
})
.doOnSuccess(s -> {
log.info("[SUCCESS]: {} records sent to the topic: '{}'", sentCount, topic);
})
.subscribe();
}
}
public class ExecuteQuery implements Runnable {
private LogProducer producer = new LogProducer("localhost:9092");
#Override
public void run() {
Flux<Logs.Data> records = ...
producer.sendMessages(kafkaTopic, records);
.....
.....
// processing related to the messages sent
}
}
So even when the Thread.sleep(0, 5); is there, sometimes it does not send all messages to kafka and the program exists early printing the SUCCESS message (log.info("[SUCCESS]: {} records sent to the topic: '{}'", sentCount, topic);). Is there any more concrete way to solve this problem. For example, using some kind of callback, so that thread will wait for all messages to be sent successfully.
I have a spring console application and running ExecuteQuery through a scheduler at fixed rate, something like this
public class Main {
private ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
private ExecutorService executor = Executors.newFixedThreadPool(POOL_SIZE);
public static void main(String[] args) {
QueryScheduler scheduledQuery = new QueryScheduler();
scheduler.scheduleAtFixedRate(scheduledQuery, 0, 5, TimeUnit.MINUTES);
}
class QueryScheduler implements Runnable {
#Override
public void run() {
// preprocessing related to time
executor.execute(new ExecuteQuery());
// postprocessing related to time
}
}
}
Your Thread.sleep(0, 5); // sleep for 5 ns does not have any value for a main thread to be blocked, so it exits when it needs and your ExecuteQuery may not finish its job yet.
It is not clear how you start your application, but I recommended Thread.sleep() exactly in a main thread to block. To be precise in the public static void main(String[] args) { method impl.

Spring Cloud Sleuth with Reactor Kafka

I'm using Reactor Kafka in a Spring Boot Reactive app, with Spring Cloud Sleuth for distributed tracing.
I've setup Sleuth to use a custom propagation key from a header named "traceId".
I've also customized the log format to print the header in my logs, so a request like
curl -H "traceId: 123456" -X POST http://localhost:8084/parallel
will print 123456 in every log anywhere downstream starting from the Controller.
I would now like this header to be propagated via Kafka too. I understand that Sleuth has built-in instrumentation for Kafka too, so the header should be propagated automatically, however I'm unable to get this to work.
From my Controller, I produce a message onto a Kafka topic, and then have another Kafka consumer pick it up for processing.
Here's my Controller:
#RestController
#RequestMapping("/parallel")
public class BasicController {
private Logger logger = Loggers.getLogger(BasicController.class);
KafkaProducerLoadGenerator generator = new KafkaProducerLoadGenerator();
#PostMapping
public Mono<ResponseEntity> createMessage() {
int data = (int)(Math.random()*100000);
return Flux.just(data)
.doOnNext(num -> logger.info("Generating document for {}", num))
.map(generator::generateDocument)
.flatMap(generator::sendMessage)
.doOnNext(result ->
logger.info("Sent message {}, offset is {} to partition {}",
result.getT2().correlationMetadata(),
result.getT2().recordMetadata().offset(),
result.getT2().recordMetadata().partition()))
.doOnError(error -> logger.error("Error in subscribe while sending message", error))
.single()
.map(tuple -> ResponseEntity.status(HttpStatus.OK).body(tuple.getT1()));
}
}
Here's the code that produces messages on to the Kafka topic
#Component
public class KafkaProducerLoadGenerator {
private static final Logger logger = Loggers.getLogger(KafkaProducerLoadGenerator.class);
private static final String bootstrapServers = "localhost:9092";
private static final String TOPIC = "load-topic";
private KafkaSender<Integer, String> sender;
private static int documentIndex = 0;
public KafkaProducerLoadGenerator() {
this(bootstrapServers);
}
public KafkaProducerLoadGenerator(String bootstrapServers) {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.CLIENT_ID_CONFIG, "load-generator");
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, IntegerSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
SenderOptions<Integer, String> senderOptions = SenderOptions.create(props);
sender = KafkaSender.create(senderOptions);
}
#NewSpan("generator.sendMessage")
public Flux<Tuple2<DataDocument, SenderResult<Integer>>> sendMessage(DataDocument document) {
return sendMessage(TOPIC, document)
.map(result -> Tuples.of(document, result));
}
public Flux<SenderResult<Integer>> sendMessage(String topic, DataDocument document) {
ProducerRecord<Integer, String> producerRecord = new ProducerRecord<>(topic, document.getData(), document.toString());
return sender.send(Mono.just(SenderRecord.create(producerRecord, document.getData())))
.doOnNext(record -> logger.info("Sent message to partition={}, offset={} ", record.recordMetadata().partition(), record.recordMetadata().offset()))
.doOnError(e -> logger.error("Error sending message " + documentIndex, e));
}
public DataDocument generateDocument(int data) {
return DataDocument.builder()
.header("Load Data")
.data(data)
.traceId("trace"+data)
.timestamp(Instant.now())
.build();
}
}
My consumer looks like this:
#Component
#Scope(scopeName = ConfigurableBeanFactory.SCOPE_PROTOTYPE)
public class IndividualConsumer {
private static final Logger logger = Loggers.getLogger(IndividualConsumer.class);
private static final String bootstrapServers = "localhost:9092";
private static final String TOPIC = "load-topic";
private int consumerIndex = 0;
public ReceiverOptions setupConfig(String bootstrapServers) {
Map<String, Object> properties = new HashMap<>();
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
properties.put(ConsumerConfig.CLIENT_ID_CONFIG, "load-topic-consumer-"+consumerIndex);
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "load-topic-multi-consumer-2");
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, IntegerDeserializer.class);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, DataDocumentDeserializer.class);
return ReceiverOptions.create(properties);
}
public void setIndex(int i) {
consumerIndex = i;
}
#EventListener(ApplicationReadyEvent.class)
public Disposable consumeMessage() {
ReceiverOptions<Integer, DataDocument> receiverOptions = setupConfig(bootstrapServers)
.subscription(Collections.singleton(TOPIC))
.addAssignListener(receiverPartitions -> logger.debug("onPartitionsAssigned {}", receiverPartitions))
.addRevokeListener(receiverPartitions -> logger.debug("onPartitionsRevoked {}", receiverPartitions));
Flux<ReceiverRecord<Integer, DataDocument>> messages = Flux.defer(() -> {
KafkaReceiver<Integer, DataDocument> receiver = KafkaReceiver.create(receiverOptions);
return receiver.receive();
});
Consumer<? super ReceiverRecord<Integer, DataDocument>> acknowledgeOffset = record -> record.receiverOffset().acknowledge();
return messages
.publishOn(Schedulers.newSingle("Parallel-Consumer"))
.doOnError(error -> logger.error("Error in the reactive chain", error))
.delayElements(Duration.ofMillis(100))
.doOnNext(record -> {
logger.info("Consumer {}: Received from partition {}, offset {}, data with index {}",
consumerIndex,
record.receiverOffset().topicPartition(),
record.receiverOffset().offset(),
record.value().getData());
})
.doOnNext(acknowledgeOffset)
.doOnError(error -> logger.error("Error receiving record", error))
.retryBackoff(100, Duration.ofSeconds(5), Duration.ofMinutes(5))
.subscribe();
}
}
I would expect Sleuth to automatically carry over the built-in Brave trace and the custom headers to the consumer, so that the trace covers the entire transaction.
However I have two problems.
The generator bean doesn't get the same trace as the one in the Controller. It uses a different (and new) trace for every message sent.
The trace isn't propagated from Kafka producer to Kafka consumer.
I can resolve #1 above by replacing the generator bean with a simple Java class and instantiating it in the controller. However that means I can't autowire other dependencies, and in any case it doesn't solve #2.
I am able to load an instance of the bean brave.kafka.clients.KafkaTracing so I know it's being loaded by Spring. However, it doesn't look the instrumentation is working. I inspected the content on Kafka using Kafka Tool, and no headers are populated on any message.
In fact the consumer doesn't have a trace at all.
2020-05-06 23:57:32.898 INFO parallel-consumer:local [123-21922,578c510e23567aec,578c510e23567aec] 8180 --- [reactor-http-nio-3] rja.parallelconsumers.BasicController : Generating document for 23965
2020-05-06 23:57:32.907 INFO parallel-consumer:local [52e02d36b59c5acd,52e02d36b59c5acd,52e02d36b59c5acd] 8180 --- [single-11] r.p.kafka.KafkaProducerLoadGenerator : Sent message to partition=17, offset=0
2020-05-06 23:57:32.908 INFO parallel-consumer:local [123-21922,578c510e23567aec,578c510e23567aec] 8180 --- [single-11] rja.parallelconsumers.BasicController : Sent message 23965, offset is 0 to partition 17
2020-05-06 23:57:33.012 INFO parallel-consumer:local [-,-,-] 8180 --- [parallel-5] r.parallelconsumers.IndividualConsumer : Consumer 8: Received from partition load-topic-17, offset 0, data with index 23965
In the log above, [123-21922,578c510e23567aec,578c510e23567aec] is [custom-trace-header, brave traceId, brave spanId]
What am I missing?

Why is windowing now working for Kafka Streams?

I am running a simple Kafka Streams program on my eclipse which is running successfully, but it is not able to implement the windowing concept.
I want to process all the messages received in a window of 5 seconds to the output topic. I googled and understand that I need to implement the tumbling window concept. However, I see that the output is sent to the output topic instantly.
What am I doing wrong here? Below is the main method that I am running:
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("wc-input");
#SuppressWarnings("deprecation")
KTable<Windowed<String>, Long> counts = source
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
return Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" "));
}
})
.groupBy(new KeyValueMapper<String, String, String>() {
#Override
public String apply(String key, String value) {
return value;
}
})
.count(TimeWindows.of(10000L)
.until(10000L),"Counts");
// need to override value serde to Long type
counts.to("wc-output");
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-wordcount-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
long windowSizeMs = TimeUnit.MINUTES.toMillis(50000); // 5 * 60 * 1000L
TimeWindows.of(windowSizeMs);
TimeWindows.of(windowSizeMs).advanceBy(windowSizeMs);
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
Windowing does not mean "one output" per window. If you want to get only one output per window, you want so use suppress() on the result KTable.
Compare this article: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

Kafka + Spring Batch Listener Flush Batch

Using Kafka Broker: 1.0.1
spring-kafka: 2.1.6.RELEASE
I'm using a batched consumer with the following settings:
// Other settings are not shown..
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
I use spring listener in the following way:
#KafkaListener(topics = "${topics}", groupId = "${consumer.group.id}")
public void receive(final List<String> data,
#Header(KafkaHeaders.RECEIVED_PARTITION_ID) final List<Integer> partitions,
#Header(KafkaHeaders.RECEIVED_TOPIC) Set<String> topics,
#Header(KafkaHeaders.OFFSET) final List<Long> offsets) { // ......code... }
I always find the a few messages remain in the batch and not received in my listener. It appears to be that if the remaining messages are less than a batch size, it isn't consumed (may be in memory and published to my listener). Is there any way to have a setting to auto-flush the batch after a time interval so as to avoid the messages not being flushed?
What's the best way to deal with such kind of situation with a batch consumer?
I just ran a test without any problems...
#SpringBootApplication
public class So50370851Application {
public static void main(String[] args) {
SpringApplication.run(So50370851Application.class, args);
}
#Bean
public ApplicationRunner runner(KafkaTemplate<String, String> template) {
return args -> {
for (int i = 0; i < 230; i++) {
template.send("so50370851", "foo" + i);
}
};
}
#KafkaListener(id = "foo", topics = "so50370851")
public void listen(List<String> in) {
System.out.println(in.size());
}
#Bean
public NewTopic topic() {
return new NewTopic("so50370851", 1, (short) 1);
}
}
and
spring.kafka.consumer.auto-offset-reset=earliest
spring.kafka.consumer.enable-auto-commit=false
spring.kafka.consumer.max-poll-records=100
spring.kafka.listener.type=batch
and
100
100
30
Also, the debug logs shows after a while that it is polling and fetching 0 records (and this gets repeated over and over).
That implies the problem is on the sending side.

Kafka Consumer committing manually based on a condition.

#kafkaListener consumer is commiting once a specific condition is met. Let us say a topic gets the following data from a producer
"Message 0" at offset[0]
"Message 1" at offset[1]
They are received at the consumer and commited with help of acknowledgement.acknowledge()
then the below messages come to the topic
"Message 2" at offset[2]
"Message 3" at offset[3]
The consumer which is running receive the above data. Here condition fail and the above offsets are not committed.
Even if new data comes at the topic, then also "Message 2" and "Message 3" should be picked up by any consumer from the same consumer group as they are not committed. But this is not happening,the consumer picks up a new message.
When I restart my consumer then I get back Message2 and Message3. This should have happened while the consumers were running.
The code is as follows -:
KafkaConsumerConfig file
enter code here
#Configuration
#EnableKafka
public class KafkaConsumerConfig {
#Bean
KafkaListenerContainerFactory<ConcurrentMessageListenerContainer<String, String>> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.setConcurrency(3);
factory.setBatchListener(true);
factory.getContainerProperties().setAckMode(AbstractMessageListenerContainer.AckMode.MANUAL_IMMEDIATE);
factory.getContainerProperties().setSyncCommits(true);
return factory;
}
#Bean
public ConsumerFactory<String, String> consumerFactory() {
return new DefaultKafkaConsumerFactory<>(consumerConfigs());
}
#Bean
public Map<String, Object> consumerConfigs() {
Map<String, Object> propsMap = new HashMap<>();
propsMap.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
propsMap.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
propsMap.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "100");
propsMap.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "15000");
propsMap.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
propsMap.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
propsMap.put(ConsumerConfig.GROUP_ID_CONFIG, "group1");
propsMap.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
propsMap.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG,"1");
return propsMap;
}
#Bean
public Listener listener() {
return new Listener();
}
}
Listner Class
public class Listener {
public CountDownLatch countDownLatch0 = new CountDownLatch(3);
private Logger LOGGER = LoggerFactory.getLogger(Listener.class);
static int count0 =0;
#KafkaListener(topics = "abcdefghi", group = "group1", containerFactory = "kafkaListenerContainerFactory")
public void listenPartition0(String data, #Header(KafkaHeaders.RECEIVED_PARTITION_ID) List<Integer> partitions,
#Header(KafkaHeaders.OFFSET) List<Long> offsets, Acknowledgment acknowledgment) throws InterruptedException {
count0 = count0 + 1;
LOGGER.info("start consumer 0");
LOGGER.info("received message via consumer 0='{}' with partition-offset='{}'", data, partitions + "-" + offsets);
if (count0%2 ==0)
acknowledgment.acknowledge();
LOGGER.info("end of consumer 0");
}
How can i achieve my desired result?
That's correct. The offset is a number which is pretty easy to keep tracking in the memory on consumer instance. We need offsets commited for newly arrived consumers in the group for the same partitions. That's why it works as expected when you restart an application or when rebalance happens for the group.
To make it working as you would like you should consider to implement ConsumerSeekAware in your listener and call ConsumerSeekCallback.seek() for the offset you would like to star consume from the next poll cycle.
http://docs.spring.io/spring-kafka/docs/2.0.0.M2/reference/html/_reference.html#seek:
public class Listener implements ConsumerSeekAware {
private final ThreadLocal<ConsumerSeekCallback> seekCallBack = new ThreadLocal<>();
#Override
public void registerSeekCallback(ConsumerSeekCallback callback) {
this.seekCallBack.set(callback);
}
#KafkaListener()
public void listen(...) {
this.seekCallBack.get().seek(topic, partition, 0);
}
}