Exception using withCheckStopReadingFn with KafkaIO.read - apache-beam

My topic has 2 partitions and my flow is to have the dataflow stop consuming message. if the message was from partition 0.
trying to use the withCheckStopReadingFn
Sample code is
public class CheckPartitionStatus implements SerializableFunction<TopicPartition, Boolean> {
#Override
public Boolean apply(TopicPartition input) {
boolean value = false;
if (input.equals(new TopicPartition(topicName, 0))) {
value = true;
}
return value;
}
}
pipeline is as follows
PTransform<PBegin, PCollection<KV<GenericData.Record, GenericData.Record>>> kafka = KafkaIO.<GenericData.Record, GenericData.Record>read()
.withBootstrapServers(brokerurl)
.withTopic(inputPersonTopic)
.withConsumerConfigUpdates(props)
.withKeyDeserializer(ConfluentSchemaRegistryDeserializerProvider
.of(url,schemakey, null, csrConfig))
.withValueDeserializer(ConfluentSchemaRegistryDeserializerProvider
.of(url,schemavalue, null, csrConfig))
.withReadCommitted()
.commitOffsetsInFinalize()
.withCheckStopReadingFn(new CheckPartitionStatus())
.withoutMetadata();
pipeline.apply(kafka)
.apply(Values.<GenericData.Record>create())
.apply("ProcessMessage", ParDo.of(new ProcessMessage()));
pipeline.run().waitUntilFinish();
running into exception
Caused by: org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalStateException: Last attempted offset should not be null. No work was claimed in non-empty range [7, 9223372036854775807).
as per the topic, the last message offset is at 7. Any help on how I can use this function?

Related

Apache Beam: how to discard messages?

I have a pipeline that:
Reads messages from pubsub
Converts them to a domain object
Applies fixed window
Sends data back to a pubsub topic
I would like to process only specific messages - for example having a specific attribute and discard all other messages. How can this be done in beam?
Can I simply skip c.outputWithTimestamp(...); for the messages that should be discarded?
My code:
pipeline.apply("Read PubSub messages",
PubsubIO.
readStrings().
fromSubscription(pubsubSub))
.apply("Convert to DeviceData",
ParDo.of(new DoFn<String, KV<String, DeviceData>>() {
#Override
public Duration getAllowedTimestampSkew() {
return new Duration(Long.MAX_VALUE);
}
#ProcessElement
public void processElement(ProcessContext c) {
String message = c.element();
DeviceData data = new Gson().fromJson(message, DeviceData.class);
String sourceId = data.getSensorId() != null ? data.getSensorId() : data.getFormulaId();
// use timestamp from payload
Long timeInNanoSeconds = data.getTimeInNanoSeconds();
Instant timestamp = ClockUtil.fromNanos(timeInNanoSeconds);
long millis = timestamp.toEpochMilli();
c.outputWithTimestamp(KV.of(sourceId, data), new org.joda.time.Instant(millis));
}
}))
.apply("Apply fixed window", window)
.apply("Group by inputId", GroupByKey.create())
.apply("Collect created buckets", ParDo.of(new GatherBuckets(options.getWindowSize())))
.apply("Send to Pub/sub", PubsubIO.writeStrings().to(topic));
Can I simply skip c.outputWithTimestamp(...); for the messages that should be discarded?
Yes, a DoFn can emit any number of output messages per input message, including zero.

Spring cloud Kafka does infinite retry when it fails

Currently, I am having an issue where one of the consumer functions throws an error which makes Kafka retry the records again and again.
#Bean
public Consumer<List<RuleEngineSubject>> processCohort() {
return personDtoList -> {
for(RuleEngineSubject subject : personDtoList)
processSubject(subject);
};
}
This is the consumer the processSubject throws a custom error which causes it to fail.
processCohort-in-0:
destination: internal-process-cohort
consumer:
max-attempts: 1
batch-mode: true
concurrency: 10
group: process-cohort-group
The above is my binder for Kafka.
Currently, I am attempting to retry 2 times and then send to a dead letter queue but I have been unsuccessful and not sure which is the right approach to take.
I have tried to implement a custom handler that will handle the error when it fails but does not retry again and I am not sure how to send to a dead letter queue
#Bean
ListenerContainerCustomizer<AbstractMessageListenerContainer<?, ?>> customizer() {
return (container, dest, group) -> {
if (group.equals("process-cohort-group")) {
container.setBatchErrorHandler(new BatchErrorHandler() {
#Override
public void handle(Exception thrownException, ConsumerRecords<?, ?> data) {
System.out.println(data.records(dest).iterator().);
data.records(dest).forEach(r -> {
System.out.println(r.value());
});
System.out.println("failed payload='{}'" + thrownException.getLocalizedMessage());
}
});
}
};
}
This stops infinite retry but does not send a dead letter queue. Can I get suggestions on how to retry two times and then send a dead letter queue. From my understanding batch listener does not how to recover when there is an error, could someone help shine light on this
Retry 15 times then throw it to topicname.DLT topic
#Bean
public ConcurrentKafkaListenerContainerFactory kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setCommonErrorHandler(
new DefaultErrorHandler(
new DeadLetterPublishingRecoverer(kafkaTemplate()), kafkaBackOffPolicy()));
factory.setConsumerFactory(kafkaConsumerFactory());
return factory;
}
#Bean
public ExponentialBackOffWithMaxRetries kafkaBackOffPolicy() {
var exponentialBackOff = new ExponentialBackOffWithMaxRetries(15);
exponentialBackOff.setInitialInterval(Duration.ofMillis(500).toMillis());
exponentialBackOff.setMultiplier(2);
exponentialBackOff.setMaxInterval(Duration.ofSeconds(2).toMillis());
return exponentialBackOff;
}
You need to configure a suitable error handler in the listener container; you can disable retry and dlq in the binding and use a DeadLetterPublishingRecoverer instead. See the answer Retry max 3 times when consuming batches in Spring Cloud Stream Kafka Binder

Pause Kafka Consumer with spring-cloud-stream and Functional Style

I'm trying to implement a retry mechanism for my kafka stream application. The idea is that I would get the consumer and partition ID as well as the topic name from the input topic and then pause the consumer for the duration stored in the payload.
I've searched for documentations and examples but all I found are examples based on the classic bindings provided by spring-cloud-stream. I'm trying to see if there's a way to get access to these info with functional style.
For example the following code can give me access to the consumer with classic binding style.
#StreamListener(Sink.INPUT)
public void in(String in, #Header(KafkaHeaders.CONSUMER) Consumer<?, ?> consumer) {
System.out.println(in);
consumer.pause(Collections.singleton(new TopicPartition("myTopic", 0)));
}
How do I get the equivalence with the Functional Style?
I tried with the following code but I'm getting exception saying no such binding is found.
#Bean
public Function<Message<?>, KStream<String, String>> process() {
message -> {
Consumer<?, ?> consumer = message.getHeaders().get(KafkaHeaders.Consumer, Consumer.class);
String topic = message.getHeaders().get(KafkaHeaders.Topic, String.class);
Integer partitionId = message.getHeaders().get(KafkaHeaders.RECEIVED_PARTITION_ID, Integer.class);
CustomPayload payload = (CustomPayload) message.getPayload();
if (payload.getRetryTime() < System.currentTimeMillis()) {
consumer.pause(Collections.singleton(new TopicPartition(topic, partitionId)));
}
}
}
Exception I got
Caused by: java.lang.IllegalStateException: No factory found for binding target type: org.springframework.messaging.Message among registered factories: channelFactory,messageSourceFactory,kStreamBoundElementFactory,kTableBoundElementFactory,globalKTableBoundElementFactory
at org.springframework.cloud.stream.binding.AbstractBindableProxyFactory.getBindingTargetFactory(AbstractBindableProxyFactory.java:82)
at org.springframework.cloud.stream.binder.kafka.streams.function.KafkaStreamsBindableProxyFactory.bindInput(KafkaStreamsBindableProxyFactory.java:191)
at org.springframework.cloud.stream.binder.kafka.streams.function.KafkaStreamsBindableProxyFactory.afterPropertiesSet(KafkaStreamsBindableProxyFactory.java:111)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1853)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1790)
... 96 more
In your functional bean example, you are mixing both Message and KStream. That is the reason for that specific exception. The functional bean could be rewritten as below.
#Bean
public java.util.function.Consumer<Message<?>> process() {
return message -> {
Consumer<?, ?> consumer = message.getHeaders().get(KafkaHeaders.Consumer, Consumer.class);
String topic = message.getHeaders().get(KafkaHeaders.Topic, String.class);
Integer partitionId = message.getHeaders().get(KafkaHeaders.RECEIVED_PARTITION_ID, Integer.class);
CustomPayload payload = (CustomPayload) message.getPayload();
if (payload.getRetryTime() < System.currentTimeMillis()) {
consumer.pause(Collections.singleton(new TopicPartition(topic, partitionId)));
}
}
}

Flink getting past bad messages in Kafka: "poison message"

First time I'm trying to get this to work so bear with me. I'm trying to
learn checkpointing with Kafka and handling "bad" messages, restarting
without losing state.
Use Case:
Use checkpointing.
Read a stream of integers from Kafka, keep a running sum.
If a "bad" Kafka message read, restart app, skip the "bad" message, keep
state. My stream would look something look like this:
set1,5
set1,7
set1,foobar
set1,6
I want my app to keep a running sum of the integers it has seen, and restart
if it crashes without losing state, so app behavior/running sum would be:
5,
12,
app crashes and restarts, reads checkpoint
18
etc.
However, I'm finding when my app restarts, it keeps reading the bad "foobar"
message and doesnt get past it. Source code below. The mapper bombs when I
try to parse "foobar" as an Integer.
How can I modify app to get past "poison" message?
env.enableCheckpointing(1000L);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500L);
env.getCheckpointConfig().setCheckpointTimeout(10000);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.setStateBackend(new
FsStateBackend("hdfs://mymachine:9000/flink/checkpoints"));
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", BROKERS);
properties.setProperty("zookeeper.connect", ZOOKEEPER_HOST);
properties.setProperty("group.id", "consumerGroup1");
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>(topicName,
new SimpleStringSchema(), properties);
DataStream<String> messageStream = env.addSource(kafkaConsumer);
DataStream<Tuple2<String,Integer>> sums = messageStream
.map(new NumberMapper())
.keyBy(0)
.sum(1);
sums.print();
private static class NumberMapper implements
MapFunction<String,Tuple2<String,Integer>> {
public Tuple2<String,Integer> map(String input) throws Exception {
return parseData(input);
}
private Tuple2<String,Integer> parseData(String record) {
String[] tokens = record.toLowerCase().split(",");
// Get Key
String key = tokens[0];
// Get Integer Value
String integerValue = tokens[1];
System.out.println("Trying to Parse=" + integerValue);
Integer value = Integer.parseInt(integerValue);
// Build Tuple
return new Tuple2<String,Integer>(key, value);
}
}
You could change the NumberMapper into a FlatMap and filter out invalid elements:
private static class NumberMapper implements FlatMapFunction<String, Tuple2<String, Integer>> {
public void flatMap(String input, Collector<Tuple2<String, Integer>> collector) throws Exception {
Optional<Tuple2<String, Integer>> optionalResult = parseData(input);
optionalResult.ifPresent(collector::collect);
}
private Optional<Tuple2<String, Integer>> parseData(String record) {
String[] tokens = record.toLowerCase().split(",");
// Get Key
String key = tokens[0];
// Get Integer Value
String integerValue = tokens[1];
try {
Integer value = Integer.parseInt(integerValue);
// Build Tuple
return Optional.of(Tuple2.of(key, value));
} catch (NumberFormatException e) {
return Optional.empty();
}
}
}

KTable Reduce function does not honor windowing

Requirement :- We need to consolidate all the messages having same orderid and perform subsequent operation for the consolidated Message.
Explanation :- Below snippet of code tries to capture all order messages received from a particular tenant and tries to consolidate to a single order message after waiting for a specific period of time
It does the following stuff
Repartition message based on OrderId. So each order message will be having tenantId and groupId as its key
Perform a groupby key operation followed by windowed operation for 2 minutes
Reduce operation is performed once windowing is completed.
Ktable is converted again to stream back and then its output is send to another kafka topic
Expected Output :- If there are 5 messages having same order id being sent with in window period. It was expected that the final kafka topic should have only one message and it would be the last reduce operation message.
Actual Output :- All the 5 messages are seen indicating windowing is not happening before invoking reduce operation. All the messages seen in kafka have proper reduce operation being done as each and every message is received.
Queries :- In kafka stream library version 0.11.0.0 reduce function used to accept timewindow as its argument. I see that this is deprecated in kafka stream version 1.0.0. Windowing which is done in the below piece of code, is it correct ? Is windowing supported in newer version of kafka stream library 1.0.0 ? If so, then is there something can be improved in below snippet of code ?
String orderMsgTopic = "sampleordertopic";
JsonSerializer<OrderMsg> orderMsgJSONSerialiser = new JsonSerializer<>();
JsonDeserializer<OrderMsg> orderMsgJSONDeSerialiser = new JsonDeserializer<>(OrderMsg.class);
Serde<OrderMsg> orderMsgSerde = Serdes.serdeFrom(orderMsgJSONSerialiser,orderMsgJSONDeSerialiser);
KStream<String, OrderMsg> orderMsgStream = this.builder.stream(orderMsgTopic, Consumed.with(Serdes.ByteArray(), orderMsgSerde))
.map(new KeyValueMapper<byte[], OrderMsg, KeyValue<? extends String, ? extends OrderMsg>>() {
#Override
public KeyValue<? extends String, ? extends OrderMsg> apply(byte[] byteArr, OrderMsg value) {
TenantIdMessageTypeDeserializer deserializer = new TenantIdMessageTypeDeserializer();
TenantIdMessageType tenantIdMessageType = deserializer.deserialize(orderMsgTopic, byteArr);
String newTenantOrderKey = null;
if ((tenantIdMessageType != null) && (tenantIdMessageType.getMessageType() == 1)) {
Long tenantId = tenantIdMessageType.getTenantId();
newTenantOrderKey = tenantId.toString() + value.getOrderKey();
} else {
newTenantOrderKey = value.getOrderKey();
}
return new KeyValue<String, OrderMsg>(newTenantOrderKey, value);
}
});
final KTable<Windowed<String>, OrderMsg> orderGrouping = orderMsgStream.groupByKey(Serialized.with(Serdes.String(), orderMsgSerde))
.windowedBy(TimeWindows.of(windowTime).advanceBy(windowTime))
.reduce(new OrderMsgReducer());
orderGrouping.toStream().map(new KeyValueMapper<Windowed<String>, OrderMsg, KeyValue<String, OrderMsg>>() {
#Override
public KeyValue<String, OrderMsg> apply(Windowed<String> key, OrderMsg value) {
return new KeyValue<String, OrderMsg>(key.key(), value);
}
}).to("newone11", Produced.with(Serdes.String(), orderMsgSerde));
I realised that I had set StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG to 0 and also set the default commit interval of 1000ms. Changing this value helps me to some extent get the windowing working