How to Commit Kafka Offsets Manually in Flink - apache-kafka

I have a Flink job to consume a Kafka topic and sink it to another topic and the Flink job is setting as auto.commit with a interval 3 minutes(checkpoint disabled), but in the monitoring side, there is 3 minutes lag. But we want to monitor the processing on real time without 3 minutes lag, so we want to have a feature that the FlinkKafkaConsumer is able to commit the offset immediately after sink function.
Is there a way to achieve this goal within Flink framework?
Or any other options?
On line 53, I am trying to create a KafkaConsumer instance to call commitSync() function to make it working, but it does not work.
public class CEPJobTest {
private final static String TOPIC = "test";
private final static String BOOTSTRAP_SERVERS = "localhost:9092";
public static void main(String[] args) throws Exception {
System.out.println("start cep test job...");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "console-consumer-cep");
properties.setProperty("enable.auto.commit", "false");
// offset interval
//properties.setProperty("auto.commit.interval.ms", "500");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),
properties);
//set commitoffset by checkpoint
consumer.setCommitOffsetsOnCheckpoints(false);
System.out.println("checkpoint enabled:"+consumer.getEnableCommitOnCheckpoints());
DataStream<String> stream = env.addSource(consumer);
stream.map(new MapFunction<String, String>() {
#Override
public String map(String value) throws Exception {
return new Date().toString() + ": " + value;
}
}).print();
//here, I want to commit offset manually after processing message...
KafkaConsumer<?, ?> kafkaConsumer = new KafkaConsumer(properties);
kafkaConsumer.commitSync();
env.execute("Flink Streaming");
}
private static Consumer<Long, String> createConsumer() {
final Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);
props.put(ConsumerConfig.GROUP_ID_CONFIG, "KafkaExampleConsumer");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, LongDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,false);
final Consumer<Long, String> consumer = new KafkaConsumer<>(props);
return consumer;
}
}

This does not work like your code
env.execute is to submit job to cluster, the execution is then submitted. The code before this line is just build the job graph rather than executing anything.
To do this after sink, you should put it in your sink function
class mySink extends RichSinkFunction {
override def invoke(...) = {
val kafkaConsumer = new KafkaConsumer(properties);
kafkaConsumer.commitSync();
}
}

Related

Why is windowing now working for Kafka Streams?

I am running a simple Kafka Streams program on my eclipse which is running successfully, but it is not able to implement the windowing concept.
I want to process all the messages received in a window of 5 seconds to the output topic. I googled and understand that I need to implement the tumbling window concept. However, I see that the output is sent to the output topic instantly.
What am I doing wrong here? Below is the main method that I am running:
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("wc-input");
#SuppressWarnings("deprecation")
KTable<Windowed<String>, Long> counts = source
.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
return Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" "));
}
})
.groupBy(new KeyValueMapper<String, String, String>() {
#Override
public String apply(String key, String value) {
return value;
}
})
.count(TimeWindows.of(10000L)
.until(10000L),"Counts");
// need to override value serde to Long type
counts.to("wc-output");
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-wordcount-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
long windowSizeMs = TimeUnit.MINUTES.toMillis(50000); // 5 * 60 * 1000L
TimeWindows.of(windowSizeMs);
TimeWindows.of(windowSizeMs).advanceBy(windowSizeMs);
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
Windowing does not mean "one output" per window. If you want to get only one output per window, you want so use suppress() on the result KTable.
Compare this article: https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/

Kafka consumers from two partitions

In kafka I need to consume a topic with two partitions from two consumers (partition 1 to consumer 1 and partition 2 to consumer 2) using Java.
This is my Producer Code
public class KafkaClientOperationProducer {
KafkaClientOperationConsumer kac = new KafkaClientOperationConsumer();
public void initiateProducer(ClientOperation clientOperation,
ClientOperationManager activityManager,Logger logger) throws Exception {
Properties props = new Properties();
props.put("bootstrap.servers","localhost:9092,localhost:9093,localhost:9094");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, ClientOperation> producer = new KafkaProducer<>(props);
try{
ProducerRecord<String, ClientOperation> record = new ProducerRecord<String, ClientOperation>(
topicName, key, clientOperation);
producer.send(record);
}
finally{
producer.flush();
producer.close();
kac.initiateConsumer(activityManager);//Calling Consumer
}
}
}
This is my Consumer code
public class KafkaClientOperationConsumer{
String topicName = "CA_Topic";
String groupName = "CA_TopicGroup";
public void initiateConsumer(ClientOperationManager activityManager) throws Exception {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093,localhost:9094");
props.put("group.id", groupName);
props.put("enable.auto.commit", "true");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaConsumer<String, ClientOperation> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topicName));
ConsumerRecords<String, ClientOperation> records = consumer.poll(100);
try{
for (ConsumerRecord<String, ClientOperation> record : records) {
activityManager.save(record.value());//saves data in database
}}
finally{
consumer.close();}
}
}
The above code is working fine for single consumer not for multiple consumers
The clientOperation is a object which holds data about client operation.
The partition number is three(which you can see from the code) ,When i tried to call initiateConsumer using thread i.e..(ExecutorService executor) I'm getting Duplicate values in database
Please change my code so that i can consume CA_Topic using two consumers,I can't use two JVM's due to memory problem.Thanks in advance
I guess you must use KafkaConsumer.assign method. Here a little example:
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "ip:port");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "group_id");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
final Consumer<byte[], byte[]> consumer = new KafkaConsumer<>(props);
TopicPartition topicPartition = new TopicPartition("topic", 0); // topic name and partition id to be assigned for this consumer. in other consumer configurations this value must be any value other than 0
List<TopicPartition> partitionList = new ArrayList<TopicPartition>();
partitionList.add(topicPartition);
consumer.assign(partitionList); // in this line, 0. partition assigning to this consumer
You can see detail in documentation of Kafka: https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#assign(java.util.Collection)

kafka streams not fetching all records

I have a java Kafka streams application that reads from a topic do some filtering and transformations and writing the data back to Kafka to a different topic.
I print the stream object on every step.
I noticed that if I send more than dozens of records to the input topic, some records are not consumed by my Kafka streams application.
when using kafka-console-consumer.sh to consume from the input topic, I do receive all records.
I'm running Kafka 1.0.0 with one broker and one partition topic.
Any idea why?
public static void main(String[] args) {
final String bootstrapServers = System.getenv("KAFKA");
final String inputTopic = System.getenv("INPUT_TOPIC");
final String outputTopic = System.getenv("OUTPUT_TOPIC");
final String gatewayTopic = System.getenv("GATEWAY_TOPIC");
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "PreProcess");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "PreProcess-client");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 300L);
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, String> textLines = builder.stream(inputTopic);
textLines.print();
StreamsTransformation streamsTransformation = new StreamsTransformation(builder);
KTable<String,Gateway> gatewayKTable = builder.table(gatewayTopic, Consumed.with(Serdes.String(), SerdesUtils.getGatewaySerde()));
KStream<String, Message> gatewayIdMessageKStream = streamsTransformation.getStringMessageKStream(textLines,gatewayKTable);
gatewayIdMessageKStream.print();
KStream<String, FlatSensor> keyFlatSensorKStream = streamsTransformation.transformToKeyFlatSensorKStream(gatewayIdMessageKStream);
keyFlatSensorKStream.to(outputTopic, Produced.with(Serdes.String(), SerdesUtils.getFlatSensorSerde()));
keyFlatSensorKStream.print();
KafkaStreams streams = new KafkaStreams(builder.build(), streamsConfiguration);
streams.cleanUp();
streams.start();
// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
streams.close();
}));
}

sending DataStream from VM socket to Kafka and receiving on Host OS's Flink program : Deserialization Issue

I am sending data stream from VM to Kafka's test topic (running on host OS at 192.168.0.12 IP ) using code below
public class WriteToKafka {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Use ingestion time => TimeCharacteristic == EventTime + IngestionTimeExtractor
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
DataStream<JoinedStreamEvent> joinedStreamEventDataStream = env
.addSource(new JoinedStreamGenerator()).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.0.12:9092");
properties.setProperty("zookeeper.connect", "192.168.0.12:2181");
properties.setProperty("group.id", "test");
DataStreamSource<JoinedStreamEvent> stream = env.addSource(new JoinedStreamGenerator());
stream.addSink(new FlinkKafkaProducer09<JoinedStreamEvent>("test", new TypeInformationSerializationSchema<>(stream.getType(),env.getConfig()), properties));
env.execute();
}
JoinedStreamEvent is of type DataSream<Tuple3<Integer,Integer,Integer>> it basically joins 2 streams respirationRateStream and heartRateStream
public JoinedStreamEvent(Integer patient_id, Integer heartRate, Integer respirationRate) {
Patient_id = patient_id;
HeartRate = heartRate;
RespirationRate = respirationRate;
There is another Flink program that is running on Host OS trying to read the data Stream from kafka . I am using localhost here as kafka and zookeper are running on Host OS.
public class ReadFromKafka {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> message = env.addSource(new FlinkKafkaConsumer09<String>("test", new SimpleStringSchema(), properties));
/* DataStream<JoinedStreamEvent> message = env.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new , properties));*/
message.print();
env.execute();
} //main
} //ReadFromKafka
I am getting output something like this
I think I need to implement deserializer of type JoinedStreamEvent. Can someone please give me an idea how should I write, the deserializer for JoinedStreamEvent of type DataSream<Tuple3<Integer, Integer, Integer>>
Please let me know if something else needs to be done.
P.S. - I thought of writing following deserializer, but I don't think it is right
DataStream<JoinedStreamEvent> message = env.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new TypeInformationSerializationSchema<JoinedStreamEvent>() , properties));
I was able to receive events VM in the same format by writing a custom serializer and deserializer for both VM and host OS program as mentioned below
public class JoinSchema implements DeserializationSchema<JoinedStreamEvent> , SerializationSchema<JoinedStreamEvent> {
#Override
public JoinedStreamEvent deserialize(byte[] bytes) throws IOException {
return JoinedStreamEvent.fromstring(new String(bytes));
}
#Override
public boolean isEndOfStream(JoinedStreamEvent joinedStreamEvent) {
return false;
}
#Override
public TypeInformation<JoinedStreamEvent> getProducedType() {
return TypeExtractor.getForClass(JoinedStreamEvent.class);
}
#Override
public byte[] serialize(JoinedStreamEvent joinedStreamEvent) {
return joinedStreamEvent.toString().getBytes();
}
} //JoinSchema
Please note that you may have to write fromstring( ) method in your event-type method, as I have added below fromString( ) JoinedStreamEvent Class
public static JoinedStreamEvent fromstring(String line){
String[] token = line.split(",");
JoinedStreamEvent joinedStreamEvent = new JoinedStreamEvent();
Integer val1 = Integer.valueOf(token[0]);
Integer val2 = Integer.valueOf(token[1]);
Integer val3 = Integer.valueOf(token[2]);
return new JoinedStreamEvent(val1,val2,val3);
} //fromstring
Events were sent from VM using below code
stream.addSink(new FlinkKafkaProducer09<JoinedStreamEvent>("test", new JoinSchema(), properties));
Events were received using following code
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<JoinedStreamEvent> message = env.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new JoinSchema(), properties));
message.print();
env.execute();
} //main

Kafka10 consumer vs Kafka8 consumer

I have been using Kafka8 and trying to move to kafka10.
We have a topic with 10 partitions and used to create a consumer group with 10 consumers as shown below.
public void run(int a_numThreads) {
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(a_numThreads));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
// now launch all the threads
//
executor = Executors.newFixedThreadPool(a_numThreads);
// now create an object to consume the messages
//
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.execute(new ConsumerTest(stream, threadNumber));
threadNumber++;
}
}
Here, based on number of partitions we used to pass number of threads.
But, with kafka10 consumers not sure if there anything like that. Here it doesnt return streams based on partitions.
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "192.168.33.10:9092");
props.put("group.id", "group-1");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("auto.offset.reset", "earliest");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(props);
kafkaConsumer.subscribe(Arrays.asList("HelloKafkaTopic"));
while (true) {
ConsumerRecords<String, String> records = kafkaConsumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, value = %s", record.offset(), record.value());
System.out.println();
}
}
}
Thanks in Advance
The new consumer enables a simple and efficient implementation which can handle all IO from a single thread. That's quite different with the old consumer. See this blog for further details :
https://www.confluent.io/blog/tutorial-getting-started-with-the-new-apache-kafka-0-9-consumer-client/