kafka streams not fetching all records - apache-kafka

I have a java Kafka streams application that reads from a topic do some filtering and transformations and writing the data back to Kafka to a different topic.
I print the stream object on every step.
I noticed that if I send more than dozens of records to the input topic, some records are not consumed by my Kafka streams application.
when using kafka-console-consumer.sh to consume from the input topic, I do receive all records.
I'm running Kafka 1.0.0 with one broker and one partition topic.
Any idea why?
public static void main(String[] args) {
final String bootstrapServers = System.getenv("KAFKA");
final String inputTopic = System.getenv("INPUT_TOPIC");
final String outputTopic = System.getenv("OUTPUT_TOPIC");
final String gatewayTopic = System.getenv("GATEWAY_TOPIC");
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "PreProcess");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "PreProcess-client");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 300L);
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, String> textLines = builder.stream(inputTopic);
textLines.print();
StreamsTransformation streamsTransformation = new StreamsTransformation(builder);
KTable<String,Gateway> gatewayKTable = builder.table(gatewayTopic, Consumed.with(Serdes.String(), SerdesUtils.getGatewaySerde()));
KStream<String, Message> gatewayIdMessageKStream = streamsTransformation.getStringMessageKStream(textLines,gatewayKTable);
gatewayIdMessageKStream.print();
KStream<String, FlatSensor> keyFlatSensorKStream = streamsTransformation.transformToKeyFlatSensorKStream(gatewayIdMessageKStream);
keyFlatSensorKStream.to(outputTopic, Produced.with(Serdes.String(), SerdesUtils.getFlatSensorSerde()));
keyFlatSensorKStream.print();
KafkaStreams streams = new KafkaStreams(builder.build(), streamsConfiguration);
streams.cleanUp();
streams.start();
// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
streams.close();
}));
}

Related

How to Commit Kafka Offsets Manually in Flink

I have a Flink job to consume a Kafka topic and sink it to another topic and the Flink job is setting as auto.commit with a interval 3 minutes(checkpoint disabled), but in the monitoring side, there is 3 minutes lag. But we want to monitor the processing on real time without 3 minutes lag, so we want to have a feature that the FlinkKafkaConsumer is able to commit the offset immediately after sink function.
Is there a way to achieve this goal within Flink framework?
Or any other options?
On line 53, I am trying to create a KafkaConsumer instance to call commitSync() function to make it working, but it does not work.
public class CEPJobTest {
private final static String TOPIC = "test";
private final static String BOOTSTRAP_SERVERS = "localhost:9092";
public static void main(String[] args) throws Exception {
System.out.println("start cep test job...");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "console-consumer-cep");
properties.setProperty("enable.auto.commit", "false");
// offset interval
//properties.setProperty("auto.commit.interval.ms", "500");
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),
properties);
//set commitoffset by checkpoint
consumer.setCommitOffsetsOnCheckpoints(false);
System.out.println("checkpoint enabled:"+consumer.getEnableCommitOnCheckpoints());
DataStream<String> stream = env.addSource(consumer);
stream.map(new MapFunction<String, String>() {
#Override
public String map(String value) throws Exception {
return new Date().toString() + ": " + value;
}
}).print();
//here, I want to commit offset manually after processing message...
KafkaConsumer<?, ?> kafkaConsumer = new KafkaConsumer(properties);
kafkaConsumer.commitSync();
env.execute("Flink Streaming");
}
private static Consumer<Long, String> createConsumer() {
final Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, BOOTSTRAP_SERVERS);
props.put(ConsumerConfig.GROUP_ID_CONFIG, "KafkaExampleConsumer");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, LongDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,false);
final Consumer<Long, String> consumer = new KafkaConsumer<>(props);
return consumer;
}
}
This does not work like your code
env.execute is to submit job to cluster, the execution is then submitted. The code before this line is just build the job graph rather than executing anything.
To do this after sink, you should put it in your sink function
class mySink extends RichSinkFunction {
override def invoke(...) = {
val kafkaConsumer = new KafkaConsumer(properties);
kafkaConsumer.commitSync();
}
}

Kafka consumers from two partitions

In kafka I need to consume a topic with two partitions from two consumers (partition 1 to consumer 1 and partition 2 to consumer 2) using Java.
This is my Producer Code
public class KafkaClientOperationProducer {
KafkaClientOperationConsumer kac = new KafkaClientOperationConsumer();
public void initiateProducer(ClientOperation clientOperation,
ClientOperationManager activityManager,Logger logger) throws Exception {
Properties props = new Properties();
props.put("bootstrap.servers","localhost:9092,localhost:9093,localhost:9094");
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, ClientOperation> producer = new KafkaProducer<>(props);
try{
ProducerRecord<String, ClientOperation> record = new ProducerRecord<String, ClientOperation>(
topicName, key, clientOperation);
producer.send(record);
}
finally{
producer.flush();
producer.close();
kac.initiateConsumer(activityManager);//Calling Consumer
}
}
}
This is my Consumer code
public class KafkaClientOperationConsumer{
String topicName = "CA_Topic";
String groupName = "CA_TopicGroup";
public void initiateConsumer(ClientOperationManager activityManager) throws Exception {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092,localhost:9093,localhost:9094");
props.put("group.id", groupName);
props.put("enable.auto.commit", "true");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaConsumer<String, ClientOperation> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topicName));
ConsumerRecords<String, ClientOperation> records = consumer.poll(100);
try{
for (ConsumerRecord<String, ClientOperation> record : records) {
activityManager.save(record.value());//saves data in database
}}
finally{
consumer.close();}
}
}
The above code is working fine for single consumer not for multiple consumers
The clientOperation is a object which holds data about client operation.
The partition number is three(which you can see from the code) ,When i tried to call initiateConsumer using thread i.e..(ExecutorService executor) I'm getting Duplicate values in database
Please change my code so that i can consume CA_Topic using two consumers,I can't use two JVM's due to memory problem.Thanks in advance
I guess you must use KafkaConsumer.assign method. Here a little example:
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "ip:port");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "group_id");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
final Consumer<byte[], byte[]> consumer = new KafkaConsumer<>(props);
TopicPartition topicPartition = new TopicPartition("topic", 0); // topic name and partition id to be assigned for this consumer. in other consumer configurations this value must be any value other than 0
List<TopicPartition> partitionList = new ArrayList<TopicPartition>();
partitionList.add(topicPartition);
consumer.assign(partitionList); // in this line, 0. partition assigning to this consumer
You can see detail in documentation of Kafka: https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#assign(java.util.Collection)

Does Kafka have any configuration which limits number of messages of Kafka Consumer?

My problem is Kafka consumer always hangs after receiving about 10,000 messages from Kafka, when I restart Kafka Consumer it starts reading again and continues to hang at 10,000 messages. Even when I consume from all partitions of just 1 partition, Kafka Consumer does not read after 10,000 messages.
P/S: If I use KafkaSpout to read messages from Kafka, KafkaSpout stops emitting after about 30,000 messages too.
Here is my code:
Properties props = new Properties();
props.put("group.id", "Tornado");
props.put("zookeeper.connect", TwitterPropertiesLoader.getInstance().getZookeeperServer());
props.put("zookeeper.connection.timeout.ms", "200000");
props.put("auto.offset.reset", "smallest");
props.put("auto.commit.enable", "true");
props.put("auto.commit.interval.ms", "1000");
ConsumerConfig consumerConfig = new ConsumerConfig(props);
final ConsumerConnector consumer = Consumer.createJavaConsumerConnector(consumerConfig);
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(TwitterConstant.Kafka.TWITTER_STREAMING_TOPIC, 1);
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(TwitterConstant.Kafka.TWITTER_STREAMING_TOPIC);
final KafkaStream<byte[], byte[]> stream0=streams.get(0);
logger.info("Client ID=" + stream0.clientId());
for (MessageAndMetadata<byte[], byte[]> message : stream0) {
try {
String messageReceived=new String(message.message(), "UTF-8");
logger.info("partition = " + message.partition() + ", offset=" + message.offset() + " => " + messageReceived);
//consumer.commitOffsets(true);
writeMessageToDatabase(messageReceived);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
Edit: this is the log file, after 10,000 messages, there is something happened like rebalancing consumer (not sure), but KafkaStream cannot continue to read message

Kafka10 consumer vs Kafka8 consumer

I have been using Kafka8 and trying to move to kafka10.
We have a topic with 10 partitions and used to create a consumer group with 10 consumers as shown below.
public void run(int a_numThreads) {
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(a_numThreads));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
// now launch all the threads
//
executor = Executors.newFixedThreadPool(a_numThreads);
// now create an object to consume the messages
//
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.execute(new ConsumerTest(stream, threadNumber));
threadNumber++;
}
}
Here, based on number of partitions we used to pass number of threads.
But, with kafka10 consumers not sure if there anything like that. Here it doesnt return streams based on partitions.
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "192.168.33.10:9092");
props.put("group.id", "group-1");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("auto.offset.reset", "earliest");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(props);
kafkaConsumer.subscribe(Arrays.asList("HelloKafkaTopic"));
while (true) {
ConsumerRecords<String, String> records = kafkaConsumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, value = %s", record.offset(), record.value());
System.out.println();
}
}
}
Thanks in Advance
The new consumer enables a simple and efficient implementation which can handle all IO from a single thread. That's quite different with the old consumer. See this blog for further details :
https://www.confluent.io/blog/tutorial-getting-started-with-the-new-apache-kafka-0-9-consumer-client/

High level Kafka Consumer Only Consumes Half Of Messages Sent By Producer

I have implemented a high level consumer per the example page: https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example
When the code runs, it only consume half of the messages produced. I have a basic 3 node zookeeper cluster and 2 kafka brokers. When I run the simple consumer code (not high level consumer), all the messages are consumed. Any ideas will be appreciated.
Consumer code
public void run() {
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put("test", 2);
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get("test");
executor = Executors.newFixedThreadPool(2);
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.submit(new Consumer(stream, threadNumber));
threadNumber++;
}
}
private static ConsumerConfig createConsumerConfig() {
Properties props = new Properties();
props.put("zookeeper.connect", "zookeeper01:2181,zookeeper02:2181,zookeeper03:2181");
props.put("group.id", "Consumers");
props.put("zookeeper.session.timeout.ms", "10000");
props.put("enable.auto.commit", "true");
props.put("zookeeper.sync.time.ms", "1000");
props.put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.commit.interval.ms", "500");
return new ConsumerConfig(props);
}