Kafka slow join - apache-kafka

Kafka slow join - apache-kafka

I am running confluent-oss-5.0.0-2.11 kafka server with default server properties (from etc/kafka) on my local PC and created topics test1 and test2 with the below command
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 –topic
Below are the environment properties I have
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10);
props.put(BATCH_SIZE_CONFIG, 16384);
props.put(LINGER_MS_CONFIG, 10);
props.put(BUFFER_MEMORY_CONFIG, 33554432);
props.put(METADATA_MAX_AGE_CONFIG, "10");
Here is what the producer does
String padded = RandomStringUtils.random(2000, true, true);
for(int i=0;i<1000000;i++) {
kafkaProducer.send("a" + i, "aa" + padded + i, "test1");
kafkaProducer.send("a" + i, "bb" + padded + i, "test2");
}
kafkaProducer.flush();
Here is what the consumer does
KTable<String, String> a = builder.table("test");
KTable<String, String> b = builder.table("test1");
a.join(b, new ValueJoiner<String, String, String>() {
#Override
public String apply(String value1, String value2) {
return "a" + value1;
}
}).toStream().to("finalTopic");
And below is how I observe the performance of "finalTopic" population
AtomicInteger counter = new AtomicInteger();
builder.<String, String>stream("finalTopic").peek((key, value) -> {
if(counter.incrementAndGet()%1000 == 0) {
logger.info("date {}, final join key {}, value size {}, joins performed {}", System.currentTimeMillis(), key, value.length(), counter.get());
}
});
I managed to get around 55,000 messages per second getting above messages into different Kafka topics using producer.
However, on the consumer side, the rate of messages are being populated into "finalTopic" is at around 110 messages per second.
Any pointer is appreciated!

Related

knowing that kafka has processed all the messages

How to know that Kafka has processed all the messages?
is there any command or log file that species Kafka offset under processing and the last Kafka offset?

You could use the command line tool kafka-consumer-groups.sh to check the consumer lag of your ConsumerGroup. It will show the end offset of the topic the ConsumerGroup is consuming and the last offset the ConsumerGroup committed:
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group mygroup
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
mygroup test-topic 0 5 15 10 xxx
mygroup test-topic 1 10 15 5 xxx

If you want to do it programmatically, e.g. from a Spring application:
#Bean
public ApplicationRunner runner(KafkaAdmin admin, ConsumerFactory<String, String> cf) {
return args -> {
try (
AdminClient client = AdminClient.create(admin.getConfig());
Consumer<String, String> consumer = cf.createConsumer("dummyGroup", "clientId", "");
) {
Collection<ConsumerGroupListing> groups = client.listConsumerGroups()
.all()
.get(10, TimeUnit.SECONDS);
groups.forEach(group -> {
Map<TopicPartition, OffsetAndMetadata> map = null;
try {
map = client.listConsumerGroupOffsets(group.groupId())
.partitionsToOffsetAndMetadata()
.get(10, TimeUnit.SECONDS);
}
catch (InterruptedException e) {
e.printStackTrace();
}
catch (ExecutionException e) {
e.printStackTrace();
}
catch (TimeoutException e) {
e.printStackTrace();
}
Map<TopicPartition, Long> endOffsets = consumer.endOffsets(map.keySet());
map.forEach((tp, off) -> {
System.out.println("group: " + group + " tp: " + tp
+ " current offset: " + off.offset()
+ " end offset: " + endOffsets.get(tp));
});
});
}
};
}

Issue while sending the Flux of data to the reactive kafka

I am using reactor library to fetch the large stream of data from the network and sending it to the kafka broker using the reactive kafka approach.
Below is the Kafka Producer I am using
public class LogProducer {
private final KafkaSender<String, String> sender;
public LogProducer(String bootstrapServers) {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.CLIENT_ID_CONFIG, "log-producer");
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
SenderOptions<String, String> senderOptions = SenderOptions.create(props);
sender = KafkaSender.create(senderOptions);
}
public void sendMessages(String topic, Flux<Logs.Data> records) {
AtomicInteger sentCount = new AtomicInteger(0);
AtomicInteger fCount = new AtomicInteger(0);
records.doOnNext(r -> fCount.incrementAndGet()).subscribe();
System.out.println("Total Records: " + fCount);
sender.send(records.doOnNext(r -> sentCount.incrementAndGet())
.map(record -> {
LogRecord lrec = record.getRecords().get(0);
String id = lrec.getId();
return SenderRecord.create(new ProducerRecord<>(topic, id,
lrec.toString()), id);
})).then()
.doOnError(e -> {
log.error("[FAIL]: Send to the topic: '{}' failed. "
+ e, topic);
})
.doOnSuccess(s -> {
log.info("[SUCCESS]: {} records sent to the topic: '{}'", sentCount, topic);
})
.subscribe();
}
}
The total number of records in the Flux (fCount) and the records sent to the Kafka topic (sentCount) does not match, it does not give any error and completes successfully.
For example: In one of the case total number of records in the Flux is 2758, while the count sent to the kafka is 256. Is there any kafka configuration, which needs to be modified, or Am I missing anything?
===========================================================
Updated based on the comments
sender.send(records
.map(record -> {
LogRecord lrec = record.getRecords().get(0);
String id = lrec.getId();
sleep(5); // sleep for 5 ns
return SenderRecord.create(new ProducerRecord<>(topic, id,
lrec.toString()), id);
})).then()
.doOnError(e -> {
log.error("[FAIL]: Send to the topic: '{}' failed. "
+ e, topic);
})
.doOnSuccess(s -> {
log.info("[SUCCESS]: {} records sent to the topic: '{}'", sentCount, topic);
})
.subscribe();
sleep(10); // sleep for 10 ns
The above code worked fine in one system but failed to send all messages in other.

kafka fetch records by timestamp, consumer loop

I am using Kafka 0.10.2.1 cluster. I am using the Kafka's offsetForTimes API to seek to a particular offset and would like to breakout of the loop when i have reached the end timestamp.
My code is like this:
//package kafka.ex.test;
import java.util.*;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.OffsetAndTimestamp;
import org.apache.kafka.common.PartitionInfo;
import org.apache.kafka.common.TopicPartition;
public class ConsumerGroup {
public static OffsetAndTimestamp fetchOffsetByTime( KafkaConsumer<Long, String> consumer , TopicPartition partition , long startTime){
Map<TopicPartition, Long> query = new HashMap<>();
query.put(
partition,
startTime);
final Map<TopicPartition, OffsetAndTimestamp> offsetResult = consumer.offsetsForTimes(query);
if( offsetResult == null || offsetResult.isEmpty() ) {
System.out.println(" No Offset to Fetch ");
System.out.println(" Offset Size "+offsetResult.size());
return null;
}
final OffsetAndTimestamp offsetTimestamp = offsetResult.get(partition);
if(offsetTimestamp == null ){
System.out.println("No Offset Found for partition : "+partition.partition());
}
return offsetTimestamp;
}
public static KafkaConsumer<Long, String> assignOffsetToConsumer( KafkaConsumer<Long, String> consumer, String topic , long startTime ){
final List<PartitionInfo> partitionInfoList = consumer.partitionsFor(topic);
System.out.println("Number of Partitions : "+partitionInfoList.size());
final List<TopicPartition> topicPartitions = new ArrayList<>();
for (PartitionInfo pInfo : partitionInfoList) {
TopicPartition partition = new TopicPartition(topic, pInfo.partition());
topicPartitions.add(partition);
}
consumer.assign(topicPartitions);
for(TopicPartition partition : topicPartitions ){
OffsetAndTimestamp offSetTs = fetchOffsetByTime(consumer, partition, startTime);
if( offSetTs == null ){
System.out.println("No Offset Found for partition : " + partition.partition());
consumer.seekToEnd(Arrays.asList(partition));
}else {
System.out.println(" Offset Found for partition : " +offSetTs.offset()+" " +partition.partition());
System.out.println("FETCH offset success"+
" Offset " + offSetTs.offset() +
" offSetTs " + offSetTs);
consumer.seek(partition, offSetTs.offset());
}
}
return consumer;
}
public static void main(String[] args) throws Exception {
String topic = args[0].toString();
String group = args[1].toString();
long start_time_Stamp = Long.parseLong( args[3].toString());
String bootstrapServers = args[2].toString();
long end_time_Stamp = Long.parseLong( args[4].toString());
Properties props = new Properties();
boolean reachedEnd = false;
props.put("bootstrap.servers", bootstrapServers);
props.put("group.id", group);
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<Long, String> consumer = new KafkaConsumer<Long, String>(props);
assignOffsetToConsumer(consumer, topic, start_time_Stamp);
System.out.println("Subscribed to topic " + topic);
int i = 0;
int arr[] = {0,0,0,0,0};
while (true) {
ConsumerRecords<Long, String> records = consumer.poll(6000);
int count= 0;
long lasttimestamp = 0;
long lastOffset = 0;
for (ConsumerRecord<Long, String> record : records) {
count++;
if(arr[record.partition()] == 0){
arr[record.partition()] =1;
}
if (record.timestamp() >= end_time_Stamp) {
reachedEnd = true;
break;
}
System.out.println("record=>"+" offset="
+record.offset()
+ " timestamp="+record.timestamp()
+ " :"+record);
System.out.println("recordcount = "+count+" bitmap"+Arrays.toString(arr));
}
if (reachedEnd) break;
if (records == null || records.isEmpty()) break; // dont wait for records
}
}
}
I face the following problems:
consumer.poll fails even for 1000 millisecond. I had to poll a few times in loop if i use 1000 millisecond. I have an extremely large value now. But having already, seeked to the relevant offsets within a partition, how to reliably set the poll timeout so that data is returned immediately?
My observations is that when data is returned it is not always from all partitions. Even when data is returned from all partitions not all records are returned. The amount of records that are in the topic is more than 1000. But the amount of records that are actually fetched and printed in loop is less(~200). Is there any issue with the current usage of my Kafka APIs?
how to reliably break out of the loop having obtained all the data between start and end timestamp and not prematurely?

Amount of records fetched per poll depends on your consumer config
You are breaking the loop when one of the partitions reaches the endtimestamp , which is not what you want . You should check that all the partitions are seeked to end before exiting poll loop
Poll call is an async call and fetch requests and responses are per node , so you may or may not get all the responses in a poll depending on the broker response time

Kafka Consumer Group Id and consumer rebalance issue

I I am using Kafka 0.10.0 and zookeeper 3.4.6 in my production server .I am having 20 topics each with approx 50 partitions. I am having the total of 100 consumers each subscribed to different topics and partitions .All the consumers are having the same groupId. So will this be the case that if a consumer is added or removed for a specific topic then the consumers attached to the different topic will also undergo rebalancing?
My consumer code is:
public static void main(String[] args) {
String groupId = "prod"
String topicRegex = args[0]
String consumerTimeOut = "10000"
int n_threads = 1
if (args && args.size() > 1) {
ConfigLoader.init(args[1])
}
else {
ConfigLoader.init('development')
}
if(args && args.size() > 2 && args[2].isInteger()){
n_threads = (args[2]).toInteger()
}
ExecutorService executor = Executors.newFixedThreadPool(n_threads)
addShutdownHook(executor)
String zooKeeper = ConfigLoader.conf.zookeeper.hostName
List<Runnable> taskList = []
for(int i = 0; i < n_threads; i++){
KafkaConsumer example = new KafkaConsumer(zooKeeper, groupId, topicRegex, consumerTimeOut)
taskList.add(example)
}
taskList.each{ task ->
executor.submit(task)
}
executor.shutdown()
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.SECONDS)
}
private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId, String consumerTimeOut) {
Properties props = new Properties()
props.put("zookeeper.connect", a_zookeeper)
props.put("group.id", a_groupId)
props.put("zookeeper.session.timeout.ms", "10000")
props.put("rebalance.backoff.ms","10000")
props.put("zookeeper.sync.time.ms","200")
props.put("rebalance.max.retries","10")
props.put("enable.auto.commit", "false")
props.put("consumer.timeout.ms", consumerTimeOut)
props.put("auto.offset.reset", "smallest")
return new ConsumerConfig(props)
}
public void run(String topicRegex) {
String threadName = Thread.currentThread().getName()
logger.info("{} [{}] main Starting", TAG, threadName)
Map<String, Integer> topicCountMap = new HashMap<String, Integer>()
List<KafkaStream<byte[], byte[]>> streams = consumer.createMessageStreamsByFilter(new Whitelist(topicRegex),1)
ConsumerConnector consumerConnector = consumer
for (final KafkaStream stream : streams) {
ConsumerIterator<byte[], byte[]> consumerIterator = stream.iterator()
List<Object> batchTypeObjList = []
String topic
String topicObjectType
String method
String className
String deserialzer
Integer batchSize = 200
while (true){
boolean hasNext = false
try {
hasNext = consumerIterator.hasNext()
} catch (InterruptedException interruptedException) {
//if (exception instanceof InterruptedException) {
logger.error("{} [{}]Interrupted Exception: {}", TAG, threadName, interruptedException.getMessage())
throw interruptedException
//} else {
} catch(ConsumerTimeoutException timeoutException){
logger.error("{} [{}] Timeout Exception: {}", TAG, threadName, timeoutException.getMessage())
topicListMap.each{ eachTopic, value ->
batchTypeObjList = topicListMap.get(eachTopic)
if(batchTypeObjList != null && !batchTypeObjList.isEmpty()) {
def dbObject = topicConfigMap.get(eachTopic)
logger.debug("{} [{}] Timeout Happened.. Indexing remaining objects in list for topic: {}", TAG, threadName, eachTopic)
className = dbObject.get(KafkaTopicConfigEntity.CLASS_NAME_KEY)
method = dbObject.get(KafkaTopicConfigEntity.METHOD_NAME_KEY)
int sleepTime = 0
if(dbObject.get(KafkaTopicConfigEntity.CONUSMER_SLEEP_IN_MS) != null)
sleepTime = dbObject.get(KafkaTopicConfigEntity.CONUSMER_SLEEP_IN_MS)?.toInteger()
executeMethod(className, method, batchTypeObjList)
batchTypeObjList.clear()
topicListMap.put(eachTopic,batchTypeObjList)
sleep(sleepTime)
}
}
consumer.commitOffsets()
continue
} catch(Exception exception){
logger.error("{} [{}]Exception: {}", TAG, threadName, exception.getMessage())
throw exception
}
if(hasNext) {
def consumerObj = consumerIterator.next()
logger.debug("{} [{}] partition name: {}", TAG, threadName, consumerObj.partition())
topic = consumerObj.topic()
DBObject dbObject = topicConfigMap.get(topic)
logger.debug("{} [{}] topic name: {}", TAG, threadName, topic)
topicObjectType = dbObject.get(KafkaTopicConfigEntity.TOPIC_OBJECT_TYPE_KEY)
deserialzer = KafkaConfig.DEFAULT_DESERIALIZER
if(KafkaConfig.DESERIALIZER_MAP.containsKey(topicObjectType)){
deserialzer = KafkaConfig.DESERIALIZER_MAP.get(topicObjectType)
}
className = dbObject.get(KafkaTopicConfigEntity.CLASS_NAME_KEY)
method = dbObject.get(KafkaTopicConfigEntity.METHOD_NAME_KEY)
boolean isBatchJob = dbObject.get(KafkaTopicConfigEntity.IS_BATCH_JOB_KEY)
if(dbObject.get(KafkaTopicConfigEntity.BATCH_SIZE_KEY) != null)
batchSize = dbObject.get(KafkaTopicConfigEntity.BATCH_SIZE_KEY)
else
batchSize = 1
Object queueObj = (Class.forName(deserialzer)).deserialize(consumerObj.message())
int sleepTime = 0
if(dbObject.get(KafkaTopicConfigEntity.CONUSMER_SLEEP_IN_MS) != null)
sleepTime = dbObject.get(KafkaTopicConfigEntity.CONUSMER_SLEEP_IN_MS)?.toInteger()
if(isBatchJob == true){
batchTypeObjList = topicListMap.get(topic)
batchTypeObjList.add(queueObj)
if(batchTypeObjList.size() == batchSize) {
executeMethod(className, method, batchTypeObjList)
batchTypeObjList.clear()
sleep(sleepTime)
}
topicListMap.put(topic,batchTypeObjList)
} else {
executeMethod(className, method, queueObj)
sleep(sleepTime)
}
consumer.commitOffsets()
}
}
logger.debug("{} [{}] Shutting Down Process ", TAG, threadName)
}
}
Any help will be appriciated.

Whenever a consumer leaves or joins a consumer group, the entire group undergoes a rebalance. Since the group tracks all partitions across all topics that its members are subscribed to you are right in thinking, that this can lead to rebalancing of consumers that are not subscribed to the topic in question.
Please see below for a small test illustrating this point, I have a broker with two topics test1 (2 partitions) and test2 (9 partitions) and am starting two consumers, both with the same consumer group, each subscribes to only one of the two topics. As you can see, when consumer2 joins the group, consumer1 gets all partitions revoked and reassigned, because the entire group rebalances.
Subscribing consumer1 to topic test1
Starting thread for consumer1
Polling consumer1
consumer1 got 0 partitions revoked!
consumer1 got 2 partitions assigned!
Polling consumer1
Polling consumer1
Polling consumer1
Subscribing consumer2 to topic test2
Starting thread for consumer2
Polling consumer2
Polling consumer1
consumer2 got 0 partitions revoked!
Polling consumer1
Polling consumer1
consumer1 got 2 partitions revoked!
consumer2 got 9 partitions assigned!
consumer1 got 2 partitions assigned!
Polling consumer2
Polling consumer1
Polling consumer2
Polling consumer1
Polling consumer2

Kafka Producer Java API is not distributing messages to all topic partitions

I am very new to Kafka and today I tried creating Java Producer for producing messages on Kafka topics on different Partitions.
First I created a package raggieKafka under which I created 2 classes: TestProducer and SimplePartitioner.
TestProducer class has following code:
package raggieKafka;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.*;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
public class TestProducer{
public static void main(String args[]) throws Exception
{
long events = 0;
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
events = Integer.parseInt(reader.readLine());
Random rnd = new Random();
Properties props = new Properties();
props.put("metadata.broker.list", "localhost:9092");
props.put("topic.metadata.refresh.interval.ms", "1");
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("partitioner.class", "raggieKafka.SimplePartitioner");
props.put("request.required.acks", "1");
ProducerConfig config = new ProducerConfig(props);
Producer<String, String> prod = new Producer<String, String>(config);
for(long i = 0; i < events; i++)
{
long runtime = new Date().getTime();
String ip = "192.168.2." + rnd.nextInt(255);
String msg = runtime + ",www.example.com, " + ip;
KeyedMessage<String,String> data = new KeyedMessage<String, String>("page_visits", ip, msg);
prod.send(data);
}
prod.close();
}
}
and SimplePartitioner class has following code:
package raggieKafka;
import kafka.producer.Partitioner;
import kafka.utils.VerifiableProperties;
public class SimplePartitioner implements Partitioner{
public SimplePartitioner(VerifiableProperties props)
{
}
public int partition(Object Key, int a_numPartitions)
{
int partition = 0;
String stringKey = (String) Key;
int offset = stringKey.indexOf(stringKey);
if(offset > 0)
{
partition = Integer.parseInt(stringKey.substring(offset+1)) % a_numPartitions;
}
return partition;
}
}
Before compiling these Java program I created topic on Kafka Broker:
C:\kafka_2.11-0.9.0.1>.\bin\windows\kafka-topics.bat --create --topic page_visit
s --zookeeper localhost:2181 --partitions 5 --replication-factor 1
WARNING: Due to limitations in metric names, topics with a period ('.') or under
score ('_') could collide. To avoid issues it is best to use either, but not bot
h.
Created topic "page_visits".
Now when I compile the java program it puts all the messages to only 1 partition i.e. page_visits-0 under which all the messages are posted however rest all other partitions remain empty.
Can someone tell me why my Java producer is NOT distributing all my messaged to other partitions ?
Infact, I looked on google and then added one more property:
props.put("topic.metadata.refresh.interval.ms", "1");
but still Producer isn't producing messages to all the topics.
PLEASE HELP.

Your SimplePartitioner code has bug in the following line
int offset = stringKey.indexOf(stringKey);
It always returns 0 so your offset always equals to 0 and as it never greater than 0 your if block will not get executed. And finally it always returns you partition 0.
Solution: As your key is ip address,the following change could work as expected.
int offset = stringKey.lastIndexOf('.');
Hope this helps!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Kafka slow join - apache-kafka

Related

knowing that kafka has processed all the messages

Issue while sending the Flux of data to the reactive kafka

kafka fetch records by timestamp, consumer loop

Kafka Consumer Group Id and consumer rebalance issue

Kafka Producer Java API is not distributing messages to all topic partitions

Categories

Resources