Kafka producer with default partitioning - apache-kafka

Right now my kafka producer is sinking all the messages to a single partition of a kafka topic which actually have more than 1 partition.
How can i create a producer that will use the default partitioner and distribute the messages among different partitions of the topic.
Code snippet of my kafka producer:
Properties props = new Properties();
props.put(ProducerConfig.RETRIES_CONFIG, 0);
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,bootstrap.servers);
props.put(ProducerConfig.ACKS_CONFIG, "all");
I am using flink kafka producer to sink the messages on kafka topic.
speStream.addSink(
new FlinkKafkaProducer011(kafkaTopicName,
new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()),
props,
FlinkKafkaProducer011.Semantic.EXACTLY_ONCE)

With the default partitioner, messages are assigned a partition using the following logic:
keyed messages: a hash of the key is generated and based on that a partition is selected. That means messages with the same key will end up on the same partition
unkeyed messages: round robin is used to assign partitions
One option that explain the behaviour you see is if you're using the same key for all your messages, then with the default partitioner they will end up on the same partition.

Solved this by changing the flinkproducer to
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new SimpleStringSchema(), props));

Related

Flink kafka consumer fetch messages from specific partition

We want to achieve parallelism while reading a message form kafka. hence we wanted to specify partition number in flinkkafkaconsumer. It will read messages from all partition in kafka instead of specific partition number. Below is sample code:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "Message-Test-Consumers");
properties.setProperty("partition", "1"); //not sure about this syntax.
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<String>("EventLog", new SimpleStringSchema(), properties);
Please suggest any better option to get the parallelism.
I don't believe there is a mechanism to restrict which partitions Flink will read from. Nor do I see how this would help you achieve your goal of reading from the partitions in parallel, which Flink does regardless.
The Flink Kafka source connector reads from all available partitions, in parallel. Simply set the parallelism of the kafka source connector to whatever parallelism you desire, keeping in mind that the effective parallelism cannot exceed the number of partitions. In this way, each instance of Flink's Kafka source connector will read from one or more partitions. You can also configure the kafka consumer to automatically discover new partitions that may be created while the job is running.

How Can a Consumer in Kafka be assigned to a Specific partition ;The partition in my case is assigned through the hash value of the Key

When a partition is assigned by the Producer using a number . For eg
kafkaTemplate.send(topic, 1, "[" + LocalDateTime.now() + "]" + "Message to partition 1");
The number 1 second parameter defines the partition id where i want my message to be sent. So the consumer can consume this message:
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer1.assign(Arrays.asList(partition1));
But how do i achieve this for a producer choosing a partition based on the hash of the key sent by the producer using the DefaultPartitioner. Example:
kafkaTemplate.send(topic, "forpartition1", "testkey");
Here the key is "forpartition1" , how do i assign my consumer to consume from this partition generated from hash Key of "forpartition1". Do i again compute the hash for that key in the consumer or are there any other ways to achieve that. I am pretty new to this technology.
Based on the information that you are new to Kafka, I am tempted to guess you are unintentionally trying an advanced use case and that is probably not what you want.
The common use case is that you publish messages to a topic. The message gets assigned a partition based on the key and all messages for the same key ends at the same partition.
On the consumer, you subscribe to the whole topic (without explicitly asking for a partition) and Kafka will handle the distribution of partitions between all the consumers available.
This gives you the guarantee that all messages with a specific key will be processed by the same consumer (they all go to the same partition and only one consumer handles each partition) and in the same order they were sent.
If you really want to choose the partition yourself, you can write a partitioner class and configure your producer to use it by setting partitioner.class configuration.
From the Kafka Documentation
NAME
partitioner.class
DESCRIPTION
Partitioner class that implements the org.apache.kafka.clients.producer.Partitioner interface.
TYPE
class
DEFAULT
org.apache.kafka.clients.producer.internals.DefaultPartitioner
VALID VALUES
IMPORTANCE
medium
A few example tutorials on how to do it can be found online. Here's a sample for reference:
Write An Apache Kafka Custom Partitioner
Apache Kafka
Foundation Course - Custom Partitioner

Kafka Consumer seektoBeginning

I did not use a partition to publish to Kafka topic.
ProducerRecord(String topic, K key, V value)
In the consumer, I would like to go to the beginning.
seekToBeginning(Collection partitions)
Is it possible to seek to beginning without using a partition? Does Kafka assign a default partition?
https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/producer/ProducerRecord.html
https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
When producing, if you don't explicitely specify a partition, the producer will pick one automatically from your topic.
In your consumer, if your are subscribed to your topic, you can seek to the start of all the partitions your consumer is currently assigned to using:
consumer.seekToBeginning(consumer.assignment())

KafkaProducer round-robin distribution not working for the same key

I'm trying to understand how Kafka works. I've read that by default Kafka will distribute the messages from a producer in a round-robin fashion among the partitions.
But, why are the messages always put in the same partition if the messages have the same key ? (no partition key strategy configured).
For instance, using the code below the messages are always put in the same partition:
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
String key = properties.getProperty("dev.id");
producer.send(new ProducerRecord<String, String>(properties.getProperty("kafka.topic"), key, value), new EventGeneratorCallback(key));
With a different key, messages are distributed in a round-robin fashion:
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);
String key = properties.getProperty("dev.id") + UUID.randomUUID().toString();
producer.send(new ProducerRecord<String, String>(properties.getProperty("kafka.topic"), key, value), new EventGeneratorCallback(key));
This is exactly how a Kafka producer works. This behaviour is defined by the DefaultPartitioner class you can find here at the official repo.
If the key isn't specified, the producer uses a round-robin way for sending messages across all the topic related partitions; if the key is specified, the partitioner processes an hash of the key module the number of partition and in this way messages with same key goes to the same partition. Because Kafka guarantees ordering at partition level (not topic level), this is also a way for having all messages with same key in the same partition and so received by consumer in the same order they are sent.
Finally, another possible way for sending messages is that the producer specifies the partition destination : in this case no key, no round-robin but the message is sent exactly to the partition specified by the producer itself.
This is the expected behavior. All messages with the same key are put in the same partition. If you want round robin assignment for all messages, you should not provide a key. To Kafka, the reason to have a key is to distribute data throughout partitions and putting identical keys in different partitions would break this contract.

KafKa partitioner class, assign message to partition within topic using key

I am new to kafka so apology if I sound stupid but what I understood so far
is .. A stream of message can be defined as a topic, like a category. And every topic is divided
into one or more partitions (each partition can have multiple replicas). so they act in parallel
From the Kafka main site they say
The producer is able to chose which message to assign to which partition within the topic.
This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the message).
Does this mean while consuming I will be able to choose the message offset from particular partition?
While running multiple partitions is it possible to choose from one specific partition i.e partition 0?
In Kafka 0.7 quick start they say
Send a message with a partition key. Messages with the same key are sent to the same partition.
And the key can be provided while creating the producer as below
ProducerData<String, String> data = new ProducerData<String, String>("test-topic", "test-key", "test-message");
producer.send(data);
Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?
While creating producer in 0.8beta we can provide the partitioner class attribute through the config file.
The custom partitioner class can be perhaps created implementing the kafka partitioner interface.
But m little confused how exactly it works. 0.8 doc also does not explain much. Any advice or m i missing something ?
This is what I've found so far ..
Define our own custom partitioner class by implementing the kafka Partitioner interface. The implemented method will have two arguments, first the key that we provide from the producer and next the number of partition available. So we can define our own logic to set which key of message goes to what partition.
Now while creating the producer we can specify our own partitioner class using the "partitioner.class" attribute
props.put("partitioner.class", "path.to.custom.partitioner.class");
If we don't mention it then Kafka will use its default class and try to distribute message evenly among the partitions available.
Also inform Kafka how to serialize the key
props.put("key.serializer.class", "kafka.serializer.StringEncoder");
Now if we send some message using a key in the producer the message will be delivered to a specific partition (based on our logic written on the custom partitioner class), and in the consumer (SimpleConsumer) level we can specify the partition to retrieve the specific messages.
In case we need to pass a String as a key, the same should be handled in the custom partitioner class ( take hash value of the key and then take first two digit etc )
Each topic in Kafka is split into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream (say C1 uses 3, C2 uses 3 and C3 uses 4).
In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Does this mean while consuming I will be able to choose the message offset from particular partition? While running multiple partitions is it possible to choose from one specific partition i.e partition 0?
Yes you can choose message from one specific partition from your consumer but if you want that to be identified dynamically then it depends on the logic how you have implemented Partitioner Class in your producer.
Now how do I consume message based on this key? what is the actual impact of using this key while producing in Kafka ?
There are two way of consuming the message. One is using Zookeeper Host and another is Static Host. Zookeper host consumes message from all partition. However if you are uisng Static Host than you can provide broker with partition number that needs to be consumed.
Please check below example of Kafka 0.8
Producer
KeyedMessage<String, String> data = new KeyedMessage<String, String>(<<topicName>>, <<KeyForPartition>>, <<Message>>);
Partition Class
public int partition(Object arg0, int arg1) {
// arg0 is the key given while producing, arg1 is the number of
// partition the broker has
long organizationId = Long.parseLong((String) arg0);
// if the given key is less than the no of partition available then send
// it according to the key given Else send it to the last partition
if (arg1 < organizationId) {
return (arg1 - 1);
}
// return (int) (organizationId % arg1);
return Integer.parseInt((String) arg0);
}
So the partiotioner class decide where to send message based on your logic.
Consumer (PN:I have used Storm Kafka 0.8 integration)
HostPort hosts = new HostPort("10.**.**.***",9092);
GlobalPartitionInformation gpi = new GlobalPartitionInformation();
gpi.addPartition(0, hosts);
gpi.addPartition(2, hosts);
StaticHosts statHost = new StaticHosts(gpi);
SpoutConfig spoutConf = new SpoutConfig(statHost, <<topicName>>, "/kafkastorm", <<spoutConfigId>>);