Can single Kafka producer produce messages to multiple topics and how? - apache-kafka

I am just exploring Kafka, currently i am using One producer and One topic to produce messages and it is consumed by one Consumer. very simple.
I was reading the Kafka page, the new Producer API is thread-safe and sharing single instance will improve the performance.
Does it mean i can use single Producer to publish messages to multiple topics?

Never tried it myself, but I guess you can. Since the code for producer and sending the record is (from here https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html):
Producer<String, String> producer = new KafkaProducer<>(props);
for(int i = 0; i < 100; i++)
producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));
So, I guess, if you just write different topics in the ProducerRecord, than it should be possible.
Also, here http://kafka.apache.org/081/documentation.html#producerapi it explicitly says that you can use a method send(List<KeyedMessage<K,V>> messages) to write into multiple topics.

If I understand you correctly, you are more looking using the same producer instance to send the same/multiple messages on multiple topics.
Not sure about java, but here you can do in C#(.NET) using the Kafka .NET Client DependentProducerBuilder
using (var producer = new ProducerBuilder<string, string>(config).Build())
using (var producer2 = new DependentProducerBuilder<Null, int>(producer.Handle).Build())
{
producer.ProduceAsync("first-topic", new Message<string, string> { Key = "my-key-value", Value = "my-value" });
producer2.ProduceAsync("second-topic", new Message<Null, int> { Value = 42 });
producer2.ProduceAsync("first-topic", new Message<Null, int> { Value = 107 });
producer.Flush(TimeSpan.FromSeconds(10));
}

Related

How to run several instances of a kafka transactional producer on same broker with same transactionalId?

I am using Kafka Transactional producer to post atomically to 2 topics on a broker. My code looks similar to this:
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("transactional.id", "my-transactional-id");
Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());
producer.initTransactions();
try {
producer.beginTransaction();
for (int i = 0; i < 100; i++)
producer.send(new ProducerRecord<>("my-topic", Integer.toString(i), Integer.toString(i)));
producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
// We can't recover from these exceptions, so our only option is to close the producer and exit.
producer.close();
} catch (KafkaException e) {
// For all other exceptions, just abort the transaction and try again.
producer.abortTransaction();
}
producer.close();
Acc. to kafka docs and my understanding, initTransactions() has to be called only once per producer session, and it registers that producer instance to the broker with the specified transactional-id.
Now, in my case I need to deploy this code to several servers using same kafka broker.
Do I need different transactional Ids for each instance?
Is there any way to close initTransactions() once it is called so that it doesn't block other producers executing transactions as they have same transactional-id.
P.s. I don't want to close the producer and re-instantiate it after every sent transaction as this can impact performance, I believe. How can we implement an efficient solution to this problem?

java KafkaConsumer never get results

I'm new to kafka, I have the following sample code :
KafkaConsumer<String,String> kc = new KafkaConsumer<String, String>(props);
while(true) {
List<String> topicNames = Arrays.asList(topics.split(","));
if (!kc.assignment().isEmpty()) {
kc.unsubscribe();
}
kc.subscribe(topicNames);
ConsumerRecords<String, String> recv = kc.poll(1000L);
if (!recv.isEmpty()) {
System.out.println("NOT EMPTY");
}
}
The recv is always empty but if I try to increment the pool timeout the records are returned, also if I cut off the unsubscribe part.
I've taken this piece of code from an integration proprietary software and I cannot modify it.
So my question is: Is this only a timing problem or there is more?
There is a lot that happens when a consumer (re)subscribes to a topic.
Very roughly and as far as I remember the consumer will:
request cluster information
request consumer group metadata
make a JOIN_GROUP request
be assigned certain partitions
The underlying mechanisms are even more complicated if there are more consumers within the same group. That's because the partitions should be reassigned between all the consumers within the group.
That is why:
1000 millis might not be enough for all this and you didn't poll anything in time
you polled something when you increased the timeout because Kafka managed to perform all of these bootstrapping operations
you polled something when you removed the unsubscription to the topics because most likely your consumer was already subscribed
So there is a timing issue. And I think that there is something more - un/subscribing to a topic within an infinite loop makes no sense to me (see the other answer).
You should subscribe to your topics only once at the beginning. Like this:
final KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
final ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}

KafkaSpout multithread or not

kafka 0.8.x doc shows how to multithread in kafka consumer:
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(a_numThreads));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
// now launch all the threads
//
executor = Executors.newFixedThreadPool(a_numThreads);
// now create an object to consume the messages
//
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.execute(new ConsumerTest(stream, threadNumber));
threadNumber++;
}
But KafkaSpout in storm seems to not multithread.
Maybe use multi task instead of multithread in KafkaSpout :
builder.setSpout(SqlCollectorTopologyDef.KAFKA_SPOUT_NAME, new KafkaSpout(spoutConfig), nThread);
Which one is better? Thanks
Since you mentioned Kafka 0.8.x, I am assuming the KafkaSpout you use is from storm-kafka other than storm-kafka-client.
The first code snippet is high-level consumer's API which could employ multiple threads to consume multiple partitions.
As for the kafka spout, it's probably the same, but Storm is using the low-level consumer, namely SimpleConsumer. However, there will be one SimpleConsumer instance created for each spout executor(task).

How to set max.poll.records in Kafka-Connect API

I am using confluent-3.0.1 platform and building a Kafka-Elasticsearch connector. For this I am extending SinkConnector and SinkTask (Kafka-connect APIs) to get data from Kafka.
As part of this code i am extending taskConfigs method of SinkConnector to return "max.poll.records" to fetch only 100 records at a time. But its not working and I am getting all records at same time and I am failing to commit offsets within the stipulated time. Please can any one help me to configure "max.poll.records"
public List<Map<String, String>> taskConfigs(int maxTasks) {
ArrayList<Map<String, String>> configs = new ArrayList<Map<String, String>>();
for (int i = 0; i < maxTasks; i++) {
Map<String, String> config = new HashMap<String, String>();
config.put(ConfigurationConstants.CLUSTER_NAME, clusterName);
config.put(ConfigurationConstants.HOSTS, hosts);
config.put(ConfigurationConstants.BULK_SIZE, bulkSize);
config.put(ConfigurationConstants.IDS, elasticSearchIds);
config.put(ConfigurationConstants.TOPICS_SATELLITE_DATA, topics);
config.put(ConfigurationConstants.PUBLISH_TOPIC, topicTopublish);
config.put(ConfigurationConstants.TYPES, elasticSearchTypes);
config.put("max.poll.records", "100");
configs.add(config);
}
return configs;
}
You can't override most Kafka consumer configs like max.poll.records in the connector configuration. You can do so in the Connect worker configuration though, with a consumer. prefix.
It was solved. I added below configuration in connect-avro-standalone.properties
group.id=mygroup
consumer.max.poll.records=1000
and ran below command for running my connector.
sh ./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties ./etc/kafka-connect-elasticsearch/connect-elasticsearch-sink.properties

Consume only specific partition message

Here is my kafka message producer:
ProducerRecord producerRecord = new ProducerRecord(topic, "k1", message);
producer.send(producerRecord);
here is my consumer
TopicPartition partition0 = new TopicPartition(topic, 0);
consumer.assign(Arrays.asList(partition0));
final int minBatchSize = 200;
List<ConsumerRecord<String, byte[]>> buffer = new ArrayList<>();
while (true) {
ConsumerRecords<String, byte[]> records = consumer.poll(100);
for (ConsumerRecord<String, byte[]> record : records) {
buffer.add(record);
System.out.println(record.key() + "KEY: " + record.value());
How is it possible to consume only topic message having k1 as partition key
The only way I see to implement such behavior is to have the number of partitions == number of possible keys and have a custom partitioner to maintain key uniqueness for a partition (default hash partitioner would work I think). But this solution is very far from optimal and I can't recommend it. Besides that you can't use any built in mechanism to achieve similar behavior - you'll have to filter messages on client side
One proposal is to remember the partition and offset of your specific message,
and using assign and seek, poll in consumer side.(also set consumer max.poll.records=1, which fetch one message in one time).
assign, assign specific partition to consumer;
seek, seek to specific offset, then next poll will get your expected message K1.
Note:It works like "random" seek, but will reduce message consumption performance.
0.10 new consumer and new config max.poll.records are required.