KafkaSpout multithread or not - apache-kafka

kafka 0.8.x doc shows how to multithread in kafka consumer:
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(a_numThreads));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
// now launch all the threads
//
executor = Executors.newFixedThreadPool(a_numThreads);
// now create an object to consume the messages
//
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.execute(new ConsumerTest(stream, threadNumber));
threadNumber++;
}
But KafkaSpout in storm seems to not multithread.
Maybe use multi task instead of multithread in KafkaSpout :
builder.setSpout(SqlCollectorTopologyDef.KAFKA_SPOUT_NAME, new KafkaSpout(spoutConfig), nThread);
Which one is better? Thanks

Since you mentioned Kafka 0.8.x, I am assuming the KafkaSpout you use is from storm-kafka other than storm-kafka-client.
The first code snippet is high-level consumer's API which could employ multiple threads to consume multiple partitions.
As for the kafka spout, it's probably the same, but Storm is using the low-level consumer, namely SimpleConsumer. However, there will be one SimpleConsumer instance created for each spout executor(task).

Related

Creating kafka stream API for JSON's

i am trying to write a kafka stream code for converting JSON array to JSON elements...since i am new to kafka stream can any one help me out writing the code.. like what should be there in kstream and ktable..
and my stream of input ll be in the following format
[
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":0,"unit":""},
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":1,"unit":""}
]
[
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":2,"unit":""},
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":3,"unit":""}
]
and my output must be in the form
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":0,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":1,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359126933+05:30","data":2,"unit":""}
{"timestamp":"2017-10-24T12:44:09.359175426+05:30","data":3,"unit":""}
can anyone help me out in writing the code??
If you want to use Kafka Streams, you can use a flatMap(). Something like
// using new 1.0 API
StreamsBuilder builder = new StreamsBuilder();
builer.stream("topic").flatMap(...).to("output-topic");
Check out the examples and docs for more details:
https://docs.confluent.io/current/streams/developer-guide/index.html
https://github.com/confluentinc/kafka-streams-examples
in Python...
from kafka import KafkaConsumer
consumer = KafkaConsumer('topicName')
for message in consumer:
print(message)
specify bootstrap_servers parameter in KafkaConsumer.
For Java look cloudkarafka, really good:
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topic));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("msg = %s\n", record.value());
}
}

Kafka consumer does not start from latest message

I want to have a Kafka Consumer which starts from the latest message in a topic.
Here is the java code:
private static Properties properties = new Properties();
private static KafkaConsumer<String, String> consumer;
static
{
properties.setProperty("bootstrap.servers","localhost");
properties.setProperty("enable.auto.commit", "true");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("group.id", "test");
properties.setProperty("auto.offset.reset", "latest");
consumer = new KafkaConsumer<>(properties);
consumer.subscribe(Collections.singletonList("mytopic"));
}
#Override
public StreamHandler call() throws Exception
{
while (true)
{
ConsumerRecords<String, String> consumerRecords = consumer.poll(200);
Iterable<ConsumerRecord<String, String>> records = consumerRecords.records("mytopic");
for(ConsumerRecord<String, String> rec : records)
{
System.out.println(rec.value());
}
}
}
Although the value for auto.offset.reset is latest, but the consumer starts form messages which belong to 2 days ago and then it catches up with the latest messages.
What am I missing?
Have you run this same code before with the same group.id? The auto.offset.reset parameter is only used if there is not an existing offset already stored for your consumer. So if you've run the example previously, say two days ago, and then you run it again, it will start from the last consumed position.
Use seekToEnd() if you would like to manually go to the end of the topic.
See https://stackoverflow.com/a/32392174/1392894 for a slightly more thorough discussion of this.
If you want to manually control the position of your offsets you need to set enable.auto.commit = false.
If you want to position all offsets to the end of each partition then call seekToEnd()
https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#seekToEnd(java.util.Collection)

Can single Kafka producer produce messages to multiple topics and how?

I am just exploring Kafka, currently i am using One producer and One topic to produce messages and it is consumed by one Consumer. very simple.
I was reading the Kafka page, the new Producer API is thread-safe and sharing single instance will improve the performance.
Does it mean i can use single Producer to publish messages to multiple topics?
Never tried it myself, but I guess you can. Since the code for producer and sending the record is (from here https://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html):
Producer<String, String> producer = new KafkaProducer<>(props);
for(int i = 0; i < 100; i++)
producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));
So, I guess, if you just write different topics in the ProducerRecord, than it should be possible.
Also, here http://kafka.apache.org/081/documentation.html#producerapi it explicitly says that you can use a method send(List<KeyedMessage<K,V>> messages) to write into multiple topics.
If I understand you correctly, you are more looking using the same producer instance to send the same/multiple messages on multiple topics.
Not sure about java, but here you can do in C#(.NET) using the Kafka .NET Client DependentProducerBuilder
using (var producer = new ProducerBuilder<string, string>(config).Build())
using (var producer2 = new DependentProducerBuilder<Null, int>(producer.Handle).Build())
{
producer.ProduceAsync("first-topic", new Message<string, string> { Key = "my-key-value", Value = "my-value" });
producer2.ProduceAsync("second-topic", new Message<Null, int> { Value = 42 });
producer2.ProduceAsync("first-topic", new Message<Null, int> { Value = 107 });
producer.Flush(TimeSpan.FromSeconds(10));
}

Consume only specific partition message

Here is my kafka message producer:
ProducerRecord producerRecord = new ProducerRecord(topic, "k1", message);
producer.send(producerRecord);
here is my consumer
TopicPartition partition0 = new TopicPartition(topic, 0);
consumer.assign(Arrays.asList(partition0));
final int minBatchSize = 200;
List<ConsumerRecord<String, byte[]>> buffer = new ArrayList<>();
while (true) {
ConsumerRecords<String, byte[]> records = consumer.poll(100);
for (ConsumerRecord<String, byte[]> record : records) {
buffer.add(record);
System.out.println(record.key() + "KEY: " + record.value());
How is it possible to consume only topic message having k1 as partition key
The only way I see to implement such behavior is to have the number of partitions == number of possible keys and have a custom partitioner to maintain key uniqueness for a partition (default hash partitioner would work I think). But this solution is very far from optimal and I can't recommend it. Besides that you can't use any built in mechanism to achieve similar behavior - you'll have to filter messages on client side
One proposal is to remember the partition and offset of your specific message,
and using assign and seek, poll in consumer side.(also set consumer max.poll.records=1, which fetch one message in one time).
assign, assign specific partition to consumer;
seek, seek to specific offset, then next poll will get your expected message K1.
Note:It works like "random" seek, but will reduce message consumption performance.
0.10 new consumer and new config max.poll.records are required.

How to read data using key in Kafka Consumer API?

I'm constructing messages using below code...
Producer<String, String> producer = new kafka.javaapi.producer.Producer<String, String>(producerConfig);
KeyedMessage<String, String> keyedMsg = new KeyedMessage<String, String>(topic, "device-420", "{message:'hello world'}");
producer.send(keyedMsg);
And Consuming using following code block...
//Key = topic name, Value = No. of threads for topic
Map<String, Integer> topicCount = new HashMap<String, Integer>();
topicCount.put(topic, 1);
//ConsumerConnector creates the message stream for each topic
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = consumerConnector.createMessageStreams(topicCount);
// Get Kafka stream for topic
List<KafkaStream<byte[], byte[]>> kStreamList = consumerStreams.get(topic);
// Iterate stream using ConsumerIterator
for (final KafkaStream<byte[], byte[]> kStreams : kStreamList) {
ConsumerIterator<byte[], byte[]> consumerIte = kStreams.iterator();
while (consumerIte.hasNext()) {
MessageAndMetadata<byte[], byte[]> msg = consumerIte.next();
System.out.println(topic.toUpperCase() + ">"
+ " Partition:" + msg.partition()
+ " | Key:"+ new String(msg.key())
+ " | Offset:" + msg.offset()
+ " | Message:"+ new String(msg.message()));
}
}
Everything is working fine because I'm reading data topic wise. So I want to know that Is there any way to to consume data using message key i.e. device-420 in this example?
Short answer: no.
The smallest granularity in Kafka is a partition. You can write a client that reads only from a single partition. However, a partition can contain multiple keys and you need to consume all the keys contained in this partition.