Strom - Reliable Spout - apache-kafka

How to set Strom spout as reliable
Is there any property to set
Is kafkaSpout reliable by default
How to change the reliablity in kafkaSpout

Kafka is reliable by default. Here's the code from PartitionManager class, which is responsible for reading messages from Kafka topic:
collector.emit(tup, new KafkaMessageId(_partition, toEmit.offset()));
As you can see the second parameter of emit method is KafkaMessageId. You can pass message id in your spouts in the similar way. Message id can be ordinary integer.

Related

Kafka message composition

A kafka message has:
key, value, compression type, headers(key-value pairs,optional),
partition+offset, timestamp
Key is hashed to partition to find which partition producer would write to.
Then why do we need partition as part of message.
Also, how does producer know the offset as offset seems more like a property of kafka server? And doesn't it cause coupling between server and producer?
And how would it work if multiple producers are writing to a topic, as offset send by them may clash?
why do we need partition as part of message.
It's optional for the client to set the record partition. The partition is still needed in the protocol because the key is not hashed server-side, then rerouted.
how does producer know the offset as offset seems more like a property of kafka server?
The producer would need a callback to get the OffsetMetadata, but it's not known when a batch is sent
And doesn't it cause coupling between server and producer?
Yes? It's the Kafka protocol. The consumer is also "coupled" with the server because it must understand how to communicate with it.
multiple producers are writing to a topic, as offset send by them may clash?
If max.inflight.connections is more than 1 and retires are enabled, then yes, batches may get rearranged, but send requests are initially ordered, and clients do not set the record offset, the broker does.

Is there anyway to use different auto.offset.reset strategy for different input topics in kafka streams app?

The use case is: I have a kafka streams app that consumer from an input topic, and output to a intermediate topic, then in the same streams another topology consume from this intermediate topic.
Whenever the application id is updated, both topic start to consumer from earliest. I want to change the auto.offset.reset for the intermediate topic to latest while keep that to earliest for the input topic.
Yes. You can set the reset strategy for each topic via:
// Processor API
topology.addSource(AutoOffsetReset offsetReset, String name, String... topics);
// DSL
builder.stream(String topic, Consumed.with(AutoOffsetReset offsetReset));
builder.table(String topic, Consumed.with(AutoOffsetReset offsetReset));
All those methods have some overloads that allow to set it.

Can kafka client select specific partition to consume?

I have a single kafka client instance that is consuming from 200 partitions, now I want it to consume only on the first 3 kafka partition for debugging and sampling purpose.
Is there a way I can do that?
Or alternatively I can consume from all partition and drop message from partition that is not from the first 3 partition. Is there a way I can find out which partition is the message from?
You can use KafkaConsumer.assign(java.util.Collection<TopicPartition> partitions) to assign a specific set of partitions. To find out the parttion of the message you can use ConsumerRecord.partition()
if you want consume only partially partition,
implement org.apache.kafka.clients.consumer.internals.PartitionAssignor
already apache kafka's test use MockPartitionAssignor (extends PartitionAssignor)
implement PartitionAssignor and, setup "partition.assignment.strategy"
reference : https://kafka.apache.org/documentation/#newconsumerconfigs
Since you haven't specified which consumer API you use, I am going to give an example based on the Python kafka-python library.
consumer = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers=['host:9092'])
for message in consumer:
if message.partition in [0, 1, 2]:
# do something with message.value
If you really want to read only from a subset of partitions, you can do something like:
consumer.assign([TopicPartition('foobar', 2)])
next_message = next(consumer)
would work for you. But I would suggest the first approach. Irrespective of the language of development you choose, all Kafka consumers must be implementing the above features.

Checking Kafka Cache against a particular field

I have a kafka topic that is receiving message. Now my consumer receives an object having various fields lets say id and name. After some i receive another object and so on . After some time i want to check that object with id = {some number} is present in cache of kafka or not. So is there a way to check kafka cache against a particular field.
You are talking of two different consumers here, Consumer-1 which is just consuming the message from Kafka topic doing some processing based on your logic .
Consumer-2 which given consumes all the messages in the topic from beginning and matches them against a set of id(s).
The second case will become more and more expensive as messages are added to the topic. This is not the use case Kafka is built for, you could do this more effectively in ActiveMQ.
If you still want to use Kafka the other option is you could maintain a small in memory Hashset of records against which you can do the comaprison.

Kafka consumer retrieve data using offset

While writing data into Kafka Producer can I get offset of the record
Can I use the same offset and partition to retrieve specific record
Please share example if you can
When you send a record to Kafka, in order to know the offset and the partition assigned to such a record you can use one of the overloaded versions of the send method. There is the one with a Callback parameter which expose the onCompletion method which provides you a RecordMetadata instance with the information you want.
You can take a look at the Kafka Producer API for that here :
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
From a consumer side, if you want to recover a specific record starting a specific offset, you can use the assign method (instead of subscribe) in order to have the consumer being assigned to a specific partition and then you can use seek for specifying the offset. Pay attention that the consumer won't receive just one record but all the records starting from that offset.
For this info see the Kafka Consumer API as well.
https://kafka.apache.org/10/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html