Logging Kafka Partition and offset from Apache Storm Trident - apache-kafka

Where are Kafka Partitions and corresponding Offsets stored while consuming messages from Kafka using Apache Storm Trident ? I could find something in Storm Zookeeper under ls /transactional/<StreamName>/coordinator/meta. But I am unable to understand what are these offsets and to which partition that they belong to ? How can I check Consumer Lag while running Trident Topology ?

Related

How can I run Kafka Consumer processor instance on multiple nodes with Apache Nifi

Currently we are using Apache NiFi to consume messages via Kafka consumer. Output of kafka consumer is connected to hive processor.
I'm looking into how to run kafka consumer instance on a nifi cluster.
I have 3 nodes of nifi cluster and a kafka topic which have 3 partitions, I want the kafka consumer to be able run on each node so each consumer can poll message from one of topic partitions.
After I started the kafka consumer processor ,i can only see that the kafka consumer always run on a single node but not all nodes.
Is there any configuration that I missed?
NiFi uses the Apache Kafka client which is what performs the assignment of consumers to partitions. When you start the processor, assuming you have it set to 1 concurrent task, then you should have 1 consumer on each node of your cluster, and each consumer should get assigned a different partition.
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka

Apache Kafka and Apache Storm Integration

What is the difference between KafkaSpout and KafkaBolt object ? Usually KafkaSpout is used for reading data from kafka producers, but why we use KafkaBolt ?
Bolts in Storm write data. Spouts are Kafka consumers. You read from the broker directly, not from producers.
For example, you can use a Spout to read anything, transform that data within the topology, then setup a Kafka Bolt to produce data into Kafka

Kafka Stream Internal Topics lag increases on taking Kafka Broker down

I have a Kafka Streams Application version - 0.11.0.1 which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers - version 0.11
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour.
Note: Whenever I take any kafka broker down, it brings down few consumers and Kafka streams (consumer) rebalances and lag of internal topics increase from 0 - few millions(1-10).
Is this because of any local state store config or something? How can I handle this?

Kafka Stream reprocessing old messages on rebalancing

I have a Kafka Streams application which reads data from a few topics, joins the data and writes it to another topic.
This is the configuration of my Kafka cluster:
5 Kafka brokers
Kafka topics - 15 partitions and replication factor 3.
My Kafka Streams applications are running on the same machines as my Kafka broker.
A few million records are consumed/produced per hour. Whenever I take a broker down, the application goes into rebalancing state and after rebalancing many times it starts consuming very old messages.
Note: When the Kafka Streams application was running fine, its consumer lag was almost 0. But after rebalancing, its lag went from 0 to 10million.
Can this be because of offset.retention.minutes.
This is the log and offset retention policy configuration of my Kafka broker:
log retention policy : 3 days
offset.retention.minutes : 1 day
In the below link I read that this could be the cause:
Offset Retention Minutes reference
Any help in this would be appreciated.
Offset retention can have an impact. Cf this FAQ: https://docs.confluent.io/current/streams/faq.html#why-is-my-application-re-processing-data-from-the-beginning
Also cf How to commit manually with Kafka Stream? and How to commit manually with Kafka Stream? about how commits work.

Kafka Consumer - JMX Properties

I enabled JMX on the kafka brokers on port 8081. When I view the MBean properties in jConsole, I only see the following for kafka.consumer-
kafka.consumer:type=FetchRequestAndResponseMetrics,name=FetchRequestRateAndTimeMs,clientId=ReplicaFetcherThread-2-413
kafka.consumer:type=FetchRequestAndResponseMetrics,name=FetchResponseSize,clientId=ReplicaFetcherThread-0-413
But none of the other ones that are identified in here under Kafka Consumer Metrics are emitted by JMX.
Kafka Version # 0.8.2.1
I am specifically interested in -
kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)
Any thoughts?
The JMX PORT you are listening is the broker port. But the Mbean of kafka.consumer: is the consumer jvm metrics. So if you have another JVM that is consume a topic, you can see kafka.consumer Mbeans.
ConsumerLag is an overloaded term in Kafka, it can stand for:
Consumer's metric: Calculated difference between a consumer's current log offset and a producer’s current log offset. You can find it under JMX bean, if you're using Java/Scala based consumer (e.g. pykafka consumer doesn't export metrics):
kafka v0.8.2.x:
kafka.consumer:type= ConsumerFetcherManager, name=MaxLag, clientId=([-.\w]+)
kafka v0.9+:
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=([-.w]+)
Consumer lag used to be stored in ZooKeeper (Kafka <= v0.8), newer versions of Kafka have special topic __consumer_offsets that stores each consumer's lag. There are tools (e.g. kafka-manager) that can compute lag by consuming messages from this topic and calculating lag. In kafka-manager you have to enable this feature for each cluster:
[ ] Poll consumer information (Not recommended for large # of consumers)
Broker's metric: Represent the offset differences between partition leaders and their followers. You can find this metric under JMX bean:
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)
This may help to find it for 0.8, but I am currently running a Kafka 0.10 broker and consumer. When using a console consumer, I pointed jconsole to this consumer and found on the MBeans TAB: kafka.consumer-> consumer-fetcher-manager-metric -> consumer-1 -> Attributes -> records-max-lag.