Storm Kafka Spout Unable to read last off read - apache-kafka

I am using storm-kafka-0.9.3 to read data from the Kafka and process those data in Storm. Below is the Kafka Spout I am using.But the problem is when I kill the Storm cluster, it does not read old data which was sent during the time it was dead, it start reading from the latest offset.
BrokerHosts hosts = new ZkHosts(Constants.ZOOKEEPER_HOST);
SpoutConfig spoutConfig = new SpoutConfig(hosts, CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME
, "/" + CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME,UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
//Never should make this true
spoutConfig.forceFromStart=false;
spoutConfig.startOffsetTime =-2;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
return kafkaSpout;

Thanks All,
Since I was running the Topology in Local mode,Storm did not store Offset in ZK, when I ran the topology in Prod mode It got resolved.
Sougata

I believe this might happen because while the topology is running it used to keep all the state information to zookeeper using the following path SpoutConfig.zkRoot+ "/" + SpoutConfig.id so that in case of failure it can resume from the last written offset in zookeeper.
Got this from the doc
Important:When re-deploying a topology make sure that the settings for SpoutConfig.zkRoot and SpoutConfig.id were not modified, otherwise the spout will not be able to read its previous consumer state information (i.e. the offsets) from ZooKeeper -- which may lead to unexpected behavior and/or to data loss, depending on your use case.
In your case as the SpoutConfig.id is a random value UUID.randomUUID().toString() Its not able to retrieve the last committed offset.
Also read from the same page
when a topology has run once the setting KafkaConfig.startOffsetTime will not have an effect for subsequent runs of the topology because now the topology will rely on the consumer state information (offsets) in ZooKeeper to determine from where it should begin (more precisely: resume) reading. If you want to force the spout to ignore any consumer state information stored in ZooKeeper, then you should set the parameter KafkaConfig.ignoreZkOffsets to true. If true, the spout will always begin reading from the offset defined by KafkaConfig.startOffsetTime as described above
You could possibly use a static id to see if it is able to retrieve.

You need to set the spoutConfig.zkServers and spoutConfig.zkPort :
BrokerHosts hosts = new ZkHosts(Constants.ZOOKEEPER_HOST);
SpoutConfig spoutConfig = new SpoutConfig(hosts, CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME
, "/" + CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME,"test");
spoutConfig.zkPort=Constants.ZOOKEEPER_PORT;
spoutConfig.zkServers=Constants.ZOOKEEPER_SERVERS;
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
return kafkaSpout;

Related

How can I know that I have consumed all of a Kafka Topic?

I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest(); // start from the earliest record possible
myConsumer.setStartFromLatest(); // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour
DataStream<String> stream = env.addSource(myConsumer);
...
Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)
Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics, which includes this explanation:
The difference between the committed offset and the most recent offset in
each partition is called the consumer lag. If the Flink topology is consuming
the data slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.
Kafka it's used as a streaming source and a stream does not have an end.
If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic
So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.
Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.

How to obtain Offset of Kafka consumer?

Working with Kafka(v2.11-0.10.1.0)-spark-streaming(v-2.0.1-bin-hadoop2.7).
I have Kafka Producer and Spark-streaming consumer to produce and consume. All works fine till I stop consumer(for approx 2-min) and start again. The consumer starts and reads data, absolutely perfect. But, I'm lost with the 2-min data, where consumer was off.
Kafka consumer/server.properties are unchanged.
Kafka producer with properties:
Properties properties = new Properties();
properties.put("bootstrap.servers", AppCoding.KAFKA_HOST);
properties.put("auto.create.topics.enable", true);
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("retries", 1);
logger.info("Initializing Kafka Producer.");
Producer<String, String> producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<String, String>(AppCoding.KAFKA_TOPIC, "", documentAsString));
Consuming using Spark-streaming api as:
SparkConf sparkConf = new SparkConf().setMaster(args[4]).setAppName("Streaming");
// Create the context with 60 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(60000 * 5));
//input arguments:localhost:2181 sparkS incoming 10 local[*]
Set<String> topicsSet = new HashSet<>(Arrays.asList(args[2].split(";")));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", args[0]);
//input arguments: localhost:9092 "" incoming 10 local[*]
JavaPairInputDStream<String, String> kafkaStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
On the other end i have been using ActiveMQ. While ActiveMQ Consumer could fetch me the data while its off.
Help me out if there's a confuguration problem.
In Kafka, consumers actually have no direct relationship with producers. Each consumer has an offset which tracks what has been consumed in the partitions. If a consumer has no offset tracked, Kafka will automatically reset its offset to the largest one because of the default value of config 'auto.offset.reset'. In your case, when the brand-new consumer is started, due to the default policy, it does not see the messages produced previously. You could set 'auto.offset.reset' to earliest (for new consumer) or smallest (for old consumer).
Kafka maintains offset per partition per record basis. While consumer was off for 2 minute duration, offset value would be stored in topic metadata for new-consumer, and again when the consumer is started back after 2minutes, it would read last offset which was stored in kafka topic.
I think what you need to check is kafka broker data retention policy if it is less than 2 minutes , data would be lost , if data corresponding to offset is not present , it would start reading from latest as by default value is set to latest auto.offset.reset=latest for new data arriving.
I would suggest to check and change kafka data retention policy accordingly if it is less than 2 minutes

Apache Storm: Support of topic wildcards in Kafka spout

We have a topology that has multiple kafka spout tasks. Each spout task is supposed to read a subset of messages from a set of Kafka topics. Topics have to be subscribed using a wild card such as AAA.BBB.*. The expected behaviour would be that all spout tasks collectively will consume all messages in all of the topics that match the wild card. Each message is only routed to a single spout task (Ignore failure scenarios). Is this currently supported?
Perhaps you could use DynamicBrokersReader class.
Map conf = new HashMap();
...
conf.put("kafka.topic.wildcard.match", true);
wildCardBrokerReader = new DynamicBrokersReader(conf, connectionString, masterPath, "AAA.BBB.*");
List<GlobalPartitionInformation> partitions = wildCardBrokerReader.getBrokerInfo();
...
for (GlobalPartitionInformation eachTopic: partitions) {
StaticHosts hosts = new StaticHosts(eachTopic);
SpoutConfig spoutConfig = new SpoutConfig(hosts, eachTopic.topic, zkRoot, id);
KafkaSpout spout = new KafkaSpout(spoutConfig);
}
... // Wrap those created spout instances into a container

FlinkKafkaConsumer082 auto.offset.reset setting doesn't work?

I have a Flink streaming program which read data from a topic of Kafka. In the program, auto.offset.reset is set to "smallest". When test in IDE/Intellij-IDEA, the program could always read data from the beginning of the topic. Then I set up a flink/kafka cluster and produced some data into kafka topic. The first time I run the streaming job, it could read data from the beginning of the topic. But after that I stopped the streaming job and run it again, it will not read data from the beginning of the topic. How could I make the program always read data from the beginning of the topic?
Properties properties = new Properties();
properties.put("bootstrap.servers", kafkaServers);
properties.put("zookeeper.connect", zkConStr);
properties.put("group.id", group);
properties.put("topic", topics);
properties.put("auto.offset.reset", offset);
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer082<String>(topics, new SimpleStringSchema(), properties));
If you want to read always from the beginning you need to disable checkpointing in your stream context.
Also disable it on the level of consumer properties:
enable.auto.commit=false or auto.commit.enable=false (depends on kafka version)
Another way:
you can keep ckeckpointing for failover but generate new group.id when you need to read from the beginning(just clean up sometimes zookeeper)

Kafka High Level Consumer Fetch All Messages From Topic Using Java API (Equivalent to --from-beginning)

I am testing the Kafka High Level Consumer using the ConsumerGroupExample code from the Kafka site. I would like to retrieve all the existing messages on the topic called "test" that I have in the Kafka server config. Looking at other blogs, auto.offset.reset should be set to "smallest" to be able to get all messages:
private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId) {
Properties props = new Properties();
props.put("zookeeper.connect", a_zookeeper);
props.put("group.id", a_groupId);
props.put("auto.offset.reset", "smallest");
props.put("zookeeper.session.timeout.ms", "10000");
return new ConsumerConfig(props);
}
The question I really have is this: what is the equivalent Java api call for the High Level Consumer that is the equivalent of:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Basically, everytime a new consumer tries to consume a topic, it'll read messages from the beginning. If you're especially just consuming from the beginning each time for testing purposes, everytime you initialise your consumer with a new groupID, it'll read the messages from the beginning. Here's how I did it :
properties.put("group.id", UUID.randomUUID().toString());
and read messages from the beginning each time!
Looks like you need to use the "low level SimpleConsumer API"
For most applications, the high level consumer Api is good enough.
Some applications want features not exposed to the high level consumer
yet (e.g., set initial offset when restarting the consumer). They can
instead use our low level SimpleConsumer Api. The logic will be a bit
more complicated and you can follow the example in here.
This example worked for getting all messages from a topic with the following arguments: (note that the port is the Kafka port, not the ZooKeeper port, topics set up from this example):
10 my-replicated-topic 0 localhost 9092
Specifically, there is a method to get readOffset which takes kafka.api.OffsetRequest.EarliestTime():
long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
Here is another post may provide some alternate ideas on how to sort this out: How to get data from old offset point in Kafka?
To fetch messages from the beginning, you can do this:
import kafka.utils.ZkUtils;
ZkUtils.maybeDeletePath("zkhost:zkport", "/consumers/group.id");
then just follow the routine work...
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("auto.offset.reset", "earliest");
props.put("group.id", UUID.randomUUID().toString());
This properties will help you.