Working with Kafka(v2.11-0.10.1.0)-spark-streaming(v-2.0.1-bin-hadoop2.7).
I have Kafka Producer and Spark-streaming consumer to produce and consume. All works fine till I stop consumer(for approx 2-min) and start again. The consumer starts and reads data, absolutely perfect. But, I'm lost with the 2-min data, where consumer was off.
Kafka consumer/server.properties are unchanged.
Kafka producer with properties:
Properties properties = new Properties();
properties.put("bootstrap.servers", AppCoding.KAFKA_HOST);
properties.put("auto.create.topics.enable", true);
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("retries", 1);
logger.info("Initializing Kafka Producer.");
Producer<String, String> producer = new KafkaProducer<>(properties);
producer.send(new ProducerRecord<String, String>(AppCoding.KAFKA_TOPIC, "", documentAsString));
Consuming using Spark-streaming api as:
SparkConf sparkConf = new SparkConf().setMaster(args[4]).setAppName("Streaming");
// Create the context with 60 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(60000 * 5));
//input arguments:localhost:2181 sparkS incoming 10 local[*]
Set<String> topicsSet = new HashSet<>(Arrays.asList(args[2].split(";")));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", args[0]);
//input arguments: localhost:9092 "" incoming 10 local[*]
JavaPairInputDStream<String, String> kafkaStream =
KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet);
On the other end i have been using ActiveMQ. While ActiveMQ Consumer could fetch me the data while its off.
Help me out if there's a confuguration problem.
In Kafka, consumers actually have no direct relationship with producers. Each consumer has an offset which tracks what has been consumed in the partitions. If a consumer has no offset tracked, Kafka will automatically reset its offset to the largest one because of the default value of config 'auto.offset.reset'. In your case, when the brand-new consumer is started, due to the default policy, it does not see the messages produced previously. You could set 'auto.offset.reset' to earliest (for new consumer) or smallest (for old consumer).
Kafka maintains offset per partition per record basis. While consumer was off for 2 minute duration, offset value would be stored in topic metadata for new-consumer, and again when the consumer is started back after 2minutes, it would read last offset which was stored in kafka topic.
I think what you need to check is kafka broker data retention policy if it is less than 2 minutes , data would be lost , if data corresponding to offset is not present , it would start reading from latest as by default value is set to latest auto.offset.reset=latest for new data arriving.
I would suggest to check and change kafka data retention policy accordingly if it is less than 2 minutes
Related
I am fairly new to Kafka and streaming.I have a requirement like every time I run the kafka producer and consumer I should get the only message produced by producer.
Below is the basic code for Producer and consumer
Producer
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
val record = new ProducerRecord[String, String]("test", "key", jsonstring)
producer.send(record)
producer.close()
Consumer
val props = new Properties()
props.put("bootstrap.servers", "localhost:9092")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("auto.offset.reset", "earliest")
props.put("group.id", "13")
val consumer: KafkaConsumer[String, Map[String,Any]] = new KafkaConsumer[String, Map[String,Any]](props)
consumer.subscribe(util.Arrays.asList("test"))
while (true) {
val record = consumer.poll(1000).asScala
for (data <- record.iterator){
println(data.value())
}
The Input Json I am using is the below
{
"id":1,
"Name":"foo"
}
Now the Problem I am facing is each time I run the program I am getting the duplicated values.For example If I run the code twice the consumer output looks like this
{
"id":1,
"Name":"foo"
}
{
"id":1,
"Name":"foo"
}
I want the output like if I run the program the only message that is processed by producer should be consume and should be printed.
I hv tried few things like changing the consumer properties for offset to latest
props.put("auto.offset.reset", "latest")
I also tried things mentioned like below but it didnot work for me
How can I get the LATEST offset of a kafka topic?
Can you please suggest any alternatives??
Consumer read messages from a topic partition on sequential order.
If you call to poll(), it returns records written to Kafka that consumers in our group have not read yet. Kafka tracks their consumption offset on each partition to know where to start to consume in case of restart.
Consumers maintain their partition offset in topic __consumer_offsets by using commit.
Commit is the action of updating the current position in
__consumer_offsets.
If a consumer restarted, In order to know where to start to consume, the consumer will read the latest committed offset of each partition and continue from there.
You can control the commit by two ways either set auto-commit true with commit interval
1.By enable.auto.commit true
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
2.Manual commit
consumer.commitAsync();//asyn commit
or
consumer.commitSync();//sync commit
If you fail to commit it will restart from the last committed position as shown on below pics
auto.offset.reset:
Once the consumer restarted the first time it uses auto.offset.reset to determine the initial position for each assigned partition. Please note when the group first created with a unique group id, before any messages have been consumed, the position is set according to a configurable offset reset policy (auto.offset.reset). After that, it will continue consuming message incrementally and use commit (as explained above) to track the latest consume message
Note: If the consumer crashes before any offset has been committed,
then the consumer which takes over its partitions will use the reset
policy.
So in your case
Either use manual offset commit or enable.auto.commit true for auto-commit.
Always use the same group id if you change group if it will treat different consumers and use auto.offset.reset to assign offset.
Reference: https://www.confluent.io/resources/kafka-the-definitive-guide/
I try to implement a very simple Kafka (0.9.0.1) consumer in scala (code below).
For my understanding, Kafka (or better say the Zookeeper) stores for each groupId the offset of the last consumed message for a giving topic. So given the following scenario:
Consumer with groupId1 which Yesterday consumed the only 5
messages in a topic. Now last consumed message has offset 4 (considering the
first message with offset 0)
During the night 2 new messages arrive to the topic
Today I restart the consumer, with the same groupId1, there will
be two options:
Option 1: The consumer will read the last 2 new messages which arrived during the night if I set the following property as "latest":
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
Option 2: The consumer will read all the 7 messages in the topic if I set the following property as "earliest":
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
Problem: For some reason, if I change the groupId of the consumer to groupId2, that is a new groupId for the given topic, so it never consumed any message before and its latest offset should be 0. I was expecting that by setting
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
The consumer will read during the first execution all the messages stored in the topic (the equivalent of having earliest). And then for following executions it will consume just the new ones. However this is not what happens.
If I set a new groupId and keep AUTO_OFFSET_RESET_CONFIG as latest, the consumer is not able to read any message. What I need to do then is for the first run set AUTO_OFFSET_RESET_CONFIG as earliest, and once there is already an offset different to 0 for the groupID I can move to latest.
Is this how it should be working my consumer? Is there a better solution than switching the AUTO_OFFSET_RESET_CONFIGafter the first time I run the consumer?
Below is the code I am using as a simple consumer:
class KafkaTestings {
val brokers = "listOfBrokers"
val groupId = "anyGroupId"
val topic = "anyTopic"
val props = createConsumerConfig(brokers, groupId)
def createConsumerConfig(brokers: String, groupId: String): Properties = {
val props = new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000")
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000")
props.put(ConsumerConfig.CLIENT_ID_CONFIG, "12321")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
props
}
def run() = {
consumer.subscribe(Collections.singletonList(this.topic))
Executors.newSingleThreadExecutor.execute( new Runnable {
override def run(): Unit = {
while (true) {
val records = consumer.poll(1000)
for (record <- records) {
println("Record: "+record.value)
}
}
}
})
}
}
object ScalaConsumer extends App {
val testConsumer = new KafkaTestings()
testConsumer.run()
}
This was used as a reference to write this simple consumer
This is working as documented.
If you start a new consumer group (i.e. one for which there are no existing offsets stored in Kafka), you have to choose if the consumer should be starting from the EARLIEST possible messages (the oldest message still available in the topic) or from the LATEST (only messages that produced from now on).
Is there a better solution than switching the AUTO_OFFSET_RESET_CONFIG after the first time I run the consumer?
You can keep it at EARLIEST, because the second time you run the consumer, it will already have stored offsets and just pick up there. The reset policy is only used when a new consumer group is created.
Today I restart the consumer, with the same groupId1, there will be two options:
Not really. Since the consumer group was running the day before, it will find its committed offsets and just pick up where it left off. So no matter what you set the reset policy to, it will get these two new messages.
By aware though, that Kafka does not store these offsets forever, I believe the default is just a week. So if you shut down your consumers for more than that, the offsets may be aged out, and you could run into an accidental reset to EARLIEST (which may be expensive for large topics). Given that, it is probably prudent to change it to LATEST anyway.
You can keep it at EARLIEST, because the second time you run the consumer, it will already have stored offsets and just pick up there. The reset policy is only used when a new consumer group is created.
In my testing, I often want to read from the earliest offset, but as noted, once you've read messages with a given groupId, then your offset remains at that pointer.
I do this:
properties.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID());
Assume there are already 10 data in my topic and now I start my consumer written Flink, the consumer will consume the 11th data.
Therefore, I have 3 questions:
How to get the number of partitions of current topic and the offset of each partition respectively?
How to set the starting position for each partition for consumer manually?
If the Flink consumer crashes and after several minutes it gets recovered. How will the consumer know where to restart?
Any help is appreciated. The sample codes (I tried FlinkKafkaConsumer08, FlinkKafkaConsumer10 but all exception.):
public class kafkaConsumer {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.95.2:9092");
properties.setProperty("group.id", "test");
properties.setProperty("auto.offset.reset", "earliest");
FlinkKafkaConsumer09<String> myConsumer = new FlinkKafkaConsumer09<>(
"game_event", new SimpleStringSchema(), properties);
DataStream<String> stream = env.addSource(myConsumer);
stream.map(new MapFunction<String, String>() {
private static final long serialVersionUID = -6867736771747690202L;
#Override
public String map(String value) throws Exception {
return "Stream Value: " + value;
}
}).print();
env.execute();
}
}
And pom.xml:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.6.1</version>
</dependency>
In order consume messages from a partition starting from a particular offset you can refer to the Flink Documentationl:
You can also specify the exact offsets the consumer should start from
for each partition:
Map<KafkaTopicPartition, Long> specificStartOffsets = new HashMap<>();
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 0), 23L);
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 1), 31L);
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 2), 43L);
myConsumer.setStartFromSpecificOffsets(specificStartOffsets);
The above example configures the consumer to start from the specified
offsets for partitions 0, 1, and 2 of topic myTopic. The offset values
should be the next record that the consumer should read for each
partition. Note that if the consumer needs to read a partition which
does not have a specified offset within the provided offsets map, it
will fallback to the default group offsets behaviour (i.e.
setStartFromGroupOffsets()) for that particular partition.
Note that these start position configuration methods do not affect the
start position when the job is automatically restored from a failure
or manually restored using a savepoint. On restore, the start position
of each Kafka partition is determined by the offsets stored in the
savepoint or checkpoint (please see the next section for information
about checkpointing to enable fault tolerance for the consumer).
In case one of the consumers crashes, once it recovers Kafka will refer to consumer_offsets topic in order to continue processing the messages from the point it was left to before crashing. consumer_offsets is a topic which is used to store meta-data information regarding committed offsets for each triple (topic, partition, group). It is also periodically compacted so that only latest offsets are available. You can also refer to Flink's Kafka Connectors Metrics:
Flink’s Kafka connectors provide some metrics through Flink’s metrics
system to analyze the behavior of the connector. The producers export
Kafka’s internal metrics through Flink’s metric system for all
supported versions. The consumers export all metrics starting from
Kafka version 0.9. The Kafka documentation lists all exported metrics
in its documentation.
In addition to these metrics, all consumers expose the current-offsets
and committed-offsets for each topic partition. The current-offsets
refers to the current offset in the partition. This refers to the
offset of the last element that we retrieved and emitted successfully.
The committed-offsets is the last committed offset.
The Kafka Consumers in Flink commit the offsets back to Zookeeper
(Kafka 0.8) or the Kafka brokers (Kafka 0.9+). If checkpointing is
disabled, offsets are committed periodically. With checkpointing, the
commit happens once all operators in the streaming topology have
confirmed that they’ve created a checkpoint of their state. This
provides users with at-least-once semantics for the offsets committed
to Zookeeper or the broker. For offsets checkpointed to Flink, the
system provides exactly once guarantees.
The offsets committed to ZK or the broker can also be used to track
the read progress of the Kafka consumer. The difference between the
committed offset and the most recent offset in each partition is
called the consumer lag. If the Flink topology is consuming the data
slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest(); // start from the earliest record possible
myConsumer.setStartFromLatest(); // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour
DataStream<String> stream = env.addSource(myConsumer);
...
Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)
Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics, which includes this explanation:
The difference between the committed offset and the most recent offset in
each partition is called the consumer lag. If the Flink topology is consuming
the data slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.
Kafka it's used as a streaming source and a stream does not have an end.
If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic
So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.
Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.
I am using storm-kafka-0.9.3 to read data from the Kafka and process those data in Storm. Below is the Kafka Spout I am using.But the problem is when I kill the Storm cluster, it does not read old data which was sent during the time it was dead, it start reading from the latest offset.
BrokerHosts hosts = new ZkHosts(Constants.ZOOKEEPER_HOST);
SpoutConfig spoutConfig = new SpoutConfig(hosts, CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME
, "/" + CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME,UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
//Never should make this true
spoutConfig.forceFromStart=false;
spoutConfig.startOffsetTime =-2;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
return kafkaSpout;
Thanks All,
Since I was running the Topology in Local mode,Storm did not store Offset in ZK, when I ran the topology in Prod mode It got resolved.
Sougata
I believe this might happen because while the topology is running it used to keep all the state information to zookeeper using the following path SpoutConfig.zkRoot+ "/" + SpoutConfig.id so that in case of failure it can resume from the last written offset in zookeeper.
Got this from the doc
Important:When re-deploying a topology make sure that the settings for SpoutConfig.zkRoot and SpoutConfig.id were not modified, otherwise the spout will not be able to read its previous consumer state information (i.e. the offsets) from ZooKeeper -- which may lead to unexpected behavior and/or to data loss, depending on your use case.
In your case as the SpoutConfig.id is a random value UUID.randomUUID().toString() Its not able to retrieve the last committed offset.
Also read from the same page
when a topology has run once the setting KafkaConfig.startOffsetTime will not have an effect for subsequent runs of the topology because now the topology will rely on the consumer state information (offsets) in ZooKeeper to determine from where it should begin (more precisely: resume) reading. If you want to force the spout to ignore any consumer state information stored in ZooKeeper, then you should set the parameter KafkaConfig.ignoreZkOffsets to true. If true, the spout will always begin reading from the offset defined by KafkaConfig.startOffsetTime as described above
You could possibly use a static id to see if it is able to retrieve.
You need to set the spoutConfig.zkServers and spoutConfig.zkPort :
BrokerHosts hosts = new ZkHosts(Constants.ZOOKEEPER_HOST);
SpoutConfig spoutConfig = new SpoutConfig(hosts, CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME
, "/" + CommonConstants.KAFKA_TRANSACTION_TOPIC_NAME,"test");
spoutConfig.zkPort=Constants.ZOOKEEPER_PORT;
spoutConfig.zkServers=Constants.ZOOKEEPER_SERVERS;
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
return kafkaSpout;