Why doesn't a new consumer group id start from the beginning - apache-kafka

I have a kafka 0.10 cluster with several topics that have messages produced to them.
When I subscribe to the topics with a KafkaConsumer and a new group Id I get no records returned, but if I subscribe to the topics with a ConsumerRebalanceListener that seeks to the beginning with the same group Id, then I get the records in the topic.
#Grab('org.apache.kafka:kafka-clients:0.10.0.0')
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.clients.consumer.ConsumerRecords
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.clients.consumer.ConsumerRebalanceListener
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.PartitionInfo
Properties props = new Properties()
props.with {
put("bootstrap.servers","***********:9091")
put("group.id","script-test-noseek")
put("enable.auto.commit","true")
put("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
put("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
put("session.timeout.ms",30000)
}
KafkaConsumer consumer = new KafkaConsumer(props)
def topicMap = [:]
consumer.listTopics().each { topic, partitioninfo ->
topicMap[topic] = 0
}
topicMap.each {topic, count ->
def stopTime = new Date().time + 30_000
def stop = false
println "Starting topic: $topic"
consumer.subscribe([topic])
//consumer.subscribe([topic], new CRListener(consumer:consumer))
while(!stop) {
ConsumerRecords<String, String> records = consumer.poll(5_000)
topicMap[topic] += records.size()
consumer.commitAsync()
if ( new Date().time > stopTime || records.size() == 0) {
stop = true
}
}
consumer.unsubscribe()
}
def total = 0
println "------------------- Results -----------------------"
topicMap.each { k,v ->
if ( v > 0 ) {
println "Topic: ${k.padRight(64,' ')} Records: ${v}"
}
total += v
}
println "==================================================="
println "Total: ${total}"
def dummy = "Process End"
class CRListener implements ConsumerRebalanceListener {
KafkaConsumer consumer
void onPartitionsAssigned(java.util.Collection partitions) {
consumer.seekToBeginning(partitions)
}
void onPartitionsRevoked(java.util.Collection partitions) {
consumer.commitSync()
}
}
The code is Groovy 2.4.x. And I've masked the bootstrap server.
If I uncomment the consumer subscribe line with the listener, it does what I expect it to do. But as it is I get no results.
Assume that I change the group Id for each run, just so as to not be picking up where another execution leaves off.
I can't see what I'm doing wrong. Any help would be appreciated.

If you use a new consumer group id and want to read the whole topic from beginning, you need to specify parameter "auto.offset.reset=earliest" in your properties. (default value is "latest")
Properties props = new Properties()
props.with {
// all other values...
put("auto.offset.reset","earliest")
}
On consumer start-up the following happens:
look for (valid) committed offset for use group.id
if (valid) offset is found, resume from there
if no (valid) offset is found, set offset according to auto.offset.reset

Related

Confluent Kafka how to get specific topic metadata without subscribe

I use some code for determination difference between read offsets and max values. It needed for inner diagnostics, for example if difference below "base line" value - so, read speed is good.
Provided code sample works good, but I should to know partition count for topic. And, later get max offset value for each partition in a topic.
How to get topic metadata without subscribing to a topic?
Something like: topic "TestTopic" have 4 partitions.
public async Task MaxOffsetValues()
{
// just for tests
// ToDo: use config from settings
while (true)
{
var topicName = "testTopic";
var config = new ConsumerConfig
{
BootstrapServers = "192.168.1.1:9092,192.168.1.2:9092,192.168.1.3:9092",
GroupId = Guid.NewGuid().ToString(),
ClientId = Dns.GetHostName(),
EnableAutoCommit = false,
};
using (var consumer = new ConsumerBuilder<Ignore, string>(config).Build())
{
var offsetBorders = consumer.QueryWatermarkOffsets(new TopicPartition(topicName, 0), TimeSpan.FromSeconds(10));
_log.Debug($"[Diagnostic] Topic: ({topicName}), Partition: ({0}) Minimal offset: ({offsetBorders.Low}) Maximum offset: ({offsetBorders.High})");
}
await Task.Delay(TimeSpan.FromSeconds(60));
}
}
If you are looking for programmatic & non-programmatic way to get metadata then there are 3 ways you can get information about a topic without subscribing it.
Admin client (https://kafka.apache.org/23/javadoc/index.html?org/apache/kafka/clients/admin/AdminClient.html)
If you have access to confluent-control-center then it does show metadata about a topic.
You can also use Kafka-topics.sh CLI tool
For my goal was enought (code part reduced for clarity):
// ...
var brokerServers = _diagnosticSettings.Value.BrokerHosts;
var brokerDelayInSeconds = TimeSpan.FromSeconds(_diagnosticSettings.Value.BrokerDelayInSeconds);
var adminClientConfig = new AdminClientConfig
{
BootstrapServers = brokerServers,
ClientId = Dns.GetHostName(),
};
adminClient = new AdminClientBuilder(adminClientConfig).Build();
Metadata topicMetadata = null;
topicMetadata = adminClient.GetMetadata(topicExternalName, brokerDelayInSeconds);
partitionCount = topicMetadata.Topics[0].Partitions.Count;
// ...

Kafka Consumer Poll runs indefinitely and doesn't return anything

I am facing difficulty with KafkaConsumer.poll(duration timeout), wherein it runs indefinitely and never come out of the method. Understand that this could be related to connection and I have seen it a bit inconsistent sometimes. How do I handle this should poll stops responding? Given below is the snippet from KafkaConsumer.poll()
public ConsumerRecords<K, V> poll(final Duration timeout) {
return poll(time.timer(timeout), true);
}
and I am calling the above from here :
Duration timeout = Duration.ofSeconds(30);
while (true) {
final ConsumerRecords<recordID, topicName> records = consumer.poll(timeout);
System.out.println("record count is" + records.count());
}
I am getting the below error:
org.apache.kafka.common.errors.SerializationException: Error
deserializing key/value for partition at offset 2. If
needed, please seek past the record to continue consumption.
I stumbled upon some useful information while trying to fix the problem I was facing above. I will provide the piece of code which should be able to handle this, but before that it is important to know what causes this.
While producing or consuming message or data to Apache Kafka, we need schema structure to that message or data, in my case Avro schema. If there is a conflict of message being produced to Kafka that conflict with that message schema, it will have an effect on consumption.
Add below code in your consumer topic in the method where it consume records --
do remember to import below packages:
import org.apache.kafka.common.TopicPartition;
import org.jsoup.SerializationException;
try {
while (true) {
ConsumerRecords<String, GenericRecord> records = null;
try {
records = consumer.poll(10000);
} catch (SerializationException e) {
String s = e.getMessage().split("Error deserializing key/value
for partition ")[1].split(". If needed, please seek past the record to
continue consumption.")[0];
String topics = s.split("-")[0];
int offset = Integer.valueOf(s.split("offset ")[1]);
int partition = Integer.valueOf(s.split("-")[1].split(" at") .
[0]);
TopicPartition topicPartition = new TopicPartition(topics,
partition);
//log.info("Skipping " + topic + "-" + partition + " offset "
+ offset);
consumer.seek(topicPartition, offset + 1);
}
for (ConsumerRecord<String, GenericRecord> record : records) {
System.out.printf("value = %s \n", record.value());
}
}
} finally {
consumer.close();
}
I ran into this while setting up a test environment.
Running the following command on the broker printed out the stored records as one would expect:
bin/kafka-console-consumer.sh --bootstrap-server="localhost:9092" --topic="foo" --from-beginning
It turned out that the Kafka server was misconfigured. To connect from an external
IP address listeners must have a valid value in kafka/config/server.properties, e.g.
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://:9092

Aug 2019 - Kafka Consumer Lag programmatically

Is there any way we can programmatically find lag in the Kafka Consumer.
I don't want external Kafka Manager tools to install and check on dashboard.
We can list all the consumer group and check for lag for each group.
Currently we do have command to check the lag and it requires the relative path where the Kafka resides.
Spring-Kafka, kafka-python, Kafka Admin client or using JMX - is there any way we can code and find out the lag.
We were careless and didn't monitor the process, the consumer was in zombie state and the lag went to 50,000 which resulted in lot of chaos.
Only when the issue arises we think of these cases as we were monitoring the script but didn't knew it will be result in zombie process.
Any thoughts are extremely welcomed!!
you can get this using kafka-python, run this on each broker or loop through list of brokers, it will give all topic partitions consumer lag.
BOOTSTRAP_SERVERS = '{}'.format(socket.gethostbyname(socket.gethostname()))
client = BrokerConnection(BOOTSTRAP_SERVERS, 9092, socket.AF_INET)
client.connect_blocking()
list_groups_request = ListGroupsRequest_v1()
future = client.send(list_groups_request)
while not future.is_done:
for resp, f in client.recv():
f.success(resp)
for group in future.value.groups:
if group[1] == 'consumer':
#print(group[0])
list_mebers_in_groups = DescribeGroupsRequest_v1(groups=[(group[0])])
future = client.send(list_mebers_in_groups)
while not future.is_done:
for resp, f in client.recv():
#print resp
f.success(resp)
(error_code, group_id, state, protocol_type, protocol, members) = future.value.groups[0]
if len(members) !=0:
for member in members:
(member_id, client_id, client_host, member_metadata, member_assignment) = member
member_topics_assignment = []
for (topic, partitions) in MemberAssignment.decode(member_assignment).assignment:
member_topics_assignment.append(topic)
for topic in member_topics_assignment:
consumer = KafkaConsumer(
bootstrap_servers=BOOTSTRAP_SERVERS,
group_id=group[0],
enable_auto_commit=False
)
consumer.topics()
for p in consumer.partitions_for_topic(topic):
tp = TopicPartition(topic, p)
consumer.assign([tp])
committed = consumer.committed(tp)
consumer.seek_to_end(tp)
last_offset = consumer.position(tp)
if last_offset != None and committed != None:
lag = last_offset - committed
print "group: {} topic:{} partition: {} lag: {}".format(group[0], topic, p, lag)
consumer.close(autocommit=False)
Yes. We can get consumer lag in kafka-python. Not sure if this is best way to do it. But this works.
Currently we are giving our consumer manually, you also get consumers from kafka-python, but it gives only the list of active consumers. So if one of your consumers is down. It may not show up in the list.
First establish client connection
from kafka import BrokerConnection
from kafka.protocol.commit import *
import socket
#This takes in only one broker at a time. So to use multiple brokers loop through each one by giving broker ip and port.
def establish_broker_connection(server, port, group):
'''
Client Connection to each broker for getting consumer offset info
'''
bc = BrokerConnection(server, port, socket.AF_INET)
bc.connect_blocking()
fetch_offset_request = OffsetFetchRequest_v3(group, None)
future = bc.send(fetch_offset_request)
Next we need to get the current offset for each topic the consumer is subscribed to. Pass the above future and bc here.
from kafka import SimpleClient
from kafka.protocol.offset import OffsetRequest, OffsetResetStrategy
from kafka.common import OffsetRequestPayload
def _get_client_connection():
'''
Client Connection to the cluster for getting topic info
'''
# Give comma seperated info of kafka broker "broker1:port1, broker2:port2'
client = SimpleClient(BOOTSTRAP_SEREVRS)
return client
def get_latest_offset_for_topic(self, topic):
'''
To get latest offset for a topic
'''
partitions = self.client.topic_partitions[topic]
offset_requests = [OffsetRequestPayload(topic, p, -1, 1) for p in partitions.keys()]
client = _get_client_connection()
offsets_responses = client.send_offset_request(offset_requests)
latest_offset = offsets_responses[0].offsets[0]
return latest_offset # Gives latest offset for topic
def get_current_offset_for_consumer_group(future, bc):
'''
Get current offset info for a consumer group
'''
while not future.is_done:
for resp, f in bc.recv():
f.success(resp)
# future.value.topics -- This will give all the topics in the form of a list.
for topic in self.future.value.topics:
latest_offset = self.get_latest_offset_for_topic(topic[0])
for partition in topic[1]:
offset_difference = latest_offset - partition[1]
offset_difference gives the difference between the last offset produced in the topic and the last offset (or message) consumed by your consumer.
If you are not getting current offset for a consumer for a topic then it means your consumer is probably down.
So you can raise alerts or send mail if the offset difference is above a threshold you want or if you get empty offsets for your consumer.
The java client exposes the lag for its consumers over JMX; in this example we have 5 partitions...
Spring Boot can publish these to micrometer.
I'm writing code in scala but use only native java API from KafkaConsumer and KafkaProducer.
You need only know the name of Consumer Group and Topics.
it's possible to avoid pre-defined topic, but then you will get Lag only for Consumer Group which exist and which state is stable not rebalance, this can be a problem for alerting.
So all that you really need to know and use are:
KafkaConsumer.commited - return latest committed offset for TopicPartition
KafkaConsumer.assign - do not use subscribe, because it causes to CG rebalance. You definitely do not want that your monitoring process to influence on the subject of monitoring.
kafkaConsumer.endOffsets - return latest produced offset
Consumer Group Lag - is a difference between the latest committed and latest produced
import java.util.{Properties, UUID}
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.clients.producer.KafkaProducer
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.{StringDeserializer, StringSerializer}
import scala.collection.JavaConverters._
import scala.util.Try
case class TopicPartitionInfo(topic: String, partition: Long, currentPosition: Long, endOffset: Long) {
val lag: Long = endOffset - currentPosition
override def toString: String = s"topic=$topic,partition=$partition,currentPosition=$currentPosition,endOffset=$endOffset,lag=$lag"
}
case class ConsumerGroupInfo(consumerGroup: String, topicPartitionInfo: List[TopicPartitionInfo]) {
override def toString: String = s"ConsumerGroup=$consumerGroup:\n${topicPartitionInfo.mkString("\n")}"
}
object ConsumerLag {
def consumerGroupInfo(bootStrapServers: String, consumerGroup: String, topics: List[String]) = {
val properties = new Properties()
properties.put("bootstrap.servers", bootStrapServers)
properties.put("auto.offset.reset", "latest")
properties.put("group.id", consumerGroup)
properties.put("key.deserializer", classOf[StringDeserializer])
properties.put("value.deserializer", classOf[StringDeserializer])
properties.put("key.serializer", classOf[StringSerializer])
properties.put("value.serializer", classOf[StringSerializer])
properties.put("client.id", UUID.randomUUID().toString)
val kafkaProducer = new KafkaProducer[String, String](properties)
val kafkaConsumer = new KafkaConsumer[String, String](properties)
val assignment = topics
.map(topic => kafkaProducer.partitionsFor(topic).asScala)
.flatMap(partitions => partitions.map(p => new TopicPartition(p.topic, p.partition)))
.asJava
kafkaConsumer.assign(assignment)
ConsumerGroupInfo(consumerGroup,
kafkaConsumer.endOffsets(assignment).asScala
.map { case (tp, latestOffset) =>
TopicPartitionInfo(tp.topic,
tp.partition,
Try(kafkaConsumer.committed(tp)).map(_.offset).getOrElse(0), // TODO Warn if Null, Null mean Consumer Group not exist
latestOffset)
}
.toList
)
}
def main(args: Array[String]): Unit = {
println(
consumerGroupInfo(
bootStrapServers = "kafka-prod:9092",
consumerGroup = "not-exist",
topics = List("events", "anotherevents")
)
)
println(
consumerGroupInfo(
bootStrapServers = "kafka:9092",
consumerGroup = "consumerGroup1",
topics = List("events", "anotehr events")
)
)
}
}
if anyone is looking for consumer lag in confluent cloud here is simple script
BOOTSTRAP_SERVERS = "<>.aws.confluent.cloud"
CCLOUD_API_KEY = "{{ ccloud_apikey }}"
CCLOUD_API_SECRET = "{{ ccloud_apisecret }}"
ENVIRONMENT = "dev"
CLUSTERID = "dev"
CACERT = "/usr/local/lib/python{{ python3_version }}/site-packages/certifi/cacert.pem"
def main():
client = KafkaAdminClient(bootstrap_servers=BOOTSTRAP_SERVERS,
ssl_cafile=CACERT,
security_protocol='SASL_SSL',
sasl_mechanism='PLAIN',
sasl_plain_username=CCLOUD_API_KEY,
sasl_plain_password=CCLOUD_API_SECRET)
for group in client.list_consumer_groups():
if group[1] == 'consumer':
consumer = KafkaConsumer(
bootstrap_servers=BOOTSTRAP_SERVERS,
ssl_cafile=CACERT,
group_id=group[0],
enable_auto_commit=False,
api_version=(0,10),
security_protocol='SASL_SSL',
sasl_mechanism='PLAIN',
sasl_plain_username=CCLOUD_API_KEY,
sasl_plain_password=CCLOUD_API_SECRET
)
list_members_in_groups = client.list_consumer_group_offsets(group[0])
for (topic,partition) in list_members_in_groups:
consumer.topics()
tp = TopicPartition(topic, partition)
consumer.assign([tp])
committed = consumer.committed(tp)
consumer.seek_to_end(tp)
last_offset = consumer.position(tp)
if last_offset != None and committed != None:
lag = last_offset - committed
print("group: {} topic:{} partition: {} lag: {}".format(group[0], topic, partition, lag))
consumer.close(autocommit=False)

Kafka: KafkaConsumer not able to pull all records

I'm pretty new at Kafka.
For the purpose of stress testing my cluster, and building operational experience, I created two simple Java applications: one that repeatedly publishes messages to a topic (a sequence of integers) and another application that loads the entire topic (all records) and verifies that the sequence is complete. Expectation is that no messages get lost due to operations on the cluster (restart a node, replacing a node, topics partitions reconfigurations, etc).
The topic "sequence" has two partitions, and replication factor 3. The cluster is made of 3 virtual nodes (its for testing purposes, hence they are running on the same machine). The topic is configured to retain all messages (retention.ms set to -1)
I currently have two issues, that I have difficulties figuring out:
If I use bin/kafka-console-consumer.sh --bootstrap-server kafka-test-server:9090,kafka-test-server:9091,kafka-test-server:9092 --topic sequence --from-beginning I see ALL messages (even though not ordered, as expected) loaded on console. On the other hand, if I use the consumer application that I wrote, I see different results being loaded at each cycle: https://i.stack.imgur.com/tMK10.png - In the console output, the first line after the divisor is a call to records.partitions(), hence records are only sometimes pulled from both partitions. Why and why is the java app not behaving like bin/kafka-console-consumer.sh?
When the topic gets to big, the bin/kafka-console-consumer.sh is still able to show all messages, while the application is able to load only about 18'000 messages. I have tried playing around with
the various consumer-side configurations, with no progress. Again, the question is why is there a difference?
Thank you in advance for any hint!
Here are for ref. the two app discussed:
package ch.demo.toys;
import java.io.FileInputStream;
import java.util.Properties;
import java.util.concurrent.Future;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
public class SequenceProducer {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.load(new FileInputStream("toy.properties"));
properties.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("acks", "1");
properties.put("retries", "3");
properties.put("compression.type", "snappy");
properties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1);
for (Integer sequence_i = 0; true; sequence_i++) {
try(Producer<Integer, String> producer = new KafkaProducer<>(properties)) {
ProducerRecord<Integer, String> record = new ProducerRecord<>("sequence", sequence_i, "Sequence number: " + String.valueOf(sequence_i));
Future<RecordMetadata> sendFuture = producer.send(record, (metadata, exception) -> {
System.out.println("Adding " + record.key() + " to partition " + metadata.partition());
if (exception != null) {
exception.printStackTrace();
}
});
}
Thread.sleep(200);
}
}
}
package ch.demo.toys;
import java.io.FileInputStream;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;
import java.util.Properties;
import java.util.stream.Collectors;
import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.TopicPartition;
public class CarthusianConsumer {
private static Properties getProperties() throws Exception {
Properties properties = new Properties();
properties.load(new FileInputStream("toy.properties"));
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.IntegerDeserializer.class);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringDeserializer.class);
properties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, Integer.MAX_VALUE);
properties.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 60 * 1000);
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "carthusian-consumer");
properties.put(ConsumerConfig.FETCH_MAX_WAIT_MS_CONFIG, 60 * 1000);
properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, 1024 * 1024 * 1024);
return properties;
}
private static boolean checkConsistency(List<Integer> sequence) {
Collections.sort(sequence);
Iterator<Integer> iterator = sequence.iterator();
int control = 0;
while(iterator.hasNext()) {
int value = iterator.next();
if (value != control) {
System.out.println("");
System.out.println("Gap found:");
System.out.println("\tSequence: " + value);
System.out.println("\tControl: " + control);
return false;
}
control++;
}
System.out.print(".");
return true;
}
public static void main(String[] args) throws Exception {
// Step 1: create a base consumer object
Consumer<Integer, String> consumer = new KafkaConsumer<>(getProperties());
// Step 2: load topic configuration and build list of TopicPartitons
List<TopicPartition> topicPartitions = consumer
.partitionsFor("sequence")
.stream()
.parallel()
.map(partitionInfo -> new TopicPartition(partitionInfo.topic(), partitionInfo.partition()))
.collect(Collectors.toList());
while (true) {
List<Integer> sequence = new ArrayList<>();
for (TopicPartition topicPartition : topicPartitions) {
// Step 3. specify the topic-partition to "read" from
// System.out.println("Partition specified: " + topicPartition);
consumer.assign(Arrays.asList(topicPartition));
// Step 4. set offset at the beginning
consumer.seekToBeginning(Arrays.asList(topicPartition));
// Step 5. get all records from topic-partition
ConsumerRecords<Integer, String> records = consumer.poll(Duration.ofMillis(Long.MAX_VALUE));
// System.out.println("\tCount: " + records.count());
// System.out.println("\tPartitions: " + records.partitions());
records.forEach(record -> { sequence.add(record.key()); });
}
System.out.println(sequence.size());
checkConsistency(sequence);
Thread.sleep(2500);
}
}
}
Thank you Mickael-Maison, here is my answer:
On producer: thanks you for the comment. I admit taking the example from the book and modifying it directly without performances considerations.
On consumer: as mentioned in the comments above, subscription was the first approach attempted which unfortunately yielded the same result described in my question: results from individual partitions, and only rarely from both partitions in the same call. I'd also love to understand the reasons for this apparently random behavior!
More on consumer: I rewind to the beginning of the topic at every cycle, because the purpose is to continuously verify that the sequence did not break (hence no messages lost). At every cycle I load all the messages and check them.
Because the single call based on topic subscription yielded an apparently random behavior (unsure when the full content of the topic is returned); I had to read off from each individual partition and join the lists of records manually before checking them - which is not what I wanted to do initially!
Are my approaches wrong?
There are a few things you should change in your clients logic.
Producer:
You are creating a new producer for each record you're sending. This is terrible in terms of performance as each producer as to first bootstrap before sendign a record. Also as a single record is sent by each producer, no batching happens. Finally compression on a single record is also inexistant.
You should first create a Producer and use it to send all records, ie move the creation out of the loop, something like:
try (Producer<Integer, String> producer = new KafkaProducer<>(properties)) {
for (int sequence_i = 18310; true; sequence_i++) {
ProducerRecord<Integer, String> record = new ProducerRecord<>("sequence", sequence_i, "Sequence number: " + String.valueOf(sequence_i));
producer.send(record, (metadata, exception) -> {
System.out.println("Adding " + record.key() + " to partition " + metadata.partition());
if (exception != null) {
exception.printStackTrace();
}
});
Thread.sleep(200L);
}
}
Consumer:
At every iteration of the for loop, you change the assignment and seek back to the beginning of the partition, so at best you will reconsume the same messages every time!
To begin, you should probably use the subscribe() API (like kafka-console-consumer.sh), so you don't have to fiddle with partitions. For example:
try (Consumer<Integer, String> consumer = new KafkaConsumer<>(properties)) {
consumer.subscribe(Collections.singletonList("topic"));
while (true) {
List<Integer> sequence = new ArrayList<>();
ConsumerRecords<Integer, String> records = consumer.poll(Duration.ofSeconds(1L));
records.forEach(record -> {
sequence.add(record.key());
});
System.out.println(sequence.size());
checkConsistency(sequence);
Thread.sleep(2500L);
}
}

Spark Streaming from Kafka topic throws offset out of range with no option to restart the stream

I have a streaming job running on Spark 2.1.1, polling Kafka 0.10. I am using the Spark KafkaUtils class to create a DStream, and everything is working fine until I have data that ages out of the topic because of the retention policy. My problem comes when I stop my job to make some changes if any data has aged out of the topic I get an error saying that my offsets are out of range. I have done a lot of research including looking at the spark source code, and I see lots of comments like the comments in this issue: SPARK-19680 - basically saying that data should not be lost silently - so auto.offset.reset is ignored by spark. My big question, though, is what can I do now? My topic will not poll in spark - it dies on startup with the offsets exception. I don't know how to reset the offsets so my job will just get started again. I have not enabled checkpoints since I read that those are unreliable for this use. I used to have a lot of code to manage offsets, but it appears that spark ignores requested offsets if there are any committed, so I am currently managing offsets like this:
val stream = KafkaUtils.createDirectStream[String, T](
ssc,
PreferConsistent,
Subscribe[String, T](topics, kafkaParams))
stream.foreachRDD { (rdd, batchTime) =>
val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
Log.debug("processing new batch...")
val values = rdd.map(x => x.value())
val incomingFrame: Dataset[T] = SparkUtils.sparkSession.createDataset(values)(consumer.encoder()).persist
consumer.processDataset(incomingFrame, batchTime)
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsets)
}
ssc.start()
ssc.awaitTermination()
As a workaround I have been changing my group ids but that is really lame. I know this is expected behavior and should not happen, I just need to know how to get the stream running again. Any help would be appreciated.
Here is a block of code I wrote to get by this until a real solution is introduced to spark-streaming-kafka. It basically resets the offsets for the partitions that have aged out based on the OffsetResetStrategy you set. Just give it the same Map params, _params, you provide to KafkaUtils. Call this before calling KafkaUtils.create****Stream() from your driver.
final OffsetResetStrategy offsetResetStrategy = OffsetResetStrategy.valueOf(_params.get(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG).toString().toUpperCase(Locale.ROOT));
if(OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy) || OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
LOG.info("Going to reset consumer offsets");
final KafkaConsumer<K,V> consumer = new KafkaConsumer<>(_params);
LOG.debug("Fetching current state");
final List<TopicPartition> parts = new LinkedList<>();
final Map<TopicPartition, OffsetAndMetadata> currentCommited = new HashMap<>();
for(String topic: this.topics()) {
List<PartitionInfo> info = consumer.partitionsFor(topic);
for(PartitionInfo i: info) {
final TopicPartition p = new TopicPartition(topic, i.partition());
final OffsetAndMetadata m = consumer.committed(p);
parts.add(p);
currentCommited.put(p, m);
}
}
final Map<TopicPartition, Long> begining = consumer.beginningOffsets(parts);
final Map<TopicPartition, Long> ending = consumer.endOffsets(parts);
LOG.debug("Finding what offsets need to be adjusted");
final Map<TopicPartition, OffsetAndMetadata> newCommit = new HashMap<>();
for(TopicPartition part: parts) {
final OffsetAndMetadata m = currentCommited.get(part);
final Long begin = begining.get(part);
final Long end = ending.get(part);
if(m == null || m.offset() < begin) {
LOG.info("Adjusting partition {}-{}; OffsetAndMeta={} Begining={} End={}", part.topic(), part.partition(), m, begin, end);
final OffsetAndMetadata newMeta;
if(OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy)) {
newMeta = new OffsetAndMetadata(begin);
} else if(OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
newMeta = new OffsetAndMetadata(end);
} else {
newMeta = null;
}
LOG.info("New offset to be {}", newMeta);
if(newMeta != null) {
newCommit.put(part, newMeta);
}
}
}
consumer.commitSync(newCommit);
consumer.close();
}
auto.offset.reset=latest/earliest will be applied only when consumer starts first time.
there is Spark JIRA to resolve this issue, till then we need live with work arounds.
https://issues.apache.org/jira/browse/SPARK-19680
Try
auto.offset.reset=latest
Or
auto.offset.reset=earliest
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
One more thing that affects what offset value will correspond to smallest and largest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 10 messages, and then an hour later you post 10 more messages. The largest offset will still remain the same but the smallest one won't be able to be 0 because Kafka will already remove these messages and thus the smallest available offset will be 10.
This problem was solved in the stream structuring structure by including "failOnDataLoss" = "false". It is unclear why there is no such option in the spark DStream framework.
This is a BIG quesion for spark developers!
In our projects, we tried to solve this problem by resetting the offsets form ealiest + 5 minutes ... it helps in most cases.