How to query kafka streams state store correctly? - apache-kafka

I have built a kafka-streams application with 20 streams threads a month ago. This app calculates the consumption of different people in a fixed time interval. Recently I found the people's spend money that is queried form local state store is less than real. I have read official, and any other documents I could find, but have not found the solve method.
I used Kafka version 0.11.0.3, kafka server's version is 0.11.0.3, and kafka streams api is also 0.11.0.3. Only one application with 20 streams threads.
Some important info:
Kafka streams config:
replication.factor 3
num.stream.threads 20
commit.interval.ms 1000
partition.assignment.strategy StickyAssignor.class.getName()
fetch.max.wait.ms 500
max.poll.records 5000
max.poll.interval.ms 300000
heartbeat.interval.ms 3000
session.timeout.ms 30000
auto.offset.reset latest
kafka message structure
key = person's name
value = the money he spend
time = the current time that this message was created
Kafka streams build code:
KStreamBuilder kStreamBuilder = new KStreamBuilder();
KStream<String, Double> peopleSpendStream = kStreamBuilder.stream(topic);
peopleSpendStream.groupByKey()// group by people's name
.aggregate(() -> new HashMap<String, Double>(8192),
(key, value, aggregate) -> {
aggregate.merge(key, value, Double::sum);
return aggregate;
},
TimeWindows.of(ONE_MINUTE).until(ONE_HOUR * 10), // 1-min window, keep 9 hours
new HashMapSerde<>(), // serialize and deserialize by jackson actually
PEOPLE_SPEND_STORE_NAME);
Query code:
long time = System.currentTimeMilles();
for (String name : names) { // query by people's name
try (WindowStoreIterator<HashMap<String, Double>> iterator = store.fetch(name, time - TEN_MINUTE_MILLES, time)) {
iterator.forEachRemaining(kv -> log.info("name = {}, time = {}, cost = {}", name, kv.key, kv.value));
}
}
Is anything I get wrong? I need your help in particular.

Related

Kafka Streams windowedBy not returning after the specified time

My requirement is to group list of messages which comes into my kafka topic having similar key to be grouped and this grouping has to happen with the window time of 5 secs. So my application should return the grouped elements after every 5secs. Below is the application which i have written. But the problem with the below code is its not returning the group of events after 5 secs its returning after 5 secs but its returning very late like 15 secs or 30 secs or more randomly.
KStream<String, String> source = builder
.stream(sourceTopic, Consumed.with(Serdes.String(), Serdes.String()))
.filter((key, value) -> Objects.nonNull(value));
final KTable<Windowed<String>, List<String>> aggTable = source
.groupByKey(Serialized.with(Serdes.String(), new JsonSerde<>(String.class, objectMapper)))
.windowedBy(TimeWindows.of(TimeUnit.SECONDS.toMillis(5))))
.aggregate(List<String>::new, (key, value, aggregater) -> {
aggregater.add(value);
return aggregater;
},
Materialized.<String, List<String>, WindowStore<Bytes, byte[]>>as("stateStore")
.withValueSerde(newStatusEventHolderJsonSerde()));
Can you please let us know do we need to do any extra coding to make stream return immediately after the specified timeout.

Streaming application with state stores takes up to 1 hour to restart

We are using spring cloud stream with Kafka 2.0.1 and utilizing the InteractiveQueryService to fetch data from the stores. There are 4 stores that persist data on disk after aggregating data. The code for the topology looks like this:
#Slf4j
#EnableBinding(SensorMeasurementBinding.class)
public class Consumer {
public static final String RETENTION_MS = "retention.ms";
public static final String CLEANUP_POLICY = "cleanup.policy";
#Value("${windowstore.retention.ms}")
private String retention;
/**
* Process the data flowing in from a Kafka topic. Aggregate the data to:
* - 2 minute
* - 15 minutes
* - one hour
* - 12 hours
*
* #param stream
*/
#StreamListener(SensorMeasurementBinding.ERROR_SCORE_IN)
public void process(KStream<String, SensorMeasurement> stream) {
Map<String, String> topicConfig = new HashMap<>();
topicConfig.put(RETENTION_MS, retention);
topicConfig.put(CLEANUP_POLICY, "delete");
log.info("Changelog and local window store retention.ms: {} and cleanup.policy: {}",
topicConfig.get(RETENTION_MS),
topicConfig.get(CLEANUP_POLICY));
createWindowStore(LocalStore.TWO_MINUTES_STORE, topicConfig, stream);
createWindowStore(LocalStore.FIFTEEN_MINUTES_STORE, topicConfig, stream);
createWindowStore(LocalStore.ONE_HOUR_STORE, topicConfig, stream);
createWindowStore(LocalStore.TWELVE_HOURS_STORE, topicConfig, stream);
}
private void createWindowStore(
LocalStore localStore,
Map<String, String> topicConfig,
KStream<String, SensorMeasurement> stream) {
// Configure how the statestore should be materialized using the provide storeName
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
.as(localStore.getStoreName());
// Set retention of changelog topic
materialized.withLoggingEnabled(topicConfig);
// Configure how windows looks like and how long data will be retained in local stores
TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));
// Processing description:
// The input data are 'samples' with key <installationId>:<assetId>:<modelInstanceId>:<algorithmName>
// 1. With the map we add the Tag to the key and we extract the error score from the data
// 2. With the groupByKey we group the data on the new key
// 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
// 4. With reduce we determine the maximum value in the time window
// 5. Materialized will make it stored in a table
stream
.map(getInstallationAssetModelAlgorithmTagKeyMapper())
.groupByKey()
.windowedBy(configuredTimeWindows)
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
}
private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
timeWindows.until(retentionMs);
return timeWindows;
}
/**
* Determine the max error score to keep by looking at the aggregated error signal and
* freshly consumed error signal
*
* #param aggValue
* #param newValue
* #return
*/
private ErrorScore getMaxErrorScore(ErrorScore aggValue, ErrorScore newValue) {
if(aggValue.getErrorSignal() > newValue.getErrorSignal()) {
return aggValue;
}
return newValue;
}
private KeyValueMapper<String, SensorMeasurement,
KeyValue<? extends String, ? extends ErrorScore>> getInstallationAssetModelAlgorithmTagKeyMapper() {
return (s, sensorMeasurement) -> new KeyValue<>(s + "::" + sensorMeasurement.getT(),
new ErrorScore(sensorMeasurement.getTs(), sensorMeasurement.getE(), sensorMeasurement.getO()));
}
}
So we are materializing aggregated data to four different stores after determining the max value within a specific window for a specific key.
Please note that retention which is set to two months of data and the clean up policy delete. We don't compact data.
The size of the individual state stores on disk is between 14 to 20 gb of data.
We are making use of Interactive Queries: https://docs.confluent.io/current/streams/developer-guide/interactive-queries.html#interactive-queries
On our setup we have 4 instances of our streaming app to be used as one consumer group. So every instance will store a specific part of all data in its store.
This all seems to work nicely. Until we restart one or more instances and wait for it to become available again. I would expect that the restart of the app would not take that long but unfortunately it takes op to 1 hour. I guess that the issue is caused by the amount of data in combination of restoring state stores, but I'm not sure. I would have expected that as we persist the state store data on persisted volumes outside of the container that runs on kubernetes, the app would receive the last offset from the broker and only has to continue from that point as the previously consumed data is already there in the state store. Unfortunately I don't have a clue how to resolve this.
Restarting our app triggers a restore task:
-StreamThread-2] Restoring task 4_3's state store twelve-hours-error-score from beginning of the changelog anomaly-timeline-twelve-hours-error-score-changelog-3.
This process takes quite a while. Why is it restoring from beginning and why does it take so long? I do have auto.offset.reset set to "earliest" but that is only being used when the offset is unknown isn't it?
Here are my stream settings. Note the max.bytes.buffering set to 0. I changed this, but that didn't make a difference. I also read about a bug with the num.stream.threads where > 1 causes issues, but also putting this on 1 doesn't improve restart speed.
2019-03-05 13:44:53,360 INFO main org.apache.kafka.common.config.AbstractConfig StreamsConfig values:
application.id = anomaly-timeline
application.server = localhost:5000
bootstrap.servers = [localhost:9095]
buffered.records.per.partition = 1000
cache.max.bytes.buffering = 0
client.id =
commit.interval.ms = 500
connections.max.idle.ms = 540000
default.deserialization.exception.handler = class org.apache.kafka.streams.errors.LogAndFailExceptionHandler
default.key.serde = class org.apache.kafka.common.serialization.Serdes$StringSerde
default.production.exception.handler = class org.apache.kafka.streams.errors.DefaultProductionExceptionHandler
default.timestamp.extractor = class errorscore.raw.boundary.ErrorScoreTimestampExtractor
default.value.serde = class errorscore.raw.boundary.ErrorScoreSerde
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
num.standby.replicas = 1
num.stream.threads = 2
partition.grouper = class org.apache.kafka.streams.processor.DefaultPartitionGrouper
poll.ms = 100
processing.guarantee = at_least_once
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
replication.factor = 1
request.timeout.ms = 40000
retries = 0
retry.backoff.ms = 100
rocksdb.config.setter = null
security.protocol = PLAINTEXT
send.buffer.bytes = 131072
state.cleanup.delay.ms = 600000
state.dir = ./state-store
topology.optimization = none
upgrade.from = null
windowstore.changelog.additional.retention.ms = 86400000
It also log these messages after a while:
CleanupThread] Deleting obsolete state directory 1_1 for task 1_1 as 1188421ms has elapsed (cleanup delay is 600000ms).
Also something to note, I did add the following code in order to override the default cleanUp on start and stop where the stores by default are deleted:
#Bean
public CleanupConfig cleanupConfig() {
return new CleanupConfig(false, false);
}
any help would be appreciated!
We think we solved the issue. The different instances each got their own persistent volume. When restarting the instances, it seems that some or sometimes all instances got linked to other persistent volumes instead of the once they were previously using. This caused the state stores to become obsolete and restoration process to kick in. We solved this by utilizing NFS to share the persistent volumes in a way that all instances would point to the same state store directory structure. This seems to solve the issue

java KafkaConsumer never get results

I'm new to kafka, I have the following sample code :
KafkaConsumer<String,String> kc = new KafkaConsumer<String, String>(props);
while(true) {
List<String> topicNames = Arrays.asList(topics.split(","));
if (!kc.assignment().isEmpty()) {
kc.unsubscribe();
}
kc.subscribe(topicNames);
ConsumerRecords<String, String> recv = kc.poll(1000L);
if (!recv.isEmpty()) {
System.out.println("NOT EMPTY");
}
}
The recv is always empty but if I try to increment the pool timeout the records are returned, also if I cut off the unsubscribe part.
I've taken this piece of code from an integration proprietary software and I cannot modify it.
So my question is: Is this only a timing problem or there is more?
There is a lot that happens when a consumer (re)subscribes to a topic.
Very roughly and as far as I remember the consumer will:
request cluster information
request consumer group metadata
make a JOIN_GROUP request
be assigned certain partitions
The underlying mechanisms are even more complicated if there are more consumers within the same group. That's because the partitions should be reassigned between all the consumers within the group.
That is why:
1000 millis might not be enough for all this and you didn't poll anything in time
you polled something when you increased the timeout because Kafka managed to perform all of these bootstrapping operations
you polled something when you removed the unsubscription to the topics because most likely your consumer was already subscribed
So there is a timing issue. And I think that there is something more - un/subscribing to a topic within an infinite loop makes no sense to me (see the other answer).
You should subscribe to your topics only once at the beginning. Like this:
final KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
final ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}

Spark Streaming from Kafka topic throws offset out of range with no option to restart the stream

I have a streaming job running on Spark 2.1.1, polling Kafka 0.10. I am using the Spark KafkaUtils class to create a DStream, and everything is working fine until I have data that ages out of the topic because of the retention policy. My problem comes when I stop my job to make some changes if any data has aged out of the topic I get an error saying that my offsets are out of range. I have done a lot of research including looking at the spark source code, and I see lots of comments like the comments in this issue: SPARK-19680 - basically saying that data should not be lost silently - so auto.offset.reset is ignored by spark. My big question, though, is what can I do now? My topic will not poll in spark - it dies on startup with the offsets exception. I don't know how to reset the offsets so my job will just get started again. I have not enabled checkpoints since I read that those are unreliable for this use. I used to have a lot of code to manage offsets, but it appears that spark ignores requested offsets if there are any committed, so I am currently managing offsets like this:
val stream = KafkaUtils.createDirectStream[String, T](
ssc,
PreferConsistent,
Subscribe[String, T](topics, kafkaParams))
stream.foreachRDD { (rdd, batchTime) =>
val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
Log.debug("processing new batch...")
val values = rdd.map(x => x.value())
val incomingFrame: Dataset[T] = SparkUtils.sparkSession.createDataset(values)(consumer.encoder()).persist
consumer.processDataset(incomingFrame, batchTime)
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsets)
}
ssc.start()
ssc.awaitTermination()
As a workaround I have been changing my group ids but that is really lame. I know this is expected behavior and should not happen, I just need to know how to get the stream running again. Any help would be appreciated.
Here is a block of code I wrote to get by this until a real solution is introduced to spark-streaming-kafka. It basically resets the offsets for the partitions that have aged out based on the OffsetResetStrategy you set. Just give it the same Map params, _params, you provide to KafkaUtils. Call this before calling KafkaUtils.create****Stream() from your driver.
final OffsetResetStrategy offsetResetStrategy = OffsetResetStrategy.valueOf(_params.get(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG).toString().toUpperCase(Locale.ROOT));
if(OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy) || OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
LOG.info("Going to reset consumer offsets");
final KafkaConsumer<K,V> consumer = new KafkaConsumer<>(_params);
LOG.debug("Fetching current state");
final List<TopicPartition> parts = new LinkedList<>();
final Map<TopicPartition, OffsetAndMetadata> currentCommited = new HashMap<>();
for(String topic: this.topics()) {
List<PartitionInfo> info = consumer.partitionsFor(topic);
for(PartitionInfo i: info) {
final TopicPartition p = new TopicPartition(topic, i.partition());
final OffsetAndMetadata m = consumer.committed(p);
parts.add(p);
currentCommited.put(p, m);
}
}
final Map<TopicPartition, Long> begining = consumer.beginningOffsets(parts);
final Map<TopicPartition, Long> ending = consumer.endOffsets(parts);
LOG.debug("Finding what offsets need to be adjusted");
final Map<TopicPartition, OffsetAndMetadata> newCommit = new HashMap<>();
for(TopicPartition part: parts) {
final OffsetAndMetadata m = currentCommited.get(part);
final Long begin = begining.get(part);
final Long end = ending.get(part);
if(m == null || m.offset() < begin) {
LOG.info("Adjusting partition {}-{}; OffsetAndMeta={} Begining={} End={}", part.topic(), part.partition(), m, begin, end);
final OffsetAndMetadata newMeta;
if(OffsetResetStrategy.EARLIEST.equals(offsetResetStrategy)) {
newMeta = new OffsetAndMetadata(begin);
} else if(OffsetResetStrategy.LATEST.equals(offsetResetStrategy)) {
newMeta = new OffsetAndMetadata(end);
} else {
newMeta = null;
}
LOG.info("New offset to be {}", newMeta);
if(newMeta != null) {
newCommit.put(part, newMeta);
}
}
}
consumer.commitSync(newCommit);
consumer.close();
}
auto.offset.reset=latest/earliest will be applied only when consumer starts first time.
there is Spark JIRA to resolve this issue, till then we need live with work arounds.
https://issues.apache.org/jira/browse/SPARK-19680
Try
auto.offset.reset=latest
Or
auto.offset.reset=earliest
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
One more thing that affects what offset value will correspond to smallest and largest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 10 messages, and then an hour later you post 10 more messages. The largest offset will still remain the same but the smallest one won't be able to be 0 because Kafka will already remove these messages and thus the smallest available offset will be 10.
This problem was solved in the stream structuring structure by including "failOnDataLoss" = "false". It is unclear why there is no such option in the spark DStream framework.
This is a BIG quesion for spark developers!
In our projects, we tried to solve this problem by resetting the offsets form ealiest + 5 minutes ... it helps in most cases.

Batching not working in Kafka Producer with Scala

I am writing a Producer in Scala and I want to do batching. The way batching should work is, it should hold the messages in queue till it is full and then post all of them together on the topic. But somehow it's not working. The moment I start sending message, it starts posting the message one by one. Does anyone know how to use batching in Kafka Producer.
val kafkaStringSerializer = "org.apache.kafka.common.serialization.StringSerializer"
val batchSize: java.lang.Integer = 163840
val props = new Properties()
props.put("key.serializer", kafkaStringSerializer)
props.put("value.serializer", kafkaStringSerializer)
props.put("batch.size", batchSize);
props.put("bootstrap.servers", "localhost:9092")
val producer = new KafkaProducer[String,String](props)
val TOPIC="topic"
val inlineMessage = "adsdasdddddssssssssssss"
for(i<- 1 to 10){
val record: ProducerRecord[String, String] = new ProducerRecord(TOPIC, inlineMessage )
val futureResponse: Future[RecordMetadata] = producer.send(record)
futureResponse.isDone
println("Future Response ==========>" + futureResponse.get().serializedValueSize())
}
You have to set linger.ms in your props
By default, it is to zero, meaning that message is send immediatly if possible.
You can increase it (for example 100) so that batch occur - this means higher latency, but higher throughput.
batch.size is a maximum : if you reach it before linger.ms has passed, data will be sent without waiting more time.
To view the batches actually sent, you will need to configure your logging (batching s done on a background thread and you will not be able to view what batches are done with producer api - you can't send or receive batches, only send a record and receive its response, communication with broker via batch is done internally)
First, if not already done, bind a log4j properties file (Dlog4j.configuration=file:path/to/log4j.properties)
log4j.rootLogger=WARN, stderr
log4j.logger.org.apache.kafka.clients.producer.internals.Sender=TRACE, stderr
log4j.appender.stderr=org.apache.log4j.ConsoleAppender
log4j.appender.stderr.layout=org.apache.log4j.PatternLayout
log4j.appender.stderr.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.appender.stderr.Target=System.err
For example, I will receive
TRACE Sent produce request to 2: (type=ProduceRequest, magic=1, acks=1, timeout=30000, partitionRecords=({test-1=[(record=LegacyRecordBatch(offset=0, Record(magic=1, attributes=0, compression=NONE, crc=2237306008, CreateTime=1502444105996, key=0 bytes, value=2 bytes))), (record=LegacyRecordBatch(offset=1, Record(magic=1, attributes=0, compression=NONE, crc=3259548815, CreateTime=1502444106029, key=0 bytes, value=2 bytes)))]}), transactionalId='' (org.apache.kafka.clients.producer.internals.Sender)
Which is a batch of 2 data. Batch will contain records send to a same broker
Then, play with batch.size and linger.ms to see the difference. Note that a record contain some overhead, so a batch.size of 1000 will not contain 10 messages of size 100
Note that I did not find documentation which stated all logger and what they do (like log4j.logger.org.apache.kafka.clients.producer.internals.Sender). You can enable DEBUG/TRACE on rootLogger and find the data you want, or explore the code
Your are producing the data to the Kafka server synchronously. Means, the moment you call producer.send with futureResponse.get, it will return only after the data gets stored in the Kafka Server.
Store the response in a separate list, and call futureResponse.get outside the for loop.
With default configuration, Kafka supports batching, see linger.ms and batch.size
List<Future<RecordMetadata>> responses = new ArrayList<>();
for (int i=1; i<=10; i++) {
ProducerRecord<String, String> record = new ProducerRecord<>(TOPIC, inlineMessage);
Future<RecordMetadata> response = producer.send(record);
responses.add(response);
}
for (Future<RecordMetadata> response : responses) {
response.get(); // verify whether the message is sent to the broker.
}