Grafana Timeout while querying large amount of logs from Loki

Grafana Timeout while querying large amount of logs from Loki - grafana

I have a Loki server running on AWS Graviton (arm, 4 vCPU, 8 GiB) configured as following:
common:
replication_factor: 1
ring:
kvstore:
store: etcd
etcd:
endpoints: ['127.0.0.1:2379']
storage_config:
boltdb_shipper:
active_index_directory: /opt/loki/index
cache_location: /opt/loki/index_cache
shared_store: s3
aws:
s3: s3://ap-south-1/bucket-name
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h # 7d
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
per_stream_rate_limit: 8MB
ingester:
lifecycler:
join_after: 30s
chunk_block_size: 10485760
compactor:
working_directory: /opt/loki/compactor
shared_store: s3
compaction_interval: 5m
schema_config:
configs:
- from: 2022-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: loki_
period: 24h
table_manager:
retention_period: 360h #15d
retention_deletes_enabled: true
index_tables_provisioning: # unused
provisioned_write_throughput: 500
provisioned_read_throughput: 100
inactive_write_throughput: 1
inactive_read_throughput: 100
Ingestion is working fine and I'm able to query logs for long durations from streams with less data sizes. I'm also able to query small durations of logs for streams with TiBs of data.
I see the following error in Loki when I try to query 24h of data from a large data stream and Grafana timeout after 5 mins:
Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.186137309Z caller=retry.go:73 org_id=fake msg="error processing request" try=2 err="context canceled"
Feb 11 08:27:32 loki-01 loki[19490]: level=info ts=2022-02-11T08:27:32.186304708Z caller=metrics.go:92 org_id=fake latency=fast query="{filename=\"/var/log/server.log\",host=\"web-199\",ip=\"192.168.20.239\",name=\"web\"} |= \"attachDriver\"" query_type=filter range_type=range length=24h0m0s step=1m0s duration=0s status=499 limit=1000 returned_lines=0 throughput=0B total_bytes=0B
Feb 11 08:27:32 loki-01 loki[19490]: level=info ts=2022-02-11T08:27:32.23882892Z caller=metrics.go:92 org_id=fake latency=slow query="{filename=\"/var/log/server.log\",host=\"web-199\",ip=\"192.168.20.239\",name=\"web\"} |= \"attachDriver\"" query_type=filter range_type=range length=24h0m0s step=1m0s duration=59.813829694s status=400 limit=1000 returned_lines=153 throughput=326MB total_bytes=20GB
Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.238959314Z caller=scheduler_processor.go:199 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=192.168.5.138:9095
Feb 11 08:27:32 loki-01 loki[19490]: level=error ts=2022-02-11T08:27:32.23898877Z caller=scheduler_processor.go:154 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=192.168.5.138:9095
Query: {filename="/var/log/server.log",host="web-199",ip="192.168.20.239",name="web"} |= "attachDriver"
Is there a way to stream the results instead of waiting for the response? Can I optimize Loki to process such queries better?

Related

How to optimize EmbddedKafka and Mongo logs in Spring Boot

how to properly keep only relevant logs when using MongoDB and Kafka in a SpringBoot application
2022-08-02 11:14:58.148 INFO 363923 --- [ main] kafka.server.KafkaConfig : KafkaConfig values:
advertised.listeners = null
alter.config.policy.class.name = null
alter.log.dirs.replication.quota.window.num = 11
alter.log.dirs.replication.quota.window.size.seconds = 1
authorizer.class.name =
auto.create.topics.enable = true
auto.leader.rebalance.enable = true
background.threads = 10
broker.heartbeat.interval.ms = 2000
broker.id = 0
broker.id.generation.enable = true
broker.rack = null
broker.session.timeout.ms = 9000
client.quota.callback.class = null
compression.type = producer
connection.failed.authentication.delay.ms = 100
connections.max.idle.ms = 600000
...
2022-08-02 11:15:11.005 INFO 363923 --- [er-event-thread] state.change.logger : [Controller id=0 epoch=1] Changed partition test_cfr_prv_customeragreement_event_disbursement_ini-0 from NewPartition to OnlinePartition with state LeaderAndIsr(leader=0, leaderEpoch=0, isr=List(0), zkVersion=0)
2022-08-02 11:15:11.005 INFO 363923 --- [er-event-thread] state.change.logger : [Controller id=0 epoch=1] Changed partition test_cfr_prv_customeragreement_event_receipt_ini-0 from NewPartition to OnlinePartition with state LeaderAndIsr(leader=0, leaderEpoch=0, isr=List(0), zkVersion=0)
2022-08-02 11:15:11.017 INFO 363923 --- [er-event-thread] state.change.logger : [Controller id=0 epoch=1] Sending LeaderAndIsr request to broker 0 with 2 become-leader and 0 become-follower partitions
2022-08-02 11:15:11.024 INFO 363923 --- [er-event-thread] state.change.logger : [Controller id=0 epoch=1] Sending UpdateMetadata request to brokers HashSet(0) for 2 partitions
2022-08-02 11:15:11.026 INFO 363923 --- [er-event-thread] state.change.logger : [Controller id=0 epoch=1] Sending UpdateMetadata request to brokers HashSet() for 0 partitions
2022-08-02 11:15:11.028 INFO 363923 --- [quest-handler-0] state.change.logger : [Broker id=0] Handling LeaderAndIsr request correlationId 1 from controller 0 for 2 partitions
example of undesired logs
2022-08-02 11:15:04.578 INFO 363923 --- [ Thread-3] o.s.b.a.mongo.embedded.EmbeddedMongo : {"t":{"$date":"2022-08-02T11:15:04.578+02:00"},"s":"I", "c":"CONTROL", "id":51765, "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Ubuntu","version":"20.04"}}}
2022-08-02 11:15:04.579 INFO 363923 --- [ Thread-3] o.s.b.a.mongo.embedded.EmbeddedMongo : {"t":{"$date":"2022-08-02T11:15:04.578+02:00"},"s":"I", "c":"CONTROL", "id":21951, "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"net":{"bindIp":"127.0.0.1","port":34085},"replication":{"oplogSizeMB":10,"replSet":"rs0"},"security":{"authorization":"disabled"},"storage":{"dbPath":"/tmp/embedmongo-db-66eab1ce-d099-40ec-96fb-f759ef3808a4","syncPeriodSecs":0}}}}
2022-08-02 11:15:04.585 INFO 363923 --- [ Thread-3] o.s.b.a.mongo.embedded.EmbeddedMongo : {"t":{"$date":"2022-08-02T11:15:04.585+02:00"},"s":"I", "c":"STORAGE", "id":22297, "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}
Please find here a link to a sample project github.com/smaillns/springboot-mongo-kafka
If we run a test we'll get a bunch of logs ! What's wrong with the current configuration ?

While checking server logs ERROR for ROCFM , facing disconnection in the ZOOKEEPER Connectivity

In the server logs, noticing that there is a disconnection in the zookeeper connection.
Details of the log is below :-
[ INFO] [Mon Apr 04 11:02:58 2022] [Server:sbin/recordprocessor] [zookeeperutil.cc] [10.101.103.9] [183325] [recordprocessor] [] [2483025664] [(ChildrenNodeCheckWatcher) Extra zookeeper event occur in ChildrenNodeCheckWatcher, with zookeeper type and state on path -1 3 ]
[ ERROR] [Mon Apr 04 11:02:58 2022] [Server:sbin/recordprocessor] [zookeeperutil.cc] [10.101.103.9] [183325] [recordprocessor] [] [2483025664] [(ChildrenNodeCheckWatcher) zookeeperutil.cc:85 - Failed to get correct Event Type) - Event Type - ZOO_CHILD_EVENT]
[ INFO] [Mon Apr 04 11:02:58 2022] [Server:sbin/recordprocessor] [zookeeperutil.cc] [10.101.103.9] [183325] [recordprocessor] [] [2483025664] [(ChildrenNodeCheckWatcher) Extra zookeeper event occur in ChildrenNodeCheckWatcher, with zookeeper type and state on path -1 3 ]
[ ERROR] [Mon Apr 04 11:02:58 2022] [Server:sbin/recordprocessor] [zookeeperutil.cc] [10.101.103.9] [183325] [recordprocessor] [] [2483025664] [(ChildrenNodeCheckWatcher) zookeeperutil.cc:85 - Failed to get correct Event Type) - Event Type - ZOO_CHILD_EVENT]
[ INFO] [Mon Apr 04 11:02:58 2022] [Server:sbin/recordprocessor] [zookeeperutil.cc] [10.101.103.9] [183325] [recordprocessor] [] [2483025664] [(ExistsCheckWatcher) Extra zookeeper event occure in ExistsCheckWatcher with zookeeper type and state on path -1 3 ]
[ ERROR] [Mon Apr 04 11:02:58 2022] [Server:sbin/recordprocessor] [zookeeperutil.cc] [10.101.103.9] [183325] [recordprocessor] [] [2483025664] [(ExistsCheckWatcher) zookeeperutil.cc:156 - Failed to get correct Event Type) - Expected Event Type - ZOO_CREATED_EVENT]
[ INFO] [Mon Apr 04 11:02:58 2022] [Server:sbin/recordprocessor] [programmanagerutil.cc] [10.101.103.9] [183325] [recordprocessor] [] [2483025664] [(reInitializewatcher) Zookeeper connection got the ZOO_CONNECTED_STATE .........]

Infinite retry in spring kafka consumer #retryabletopic

I am using #RetryableTopic to implement retry logic in kafka consumer.
I gave config as below:
#RetryableTopic(
attempts = "4",
backoff = #Backoff(delay = 300000, multiplier = 10.0),
autoCreateTopics = "false",
topicSuffixingStrategy = SUFFIX_WITH_INDEX_VALUE
)
However, instead of retrying for 4 times, it retries infinitely, and that too in no delay time. Can someone please help me with the code?
I want the message to be retried 4 times with first time delay- after 5 mins, then 2nd delay after 10 mins, third delay after 20 mins...
Code is below:
int i = 1;
#RetryableTopic(
attempts = "4",
backoff = #Backoff(delay = 300000, multiplier = 10.0),
autoCreateTopics = "false",
topicSuffixingStrategy = SUFFIX_WITH_INDEX_VALUE
)
#KafkaListener(topics = "topic_string_data", containerFactory = "default")
public void consume(#Payload String message , #Header(KafkaHeaders.RECEIVED_TOPIC) String topic) {
String prachi = null;
System.out.println("current time: " + new Date());
System.out.println("retry method invoked -> " + i++ + " times from topic: " + topic);
System.out.println("current time: " + new Date());
prachi.equals("abc");
}
#DltHandler
public void listenDlt(String in, #Header(KafkaHeaders.RECEIVED_TOPIC) String topic,
#Header(KafkaHeaders.OFFSET) long offset) {
System.out.println("current time dlt: " + new Date());
System.out.println("DLT Received: " + in + " from " + topic + " offset " + offset + " -> " + i++ + " times");
System.out.println("current time dlt: " + new Date());
//dump event to dlt queue
}
Kafka config:
#Bean
public ConsumerFactory<String, String> consumerFactory() {
Map<String, Object> config = new HashMap<>();
config.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
config.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
config.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
config.put(ConsumerConfig.GROUP_ID_CONFIG, "grp_STRING");
return new DefaultKafkaConsumerFactory<>(config);
//inject consumer factory to kafka listener consumer factory
}
#Bean(name = "default")
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory factory = new ConcurrentKafkaListenerContainerFactory();
factory.setConsumerFactory(consumerFactory());
return factory;
}
Logs when running app: this is not the complete logs:
32:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.186 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Setting offset for partition topic_string_data-retry-1-0 to the committed offset FetchPosition{offset=1, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.187 INFO 96675 --- [ner#6-dlt-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-7, groupId=grp_STRING] Setting offset for partition topic_string_data-dlt-1 to the committed offset FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.187 INFO 96675 --- [3-retry-0-0-C-1] o.s.k.l.KafkaMessageListenerContainer : grp_STRING: partitions assigned: [topic_string_data-retry-0-0]
2022-02-23 13:58:40.187 INFO 96675 --- [ntainer#2-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-2, groupId=grp_STRING] Setting offset for partition topic_string_data-1 to the committed offset FetchPosition{offset=27, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.187 INFO 96675 --- [4-retry-1-0-C-1] o.s.k.l.KafkaMessageListenerContainer : grp_STRING: partitions assigned: [topic_string_data-retry-1-0]
2022-02-23 13:58:40.187 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Setting offset for partition topic_string_data-retry-2-0 to the committed offset FetchPosition{offset=1, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.188 INFO 96675 --- [ntainer#2-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-2, groupId=grp_STRING] Setting offset for partition topic_string_data-0 to the committed offset FetchPosition{offset=24, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.188 INFO 96675 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-4, groupId=grp_STRING] Setting offset for partition topic_string-1 to the committed offset FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.188 INFO 96675 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-4, groupId=grp_STRING] Setting offset for partition topic_string-0 to the committed offset FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.188 INFO 96675 --- [5-retry-2-0-C-1] o.s.k.l.KafkaMessageListenerContainer : grp_STRING: partitions assigned: [topic_string_data-retry-2-0]
2022-02-23 13:58:40.188 INFO 96675 --- [ntainer#2-0-C-1] o.s.k.l.KafkaMessageListenerContainer : grp_STRING: partitions assigned: [topic_string_data-1, topic_string_data-0]
2022-02-23 13:58:40.189 INFO 96675 --- [ntainer#0-0-C-1] o.s.k.l.KafkaMessageListenerContainer : grp_STRING: partitions assigned: [topic_string-1, topic_string-0]
2022-02-23 13:58:40.188 INFO 96675 --- [ner#6-dlt-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-7, groupId=grp_STRING] Setting offset for partition topic_string_data-dlt-2 to the committed offset FetchPosition{offset=0, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.190 INFO 96675 --- [ner#6-dlt-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-grp_STRING-7, groupId=grp_STRING] Setting offset for partition topic_string_data-dlt-0 to the committed offset FetchPosition{offset=14, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=Optional[197mumnb29632:9092 (id: 0 rack: null)], epoch=0}}
2022-02-23 13:58:40.191 INFO 96675 --- [ner#6-dlt-0-C-1] o.s.k.l.KafkaMessageListenerContainer : grp_STRING: partitions assigned: [topic_string_data-dlt-0, topic_string_data-dlt-1, topic_string_data-dlt-2]
2022-02-23 13:58:40.196 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 1 for partition topic_string_data-retry-1-0
2022-02-23 13:58:40.196 WARN 96675 --- [4-retry-1-0-C-1] essageListenerContainer$ListenerConsumer : Seek to current after exception; nested exception is org.springframework.kafka.listener.ListenerExecutionFailedException: Listener failed; nested exception is org.springframework.kafka.listener.KafkaBackoffException: Partition 0 from topic topic_string_data-retry-1 is not ready for consumption, backing off for approx. 26346 millis.
current time: Wed Feb 23 13:58:40 IST 2022
retry method invoked -> 3 times from topic: topic_string_data-retry-0
current time: Wed Feb 23 13:58:40 IST 2022
2022-02-23 13:58:40.713 INFO 96675 --- [3-retry-0-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-3, groupId=grp_STRING] Seeking to offset 2 for partition topic_string_data-retry-0-0
current time: Wed Feb 23 13:58:40 IST 2022
retry method invoked -> 4 times from topic: topic_string_data-retry-0
current time: Wed Feb 23 13:58:40 IST 2022
2022-02-23 13:58:41.228 INFO 96675 --- [3-retry-0-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-3, groupId=grp_STRING] Seeking to offset 3 for partition topic_string_data-retry-0-0
current time: Wed Feb 23 13:58:41 IST 2022
retry method invoked -> 5 times from topic: topic_string_data-retry-0
current time: Wed Feb 23 13:58:41 IST 2022
2022-02-23 13:58:41.740 INFO 96675 --- [3-retry-0-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-3, groupId=grp_STRING] Seeking to offset 4 for partition topic_string_data-retry-0-0
current time: Wed Feb 23 13:58:41 IST 2022
retry method invoked -> 6 times from topic: topic_string_data-retry-0
current time: Wed Feb 23 13:58:41 IST 2022
2022-02-23 13:58:42.254 INFO 96675 --- [3-retry-0-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-3, groupId=grp_STRING] Seeking to offset 5 for partition topic_string_data-retry-0-0
current time: Wed Feb 23 13:58:42 IST 2022
retry method invoked -> 7 times from topic: topic_string_data-retry-0
current time: Wed Feb 23 13:58:42 IST 2022
2022-02-23 13:58:42.777 INFO 96675 --- [3-retry-0-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-3, groupId=grp_STRING] Seeking to offset 6 for partition topic_string_data-retry-0-0
current time: Wed Feb 23 13:58:42 IST 2022
retry method invoked -> 8 times from topic: topic_string_data-retry-0
current time: Wed Feb 23 13:58:42 IST 2022
2022-02-23 13:58:43.298 INFO 96675 --- [3-retry-0-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-3, groupId=grp_STRING] Seeking to offset 7 for partition topic_string_data-retry-0-0
current time: Wed Feb 23 13:58:43 IST 2022
retry method invoked -> 9 times from topic: topic_string_data-retry-0
current time: Wed Feb 23 13:58:43 IST 2022
2022-02-23 13:58:43.809 INFO 96675 --- [3-retry-0-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-3, groupId=grp_STRING] Seeking to offset 8 for partition topic_string_data-retry-0-0
current time: Wed Feb 23 13:58:43 IST 2022
retry method invoked -> 10 times from topic: topic_string_data-retry-0
current time: Wed Feb 23 13:58:43 IST 2022
current time: Wed Feb 23 13:59:10 IST 2022
retry method invoked -> 11 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:10 IST 2022
2022-02-23 13:59:10.733 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 2 for partition topic_string_data-retry-1-0
2022-02-23 13:59:10.736 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 1 for partition topic_string_data-retry-2-0
2022-02-23 13:59:10.737 WARN 96675 --- [5-retry-2-0-C-1] essageListenerContainer$ListenerConsumer : Seek to current after exception; nested exception is org.springframework.kafka.listener.ListenerExecutionFailedException: Listener failed; nested exception is org.springframework.kafka.listener.KafkaBackoffException: Partition 0 from topic topic_string_data-retry-2 is not ready for consumption, backing off for approx. 29483 millis.
current time: Wed Feb 23 13:59:10 IST 2022
retry method invoked -> 12 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:10 IST 2022
2022-02-23 13:59:11.249 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 3 for partition topic_string_data-retry-1-0
current time: Wed Feb 23 13:59:11 IST 2022
retry method invoked -> 13 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:11 IST 2022
2022-02-23 13:59:11.769 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 4 for partition topic_string_data-retry-1-0
current time: Wed Feb 23 13:59:11 IST 2022
retry method invoked -> 14 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:11 IST 2022
2022-02-23 13:59:12.286 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 5 for partition topic_string_data-retry-1-0
current time: Wed Feb 23 13:59:12 IST 2022
retry method invoked -> 15 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:12 IST 2022
2022-02-23 13:59:12.805 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 6 for partition topic_string_data-retry-1-0
current time: Wed Feb 23 13:59:12 IST 2022
retry method invoked -> 16 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:12 IST 2022
2022-02-23 13:59:13.339 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 7 for partition topic_string_data-retry-1-0
current time: Wed Feb 23 13:59:13 IST 2022
retry method invoked -> 17 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:13 IST 2022
2022-02-23 13:59:13.856 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 8 for partition topic_string_data-retry-1-0
current time: Wed Feb 23 13:59:13 IST 2022
retry method invoked -> 18 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:13 IST 2022
2022-02-23 13:59:14.372 INFO 96675 --- [4-retry-1-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-6, groupId=grp_STRING] Seeking to offset 9 for partition topic_string_data-retry-1-0
current time: Wed Feb 23 13:59:14 IST 2022
retry method invoked -> 19 times from topic: topic_string_data-retry-1
current time: Wed Feb 23 13:59:14 IST 2022
current time: Wed Feb 23 13:59:40 IST 2022
retry method invoked -> 20 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:40 IST 2022
2022-02-23 13:59:40.846 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 2 for partition topic_string_data-retry-2-0
current time dlt: Wed Feb 23 13:59:40 IST 2022
DLT Received: prachi from topic_string_data-dlt offset 14 -> 21 times
current time dlt: Wed Feb 23 13:59:46 IST 2022
current time: Wed Feb 23 13:59:46 IST 2022
retry method invoked -> 22 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:46 IST 2022
2022-02-23 13:59:47.466 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 3 for partition topic_string_data-retry-2-0
current time dlt: Wed Feb 23 13:59:47 IST 2022
DLT Received: prachi from topic_string_data-dlt offset 15 -> 23 times
current time dlt: Wed Feb 23 13:59:47 IST 2022
current time: Wed Feb 23 13:59:47 IST 2022
retry method invoked -> 24 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:47 IST 2022
2022-02-23 13:59:47.981 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 4 for partition topic_string_data-retry-2-0
current time dlt: Wed Feb 23 13:59:47 IST 2022
DLT Received: prachisharma from topic_string_data-dlt offset 16 -> 25 times
current time dlt: Wed Feb 23 13:59:47 IST 2022
current time: Wed Feb 23 13:59:47 IST 2022
retry method invoked -> 26 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:47 IST 2022
2022-02-23 13:59:48.493 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 5 for partition topic_string_data-retry-2-0
current time dlt: Wed Feb 23 13:59:48 IST 2022
DLT Received: hie from topic_string_data-dlt offset 17 -> 27 times
current time dlt: Wed Feb 23 13:59:48 IST 2022
current time: Wed Feb 23 13:59:48 IST 2022
retry method invoked -> 28 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:48 IST 2022
2022-02-23 13:59:49.011 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 6 for partition topic_string_data-retry-2-0
current time dlt: Wed Feb 23 13:59:49 IST 2022
DLT Received: hieeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee from topic_string_data-dlt offset 18 -> 29 times
current time dlt: Wed Feb 23 13:59:49 IST 2022
current time: Wed Feb 23 13:59:49 IST 2022
retry method invoked -> 30 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:49 IST 2022
2022-02-23 13:59:49.527 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 7 for partition topic_string_data-retry-2-0
current time dlt: Wed Feb 23 13:59:49 IST 2022
DLT Received: hie from topic_string_data-dlt offset 19 -> 31 times
current time dlt: Wed Feb 23 13:59:49 IST 2022
current time: Wed Feb 23 13:59:49 IST 2022
retry method invoked -> 32 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:49 IST 2022
current time dlt: Wed Feb 23 13:59:50 IST 2022
DLT Received: hi from topic_string_data-dlt offset 20 -> 33 times
current time dlt: Wed Feb 23 13:59:50 IST 2022
2022-02-23 13:59:50.039 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 8 for partition topic_string_data-retry-2-0
current time: Wed Feb 23 13:59:50 IST 2022
retry method invoked -> 34 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:50 IST 2022
2022-02-23 13:59:50.545 INFO 96675 --- [5-retry-2-0-C-1] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-grp_STRING-1, groupId=grp_STRING] Seeking to offset 9 for partition topic_string_data-retry-2-0
current time dlt: Wed Feb 23 13:59:50 IST 2022
DLT Received: hi from topic_string_data-dlt offset 21 -> 35 times
current time dlt: Wed Feb 23 13:59:50 IST 2022
current time: Wed Feb 23 13:59:50 IST 2022
retry method invoked -> 36 times from topic: topic_string_data-retry-2
current time: Wed Feb 23 13:59:50 IST 2022
current time dlt: Wed Feb 23 13:59:51 IST 2022
DLT Received: hi from topic_string_data-dlt offset 22 -> 37 times
current time dlt: Wed Feb 23 13:59:51 IST 2022

It seems there are two separate problems.
One is that you seem to already have records in the topics, and if you have it configured to earliest the app will read all those records when it starts up. You can either set ConsumerConfig.AUTO_OFFSET_RESET_CONFIG to latest, or, if you're running locally on docker, you can stop the Kafka container and prune the volumes with something like docker system prune --volumes (note that this will erase data from all your stopped containers - use wisely).
Can you try one of these and test again?
The other problem is that the framework is wrongly setting the default maxDelay of 30s even though the annotation states the default is to ignore. I'll open an issue for that and add the link here.
For now You can set a maxDelay such as #Backoff(delay = 600000, multiplier = 3.0, maxDelay = 5400000), then the application should have the correct delays for 10, 30 and 90 minutes as you wanted.
Let me know if that works out for you, or if you have any other problems related to this issue.
EDIT: Issue opened, you can follow the development there https://github.com/spring-projects/spring-kafka/issues/2137
It should be fixed in the next release.
EDIT 2: Actually the phrasing in the #BackOff annotation is rather ambiguous, but seems like the behavior is correct and you should explicitly set a larger maxDelay.
The documentation should clarify this behavior in the next release.
EDIT 3: To answer your question in the comments, the way retryable topics work is the partition is paused for the duration of the delay, but the consumer keeps polling the broker, so longer delays don't trigger rebalancing.
From your logs the rebalancing is from the main topic's partitions, so it's unlikely it has anything to do with this feature.
EDIT 4: The Retryable Topics feature was released in Spring for Apache Kafka 2.7.0, which uses kafka-clients 2.7.0. However, there have been several improvements to the feature, so I recommend using the latest Spring Kafka version (currently 2.8.3) if possible to benefit from those.

kubedns can't nslookup any service

version 1.6.1.I setup kubernetes with kubeadm.DNS cant resolve kubernetes service within heapster. But using ip of kubernetes is OK. Below is the container log of kubedns
[root#node-47 ~]# docker logs 624905719fa2
I0512 04:29:57.374905 1 dns.go:49] version: v1.5.2-beta.0+$Format:%h$
I0512 04:29:57.380148 1 server.go:70] Using configuration read from directory: /kube-dns-config%!(EXTRA time.Duration=10s)
I0512 04:29:57.380219 1 server.go:112] FLAG: --alsologtostderr="false"
I0512 04:29:57.380231 1 server.go:112] FLAG: --config-dir="/kube-dns-config"
I0512 04:29:57.380238 1 server.go:112] FLAG: --config-map=""
I0512 04:29:57.380242 1 server.go:112] FLAG: --config-map-namespace="kube-system"
I0512 04:29:57.380245 1 server.go:112] FLAG: --config-period="10s"
I0512 04:29:57.380268 1 server.go:112] FLAG: --dns-bind-address="0.0.0.0"
I0512 04:29:57.380272 1 server.go:112] FLAG: --dns-port="10053"
I0512 04:29:57.380278 1 server.go:112] FLAG: --domain="cluster.local."
I0512 04:29:57.380283 1 server.go:112] FLAG: --federations=""
I0512 04:29:57.380288 1 server.go:112] FLAG: --healthz-port="8081"
I0512 04:29:57.380292 1 server.go:112] FLAG: --initial-sync-timeout="1m0s"
I0512 04:29:57.380295 1 server.go:112] FLAG: --kube-master-url=""
I0512 04:29:57.380300 1 server.go:112] FLAG: --kubecfg-file=""
I0512 04:29:57.380303 1 server.go:112] FLAG: --log-backtrace-at=":0"
I0512 04:29:57.380309 1 server.go:112] FLAG: --log-dir=""
I0512 04:29:57.380315 1 server.go:112] FLAG: --log-flush-frequency="5s"
I0512 04:29:57.380318 1 server.go:112] FLAG: --logtostderr="true"
I0512 04:29:57.380323 1 server.go:112] FLAG: --nameservers=""
I0512 04:29:57.380326 1 server.go:112] FLAG: --stderrthreshold="2"
I0512 04:29:57.380330 1 server.go:112] FLAG: --v="2"
I0512 04:29:57.380334 1 server.go:112] FLAG: --version="false"
I0512 04:29:57.380340 1 server.go:112] FLAG: --vmodule=""
I0512 04:29:57.380386 1 server.go:175] Starting SkyDNS server (0.0.0.0:10053)
I0512 04:29:57.380816 1 server.go:197] Skydns metrics enabled (/metrics:10055)
I0512 04:29:57.380828 1 dns.go:147] Starting endpointsController
I0512 04:29:57.380832 1 dns.go:150] Starting serviceController
I0512 04:29:57.381406 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0512 04:29:57.381417 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0512 04:29:57.439494 1 dns.go:264] New service: monitoring-influxdb
I0512 04:29:57.439579 1 dns.go:462] Added SRV record &{Host:monitoring-influxdb.kube-system.svc.cluster.local. Port:8086 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0512 04:29:57.439601 1 dns.go:462] Added SRV record &{Host:monitoring-influxdb.kube-system.svc.cluster.local. Port:8083 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0512 04:29:57.439623 1 dns.go:264] New service: dai-1-daiym-context-46
I0512 04:29:57.439642 1 dns.go:264] New service: kubernetes
I0512 04:29:57.439722 1 dns.go:462] Added SRV record &{Host:kubernetes.default.svc.cluster.local. Port:443 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0512 04:29:57.439746 1 dns.go:264] New service: nginxservice
I0512 04:29:57.439766 1 dns.go:264] New service: heapster
I0512 04:29:57.439784 1 dns.go:264] New service: kube-dns
I0512 04:29:57.439803 1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0512 04:29:57.439818 1 dns.go:462] Added SRV record &{Host:kube-dns.kube-system.svc.cluster.local. Port:53 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0512 04:29:57.439834 1 dns.go:264] New service: kubernetes-dashboard
I0512 04:29:57.439851 1 dns.go:264] New service: monitoring-grafana
I0512 04:29:57.882534 1 dns.go:171] Initialized services and endpoints from apiserver
I0512 04:29:57.882995 1 server.go:128] Setting up Healthz Handler (/readiness)
I0512 04:29:57.883004 1 server.go:133] Setting up cache handler (/cache)
I0512 04:29:57.883009 1 server.go:119] Status HTTP port 8081
But I can't resolve any service within kubedns.
[root#node-47 ~]# docker exec -it 624905719fa2 sh
/ # nslookup nginxservice localhost
Server: 127.0.0.1
Address 1: 127.0.0.1 localhost
nslookup: can't resolve 'nginxservice': Try again
/ # nslookup heapster localhost
Server: 127.0.0.1
Address 1: 127.0.0.1 localhost
nslookup: can't resolve 'heapster': Try again

Please use full DNS name of service try again if service do not have the same namespace with DNS? Check https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/ to get full dns name.

Mongodb replica set polluted logs and arbiter in "initial startup"

I'm running a replica set on with MongoBD v.2.0.3, here is the latest status:
+-----------------------------------------------------------------------------------------------------------------------+
| Member |id|Up| cctime |Last heartbeat|Votes|Priority| State | Messages | optime |skew|
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27002 |0 |1 |13 hrs |2 secs ago |1 |1 |PRIMARY | |4f619079:2|1 |
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27003 |1 |1 |12 hrs |1 sec ago |1 |1 |SECONDARY | |4f619079:2|1 |
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27001 |2 |1 |2.5e+02 hrs|1 sec ago |1 |0 |SECONDARY (hidden)| |4f619079:2|-1 |
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27000 (me)|3 |1 |2.5e+02 hrs| |1 |1 |ARBITER |initial startup|0:0 | |
|--------------------+--+--+-----------+--------------+-----+--------+------------------+---------------+----------+----|
|127.0.1.1:27004 |4 |1 |9.5 hrs |2 secs ago |1 |1 |SECONDARY | |4f619079:2|-1 |
+-----------------------------------------------------------------------------------------------------------------------+
I'm puzzled by the following:
1) The arbiter always report the same message "initial startup" and "optime" of 0:0.
What does the "initial startup" mean, and is it normal for this message not to change?
Why the "optime" is always "0:0"?
2) What information does the skew column convey?
I've set up my replicas according to MongoDB's documentation and data seems to replicate across the set nicely, so no problem with that.
Another thing is that logs across all MongoDB hosts are polluted with such entries:
Thu Mar 15 03:25:29 [initandlisten] connection accepted from 127.0.0.1:38599 #112781
Thu Mar 15 03:25:29 [conn112781] authenticate: { authenticate: 1, nonce: "99e2a4a5124541b9", user: "__system", key: "417d42d26643b2c2d014b89900700263" }
Thu Mar 15 03:25:32 [clientcursormon] mem (MB) res:12 virt:244 mapped:32
Thu Mar 15 03:25:34 [conn112779] end connection 127.0.0.1:38586
Thu Mar 15 03:25:34 [initandlisten] connection accepted from 127.0.0.1:38602 #112782
Thu Mar 15 03:25:34 [conn112782] authenticate: { authenticate: 1, nonce: "a021e521ac9e19bc", user: "__system", key: "14507310174c89cdab3b82decb52b47c" }
Thu Mar 15 03:25:36 [conn112778] end connection 127.0.0.1:38585
Thu Mar 15 03:25:36 [initandlisten] connection accepted from 127.0.0.1:38604 #112783
Thu Mar 15 03:25:37 [conn112783] authenticate: { authenticate: 1, nonce: "58bcf511e040b760", user: "__system", key: "24c5b20886f6d390d1ea8ea1c61fd109" }
Thu Mar 15 03:26:00 [conn112781] end connection 127.0.0.1:38599
Thu Mar 15 03:26:00 [initandlisten] connection accepted from 127.0.0.1:38615 #112784
Thu Mar 15 03:26:00 [conn112784] authenticate: { authenticate: 1, nonce: "8a8f24fe012a03fe", user: "__system", key: "9b0be0c7fc790021b25aeb4511d85848" }
Thu Mar 15 03:26:01 [conn112780] end connection 127.0.0.1:38598
Thu Mar 15 03:26:01 [initandlisten] connection accepted from 127.0.0.1:38616 #112785
Thu Mar 15 03:26:01 [conn112785] authenticate: { authenticate: 1, nonce: "420808aa9a12947", user: "__system", key: "90e8654a2eb3981219c370208989e97a" }
Thu Mar 15 03:26:04 [conn112782] end connection 127.0.0.1:38602
Thu Mar 15 03:26:04 [initandlisten] connection accepted from 127.0.0.1:38617 #112786
Thu Mar 15 03:26:04 [conn112786] authenticate: { authenticate: 1, nonce: "b46ac4868db60973", user: "__system", key: "43cda53cc503bce942040ba8d3c6c3b1" }
Thu Mar 15 03:26:09 [conn112783] end connection 127.0.0.1:38604
Thu Mar 15 03:26:09 [initandlisten] connection accepted from 127.0.0.1:38621 #112787
Thu Mar 15 03:26:10 [conn112787] authenticate: { authenticate: 1, nonce: "20fae7ed47cd1780", user: "__system", key: "f7b81c2d53ad48343e917e2db9125470" }
Thu Mar 15 03:26:30 [conn112784] end connection 127.0.0.1:38615
Thu Mar 15 03:26:30 [initandlisten] connection accepted from 127.0.0.1:38632 #112788
Thu Mar 15 03:26:31 [conn112788] authenticate: { authenticate: 1, nonce: "38ee5b7b665d26be", user: "__system", key: "49c1f9f4e3b5cf2bf05bfcbb939ee422" }
Thu Mar 15 03:26:33 [conn112785] end connection 127.0.0.1:38616
It seems like many connections are established and dropped. Is that the replica set heartbeat?
Additional information
Arbiter config
dbpath=/var/lib/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
port = 27000
bind_ip = 127.0.1.1
rest = true
journal = true
replSet = myreplname
keyFile = /etc/mongodb/set.key
oplogSize = 8
quiet = true
Replica set member config
dbpath=/root/local/var/mongodb
logpath=/root/local/var/log/mongodb.log
logappend=true
port = 27002
bind_ip = 127.0.1.1
rest = true
journal = true
replSet = myreplname
keyFile = /root/local/etc/set.key
quiet = true
MongoDB instances are running on different machines and connect to each other over SSH tunnels setup in fully connected mesh.

The arbiter doesn't do anything besides participate in elections, so it has no further operations after startup. "Skew" is clock skew in seconds between this member and the others in the set. Yes, the connect / disconnect messages are heartbeats.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse