I think the word 'log' is used in more than one way when it comes to Kafka. I'm talking about log output that ends up in stdout or your-app.log or splunk/datadog/etc.
Every 30 seconds, something happens 3 times. And each time it happens, approximately 65 log events appear. I'm wondering
What is this something?
Can I cause all of its output to appear on a single line? (My log 'provider' charges per log event, and each line counts as a separate event.)
The logs are like this:
INFO - Kafka version: ...
INFO - Kafka commitId: ...
INFO - Kafka startTimeMs: ...
INFO - App info kafka.admin.client for adminclient-...
INFO - Metrics scheduler closed
INFO - Closing reporter org.apache.kafka.common.metrics.JmxReporter
INFO - Metrics reporters closed
INFO - AdminClientConfig values:
bootstrap.servers = [...
foo = ...
bar = ...
baz = ...
qux = ...
Each line is an Slf4j event. If you want to change its format from your client or the broker, you'll need to modify your logging framework configurations. In the broker, you'll find a log4j.properties file.
All output cannot appear in a single line. Each INFO, for example, is an individual event. These can be reduced by disabling the logs for the Java packages that print them.
The alternative is to install some other log forwarder on your systems like Fluentd and parse/filter/forward data using that.
Related
I am attempting to use mirrormaker 2 to replicate data between AWS Managed Kafkas (MSK) in 2 different AWS regions - one in eu-west-1 (CLOUD_EU) and the other in us-west-2 (CLOUD_NA), both running Kafka 2.6.1. For testing I am currently trying just to replicate topics 1 way, from EU -> NA.
I am starting a mirrormaker connect cluster using ./bin/connect-mirror-maker.sh and a properties file (included)
This works fine for topics with small messages on them, but one of my topic has binary messages up to 20MB in size. When I try to replicate that topic I get an error every 30 seconds
[2022-04-21 13:47:05,268] INFO [Consumer clientId=consumer-29, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: {}. (org.apache.kafka.clients.FetchSessionHandler:481)
org.apache.kafka.common.errors.DisconnectException
When logging in DEBUG to get more information we get
[2022-04-21 13:47:05,267] DEBUG [Consumer clientId=consumer-29, groupId=null] Disconnecting from node 2 due to request timeout. (org.apache.kafka.clients.NetworkClient:784)
[2022-04-21 13:47:05,268] DEBUG [Consumer clientId=consumer-29, groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=consumer-29, correlationId=35) due to node 2 being disconnected (org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient:593)
It gets stuck in a loop constantly disconnecting with request timeout every 30s and then trying again.
Looking at this, I suspect that the problem is the request.timeout.ms is on the default (30s) and it times out trying to read the topic with many large messages.
I followed the guide at https://github.com/apache/kafka/tree/trunk/connect/mirror to attempt to configure the consumer properties, however, no matter what I set, the timeout for the consumer remains fixed at the default, confirmed both by kafka outputting its config in the log and by timing how long between the disconnect messages. e.g. I set:
CLOUD_EU.consumer.request.timeout.ms=120000
In the properties that I start MM-2 with.
based on various guides I have found while looking at this, I have also tried
CLOUD_EU.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.override.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.override.request.timeout.ms=120000
None of which have worked.
How can I change the consumer request.timeout setting? The log is approx 10,000 lines long, but everywhere where the ConsumerConfig is logged out it logs request.timeout.ms = 30000
Properties file I am using:
# specify any number of cluster aliases
clusters = CLOUD_EU, CLOUD_NA
# connection information for each cluster
CLOUD_EU.bootstrap.servers = kafka.eu-west-1.amazonaws.com:9092
CLOUD_NA.bootstrap.servers = kafka.us-west-2.amazonaws.com:9092
# enable and configure individual replication flows
CLOUD_EU->CLOUD_NA.enabled = true
CLOUD_EU->CLOUD_NA.topics = METRICS_ATTACHMENTS_OVERSIZE_EU
CLOUD_NA->CLOUD_EU.enabled = false
replication.factor=3
tasks.max = 1
############################# Internal Topic Settings #############################
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
############################ Kafka Settings ###################################
# CLOUD_EU cluster over writes
CLOUD_EU.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.session.timeout.ms=150000
We using ELK stack (7.10.2) in Kubernetes (1.21.5). After several time our service provider Gardener change OS version (318.9.0 -> 576.1.0) and our troubles with logging stack started.
It seems, that Kafka (v 2.8.1, 2 pods) not stream data to Logstash (7.10.2, 2 pods), but sent it by chunks of data every few moments. In fact, in Kibana we not see continual adding log records, but we see bunch of new records every few moments. If high load occur (e.g. debugging some component in k8s cluster), this delay is rising to minutes.
We discovered, that metric delayed fetch in purgatory is jumping with very similar pattern
see screenshot, like a "saw". When I downgrade OS version on nodes from current (576.2.0, orange) to previous one (318.9.0, blue), problem disappeared. As you expected, we dont stay on same OS version much longer.
I asked Gardener staff for assistance, but without root cause they are not able help us. We not change any settings, component versions, ... Just OS version on nodes.
From Logstashs debug log I can see, that Logstash is continuously connecting/disconnecting to Kafka:
[2022-01-17T08:53:33,232][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=elk-logstash-indexer-6c84d6bf8c-58gnz-containers-10, groupId=containers] Attempt to heartbeat failed since group is rebalancing
[2022-01-17T08:53:30,501][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator] [Consumer clientId=elk-logstash-indexer-6c84d6bf8c-ct29t-containers-49, groupId=containers] Discovered group coordinator elk-kafka-0.kafka.logging.svc.cluster.local:9092 (id: 2147483647 rack: null)
[2022-01-17T08:53:30,001][INFO ][org.apache.kafka.common.utils.AppInfoParser] Kafka startTimeMs: 1642409610000
These lines are still repeating in loop.
Similar situation I can see on Kafka:
[2022-01-20 11:55:04,241] DEBUG [broker-0-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,241] DEBUG [broker-0-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,342] DEBUG [broker-0-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,342] DEBUG [broker-0-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
[2022-01-20 11:55:04,365] DEBUG Accepted connection from /10.250.1.127:53678 on /100.96.30.21:9092 and assigned it to processor 1, sendBufferSize [actual|requested]: [102400|102400] recvBufferSize [actual|requested]: [102400|102400] (kafka.network.Acceptor)
[2022-01-20 11:55:04,365] DEBUG Processor 1 listening to new connection from /10.250.1.127:53678 (kafka.network.Processor)
[2022-01-20 11:55:04,368] DEBUG [SocketServer listenerType=ZK_BROKER, nodeId=0] Connection with /10.250.1.127 disconnected (org.apache.kafka.common.network.Selector)
I attempted:
double resources for Kafka and Logstash (no change occurred)
change container engine from Docker to ContainerD (problem was worse in ContainerD, ~400 -> ~1000)
change Logstash parameters for Kafka plugin (no change occurred)
compare Kernel settings (5.4.0 -> 5.10.0, I not spotted any interesting changes)
temporary disable Karydia for Kafka, Logstash and ZooKeeper (no change occurred)
temporary upgrade Logstash version (7.10.2 -> 7.12.0, without success, all tested version have same bad behavior, move to higher version currently isnt possible without change version of another components in ELK)
Unfortunately, I am not a Kafka expert, I am not sure, that connecting/disconnecting is root cause of some of our non-optimal settings, or communication is interference by something unknow for us.
I would like to ask community for help with this problem. Some suggestion, how to continue with investigation are very welcome too.
I have simple setup for tests, and during performance tests, I've noticed that nifi read records from kafka in awkward way, it does not preserves order. I have consumekafka_20 procesor connected to logmessage, as on belowed screen
Both processors are configured to use only one concurent tasks, but log message show messages
2021-12-30 17:10:44,612 INFO nifi-0 [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.LogMessage LogMessage[id=0c0931ad-017e-1000-ebfe-53193c54a8b8] caa6dcba-8d44-44be-b7c8-e9ea95481a1c - - record read from kafka - 71b72dcb-8e31-488c-9fa4-bbeda6494014 - kafka timestamp: 1640884241030 - kafkadebug - offset: 10057 partition: 8
2021-12-30 17:10:47,132 INFO nifi-0 [Timer-Driven Process Thread-4] o.a.nifi.processors.standard.LogMessage LogMessage[id=0c0931ad-017e-1000-ebfe-53193c54a8b8] 90c3cc6f-1142-4aa7-953e-8fa8810877c2 - - record read from kafka - 71b72dcb-8e31-488c-9fa4-bbeda6494014 - kafka timestamp: 1640884239426 - kafkadebug - offset: 9985 partition: 8
As you can see second offset (9985) is far before first one (10057)
Does someone knows why this happens?
ConsumeKafka_2.0 configuration looks as follows:
I was experimentig with Max poll records setting but this does not fix the issue.
Kafka provides ordering within a partition according to this cloudera blog post
Try to select FirstInFirstOutPrioritizer on the Queue between ConsumeKafka and LogMessage
I run MirrorMaker2 with the high-level driver, as documented here, with ./bin/connect-mirror-maker.sh mm2.properties running in 3 pods in a k8s deployment.
The mm2.properties file looks like this:
clusters = source, dest
source.bootstrap.servers = ***:9092
dest.bootstrap.servers = ***:9092
source->dest.enabled = true
dest->source.enabled = false
source->dest.topics = event\.PROD\.some_id.*
replication.factor=3
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
sync.topic.acls.enabled = false
This works fine, with all topics matching the event\.PROD\.some_id.* regex being replicated.
Now, when I need to add other topic the whitelisting, I expected to be able to simply scale everything down, update the regex, and scale everything up again.
When I update the whitelist regex to source->dest.topics = event\.PROD\.(some_id|another_id).* , the topics matching "another_id" are created in the dest cluster, but no data is replicated, and mirrormaker seems to be lost commiting offsets:
[2020-05-28 20:33:19,496] INFO WorkerSourceTask{id=MirrorHeartbeatConnector-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:424)
[2020-05-28 20:33:19,496] INFO WorkerSourceTask{id=MirrorHeartbeatConnector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:441)
[2020-05-28 20:33:19,499] INFO WorkerSourceTask{id=MirrorHeartbeatConnector-0} Finished commitOffsets successfully in 3 ms (org.apache.kafka.connect.runtime.WorkerSourceTask:523)
Is this a limitation of the high level driver, or am I doing something wrong? From my understanding, being able to dynamically add topics to the whitelist was one of the motivations for MM2.
I am playing with mmv2 as well. Can you try setting these configurations? I had to enable the sync.topic.configs.enabled parameter so my mmv2 would detect the new topics and their data.
refresh.topics.enabled = true
sync.topic.configs.enabled = true
refresh.topics.interval.seconds = 10
Pd.- I am adding my reply as an answer because I wanted to paste come configs.
Our Flink application has a Kafka datasource.
Application is run with 32 parallelism.
When I look at the logs, I see a lot of statements about FETCH_SESSION_ID_NOT_FOUND.
2020-05-04 11:04:47,753 INFO org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-81, groupId=sampleGroup]
Node 26 was unable to process the fetch request with (sessionId=439766827, epoch=42): FETCH_SESSION_ID_NOT_FOUND.
2020-05-04 11:04:48,230 INFO org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-78, groupId=sampleGroup]
Node 28 was unable to process the fetch request with (sessionId=281654250, epoch=42): FETCH_SESSION_ID_NOT_FOUND.
What do these log statements mean?
What are the possible negative effects?
Not: I have no experience with Apache Kafka
Thanks..
This can happen for a few reasons but the most common one is the FetchSession cache being full on the brokers.
By default, brokers cache up to 1000 FetchSessions (configured via max.incremental.fetch.session.cache.slots). When this fills up, brokers cam evict cache entries. If your client cache entry is gone, it will received the FETCH_SESSION_ID_NOT_FOUND error.
This error is not fatal and consumers should send a new full FetchRequest automatically and keep working.
You can check the size of the FetchSession cache using the kafka.server:type=FetchSessionCache,name=NumIncrementalFetchSessions metric.