AWS MSK - Internal Brokers communication - sockets

I am using AWS MSK for our production workload and we have been noticing some not very clear log messages in cloudwatch. The messages are about the internal communication between brokers (more on cluster setup later):
[2022-05-14 06:50:17,171] INFO [SocketServer brokerId=2] Failed authentication with ec2-18-185-175-128.eu-central-1.compute.amazonaws.com/18.185.175.128 ([97fe8ff0-ee38-46c5-ae21-1545fd571224]: Access denied) (org.apache.kafka.common.network.Selector)
Our logs are cluttered with these recurring messages. The logs can be found on all three brokers, all referencing the brokerId=2, as per the message above.
I am assuming the instance referenced is one of the MSK brokers.
Whilst the logs are at INFO level and the cluster seems to work fine, I'd like to understand if anyone had to face these sorts of output messages before?
The MSK config is the following:
3 brokers over 3 availability zones
encryption in transit,client_broker = TLS, encryption in cluster
client_authentication sasl I am
cluster properties: auto.create.topics.enable = true, default.replication.factor = 3, num.partitions = 3, delete.topic.enable = true, min.insync.replicas = 2, log.retention.hours = 168, compression.type = gzip
kafka version: 2.7.0
I would be interested to know how to get rid of this log message and if this should be a matter of worry.
Thanks,
Alessio

Related

How to set consumer config values for Kafka Mirrormaker-2 2.6.1?

I am attempting to use mirrormaker 2 to replicate data between AWS Managed Kafkas (MSK) in 2 different AWS regions - one in eu-west-1 (CLOUD_EU) and the other in us-west-2 (CLOUD_NA), both running Kafka 2.6.1. For testing I am currently trying just to replicate topics 1 way, from EU -> NA.
I am starting a mirrormaker connect cluster using ./bin/connect-mirror-maker.sh and a properties file (included)
This works fine for topics with small messages on them, but one of my topic has binary messages up to 20MB in size. When I try to replicate that topic I get an error every 30 seconds
[2022-04-21 13:47:05,268] INFO [Consumer clientId=consumer-29, groupId=null] Error sending fetch request (sessionId=INVALID, epoch=INITIAL) to node 2: {}. (org.apache.kafka.clients.FetchSessionHandler:481)
org.apache.kafka.common.errors.DisconnectException
When logging in DEBUG to get more information we get
[2022-04-21 13:47:05,267] DEBUG [Consumer clientId=consumer-29, groupId=null] Disconnecting from node 2 due to request timeout. (org.apache.kafka.clients.NetworkClient:784)
[2022-04-21 13:47:05,268] DEBUG [Consumer clientId=consumer-29, groupId=null] Cancelled request with header RequestHeader(apiKey=FETCH, apiVersion=11, clientId=consumer-29, correlationId=35) due to node 2 being disconnected (org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient:593)
It gets stuck in a loop constantly disconnecting with request timeout every 30s and then trying again.
Looking at this, I suspect that the problem is the request.timeout.ms is on the default (30s) and it times out trying to read the topic with many large messages.
I followed the guide at https://github.com/apache/kafka/tree/trunk/connect/mirror to attempt to configure the consumer properties, however, no matter what I set, the timeout for the consumer remains fixed at the default, confirmed both by kafka outputting its config in the log and by timing how long between the disconnect messages. e.g. I set:
CLOUD_EU.consumer.request.timeout.ms=120000
In the properties that I start MM-2 with.
based on various guides I have found while looking at this, I have also tried
CLOUD_EU.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.override.request.timeout.ms=120000
CLOUD_EU.cluster.consumer.override.request.timeout.ms=120000
None of which have worked.
How can I change the consumer request.timeout setting? The log is approx 10,000 lines long, but everywhere where the ConsumerConfig is logged out it logs request.timeout.ms = 30000
Properties file I am using:
# specify any number of cluster aliases
clusters = CLOUD_EU, CLOUD_NA
# connection information for each cluster
CLOUD_EU.bootstrap.servers = kafka.eu-west-1.amazonaws.com:9092
CLOUD_NA.bootstrap.servers = kafka.us-west-2.amazonaws.com:9092
# enable and configure individual replication flows
CLOUD_EU->CLOUD_NA.enabled = true
CLOUD_EU->CLOUD_NA.topics = METRICS_ATTACHMENTS_OVERSIZE_EU
CLOUD_NA->CLOUD_EU.enabled = false
replication.factor=3
tasks.max = 1
############################# Internal Topic Settings #############################
checkpoints.topic.replication.factor=3
heartbeats.topic.replication.factor=3
offset-syncs.topic.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
config.storage.replication.factor=3
############################ Kafka Settings ###################################
# CLOUD_EU cluster over writes
CLOUD_EU.consumer.request.timeout.ms=120000
CLOUD_EU.consumer.session.timeout.ms=150000

Problems with Amazon MSK default configuration and publishing with transactions

Recently we have started doing some testing of our Kafka connectors to MSK, Amazon's managed Kafka service. Publishing records seem to work fine however not when transactions are enabled.
Our cluster consists of 2 brokers (because we have 2 zones) using the default MSK configuration. We are creating our Java Kafka producer using the following properties:
bootstrap.servers=x.us-east-1.amazonaws.com:9094,y.us-east-1.amazonaws.com:9094
client.id=kafkautil
max.block.ms=5000
request.timeout.ms=5000
security.protocol=SSL
transactional.id=transactions
However when the producer was started with the transactional.id setting which enables transactions, the initTransactions() method hangs:
producer = new KafkaProducer<Object, Object>(kafkaProperties);
if (kafkaProperties.containsKey(ProducerConfig.TRANSACTIONAL_ID_CONFIG)) {
// this hangs
producer.initTransactions();
}
Looking at the log output we see streams of the following, and it didn't seem like it ever timed out.
TransactionManager - Enqueuing transactional request (type=FindCoordinatorRequest,
coordinatorKey=y, coordinatorType=TRANSACTION)
TransactionManager - Request (type=FindCoordinatorRequest, coordinatorKey=y,
coordinatorType=TRANSACTION) dequeued for sending
NetworkClient - Found least loaded node z:9094 (id: -2 rack: null) connected with no
in-flight requests
Sender - Sending transactional request (type=FindCoordinatorRequest, coordinatorKey=y,
coordinatorType=TRANSACTION) to node z (id: -2 rack: null)
NetworkClient - Sending FIND_COORDINATOR {coordinator_key=y,coordinator_type=1} with
correlation id 424 to node -2
NetworkClient - Completed receive from node -2 for FIND_COORDINATOR with
correlation id 424, received {throttle_time_ms=0,error_code=15,error_message=null,
coordinator={node_id=-1,host=,port=-1}}
TransactionManager LogContext.java:129 - Received transactional response
FindCoordinatorResponse(throttleTimeMs=0, errorMessage='null',
error=COORDINATOR_NOT_AVAILABLE, node=:-1 (id: -1 rack: null)) for request
(type=FindCoordinatorRequest, coordinatorKey=xxx, coordinatorType=TRANSACTION)
As far as I can determine, the broker is available and each of the hosts in the bootstrap.servers property are available. If I connect to each of them and publish without transactions then it works.
Any idea what we are missing?
However when the producer was started with the transactional.id setting which enables transactions, the initTransactions() method hangs:
This turned out to a problem with the default AWS MSK properties and the number of brokers. If you create a Kafka cluster with less than 3 brokers, the following settings will need to be adjusted.
The following settings should be set (I think) to the number of brokers:
Property
Kafka Default
AWS Default
Description
default.replication.factor
1
3
Default replication factors for automatically created topics.
min.insync.replicas
1
2
Minimum number of replicas that must acknowledge a write for the write to be considered successful
offsets.topic.replication.factor
3
3
Internal topic that shares offsets on topics.
transaction.state.log.replication.factor
3
3
Replication factor for the transaction topic.
Here's the Kafka docs on broker properties.
Because we have 2 brokers, we ended up with:
default.replication.factor=2
min.insync.replicas=2
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
This seemed to resolve the issue. IMHO this is a real problem with the AWS MSK and the default configuration. They need to auto-generate the default configuration and tune it depending on the number of brokers in the cluster.

kafka listening on multiple interfaces

I have a requirement as below:
Kafka needs to listen to multiple interfaces, one external and one internal interface. All other components within the system will connect kafka to internal interfaces.
At installation time internal ips on other host are not reachable, need to do some configuration to make them reachable, we do not have control over that. So, assume that when kafka is coming up, internal IPs on other nodes are not reachable to each other.
Scenario:
I have two nodes in cluster:
node1 (External IP: 10.10.10.4, Internal IP: 5.5.5.4)
node2 (External IP: 10.10.10.5, Internal IP: 5.5.5.5)
Now, while installation, 10.10.10.4 can ping to 10.10.10.5 and vice versa, but 5.5.5.4 can not reach to 5.5.5.5. That will happen once kafka installation is done and after that someone does some config to make it reachable, so before kafka installation, we can do make them reachable.
Now the requirement is kafka brokers will exchange the messages on 10.10.10 interface, such that cluster will be formed, but clients will send messages on 5.5.5.X interface.
What I tried was as below:
listeners=USERS://0.0.0.0:9092,REPLICATION://0.0.0.0:9093
advertised.listeners=USERS://5.5.5.5:9092,REPLICATION://5.5.5.5:9093
Where 5.5.5.5 is the internal ip address.
But with this, while restarting kafka, I see below logs:
{"log":"[2020-06-23 19:05:34,923] INFO Creating /brokers/ids/2 (is it secure? false) (kafka.zk.KafkaZkClient)\n","stream":"stdout","time":"2020-06-23T19:05:34.923403973Z"}
{"log":"[2020-06-23 19:05:34,925] INFO Result of znode creation at /brokers/ids/2 is: OK (kafka.zk.KafkaZkClient)\n","stream":"stdout","time":"2020-06-23T19:05:34.925237419Z"}
{"log":"[2020-06-23 19:05:34,926] INFO Registered broker 2 at path /brokers/ids/2 with addresses: ArrayBuffer(EndPoint(5.5.5.5,9092,ListenerName(USERS),PLAINTEXT), EndPoint(5.5.5.5,9093,ListenerName(REPLICATION),PLAINTEXT)) (kafka.zk.KafkaZkClient)\n","stream":"stdout","time":"2020-06-23T19:05:34.926127438Z"}
.....
{"log":"[2020-06-23 19:05:35,078] INFO Kafka version : 1.1.0 (org.apache.kafka.common.utils.AppInfoParser)\n","stream":"stdout","time":"2020-06-23T19:05:35.078444509Z"}
{"log":"[2020-06-23 19:05:35,078] INFO Kafka commitId : fdcf75ea326b8e07 (org.apache.kafka.common.utils.AppInfoParser)\n","stream":"stdout","time":"2020-06-23T19:05:35.078471358Z"}
{"log":"[2020-06-23 19:05:35,079] INFO [KafkaServer id=2] started (kafka.server.KafkaServer)\n","stream":"stdout","time":"2020-06-23T19:05:35.079436798Z"}
{"log":"[2020-06-23 19:05:35,136] ERROR [KafkaApi-2] Number of alive brokers '0' does not meet the required replication factor '2' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)\n","stream":"stdout","time":"2020-06-23T19:05:35.136792119Z"}
And after that this msg continuously comes up.
{"log":"[2020-06-23 19:05:35,166] ERROR [KafkaApi-2] Number of alive brokers '0' does not meet the required replication factor '2' for the offsets topic (configured via 'offsets.topic.replication.factor'). This error can be ignored if the cluster is starting up and not all brokers are up yet. (kafka.server.KafkaApis)\n","stream":"stdout","time":"2020-06-23T19:05:35.166895344Z"}
Is there any way we can achieve that?
With regards,
-M-

Kafka Connect AWS S3 sink connector doesn't read from topic

I have a simple standalone S3 sink connector. Here is the relevant part of worker configuration properties:
plugin.path = <plugins directory>
bootstrap.servers = <List of servers on Amazon MKS>
security.protocol = SSL
...
It works fine when I connect it to a locally running Kafka. However when I connect it to a Kafka broker on AWS (with SSL), it doesn't consume anything. No errors, nothing. As if the topic was empty:
[2020-01-30 10:50:03,597] INFO Started S3 connector task with assigned partitions: [] (io.confluent.connect.s3.S3SinkTask:116)
[2020-01-30 10:50:03,598] INFO WorkerSinkTask{id=xxx} Sink task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:302)
When I enabled DEBUG mode in connect-log4j.properties, I started seeing lots of error messages:
Completed connection to node -2. Fetching API versions. (org.apache.kafka.clients.NetworkClient:914)
Initiating API versions fetch from node -2. (org.apache.kafka.clients.NetworkClient:928)
Connection with YYY disconnected (org.apache.kafka.common.network.Selector:607)
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:119)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:424)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:385)
...
Node -2 disconnected. (org.apache.kafka.clients.NetworkClient:894)
Initialize connection to node XXX (id: -3 rack: null) for sending metadata request (org.apache.kafka.clients.NetworkClient:1125)
Initiating connection to node XXX (id: -3 rack: null) using address XXX (org.apache.kafka.clients.NetworkClient:956)
Am I missing something with SSL configuration? Note that manually created org.apache.kafka.clients.consumer.KafkaConsumers can successfully read from this topic having only set "security.protocol = SSL".
EDIT:
Here are the connector properties:
name = my-connector
connector.class = io.confluent.connect.s3.S3SinkConnector
topics = some_topic
timestamp.extractor = Record
locale = de_DE
timezone = UTC
storage.class = io.confluent.connect.s3.storage.S3Storage
partitioner.class = io.confluent.connect.storage.partitioner.HourlyPartitioner
format.class = io.confluent.connect.s3.format.bytearray.ByteArrayFormat
s3.bucket.name = some-s3-bucket
s3.compression.type = gzip
flush.size = 3
s3.region = eu-central-1
I had a similar problem, which got solved after I have specified security protocol for consumer additionally (besides the global one): So just add
consumer.security.protocol = SSL
To the configuration properties

Kafka 0.10 SASL/PLAIN producer timeout

I've got a 3 broker kerberised Kafka 0.10 install running in Cloudera and I'm trying to authenticate with SASL/PLAIN
I'm passing kafka_server_jaas.conf into the JVM on each of the brokers.
KafkaServer {
org.apache.kafka.common.security.plain.PlainLoginModule required
username=admin
password=password1
user_admin=password1
user_remote=password1;
};
My server.properties (or kafka.properties as Cloudera renames it) is set as below;
listeners=SASL_SSL://10.10.3.47:9093 # ip set for each broker
advertised.listeners=SASL_SSL://10.10.3.47:9093 # ip set for each broker
sasl.enabled.mechanisms=GSSAPI,PLAIN
security.inter.broker.protocol=SASL_SSL
sasl.mechanism.inter.broker.protocol=GSSAPI
When Kafka starts up, the inter-broker communication is all fine, but when I try to connect using the console producer I get a Timeout failed to update metadata
bin/kafka-consolproducer --broker-list 10.10.3.161:9093 --topic test1 --producer.config client.properties.plain
client.properties.plain is set to
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
finally, the client side jaas.conf
KafkaClient {
org.apache.kafka.common.security.plain.PlainLoginModule required
username="remote"
password="password1";
};
As far as I can tell I've followed all instructions correctly, can anyone see anything wrong?
Update
I've turned the logging on the console consumer up a bit, I'm getting the following error;
[2017-03-02 13:17:50,817] TRACE SSLHandshake NEED_UNWRAP channelId -1, handshakeResult Status = OK HandshakeStatus = FINISHED
bytesConsumed = 101 bytesProduced = 0, appReadBuffer pos 0, netReadBuffer pos 0, netWriteBuffer pos 101 (org.apache.kafka.common.network.SslTransportLayer)
[2017-03-02 13:17:50,817] TRACE SSLHandshake FINISHED channelId -1, appReadBuffer pos 0, netReadBuffer pos 0, netWriteBuffer pos 101 (org.apache.kafka.common.network.SslTransportLayer)
[2017-03-02 13:17:50,817] DEBUG Set SASL client state to RECEIVE_HANDSHAKE_RESPONSE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator)
[2017-03-02 13:17:50,818] DEBUG Set SASL client state to INITIAL (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator)
[2017-03-02 13:17:50,819] DEBUG Set SASL client state to INTERMEDIATE (org.apache.kafka.common.security.authenticator.SaslClientAuthenticator)
[2017-03-02 13:17:50,820] DEBUG Connection with <IPADDESS_REMOVED> disconnected (org.apache.kafka.common.network.Selector)
java.io.EOFException
at org.apache.kafka.common.network.SslTransportLayer.read(SslTransportLayer.java:488)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:81)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.receiveResponseOrToken(SaslClientAuthenticator.java:239)
at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.authenticate(SaslClientAuthenticator.java:182)
at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:64)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:318)
at org.apache.kafka.common.network.Selector.poll(Selector.java:283)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:260)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:134)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:183)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:974)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at kafka.consumer.NewShinyConsumer.<init>(BaseConsumer.scala:61)
at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:64)
at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:51)
at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
[2017-03-02 13:17:50,821] DEBUG Node -1 disconnected. (org.apache.kafka.clients.NetworkClient)
I had a similar issue with SASL_PLAINTEXT auth. I was able to connect to the broker (via kafka-python), but any messages I sent from the producer would simply time out.
I ended up advertising both SASL_PLAINTEXT and PLAINTEXT listeners, but only publicly exposed the SASL_PLAINTEXT listener via AWS security groups.
My server_jaas.conf was basically the same.
My server.properties used these settings:
security.inter.broker.protocol=PLAINTEXT
sasl.mechanism.inter.broker.protocol=PLAIN
sasl.enabled.mechanisms=PLAIN
advertised.listeners=SASL_PLAINTEXT://example.com:9095,PLAINTEXT://example.com:9092
listeners = SASL_PLAINTEXT://0.0.0.0:9095,PLAINTEXT://0.0.0.0:9092
I was debugging this with the kafka-python client and my command looked like this (python)
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='example.com:9095', security_protocol="SASL_PLAINTEXT", sasl_mechanism='PLAIN', sasl_plain_username='username', sasl_plain_password='password')
With this setup I was able to have username/password authentication and also produce and consume messages to the broker without timeouts.
Hope this helps in some way :)
In my case there was no need for adding a plaintext listener or for advertising the listener. Instead, the issue was in my kafka_server_jaas.conf. Setting the username property to the name used by the client to log in solved the issue for me.