How are TCP Connections managed by kafka-clients scala library? - scala

I am using kafka-clients library for integrating Kafka with a Scala application. And finding it difficult to understand, how and when TCP connections are made between Brokers and Producers/Consumers.
Please verify my understanding on the below points-
(1) No TCP connection is established on initialisation of KafkaProducer instance.
val producer = new KafkaProducer[String, String](properties)
This also holds true for KafkaConsumer.
val consumer = new KafkaConsumer[String, String](properties)
(2) First TCP connection (between Broker and Producer) is established on producing a record to Broker.
producer.send(record1)
(3) Subsequent send() calls from the same Producer to same Broker will share same TCP connection irrespective of the Topic.
producer.send(record2)
(4) In case of Consumer, first TCP connection is established on polling a Topic (not on Subscription).
val records = consumer.poll(timeout)
(5) Subsequent calls to poll by the same Consumer to the same Broker share the same connection.

No TCP connection is established on initialisation of KafkaProducer instance.
Not exactly. KafkaProducer initialisation will start the Sender thread from within multiple TCP connections to all the bootstrap servers will be established. Those Sockets will be used to retrieve metadata from the cluster.
First TCP connection (between Broker and Producer) is established on producing a record to Broker.
Almost correct. Actually client always creates multiple TCP connections to the brokers. This is even true when you have one broker. For producer, it often creates two connections, one of which is for updating Metadata and the other is for sending messages. For consumer(assume you are using consumer group), seems it will create 3 connections. One for finding coordinator; one for group management(including join/sync groups and offset things); one for retrieving offsets and the last for pulling messages.
UPDATE: consumer creates 3 connections instead of 4 which I previously claimed. THANKS #ppatierno FOR THE REMINDING.
Subsequent send() calls from the same Producer to same Broker will share same TCP connection irrespective of the Topic.
Subsequent send calls reuse the second connection producer creates.
In case of Consumer, first TCP connection is established on polling a Topic (not on Subscription).
Yes, all connections are created in the poll call.
Subsequent calls to poll by the same Consumer to the same Broker share the same connection.
Subsequent calls to poll reuse the last connection consumer creates.

Subsequent send() calls from the same Producer to same Broker will
share same TCP connection irrespective of the Topic.
Just to add (to the great answer by #amethystic) that if the producer tries to send to a new topic and the broker to which it's connected isn't the leader, the producer needs to fetch metadata about that topic and opening a new connection to the broker which is leader for that topic. So saying "share same TCP connection irrespective of the Topic" is not completely correct.

Related

Do we need kafka connection pool from client?

I have a REST webservice with traffic about 1 mil requests per day. In that REST service, for each request, we send message to remote kafka topic (Confluent platform). Do I have to set up Kafka connection pool to improve the performance?
No, you don't need to keep the Kafka connection pooling. Kafka clients keep the connections with the Kafka cluster and it will manage all the connections. As long as you have enough partitions configured for the Kafka topic, it should be alright.

Does a producer make two different connections to broker and Schema registry? If yes, Is it a HTTP connection?

I'm using spring-kafka-2.2.7.RELEASE and have a producer and consumer. My cluster has zookeepers, brokers and a shema registry as well to handle the avro schema validation. So, in my producer configuration, I'll pass in both brokers URL and Schema Registry URL. Now I've couple of questions,
when publishing/producing a message, Does the producer make two different connections to broker and schema registry or just one connection to broker and from there broker would communicate with schema registry?
If it opens only one connection, how long does the connection would be open? Can the producer use the same connection to produce multiple messages or should it open multiple connections to produce multiple messages?
If there is a connection open, does it use HTTP/HTTPS protocol to communicate ?
The schema registry has nothing to do with Kafka; there is a separate HTTP connection made directly from the client.

Does kafka client connect to zookeeper or is it behind the scene

Kafka client code directly refers to the broker ip and port and in case if it is down will zookeeper direct to another broker. is zookeper always behind the scene
In the case you provide only one broker address in the client code, and it goes down, plus your client restarts, then your client will also be down. Zookeeper will not be used here because the broker will not be reachable.
If you give more than one broker address in the client, then it's more resilient in that the Kafka Controller process periodically fetches a list of all alive brokers in the cluster from Zookeeper and is responsible for sending that information back to the clients via the leader of the partitions they get assigned. Zookeeper is indirectly used here, but does not communicate with any external clients
If I got the question in the right way the answer is no.
The Kafka clients need connection only to Kafka brokers and Zookeeper isn't involved at all. Clients needs to write/read leader partitions on brokers.
If the Kafka brokers set in the brokers list aren't available, the clients can connect and cannot start to send/receive messages.
Only in the old version 0.8.0 the Zookeeper was involved for consumers which saved offset on Zookeeper. Starting from 0.9.0, the consumers save offset in Kafka topics so Zookeeper isn't needed anymore.

Apache Kafka Producer Broker Connection

I have a set of Kafka broker instances running as a cluster. I have a client that is producing data to Kafka:
props.put("metadata.broker.list", "broker1:9092,broker2:9092,broker3:9092");
When we monitor using tcpdump, I can see that only the connections to broker1 and broker2 are ESTABLISHED while for the broker3, there is no connection from my producer. I have a single topic with just one partition.
My questions:
How is the relation between number of brokers and topic partitions? Should I always have number of brokers = number of partitons?
Why in my case, I'm not able to connect to broker3? or atleast my network monitoring does not show that a connection from my Producer is established with broker3?
It would be great if I could get some deeper insight into how the connection to the brokers work from a Producer stand point.
Obviously, your producer does not need to connect to broker3 :)
I'll try to explain you what happens when you are producing data to Kafka:
You spin up some brokers, let's say 3, then create some topic foo with 2 partitions, replication factor 2. Quite simple example, yet could be a real case for someone.
You create a producer with metadata.broker.list (or bootstrap.servers in new producer) configured to these brokers. Worth mentioning, you don't necessarily have to specify all the brokers in your cluster, in fact you can specify only 1 of them and it will still work. I'll explain this in a bit too.
You send a message to topic foo using your producer.
The producer looks up its local metadata cache to see what brokers are leaders for each partition of topic foo and how many partitions does your foo topic have. As this is the first send to the producer, local cache contains nothing.
Producer sends a TopicMetadataRequest to each broker in metadata.broker.list sequentially until first successful response. That's why I mentioned 1 broker in that list would work as long as it's alive.
Returned TopicMetadataResponse will contain the information about requested topics, in your case it's foo and brokers in the cluster. Basically, this response contains the following:
list of brokers in the cluster, where each broker has an ID, host and port. This list may not contain the entire list of brokers in the cluster, but should contain at least the list of brokers that are responsible for servicing the subject topic.
list of topic metadata, where each entry has topic name, number of partitions, leader broker ID for each partition and ISR broker IDs for each partition.
Based on TopicMetadataResponse your producer builds up its local cache and now knows exactly that the request for topic foo partition 0 should go to broker X.
Based on number of partitions in a topic, producer partitions your message and accumulates it with the knowledge that it should be sent as a part of batch to some broker.
When the batch is full or linger.ms timeout passes, your producer flushes the batch to the broker. By "flushes" I mean "opens a new connection to a broker or reuses an existing one, and sends the ProduceRequest".
The producer does not need to open unnecessary connections to all brokers, as the topic you are producing to may not be serviced by some brokers, and your cluster could be quite large. Imagine a 1000 broker cluster with lots of topics, but one of topics has just one partition - you only need that one connection, not 1000.
In your particular case I'm not 100% sure why you have 2 open connections to brokers, if you have just a single partition, but I assume one connection was opened during metadata discovery and was cached for reusing, and the second one is the actual broker connection to produce data. However, I might be wrong in this case.
But anyway, there is no need at all to have a connection for the third broker.
Regarding your question about "Should I always have number of brokers = number of partitons?" the answer is most likely no. If you explain what you are trying to achieve, maybe I'll be able to point you to the right direction, but this is too broad to explain in general. I recommend reading this to clarify things.
UPD to answer the question in comment:
Metadata cache is updated in 2 cases:
If producer fails to communicate with broker for any reason - this includes the case when the broker is not reachable at all and when broker responds with an error (like "I'm not leader for this partition anymore, go away")
If no failures happen, the client still refreshes metadata every metadata.max.age.ms (https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/CommonClientConfigs.java#L42-L43) to discover new brokers and partitions itself.

Scala Akka TCP Actors

I have a question about the Akka 2.4 TCP API.
I running a server and have 2 TCP servers in Akka TCP, one for incoming clients and one for my server's worker nodes (which are on other computers/IPs). I have one current connection to a client, and one connection to a worker node.
If receiving a message from a client, I want to pass some of that information to the worker node, but my TCP Akka Actor representing the worker node connection doesn't seem to like when I, from the thread running the Client Akka Actor, send messages to the Akka Actor worker node.
So, as an example, if the client sends a message to delete a file, and that partitions on that file is on a worker node, I want to send a TCP message to that worker node that it should delete the partitions.
How can I from the client Actor send a message to the worker node Actor, that it should pass to the worker node server through TCP? When just doing the regular workerActorRef ! msg it doesn't receive it at all and no logging is shown.
I hope this question wasn't unclear but essentially I want the workerActorRef to in some way be able to have some functionality similar to "send this through the TCP socket".
Cheers,
Johan
Have you looked at Akka Remoting at all? If used properly it should be able to achieve what you want. You might want to look into Clustering too as it's built on top of Remoting.