Java Producer for AWS MSK - apache-kafka

As part of my existing application (which runs in the public network), I want to send some messages to AWS MSK Kafka cluster. As my Kafka cluster is within a VPC, I couldn't connect to it directly. That I understood. In my research, I came across a pattern - by running confluent's kafka-rest on a ec2-instance within same VPC having a public subnet. But my application needs to make HTTP calls to send the messages to my kafka cluster, which adds extra latency.
I'm trying to figure out a way by having some persistent connection and send messages to MSK cluster.
Any Thoughts Please!!

Related

Exposing Kafka brokers using Google click to deploy environment

I have used the Kafka cluster (with replication) click to deploy container from Google on kubernetes. How do I expose the brokers so I can consume from an external consumer? I'm very new to kunernetes.
I have tried exposing the broker nodes with a load balancer but the external ip given
I opened the port with a firewall rule
But when connecting from my consumer it throws an error about being disconnected
Any help would be great, I can provide more info if asked.
You cannot use a load balancer. Kafka clients must talk directly to the brokers.
Firewall is a good step, but you need to ensure each of the brokers are exposed properly via advertised.listeners within and outside of the VPC. Refer blog post.
Alternative, within GKE, you can run Strimzi Operator, which handles Kubernetes resources for you with regard to the Kafka cluster.

How to expand confluent cloud kafka cluster?

I have set up a confluent cloud multizone cluster and it got created with just one bootstrap server. There was no setting for choosing number of servers while creating the cluster. Even after creation, I can’t edit the number of bootstrap servers.
I want to know how to increase the number of servers in confluent cloud kafka cluster.
Under the hood, the Confluent Cloud cluster is already running multiple brokers. Depending on your cluster configuration (specifically, whether you're running Standard or Dedicated, and what region and cloud you're in), the cluster will have between six and several dozen brokers.
The way a Kafka client bootstrap server config works is that the client reaches out to the bootstrap server and requests a list of all brokers, and then uses those broker endpoints to actually produce/consume from Kafka (reference: https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-bootstrap-servers.html)
In Confluent Cloud, the provided bootstrap server is actually a load balancer in front of all of the brokers; when the client connects to the bootstrap server it'll receive the actual endpoints for all of the actual brokers, and then use that for subsequent connections.
So TL;DR, in your client, you only need to specify the one bootstrap server; under the hood, the Kafka client will connect to the (many) brokers running in Confluent Cloud, and it should all just work.
Source: I work at Confluent.

Can Kafka Connect consume data from a separate kerberized Kafka instance and then route to Splunk?

My pipeline is:
Kerberized Kafka --> Logstash (hosted on a different server) --> Splunk.
Can I replace the Logstash component with Kafka Connect?
Could you point me to a resource/guide where I can use kerberized Kafka as a source for my Kafka connect (which is hosted separately)?
From the documentation, what I understood is that if Kafka Connect is hosted on the same cluster as that of Kafka, that's quite possible. But I don't have that option right now, as our Kafka cluster is multi-tenant and hence not approved for additional processes on the cluster.
Kerberos keytabs aren't commonly machine/JVM specific, so yes, Kafka Connect should be able to be configured very similarly to Logstash since both are JVM processes using native Kafka protocol.
You shouldn't run Connect on the brokers anyway
If you can't add Kafka Connect to an existing Kafka cluster, you will have to spin up a separate Kafka Connect (Cluster or standalone).
I've written about it here: enter link description here

Is there a way to dump Amazon MSK Topic to S3 directly?

I have planned to used Amazon MSK and i want to dump consumer logs to S3 . But i don't see any options. Do i need to write my own consumer or is there a way to consume Amazon MSK consumer output to s3 directly ?
Kafka Connect is generally the best (easiest/scalable/portable/resilient) way to get data between Kafka and systems down (and up) stream such as S3. Learn more about Kafka Connect here and in this talk here.
MSK Connect can run Kafka Connect workloads for your MSK on AWS.
Another option you have is to run your own Kafka Connect worker (which connects to MSK) and use the S3 sink connector (tutorial).
There is not a direct way to do it from MSK. You can use an external consumer to do it or preferably use KafkaConnect in an EC2 within the same VPC as MSK.
Either way you need to consider for high availability and data transfer costs. For HA, use consumers in different AZs. For costs, use MSK 2.4.1 that allows consumers to fetch data from the closest replica.

Apache Kafka consumer groups and microservices running on Kubernetes, are they compatible?

So far, I have been using Spring Boot apps (with Spring Cloud Stream) and Kafka running without any supporting infrastructure (PaaS).
Since our corporate platform is running on Kubernetes we need to move those Spring Boot apps into K8s to allow the apps to scale and so on. Obviously there will be more than one instance of every application so we will define a consumer group per application to ensure the unique delivery and processing of every message.
Kafka will be running outside Kubernetes.
Now my doubt is: since the apps deployed on k8s are accessed through the k8s service that abstracts the underlying pods, and individual application pods can't be access directly outside of the k8s cluster, Kafka won't know how to call individual instances of the consumer group to deliver the messages, will it?
How can I make them work together?
Kafka brokers do not push data to clients. Rather clients poll() and pull data from the brokers. As long as the consumers can connect to the bootstrap servers and you set the Kafka brokers to advertise an IP and port that the clients can connect to and poll() then it will all work fine.
Can Spring Cloud Data Flow solve your requirement to control the number of instances deployed?
and, there is a community released Spring Cloud Data Flow server for OpenShift:
https://github.com/donovanmuller/spring-cloud-dataflow-server-openshift