I have planned to used Amazon MSK and i want to dump consumer logs to S3 . But i don't see any options. Do i need to write my own consumer or is there a way to consume Amazon MSK consumer output to s3 directly ?
Kafka Connect is generally the best (easiest/scalable/portable/resilient) way to get data between Kafka and systems down (and up) stream such as S3. Learn more about Kafka Connect here and in this talk here.
MSK Connect can run Kafka Connect workloads for your MSK on AWS.
Another option you have is to run your own Kafka Connect worker (which connects to MSK) and use the S3 sink connector (tutorial).
There is not a direct way to do it from MSK. You can use an external consumer to do it or preferably use KafkaConnect in an EC2 within the same VPC as MSK.
Either way you need to consider for high availability and data transfer costs. For HA, use consumers in different AZs. For costs, use MSK 2.4.1 that allows consumers to fetch data from the closest replica.
Related
I am going to implement a Snowflake Kafka connector with the continuous ingestion of data to target database snowflake.
What are the best practices for :
Kafka for its clusters
Kafka and its related parameters
Monitoring resources
Kafka for its clusters
Run at least 3 brokers
Kafka and its related parameters
That's too broad and has nothing to do with running a Connect cluster or implementing one. The defaults are mostly fine. You can find the production recommendations in the Kafka documentation.
Monitoring resources
Use JMX. https://docs.confluent.io/platform/current/kafka/monitoring.html
going to implement a Snowflake Kafka connector
Snowflake already has a connector... I'd start by forking rather than making your own
My pipeline is:
Kerberized Kafka --> Logstash (hosted on a different server) --> Splunk.
Can I replace the Logstash component with Kafka Connect?
Could you point me to a resource/guide where I can use kerberized Kafka as a source for my Kafka connect (which is hosted separately)?
From the documentation, what I understood is that if Kafka Connect is hosted on the same cluster as that of Kafka, that's quite possible. But I don't have that option right now, as our Kafka cluster is multi-tenant and hence not approved for additional processes on the cluster.
Kerberos keytabs aren't commonly machine/JVM specific, so yes, Kafka Connect should be able to be configured very similarly to Logstash since both are JVM processes using native Kafka protocol.
You shouldn't run Connect on the brokers anyway
If you can't add Kafka Connect to an existing Kafka cluster, you will have to spin up a separate Kafka Connect (Cluster or standalone).
I've written about it here: enter link description here
I am using Kafka MSK in AWS. So we don't have native kafka connect with all required connectors like on confluent.
Actually I work with kakfa mongo connector and I want to find a way to push the kafka mongo connector jar to an on an instance of kafka MSK cluster.
The path to which the jar will be pushed is the plugins.path as defined in the properties of the used connector.
ANy way to make it please ?
MSK doesn't give you a hosted Kafka Connect worker. You'd need to provision and run this yourself, e.g. on EC2. This work would then connect to your Kafka cluster (MSK in this case)
To be clear: MSK is only the hosted Kafka brokers (and Zookeeper). It does not include Kafka Connect, which is what you need in order to run connectors.
How do I use Kafka Connect adapters with Amazon MSK?
As per the AWS documentation, it supports Kafka connect but not documented about how to setup adapters and use it.
Edit Oct 2021: MSK Connect has been launched, see https://aws.amazon.com/blogs/aws/introducing-amazon-msk-connect-stream-data-to-and-from-your-apache-kafka-clusters-using-managed-connectors/
AFAIK Amazon MSK does not provide managed connectors, so you have to run them yourself. This is done by running the Kafka Connect worker process (a JVM) and then providing it one or more connector configurations to run.
From the point of view of a Kafka Connect worker it just needs a Kafka cluster to connect to; it shouldn't matter whether it's MSK or on-premises, since it's ultimately 'just' a consumer/producer underneath.
You can see more, including a live demo, here: https://rmoff.dev/bbuzz19-kafka-connect
For an example of configuring Kafka Connect to use a cloud-hosted Kafka platform (in this case, Confluent Cloud), see this article.
If you are interested in managed connectors in the Cloud, check out the connectors that are provided in Confluent Cloud.
Disclaimer: I work for Confluent :)
AWS now supports MSK Connect, a new feature of MSK service based on Kafka Connect allowing you to deploy managed Kafka connectors built for Kafka connect
Check the announcement here: https://aws.amazon.com/blogs/aws/introducing-amazon-msk-connect-stream-data-to-and-from-your-apache-kafka-clusters-using-managed-connectors/
There are two aspects to this
Kafka Connect is a framework which should be deployed separately from kafka brokers. MSK only provides kafka brokers. If you want to use Kafka Connect with MSK you would need to use EC2 instances and deploy the kafka binaries.Kafka Connect framework is bundled along with kafka
Coming to connectors if you donot have a confluent subscription or similar - I am afraid your choices get very limited. But having said you can always write your own connectors. Writing new connectors is not that difficult rather you can apply your business specific logic and be on your way quite quickly.
I want to know if Kafka and Kafka-connect can run on different servers? So a connector would be started on server A and send data from a kafka topic on server B to HDFS or S3 etc. Thanks
Yes, and for Production deployments this is typically recommended for resource reasons. Generally you'd deploy a cluster of Kafka Brokers (3+ for HA), and then a cluster of Kafka Cluster workers (as many as needed for throughput capacity / resilience) -- all on separate nodes.
For more details, see the Confluent Enterprise Reference Architecture.
Yes, you can do it.
I have my set of kafka servers and kafka connect applications are running in different machines and writing data in hdfs. you have to mention list of brokers in bootstrap.servers under worker properties file (config/connect-distributed.properties or config/connect-standalone.properties) instead of localhost:9092