I want to create and deploy Kafka cluster for data pipeline.
What is the prefer way to deploy it in the cloud , VMs or Kubernetes?
Kafka can run either way. However, if this is the question you are asking, then you might want to question if you really want to manage your own Kafka cluster in the cloud. Why not use an existing Kafka as a service offering?
Related
I am planning to run Kafka on GCP (google cloud platform).
What I wonder is what happens to a data in Kafka topic when a GCP pod fails? By default a new pod will be created, but will the data in Kafka topic be lost? How can I avoid data loss in this situation?
I appreciate any help. Thanks in advance :)
Best Regards,
Kafka itself needs a solution for persistence, you will probably need a cloud native storage solution. Create a storage class defining your storage requirements like replication factor, snapshot policy, and performance profile. Deploy Kafka as a StatefulSet on Kubernetes at a high level.
Not understood exactly your purpose but in this case, you cannot guarantee Kafka's data resistance when pod fails/evicted. Maybe you should tried with native VM with Kafka installed and config that to be fully backed up (restorable anytime when disaster happen)
It depends on what exactly you need. It's quite a general question.
You have some ready Kafka deployments if you would use MarketPlace.
As you are asking for pods, I guess you want to use Google Kubernetes Engine. On the internet you can find many guides about using Kafka on Kubernetes.
For example you can refer Kafka with Zookeeper on Portworx. In one of the steps you have StorageClass yaml. In GKE default storageclass is set to delete but you can create a new storageclass with reclaimPolicy: Retain which will keep the disk in GCP after delete of pod.
In GCP you have also an option to create disk snapshot
In addition to some best practice using Kafka on Kubernetes you can find here.
I have recently been reading more about infrastructure as a service (IaaS) and platform as a service (PaaS) and had some questions. I see when we opt for a PaaS solution, it is generally very easy to create the infrastructure as the cloud providers handle that for us and we can even automate the deployment using an infrastructure as code solution like Terraform.
But if we use an IaaS solution or even a local on premise cluster, we lose a lot of the automation it seems that PaaS allows. So I was curious, are there any good tools out there for automating infrastructure deployment on a local cluster that is not in the cloud?
The best thing I could think of was to run a local Kubernetes cluster and then Dockerize each of the infrastructure components, but this seems difficult as each node in the cluster will need its own specific configuration files.
From my basic Googling, it seems like there is not a good solution to this.
Edit:
I was not clear enough with my original intentions. I have two problems I am trying to solve.
How do I automate infrastructure deployment locally? For example, suppose I wanted to create a Hadoop HDFS cluster. I would need to configure one node to be the namenode with an accessible IP, and the other nodes to be datanodes that are aware of the namenode's IP. At the moment, I have to do this manually by logging into each node, checking it's IP, and then configuring each one. How would I automate this? If I were to use a Kubernetes approach, how do I specify that one of the running pods needs to be the namenode and the others are datanodes? How do I find the pods' IPs and have them be aware of the namenode IP?
The next problem I have is very similar to the first, but a slight modification. How would I deploy specific configuration files to each node. For instance in Kafka, the configuration file for one node, requires the IPs of the Zookeeper nodes, as well as the IP it should listen on. This may be different for every node in the cluster. Is there a good way to make these config files pod specific, so that I do not have to do bash text processing to insert the correct contents into each pod's config files?
You can use Terraform for all of your on-premise Infra. Automation, and Ansible for configuration management.
Let's say you have three HPE servers, Install K8s or VMware on them using Ansible, then you can treat them as three Avvaliabilty zones in one region, same as AWS. from this you can start deploying dockerize apps, or helm charts using Terraform.
Summary:
Ansbile for installing and configuration K8s.
Terraform for provisioning K8s.
Helm for installing apps on K8s.
After this you gonna have a base automated on-premise Infra.
I have a large application with lots of microservices that communicate through Kafka. Right now it's working on GKE.
We are moving Kafka to confluent.io and we were planning to move some of the microservices to Google Cloud Run (fully managed).
BUT,... it looks like Google Cloud Run (fully managed) does not support listening to kafka events, right? Are there any plans to support it? Is there a workaround?
EDIT:
This post shared by andres-s, shows that you can implement your own cloud run and have it connected to confluent kafka, in Anthos.
It would be great to have this option in the fully-managed Google Cloud Run service.
But in the meantime, the question would be: is it possible to implement it in a regular GKE cluster (not Anthos)?
Google Cloud has a fully managed Kafka solution through SaaS partner Confluent, which uses Cloud Run for Anthos (with GKE)
Google Pub/Sub is the GCP alternative for Kafka, but through Confluent you can use kafka on GCP
Cloud Run is just Knative SERVING. It is stateless and spins up when it receives events. Due to this, it can't really subscribe to a topic and pull an event.
Knative Eventing is more stateful in nature and handles the pulls and subsequently triggers the pods running Knative Serving. They, ideally, are used together to give you the full serverless experience.
The good news is, there is a "hack". You can do Kafka to PubSub then PubSub to Cloud Run. If you are adventurous and don't mind OSS software, there are a number of Knative Eventing tutorials at serverlesseventing.com.
We have 12 API's deployed on a cluster and we are using Kafka which are deployed on 3 EC2 instances. Should I add the Kafka Servers in K8s too or should I keep it the same? Or should I start using AWS MSK?
Still Experimenting so any suggestions or good documentation would be helpful
This is opinion based so it's probably going to be closed but check out https://strimzi.io/. It's been working great for us.
We're trying to set up a Storm cluster with Google Compute Engine but are having a hard time finding resources. Most Tutorials only cover deploying single applications to GCE. We've dockerized the project but don't know how to deploy to GCP. Any Suggestions?
You may try to configure an instance template and create instances with COS image which already have Docker installed.
Here you can have more information about this.
Other option is using Kubernetes Engine (GKE) which has more features that can help you to have more control on your workloads and it also supports autoscaling, auto upgrades and node repairs.