How to run Kafka-Connect in Minikube? - apache-kafka

To run cp S3-connect to consume kafka topic in my local mac, I did something like below
1. Installed Confluent Kafka connector and ran the kafka connect-standalone.sh
ML-C02Z605SLVDQ:kafka_2.12-2.5.0 e192270$ confluent-hub install confluentinc/kafka-connect-s3:latest --component-dir /usr/local/share/java --worker-configs config/connect-distributed.properties 
ML-C02Z605SLVDQ:kafka_2.12-2.5.0 e192270$ cd kafka_2.12-2.5.0
ML-C02Z605SLVDQ:kafka_2.12-2.5.0 e192270$ bin/connect-standalone.sh config/connect-standalone.properties s3-sink.properties. // s3-sink.properties connector.class=io.confluent.connect.s3.S3SinkConnector
Now, to run Kafka S3 connect in minikube I have installed Kafka-connect(kafka-connect-s3) in minikube using cp-helm-charts with help this tutorial Using a connector with Helm-installed Kafka/Confluent.
How to copy kafka config and script files inside kafka-connect pod ?
Should I need to login kafka-connect pod to run
connect-standalone.sh command?

There is a from scratch procedure here. The only requirement is Minikube.
The steps you need are the following:
Start Minikube
Deploy a Kafka cluster using the Strimzi Operator
Build your own custom image including required plugins and dependencies
Deploy Kafka Connect cluster in distributed mode using that image
Create a KafkaConnector instance passing a configuration YAML

How to copy kafka config and script files inside kafka-connect pod
You shouldn't copy anything. Everything is configured by env-vars. The Helm charts should be mostly documenting how those variables are working.
The Docker image uses Connect Distributed, which is started via a REST API, not a property file. And confluentinc/cp-kafka-connect already contains S3 Connect

You can also take a look at https://strimzi.io/.
The project is aimed at making the installation and management of a Kafka and Kafka Connect cluster on Kubernetes very easy.

Related

How to create connectors for Kafka-connect on Kubernetes?

I am deploying Kafka-connect on Google Kubernetes Engine (GKE) using cp-kafka-connect Helm chart in distributed mode.
A working Kafka cluster with broker and zookeeper is already running on the same GKE cluster. I understand I can create connectors by sending post requests to http://localhost:8083/connectors endpoint once it is available.
However, Kafka-connect container goes into RUNNING state and then starts loading the jar files and till all the jar files are loaded the endpoint mentioned above is unreachable.
I am looking for a way to automate the step of manually exec the pod, check if the endpoint is ready and then send the post requests. I have a shell script that has a bunch of curl -X POST requests to this endpoint to create the connectors and also have config files for these connectors which work fine with standalone mode (using Confluent platform show in this confluent blog).
Now there are only two ways to create the connector:
Somehow identify when the container is actually ready (when the endpoint has started listening) and then run the shell script containing the curl requests
OR use the configuration files as we do in standalone mode (Example: $ <path/to/CLI>/confluent local load connector_name -- -d /connector-config.json)
Which of the above approach is better?
Is the second approach (config files) even doable with distributed mode?
If YES: How to do that?
If NO: How to successfully do what is explained in the first approach?
EDIT:
With reference to his github issue(thanks to #cricket_007's answer below) I added the following as the container command and connectors got created after the endpoint gets ready:
...
command:
- /bin/bash
- -c
- |
/etc/confluent/docker/run &
echo "Waiting for Kafka Connect to start listening on kafka-connect "
while : ; do
curl_status=`curl -s -o /dev/null -w %{http_code} http://localhost:8083/connectors`
echo -e `date` " Kafka Connect listener HTTP state: " $curl_status " (waiting for 200)"
if [ $curl_status -eq 200 ] ; then
break
fi
sleep 5
done
echo -e "\n--\n+> Creating Kafka Connector(s)"
/tmp/scripts/create-connectors.sh
sleep infinity
...
/tmp/scripts/create-connectors.sh is a script mounted externally containing a bunch of POST requests using CURL to the Kafka-connect API.
confluent local doesn't interact with a remote Connect cluster, such as one in Kubernetes.
Please refer to the Kafka Connect REST API
You'd connect to it like any other RESTful api running in the cluster (via a Nodeport, or an Ingress/API Gateway for example)
the endpoint mentioned above is unreachable.
Localhost is the physical machine you're typing the commands into, not the remote GKE cluster
Somehow identify when the container is actually ready
Kubernetes health checks are responsible for that
kubectl get services
there are only two ways to create the connector
That's not true. You could additional run Landoop's Kafka Connect UI or Confluent Control Center in your cluster to point and click.
But if you have local config files, you could also write code to interact with the API
Or try and see if you can make a PR for this issue
https://github.com/confluentinc/cp-docker-images/issues/467

Unable to read local files in spark kubernetes cluster mode

I am facing an issue while reading a file stored in my system in spark cluster mode program.It is giving me an error that "File not found" but file is present at defined location.Please suggest me some idea so that i can read local file in spark cluster using kubernetes.
You cannot refer local files on your machine when you submit Spark on Kubernetes.
The available solutions for your case might be:
Use Resource staging server. Is not available in the main branch of Apache Spark codebase, so the whole integration is on your side.
Put your file to the http/hdfs accessible location: refer docs
Put your file inside Spark Docker image and refer it as local:///path/to/your-file.jar
If you are running local Kubernetes cluster like Minikube you can also create a Kubernetes Volume with files you are interested in and mount it to the Spark Pods: refer docs. Be sure to mount that volume to both Driver and Executors.

How to get flink streaming jar to kubernetes

With maven am building a fat jar for my streaming app. Have to deploy the jar to a k8 cluster. Enterprise don't have internal docker hub. So my option is to build the image as part of jenkins and use the same in kub job manager config. I would appreciate if any example demonstrating the project layout and steps to deploy
Used the build.sh script from https://github.com/apache/flink/blob/release-1.7/flink-container/docker/README.md and able to convert to docker image. And using docker compose am able to get the app running. But when trying for kub as specified in https://github.com/apache/flink/blob/release-1.7/flink-container/kubernetes/README.md#deploy-flink-job-cluster am seeing image not found.
Kubernetes does not manage images, it relies on Docker for that. You can check the Docker documentation About images, containers, and storage drivers.
In Kubernetes You can use the following registries: Google Container Registry, AWS EC2 Container Registry, Azure Container Registry, IBM Cloud Container Registry and your own Private Registry
You can read the Kubernetes documentation on how to Pull an Image from a Private Registry
You can find many projects helping with the setup of your own private registry.
One of the easiest ones is the project k8s-local-docker-registry by SeldonIO.
Start/Stop private registry in cluster
start private registry
./start-docker-private-registry
stop private registry
./stop-docker-private-registry
Check if the registry catalog can be accessed and the ability to push an image.
(set -x && curl -X GET http://127.0.0.1:5000/v2/_catalog && docker pull busybox && docker tag busybox 127.0.0.1:5000/busybox && docker push 127.0.0.1:5000/busybox)

Storage plugin configuration on Zookeeper for Apache Drill + Zookeeper in Kubernetes cluster

I am running Apache Drill and Zookeeper on a Kubernetes cluster.
Drill is connecting to zookeeper through a zookeeper-service running on port 2181. I am trying the persist storage plugin configuration on zookeeper. On the Apache Drill docs (https://drill.apache.org/docs/persistent-configuration-storage/), it is given that sys.store.provider.zk.blobroot key needs to be added to drill-override.conf property. But I am not able to figure out a value for this key if I want to connect it to Zookeeper service in Kubernetes.
The value should be:
<name-of-your-zk-service>.<namespace-where-zk-is-running>.svc.cluster.local:2181
That's how services get resolved internally in Kubernetes. You can always test it by creating a Pod, connecting to is using kubectl exec -it <pod-name> sh, and running:
ping <name-of-your-zk-service>.<namespace-where-zk-is-running>.svc.cluster.local
Hope it helps!
This is an optional config. You can specify it to modify where the ZooKeeper PStore provider offloads query profile data [1] or you can remove this property from your drill-override.conf and restart drillbits.
[1] http://doc.mapr.com/display/MapR/Persistent+Configuration+Storage

Setting up Spring Cloud Data Flow on Kubernetes

Do I need to install an instance of Spring Cloud Data Flow on the master server myself, or is this getting installed "automatically" as part of the deployment?
This isn't quite clear from the description at
http://docs.spring.io/spring-cloud-dataflow-server-kubernetes/docs/current-SNAPSHOT/reference/htmlsingle/#_deploying_streams_on_kubernetes
I've followed the guide, though removed every config for MySQL. Maybe this is required. Though I'm somewhat stuck since it's just not assigning an external IP and I do not see why, how to debug, and whether I missed to install some required component.
Edit:
To clarify, I see a scdf service entry when I run
kubectl get svc
But this service never gets an external IP.
Do I need to install an instance of Spring Cloud Data Flow on the master server myself, or is this getting installed "automatically" as part of the deployment?
Spring Cloud Data Flow server needs to be setup either outside (that knows how to connect to the kubernetes environment) or you can use the Spring Cloud Data Flow server docker image to run inside the kubernetes while the latter approach is better.
Step 6 in the link you posted above runs the SCDF docker image inside the kubernetes cluster:
```
Deploy the Spring Cloud Data Flow Server for Kubernetes using the Docker image and the configuration settings you just modified.
$ kubectl create -f src/etc/kubernetes/scdf-config-kafka.yml
$ kubectl create -f src/etc/kubernetes/scdf-secrets.yml
$ kubectl create -f src/etc/kubernetes/scdf-service.yml
$ kubectl create -f src/etc/kubernetes/scdf-controller.yml
```
MySql is required, that's why it's in the steps.
Spring Cloud Data Flow uses an RDBMS instead of Redis for stream/task
definitions, application registration, and for job repositories.
You can also use any of the other supported RDMBSes.
You can install it using Helm Charts.
https://dataflow.spring.io/docs/installation/kubernetes/helm/
At first install Helm
Then install Spring Cloud Data Flow
helm install --name my-release stable/spring-cloud-data-flow
It will install and config relevant pods such as spring-cloud-dataflow-server, mysql, skipper, rabbitmq, etc.
Also you can customize versions and configurations.