How to create connectors for Kafka-connect on Kubernetes? - kubernetes

I am deploying Kafka-connect on Google Kubernetes Engine (GKE) using cp-kafka-connect Helm chart in distributed mode.
A working Kafka cluster with broker and zookeeper is already running on the same GKE cluster. I understand I can create connectors by sending post requests to http://localhost:8083/connectors endpoint once it is available.
However, Kafka-connect container goes into RUNNING state and then starts loading the jar files and till all the jar files are loaded the endpoint mentioned above is unreachable.
I am looking for a way to automate the step of manually exec the pod, check if the endpoint is ready and then send the post requests. I have a shell script that has a bunch of curl -X POST requests to this endpoint to create the connectors and also have config files for these connectors which work fine with standalone mode (using Confluent platform show in this confluent blog).
Now there are only two ways to create the connector:
Somehow identify when the container is actually ready (when the endpoint has started listening) and then run the shell script containing the curl requests
OR use the configuration files as we do in standalone mode (Example: $ <path/to/CLI>/confluent local load connector_name -- -d /connector-config.json)
Which of the above approach is better?
Is the second approach (config files) even doable with distributed mode?
If YES: How to do that?
If NO: How to successfully do what is explained in the first approach?
EDIT:
With reference to his github issue(thanks to #cricket_007's answer below) I added the following as the container command and connectors got created after the endpoint gets ready:
...
command:
- /bin/bash
- -c
- |
/etc/confluent/docker/run &
echo "Waiting for Kafka Connect to start listening on kafka-connect "
while : ; do
curl_status=`curl -s -o /dev/null -w %{http_code} http://localhost:8083/connectors`
echo -e `date` " Kafka Connect listener HTTP state: " $curl_status " (waiting for 200)"
if [ $curl_status -eq 200 ] ; then
break
fi
sleep 5
done
echo -e "\n--\n+> Creating Kafka Connector(s)"
/tmp/scripts/create-connectors.sh
sleep infinity
...
/tmp/scripts/create-connectors.sh is a script mounted externally containing a bunch of POST requests using CURL to the Kafka-connect API.

confluent local doesn't interact with a remote Connect cluster, such as one in Kubernetes.
Please refer to the Kafka Connect REST API
You'd connect to it like any other RESTful api running in the cluster (via a Nodeport, or an Ingress/API Gateway for example)
the endpoint mentioned above is unreachable.
Localhost is the physical machine you're typing the commands into, not the remote GKE cluster
Somehow identify when the container is actually ready
Kubernetes health checks are responsible for that
kubectl get services
there are only two ways to create the connector
That's not true. You could additional run Landoop's Kafka Connect UI or Confluent Control Center in your cluster to point and click.
But if you have local config files, you could also write code to interact with the API
Or try and see if you can make a PR for this issue
https://github.com/confluentinc/cp-docker-images/issues/467

Related

Kafka Connect on CFK: Discrepancies between Connect Rest api and Connect CR by CFK

I installed confluent using CFK (Confluent for Kubernetes) way of deployment, setup went fine, using the vanilla yaml file for the entire components (zookeeper, kafka, connect, ksql, control-center, schema-registry).
I tried to use kind: Connector to configure my sqlserver source connector, connector created succesfully.
The problem came when I tried to list the connector via the below curl request (after port-forwarding to the pod)
curl localhost:8083/connectors | jq .
I got nothing registered, however when I ran the below command:
kubectl confluent connector list
it shows that I have a registered connectors as below, I assume both are 2 faces for the same coin.
NAME STATUS TASKS-READY TASKS-FAILED AGE
bq-sink-conn 0 9h
mssql-source-conn 0 9h
My question is, why is this discrepancy?, or I am missing something
also after a week of searching and looking on the internet, I can't find enough resources with example on how to use CFK and specifically Connectors CR.
Thanks,

How to run Kafka-Connect in Minikube?

To run cp S3-connect to consume kafka topic in my local mac, I did something like below
1. Installed Confluent Kafka connector and ran the kafka connect-standalone.sh
ML-C02Z605SLVDQ:kafka_2.12-2.5.0 e192270$ confluent-hub install confluentinc/kafka-connect-s3:latest --component-dir /usr/local/share/java --worker-configs config/connect-distributed.properties 
ML-C02Z605SLVDQ:kafka_2.12-2.5.0 e192270$ cd kafka_2.12-2.5.0
ML-C02Z605SLVDQ:kafka_2.12-2.5.0 e192270$ bin/connect-standalone.sh config/connect-standalone.properties s3-sink.properties. // s3-sink.properties connector.class=io.confluent.connect.s3.S3SinkConnector
Now, to run Kafka S3 connect in minikube I have installed Kafka-connect(kafka-connect-s3) in minikube using cp-helm-charts with help this tutorial Using a connector with Helm-installed Kafka/Confluent.
How to copy kafka config and script files inside kafka-connect pod ?
Should I need to login kafka-connect pod to run
connect-standalone.sh command?
There is a from scratch procedure here. The only requirement is Minikube.
The steps you need are the following:
Start Minikube
Deploy a Kafka cluster using the Strimzi Operator
Build your own custom image including required plugins and dependencies
Deploy Kafka Connect cluster in distributed mode using that image
Create a KafkaConnector instance passing a configuration YAML
How to copy kafka config and script files inside kafka-connect pod
You shouldn't copy anything. Everything is configured by env-vars. The Helm charts should be mostly documenting how those variables are working.
The Docker image uses Connect Distributed, which is started via a REST API, not a property file. And confluentinc/cp-kafka-connect already contains S3 Connect
You can also take a look at https://strimzi.io/.
The project is aimed at making the installation and management of a Kafka and Kafka Connect cluster on Kubernetes very easy.

How to get Kubernetes cluster name from K8s API using client-go

How to get Kubernetes cluster name from K8s API mentions that
curl http://metadata/computeMetadata/v1/instance/attributes/cluster-name -H "Metadata-Flavor: Google"
(from within the cluster), or
kubectl run curl --rm --restart=Never -it --image=appropriate/curl -- -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster-name
(from outside the cluster), can be used to retrieve the cluster name. That works.
Is there a way to perform the same programmatically using the k8s client-go library? Maybe using the RESTClient()? I've tried but kept getting the server could not find the requested resource.
UPDATE
What I'm trying to do is to get the cluster-name from an app that runs either in a local computer or within a k8s cluster. the k8s client-go allows to initialise the clientset via in cluster or out of cluster authentication.
With the two commands mentioned at the top that is achievable. I was wondering if there was a way from the client-go library to achieve the same, instead of having to do kubectl or curl depending on where the service is run from.
The data that you're looking for (name of the cluster) is available at GCP level. The name itself is a resource within GKE, not Kubernetes. This means that this specific information is not available using the client-go.
So in order to get this data, you can use the Google Cloud Client Libraries for Go, designed to interact with GCP.
As a starting point, you can consult this document.
First you have to download the container package:
➜ go get google.golang.org/api/container/v1
Before you will launch you code you will have authenticate to fetch the data:
Google has a very good document how to achieve that.
Basically you have generate a ServiceAccount key and pass it in GOOGLE_APPLICATION_CREDENTIALS environment:
➜ export GOOGLE_APPLICATION_CREDENTIALS=sakey.json
Regarding the information that you want, you can fetch the cluster information (including name) following this example.
Once you do do this you can launch your application like this:
➜ go run main.go -project <google_project_name> -zone us-central1-a
And the result would be information about your cluster:
Cluster "tom" (RUNNING) master_version: v1.14.10-gke.17 -> Pool "default-pool" (RUNNING) machineType=n1-standard-2 node_version=v1.14.10-gke.17 autoscaling=false%
Also it is worth mentioning that if you run this command:
curl http://metadata/computeMetadata/v1/instance/attributes/cluster-name -H "Metadata-Flavor: Google"
You are also interacting with the GCP APIs and can go unauthenticated as long as it's run within a GCE machine/GKE cluster. This provided automatic authentication.
You can read more about it under google`s Storing and retrieving instance metadata document.
Finally, one great advantage of doing this with the Cloud Client Libraries, is that it can be launched externally (as long as it's authenticated) or internally within pods in a deployment.
Let me know if it helps.
If you're running inside GKE, you can get the cluster name through the instance attributes: https://pkg.go.dev/cloud.google.com/go/compute/metadata#InstanceAttributeValue
More specifically, the following should give you the cluster name:
metadata.InstanceAttributeValue("cluster-name")
The example shared by Thomas lists all the clusters in your project, which may not be very helpful if you just want to query the name of the GKE cluster hosting your pod.

Setting up Spring Cloud Data Flow on Kubernetes

Do I need to install an instance of Spring Cloud Data Flow on the master server myself, or is this getting installed "automatically" as part of the deployment?
This isn't quite clear from the description at
http://docs.spring.io/spring-cloud-dataflow-server-kubernetes/docs/current-SNAPSHOT/reference/htmlsingle/#_deploying_streams_on_kubernetes
I've followed the guide, though removed every config for MySQL. Maybe this is required. Though I'm somewhat stuck since it's just not assigning an external IP and I do not see why, how to debug, and whether I missed to install some required component.
Edit:
To clarify, I see a scdf service entry when I run
kubectl get svc
But this service never gets an external IP.
Do I need to install an instance of Spring Cloud Data Flow on the master server myself, or is this getting installed "automatically" as part of the deployment?
Spring Cloud Data Flow server needs to be setup either outside (that knows how to connect to the kubernetes environment) or you can use the Spring Cloud Data Flow server docker image to run inside the kubernetes while the latter approach is better.
Step 6 in the link you posted above runs the SCDF docker image inside the kubernetes cluster:
```
Deploy the Spring Cloud Data Flow Server for Kubernetes using the Docker image and the configuration settings you just modified.
$ kubectl create -f src/etc/kubernetes/scdf-config-kafka.yml
$ kubectl create -f src/etc/kubernetes/scdf-secrets.yml
$ kubectl create -f src/etc/kubernetes/scdf-service.yml
$ kubectl create -f src/etc/kubernetes/scdf-controller.yml
```
MySql is required, that's why it's in the steps.
Spring Cloud Data Flow uses an RDBMS instead of Redis for stream/task
definitions, application registration, and for job repositories.
You can also use any of the other supported RDMBSes.
You can install it using Helm Charts.
https://dataflow.spring.io/docs/installation/kubernetes/helm/
At first install Helm
Then install Spring Cloud Data Flow
helm install --name my-release stable/spring-cloud-data-flow
It will install and config relevant pods such as spring-cloud-dataflow-server, mysql, skipper, rabbitmq, etc.
Also you can customize versions and configurations.

In Dataproc how can I access the Spark and Hadoop job history?

In Google Cloud Dataproc how can I access the Spark or Hadoop job history servers? I want to be able to look at my job history details when I run jobs.
To do this, you will need to create an SSH tunnel to the cluster and then use a SOCKS proxy with your browser. This is due to the fact that while the web interfaces are open on the cluster, firewall rules prevent anyone from connecting (for security.)
To access the Spark or Hadoop job history server, you will first need to create an SSH tunnel to the master node of your cluster:
gcloud compute ssh --zone=<master-host-zone> \
--ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" <master-host-name>
Once you have the SSH tunnel in place, you need to configure a browser to use a SOCKS proxy. Assuming you're using Chrome and know the path to Chrome on your system, you can launch Chrome with a SOCKS proxy using:
<Google Chrome executable path> \
--proxy-server="socks5://localhost:1080" \
--host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
--user-data-dir=/tmp/
The full details on how to do this can be found here.