kafka after reboot cluster or os remove all connectors

kafka after reboot cluster or os remove all connectors - apache-kafka

I use rest api for Kafka Connect. After reboot server all connectors are deleted and return result is empty.
curl -H "Accept:application/json" localhost:8083/connectors
output is : []

Connectors are stored in Kafka topics
By default, Kafka (and Zookeeper) stores its topics/data under /tmp, which is wiped on restart
As mentioned in the comments, there are also other ways you'd end up with temporary data

Related

Kafka connect - completely removing a connector

my question is split to two. I've read Kafka Connect - Delete Connector with configs?. I'd like to completely remove a connector, with offsets and all, so I can recreate it with the same name later. Is this possible? To my understanding, a tombstone message will kill this connector indefinitely.
The second part is - is there a way to have the kafka-connect container automatically delete all connectors he created when bringing it down?
Thanks

There is no such command to completely cleanup connector state. For sink connectors, you can use kafka-consumer-groups to reset it's offsets. For source connectors, it's not as straightforward, as you'll need to manually produce data into the Connect-managed offsets topic.
The config and status topics also persist historical data, but shouldn't prevent you from recreating the connector with the same name/details.
The Connect containers published by Confluent and Debezium always uses Distributed mode. You'll need to override the entrypoint of the container to use standalone mode to not persist the connector metadata in Kafka topics (this won't be fault tolerant, but it'll be fine for testing)

How to connect Redpanda cluster and topic to Redpanda Schema Registry?

I am trying to connect a running redpanda kafka cluster to a redpanda schema registry, so that the schema registry verifies incoming messages to the topic, and/or messages being read from the topic.
I am able to add a schema to the registry and read it back with curl requests, as well as add messages to the kafka topic I created in the redpanda cluster.
My question is, how to implement the schema registry with a topic in a kafka cluster? Like how do I instruct the schema registry and/or the kafka topic to validate incoming messages against the schema I added to the registry?
Thanks for your help or a point in the right direction!
Relevant info:
Cluster & topic creation:
https://vectorized.io/docs/guide-rpk-container
rpk container start -n 3
rpk topic create -p 6 -r 3 new-topic --brokers <broker1_address>,<broker2_address>...
Schema Registry creation:
https://vectorized.io/blog/schema_registry/
Command to add a schema:
curl -s \
-X POST \
"http://localhost:8081/subjects/sensor-value/versions" \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "{\"type\":\"record\",\"name\":\"sensor_sample\",\"fields\":[{\"name\":\"timestamp\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},{\"name\":\"identifier\",\"type\":\"string\",\"logicalType\":\"uuid\"},{\"name\":\"value\",\"type\":\"long\"}]}"}' \
| jq

The client is responsible for such integration. For example, the Confluent Schema Registry includes KafkaAvroSerializer Java class that wraps an HTTP Client that handles the schema registration and message validation. The broker doesn't handle "topic schemas", since schemas are really per-record. Unless RedPanda has something similar, broker-side validation is only offered by Enterprise "Confluent Server."
RedPanda is primarily a server that exposes a Kafka-compatible API; I assume it is up to you to create (de)serializer interfaces for your respective client languages. There is a Python example on Vectorized Github.
That being said, the Confluent Schema Registry should work with RedPanda as well, and so you can use its serializers and HTTP client libraries with it.

Mongodb Kafka messages not seen by topic

I encountered that my topic despite running and operating doesn't register events occuring in my MongoDB.
Everytime I insert/modify record I'm not getting anymore logs from kafka-console-consumer command.
Is there a way clear Kafka's cache/offset maybe?
Source and sink connection are up and running. Entire cluster is also healthy, thing is that everything worked as usual but every couple weeks I see this coming back or when I log into my Mongo cloud from other location.
--partition 0 parameter didn't help, changing retention_ms to 1 too.
I checked both connectors' status and got RUNNING:
curl localhost:8083/connectors | jq
curl localhost:8083/connectors/monit_people/status | jq
Running docker-compose logs connect I found:
WARN Failed to resume change stream: Resume of change stream was not possible, as the resume point may no longer be in the oplog. 286
If the resume token is no longer available then there is the potential for data loss.
Saved resume tokens are managed by Kafka and stored with the offset data.
When running Connect in standalone mode offsets are configured using the:
`offset.storage.file.filename` configuration.
When running Connect in distributed mode the offsets are stored in a topic.
Use the `kafka-consumer-groups.sh` tool with the `--reset-offsets` flag to reset offsets.
Resetting the offset will allow for the connector to be resume from the latest resume token.
Using `copy.existing=true` ensures that all data will be outputted by the connector but it will duplicate existing data.
Future releases will support a configurable `errors.tolerance` level for the source connector and make use of the `postBatchResumeToken

Issue requires more practice with Confluent Platform thus for now I re-built entire environment by removing entire container with:
docker system prune -a -f --volumes
docker container stop $(docker container ls -a -q -f "label=io.confluent.docker").
After running docker-compose up -d all is up and working.

Starting multiple connectors in Kafka Connect withing single distributed worker?

How to start multiple Kafka connectors in a Kafka Connect world within a single distributed worker(running on 3 different servers)?
Right now I have a need of 4 Kafka Connectors in this distributed worker(same group.id).
Currently, I am adding one connector at a time using following curl command.
curl -X POST -H "Content-type: application/json" -d '<my_single_connector_config>' 'http://localhost:8083/connectors'
Issue:
For each new connector I add, previous/existing connector(s) restarts along with new connector.
Question:
How should I start/create all these new connectors with one REST call in a distributed worker mode?
Is there any way to have all connector configs in a single REST call, like an array of connector configs?
I tried to search for the same but didn't come across any workaround for this.
Thanks.

For each new connector I add, previous/existing connector(s) restarts along with new connector.
Yes, that's the current behaviour of Kafka Connect. For further discussion see:
https://issues.apache.org/jira/browse/KAFKA-5505
https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing%3A+Support+and+Policies
How should I start/create all these new connectors with one REST call in a distributed worker mode?
Is there any way to have all connector configs in a single REST call, like an array of connector configs?
You can't do it in a single REST call
If you want to isolate your connectors from each other when creating/updating them, you can just run multiple distributed clusters.
So instead of 1 distributed Connect cluster running 3 connectors, you could have 3 distributed Connect clusters each running 1 connector.
Remember in practice a 'distributed Cluster' could just be of a single node, and indeed could all run on the same machine. You'd scale out for resilience and throughput capacity.

Kafka connect cluster setup or launching connect workers

I am going through kafka connect, and i am trying to get the concepts.
Let us say I have kafka cluster (nodes k1, k2 and k3) setup and it is running, now i want to run kafka connect workers in different nodes say c1 and c2 in distributed mode.
Few questions.
1) To run or launch kafka connect in distributed mode I need to use command ../bin/connect-distributed.sh, which is available in kakfa cluster nodes, so I need to launch kafka connect from any one of the kafka cluster nodes? or any node from where I launch kafka connect needs to have kafka binaries so that i will be able to use ../bin/connect-distributed.sh
2) I need to copy the my connector plugins to any kafka cluster node( or to all cluster nodes?) from where I do the step 1?
3) how does kafka copies these connector plugins to worker node before starting jvm process on the worker node? because the plugin is the one which has my task code and it needs to be copied to worker in order to start the process in worker.
4) Do i need to install anything in connect cluster nodes c1 and c2, like need to install java or any kafka connect related?
5) In some places it says use confluent platform but i would like to start it with apache kafka connect alone first.
can some one please through some light or even pointer to some resources would also help.
Thank you.

1) In order to have a highly available kafka-connect service you need to run at least two instances of connect-distributed.sh on two distinct machines that have the same group.id. You can find more details regarding the configuration of each worker here. For improved performance, Connect should be ran independently of the broker and Zookeeper machines.
2) Yes, you need to place all your connectors under plugin.path (normally under /usr/share/java/) on every machine that you are planning to run kafka-connect.
3) kafka-connect will load the connectors on startup. You don't need to handle this. Note that if your kafka-connect instance is running and a new connector is added, you need to restart the service.
4) You need to have Java installed on all your machines. For Confluent Platform particularly:
Java 1.7 and 1.8 are supported in this version of Confluent Platform
(Java 1.9 is currently not supported). You should run with the
Garbage-First (G1) garbage collector. For more information, see the
Supported Versions and Interoperability.
5) It depends. Confluent was founded by the original creators of Apache Kafka and it comes as a more complete distribution adding schema management, connectors and clients. It also comes with KSQL which is quite useful if you need to act on certain events. Confluent simply adds on top of the Apache Kafka distribution, it's not a modified version.

Answer given by Giorgos is correct. I ran few connectors and now I understand it better.
I am just trying to put it differently.
In Kafka connect there are two things involved one is Worker and second is connector.Below is on details about running distributed Kafka connect.
Kafka connect Worker is a Java process on which the connector/connect task will run. So first thing is we need to launch worker, to run/launch a worker we need java installed on that machine then we need Kafka connect related sh/bat files to launch worker and kafka libs which will be used by kafka connect worker, for this we will just simply copy/install Kafka in the worker machine, also we need to copy all the connector and connect-task related jars/dependencies in "plugin.path" as defined in the below worker properties file, now worker machine is ready, to start worker we need to invoke ./bin/connect-distributed.sh ./config/connect-distributed.properties, here connect-distributed.properties will have configuration for worker. The same thing has to be repeated in each machine where we need to run Kafka connect.
Now the worker java process is running in all machines, the woker config will have group.id property, the workers which have this same property value will be forming a group/cluster of workers.
Each worker process will expose rest endpoint (default http://localhost:8083/connectors), to launch/start a connector on the running workers, we need do http-post a connector config json, based on the given config the worker will start the connector and the number of tasks in the above group/cluster workers.
Example: Connect post,
curl -X POST -H "Content-Type: application/json" --data '{"name": "local-file-sink", "config": {"connector.class":"FileStreamSinkConnector", "tasks.max":"3", "file":"test.sink.txt", "topics":"connect-test" }}' http://localhost:8083/connectors

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse