https://issues.apache.org/jira/browse/IGNITE-13442
Regarding the above issue, this is not yet implemented/fixed.
Alternatively, in the Sink Connector place, can we write our own consumers which listen to Kafka queue for cache events? Where those consumers will check those events and execute on the specified cluster for DC replication.
Since Ignite is Open Source, it might be easier to fix it yourself than to write your own consumer from scratch.
Having said that, there's nothing "special" about the Kafka adapter. It's entirely possible to write an application that reads from Kafka and sends puts or removes to an Ignite cluster.
Related
my question is split to two. I've read Kafka Connect - Delete Connector with configs?. I'd like to completely remove a connector, with offsets and all, so I can recreate it with the same name later. Is this possible? To my understanding, a tombstone message will kill this connector indefinitely.
The second part is - is there a way to have the kafka-connect container automatically delete all connectors he created when bringing it down?
Thanks
There is no such command to completely cleanup connector state. For sink connectors, you can use kafka-consumer-groups to reset it's offsets. For source connectors, it's not as straightforward, as you'll need to manually produce data into the Connect-managed offsets topic.
The config and status topics also persist historical data, but shouldn't prevent you from recreating the connector with the same name/details.
The Connect containers published by Confluent and Debezium always uses Distributed mode. You'll need to override the entrypoint of the container to use standalone mode to not persist the connector metadata in Kafka topics (this won't be fault tolerant, but it'll be fine for testing)
environment
Apache Kafka 2.7.0
Apache Flume 1.9.0
Problem
Currently, in our architecture,
We are using Flume with Kafka channel, no source and sink to HDFS.
In the future, We are going to build a Kafka HA cluster using kafka mirror maker.
So, even if one cluster is shut down, I try to use it so that there is no problem with failure by connecting to the other cluster.
To do this, I think that we need to subscribe topic with a regex pattern with Flume.
Assume that cluster A and cluster B exist, and two clusters have a topic called ex. And the mirror maker copy each other ex, so cluster A has topic : ex, b.ex and cluster B has topic : ex, a.ex.
For example, while reading e and b.e from cluster A, if there is a failure, it tries to read ex and a.ex by going to the opposite cluster.
Like below.
test.channel = c1 c2
c1.channels.kafka.topics.regex = .*e (impossible in kafka channel)
...
c1.source.kafka.topics.regex = .*e (possible in kafka source)
In the case of flume kafka source, there is a property to read the topic as a regex pattern.
But This property does not exist in channel.
Is there any good way?
I'd appreciate it if you could suggest a better way. Thank you.
Sure, using a regex or simply a list of both topics would be preferred, but you then end up with data split across different directories based on the topic name, leaving HDFS clients to merge the data back together
A channel includes a producer, thus why a regex isn't possible
By going to the opposite cluster
There's no way Flume will automatically do that unless you modify its bootstrap servers config and restart it. Same applies for any Kafka client, really... This isn't exactly what I'd call "highly available" because all clients pointing to the down cluster will experience downtime
Instead, you should be using a Flume pipeline (or Kafka Connect) from each cluster. That being said, MirrorMaker would only then be making extra copies of your data or allowing clients to consume data from the other cluster for their own purposes rather than acting as a backup/fallback
Aside: unclear from the question, but make sure you are using MirrorMaker2, also, which would imply you'd already be using Kafka Connect and can therefore install the HDFS sink rather than need Flume
We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)
I have a critical Kafka application that needs to be up and running all the time. The source topics are created by debezium kafka connect for mysql binlog. Unfortunately, many things can go wrong with this setup. A lot of times debezium connectors fail and need to be restarted, so does my apps then (because without throwing any exception it just hangs up and stops consuming). My manual way of testing and discovering the failure is checking kibana log, then consume the suspicious topic through terminal. I can mimic this in code but obviously no way the best practice. I wonder if there is the ability in KafkaStream api that allows me to do such health check, and check other parts of kafka cluster?
Another point that bothers me is if I can keep the stream alive and rejoin the topics when connectors are up again.
You can check the Kafka Streams State to see if it is rebalancing/running, which would indicate healthy operations. Although, if no data is getting into the Topology, I would assume there would be no errors happening, so you need to then lookup the health of your upstream dependencies.
Overall, sounds like you might want to invest some time into using monitoring tools like Consul or Sensu which can run local service health checks and send out alerts when services go down. Or at the very least Elasticseach alerting
As far as Kafka health checking goes, you can do that in several ways
Is the broker and zookeeper process running? (SSH to the node, check processes)
Is the broker and zookeeper ports open? (use Socket connection)
Are there important JMX metrics you can track? (Metricbeat)
Can you find an active Controller broker (use AdminClient#describeCluster)
Are there a required minimum number of brokers you would like to respond as part of the Controller metadata (which can be obtained from AdminClient)
Are the topics that you use having the proper configuration? (retention, min-isr, replication-factor, partition count, etc)? (again, use AdminClient)
I need a simple health checker for Apache Kafka. I dont want something large and complex like Yahoo Kafka Manager, basically I want to check if a topic is healthy or not and if a consumer is healthy.
My first idea was to create a separate heart-beat topic and periodically send and read messages to/from it in order to check availability and latency.
The second idea is to read all the data from Apache Zookeeper. I can get all brokers, partitions, topics etc. from ZK, but I dont know if ZK can provide something like failure detection info.
As I said, I need something simple that I can use in my app health checker.
Some existing tools you can try them out if you haven't yet -
Burrow Linkedin's Kafka Consumer Lag Checking
exhibitor Netflix's ZooKeeper co-process for instance monitoring, backup/recovery, cleanup and visualization.
Kafka System Tools Kafka command line tools