Kafka 2.0 - Multiple Kerberos Principals in KafkaConnect Connectors - apache-kafka

We are currently using HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0. Problem is with running multiple connectors with different configuration(Kerberos principals) on same KafkaConnect Cluster.
As part of this Kafka version, all connectors are supposed to use same consumer/producer properties which have been set in worker configuration with consumer.* or producer.* prefix. But as I stated, we have multiple users (apps) running their own connectors and we can't use a single Kerberos principal to allow read on all topics.
So just wanted to check with experts if there is any way this security limitation can be over come. The option I can think of is - run a different Kafka-Connect cluster for each Kafka User (different principals) but what implications it could have if we run many KafkaConnect Clusters on same nodes ? Will it cause any impacts in term of resources (Java heap etc.) or this is the only way (standard procedure) to handle this.
PS: In later releases (2.3+) this problem is fixed via KAFKA-8265 and these settings can be overwritten but even if we try upgrading to latest HDF we will only get Kafka 2.1 which will not solve this issue.
Thanks for your help !!

I think upgrading is your best option to get the linked feature. As I commented, you can go get latest kafka versions on your own... Hortonworks/Cloudera doesn't offer support for Connect anyway. They'd rather you use Spark/Flink/NiFi (I think Storm is no longer around?)
what implications it could have if we run many KafkaConnect Clusters on same nodes ? Will it cause any impacts in term of resources (Java heap etc.)
Heap is the main one (for batching, sink connectors). Network and CPU load could also come into account, depending on rate of messages.
As long as the advertised ports for each cluster process aren't colliding, you should be able to use the same group ids and internal topics, though

Related

Where to run Kafka stream processor?

I'm playing around with Apache Kafka a bit and have a functional multi-node cluster configured. I want to now introduce a Kafka Stream Processor. I'll just do something simple, but here's my question: Where do I run it? I know I can run it as a standalone jar on any machine, but is that the correct place to run it? Do I run it on a worker node? Can I run it via the distributed Kafka Connect worker API? I saw documentation that says multiple instances of the same processor will be aware of each other....how? Is that handled in the Java Kafka libraries behind the scenes?
Basically, how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Assume I am NOT using Kubernetes for this, please. Also assume I am using the community-only packages for the Confluent Platform.
Kafka Connect does not run Kafka Streams applications.
ksqlDB, on the other hand, offers an abstraction layer for Kafka Streams applications and offers an embedded Connect worker.
Otherwise, yes, you simply run the Kafka Streams JAR files, anywhere that has network access to your Kafka cluster. Ideally, not on the cluster itself as it'll be competing for RAM and disk space.
And none of the above require Confluent Platform.
how do I deploy a processor at scale? Presumably I wouldn't manually start 10 (or 100 or 1000) instances of the same processor.
Well, you can only have up-to the number of partitions for your processor's input topics active threads, which you control by num.stream.threads and number of Streams processes.
If you're not deploying into Kubernetes, then you can still use other options like Puppet, Ansible, Supervisor, Hashicorp Nomad's Java Driver, etc.

Kafka Connect instead of Flume Ingestion

I have been looking into the concepts and application of Kafka Connect, and I have even touched one project based on it in one of my intern. Now in my working scenario, now I am considering replacing the architecture of the our real time data ingestion platform which is currently based on flume -> Kafka with Kafka Connect and Kafka.
The reason why I am considering the switch can be concluded mainly into:
But if we use flume we need to install the agent on each remote machine which generates tons of workload for further devops, especially at the place where I am working where the authority of machines is managed in a rigid way that maintaining utilities on machines belonging to other departments.
Another reason for the consideration is that the machines' os environment varies, if we install flumes on a variety of machines , some machine with different os and jdks(I have met some with IBM jdk) just cannot make flume work well which in worst case can result in zero data ingestion.
It looks with Kafka Connect we can deploy it in a centralized way with our Kafka cluster so that the develops cost can go down. Beside, we can avoid installing flumes on machines belonging to others and avoid the risk of incompatible environment to ensure the stable ingestion of data from every remote machine.
Besides, the most ingestion scenario is only to ingest real-time-written log text file on remote machines(on linux and unix file system) into Kafka topics, that is it. So I won't need advanced connectors which is not supported in apache version of Kafka.
But I am not sure if I am understanding the usage or scenario of Kafka Connect the right way. Also I am wondering if Kafka Connect should be deployed on the same machine with the data source machines or if it is ok they resides on different machines. If they can be different then why flume requires the agent to be run on the same machine with the data source? So I wish someone more experienced can give me some lights on that.
Is Kafka Connect appropriate for ingesting data to Kafka? yes
Does Kafka Connect run local to the data source? only if it has to (e.g. reading a local file with Kafka Connect spooldir plug, FilePulse plugin, etc ).
Should you rip out something that works and replace it with Kafka Connect? not unless it's fixing a problem that you have
If you're not using either yet, should you use Kafka Connect instead of Flume? Quite possibly.
Learn more about Kafka Connect here: https://dev.to/rmoff/crunchconf-2019-from-zero-to-hero-with-kafka-connect-81o
For file ingest alone there's other tools too like Filebeat too

Kafka consumer groups suddenly stopped balancing messages among instances

We have a microservice architecture communicated by Kafka on Confluent where each service is set in its own consumer group in order to balance message delivery between the multiple instances.
For example:
SERVICE_A_INSTANCE_1 (CONSUMER_GROUP_A)
SERVICE_A_INSTANCE_2 (CONSUMER_GROUP_A)
SERVICE_A_INSTANCE_3 (CONSUMER_GROUP_A)
SERVICE_B_INSTANCE_1 (CONSUMER_GROUP_B)
SERVICE_B_INSTANCE_2 (CONSUMER_GROUP_B)
When a message is emitted it should only be consumed by one instance of each consumer group.
This worked fine until two days ago. All of the sudden, each message is being delivered to all the instances, so each message is processed multiple times. Basically, the consumer-group stopped working and messages are not being distributed.
Important points:
We use Kafka paas in Confluent on GCP.
We tested this in a different environment and everything worked as expected
No changes have been made on our consumers
No changes have been made on our part to the cluster (we cant know if Confluent changed something)
We suspect it might be a problem on Confluent or an update that is not compatible with our current configuration. Kafka 2.2.0 was recently released and it has some changes to consumer groups behavior.
We are currently working on migrating to AWS MSK to see if the issue prevails.
Any ideas on what could be causing this?
TL;DR: We solved the issue by moving away from Confluent into our own Kafka cluster on GCP.
I will answer my own question since its been a while and we have already solved this. Also, my insights might help others make more informed decisions on where to deploy their Kafka infrastructure.
Unfortunately we could not get to the bottom of the problem with Confluent. It is most likely something on their side because we simply migrated to our own self managed instances on GCP and everything went back to normal.
Some important clarifications before my final thoughts and warnings about using Confluent as a managed Kafka service:
We think this is related to something that affected Node.js in particular. We tested external libraries in languages other than Node and the behavior was as expected. When testing on multiple of the most popular Node libraries the problem persisted.
We did not have premium support with Confluent.
I cannot confirm that this issue is not our fault.
With all of those points in mind, our conclusion is that for companies that decide on using a managed service with Confluent, its best to calculate costs with premium support included. Otherwise, Kafka turns into a completely closed blackbox, making it impossible to diagnose issues. In my personal opinion, the dependency on the Confluent team during a problem is so large that not having them ready to help when needed renders the service non-production ready.

two kafka versions running on same cluster

I am trying to configure two Kafka servers on a cluster of 3 nodes. while there is already one Kafka broker(0.8 version) already running with the application. and there is a dependency on that kafka version 0.8 that cannot be disturbed/upgraded .
Now for a POC, I need to configure 1.0.0 since my new code is compatible with this version and above...
my task is to push data from oracle to HIVE tables. for this I am using jdbc connect to fetch data from oracle and hive jdbc to push data to hive tables. it should be fast and easy way...
I need the following help
can I use spark-submit to run this data push to hive?
can I simply copy kafka_2.12-1.0.0 on my Linux server on one of the node and run my code on it. I think I need to configure my Zookeeper.properties and server.properties with ports not in use and start this new zookeeper and kafka services separately??? please note I cannot disturb existing zookeeper and kafka already running.
kindly help me achieve it.
I'm not sure running two very memory intensive applications (Kafka and/or Kafka Connect) on the same machines is considered very safe. Especially if you do not want to disturb existing applications. Realistically, a rolling restart w/ upgrade will be best for performance and feature reasons. And, no, two Kafka versions should not be part of the same cluster, unless you are in the middle of a rolling upgrade scenario.
If at all possible, please use new hardware... I assume Kafka 0.8 is even running on machines that could be old, and out of warranty? Then, there's no significant reason that I know of not to even use a newer version of Kafka, but yes, extract it on any machine you'd like, use perhaps use something like Ansible, or preferred config management tool you choose, to do it for you.
You can share the same Zookeeper cluster actually, just make sure it's not the same settings. For example,
Cluster 0.8
zookeeper.connect=zoo.example.com:2181/kafka08
Cluster 1.x
zookeeper.connect=zoo.example.com:2181/kafka10
Also, not clear where Spark fits into this architecture. Please don't use JDBC sink for Hive. Use the proper HDFS Kafka Connect sink, which has direct Hive support via the metastore. And while the JDBC source might work for Oracle, chances are, you might already be able to afford a license for GoldenGate
i am able to achieve two kafka version 0.8 and 1.0 running on the same server with respective zookeepers.
steps followed:
1. copy the version package folder to the server at desired location
2. changes configuration setting in zookeeper.properties and server.propeties(here you need to set port which are not in used on that particular server)
3. start the services and push data to kafka topics.
Note: this requirement is only for a POC and not an ideal production environment. as answered above we must upgrade to next level rather than what is practiced above.

Kafka and IIDR CDC

I am trying to build a CDC pipeline using : DB2--IBM CDC --Kafka
and I am trying to figure out the right way to setup this .
I tried below things -
1.Setup a 3 node kafka cluster on linux on prem
2.Installed IIDR CDC software on linux on prem using - setup-iidr-11.4.0.1-5085-linux-x86.bin file . The CDC instance is up and running .
The various online documentation suggest to install 'IIDR management console ' to configure the source datastore and CDC server configuration and also Kafka subscription configuration to build the pipeline .
Currently I do not have the management console installed .
Few questions on this -
1.Is there any alternative to IBM CDC management console for setting up the kafka-CDC pipeline ?
2.How can I get the IIDR management console ? and if we install it on our local windows dekstop and try to connect to CDC/Kafka which are on remote linux servers, will it work ?
3.Any other method to setup the data ingestion IIDR CDC to Kafka ?
I am fairly new to CDC/ IIDR , please help !
I own the development of the IIDR Kafka target for our CDC Replication product.
Management Console is the best way to setup the subscription initially. You can install it on a windows box.
Technically I believe you can use our scripting language called CHCCLP to setup a subscription as well. But I recommend using the GUI.
Here are links to our resources on our IIDR (CDC) Kafka Target. Search for the "Kafka" section.
"https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W8d78486eafb9_4a06_a482_7e7962f5ac59/page/IIDR%20Wiki"
An example of setting up a subscription and replicating is this video
https://ibm.box.com/s/ur8jokg6tclsx5fcav5g86a3n57mqtd5
Management console and access server can be obtained from IBM fix central.
I have installed MC/Access server on my VM and on my personal windows box to use it against my linux VMs. You will need connectivity of course.
You can definitely follow up with our Support and they'll be able to sort you out. Plus we have docs in our knowledge centre on MC starting here.... https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.mcadminguide.doc/concepts/overview_of_cdc.html
You'll find our Kafka target is very flexible it comes with five different formats to write data into Kafka, and you can choose to capture data in an audit format, or the Kafka compaction compatible key, null for a delete method.
Additionally you can even use the product to write several records to several different topics in several formats from a single insert operation. This is useful if some of your consumer apps want JSON and others Avro binary. Additionally you can use this to put all the data to more secure topics, and write out just some of the data to topics that more people have access to.
We even have customers who encrypt columns in flight when replicating.
Finally the product's transformations can be parallelized even if you choose to only use one producer to write out data.
Actually one more finally, we additionally provide the option to use a special consumer which produces database ACID semantics for data written into Kafka and shred across topics and partitions. It re-orders it. we call it the transactionally consistent consumer. It provides operation order, bookmarks for restarting applications, and allows parallelism in performance but ordered, exactly once, deduplicated consumption of data.
From my talk at the Kafka Summit...
https://www.confluent.io/kafka-summit-sf18/a-solution-for-leveraging-kafka-to-provide-end-to-end-acid-transactions