Route message based on IP origin with SMT's to a specific destination topic - apache-kafka

I have to produce messages based on virtual iP's (Those ip's are targetting the same Kafka cluster behind) .
So i need to extract the IP from the url (producer request) to route the mmessage to a specific topic before the message is persisted to Kafka .
**Example **
Static IP's on host machine available :
192.168.0.2
192.168.0.3
192.168.0.4
192.168.0.5
Destination topics
Dest02 for IP 192.168.0.2
Dest03 for IP 192.168.0.3
Dest04 etc....
Dest05
So i Publish a record 001 to topicA (Virtual IP set to producer config > 192.168.0.2) in the service
=> record001 is routed to Dest02 destination topic
If you wonder why i want to route my message this way its because i cannot change the upstream service (producer) nor the downstream service neither (consumers) .
One more thing , i need to base this logic on the virtual IP as its used as my discriminant element to take a decision . otherwise i would not be able to know where to rouge my message
Thanks for your help
I am investigating on SMT's with HTTP source connector to try to catch the message before its written in Kafka brokers but maybe its not a good approach .

Related

Find producer domain or IP sending traffic to kafka

I am receiving huge traffic to one of the topic . Unable to identify the source. Is there any way to find IP or domain of producer sending traffic from kafka/zookeeper using metrics or commands.

how to send message to Kafka cluster from outside of vpn

we have 3 Kafka brokers in a cluster running in our private VPN. our client wants to send messages to our Kafka cluster. our admin has created public-IP for only one Kafka broker out of 3. client cluster is able to communicate with our VPN through this public-IP(checked using Wireshark), but we are not receiving any messages.
Whether we need to create public IPs for the remaining 2 Kafka brokers also?
Can anyone suggest what configs should we change in order to work?
Clients are required to communicate directly with leader brokers of the partitions, so all brokers need to advertise some resolvable address for themselves through the VPN
I think you're using the VPN term wrong. You probably mean VPC or some kind of private network.
As mentioned by OneCricketeer you must assign public address to every broker. But also you must set advertised.listeners property or KAFKA_ADVERTISED_LISTENERS to the public IP address or public DNS name pointing to that public address otherwise your client will not be able to connect.
More details: https://www.confluent.io/blog/kafka-listeners-explained/

how to change Kafka broker list ip

I have 3 Kafka brokers running in a isolated network region, my client can not connect them directly, so I have to use a VIP(virtual ip) to connect the brokers.
For example:
my brokers' IP are: 10.5.1.5, 10.5.1.6, 10.5.1.7,
my VIPs' ip are: 200.100.1.5, 200.100.1.6, 200.100.1.7, they one to one paired.
So when I indicate the bootstrap list as 200.100.1.5, the cluster response me the mixed VIPs and Broker ips, such as: 10.5.1.5, 10.5.1.6, 200.100.1.5, 200.100.1.6 ..., then the connection failed, because my program can not reach broker's ip, only can reach VIPs.
My current configuration as following, it responses both IP and VIP:
listeners=INTERNAL://:9092,EXTERNAL_PLAINTEXT://:8080
advertised.listeners=EXTERNAL_PLAINTEXT://200.100.1.5:8080,INTERNAL://10.5.1.5:9092
listener.security.protocol.map=INTERNAL:PLAINTEXT,EXTERNAL_PLAINTEXT:PLAINTEXT
inter.broker.listener.name=INTERNAL
How can I let Kafka only response the VIP list please.
I've got the answer, it could be the following:
advertised.listeners=PLAINTEXT://200.100.1.5:8080
listeners=PLAINTEXT://10.5.1.5:9092
And remove the listener.security and inter.broker.
You can use the broker setting called advertised.listeners to tell your brokers to include a different IP/hostname in their response to clients.
advertised.listeners:
Listeners to publish to ZooKeeper for clients to use, if different
than the listeners config property. In IaaS environments, this may
need to be different from the interface to which the broker binds. If
this is not set, the value for listeners will be used. Unlike
listeners it is not valid to advertise the 0.0.0.0 meta-address.
In your example, for the first broker you can have:
advertised.listeners=PLAINTEXT://200.100.1.5:9092
listeners=PLAINTEXT://10.5.1.5:9092

kafka bootstrap.servers as DNS A-Record with multiple IPs

I have a cluster of Kafka with 5 brokers and I'm using Consul Service Discovery to put their IPs into a dns record.
kafka.service.domain.cc A 1.1.1.1 2.2.2.2 ... 5.5.5.5
Is it recommended to use only one domain name:
kafka.bootstrap.servers = kafka.service.domain.cc:30000
or is it better to have multiple domain names (at least 2), each one resolves to one broker
kafka1.service.domain.cc A 1.1.1.1
kafka2.service.domain.cc A 2.2.2.2
then use them in in kafka
kafka.bootstrap.servers = kafka1.service.domain.cc:30000,kafka2.service.domain.cc:30000
my concerns with the first approach that the domain name will be resolved only once to a random broker, and if that broker is down, no new dns resolving will take place.
From the book Mastering Apache Kafka:
bootstrap.servers is a comma-separated list of host and port pairs
that are the addresses of the Kafka brokers in a "bootstrap" Kafka
cluster that a Kafka client connects to initially to bootstrap itself.
bootstrap.servers provides the initial hosts that act as the
starting point for a Kafka client to discover the full set of alive
servers in the cluster. Since these servers are just used for the
initial connection to discover the full cluster membership (which may
change dynamically), this list does not have to contain the full set
of servers (you may want more than one, though, in case a server is
down).
Clients (producers or consumers) make use of all servers irrespective
of which servers are specified in bootstrap.servers for bootstrapping.
So as the property bootstrap.servers provides the initial hosts that act as the starting point for a Kafka client to discover the full set of alive servers in the cluster, I think both the approach will do. But as they kept the value of the property to be a comma separated list, I guess second approach will be the recommended one. And also it will be a problem in approach 1 is, while bootstrapping, random broker may be down and client will not get the cluster information to continue. So it is always better to provide more than one as fallback if one broker is down during bootstrapping.
Kafka 2.1 included support for handling multiple DNS resource records in bootstrap.servers.
If you set client.dns.lookup="use_all_dns_ips" in your client configuration, it will use all of the IP addresses returned by DNS, not just the first (or a random one).
See KIP-235 and KIP-302 for more information.

Zookeeper - what will happen if I pass in a connection string only some of the nodes from the zk cluster (ensemble)?

I have a zookeeper cluster consisting of N nodes (which knows about each other). What if I pass only M < N of the nodes' addresses in zk client connection string? What will be the cluster's behavior?
In a more specific case, what if I pass host address of only 1 zk from the cluster? Is it possible then for the zk client to connect to other hosts from the cluster? What if this one host is down? Will be client able to connect to other zookeeper nodes in an ensemble?
The other question is, is it possible to limit client to use only specific nodes from the ensemble?
What if I pass only M < N of the nodes' addresses in zk client
connection string? What will be the cluster's behavior?
ZooKeeper clients will connect only to the M nodes specified in the connection string. The ZooKeeper ensemble's back-end interactions (leader election and processing write transaction proposals) will continue to be processed by all N nodes in the cluster. Any of the N nodes still could become the ensemble leader. If a ZooKeeper server receives a write transaction request, and that server is not the current leader, then it will forward the request to the current leader.
In a more specific case, what if I pass host address of only 1 zk from
the cluster? Is it possible then for the zk client to connect to other
hosts from the cluster? What if this one host is down? Will be client
able to connect to other zookeeper nodes in an ensemble?
No, the client would only be able to connect to the single address specified in the connection string. That address effectively becomes a single point of failure for the application, because if the server goes down, the client will not have any other options for establishing a connection.
The other question is, is it possible to limit client to use only specific nodes from the ensemble?
Yes, you can limit the nodes that the client considers for establishing a connection by listing only those nodes in the client's connection string. However, keep in mind that any of the N nodes in the cluster could still become the leader, and then all client write requests will get forwarded to that leader. In that sense, the client is using the other nodes indirectly, but the client is not establishing a direct socket connection to those nodes.
The ZooKeeper Overview page in the Apache documentation has further discussion of client and server behavior in a ZooKeeper cluster. For example, there is a relevant quote in the Implementation section:
As part of the agreement protocol all write requests from clients are
forwarded to a single server, called the leader. The rest of the
ZooKeeper servers, called followers, receive message proposals from
the leader and agree upon message delivery. The messaging layer takes
care of replacing leaders on failures and syncing followers with
leaders.