How to setup Apache Kafka consumer to get data across internet? - apache-kafka

I've configured 2 Kafka with basic settings on two different servers across the net one in the UK and another one in India and my scenario is very simple the UK is publisher and India is consumer but none of them can get any data.
I've checked my firewalls there is not port blocking/ whatsoever. Also I've tested my scenario with redis pubsub and it worked but I wasn't successful with Kafka.
How should I setup my Kafkas to do such? or is it possible at all to do such with Kafka?

Kafka is not recommended when you want to interact from multiple data centres. Kafka is designed to give you high-throughput given you are producing and consuming from the same data centre where network latency is minimal.
Why ?
Once you have consumers in a different data centre the latency comes to play affecting all coordination that Kafka does with consumers (group rebalancing/offset commits/heartbeats) and producers being in a different data centre the latency for getting acks for each message send will be considerable, slowing down the rate at which you can produce messages.
So, in theory, you can very well have the setup if your network is reliable.
Now If you are thinking to have Kafka brokers distributed among data centres, it will be more costly. All the inter-broker communication will be delayed effectively creating lag in replicas, lotof network calls(over the internet), broker heartbeat timeouts etc, again theoretically feasible.
In practice for these scenarios, it is better to have local Kafka cluster for each DC where they produce/consume messages with applications hosted locally and have Mirrormaker to aggregate messages between data centres.

Found the solution :
In Apache Kafka's config, add following line :
advertised.listeners=PLAINTEXT://xxx.xxx.xxx.xxx:pppp
# x = your IP
# p = your port

Related

How to best position Apache Kafka to be a message broker for many isolated clients

My organization has an AWS hosted Spring Boot application with Apache Kafka currently facilitating message exchange for ~50 topics ("the cloud application"). Within client facilities (physical locations) we have a processing machine which handles logic and commands from the cloud. All clients have their own local machines. The desire is to allow the client/cloud to make use of the same Kafka topics but disallow one client form receiving the others.
Many Kafka instances does not scale. What is the appropriate way to enable Kafka to do this?
Many Kafka instances does not scale
Depends how you manage it. You can use Ansible/Puppet/Chef and Terraform to quickly setup Kafka clusters in any environment.
You can use MirrorMaker or Kafka Connect to pull topics from a cloud datacenter into a "physical" private one.
But, the end result is that you have some consumer pulling data from a remote cluster and processing it.
disallow one client from receiving the others
"Other" what? Other datacenters? That would be a network rule, not a Kafka problemm, IMO.
Other Kafka topics? You can setup SASL + JAAS for having a basic auth layer.
More information - Kafka Security 101 (old post, information may be somewhat outdated)
Docs on Kafka Security
I don't see an ideal solution here, but if your load requires Kafka and you need Client isolation by authorization than the way to go would be to create a destination Topic per client and enforce ACLs on Topic READ / WRITE as mentioned in the post above.
The possible drawback of this approach could be a hit to performance or possibly needing to extend a big enough cluster to support the load and keep the SLAs as needed.
Planning and calculating estimations
As presented in the blog post about partition number optimization, the general rule of thumb to keep your Kafka cluster safe, and our first step is:
NumPartitionsPerBroker = 100 x NumOfBrokers x ReplicationFactor
Where:
NumPartitionsPerBroker = Maximum load of partitions on a single Kafka broker in the cluster.
NumOfBrokers = Number of Kafka brokers in the current cluster setting.
ReplicationFactor = Default / Average replication factor, essentially how many peer brokers can share the load of partition leadership.
The next step would be to figure out how many Partitions you are possibly epecting in the next months / years:
TotalExpectedPartitions =
(NumTopics x AvgNumParts) x
[(1 + % GrowthTopics) x (1 + % GrowthParts)] ^ TimeInterval
Where:
NumTopics = Number of topics
AvgNumParts = Average number of partitions per topic (producers / consumers per topic)
GrowthTopics = Expected growth in topics
GrowthPrts = Expected growth in partitions
TimeInterval = Estimation of how many Months / Years / etc are you planning ahead
And finally the two numbers should add up in the following way:
NumPartitionsPerBroker x NumOfBrokers = TotalExpectedPartitions
Hope this helps :)

What are the practical limits of Kafka regex-topics / listening to multiple topics

I am exploring different PubSub platforms and I was wondering what the limits are in Kafka for listening to multiple topics. Consider for instance this Use Case. We have trains, station entry gates, devices that all publish their telemetry. Currently this is done on a MQ but as data rates increase, smart trains etc. we need to move to a new PubSub/streaming platform and Kafka is on that list of course.
As I see it there are two strategies for aggregating this telemetry into a stream:
aggregate on consumption, in which each train/device initially gets its own topic and topic aggregation is done using a regex-topic / virtual topic
aggregate on production, in which all trains produces to an single topic and consumers use filters if neccessary to single out individual producers
As I understood Kafka is not particularly suited for high number of topics (>10.000), but it could be done. Would a regex-topic be able to aggregate 2000, 3000 topics?
From the technical point view, it could be done; but in practice, this is not common. Why? Zookeeper. it is advised for cluster to have a maximum of 4000 partitions per brokers. This is partly due to the overhead of performing leader election for all of those on Zookeeper.
I recommend you to read these blog posts about this interesting topic on Confluent's blog:
How to choose the number of topics/partitions in a Kafka cluster?
Apache Kafka Supports 200K Partitions Per Cluster
Apache Kafka Made Simple: A First Glimpse of a Kafka Without ZooKeeper

How many Partitions can we have in Kafka?

I have a requirement in my IoT project like, a custom java application called "NorthBound" (NB) can manage 3000 devices maximum. Devices send data to SouthBound (SB - Java Application), SB sends data to Kafka and from Kafka, NB consume the messages.
To manage around 100K devices, I am planning to start multiple instances (around 35) of NorthBound, but i want same instance should receive the messages from same devices. e.g. Device1 is sending data to NB_instance1, Device2 is sending data to NB_instance2 etc.
To handle this, i am thinking of creating 35 partitions of same topic (Device-Messages) so that each NB instance can consume one partition and same device's data should go to same NB instance. Is it the right approach? Or is there any better way?
How many partitions can we make in a Kafka cluster? and What is a recommended value considering 3 nodes (Brokers) in a cluster?
Currently, we have only 1 node in Kafka. Can we continue with single node and 35 partitions?
Say on startup I might have only 5-6K devices, then I will have only 2 partitions with 2 NB instances. Gradually when we add more devices, we will keep adding more partitions and NB instances. Can we do it without restarting Kafka? Is it possible to create partitions dynamically?
Regards,
Krishan
As you can imagine the number of partitions you can have depends on a number of factors.
Assuming you have recent hardware, since Kafka 1.1, you can have 1000s of partitions per broker. Moreover Kafka has been tested with over 100000 partitions in a cluster. Link 1
As a rule of thumb, it's recommended to over partition a bit in order to allow future growth in traffic/usage. Kafka allows to add partitions at runtime but that will change partitioning of keyed messages which can be an issue depending on your use case.
Finally, it's not recommended to run a single broker for production workloads as if it was to crash or fail, you'd be exposed to an outage and possibly data loss. It's best to at least have 2 of them with a replication factor of 2 even with only 35 partitions.

How does Kafka message processing scale in publish-subscribe mode?

All, Forgive me I am a newbie just beginner of Kafka. Currently I was reading the document of Kafka about the difference between traditional message system like Active MQ and Kafka.
As the document put.
For the traditional message system. they can not scale the message processing.
Since
Publish-subscribe allows you broadcast data to multiple processes, but
has no way of scaling processing since every message goes to every
subscriber.
I think this make sense to me.
But for the Kafka. Document says the Kafka can scale the message processing even in the publish-subscribe mode. (Please correct me if I was wrong. Thanks.)
The consumer group concept in Kafka generalizes these two concepts. As
with a queue the consumer group allows you to divide up processing
over a collection of processes (the members of the consumer group). As
with publish-subscribe, Kafka allows you to broadcast messages to
multiple consumer groups.
The advantage of Kafka's model is that every topic has both these
properties—it can scale processing and is also multi-subscriber—there
is no need to choose one or the other.
So my question is How Kafka make it ? I mean scaling the processing in the publish-subscribe mode. Thanks.
The main unique features in Kafka that enables scalable pub/sub are:
Partitioning individual topics and spreading the active partitions across multiple brokers in the cluster to take advantage of more machines, disks, and cache memory. Producers and consumers often connect to many or all nodes in the cluster, not just a single master node for a given topic/queue.
Storing all messages in a sequential commit log and not deleting them when consumed. This leads to more sequential reads and writes, offloads the broker from having to deal with keeping track of different copies of messages, deleting individual messages, handling fragmentation, tracking which consumer has acknowledged consuming which messages.
Enabling smart parallel processing of individual consumers and consumer groups in a way that each parallel message stream can come from the distributed partitions mentioned in #1 while offloading the offset management and partition assignment logic onto the clients themselves. Kafka scales with more consumers because the consumers do some of the work (unlike most other pub/sub brokers where the bulk of the work is done in the broker)

Does Kafka scale well for big number of clients?

I know Kafka can handle tons of traffic. However, how well does it scale for big number of concurrent clients?
Each client would have their own unique group_id (and as consequence Kafka would be keeping track of each one's offsets).
Would that be an issue for Kafka 0.9+ with offsets stored internally?
Would that be an issue for Kafka 0.8 with offsets stored in Zookeeper?
Some Kafka users such as LinkedIn have reported in the past that a single Kafka broker can support ~10K client connections. This number may vary depending on hardware, configuration, etc.
As long as the request rate is not too high, the limiting factor is probably just the open-file-descriptors limit as configured in the operating system, see e.g. http://docs.confluent.io/current/kafka/deployment.html#file-descriptors-and-mmap for more information.