Apache Flink Kafka Integration Partition Seperation - apache-kafka

I need to implement below data flow. I have one kafka topic which has 9 partitions. I can read this topic with 9 parallelism level. I have also 3 node Flink cluster. Each of nodes of this cluster has 24 task slot.
First of all, I want to spread my kafka like, each server has 3 partition like below. Order is not matter, I only transform kafka message and send it DB.
Second thing is, I want to increase my parallelism degree while saving NoSQL DB. If I increase my parallelism 48, since sending DB is IO operation, it does not consume CPU, I want to be sure, When Flink rebalance my message, my message will stay in the same server.
Is there any advice for me?

If you want to spread you Kafka readers across all 3 nodes, I would recommend to start them with 3 slots each and set the parallelism of the Kafka source to 9.
The problem is that at the moment it is not possible to control how tasks are placed if there are more slots available than the required parallelism. This means if you have fewer sources than slots, then it might happen that all sources will be deployed to one machine, leaving the other machines empty (source-wise).
Being able to spread out tasks across all available machines is a feature which the community is currently working on.

Related

Can I use Kafka for multiple independent consumers sequential reads?

I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html

kafka consume an undefined number of topics

The topics are dynamically created, and there could be thousands of them. I need a way to detect when messages are produced so I can consume them. Moreover, I need to consume each topic independently so that I can then bulk a large number of messages into a database, each topic corresponding to a different table. So let's say I start consuming a topic, I would consume 1000 messages, bulk them in a database in one operation, then commit the reading in kafka. If I have 10 topics, I could use 10 consumers in parallel. The problem is if I end up with a large number of topics, and that most of them are idle (empty), I need a way to be notified that some topics become suddenly active, so that I don't have to launch thousands of idle consumers that do nothing most of the time.
The only solution I thought so far is using a single signal topic in addition to the real topics, in which the producers would produce in addition to the real topic. But I was wondering if there was another solution. Like polling the meta-data in kafka, maybe. But for what I've seen, I would have to iterate through all the topics matching a regex, then check the offsets of the partitions for each. I don't think it's possible to do that efficiently, but maybe I'm wrong.
You could track JMX metrics from the broker for incoming bytes per topic using Prometheus JMX Exporter, for example, then combine that with AlertManager to send some event/webhook upon some threshold of data to a consuming REST service, which would then start some consumers (maybe Kafka Connect tasks for a database?).
Or, like you said, use a signal topic since producer requests can be made to multiple topics at once.
If I have 10 topics, I could use 10 consumers in parallel
You can have more parallel consumers if any of those topics have multiple partitions
could be thousands of them
There's are reasonable limits on the number of topics a Kafka cluster can support, by the way, but it's upwards of hundreds of thousands, as of latest releases. Something to keep in mind, though.
launch thousands of idle consumers that do nothing most of the time.
You could also use solutions like AWS Lambda or Kubernetes KEDA to auto scale up/down based on topic data (lag)

Kafka Random Access to Logs

I am trying to implement a way to randomly access messages from Kafka, by using KafkaConsumer.assign(partition), KafkaConsumer.seek(partition, offset).
And then read poll for a single message.
Yet i can't get past 500 messages per second in this case. In comparison if i "subscribe" to the partition i am getting 100,000+ msg/sec. (#1000 bytes msg size)
I've tried:
Broker, Zookeeper, Consumer on the same host and on different hosts. (no replication is used)
1 and 15 partitions
default threads configuration in "server.properties" and increased to 20 (io and network)
Single consumer assigned to a different partition each time and one consumer per partition
Single thread to consume and multiple threads to consume (calling multiple different consumers)
Adding two brokers and a new topic with it's partitions on both brokers
Starting multiple Kafka Consumer Processes
Changing message sizes 5k, 50k, 100k -
In all cases the minimum i get is ~200 msg/sec. And the maximum is 500 if i use 2-3 threads. But going above, makes the ".poll()", call take longer and longer (starting from 3-4 ms on a single thread to 40-50 ms with 10 threads).
My naive kafka understanding is that the consumer opens a connection to the broker and sends a request to retrieve a small portion of it's log. While all of this has some involved latency, and retrieving a batch of messages will be much better - i would imagine that it would scale with the number of receivers involved, with the expense of increased server usage on both the VM running the consumers and the VM running the broker. But both of them are idling.
So apparently there is some synchronization happening on broker side, but i can't figure out if it is due to my usage of Kafka or some inherent limitation of using .seek
I would appreaciate some hints of whether i should try something else, or this is all i can get.
Kafka is a streaming platform by design. It means there are many, many things has been developed for accelerating sequential access. Storing messages in batches is just one thing. When you use poll() you utilize Kafka in such way and Kafka do its best. Random access is something for what Kafka don't designed.
If you want fast random access to distributed big data you would want something else. For example, distributed DB like Cassandra or in-memory system like Hazelcast.
Also you could want to transform Kafka stream to another one which would allow you to use sequential way.

How many Partitions can we have in Kafka?

I have a requirement in my IoT project like, a custom java application called "NorthBound" (NB) can manage 3000 devices maximum. Devices send data to SouthBound (SB - Java Application), SB sends data to Kafka and from Kafka, NB consume the messages.
To manage around 100K devices, I am planning to start multiple instances (around 35) of NorthBound, but i want same instance should receive the messages from same devices. e.g. Device1 is sending data to NB_instance1, Device2 is sending data to NB_instance2 etc.
To handle this, i am thinking of creating 35 partitions of same topic (Device-Messages) so that each NB instance can consume one partition and same device's data should go to same NB instance. Is it the right approach? Or is there any better way?
How many partitions can we make in a Kafka cluster? and What is a recommended value considering 3 nodes (Brokers) in a cluster?
Currently, we have only 1 node in Kafka. Can we continue with single node and 35 partitions?
Say on startup I might have only 5-6K devices, then I will have only 2 partitions with 2 NB instances. Gradually when we add more devices, we will keep adding more partitions and NB instances. Can we do it without restarting Kafka? Is it possible to create partitions dynamically?
Regards,
Krishan
As you can imagine the number of partitions you can have depends on a number of factors.
Assuming you have recent hardware, since Kafka 1.1, you can have 1000s of partitions per broker. Moreover Kafka has been tested with over 100000 partitions in a cluster. Link 1
As a rule of thumb, it's recommended to over partition a bit in order to allow future growth in traffic/usage. Kafka allows to add partitions at runtime but that will change partitioning of keyed messages which can be an issue depending on your use case.
Finally, it's not recommended to run a single broker for production workloads as if it was to crash or fail, you'd be exposed to an outage and possibly data loss. It's best to at least have 2 of them with a replication factor of 2 even with only 35 partitions.

Hint about kafka cluster setup

I have the following scenario:
4 wearable sensors attached on individuals.
Potentially infinite individuals.
A Kafka cluster.
I have to perform real-time processing on data streams on a cluster with a running instance of apache flink.
Kafka is the data hub between flink cluster and sensors.
Moreover, subject's streams are totally independent and also different streams belonging to same subject are independent each other.
I imagine this setup in my mind:
I set a specific topic for each subject and each topic is partitioned in 4 partition, each one for each sensor on specific person.
In this way I though to establish a consumer group for every topic.
Actually, my data amount is not so much big but mine interest is to build an easily scalable system. A day maybe I can have hundreds of individuals for instance...
My questions are:
Is this setup good? What do you think about it?
In this way I will have 4 kafka broker and each one handles a partition, right (without consider potential backups)?
Destroy me guys,
and thanks in advance
You can't have an infinite number of topics in a Kafka cluster so if you plan to scale beyond 10,000 or more topics then you should consider another design. Instead of giving each individual a dedicated topic, you can use an individual's ID as a key and publish data as a key/value pair to a smaller number of topics. In Kafka you can have an (almost) infinite number of keys.
Also consider more partitions. Each of your 4 brokers can handle many partitions. If you only have 4 partitions in a topic then you can only have at most 4 consumers working together in parallel in a consumer group (in your case in Flink)