I've got a question about design.
We are working on building Kafka dedupe processing.
As far as I understand leveraging Kafka Stream is the possible API candidate.
Our topics are divided into 100 partitions. To scale the process we are examining an option,
a single Docker container (K8S pod) as a processing task per partition or group of partitions.
My question is if Kafka Stream API allows creating a stream from a specific topic.
Since our time window is very long (2 weeks) then the stateful process store grows to a large size.
Will appreciate any idea, what is the most proper way how to tackle it.
Related
I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html
The topics are dynamically created, and there could be thousands of them. I need a way to detect when messages are produced so I can consume them. Moreover, I need to consume each topic independently so that I can then bulk a large number of messages into a database, each topic corresponding to a different table. So let's say I start consuming a topic, I would consume 1000 messages, bulk them in a database in one operation, then commit the reading in kafka. If I have 10 topics, I could use 10 consumers in parallel. The problem is if I end up with a large number of topics, and that most of them are idle (empty), I need a way to be notified that some topics become suddenly active, so that I don't have to launch thousands of idle consumers that do nothing most of the time.
The only solution I thought so far is using a single signal topic in addition to the real topics, in which the producers would produce in addition to the real topic. But I was wondering if there was another solution. Like polling the meta-data in kafka, maybe. But for what I've seen, I would have to iterate through all the topics matching a regex, then check the offsets of the partitions for each. I don't think it's possible to do that efficiently, but maybe I'm wrong.
You could track JMX metrics from the broker for incoming bytes per topic using Prometheus JMX Exporter, for example, then combine that with AlertManager to send some event/webhook upon some threshold of data to a consuming REST service, which would then start some consumers (maybe Kafka Connect tasks for a database?).
Or, like you said, use a signal topic since producer requests can be made to multiple topics at once.
If I have 10 topics, I could use 10 consumers in parallel
You can have more parallel consumers if any of those topics have multiple partitions
could be thousands of them
There's are reasonable limits on the number of topics a Kafka cluster can support, by the way, but it's upwards of hundreds of thousands, as of latest releases. Something to keep in mind, though.
launch thousands of idle consumers that do nothing most of the time.
You could also use solutions like AWS Lambda or Kubernetes KEDA to auto scale up/down based on topic data (lag)
Background
We have a Kafka topic with a steady stream of data. To process it we have a stateless Flink pipeline that consumes that topic and writes to another topic.
From time to time we have bursts of information that our Flink is not configured to handle. We don't want to configure our Flink pipeline and cluster to always support the maximum load we can have, we want to dynamically scale according to the load. (budget reasons $$$)
Solutions we thought of
One way to do so is to add/remove nodes to the Flink cluster and change the parallelism of the Flink pipeline operators. This will require stopping the Flink job with a snapshot, reconfiguring the parallelism and restarting with new parallelism.
This would be great but we cannot allow ourselves the downtime it produces. We have to scale up/down without downtime.
If we would use regular Kafka consumers it would be as simple as adding a consumer (assuming we have enough Kafka partitions) and Kafka would redistribute the topic partitions between all the consumers.
The Flink Kafka consumer manages the partition assignment and the offset on its own which allows exactly-once semantics (we don't need it). The drawback is that a single Flink job always uses all the topic partitions.
We thought we could create another instance of Flink that would subscribe to the same topic with the same group and let Kafka distribute the partitions between them. But for that we would need the Kafka Flink consumer to let Kafka manage which partitions are assigned to which consumer.
What are we looking for
We couldn't find a library that contains such a consumer or a configuration in the existing consumer. We could write it on our own (not so difficult) but if there is an existing solution we'd rather use it.
Are we missing something? Are we misunderstanding something? Is there a better solution?
Thanks!
The most straightforward approach, since you said that at worst you'll need double the capacity, would be to modify your topology to be able to write Kafka messages you can't process quickly enough to a second overflow Kafka topic. Both input and output Kafka topic names would be configurable. Maybe you would have a threshold backlog delay that automatically triggers this writing or maybe you would have a flag in the topology that you can externally set while the topology is running. That's a design detail you can work through that has operational implications.
This gives you a Flink topology that can handle some maximum number of messages in a timely fashion while writing the rest of the messages that can't be handled to a second Kafka topic. You can then run a second instance of the same Flink topology that reads from that secondary topic and writes, if necessary to a third topic. If the writing to the overflow topic happens very early in the topology processing, you could chain several of these instances together via Kafka with minimal latency and without having to reconfigure and restart any topologies.
In Kafka, I can split my topic into many partitions. I cannot have more consumers than partitions in Kafka, because the partition is used as a way to scale out a topic. If I have more load, I can increase the number of partitions, which will allow me to increase the number of consumers, which will allow me to have more threads / processes processing on a given topic.
In Kafka, there is a concept of a Consumer Group. If we have 10 consumer groups on a single topic, each consumer group will have the opportunity to process every message in a topic. The consumer group still takes advantage of the scalability from the partitions (i.e. Each consumer group can have up to 'n' consumers, where 'n' is the number of partitions on a topic). This is the beauty of kafka, scalability and multi-channel reading are two separate concepts with two separate knobs to turn.
In Kinesis, we are told that, if you use the Kinesis Library Client you can get the same functionality as consumer groups by defining different Kinesis Applications. In other words, we can have different Kinesis Applications independently streaming all records from the same stream and different times.
We are also told that "Amazon Kinesis Client Library (KCL) automatically creates an Amazon DynamoDB table for each Amazon Kinesis Application to track and maintain state information such as resharding events and sequence number checkpoints."
OK, So I'm getting ready to start reading through the KCL code here, but I'm hoping someone can answer these questions to save me some time.
How does the KCL actually do this?
Are there diagrams somewhere explaining the process?
If I started a new Kinesis Application (MyKinesisApp1) after a record was already produced and consumed by all prior Kinesis Applications, will the new Kinesis Application (MyKinesisApp1) still have an opportunity to consume that record? In other words, does Kinesis remove the record from its stream after it has been processed, or does it leave it there for the 7 days no matter what?
I have seen this question here but it doesn't answer my question. Especially my third question! Also, this question does a direct comparison between two similar technologies. It will help people that know Kafka, learn Kinesis more quickly.
In the KCL configuration, there is a section "appName" which corresponds to "Application Name" and that is the same as "consumer group" in Kafka. For each consumer group (ie. Kinesis Streams Consumer Application) there is a DynamoDB table. You can see an example DynamoDB here (the KCL appName is 'quickstats-development'): AWS Kinesis leaseOwner confusion
No, as far as I know, there is not. "Kinesis Streams" is similar to Kafka, but other than that, not much graphical representation.
Yes. Each Kafka Consumer-Group is represented as a different DynamoDB table in Kinesis. That way, different Kinesis Consumer Applications can consume same record independently. The checkpoint in Kinesis is the Offset value of Kafka. And a checkpoint in DynamoDB is the cursor of reading point in a Kinesis shard. Read this answer for a similar example: https://stackoverflow.com/a/42833193/1622134
I have the following scenario:
4 wearable sensors attached on individuals.
Potentially infinite individuals.
A Kafka cluster.
I have to perform real-time processing on data streams on a cluster with a running instance of apache flink.
Kafka is the data hub between flink cluster and sensors.
Moreover, subject's streams are totally independent and also different streams belonging to same subject are independent each other.
I imagine this setup in my mind:
I set a specific topic for each subject and each topic is partitioned in 4 partition, each one for each sensor on specific person.
In this way I though to establish a consumer group for every topic.
Actually, my data amount is not so much big but mine interest is to build an easily scalable system. A day maybe I can have hundreds of individuals for instance...
My questions are:
Is this setup good? What do you think about it?
In this way I will have 4 kafka broker and each one handles a partition, right (without consider potential backups)?
Destroy me guys,
and thanks in advance
You can't have an infinite number of topics in a Kafka cluster so if you plan to scale beyond 10,000 or more topics then you should consider another design. Instead of giving each individual a dedicated topic, you can use an individual's ID as a key and publish data as a key/value pair to a smaller number of topics. In Kafka you can have an (almost) infinite number of keys.
Also consider more partitions. Each of your 4 brokers can handle many partitions. If you only have 4 partitions in a topic then you can only have at most 4 consumers working together in parallel in a consumer group (in your case in Flink)