Is it possible to configure Spark with the spark-streaming-kafka-0-10 library to read multiple Kafka partitions or an entire Kafka topic with a single task instead of creating a different Spark task for every Kafka partition available?
Please excuse my rough understanding of these technologies; I think I'm still new to Spark and Kafka. The architecture and settings are mostly just messing around to explore and see how these technologies work together.
I have a four virtual hosts, one with a Spark master and each with a Spark worker. One of the hosts is also running a Kafka broker, based on Spotify's Docker image. Each host has four cores and about 8 GB of unused RAM.
The Kafka broker has 206 topics, and each topic has 10 partitions. So there are a total of 2,060 partitions for applications to read from.
I'm using the spark-streaming-kafka-0-10 library (currently experimental) to subscribe to topics in Kafka from a Spark Streaming job. I am using the SubscribePattern class to subscribe to all 206 topics from Spark:
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
SubscribePattern[String, String](Pattern.compile("(pid\\.)\\d+"),
kafkaParams)
)
When I submit this job to the Spark master, it looks like 16 executors are started, one for each core in the cluster. It also looks like each Kafka partition gets its own task, for a total of 2,060 tasks. I think my cluster of 16 executors is having trouble churning through so many tasks because the job keeps failing at different points between 1500 and 1800 tasks completed.
I found a tutorial by Michael Noll from 2014 which addresses using the spark-streaming-kafka-0-8 library to control the number of consumer threads for each topic:
val kafkaParams: Map[String, String] = Map("group.id" -> "terran", ...)
val consumerThreadsPerInputDstream = 3
val topics = Map("zerg.hydra" -> consumerThreadsPerInputDstream)
val stream = KafkaUtils.createStream(ssc, kafkaParams, topics, ...)
Is it possible to configure Spark with the spark-streaming-kafka-0-10 library to read multiple Kafka partitions or an entire Kafka topic with a single task instead of creating a different Spark task for every Kafka partition available?
You could alter the number of generated partitions by calling repartition on the stream, but then you lose the 1:1 correspondence between Kafka and RDD partition.
The number of tasks generated by Kafka partitions aren't related to the fact you have 16 executors. The number of executors depend on your settings and the resource manager you're using.
There is a 1:1 mapping between Kafka partitions and RDD partitions with the direct streaming API, each executor will get a subset of these partitions to consume from Kafka and process where each partition is independent and can be computed on it's own. This is unlike the receiver based API which creates a single receiver on an arbitrary executor and consumes the data itself via threads on the node.
If you have 206 topics and 10 partitions each, you better have a decent sized cluster which can handle the load of the generated tasks. You can control the max messages generated per partition, but you can alter the number of partitions unless you're will to call the shuffling effect of the repartition transformation.
The second approach will be the best for your requirements. Only you have to set consumerThreadsPerInputDstream = 1. So only one thread will be created per read operations, hence single machine will be involved per cluster.
Related
I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html
I'm working on a project that's implemented with spring-cloud-stream on top of apache kafka. We are currently tuning the kafka configs to make the application more performant.
One thing I noticed is this num.stream.threads config for the apache streams application. Upon reading the documentation, my understanding for this config is that once this value is set, the streams application will have these number of threads to process the kafka streams.
However when reading the log of the application I saw that these stream threads also assigned to the input topic partitions.
Here are my questions:
If stream threads are responsible for consuming from input topics, we cannot have more stream threads than the total partition number of input topic.
If my assumption for the first question is correct, then what about multiple input topics for one stream application. Let's say we have a stream application that consumes from 2 input topics and produce to 1 output topic. Both input topics were created with 12 partitions. Once I assigned num.stream.threads with 12 threads, then each stream thread consumer will be consuming from 2 partitions, one from each topic right?
I am very new to Spark Streaming.I have some basic doubts..Can some one please help me to clarify this:
My message size is standard.1Kb each message.
Number of Topic partitions is 30 and using dstream approach to consume message from kafka.
Number of cores given to spark job as :
( spark.max.cores=6| spark.executor.cores=2)
As I understand that Number of Kafka Partitions=Number of RDD partitions:
In this case dstream approach:
dstream.forEachRdd(rdd->{
rdd.forEachPartition{
}
**Question**:This loop forEachPartiton will execute 30 times??As there are 30 Kafka partitions
}
Also since I have given 6 cores,How many partitions will be consumed in parallel from kafka
Questions: Is it 6 partitions at a time
or
30/6 =5 partitions at a time?
Can some one please give little detail on it on how this exactly work in dstream approach.
"Is it 6 partitions at a time or
30/6 =5 partitions at a time?"
As you said already, the resulting RDDs within the Direct Stream will match the number of partitions of the Kafka topic.
On each micro-batch Spark will create 30 tasks to read each partition. As you have set the maximum number of cores to 6 the job is able to read 6 partitions in parallel. As soon as one of the tasks finishes a new partition can be consumed.
Remember, even if you have no new data in on of the partitions, the resulting RDD still get 30 partitions so, yes, the loop forEachPartiton will iterate 30 times within each micro-batch.
I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill
There is an requirement in our systems to maintain a proper sequence and ordering guarantee of records inside a Kafka topic partition.
As observed in our test runs, Kafka Mirror does not provide an ordering guarantee in partition. Records tend to be shuffled between source and target cluster topics.
We are planning to use Confluent Replicator for cross cluster data replication. In the test run of Confluent community edition 5.3.1, it has been observed that source and destination topic maintained the exact same partition and its respective record count. (Replicator was run on single thread configs)
But, does Replicator guarantee exact ordering of records within a partition ?
And if I increase the number of replication threads for parallelism and better throughput, does it still guarantee ordering (also in case of one thread failure) ?
MirrorMaker (1.0) will repartition data using the DefaultPartitioner, so the only way you'd manage to get "out of order data" is by having producers overriding their partitioner. In addition, MirrorMaker does not guarantee destination topics have the same number of partitions or configurations as the source
Replicator and MirrorMaker 2.0 (available with Kafka 2.4.0) preserve the input partition counts and topic configs. Order is guaranteed as well as any other consumer group. It might be possible records are produced more than once during delivery due to edge cases in network transmission errors, however.
Increasing connector tasks will add more consumers to the group, again, same as any other application, and input and output partitions should match