Limit consuming data by Kafka in spark streaming - apache-kafka

I am working on spark streaming project. Spark getting data from kafka. I want to limit the records consume by spark-streaming. There is very huge amount of data on kafka. I have using spark.streaming.kafka.maxRatePerPartition=1 property to limit the record in spark. But still in 5 min batch I am getting 13400 messages. My spark program could not handle more than 1000 messages per 5mins. Kafka topic have 3 partitions. My spark driver memory is 5GB and have 3 executor with 3GB each. How I can limit message consume by kafka in spark streaming.

Did you try setting below props ?
spark.streaming.backpressure.enabled
spark.streaming.backpressure.initialRate

Related

spark streaming and kafka, increase the number of messages that spark pulles from kafka

I have an application that produces 60,000 messages per second.
I send messages to the Kafka and I want to receive those messages by spark streaming in other application,
but the rate of messages that Spark receives is ~ 40,000. I want to increase the number of messages that Spark receives per interval, how can I do that ?
In kafka, the degree of parallelism is determined by the number of partitions of the topic.
You need to increase the number of partitions in the topic.
And you need to set the number of executers that run spark streaming to match the number of partitions in kafka as much as possible.
This will give you optimal performance.
Try increasing spark.streaming.kafka.maxRatePerPartition.
You can try this equation (spark.streaming.kafka.maxRatePerPartition)* (your partitions) = 60000.
But would suggest to keep spark.streaming.kafka.maxRatePerPartition a little higher just in case of spikes in incoming messages.

Apache Storm with Kafka spout bottleneck

I'm trying to achieve max performance with my Storm setup. I'm sending tens of thousands of messages through Kafka, which would be received by the storm topology.
When I look in Storm UI, I noticed that all the messages are going to a single executor rather than load balanced between all the executors. (see attached screenshot).
Any reason for this and how can I load balance the Kafka messages?
Storm UI Screenshot
Since you have 3 partitions, try creating the Kafka Spout with a parallelism hint of 3 and HBase Bolt with a parallelism hint of 3. Use Partial Key grouping in the HBase Bolt to load balance the messages between the bolts on the basis of a key.

how to better process the huge history data in the kafka topic by using spark streaming

I am experiencing an issue to start spark streaming on a really big kafka topic, there are around 150 million data in this topic already and the topic is growing super fast.
When I tried to start spark streaming and read data from the beginning of this topic by setting kafka parameter ("auto.offset.reset" -> "smallest"), it always try to finish all 150 million data processing in the first batch and return a "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. There isn't a lot calculation in this spark stream app though.
Can I have a way to process the history data in this topic in first several batches but not all in first batch?
Bunch of thanks in advance!
James
You can control spark kafka-input reading rate with following spark configuration spark.streaming.kafka.maxRatePerPartition .
You can configure this by giving how many docs you want to process per batch.
sparkConf.set("spark.streaming.kafka.maxRatePerPartition","<docs-count>")
Above config process <docs-count>*<batch_interval> records per batch.
You can find more info about above config here.

spark streaming cannot receive data from kafka if send some message to kafka beforehand

I produce some messages first and these messages are persisted on disk by kafka's brokers. Then I start the spark streaming program to process these data, but I can't receive anything in spark streaming. And there is not any error log.
However, If I produce message when the spark streaming program is running, it can receive data.
Can spark streaming only receive the real time data from kafka?
To control the behavior of what data is consumed at the start of a new consumer stream, you should provide auto.offset.reset as part of the properties used to create the kafka stream.
auto.offset.reset can take the following values:
earliest => the kafka topic will be consumed from the earliest offset available
latest => the kafka topic will be consumed, starting at the current latest offset
Also note that depending on the kafka consumer model you are using (received-based or direct), the behavior of a restarted spark streaming job will be different.

how to make spark streaming asynchronously when read from Kafka

I have one Kafka partition, and one sparkStreaming application. One server with 10 cores. When the spark streaming get one message from Kafka, the subsequent process will take 5 seconds(this is my code). So I found sparkStreaming read Kafka message very slow, I'm guessing that when spark read out one message then it will wait until the message was processed, so the reading and processing are synchronized.
I was wondering can I make the spark reading asynchronously? So the reading from Kafka won't be dragged by the subsequent processing. Then the spark will very quickly consume data from Kafka. And then I can focus on the slow data process inside spark. btw, I'm using foreachRDD function.
you can increase the number of partitions in kafka, it should improve the parallelism , also you can try with "Direct kafka receiver" which really improve the performance when your app is reading from kafka