Pyspark - Job getting bogged down when writing dataframe to kafka - pyspark

I have a dataframe that I'm transforming. I want to write this in the kafka. But Job getting bogged down when writing dataframe to kafka. There are close to ~400 gb of shuffle reads.
how can i fix this?
last clogged code ;
yarn stuck job ;

Related

Spark Kafka Streaming - There a is a lot of delay in processing the batch

I am running spark streaming with kafka fro word count program, there is a lot of delay in batch creation and processing - around 2 mins for each batch.
How could i reduce this time ? Are there any properties to be configured this to be quickly as possible - like properties at spark streaming level or kafka level ?
you should define the interval between each batch in your unstructured StreamingContext, exemple :
val ssc = StreamingContext(new SparkConf(), Minutes(1))
in strutured streaming you have a option: kafkaConsumer.pollTimeoutMs
with 512 ms as default value, more informations: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Another problem can come from the kafka lag. you application can take a long time to process a specific offset, maybe 2 minutes, so as soon as this offset is finish, he will poll others for processing. Try to look at the current offset of your consumer group and the last offset of your topic.

Limit consuming data by Kafka in spark streaming

I am working on spark streaming project. Spark getting data from kafka. I want to limit the records consume by spark-streaming. There is very huge amount of data on kafka. I have using spark.streaming.kafka.maxRatePerPartition=1 property to limit the record in spark. But still in 5 min batch I am getting 13400 messages. My spark program could not handle more than 1000 messages per 5mins. Kafka topic have 3 partitions. My spark driver memory is 5GB and have 3 executor with 3GB each. How I can limit message consume by kafka in spark streaming.
Did you try setting below props ?
spark.streaming.backpressure.enabled
spark.streaming.backpressure.initialRate

Connecting Spark streaming to streamsets input

I was wondering if it would be possible to provide input to spark streaming from StreamSets. I noticed that Spark streaming is not supported within the StreamSets connectors destination https://streamsets.com/connectors/ .
I exploring if there are other ways to connect them for a sample POC.
The best way to process data coming in from Streamsets Data Collector (SDC) in Apache Spark Streaming would be to write the data out to a Kafka topic and read the data from there. This allows you to separate out Spark Streaming from SDC, so both can proceed at its own rate of processing.
SDC microbatches are defined record count while Spark Streaming microbatches are dictated by time. This means that each SDC batch may not (and probably will not) correspond to a Spark Streaming batch (most likely that Spark Streaming batch will probably have data from several SDC batches). SDC "commits" each batch once it is sent to the destination - having a batch written to Spark Streaming will mean that each SDC batch will need to correspond to a Spark Streaming batch to avoid data loss.
It is also possible that Spark Streaming "re-processes" already committed batches due to processing or node failures. SDC cannot re-process committed batches - so to recover from a situation like this, you'd really have to write to something like Kafka that allows you to re-process the batches. So having a direct connector that writes from SDC to Spark Streaming would be complex and likely have data loss issues.
In short, your best option would be SDC -> Kafka -> Spark Streaming.

how to better process the huge history data in the kafka topic by using spark streaming

I am experiencing an issue to start spark streaming on a really big kafka topic, there are around 150 million data in this topic already and the topic is growing super fast.
When I tried to start spark streaming and read data from the beginning of this topic by setting kafka parameter ("auto.offset.reset" -> "smallest"), it always try to finish all 150 million data processing in the first batch and return a "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. There isn't a lot calculation in this spark stream app though.
Can I have a way to process the history data in this topic in first several batches but not all in first batch?
Bunch of thanks in advance!
James
You can control spark kafka-input reading rate with following spark configuration spark.streaming.kafka.maxRatePerPartition .
You can configure this by giving how many docs you want to process per batch.
sparkConf.set("spark.streaming.kafka.maxRatePerPartition","<docs-count>")
Above config process <docs-count>*<batch_interval> records per batch.
You can find more info about above config here.

how to make spark streaming asynchronously when read from Kafka

I have one Kafka partition, and one sparkStreaming application. One server with 10 cores. When the spark streaming get one message from Kafka, the subsequent process will take 5 seconds(this is my code). So I found sparkStreaming read Kafka message very slow, I'm guessing that when spark read out one message then it will wait until the message was processed, so the reading and processing are synchronized.
I was wondering can I make the spark reading asynchronously? So the reading from Kafka won't be dragged by the subsequent processing. Then the spark will very quickly consume data from Kafka. And then I can focus on the slow data process inside spark. btw, I'm using foreachRDD function.
you can increase the number of partitions in kafka, it should improve the parallelism , also you can try with "Direct kafka receiver" which really improve the performance when your app is reading from kafka