Spark Structured Streaming with foreach - scala

I am using spark structured streaming to read data from kafka.
val readStreamDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.getString("kafka.source.brokerList"))
.option("startingOffsets", config.getString("kafka.source.startingOffsets"))
.option("subscribe", config.getString("kafka.source.topic"))
.load()
Based on an uid in the message read from kafka, I have to make an api call to an external source and fetch data and write back to another kafka topic.
For this I am using a custom foreach writer and processing every message.
import spark.implicits._
val eventData = readStreamDF
.select(from_json(col("value").cast("string"), event).alias("message"), col("timestamp"))
.withColumn("uid", col("message.eventPayload.uid"))
.drop("message")
val q = eventData
.writeStream
.format("console")
.foreach(new CustomForEachWriter())
.start()
The CustomForEachWriter makes an API call and fetch results against the given uid from a service. The result is an array of ids. These ids are then again written back to another kafka topic via a kafka producer.
There are 30 kafka partition and I have launched spark with following config
num-executors = 30
executors-cores = 3
executor-memory = 10GB
But still the spark job starts lagging and is not able to keep up with the incoming data rate.
Incoming data rate is around 10K messages per sec. The avg time to process a single msg in 100ms.
I want to understand how spark process this in case of structured streaming.
In case of structured streaming there is one dedicated executor which is responsible for reading data from all partitions of kafka.
Does that executor distributes tasks based on no. of partitions in kafka.
The data in a batch get processed sequentially. How can that be made to process parallel so as to maximize the throughput.

I think CustomForEachWriter writer will work on a single row/record of the dataset. If you are using 2.4 version of Spark, you can experiment foreachBatch. But it is under Evolving.

Related

Spark structured streaming avoid delay and checkpointing: startingOffsets latest does not work?

i am developing a spark structured streaming process for a real time application
I need to read current kafka messages without any delay.
Messages older than 30 seconds are not relevant in this project.
I am reading old messages with a big delay from current timestamp ...(minutes) .. it seems that spark structured streaming does not use well startingOffsets property to latest.
I guess that the problem is the HDFS checkpoint location of topic where i write ...
I do not want to read old messages, only are important current ones!!
I have test many different configurations, kafka properties, etc .. but did not work ..
Here is my code and relevant config (kafka.bootstrap.servers and kafka.ssl.* properties are not include here but exists)
Spark version: 2.4.0-cdh6.3.3
Consumer properties used at readStream
offsets.retention.minutes -> 1,
groupIdPrefix -> 710b6fb4-4454-4a52-819e-f565e047ecb7,
subscribe -> topic_x,
consumer.override.auto.offset.reset -> latest,
failOnDataLoss -> false,
startingOffset -> latest,
offsets.retention.check.interval.ms -> 1000
Reader Kafka topic readerB
val readerB = spark
.readStream
.format("kafka")
.options(consumerProperties)
.load()
Producer properties used at writeStream
topic -> topic_destination,
checkpointLocation -> checkPointLocation
Write stream block
val sendToKafka = dtXXXKafka
.select(to_sr(struct($"*"), xxx_out_topic, xxx_out_topic, schemaRegistryNamespace).alias("value"))
.writeStream
.format("kafka")
.options(producerProperties(properties, xxx_agg_out_topic, xxx_agg_out_localcheckpoint_hdfs))
.outputMode("append")
.start()
The startingOffset property is applied only when a query is started, meaning, for the first batch of your streaming query only. Afterwards, this property is ignored and the batches are defined according to the checkpoint data.
In your case, the streaming query starts by reading only the freshest data from your kafka topic (thanks to the startingOffset -> latest setting). BUT, The second batch (and all the next batches) will be defined according to the checkpoint, or in other words, they will start exactly from where the previous batch ended - the offsets folder in the checkpoint contains the ending offsets (exclusive) for each batch so batch X starts from the offsets saved for the batch X-1. This is how the exactly once delivery semantics is achieved in spark structured streaming.
docs: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
If you want to process only the data from the last 30 seconds, I would suggest you to filter the Dataframe by the timestamp field to include only the data from the desired time frame.

Spark Structured Streaming checkpoint usage in production

I have troubles understanding how checkpoints work when working with Spark Structured streaming.
I have a spark process that generates some events, which I log in an Hive table.
For those events, I receive a confirmation event in a kafka stream.
I created a new spark process that
reads the events from the Hive log table into a DataFrame
joins those events with the stream of confirmation events using Spark Structured Streaming
writes the joined DataFrame to an HBase table.
I tested the code in spark-shell and it works fine, below the pseudocode (I'm using Scala).
val tableA = spark.table("tableA")
val startingOffset = "earliest"
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
val joinTableAWithStreamOfData = streamOfData.join(tableA, Seq("a"), "inner")
joinTableAWithStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
).start()
.awaitTermination()
Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run, and inner join those new events with my log table, so to write only new data to the final HBase table.
I created a directory in HDFS where to store the checkpoint file.
I provided that location to the spark-submit command I use to call the spark code.
spark-submit --conf spark.sql.streaming.checkpointLocation=path_to_hdfs_checkpoint_directory
--all_the_other_settings_and_libraries
At this moment the code runs fine every 15 minutes without any error, but it doesn't do anything basically since it is not dumping the new events to the HBase table.
Also the checkpoint directory is empty, while I assume some file has to be written there?
And does the readStream function need to be adapted so to start reading from the latest checkpoint?
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets) ??
.option("otherOptions", otherOptions)
I'm really struggling to understand the spark documentation regarding this.
Thank you in advance!
Trigger
"Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
In case you want your job to be triggered every 15 minutes, you can make use of Triggers.
You do not need to "use" checkpointing specifically, but just provide a reliable (e.g. HDFS) checkpoint location, see below.
Checkpointing
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run [...]"
When reading data from Kafka in a Spark Structured Streaming application it is best to have the checkpoint location set directly in your StreamingQuery. Spark uses this location to create checkpoint files that keep track of your application's state and also record the offsets already read from Kafka.
When restarting the application it will check these checkpoint files to understand from where to continue to read from Kafka so it does not skip or miss any message. You do not need to set the startingOffset manually.
It is important to keep in mind that only specific changes in your application's code are allowed such that the checkpoint files can be used for secure re-starts. A good overview can be found in the Structured Streaming Programming Guide on Recovery Semantics after Changes in a Streaming Query.
Overall, for productive Spark Structured Streaming applications reading from Kafka I recommend the following structure:
val spark = SparkSession.builder().[...].getOrCreate()
val streamOfData = spark.readStream
.format("kafka")
// option startingOffsets is only relevant for the very first time this application is running. After that, checkpoint files are being used.
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
.load()
// perform any kind of transformations on streaming DataFrames
val processedStreamOfData = streamOfData.[...]
val streamingQuery = processedStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
)
.option("checkpointLocation", "/path/to/checkpoint/dir/in/hdfs/"
.trigger(Trigger.ProcessingTime("15 minutes"))
.start()
streamingQuery.awaitTermination()

scala Spark structured streaming Receiving duplicate message

I am using Kafka and Spark 2.4.5 Structured Streaming.I am doing the average operation.but i am facing issues due to getting duplicate records from the Kafka topic in a current batch.
For example ,Kafka topic message received on 1st batch batch on update mode
car,Brand=Honda,speed=110,1588569015000000000
car,Brand=ford,speed=90,1588569015000000000
car,Brand=Honda,speed=80,15885690150000000000
here the result is average on car brand per timestamp
i.e groupby on 1588569015000000000 and Brand=Honda , the result we got
110+90/2 = 100
now second message received late data with the duplicate message with same timestamp
car,Brand=Honda,speed=50,1588569015000000000
car,Brand=Honda,speed=50,1588569015000000000
i am expecting average should update to 110+90+50/3 = 83.33
but result update to 110+90+50+50/4=75,which is wrong
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1") // Both topics on same stream!
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as data")
group by timestamp and brand
write to kafka with checkpoint
How to use Spark Structured Streaming to do this or anything wrong on code?
Spark Structured Streaming allows deduplication on a streaming dataframe using dropDuplicates. You would need to specify fields to identify a duplicate record and across batches, spark will consider only the first record per combination and records with duplicate values will be discarded.
Below snippet will deduplicate your streaming dataframe on Brand, Speed and timestamp combination.
rawDataStream.dropDuplicates("Brand", "speed", "timestamp")
Refer to spark documentation here

How can it be possible? duplicate records in Kafka queue?

I'm using Apache Nifi and Spark and Kafka to send messages between them. First of all, I take data with Nifi and I send it to Spark to process it. Then, I send data from Spark to Nifi again to insert it in a DB.
My problem is Each time I run Spark, I get the same 3.142 records. I have the first part of Nifi stopped, the second is running, and each time I run Spark, I have the same 3.142 records., and now I'm not able to understand this data.
Where does it come from?
I've tried to see if I have data on Kafka-Queue-I (from Nifi to Spark) or Kafka-Queue-II (from Spark to NiFi), but in both cases, the answer is NO. Only, when I run Spark, appears in Kafka-Queue-II 3.142 records, but this doesn't happen on Kafka-Queue-I...
In Nifi, PublishKafka_1_0 1.7.0:
In Spark, KafkaConsumer:
val raw_data = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("group.id", "spark_kafka_consumer")
.option("startingOffsets", "latest")
.option("enable.auto.commit", true)
option("failOnDataLoss", "false")
.option("subscribe", input_topic)
.load()
It show me a lot of false_alarm...
Some process...
var parsed_data = raw_data
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
. ...
Kafka Source in Spark
var query = parsed_data
.select(to_json(schema_out).alias("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("zookeeper.connect", zookeeper_servers)
.option("checkpointLocation", checkpoint_location)
.option("topic", output_topic)
.start()
query.awaitTermination()
And, KafkaConsumer 1.7.0 in NiFi
I suspect that you're using a new (auto-generated) consumer group every time you start Spark and you have an offset reset policy of earliest. The result of this would be that Spark starts from the beginning of the topic every time.
Kafka does not remove messages from the topic when its consumed (unlike other pub-sub systems). To not see old messages, you will need to set a consumer group and have Spark commit the offsets as it processes. These offsets are stored and then the next time a consumer starts with that group, it will pick up from the last stored offset for that group.
I would also note that Kafka, outside of some very specific usage patterns and technology choices, does not promise "exactly-once" messaging but instead "at-least-once"; my general advice there would be to try to be tolerant of duplicated records.

structured streaming read based on kafka partitions

I am using spark structured Streaming to Read incoming messages from a Kafka topic and write to multiple parquet tables based on the incoming message
So i created a single readStream as Kafka source is common and for each parquet table created separate write stream in a loop . This works fine but the readstream is creating a bottleneck as for each writeStream it create a readStream and there is no way to cache the dataframe which is already read.
val kafkaDf=spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.servers)
.option("subscribe", conf.topics)
// .option("earliestOffset","true")
.option("failOnDataLoss",false)
.load()
foreach table {
//filter the data from source based on table name
//write to parquet
parquetDf.writeStream.format("parquet")
.option("path", outputFolder + File.separator+ tableName)
.option("checkpointLocation", "checkpoint_"+tableName)
.outputMode("append")
.trigger(Trigger.Once())
.start()
}
Now every write stream is creating a new consumer group and reading entire data from Kafka and then doing the filter and writing to Parquet. This is creating a huge overhead. To avoid this overhead, I can partition the Kafka topic to have as many partitions as the number of tables and then the readstream should only read from a given partition. But I don't see a way to specify partition details as part of Kafka read stream.
if data volume is not that high, write your own sink, collect data of each micro-batch , then you should be able to cache that dataframe and write to different locations, need some tweaks though but it will work
you can use foreachBatch sink and cache the dataframe. Hopefully it works