I am using spark structured Streaming to Read incoming messages from a Kafka topic and write to multiple parquet tables based on the incoming message
So i created a single readStream as Kafka source is common and for each parquet table created separate write stream in a loop . This works fine but the readstream is creating a bottleneck as for each writeStream it create a readStream and there is no way to cache the dataframe which is already read.
val kafkaDf=spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.servers)
.option("subscribe", conf.topics)
// .option("earliestOffset","true")
.option("failOnDataLoss",false)
.load()
foreach table {
//filter the data from source based on table name
//write to parquet
parquetDf.writeStream.format("parquet")
.option("path", outputFolder + File.separator+ tableName)
.option("checkpointLocation", "checkpoint_"+tableName)
.outputMode("append")
.trigger(Trigger.Once())
.start()
}
Now every write stream is creating a new consumer group and reading entire data from Kafka and then doing the filter and writing to Parquet. This is creating a huge overhead. To avoid this overhead, I can partition the Kafka topic to have as many partitions as the number of tables and then the readstream should only read from a given partition. But I don't see a way to specify partition details as part of Kafka read stream.
if data volume is not that high, write your own sink, collect data of each micro-batch , then you should be able to cache that dataframe and write to different locations, need some tweaks though but it will work
you can use foreachBatch sink and cache the dataframe. Hopefully it works
Related
I have troubles understanding how checkpoints work when working with Spark Structured streaming.
I have a spark process that generates some events, which I log in an Hive table.
For those events, I receive a confirmation event in a kafka stream.
I created a new spark process that
reads the events from the Hive log table into a DataFrame
joins those events with the stream of confirmation events using Spark Structured Streaming
writes the joined DataFrame to an HBase table.
I tested the code in spark-shell and it works fine, below the pseudocode (I'm using Scala).
val tableA = spark.table("tableA")
val startingOffset = "earliest"
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
val joinTableAWithStreamOfData = streamOfData.join(tableA, Seq("a"), "inner")
joinTableAWithStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
).start()
.awaitTermination()
Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run, and inner join those new events with my log table, so to write only new data to the final HBase table.
I created a directory in HDFS where to store the checkpoint file.
I provided that location to the spark-submit command I use to call the spark code.
spark-submit --conf spark.sql.streaming.checkpointLocation=path_to_hdfs_checkpoint_directory
--all_the_other_settings_and_libraries
At this moment the code runs fine every 15 minutes without any error, but it doesn't do anything basically since it is not dumping the new events to the HBase table.
Also the checkpoint directory is empty, while I assume some file has to be written there?
And does the readStream function need to be adapted so to start reading from the latest checkpoint?
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets) ??
.option("otherOptions", otherOptions)
I'm really struggling to understand the spark documentation regarding this.
Thank you in advance!
Trigger
"Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
In case you want your job to be triggered every 15 minutes, you can make use of Triggers.
You do not need to "use" checkpointing specifically, but just provide a reliable (e.g. HDFS) checkpoint location, see below.
Checkpointing
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run [...]"
When reading data from Kafka in a Spark Structured Streaming application it is best to have the checkpoint location set directly in your StreamingQuery. Spark uses this location to create checkpoint files that keep track of your application's state and also record the offsets already read from Kafka.
When restarting the application it will check these checkpoint files to understand from where to continue to read from Kafka so it does not skip or miss any message. You do not need to set the startingOffset manually.
It is important to keep in mind that only specific changes in your application's code are allowed such that the checkpoint files can be used for secure re-starts. A good overview can be found in the Structured Streaming Programming Guide on Recovery Semantics after Changes in a Streaming Query.
Overall, for productive Spark Structured Streaming applications reading from Kafka I recommend the following structure:
val spark = SparkSession.builder().[...].getOrCreate()
val streamOfData = spark.readStream
.format("kafka")
// option startingOffsets is only relevant for the very first time this application is running. After that, checkpoint files are being used.
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
.load()
// perform any kind of transformations on streaming DataFrames
val processedStreamOfData = streamOfData.[...]
val streamingQuery = processedStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
)
.option("checkpointLocation", "/path/to/checkpoint/dir/in/hdfs/"
.trigger(Trigger.ProcessingTime("15 minutes"))
.start()
streamingQuery.awaitTermination()
I am using Kafka and Spark 2.4.5 Structured Streaming.I am doing the average operation.but i am facing issues due to getting duplicate records from the Kafka topic in a current batch.
For example ,Kafka topic message received on 1st batch batch on update mode
car,Brand=Honda,speed=110,1588569015000000000
car,Brand=ford,speed=90,1588569015000000000
car,Brand=Honda,speed=80,15885690150000000000
here the result is average on car brand per timestamp
i.e groupby on 1588569015000000000 and Brand=Honda , the result we got
110+90/2 = 100
now second message received late data with the duplicate message with same timestamp
car,Brand=Honda,speed=50,1588569015000000000
car,Brand=Honda,speed=50,1588569015000000000
i am expecting average should update to 110+90+50/3 = 83.33
but result update to 110+90+50+50/4=75,which is wrong
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1") // Both topics on same stream!
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as data")
group by timestamp and brand
write to kafka with checkpoint
How to use Spark Structured Streaming to do this or anything wrong on code?
Spark Structured Streaming allows deduplication on a streaming dataframe using dropDuplicates. You would need to specify fields to identify a duplicate record and across batches, spark will consider only the first record per combination and records with duplicate values will be discarded.
Below snippet will deduplicate your streaming dataframe on Brand, Speed and timestamp combination.
rawDataStream.dropDuplicates("Brand", "speed", "timestamp")
Refer to spark documentation here
I'm using Apache Nifi and Spark and Kafka to send messages between them. First of all, I take data with Nifi and I send it to Spark to process it. Then, I send data from Spark to Nifi again to insert it in a DB.
My problem is Each time I run Spark, I get the same 3.142 records. I have the first part of Nifi stopped, the second is running, and each time I run Spark, I have the same 3.142 records., and now I'm not able to understand this data.
Where does it come from?
I've tried to see if I have data on Kafka-Queue-I (from Nifi to Spark) or Kafka-Queue-II (from Spark to NiFi), but in both cases, the answer is NO. Only, when I run Spark, appears in Kafka-Queue-II 3.142 records, but this doesn't happen on Kafka-Queue-I...
In Nifi, PublishKafka_1_0 1.7.0:
In Spark, KafkaConsumer:
val raw_data = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("group.id", "spark_kafka_consumer")
.option("startingOffsets", "latest")
.option("enable.auto.commit", true)
option("failOnDataLoss", "false")
.option("subscribe", input_topic)
.load()
It show me a lot of false_alarm...
Some process...
var parsed_data = raw_data
.select(from_json(col("value").cast("string"), schema).alias("data"))
.select("data.*")
. ...
Kafka Source in Spark
var query = parsed_data
.select(to_json(schema_out).alias("value"))
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_servers)
.option("zookeeper.connect", zookeeper_servers)
.option("checkpointLocation", checkpoint_location)
.option("topic", output_topic)
.start()
query.awaitTermination()
And, KafkaConsumer 1.7.0 in NiFi
I suspect that you're using a new (auto-generated) consumer group every time you start Spark and you have an offset reset policy of earliest. The result of this would be that Spark starts from the beginning of the topic every time.
Kafka does not remove messages from the topic when its consumed (unlike other pub-sub systems). To not see old messages, you will need to set a consumer group and have Spark commit the offsets as it processes. These offsets are stored and then the next time a consumer starts with that group, it will pick up from the last stored offset for that group.
I would also note that Kafka, outside of some very specific usage patterns and technology choices, does not promise "exactly-once" messaging but instead "at-least-once"; my general advice there would be to try to be tolerant of duplicated records.
I am using twitter stream function which gives a stream. I am required to use Spark writeStream function like:writeStream function link
// Write key-value data from a DataFrame to a specific Kafka topic specified in an option
val ds = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
The 'df' needs to be a streaming Dataset/DataFrame. If df is a normal DataFrame, it will give error showing 'writeStream' can be called only on streaming Dataset/DataFrame;
I have already done:
1. get stream from twitter
2. filter and map it to get a tag for each twitt (Positive, Negative, Natural)
The last step is to groupBy tag and count for each and pass it to Kafka.
Do you guys have any idea how to transform a Dstream into a streaming Dataset/DataFrame?
Edited: ForeachRDD function does change Dstream to normal DataFrame.
But 'writeStream' can be called only on streaming
Dataset/DataFrame. (writeStream link is provided above)
org.apache.spark.sql.AnalysisException: 'writeStream' can be called only on streaming Dataset/DataFrame;
how to transform a Dstream into a streaming Dataset/DataFrame?
DStream is an abstraction of a series of RDDs.
A streaming Dataset is an "abstraction" of a series of Datasets (I use quotes since the difference between streaming and batch Datasets is a property isStreaming of Dataset).
It is possible to convert a DStream to a streaming Dataset to keep the behaviour of the DStream.
I think you don't really want it though.
All you need is to take tweets using DStream and save them to a Kafka topic (and you think you need Structured Streaming). I think you simply need Spark SQL (the underlying engine of Structured Streaming).
A pseudo-code would then be as follows (sorry it's been a longer while since I used the old-fashioned Spark Streaming):
val spark: SparkSession = ...
val tweets = DStream...
tweets.foreachRDD { rdd =>
import spark.implicits._
rdd.toDF.write.format("kafka")...
}
I am using spark structured streaming to read data from kafka.
val readStreamDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.getString("kafka.source.brokerList"))
.option("startingOffsets", config.getString("kafka.source.startingOffsets"))
.option("subscribe", config.getString("kafka.source.topic"))
.load()
Based on an uid in the message read from kafka, I have to make an api call to an external source and fetch data and write back to another kafka topic.
For this I am using a custom foreach writer and processing every message.
import spark.implicits._
val eventData = readStreamDF
.select(from_json(col("value").cast("string"), event).alias("message"), col("timestamp"))
.withColumn("uid", col("message.eventPayload.uid"))
.drop("message")
val q = eventData
.writeStream
.format("console")
.foreach(new CustomForEachWriter())
.start()
The CustomForEachWriter makes an API call and fetch results against the given uid from a service. The result is an array of ids. These ids are then again written back to another kafka topic via a kafka producer.
There are 30 kafka partition and I have launched spark with following config
num-executors = 30
executors-cores = 3
executor-memory = 10GB
But still the spark job starts lagging and is not able to keep up with the incoming data rate.
Incoming data rate is around 10K messages per sec. The avg time to process a single msg in 100ms.
I want to understand how spark process this in case of structured streaming.
In case of structured streaming there is one dedicated executor which is responsible for reading data from all partitions of kafka.
Does that executor distributes tasks based on no. of partitions in kafka.
The data in a batch get processed sequentially. How can that be made to process parallel so as to maximize the throughput.
I think CustomForEachWriter writer will work on a single row/record of the dataset. If you are using 2.4 version of Spark, you can experiment foreachBatch. But it is under Evolving.