Measure time taken for repartitioning of spark structured streaming dataframe - spark-structured-streaming

I want to do a performance comparison in which I want to measure the time taken in repartitioning of spark structured streaming dataframe.
for eg time taken for the following step
df.repartition($"colname")
Any ideas how can this be achieved ?

Related

Spark Streaming: Aggregating values across batches

We've a Spark Streaming job that calculates some values in each batch. What we need to do now is aggregate values across ALL batches. What is the best strategy to do this in Spark Streaming. Should we use 'Spark Accumulators' for this?

Are spark.streaming.backpressure.* properties applicable to Spark Structured Streaming?

My understanding is that Spark structured streaming is build on top of Spark SQL and not Spark Streaming. Hence, the following question, does the properties that apply to spark streaming also applies to spark structured streaming such as:
spark.streaming.backpressure.initialRate
spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate
No, these settings are applicable only to DStream API.
Spark Structured Streaming does not have a backpressure mechanism. You can find more details in this discussion: How Spark Structured Streaming handles backpressure?
No.
Spark Structured Stream processes data asap by default - after finishing the current batch. You can control via the rate of processing for various types, e.g. maxFilesPerTrigger for files and maxOffsetsPerTrigger for KAFKA.
This link http://javaagile.blogspot.com/2019/03/everything-you-needed-to-know-about.html explains that back pressure is not relevant.
It quotes: "Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.".
I am not sure this aspect is relevant as KAFKA buffers the data. None-the-less the article has good merit imho.

Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below:
df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")
Now later on I would like to read the parquet file so I do something like this:
val df = spark.read.parquet("/path/to/parquet/file")
Is the dataframe partitioned by "DATE"? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?
Also the why and why not to this answer would be helpful as well.
The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:
If SparkContext.minPartitions >= partitions count in data, SparkContext.minPartitions will be returned.
If partitions count in data >= SparkContext.parallelism, SparkContext.parallelism will be returned, though in some very small partition cases, #3 may be true instead.
Finally, if the partitions count in data is somewhere between SparkContext.minPartitions and SparkContext.parallelism, generally you'll see the partitions reflected in the dataset partitioning.
Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.
Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.
val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))
You would get the number of partitions based on the spark config spark.sql.files.maxPartitionBytes which defaults to 128MB. And the data would not be partitioned as per the partition column which was used while writing.
Reference https://spark.apache.org/docs/latest/sql-performance-tuning.html
In your question, there are two ways we could say the data are being "partitioned", which are:
via repartition, which uses a hash partitioner to distribute the data into a specific number of partitions. If, as in your question, you don't specify a number, the value in spark.sql.shuffle.partitions is used, which has default value 200. A call to .repartition will usually trigger a shuffle, which means the partitions are now spread across your pool of executors.
via partitionBy, which is a method specific to a DataFrameWriter that tells it to partition the data on disk according to a key. This means the data written are split across subdirectories named according to your partition column, e.g. /path/to/parquet/file/DATE=<individual DATE value>. In this example, only rows with a particular DATE value are stored in each DATE= subdirectory.
Given these two uses of the term "partitioning," there are subtle aspects in answering your question. Since you used partitionBy and asked if Spark "maintain's the partitioning", I suspect what you're really curious about is if Spark will do partition pruning, which is a technique used drastically improve the performance of queries that have filters on a partition column. If Spark knows values you seek cannot be in specific subdirectories, it won't waste any time reading those files and hence your query completes much quicker.
If the way you're reading the data isn't partition aware, you'll get a number of partitions something like what's in bsplosion's answer. Spark won't employ partition pruning, and hence you won't get the benefit of Spark automatically ignoring reading certain files to speed things up1.
Fortunately, reading parquet files in Spark that were written with partitionBy is a partition-aware read. Even without a metastore like Hive that tells Spark the files are partitioned on disk, Spark will discover the partitioning automatically. Please see partition discovery in Spark for how this works in parquet.
I recommend testing reading your dataset in spark-shell so that you can easily see the output of .explain, which will let you verify that Spark correctly finds the partitions and can prune out the ones that don't contain data of interest in your query. A nice writeup on this can be found here. In short, if you see PartitionFilters: [], it means that Spark isn't doing any partition pruning. But if you see something like PartitionFilters: [isnotnull(date#3), (date#3 = 2021-01-01)], Spark is only reading in a specific set of DATE partitions, and hence the query execution is usually a lot faster.
1A separate detail is that parquet stores statistics about data in its columns inside of the files themselves. If these statistics can be used to eliminate chunks of data that can't match whatever filtering you're doing, e.g. on DATE, then you'll see some speedup even if the way you read the data isn't partition-aware. This is called predicate pushdown. It works because the files on disk will still contain only specific values of DATE when using .partitionBy. More info can be found here.

Spark Structured Streaming - writeStream split data hourly

Currently we have Spark streaming running in Production. I am in process of converting the code to use Structured streaming.
I am able to successfully read data from Kinesis and write (sink) to Parquet file in S3 .
Our business logic demands, we write the streamed data in hourly folders. The incoming data from kinesis does not have any date-time field. So cannot partitionBy date-time.
We have a defined a function [getSubfolderNameFromDate()] which gets us the current hour+date (1822062017 - 18th Hour of the day, 22nd June 2017) so we can write data in hourly folders.
With Spark streaming the Context gets re-initialized and automatically writes data in next hour folder, but I am not able to achieve the same with structured streaming.
For example, 2 million records were streamed in 4th hour of the day, it should be written to "S3_location/0422062017.parquet/", data streamed in the following hour should be in "S3_location/0522062017.parquet/" so on and so forth.
With Structured streaming it continuous to write to same folder throughout the day, which I understand is because it evaluates the folder name just once and continuous to append the data to it.
But I want to append new data to hourly folders, is there a way to achieve this ?
I am currently using below query :
val query = streamedDataDF
.writeStream
.format("parquet")
.option("checkpointLocation", checkpointDir)
.option("path", fileLocation + FileDirectory.getSubfolderNameFromDate() + ".parquet/")
.outputMode("append")
.start()

Connecting Spark streaming to streamsets input

I was wondering if it would be possible to provide input to spark streaming from StreamSets. I noticed that Spark streaming is not supported within the StreamSets connectors destination https://streamsets.com/connectors/ .
I exploring if there are other ways to connect them for a sample POC.
The best way to process data coming in from Streamsets Data Collector (SDC) in Apache Spark Streaming would be to write the data out to a Kafka topic and read the data from there. This allows you to separate out Spark Streaming from SDC, so both can proceed at its own rate of processing.
SDC microbatches are defined record count while Spark Streaming microbatches are dictated by time. This means that each SDC batch may not (and probably will not) correspond to a Spark Streaming batch (most likely that Spark Streaming batch will probably have data from several SDC batches). SDC "commits" each batch once it is sent to the destination - having a batch written to Spark Streaming will mean that each SDC batch will need to correspond to a Spark Streaming batch to avoid data loss.
It is also possible that Spark Streaming "re-processes" already committed batches due to processing or node failures. SDC cannot re-process committed batches - so to recover from a situation like this, you'd really have to write to something like Kafka that allows you to re-process the batches. So having a direct connector that writes from SDC to Spark Streaming would be complex and likely have data loss issues.
In short, your best option would be SDC -> Kafka -> Spark Streaming.