pyspark performance and processing time - pyspark

I have 5 steps which produced df_a,df_b,df_c,df_e and df_f.
Each step generates a dataframe (df_a for instance), and persist as parquet files.
The file is used in the sequential step (df_b for example).
The processing time for df_e and df_f took about 30 mins.
However, I started a new spark session, read df_c parquet to a data frame and processed df_e and df_f, it took less than a minute.
Was it because when we read the parquet file, it stores in a more compact way?
Should I overwrite the dataframe with the spark read after writing to storage to improve the performance?

Using Parquet file format is definitely best from processing point of view but as you said in your case
Why don't u convert the very first step to parquet( start reading parquet file itself in very first step).
If storing intermediate df as parquet improves your performance , then may be you can apply filter to focus on the part of Data you require to read , this is as good as reading from parquet file format. because if you are storing intermediate df as parquet it is also taking some time to store the data and again reading the data into spark.

Related

Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk

Checkpoint version:
val savePath = "/some/path"
spark.sparkContext.setCheckpointDir(savePath)
df.checkpoint()
Write to disk version:
df.write.parquet(savePath)
val df = spark.read.parquet(savePath)
I think both break the lineage in the same way.
In my experiments checkpoint is almost 30 bigger on disk than parquet (689GB vs. 24GB). In terms of running time, checkpoint takes 1.5 times longer (10.5 min vs 7.5 min).
Considering all this, what would be the point of using checkpoint instead of saving to file? Am I missing something?
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable distributed (HDFS) or local file system. If you have a large RDD lineage graph and you want freeze the content of the current RDD i.e. materialize the complete RDD before proceeding to the next step, you generally use persist or checkpoint. The checkpointed RDD then could be used for some other purpose.
When you checkpoint the RDD is serialized and stored in Disk. It doesn't store in parquet format so the data is not properly storage optimized in the Disk. Contraty to parquet which provides various compaction and encoding to store optimize the data. This would explain the difference in the Size.
You should definitely think about checkpointing in a noisy cluster. A cluster is called noisy if there are lots of jobs and users which compete for resources and there are not enough resources to run all the jobs simultaneously.
You must think about checkpointing if your computations are really expensive and take long time to finish because it could be faster to write an RDD to
HDFS and read it back in parallel than recompute from scratch.
And there's a slight inconvenience prior to spark2.1 release;
there is no way to checkpoint a dataframe so you have to checkpoint the underlying RDD. This issue has been resolved in spark2.1 and above versions.
The problem with saving to Disk in parquet and read it back is that
It could be inconvenient in coding. You need to save and read multiple times.
It could be a slower process in the overall performance of the job. Because when you save as parquet and read it back the Dataframe needs to be reconstructed again.
This wiki could be useful for further investigation
As presented in the dataset checkpointing wiki
Checkpointing is actually a feature of Spark Core (that Spark SQL uses for distributed computations) that allows a driver to be restarted on failure with previously computed state of a distributed computation described as an RDD. That has been successfully used in Spark Streaming - the now-obsolete Spark module for stream processing based on RDD API.
Checkpointing truncates the lineage of a RDD to be checkpointed. That has been successfully used in Spark MLlib in iterative machine learning algorithms like ALS.
Dataset checkpointing in Spark SQL uses checkpointing to truncate the lineage of the underlying RDD of a Dataset being checkpointed.
One difference is that if your spark job needs a certain in memory partitioning scheme, eg if you use a window function, then checkpoint will persist that to disk, whereas writing to parquet will not.
I'm not aware of a way with the current versions of spark to write parquet files and then read them in again, with a particular in memory partitioning strategy. Folder level partitioning doesn't help with this.

Read parquet file to multiple partitions [duplicate]

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).
Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.
https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html
I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.
You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.
The source of ParquetOuputFormat is here, if you want to dig into details.
The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.
The new way of doing it (Spark 2.x) is setting
spark.sql.files.maxPartitionBytes
Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)
From my experience, Hadoop settings no longer have effect.
Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it
val k = sc.parquetFile("the-big-table.parquet")
k.partitions.length
You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)
You have mentioned that you want to control distribution during write to parquet. When you create parquet from RDDs parquet preserves partitions of the RDD. So, if you create RDD and specify 100 partitions and from dataframe with parquet format then it will be writing 100 separate parquet files to fs.
For read you could specify spark.sql.shuffle.partitions parameter.
To achieve that you should use SparkContext to set Hadoop configuration (sc.hadoopConfiguration) property mapreduce.input.fileinputformat.split.maxsize.
By setting this property to a lower value than hdfs.blockSize, than you will get as much partitions as the number of splits.
For example:
When hdfs.blockSize = 134217728 (128MB),
and one file is read which contains exactly one full block,
and mapreduce.input.fileinputformat.split.maxsize = 67108864 (64MB)
Then there will be two partitions those splits will be read into.

Spark: repartition output by key

I'm trying to output records using the following code:
spark.createDataFrame(asRow, struct)
.write
.partitionBy("foo", "bar")
.format("text")
.save("/some/output-path")
I don't have a problem when the data is small. However when I'm processing ~600GB input, I am writing around 290k files and that includes small files per partition. Is there a way we could control the number of output files per partition? Because right now I am writing a lot of small files and it's not good.
Having lots of files is the expected behavior as each partition (resulting in whatever computation you had before the write) will write to the partitions you requested the relevant files
If you wish to avoid that you need to repartition before the write:
spark.createDataFrame(asRow, struct)
.repartition("foo","bar")
.write
.partitionBy("foo", "bar")
.format("text")
.save("/some/output-path")
You have multiple files per partition because each node writes output to its own file. That means that the only way how to have only single file per partition is to re-partition data before writing. Please note, that that will be very inefficient because data repartition will cause shuffling on your data.

s3 parquet write - too many partitions, slow writing

I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.
output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")
Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.
EDIT : - id is an integer column with random numbers.
you can partition by ranges of ids (you didn't say anything about the ids so I can't suggest something specific) and/or use buckets instead of partitions https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark

Performance Issue with writing data to snowflake using spark df

I am trying to read data from AWS RDS system and write to Snowflake using SPARK.
My SPARK job makes a JDBC connection to RDS and pulls the data into a dataframe and on other hand same dataframe I write to snowflake using snowflake connector.
Problem Statement : When I am trying to write the data, even 30 GB data is taking long time to write.
Solution I tried :
1) repartition the dataframe before writing.
2) caching the dataframe.
3) taking a count of df before writing to reduce scan time at write.
It may have been a while since this question was asked. If you are preparing the dataframe, or using another tool for preparing your data to move to Snowflake, the python connector integrates very nicely.
Some recommendations in general for troubleshooting the query, including the comments that were recommended above, which are great, were you able to resolve the jdbc connection with the recent updates?
Some other troubleshooting to consider:
Saving time and going directly from Spark to Snowflake with the Spark connector https://docs.snowflake.net/manuals/user-guide/spark-connector.html \
For larger data sets, in general increasing the warehouse size for the session you are using, and looping in data in smaller 10 mb to 100 mb size files will increase compute speed.
Let me know what you think, I would love to hear how you solved it.