Spark partitionBy much slower than without it - scala

I tested writing with:
df.write.partitionBy("id", "name")
.mode(SaveMode.Append)
.parquet(filePath)
However if I leave out the partitioning:
df.write
.mode(SaveMode.Append)
.parquet(filePath)
It executes 100x(!) faster.
Is it normal for the same amount of data to take 100x longer to write when partitioning?
There are 10 and 3000 unique id and name column values respectively.
The DataFrame has 10 additional integer columns.

The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).
Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.

Related

Writing wide table (40,000+ columns) to Databricks Hive Metastore for use with AutoML

I want to train a regression prediction model with Azure Databricks AutoML using the GUI. The training data is very wide. All of the columns except for the response variable will be used as features.
To use the Databricks AutoML GUI I have to store the data as a table in the Hive metastore. I have a large DataFrame df with more than 40,000 columns.
print((df.count(), len(df.columns)))
(33030, 45502)
This data is written to a table in Hive using the following PySpark command (I believe this is standard):
df.write.mode('overwrite').saveAsTable("WIDE_TABLE")
Unfortunately this job does not finish within 'acceptable' time (10 hours). I cancel and hence don't have an error message.
When I reduce the number of columns with
df.select(df.columns[:500]).write.mode('overwrite').saveAsTable("WIDE_TABLE")
it fares better and finishes in 9.87 minutes, so the method should work.
Can this be solved:
With a better compute instance?
With a better script?
Not at all and if so, is there another approach?
[EDIT to address questions in comments]
Runtime and driver summary:
2-16 Workers 112-896 GB Memory 32-256 Cores (Standard_DS5_v2)
1 Driver 56 GB Memory, 16 Cores (Same as worker)
Runtime10.4.x-scala2.12
To give an impression of the timings I've added a table below.
columns
time (mins)
10
1.94
100
1.92
200
3.04
500
9.87
1000
25.91
5000
938.4
Data type of the remaining columns is Integer.
As far as I know I'm writing the table on the same environment that I am working on. Data flow: Azure Blob CSV -> Data read and wrangling -> PySpark DataFrame -> Hive Table. Last three steps are on the same cloud machine.
Hope this helps!
I think your case is not related to either Spark resource configuration or network connection, it's related to Spark design itself.
Long in short, Spark is designed for long and narrow data, which is exactly opposite of your dataframe. When you look at your experiment, the time consuming is in exponential growth when your column size increase. Although it's about reading the csv but not writing table, you can check this post for a good explanation on why Spark is not good at handling wide dataframe: Spark csv reading speed is very slow although I increased the number of nodes
Although I didn't use the Azure AutoML before, based on the dataset to achieve your goal, I think you can try:
Try to use python pandas dataframe and Hive connection library to see if there is any performance enhancement
Concatenate all your column into a single Array / Vector before you write to Hive

Issues while trying to export pyspark pandas dataframe to csv in pyspark

df=df_full[df_fill.part_col.isin(['part_a','part_b'])]
df=df[df.some_other_col =='some_value']
#df has shape of roughly 240k,200
#df_full has shape of roughly 30m, 200
df.to_pandas().reset_index().to_csv('testyyy.csv',index=False)
If I do any groupby operation it is amazingly fast. However the issue lies when I try to export small subset of this large dataset to csv. While I am eventually able to export the dataframe to csv but it is taking too much time.
Warnings:
2022-05-08 13:01:15,948 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df[column_name] = series
Note: part_a and part_b are stored as two separate parquet partitioned files. Also I am using pyspark.pandas in spark3+
So question is what is happening? And what is most efficient wat to export the filtered dataframe to csv?

Spark binary file and Delta Table

I have batches of binary files (~3mb each) that I receive in batches of ~20000 files at a time. These files are used downstream for further processing, but I want to process them and store in Delta tables.
I can do this easily:
df = spark.read.format(“binaryFile”).load(<path-to-batch>)
df = df.withColumn(“id”, expr(“uuid()”)
dt = DeltaTable.forName(“myTable”)
dt.alias(“a”).merge(
df.alias(“a”),
“a.path = b.path”
).whenNotMatchedInsert(
values={“id”: “b.id”, “content”: “b.content”}
).execute()
This makes the table quite slow already, but later I need to query certain IDs, do collect and write them individually back to binary files.
Questions:
Would my table benefit from a batch column and partition?
Should I partition by id? I know this is not ideal, but might make querying individual rows easier?
Is there a better way to write the files out again, rather than .collect()? I have seen when I select about 1000 specific ids write them out that about 10 minutes is just for collect and then less than a minute to write. I do something like:
for row in df.collect():
with open(row.id, “wb”) as fw:
fw.write(row.content)
As uuid() returns random values, I'm afraid we cannot use it to compare existing data with new records. (Sorry if I misunderstood the idea)
I don't think using partition by id will help as the id column has obviously high cardinality.
Instead of using collect() which loads all records into Driver, I think it would be better if you can write the records in the Spark dataframe directly and simultaneously from all the worker nodes into a temporary location on ADLS first and then aggregate a few data files from that location.

EMR-Spark is slow writing a DataFrame with an Array of Strings to S3

I'm trying to write a dataframe to S3 from EMR-Spark and I'm seeing some really slow write times where the writing comes to dominate the total runtime (~80%) of the script. For what it's worth, I've tried both .csv and .parquet formats, it doesn't seem make a difference.
My data can be formatted in two ways, here's the preferred format:
ID : StringType | ArrayOfIDs : ArrayType
(The number of unique IDs in the first column numbers in the low millions. ArrayOfIDs contains GUID formatted strings, and can contain anywhere from ~100 - 100,000 elements)
Writing the first form to S3 is incredibly slow. For what it's worth, I've tried setting the mapreduce.fileoutputcommitter.algorithm.version to 2 as described here: https://issues.apache.org/jira/browse/SPARK-20107 to no real effect.
However my data can also be formatted as an adjacency list, like this:
ID1 : StringType | ID2 : StringType
This appears to be much faster for writing to S3, but I am at a loss for why. Here are my specific questions:
Ultimately I'm trying to get my data into an Aurora RDS Postgres cluster (I was told firmly by those before me that the Spark JDBC connector is too slow for the job, which is why I'm currently trying to dump the data in S3 before loading it into Postgres with a COPY command). I'm not married to using S3 as an intermediate store if there are better alternatives for getting these data frames into RDS Postgres.
I don't know why the first schema with the Array of Strings is so much slower on write. The total data written is actually far less than the second schema on account of eliminating ID duplication from the first column. Would also be nice to understand this behavior.
Well, I still don't know why writing arrays directly from Spark is so much slower than the adjacency list format. But best practice seems to dictate that I avoid writing to S3 directly from Spark.
Here's what i'm doing now:
Write the data to HDFS (anecdotally, the write speed of the adjacency list vs the array now falls in line with my expectations).
From HDFS, use EMR's s3-dist-cp utility to wholesale write the data to S3 (this also seems reasonably performant with array typed data).
Bring the data into Aurora Postgres with the aws_s3.table_import_from_s3 extension.

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition