Writing wide table (40,000+ columns) to Databricks Hive Metastore for use with AutoML - pyspark

I want to train a regression prediction model with Azure Databricks AutoML using the GUI. The training data is very wide. All of the columns except for the response variable will be used as features.
To use the Databricks AutoML GUI I have to store the data as a table in the Hive metastore. I have a large DataFrame df with more than 40,000 columns.
print((df.count(), len(df.columns)))
(33030, 45502)
This data is written to a table in Hive using the following PySpark command (I believe this is standard):
df.write.mode('overwrite').saveAsTable("WIDE_TABLE")
Unfortunately this job does not finish within 'acceptable' time (10 hours). I cancel and hence don't have an error message.
When I reduce the number of columns with
df.select(df.columns[:500]).write.mode('overwrite').saveAsTable("WIDE_TABLE")
it fares better and finishes in 9.87 minutes, so the method should work.
Can this be solved:
With a better compute instance?
With a better script?
Not at all and if so, is there another approach?
[EDIT to address questions in comments]
Runtime and driver summary:
2-16 Workers 112-896 GB Memory 32-256 Cores (Standard_DS5_v2)
1 Driver 56 GB Memory, 16 Cores (Same as worker)
Runtime10.4.x-scala2.12
To give an impression of the timings I've added a table below.
columns
time (mins)
10
1.94
100
1.92
200
3.04
500
9.87
1000
25.91
5000
938.4
Data type of the remaining columns is Integer.
As far as I know I'm writing the table on the same environment that I am working on. Data flow: Azure Blob CSV -> Data read and wrangling -> PySpark DataFrame -> Hive Table. Last three steps are on the same cloud machine.
Hope this helps!

I think your case is not related to either Spark resource configuration or network connection, it's related to Spark design itself.
Long in short, Spark is designed for long and narrow data, which is exactly opposite of your dataframe. When you look at your experiment, the time consuming is in exponential growth when your column size increase. Although it's about reading the csv but not writing table, you can check this post for a good explanation on why Spark is not good at handling wide dataframe: Spark csv reading speed is very slow although I increased the number of nodes
Although I didn't use the Azure AutoML before, based on the dataset to achieve your goal, I think you can try:
Try to use python pandas dataframe and Hive connection library to see if there is any performance enhancement
Concatenate all your column into a single Array / Vector before you write to Hive

Related

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

sparksql with dataset of several GBs size

I didn't found the answer for this question in the web or other questions, so I'm trying here:
The size of my dataset is several GB's (~5GB to ~15GB).
I have multiple tables, some of them contains ~50M rows
I'm using postgresSQL which has it's own query optimization (parallel workers and indexing).
50% of my queries take advantages the indexing and multiple workers to finish the query faster.
Some of my queries use join command
I read that sparkSQL intends to run on huge datasets.
If I have multiple servers to run sparkSQL on, can I get better performance with sparkSQL ?
Does 15GB of datasets fit to work with sparkSQL or postgresSQL ?
When it will be best to choose sparkSQL over postgresSQL ?
If I have multiple servers to run sparkSQL on, can I get better performance with sparkSQL ?
-> If your data does not havea lot of skew, SparkSQL will give better performance in terms of query speeds as the query wold run on the spark cluster.
Does 15GB of datasets fit to work with sparkSQL or postgresSQL ?
-> SparkSQL is simply a Query Engine that is built into Apache Spark so it will process the data, and will allow you to create Views in-memory but that is it. Once the Application terminates, the view is removed.
PostgreSQL, on teh other hand is a, and I quote, a DATABASE. It will let you query data and store the results in its own native format.
Now coming to your question, 15GB of Data is not a lot to process for wither of the Engines, and your query performance would depend upon the data model.
When it will be best to choose sparkSQL over postgresSQL ?
-> Choose SparkSQL, when you wish to perform AD-HOC queries, and the dataset sizes range in the TeraByte range.
Choose PostgreSQL, when you wish to store transactional data, or datasets that are simply being used to drive BI tools, custom UIs or Applications.

EMR-Spark is slow writing a DataFrame with an Array of Strings to S3

I'm trying to write a dataframe to S3 from EMR-Spark and I'm seeing some really slow write times where the writing comes to dominate the total runtime (~80%) of the script. For what it's worth, I've tried both .csv and .parquet formats, it doesn't seem make a difference.
My data can be formatted in two ways, here's the preferred format:
ID : StringType | ArrayOfIDs : ArrayType
(The number of unique IDs in the first column numbers in the low millions. ArrayOfIDs contains GUID formatted strings, and can contain anywhere from ~100 - 100,000 elements)
Writing the first form to S3 is incredibly slow. For what it's worth, I've tried setting the mapreduce.fileoutputcommitter.algorithm.version to 2 as described here: https://issues.apache.org/jira/browse/SPARK-20107 to no real effect.
However my data can also be formatted as an adjacency list, like this:
ID1 : StringType | ID2 : StringType
This appears to be much faster for writing to S3, but I am at a loss for why. Here are my specific questions:
Ultimately I'm trying to get my data into an Aurora RDS Postgres cluster (I was told firmly by those before me that the Spark JDBC connector is too slow for the job, which is why I'm currently trying to dump the data in S3 before loading it into Postgres with a COPY command). I'm not married to using S3 as an intermediate store if there are better alternatives for getting these data frames into RDS Postgres.
I don't know why the first schema with the Array of Strings is so much slower on write. The total data written is actually far less than the second schema on account of eliminating ID duplication from the first column. Would also be nice to understand this behavior.
Well, I still don't know why writing arrays directly from Spark is so much slower than the adjacency list format. But best practice seems to dictate that I avoid writing to S3 directly from Spark.
Here's what i'm doing now:
Write the data to HDFS (anecdotally, the write speed of the adjacency list vs the array now falls in line with my expectations).
From HDFS, use EMR's s3-dist-cp utility to wholesale write the data to S3 (this also seems reasonably performant with array typed data).
Bring the data into Aurora Postgres with the aws_s3.table_import_from_s3 extension.

Optimised way of doing cumulative sum on large number of columns in pyspark

I have a DataFrame containing 752 (id,date and 750 feature columns) columns and around 1.5 million rows and I need to apply cumulative sum on all 750 feature columns partition by id and order by date.
Below is the approach I am following currently:
# putting all 750 feature columns in a list
required_columns = ['ts_1','ts_2'....,'ts_750']
# defining window
sumwindow = Window.partitionBy('id').orderBy('date')
# Applying window to calculate cumulative of each individual feature column
for current_col in required_columns:
new_col_name = "sum_{0}".format(current_col)
df=df.withColumn(new_col_name,sum(col(current_col)).over(sumwindow))
# Saving the result into parquet file
df.write.format('parquet').save(output_path)
I am getting below error while running this current approach
py4j.protocol.Py4JJavaError: An error occurred while calling o2428.save.
: java.lang.StackOverflowError
Please let me know alternate solution for the same. seems like cumulative sum is bit tricky with large amount of data. Please suggest any alternate approach or any spark configurations which I can tune to make it work.
I expect you have the issue of too large of a lineage. Take a look at your explain plan after you re-assign the dataframe so many times.
The standard solution for this is to checkpoint your dataframe every so often to truncate the explain plan. This is sort of like caching but for the plan rather than the data and is often needed for iterative algorithms that modify dataframes.
Here is a nice pyspark explanation of caching and checkpointing
I suggest df.checkpoint() every 5-10 modifications to start with
Let us know how it goes

Spark partitionBy much slower than without it

I tested writing with:
df.write.partitionBy("id", "name")
.mode(SaveMode.Append)
.parquet(filePath)
However if I leave out the partitioning:
df.write
.mode(SaveMode.Append)
.parquet(filePath)
It executes 100x(!) faster.
Is it normal for the same amount of data to take 100x longer to write when partitioning?
There are 10 and 3000 unique id and name column values respectively.
The DataFrame has 10 additional integer columns.
The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).
Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.