Too many withColumn statements? - pyspark

I'm relatively new to Spark. Right now I'm working on a dataset which is really messy. As a result, I had to write a lot of withColumn statements that change strings in a column. I just counted them and in total I have a like 35. Most of them change only two or three columns, over and over again. My statements look as follows:
.withColumn(
"id",
F.when(
(F.col("country") == "SE") &
(F.col("company") == "ABC"),
"22030"
) \
.otherwise(F.col("id"))
)
Anyways, sometimes, I succeed in running the dataset, sometimes I don't. It seems to crash my driver. Is this a problem because there are too many withColumn statements? My understanding is that this shouldn't cause a collect, so they can be executed on the workers independently, right? Also, the dataset itself doesn't have a lot of rows, approximately 25000. Or is there something wrong in the way I tackle the problem? Should I rewrite the withColumn statements? How can I find out where the problem lies?

Related

Spark - adding multiple columns under the same when condition

I need to add a couple of columns to a Spark DataFrame.
The value for both columns is conditional, using a when clause, but the condition is the same for both of them.
val df: DataFrame = ???
df
.withColumn("colA", when(col("condition").isNull, f1).otherwise(f2))
.withColumn("colB", when(col("condition").isNull, f3).otherwise(f4))
Since the condition in both when clauses is the same, is there a way I can rewrite this without repeating myself? I don't mean just extracting the condition to a variable, but actually reducing it to a single when clause, to avoid having to run the test multiple times on the DataFrame.
Also, in case I leave it like that, will Spark calculate the condition twice, or will it be able to optimize the work plan and run it only once?
The corresponding columns f1/f3 and f2/f4 can be packed into an array and then separated into two different columns after evaluating the condition.
df.withColumn("colAB", when(col("condition").isNull, array('f1, 'f3)).otherwise(array('f2, 'f4)))
.withColumn("colA", 'colAB(0))
.withColumn("colB", 'colAB(1))
The physical plans for my code and the code in the question are (ignoring the intermediate column colAB) the same:
== Physical Plan ==
LocalTableScan [f1#16, f2#17, f3#18, f4#19, condition#20, colA#71, colB#78]
== Physical Plan ==
LocalTableScan [f1#16, f2#17, f3#18, f4#19, condition#20, colAB#47, colA#54, colB#62]
so in both cases the condition is evaluated only once. This is at least true if condition is a regular column.
A reason to combine the two when statements could be that the code is better readable, although this judgement depends on the reader.

Avoid loading into table dataframe that is empty

I am creating a process in spark scala within an ETL that checks for some events occurred during the ETL process. I start with an empty dataframe and if events occur this dataframe is filled with information ( a dataframe can't be filled it can only be joined with other dataframes with the same structure ). The thing is that at the end of the process, the dataframe that has been generated is loaded into a table but it can happen that the dataframe ends up being empty because no event has occured and I don't want to load a dataframe that is empty because it has no sense. So, I'm wondering if there is an elegant way to load the dataframe into the table only if it is not empty without using the if condition. Thanks!!
I recommend to create the dataframe anyway; If you don't create a table with the same schema, even if it's empty, your operations/transformations on DF could fail as it could refer to columns that may not be present.
To handle this, you should always create a DataFrame with the same schema, which means the same column names and datatypes regardless if the data exists or not. You might want to populate it with data later.
If you still want to do it your way, I can point a few ideas for Spark 2.1.0 and above:
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
These are equivalent.
I don't recommend using df.count > 0 because it is linear in time complexity and you would still have to do a check like df != null before.
A much better solution would be:
df.rdd.isEmpty
Or since Spark 2.4.0 there is also Dataset.isEmpty.
As you can see, whatever you decide to do, there is a check somewhere that you need to do, so you can't really get rid of the if condition - as the sentence implies: if you want to avoid creating an empty dataframe.

Spark Scala - Comparing Datasets Column by Column

I'm just getting started with using spark, I've previously used python with pandas. One of the common things I do very regularly is compare datasets to see which columns have differences. In python/pandas this looks something like this:
merged = df1.merge(df2,on="by_col")
for col in cols:
diff = merged[col+"_x"] != merged[col+"_y"]
if diff.sum() > 0:
print(f"{col} has {diff.sum()} diffs")
I'm simplifying this a bit but this is the gist of it, and of course after this I'd drill down and look at for example:
col = "col_to_compare"
diff = merged[col+"_x"] != merged[col+"_y"]
print(merged[diff][[col+"_x",col+"_y"]])
Now in spark/scala this is turning out to be extremely inefficient. The same logic works, but this dataset is roughly 300 columns long, and the following code takes about 45 minutes to run for a 20mb dataset, because it's submitting 300 different spark jobs in sequence, not in parallel, so I seem to be paying the startup cost of spark 300 times. For reference the pandas one takes something like 300ms.
for(col <- cols){
val cnt = merged.filter(merged("dev_" + col) <=> merged("prod_" + col)).count
if(cnt != merged.count){
println(col + " = "+cnt + "/ "+merged.count)
}
}
What's the faster more spark way of doing this type of thing? My understanding is I want this to be a single spark job where it creates one plan. I was looking at transposing to a super tall dataset and while that could potentially work it ends up being super complicated and the code is not straightforward at all. Also although this example fits in memory, I'd like to be able to use this function across datasets and we have a few that are multiple terrabytes so it needs to scale for large datasets as well, whereas with python/pandas that would be a pain.

Pyspark to get the column names of the columns that contains null values

I've a DataFrame where I want to get the column names of the columns that contains one or more null values in them.
So far what I've done :
df.select([c for c in tbl_columns_list if df.filter(F.col(c).isNull()).count() > 0]).columns
I have almost 500 columns in my dataframe and when I execute that code, it becomes incredibly slow for a reason I don't know. Do you have any clue how can I make it work and how can I optimize that please? I need optimized solution in Pyspark please. Thanks in advance.

Why is my Code Repo warning me about using withColumn in a for/while loop?

I'm noticing my code repo is warning me that using withColumn in a for/while loop is an antipattern. Why is this not recommended? Isn't this a normal use of the PySpark API?
We've noticed in practice that using withColumn inside a for/while loop leads to poor query planning performance as discussed over here. This is not obvious when writing code for the first time in Foundry, so we've built a feature to warn you about this behavior.
We'd recommend you follow the Scala docs recommendation:
withColumn(colName: String, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Since
2.0.0
Note
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.
i.e.
my_other_columns = [...]
df = df.select(
*[col_name for col_name in df.columns if col_name not in my_other_columns],
*[F.col(col_name).alias(col_name + "_suffix") for col_name in my_other_columns]
)
is vastly preferred over
my_other_columns = [...]
for col_name in my_other_columns:
df = df.withColumn(
col_name + "_suffix",
F.col(col_name)
)
While this may technically be a normal use of the PySpark API, it will result in poor query planning performance if withColumn is called too many times in your job, so we'd prefer you avoid this problem entirely.