I need to add a couple of columns to a Spark DataFrame.
The value for both columns is conditional, using a when clause, but the condition is the same for both of them.
val df: DataFrame = ???
df
.withColumn("colA", when(col("condition").isNull, f1).otherwise(f2))
.withColumn("colB", when(col("condition").isNull, f3).otherwise(f4))
Since the condition in both when clauses is the same, is there a way I can rewrite this without repeating myself? I don't mean just extracting the condition to a variable, but actually reducing it to a single when clause, to avoid having to run the test multiple times on the DataFrame.
Also, in case I leave it like that, will Spark calculate the condition twice, or will it be able to optimize the work plan and run it only once?
The corresponding columns f1/f3 and f2/f4 can be packed into an array and then separated into two different columns after evaluating the condition.
df.withColumn("colAB", when(col("condition").isNull, array('f1, 'f3)).otherwise(array('f2, 'f4)))
.withColumn("colA", 'colAB(0))
.withColumn("colB", 'colAB(1))
The physical plans for my code and the code in the question are (ignoring the intermediate column colAB) the same:
== Physical Plan ==
LocalTableScan [f1#16, f2#17, f3#18, f4#19, condition#20, colA#71, colB#78]
== Physical Plan ==
LocalTableScan [f1#16, f2#17, f3#18, f4#19, condition#20, colAB#47, colA#54, colB#62]
so in both cases the condition is evaluated only once. This is at least true if condition is a regular column.
A reason to combine the two when statements could be that the code is better readable, although this judgement depends on the reader.
Related
Suppose I create a polars Lazyframe from a list of csv files using pl.concat():
df = pl.concat([pl.scan_csv(file) for file in ['file1.csv', 'file2.csv']])
Is the data in the resulting dataframe guaranteed to have the exact order of the input files, or could there be a scenario where the query optimizer would mix things up?
The order is maintained. The engine may execute them in a different order, but the final result will always have the same order as the lazy computations provided by the caller.
We generally add columns to existing spark dataframes by using withColumn function.
Just wanted to know that if we have millions of rows in a Dataset will the
withColumn("columnName", when(condition1, valueA).when(condition2, valueB))
method checks the conditions for each row of the Dataset ??
If Yes then is it not poor performance ?? & is there any better way
Yes withColumn("columnName", column expression) will be evaluated for each and every row, millions of them. This is a map operation so it is linearly scalable. I wouldn't worry about the performance which will depend on the complexity of the operation.
If your operation needs data from each row then you must execute it for each row.
If your operation is same per dataframe or partition of the dataframe then you can execute that operation once per dataframe or partition and write the result on each row, this can reduce some overhead.
I'm relatively new to Spark. Right now I'm working on a dataset which is really messy. As a result, I had to write a lot of withColumn statements that change strings in a column. I just counted them and in total I have a like 35. Most of them change only two or three columns, over and over again. My statements look as follows:
.withColumn(
"id",
F.when(
(F.col("country") == "SE") &
(F.col("company") == "ABC"),
"22030"
) \
.otherwise(F.col("id"))
)
Anyways, sometimes, I succeed in running the dataset, sometimes I don't. It seems to crash my driver. Is this a problem because there are too many withColumn statements? My understanding is that this shouldn't cause a collect, so they can be executed on the workers independently, right? Also, the dataset itself doesn't have a lot of rows, approximately 25000. Or is there something wrong in the way I tackle the problem? Should I rewrite the withColumn statements? How can I find out where the problem lies?
I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.
I have two dataframes, both of them contain different number of columns.
I need to compare three fields between them to check if those are equal.
I tried following approach but its not working.
if(df_table_stats("rec_cnt").equals(df_aud("REC_CNT")) || df_table_stats("hashcount").equals(df_aud("HASH_CNT")) || round(df_table_stats("hashsum"),0).equals(round(df_aud("HASH_TTL"),0)))
{
println("Job executed succefully")
}
df_table_stats("rec_cnt"), this returns Column rather than actual value hence condition becoming false.
Also, please explain difference between df_table_stats.select("rec_cnt") and df_table_stats("rec_cnt").
Thanks.
Use sql and inner join both df , with your conditions .
Per my comment, the syntax you're using are simple column references, they don't actually return data. Assuming you MUST use Spark for this, you'd want a method that actually returns the data, known in Spark as an action. For this case you can use take to return the first Row of data and extract the desired columns:
val tableStatsRow: Row = df_table_stats.take(1).head
val audRow: Row = df_aud.take(1).head
val tableStatsRecCount = tableStatsRow.getAs[Int]("rec_cnt")
val audRecCount = audRow.getAs[Int]("REC_CNT")
//repeat for the other values you need to capture
However, Spark definitely is overkill if this is all you're using it for. You could use a simple JDBC library for Scala like ScalikeJDBC to do these queries and capture the primitives in the results.