spark dataframe is loading all nulls from csv file - scala

I have a file with following data
####$ cat products.csv
1,tv,sony,hd,699
2,tv,sony,uhd,799
3,tv,samsung,hd,599
4,tv,samsung,uhd,799
5,phone,iphone,x,999
6,phone,iphone,11,999
7,phone,samsung,10,899
8,phone,samsung,10note,999
9,phone,pixel,4,799
10,phone,pixel,3,699
Im trying to load this into spark dataframe it is giving me no errors but it is loading all nulls.
scala> val productSchema = StructType((Array(StructField("productId",IntegerType,true),StructField("productType",IntegerType,true),StructField("company",IntegerType,true),StructField("model",IntegerType,true),StructField("price",IntegerType,true))))
productSchema: org.apache.spark.sql.types.StructType = StructType(StructField(productId,IntegerType,true), StructField(productType,IntegerType,true), StructField(company,IntegerType,true), StructField(model,IntegerType,true), StructField(price,IntegerType,true))
scala> val df = spark.read.format("csv").option("header", "false").schema(productSchema).load("/path/products_js/products.csv")
df: org.apache.spark.sql.DataFrame = [productId: int, productType: int ... 3 more fields]
scala> df.show
+---------+-----------+-------+-----+-----+
|productId|productType|company|model|price|
+---------+-----------+-------+-----+-----+
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
| null| null| null| null| null|
+---------+-----------+-------+-----+-----+
Now I tried a different way to load the data and it worked
scala> val temp = spark.read.csv("/path/products_js/products.csv")
temp: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 3 more fields]
scala> temp.show
+---+-----+-------+------+---+
|_c0| _c1| _c2| _c3|_c4|
+---+-----+-------+------+---+
| 1| tv| sony| hd|699|
| 2| tv| sony| uhd|799|
| 3| tv|samsung| hd|599|
| 4| tv|samsung| uhd|799|
| 5|phone| iphone| x|999|
| 6|phone| iphone| 11|999|
| 7|phone|samsung| 10|899|
| 8|phone|samsung|10note|999|
| 9|phone| pixel| 4|799|
| 10|phone| pixel| 3|699|
+---+-----+-------+------+---+
In the second approach it loaded data but I cannot add the scheme to dataframe. what is the difference between about two methods of loading data, why is it loading null for the first approach? can any one help me

You define the string type of columns as integertype that is wrong first. And this is working,
import org.apache.spark.sql.types.{StructType, IntegerType, StringType}
val productSchema = new StructType()
.add("productId", "int")
.add("productType", "string")
.add("company", "string")
.add("model", "string")
.add("price", "int")
val df = spark.read.format("csv")
.option("header", "false")
.schema(productSchema)
.load("test.csv")
df.show()
the result is
+---------+-----------+-------+------+-----+
|productId|productType|company| model|price|
+---------+-----------+-------+------+-----+
| 1| tv| sony| hd| 699|
| 2| tv| sony| uhd| 799|
| 3| tv|samsung| hd| 599|
| 4| tv|samsung| uhd| 799|
| 5| phone| iphone| x| 999|
| 6| phone| iphone| 11| 999|
| 7| phone|samsung| 10| 899|
| 8| phone|samsung|10note| 999|
| 9| phone| pixel| 4| 799|
| 10| phone| pixel| 3| 699|
+---------+-----------+-------+------+-----+

Related

Pyspark add column based on other column and a running counter

I have data in pyspark dataframe (it is a very big table with 900M rows)
The dataframe contains a column with these values:
+---------------+
|prev_display_id|
+---------------+
| null|
| null|
| 1062|
| null|
| null|
| null|
| null|
| 18882624|
| 11381128|
| null|
| null|
| null|
| null|
| 2779|
| null|
| null|
| null|
| null|
+---------------+
I am trying to generate a new column based on this column, that will look like this:
+---------------+------+
|prev_display_id|result|
+---------------+------+
| null| 0|
| null| 1|
| 1062| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
| 18882624| 0|
| 11381128| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
| 2779| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
+---------------+------+
The function for the new column is something like:
new_col = 0 if (prev_display_id!=null) else col = col+1
Where col is like a running counter that reset to zero when a non-null value is met.
How can that be done efficiently in pyspark?
UPDATE
I tried the solution suggested by #anki below. I works great for small datasets, but it generates this error:
WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Unfortunately it seems that for my big dataset it kills the cluster.
See image below for the error when running on the big dataset with 2 rd5.2xlarge data nodes:
Any idea how to solve this issue?
From what I understand , you can create an id column with monotonically_increasing_id and then take sum over the window for cases where prev_display_id is not null , then take row number partitioned by that column and minus 1:
w = Window.orderBy(F.monotonically_increasing_id())
w1 = F.sum((F.col("prev_display_id").isNotNull()).cast("integer")).over(w)
(df.withColumn("result",F.row_number()
.over(Window.partitionBy(w1).orderBy(w1))-1).drop("idx")).show()
+---------------+------+
|prev_display_id|result|
+---------------+------+
| null| 0|
| null| 1|
| 1062| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
| 18882624| 0|
| 11381128| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
| 2779| 0|
| null| 1|
| null| 2|
| null| 3|
| null| 4|
+---------------+------+
You can get this by running the following command:
window = Window.orderBy(f.monotonically_increasing_id())
df.withColumn('row',f.row_number().over(window))\
.withColumn('ne',f.when(f.col('consumer_id').isNotNull(),f.col('row')))\
.withColumn('result',f.when(f.col('ne').isNull(),f.col('row')-f.when(f.last('ne',ignorenulls=True)\
.over(window).isNull(),1).otherwise(f.last('ne',ignorenulls=True).over(window))).otherwise(0))\
.drop('row','ne').show()
+-----------+------+
|consumer_id|result|
+-----------+------+
| null| 0|
| null| 1|
| null| 2|
| 11| 0|
| 11| 0|
| null| 1|
| null| 2|
| 12| 0|
| 12| 0|
+-----------+------+

Pyspark iterate over rows and compute counter with logic on result column

I have data in pyspark dataframe (it is a very big table with 900M rows)
This is the data that I have
+-------+---------+----------+
| key| time| cond|
+-------+---------+----------+
| 6| 3704| null|
| 6| 74967| 1062|
| 6|151565068| null|
| 6|154999554| null|
| 6|160595800| null|
| 6|166192324| null|
| 6|166549533| null|
| 6|171318946| null|
| 6|754759092| null|
| 6|754999359| 18882624|
| 6|755171746| 11381128|
| 6|761097038| null|
| 6|774496554| null|
| 6|930609982| null|
| 6|930809622| null|
| 1| 192427| null|
| 1| 192427| 2779|
| 1| 717931| null|
| 1| 1110573| null|
| 1| 1155854| null|
| 1| 70049289| null|
| 1| 70687548| null|
| 1| 71222733| null|
| 1| 85006084| null|
| 1| 85029676| null|
| 1| 85032605| 1424537|
| 1| 85240114| null|
| 1| 85573757| null|
| 1| 85710915| null|
| 1| 85870370| null|
+-------+---------+----------+
This is what I need to be doing with the dataframe (intermediate step):
+-------+---------+----------+--------+
| key| time| cond| result|
+-------+---------+----------+--------+
| 6| 3704| null| 0|
| 6| 74967| 1062| 1|
| 6|151565068| null| 0|
| 6|154999554| null| 1|
| 6|160595800| null| 2|
| 6|166192324| null| 3|
| 6|166549533| null| 4|
| 6|171318946| null| 5|
| 6|754759092| null| 6|
| 6|754999359| 18882624| 7|
| 6|755171746| 11381128| 0|
| 6|761097038| null| 0|
| 6|774496554| null| 1|
| 6|930609982| null| 2|
| 6|930809622| null| 3|
| 1| 192427| null| 0|
| 1| 192427| 2779| 1|
| 1| 717931| null| 0|
| 1| 1110573| null| 1|
| 1| 1155854| null| 2|
| 1| 70049289| null| 3|
| 1| 70687548| null| 4|
| 1| 71222733| null| 5|
| 1| 85006084| null| 6|
| 1| 85029676| null| 7|
| 1| 85032605| 1424537| 8|
| 1| 85240114| null| 0|
| 1| 85573757| null| 1|
| 1| 85710915| null| 2|
| 1| 85870370| null| 3|
+-------+---------+----------+--------+
The logic for 'result' column is as follows: have a running counter per key, zero the counter if 'cond' column is not null.
We can assume that table is orderBy("key",asc("time"))
My end results is actually average the result (per key) on rows were condition is not null.
It should look like this for above data (final result):
+--------+--------------+
| key | avg_per_key |
+--------+--------------+
| 6| 2.66666665| ==> (1+7+0)/3
| 1| 4.5| ==> (1+8)/2
+--------+--------------+
I plan to do it like this:
df_results = df3[df3.cond.isNotNull()].groupby(['key']).agg(
F.expr("avg(result)").alias("avg_per_key")
)
I assume it should work, but maybe there is a better way of doing it without the intermediate step in the middle.
How can that be done efficiently in pyspark? (remember that the dataset is huge)
Try this. Result is calculated by using an incremental sum over conditions, and then using those groupings as partitionBy in another window for row_number() - 1 to get desired result. Filter before groupBy should be performant by reducing shuffle.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("key").orderBy("time")
w1=Window().partitionBy("key","result").orderBy("time")
conditions=F.when((F.col("cond").isNotNull())&(F.col("lag").isNotNull()&\
F.col("lead").isNull()),F.lit(1))\
.when((F.col("cond").isNull())&(F.col("lag").isNotNull()),F.lit(1))\
.otherwise(F.lit(0))
df.withColumn("lag", F.lag("cond").over(w))\
.withColumn("lead", F.lead("cond").over(w))\
.withColumn("result",F.sum(conditions).over(w))\
.withColumn("result", F.row_number().over(w1)-1).filter("cond is not null")\
.groupBy("key").agg(F.mean(F.col("result")).alias("avg_per_key")).show()
#+---+------------------+
#|key| avg_per_key|
#+---+------------------+
#| 6|2.6666666666666665|
#| 1| 4.5|
#+---+------------------+
This was my solution for it, I am not saying that it is optimal, but worked for my case, when other attempts crashed the cluster.
I am a beginner in spark, so I understand that this approach might cause issues since it takes the dataset into memory.
If I had more time to play with it, I would try something with sortWithinPartitions
def handleRow(row):
temp = list(row[1])
temp = np.array([temp[x:x+2] for x in range(0, len(temp),2)])
temp[:,0] = temp[:,0].astype(float)
temp = temp[temp[:,0].argsort()]
avg_per_key= []
counter=0
for time,cond in temp:
if cond!=None:
avg_per_key.append(counter)
counter=0
else:
counter=counter+1
return [(row[0],-1 if len(avg_per_key)==0 else np.mean(avg_per_key))]
count = df3.rdd.map(lambda x: (x.key, (x.time, x.cond)))\
.reduceByKey(lambda a, b: a + b)\
.flatMap(handleRow)\
.collect()

Scala - Which is the most efficient way to get the names of the null columns from spark dataframe?

I have a df like this:
+----------+----------+----------+----------+----------+----------+----------+
| user_id| apple| orange| banana| pear| table| desk|
+----------+----------+----------+----------+----------+----------+----------+
| 1| 13| null| 55| null| null| null|
| 2| 30| null| null| null| null| null|
| 3| null| null| 50| null| null| null|
| 4| 1| null| 3| null| null| null|
+----------+----------+----------+----------+----------+----------+----------+
I would like to get back an Array[String] which contains the fruit column names which are have only null values. I would like to do this on a very big data frame so i don't want to sum the columns, i need a faster and much more efficient way. I need a Scala code.
So i need this list:
List(orange,pear)
I have this solution now, summing columns, but i need a solution without summing all of the columns:
val fruitList: Array[String] = here are the fruit names
val nullFruits: Array[String] = fruitList.filter(col => dataFrame.agg(sum(col)).first.get(0) == null)
You can achieve this by using Spark's describe too:
val r1 = df.select(fruitList.head, fruitList.tail :_*)
.summary("count")
//alternatively
val r1 = df.select(fruitList.head, fruitList.tail :_*)
.describe()
.filter($"summary" === "count")
+-------+-----+------+------+----+
|summary|apple|orange|banana|pear|
+-------+-----+------+------+----+
| count| 3| 0| 3| 0|
+-------+-----+------+------+----+
And to extract the desired values:
r1.columns.tail
.map(c => (c,r1.select(c).head.getString(0) == "0"))
.filter(_._2 == true)
.map(_._1)
which gives:
Array(orange, pear)

Null values from a csv on Scala and Apache Spark

I'm using Apache Spark 2.3.0. When I upload a csv file and then I put df.show it shows me the table with all null values and I would like to know why because everything looks fine in the csv
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
Rank,Grade,Channelname,VideoUploads,Subscribers,Videoviews
1st,A++ ,Zee TV,82757,18752951,20869786591
2nd,A++ ,T-Series,12661,61196302,47548839843
3rd,A++ ,Cocomelon - Nursery Rhymes,373,19238251,9793305082
4th,A++ ,SET India,27323,31180559,22675948293
5th,A++ ,WWE,36756,32852346,26273668433
6th,A++ ,Movieclips,30243,17149705,16618094724
7th,A++ ,netd müzik,8500,11373567,23898730764
8th,A++ ,ABS-CBN Entertainment,100147,12149206,17202609850
9th,A++ ,Ryan ToysReview,1140,16082927,24518098041
10th,A++ ,Zee Marathi,74607,2841811,2591830307
11th,A+ ,5-Minute Crafts,2085,33492951,8587520379
12th,A+ ,Canal KondZilla,822,39409726,19291034467
13th,A+ ,Like Nastya Vlog,150,7662886,2540099931
14th,A+ ,Ozuna,50,18824912,8727783225
15th,A+ ,Wave Music,16119,15899764,10989179147
16th,A+ ,Ch3Thailand,49239,11569723,9388600275
17th,A+ ,WORLDSTARHIPHOP,4778,15830098,11102158475
18th,A+ ,Vlad and Nikita,53,-- ,1428274554
So if we load without a schema we see the following:
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+--------------------+------------+-----------+-----------+
|Rank|Grade| Channelname|VideoUploads|Subscribers| Videoviews|
+----+-----+--------------------+------------+-----------+-----------+
| 1st| A++ | Zee TV| 82757| 18752951|20869786591|
| 2nd| A++ | T-Series| 12661| 61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...| 373| 19238251| 9793305082|
| 4th| A++ | SET India| 27323| 31180559|22675948293|
| 5th| A++ | WWE| 36756| 32852346|26273668433|
| 6th| A++ | Movieclips| 30243| 17149705|16618094724|
| 7th| A++ | netd müzik| 8500| 11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...| 100147| 12149206|17202609850|
| 9th| A++ | Ryan ToysReview| 1140| 16082927|24518098041|
|10th| A++ | Zee Marathi| 74607| 2841811| 2591830307|
|11th| A+ | 5-Minute Crafts| 2085| 33492951| 8587520379|
|12th| A+ | Canal KondZilla| 822| 39409726|19291034467|
|13th| A+ | Like Nastya Vlog| 150| 7662886| 2540099931|
|14th| A+ | Ozuna| 50| 18824912| 8727783225|
|15th| A+ | Wave Music| 16119| 15899764|10989179147|
|16th| A+ | Ch3Thailand| 49239| 11569723| 9388600275|
|17th| A+ | WORLDSTARHIPHOP| 4778| 15830098|11102158475|
|18th| A+ | Vlad and Nikita| 53| -- | 1428274554|
+----+-----+--------------------+------------+-----------+-----------+
If we apply your schema we see this:
scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+-----------+-------------+----------+----------+
|Rank|Grade|Channelname|Video Uploads|Suscribers|Videoviews|
+----+-----+-----------+-------------+----------+----------+
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
|null| null| null| null| null| null|
+----+-----+-----------+-------------+----------+----------+
Now if we look at your data we see Subscribers contains non Integer values ("--") and Videoviews contains values which exceed Integer max value (2,147,483,647)
So if we change the schema to conform with the data:
scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",StringType,true),StructField("Videoviews",LongType,true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Rank,StringType,true), StructField(Grade,StringType,true), StructField(Channelname,StringType,true), StructField(Video Uploads,IntegerType,true), StructField(Suscribers,StringType,true), StructField(Videoviews,LongType,true))
scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]
scala> df.show
+----+-----+--------------------+-------------+----------+-----------+
|Rank|Grade| Channelname|Video Uploads|Suscribers| Videoviews|
+----+-----+--------------------+-------------+----------+-----------+
| 1st| A++ | Zee TV| 82757| 18752951|20869786591|
| 2nd| A++ | T-Series| 12661| 61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...| 373| 19238251| 9793305082|
| 4th| A++ | SET India| 27323| 31180559|22675948293|
| 5th| A++ | WWE| 36756| 32852346|26273668433|
| 6th| A++ | Movieclips| 30243| 17149705|16618094724|
| 7th| A++ | netd müzik| 8500| 11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...| 100147| 12149206|17202609850|
| 9th| A++ | Ryan ToysReview| 1140| 16082927|24518098041|
|10th| A++ | Zee Marathi| 74607| 2841811| 2591830307|
|11th| A+ | 5-Minute Crafts| 2085| 33492951| 8587520379|
|12th| A+ | Canal KondZilla| 822| 39409726|19291034467|
|13th| A+ | Like Nastya Vlog| 150| 7662886| 2540099931|
|14th| A+ | Ozuna| 50| 18824912| 8727783225|
|15th| A+ | Wave Music| 16119| 15899764|10989179147|
|16th| A+ | Ch3Thailand| 49239| 11569723| 9388600275|
|17th| A+ | WORLDSTARHIPHOP| 4778| 15830098|11102158475|
|18th| A+ | Vlad and Nikita| 53| -- | 1428274554|
+----+-----+--------------------+-------------+----------+-----------+
The reason for the null values is because the default "mode" for the csv API is PERMISSIVE:
mode (default PERMISSIVE): allows a mode for dealing with corrupt
records during parsing. It supports the following case-insensitive
modes.
- PERMISSIVE : sets other fields to null when it meets a
corrupted record, and puts the malformed string into a field
configured by columnNameOfCorruptRecord. To keep corrupt records, an
user can set a string type field named columnNameOfCorruptRecord in an
user-defined schema. If a schema does not have the field, it drops
corrupt records during parsing. When a length of parsed CSV tokens is
shorter than an expected length of a schema, it sets null for extra
fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records
csv API

How to fill missing values in SataFrame?

After querying a mysql db and building the corresponding data frame, I am left with this:
mydata.show
+--+------+------+------+------+------+------+
|id| sport| var1| var2| var3| var4| var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234| | | | |
| 2|soccer| null| null| null| null| null|
| 3|soccer|330101| | | | |
| 4|soccer| null| null| null| null| null|
| 5|soccer| null| null| null| null| null|
| 6|soccer| null| null| null| null| null|
| 7|soccer| null| null| null| null| null|
| 8|soccer|330024|330401| | | |
| 9|soccer|330055|330106| | | |
|10|soccer| null| null| null| null| null|
|11|soccer|390027| | | | |
|12|soccer| null| null| null| null| null|
|13|soccer|330101| | | | |
|14|soccer|330059| | | | |
|15|soccer| null| null| null| null| null|
|16|soccer|140242|140281| | | |
|17|soccer|330214| | | | |
|18|soccer| | | | | |
|19|soccer|330055|330196| | | |
|20|soccer|210022| | | | |
+--+------+------+------+------+------+------+
Every var column is a:
string (nullable = true)
So I'd like to change all the empty rows to a "null", so to be able to treat empty cells and cell with "null" as equal, possibly without leaving the data frame for an RDD...
My approach would be to create a list of expressions. In Scala this can be done using a map. On the other hand in Python you'd to use a comprehension list.
After that, you should unpack that list inside a df.select instruction like in the examples bellow.
Inside the expression, empty strings are replaced with a null value
Scala:
val exprs = df.columns.map(x => when(col(x) === '', null).otherwise(col(x)).as(x))
df.select(exprs:_*).show()
Python:
# Creation of a dummy dataframe:
df = sc.parallelize([("", "19911201", 1, 1, 20.0),
("", "19911201", 2, 1, 20.0),
("hola", "19911201", 2, 1, 20.0),
(None, "20111201", 3, 1, 20.0)]).toDF()
df.show()
exprs = [when(col(x) == '', None).otherwise(col(x)).alias(x)
for x in df.columns]
df.select(*exprs).show()
E.g:
+----+--------+---+---+----+
| _1| _2| _3| _4| _5|
+----+--------+---+---+----+
| |19911201| 1| 1|20.0|
| |19911201| 2| 1|20.0|
|hola|19911201| 2| 1|20.0|
|null|20111201| 3| 1|20.0|
+----+--------+---+---+----+
+----+--------+---+---+----+
| _1| _2| _3| _4| _5|
+----+--------+---+---+----+
|null|19911201| 1| 1|20.0|
|null|19911201| 2| 1|20.0|
|hola|19911201| 2| 1|20.0|
|null|20111201| 3| 1|20.0|
+----+--------+---+---+----+
One option would be to do the opposite - replace nulls with empty values (I personally hate nulls...), for which you can use the coalesce function:
import org.apache.spark.sql.functions._
val result = input.withColumn("myCol", coalesce(input("myCol"), lit("")))
To do that for multiple columns:
val cols = Seq("var1", "var2", "var3", "var4", "var5")
val result = cols.foldLeft(input) { case (df, colName) => df.withColumn(colName, coalesce(df(colName), lit(""))) }