Concatenate names of the columns in a dataframe which contain true - scala

I have a dataframe that looks like this:
|id|type|isBlack|isHigh|isLong|
|1 |A |true |false |null |
|2 |B |true |true |true |
|3 |C |false |null |null |
I'm trying to concatenate names of columns which contain 'true' into another column to get this:
|id|type|isBlack|isHigh|isLong|Description |
|1 |A |true |false |null |isBlack |
|2 |B |true |true |true |isBlack,isHigh,isLong|
|3 |C |false |null |null |null |
Now, I have a predefined list of column names that I need to check for (in this example it'd be Seq("isBlack", "isHigh", "isLong"), which are present in the dataframe (this list can be somewhat long).

val cols = Seq("isBlack", "isHigh", "isLong")
df.withColumn("description", concat_ws(",", cols.map(x => when(col(x), x)): _*)).show(false)
+---+----+-------+------+------+---------------------+
|id |type|isBlack|isHigh|isLong|description |
+---+----+-------+------+------+---------------------+
|1 |A |true |false |null |isBlack |
|2 |B |true |true |true |isBlack,isHigh,isLong|
|3 |C |false |null |null | |
+---+----+-------+------+------+---------------------+
Firstly map columns to column names when the value is true with:
cols.map(x => when(col(x), x))
And then use concat_ws to combine the columns concat_ws(",", cols.map(x => when(col(x), x)): _*)

Related

Scala Spark dataframe filter using multiple column based on available value

I need to filter a dataframe with the below criteria.
I have 2 columns 4Wheel(Subaru, Toyota, GM, null/empty) and 2Wheel(Yamaha, Harley, Indian, null/empty).
I have to filter on 4Wheel with values (Subaru, Toyota), if 4Wheel contain empty/null then filter on 2Wheel with values (Yamaha, Harley)
I couldn't find this type of filtering in different examples. I am new to spark/scala, so could not get enough idea to implement this.
Thanks,
Barun.
You can use spark SQL built-in function when to check if a column is null or empty, and filter accordingly:
import org.apache.spark.sql.functions.{col, when}
dataframe.filter(when(col("4Wheel").isNull || col("4Wheel").equalTo(""),
col("2Wheel").isin("Yamaha", "Harley")
).otherwise(
col("4Wheel").isin("Subaru", "Toyota")
))
So if you have the following input:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|3 |GM |null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
|8 |null |Indian|
|9 | |Indian|
|10 |null |null |
+---+------+------+
You get the following filtered ouput:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
+---+------+------+

How to replace values with NULL of all column's start with "stage{col's}" when condition statisfied

I have a scenario, where the final dataframe looks like below which was the result of Joining stage and foundation.
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key |ICC_key |suff_key |stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |2019-02-02 21:50:25.585|9123 |20.00 |1000.00 |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |
|333 |333 |1 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |
|444 |444 |1 |2020-04-03 21:50:25.585|8123 |30.00 |200.00 |null |null |null |null |
|555 |333 |1 |null |null |null |null |2020-05-03 21:50:25.585|813 |30.00 |200.00 |
|111 |111 |1 |2020-01-01 21:50:25.585|A123 |10.00 |99.00 |null |null |null |null |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
i am looking for a logic, on each row when final_{timestamp} > stage_{timestamp}, have to replace the value with "null" all the column starts with stage_{}.
Like below:
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key |ICC_key |suff_key |stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |null |null |null |null |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |
|333 |333 |1 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |
|444 |444 |1 |2020-04-03 21:50:25.585|8123 |30.00 |200.00 |null |null |null |null |
|555 |333 |1 |null |null |null |null |2020-05-03 21:50:25.585|813 |30.00 |200.00 |
|111 |111 |1 |2020-01-01 21:50:25.585|A123 |10.00 |99.00 |null |null |null |null |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
It would be great if you can help me with the logic.
"""
Here's the code:
// Select all stage columns from dataframe
val stageColumns = df.columns.filter(_.startsWith("stage"))
// For each stage column nullify unless condition satisfied
stageColumns.foldLeft(df) { (acc, c) =>
acc.withColumn(c, when('final_timestamp <= 'stage_timestamp, col(c)).otherwise(lit(null)))
}
Check below code.
Condition
scala> val expr = col("final_{timestamp}") > col("stage_{timestamp}")
Condition Matched Columns
scala> val matched = df
.columns
.filter(_.startsWith("stage"))
.map(c => (when(expr,lit(null)).otherwise(col(c))).as(c))
Condition Not Matched Columns
scala> val notMatched = df
.columns
.filter(!_.startsWith("stage"))
.map(c => col(c).as(c))
Combining Not Matched & Matched Columns
scala> val allColumns = notMatched ++ matched
Final Result
scala> df.select(allColumns:_*).show(false)
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key|ICC_key|suff_key|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |null |null |null |null |
|333 |333 |1 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |
|444 |444 |1 |null |null |null |null |null |null |null |null |
|555 |333 |1 |2020-05-03 21:50:25.585|813 |30.00 |200.00 |null |null |null |null |
|111 |111 |1 |null |null |null |null |null |null |null |null |
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
pyspark solution :
#test_data
tst = sqlContext.createDataFrame([(867,0.12,'G','2020-07-01 17-49-32','2020-07-02 17-49-32'),(430,0.72,'R','2020-07-01 17-49-32','2020-07-02 17-49-32'),(658,0.32,'A','2020-07-01 17-49-32','2020-06-01 17-49-32'),\
(157,0.83,'R','2020-07-01 17-49-32','2020-06-01 17-49-32'),(521,0.12,'G','2020-07-01 17-49-32','2020-08-01 17-49-32'),(867,0.49,'A','2020-07-01 16-45-32','2020-08-01 17-49-32'),
(430,0.14,'G','2020-07-01 16-45-32','2020-07-01 17-49-32'),(867,0.12,'G','2020-07-01 16-45-32','2020-07-01 17-49-32')],
schema=['stage_1','stage_2','RAG','timestamp','timestamp1'])
# change the string column to date
tst_format= (tst.withColumn('timestamp',F.to_date('timestamp',format='yyyy-MM-dd'))).withColumn('timestamp1',F.to_date('timestamp1',format='yyyy-MM-dd'))
# extract column information with name 'stage' and retain others
col_info =[x for x in tst_format.columns if 'stage_' in x]
col_orig = list(set(tst_format.columns)-set(col_info))
# Build the query expression
expr = col_orig+[(F.when(F.col('timestamp1')>F.col('timestamp'),F.col(x)).otherwise(F.lit(None))).alias(x) for x in col_info]
# execute query
tst_res = tst_format.select(*expr)
The results are :
+---+----------+----------+-------+-------+
|RAG|timestamp1| timestamp|stage_1|stage_2|
+---+----------+----------+-------+-------+
| G|2020-07-02|2020-07-01| 867| 0.12|
| R|2020-07-02|2020-07-01| 430| 0.72|
| A|2020-06-01|2020-07-01| null| null|
| R|2020-06-01|2020-07-01| null| null|
| G|2020-08-01|2020-07-01| 521| 0.12|
| A|2020-08-01|2020-07-01| 867| 0.49|
| G|2020-07-01|2020-07-01| null| null|
| G|2020-07-01|2020-07-01| null| null|
+---+----------+----------+-------+-------+

getting duplicate count but retaining duplicate rows in pyspark

I am trying to find the duplicate count of rows in a pyspark dataframe. I found a similar answer here
but it only outputs a binary flag. I would like to have the actual count for each row.
To use the orignal post's example, if I have a dataframe like so:
+--+--+--+--+
|a |b |c |d |
+--+--+--+--+
|1 |0 |1 |2 |
|0 |2 |0 |1 |
|1 |0 |1 |2 |
|0 |4 |3 |1 |
|1 |0 |1 |2 |
+--+--+--+--+
I would like to result in something like:
+--+--+--+--+--+--+--+--+
|a |b |c |d |row_count |
+--+--+--+--+--+--+--+--+
|1 |0 |1 |2 |3 |
|0 |2 |0 |1 |0 |
|1 |0 |1 |2 |3 |
|0 |4 |3 |1 |0 |
|1 |0 |1 |2 |3 |
+--+--+--+--+--+--+--+--+
Is this possible?
Thank You
Assuming df is your input dataframe:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import *
w = (Window.partitionBy([F.col("a"), F.col("b"), F.col("c"), F.col("D")]))
df=df.select(F.col("a"), F.col("b"), F.col("c"), F.col("D"), F.count(F.col("a")).over(w).alias("row_count"))
If, as per your example, you want to replace every count 1 with 0 do:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import *
w = (Window.partitionBy([F.col("a"), F.col("b"), F.col("c"), F.col("D")]))
df=df.select(F.col("a"), F.col("b"), F.col("c"), F.col("D"), F.count(F.col("a")).over(w).alias("row_count")).select("a", "b", "c", "d", F.when(F.col("row_count")==F.lit(1), F.lit(0)). otherwise(F.col("row_count")).alias("row_count"))

Value of column changes after changing the Date format in scala spark

This is my data frame without data format
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T03:00:00+00:00|YUH |null |null |null |1976-12-31T00:00:00+00:00|true |false |1976-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T03:00:00+00:00|YUH |null |null |null |1983-12-31T00:00:00+00:00|true |false |1983-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T03:00:00+00:00|YUH |null |null |null |1976-12-31T00:00:00+00:00|true |false |1976-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
This is waht i do to change in data format
val df2resultTimestamp = finalXmlDf.withColumn("FilingDateTime_1", date_format(col("FilingDateTime_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("StatementDate_1", date_format(col("StatementDate_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CapitalChangeAdjustmentDate_1", date_format(col("CapitalChangeAdjustmentDate_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CumulativeAdjustmentFactor_1", regexp_replace(format_number($"CumulativeAdjustmentFactor_1".cast(DoubleType), 5), ",", ""))
This is the output I get where FilingDateTime_1 column value is changed
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T08:30:00Z|YUH |null |null |null |1976-12-31T05:30:00Z|true |false |1976-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T08:30:00Z|YUH |null |null |null |1983-12-31T05:30:00Z|true |false |1983-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T08:30:00Z|YUH |null |null |null |1976-12-31T05:30:00Z|true |false |1976-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
The value should be 1984-02-14T03:00:00Z
I dont know what i am missing here ..
All you need is addition to_timestamp inbuilt function as below
val df2resultTimestamp = df.withColumn("FilingDateTime_1", date_format(to_timestamp(col("FilingDateTime_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("StatementDate_1", date_format(to_timestamp(col("StatementDate_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CapitalChangeAdjustmentDate_1", date_format(to_timestamp(col("CapitalChangeAdjustmentDate_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CumulativeAdjustmentFactor_1", regexp_replace(format_number($"CumulativeAdjustmentFactor_1".cast(DoubleType), 5), ",", ""))
which should give you the correct output as
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T03:00:00Z|YUH |null |null |null |1976-12-31T00:00:00Z|true |false |1976-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T03:00:00Z|YUH |null |null |null |1983-12-31T00:00:00Z|true |false |1983-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T03:00:00Z|YUH |null |null |null |1976-12-31T00:00:00Z|true |false |1976-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+

clean missing values spark with aggregation function

I would like to clean missing values by replacing them by the mean.This source code used to work i do not why , it doesn't work now.Any help will be appreciated.
Here is the dataset i use
RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe
0,,72,160,5,,2.9421,,3,4
1,54,70,,5,0.6301,2.7273,,3,
2,,51,164,5,,2.9834,,3,4
3,,74,170,5,0.6966,2.9654,2.3699,3,4
4,108,62,,5,0.6087,2.7093,2.1619,3,4
Here what i did
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header", true).option("inferSchema", true).format("com.databricks.spark.csv").load("C:/Users/mhattabi/Desktop/data_with_missing_values3.csv")
df.show(false)
var newDF = df
df.dtypes.foreach { x =>
val colName = x._1
newDF = newDF.na.fill(df.agg(max(colName)).first()(0).toString, Seq(colName))
}
newDF.show(false)
Here is the result , nothing happened
initial_data
+---------+-----+---+------+---+------+---------+--------+-----+------+
|RowNumber|Poids|Age|Taille|0MI|Hmean |CoocParam|LdpParam|Test2|Classe|
+---------+-----+---+------+---+------+---------+--------+-----+------+
|0 |null |72 |160 |5 |null |2.9421 |null |3 |4 |
|1 |54 |70 |null |5 |0.6301|2.7273 |null |3 |null |
|2 |null |51 |164 |5 |null |2.9834 |null |3 |4 |
|3 |null |74 |170 |5 |0.6966|2.9654 |2.3699 |3 |4 |
|4 |108 |62 |null |5 |0.6087|2.7093 |2.1619 |3 |4 |
+---------+-----+---+------+---+------+---------+--------+-----+------+
new_data
+---------+-----+---+------+---+------+---------+--------+-----+------+
|RowNumber|Poids|Age|Taille|0MI|Hmean |CoocParam|LdpParam|Test2|Classe|
+---------+-----+---+------+---+------+---------+--------+-----+------+
|0 |null |72 |160 |5 |null |2.9421 |null |3 |4 |
|1 |54 |70 |null |5 |0.6301|2.7273 |null |3 |null |
|2 |null |51 |164 |5 |null |2.9834 |null |3 |4 |
|3 |null |74 |170 |5 |0.6966|2.9654 |2.3699 |3 |4 |
|4 |108 |62 |null |5 |0.6087|2.7093 |2.1619 |3 |4 |
+---------+-----+---+------+---+------+---------+--------+-----+------+
What should i do
You can use withColumn api and use when function to check for null values in the columns as
df.dtypes.foreach { x =>
val colName = x._1
val fill = df.agg(max(col(s"`$colName`"))).first()(0).toString
newDF = newDF.withColumn(colName, when(col(s"`$colName`").isNull , fill).otherwise(col(s"`$colName`")) )
}
newDF.show(false)
I hope this solves your issue
If you are trying to replace the null values with mean value then you calculate mean and fill as
import org.apache.spark.sql.functions.mean
val data = spark.read.option("header", true)
.option("inferSchema", true).format("com.databricks.spark.csv")
.load("data.csv")
//Calculate the mean for each column and create a map with its column name
//and use na.fill() method to replace null with that mean
data.na.fill(data.columns.zip(
data.select(data.columns.map(mean(_)): _*).first.toSeq
).toMap)
I have tested the code in local and works fine.
Output:
+---------+-----+---+------+---+------------------+---------+------------------+-----+------+
|RowNumber|Poids|Age|Taille|0MI| Hmean|CoocParam| LdpParam|Test2|Classe|
+---------+-----+---+------+---+------------------+---------+------------------+-----+------+
| 0| 81| 72| 160| 5|0.6451333333333333| 2.9421|2.2659000000000002| 3| 4|
| 1| 54| 70| 164| 5| 0.6301| 2.7273|2.2659000000000002| 3| 4|
| 2| 81| 51| 164| 5|0.6451333333333333| 2.9834|2.2659000000000002| 3| 4|
| 3| 81| 74| 170| 5| 0.6966| 2.9654| 2.3699| 3| 4|
| 4| 108| 62| 164| 5| 0.6087| 2.7093| 2.1619| 3| 4|
+---------+-----+---+------+---+------------------+---------+------------------+-----+------+
Hope this helps!
This should do:
var imputeDF = df
df.dtypes.foreach { x =>
val colName = x._1
newDF = newDF.na.fill(df.agg(max(colName)).first()(0).toString , Seq(colName)) }
Note that it is not a good practice to use Mutable data types with scala.
Depending on your data, you can use a SQL join or something else to replace the nulls with a more suitable value.