Concatenate names of the columns in a dataframe which contain true

Concatenate names of the columns in a dataframe which contain true - scala

I have a dataframe that looks like this:
|id|type|isBlack|isHigh|isLong|
|1 |A |true |false |null |
|2 |B |true |true |true |
|3 |C |false |null |null |
I'm trying to concatenate names of columns which contain 'true' into another column to get this:
|id|type|isBlack|isHigh|isLong|Description |
|1 |A |true |false |null |isBlack |
|2 |B |true |true |true |isBlack,isHigh,isLong|
|3 |C |false |null |null |null |
Now, I have a predefined list of column names that I need to check for (in this example it'd be Seq("isBlack", "isHigh", "isLong"), which are present in the dataframe (this list can be somewhat long).

val cols = Seq("isBlack", "isHigh", "isLong")
df.withColumn("description", concat_ws(",", cols.map(x => when(col(x), x)): _*)).show(false)
+---+----+-------+------+------+---------------------+
|id |type|isBlack|isHigh|isLong|description |
+---+----+-------+------+------+---------------------+
|1 |A |true |false |null |isBlack |
|2 |B |true |true |true |isBlack,isHigh,isLong|
|3 |C |false |null |null | |
+---+----+-------+------+------+---------------------+
Firstly map columns to column names when the value is true with:
cols.map(x => when(col(x), x))
And then use concat_ws to combine the columns concat_ws(",", cols.map(x => when(col(x), x)): _*)

Related

Scala Spark dataframe filter using multiple column based on available value

I need to filter a dataframe with the below criteria.
I have 2 columns 4Wheel(Subaru, Toyota, GM, null/empty) and 2Wheel(Yamaha, Harley, Indian, null/empty).
I have to filter on 4Wheel with values (Subaru, Toyota), if 4Wheel contain empty/null then filter on 2Wheel with values (Yamaha, Harley)
I couldn't find this type of filtering in different examples. I am new to spark/scala, so could not get enough idea to implement this.
Thanks,
Barun.

You can use spark SQL built-in function when to check if a column is null or empty, and filter accordingly:
import org.apache.spark.sql.functions.{col, when}
dataframe.filter(when(col("4Wheel").isNull || col("4Wheel").equalTo(""),
col("2Wheel").isin("Yamaha", "Harley")
).otherwise(
col("4Wheel").isin("Subaru", "Toyota")
))
So if you have the following input:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|3 |GM |null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
|8 |null |Indian|
|9 | |Indian|
|10 |null |null |
+---+------+------+
You get the following filtered ouput:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
+---+------+------+

How to replace values with NULL of all column's start with "stage{col's}" when condition statisfied

I have a scenario, where the final dataframe looks like below which was the result of Joining stage and foundation.
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key |ICC_key |suff_key |stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |2019-02-02 21:50:25.585|9123 |20.00 |1000.00 |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |
|333 |333 |1 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |
|444 |444 |1 |2020-04-03 21:50:25.585|8123 |30.00 |200.00 |null |null |null |null |
|555 |333 |1 |null |null |null |null |2020-05-03 21:50:25.585|813 |30.00 |200.00 |
|111 |111 |1 |2020-01-01 21:50:25.585|A123 |10.00 |99.00 |null |null |null |null |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
i am looking for a logic, on each row when final_{timestamp} > stage_{timestamp}, have to replace the value with "null" all the column starts with stage_{}.
Like below:
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key |ICC_key |suff_key |stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |null |null |null |null |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |
|333 |333 |1 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |
|444 |444 |1 |2020-04-03 21:50:25.585|8123 |30.00 |200.00 |null |null |null |null |
|555 |333 |1 |null |null |null |null |2020-05-03 21:50:25.585|813 |30.00 |200.00 |
|111 |111 |1 |2020-01-01 21:50:25.585|A123 |10.00 |99.00 |null |null |null |null |
+-------------------+------------------------+---------------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
It would be great if you can help me with the logic.
"""

Here's the code:
// Select all stage columns from dataframe
val stageColumns = df.columns.filter(_.startsWith("stage"))
// For each stage column nullify unless condition satisfied
stageColumns.foldLeft(df) { (acc, c) =>
acc.withColumn(c, when('final_timestamp <= 'stage_timestamp, col(c)).otherwise(lit(null)))
}

Check below code.
Condition
scala> val expr = col("final_{timestamp}") > col("stage_{timestamp}")
Condition Matched Columns
scala> val matched = df
.columns
.filter(_.startsWith("stage"))
.map(c => (when(expr,lit(null)).otherwise(col(c))).as(c))
Condition Not Matched Columns
scala> val notMatched = df
.columns
.filter(!_.startsWith("stage"))
.map(c => col(c).as(c))
Combining Not Matched & Matched Columns
scala> val allColumns = notMatched ++ matched
Final Result
scala> df.select(allColumns:_*).show(false)
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|ID_key|ICC_key|suff_key|final_{timestamp} |final_{code}|final_{dol1}|final_{dol2}|stage_{timestamp} |stage_{code}|stage_{dol1}|stage_{dol2}|
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+
|222 |222 |1 |2019-03-02 21:50:25.585|7123 |30.00 |200.00 |null |null |null |null |
|333 |333 |1 |2020-01-03 21:50:25.585|823 |30.00 |200.00 |2020-03-03 21:50:25.585|8123 |30.00 |200.00 |
|444 |444 |1 |null |null |null |null |null |null |null |null |
|555 |333 |1 |2020-05-03 21:50:25.585|813 |30.00 |200.00 |null |null |null |null |
|111 |111 |1 |null |null |null |null |null |null |null |null |
+------+-------+--------+-----------------------+------------+------------+------------+-----------------------+------------+------------+------------+

pyspark solution :
#test_data
tst = sqlContext.createDataFrame([(867,0.12,'G','2020-07-01 17-49-32','2020-07-02 17-49-32'),(430,0.72,'R','2020-07-01 17-49-32','2020-07-02 17-49-32'),(658,0.32,'A','2020-07-01 17-49-32','2020-06-01 17-49-32'),\
(157,0.83,'R','2020-07-01 17-49-32','2020-06-01 17-49-32'),(521,0.12,'G','2020-07-01 17-49-32','2020-08-01 17-49-32'),(867,0.49,'A','2020-07-01 16-45-32','2020-08-01 17-49-32'),
(430,0.14,'G','2020-07-01 16-45-32','2020-07-01 17-49-32'),(867,0.12,'G','2020-07-01 16-45-32','2020-07-01 17-49-32')],
schema=['stage_1','stage_2','RAG','timestamp','timestamp1'])
# change the string column to date
tst_format= (tst.withColumn('timestamp',F.to_date('timestamp',format='yyyy-MM-dd'))).withColumn('timestamp1',F.to_date('timestamp1',format='yyyy-MM-dd'))
# extract column information with name 'stage' and retain others
col_info =[x for x in tst_format.columns if 'stage_' in x]
col_orig = list(set(tst_format.columns)-set(col_info))
# Build the query expression
expr = col_orig+[(F.when(F.col('timestamp1')>F.col('timestamp'),F.col(x)).otherwise(F.lit(None))).alias(x) for x in col_info]
# execute query
tst_res = tst_format.select(*expr)
The results are :
+---+----------+----------+-------+-------+
|RAG|timestamp1| timestamp|stage_1|stage_2|
+---+----------+----------+-------+-------+
| G|2020-07-02|2020-07-01| 867| 0.12|
| R|2020-07-02|2020-07-01| 430| 0.72|
| A|2020-06-01|2020-07-01| null| null|
| R|2020-06-01|2020-07-01| null| null|
| G|2020-08-01|2020-07-01| 521| 0.12|
| A|2020-08-01|2020-07-01| 867| 0.49|
| G|2020-07-01|2020-07-01| null| null|
| G|2020-07-01|2020-07-01| null| null|
+---+----------+----------+-------+-------+

getting duplicate count but retaining duplicate rows in pyspark

I am trying to find the duplicate count of rows in a pyspark dataframe. I found a similar answer here
but it only outputs a binary flag. I would like to have the actual count for each row.
To use the orignal post's example, if I have a dataframe like so:
+--+--+--+--+
|a |b |c |d |
+--+--+--+--+
|1 |0 |1 |2 |
|0 |2 |0 |1 |
|1 |0 |1 |2 |
|0 |4 |3 |1 |
|1 |0 |1 |2 |
+--+--+--+--+
I would like to result in something like:
+--+--+--+--+--+--+--+--+
|a |b |c |d |row_count |
+--+--+--+--+--+--+--+--+
|1 |0 |1 |2 |3 |
|0 |2 |0 |1 |0 |
|1 |0 |1 |2 |3 |
|0 |4 |3 |1 |0 |
|1 |0 |1 |2 |3 |
+--+--+--+--+--+--+--+--+
Is this possible?
Thank You

Assuming df is your input dataframe:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import *
w = (Window.partitionBy([F.col("a"), F.col("b"), F.col("c"), F.col("D")]))
df=df.select(F.col("a"), F.col("b"), F.col("c"), F.col("D"), F.count(F.col("a")).over(w).alias("row_count"))
If, as per your example, you want to replace every count 1 with 0 do:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import *
w = (Window.partitionBy([F.col("a"), F.col("b"), F.col("c"), F.col("D")]))
df=df.select(F.col("a"), F.col("b"), F.col("c"), F.col("D"), F.count(F.col("a")).over(w).alias("row_count")).select("a", "b", "c", "d", F.when(F.col("row_count")==F.lit(1), F.lit(0)). otherwise(F.col("row_count")).alias("row_count"))

Value of column changes after changing the Date format in scala spark

This is my data frame without data format
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T03:00:00+00:00|YUH |null |null |null |1976-12-31T00:00:00+00:00|true |false |1976-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T03:00:00+00:00|YUH |null |null |null |1983-12-31T00:00:00+00:00|true |false |1983-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T03:00:00+00:00|YUH |null |null |null |1976-12-31T00:00:00+00:00|true |false |1976-12-31T00:00:00+00:00 |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+-------------------------+----------------+------------+-----+-----------+-------------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
This is waht i do to change in data format
val df2resultTimestamp = finalXmlDf.withColumn("FilingDateTime_1", date_format(col("FilingDateTime_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("StatementDate_1", date_format(col("StatementDate_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CapitalChangeAdjustmentDate_1", date_format(col("CapitalChangeAdjustmentDate_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CumulativeAdjustmentFactor_1", regexp_replace(format_number($"CumulativeAdjustmentFactor_1".cast(DoubleType), 5), ",", ""))
This is the output I get where FilingDateTime_1 column value is changed
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T08:30:00Z|YUH |null |null |null |1976-12-31T05:30:00Z|true |false |1976-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T08:30:00Z|YUH |null |null |null |1983-12-31T05:30:00Z|true |false |1983-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T08:30:00Z|YUH |null |null |null |1976-12-31T05:30:00Z|true |false |1976-12-31T05:30:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
The value should be 1984-02-14T03:00:00Z
I dont know what i am missing here ..

All you need is addition to_timestamp inbuilt function as below
val df2resultTimestamp = df.withColumn("FilingDateTime_1", date_format(to_timestamp(col("FilingDateTime_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("StatementDate_1", date_format(to_timestamp(col("StatementDate_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CapitalChangeAdjustmentDate_1", date_format(to_timestamp(col("CapitalChangeAdjustmentDate_1"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CumulativeAdjustmentFactor_1", regexp_replace(format_number($"CumulativeAdjustmentFactor_1".cast(DoubleType), 5), ",", ""))
which should give you the correct output as
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|Source_organizationId|Source_sourceId|FilingDateTime_1 |SourceTypeCode_1|DocumentId_1|Dcn_1|DocFormat_1|StatementDate_1 |IsFilingDateTimeEstimated_1|ContainsPreliminaryData_1|CapitalChangeAdjustmentDate_1|CumulativeAdjustmentFactor_1|ContainsRestatement_1|FilingDateTimeUTCOffset_1|ThirdPartySourceCode_1|ThirdPartySourcePriority_1|SourceTypeId_1|ThirdPartySourceCodeId_1|FFAction|!|_1|DataPartition_1|TimeStamp |
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+
|4295876589 |1 |1977-02-14T03:00:00Z|YUH |null |null |null |1976-12-31T00:00:00Z|true |false |1976-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:03:27+00:00|
|4295876589 |8 |1984-02-14T03:00:00Z|YUH |null |null |null |1983-12-31T00:00:00Z|true |false |1983-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T09:46:58+00:00|
|4295876589 |1 |1977-02-14T03:00:00Z|YUH |null |null |null |1976-12-31T00:00:00Z|true |false |1976-12-31T00:00:00Z |0.82457 |false |540 |SS |1 |3013057 |1000716240 |I|!| |Japan |2018-05-03T07:30:16+00:00|
+---------------------+---------------+--------------------+----------------+------------+-----+-----------+--------------------+---------------------------+-------------------------+-----------------------------+----------------------------+---------------------+-------------------------+----------------------+--------------------------+--------------+------------------------+-------------+---------------+-------------------------+

clean missing values spark with aggregation function

I would like to clean missing values by replacing them by the mean.This source code used to work i do not why , it doesn't work now.Any help will be appreciated.
Here is the dataset i use
RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe
0,,72,160,5,,2.9421,,3,4
1,54,70,,5,0.6301,2.7273,,3,
2,,51,164,5,,2.9834,,3,4
3,,74,170,5,0.6966,2.9654,2.3699,3,4
4,108,62,,5,0.6087,2.7093,2.1619,3,4
Here what i did
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header", true).option("inferSchema", true).format("com.databricks.spark.csv").load("C:/Users/mhattabi/Desktop/data_with_missing_values3.csv")
df.show(false)
var newDF = df
df.dtypes.foreach { x =>
val colName = x._1
newDF = newDF.na.fill(df.agg(max(colName)).first()(0).toString, Seq(colName))
}
newDF.show(false)
Here is the result , nothing happened
initial_data
+---------+-----+---+------+---+------+---------+--------+-----+------+
|RowNumber|Poids|Age|Taille|0MI|Hmean |CoocParam|LdpParam|Test2|Classe|
+---------+-----+---+------+---+------+---------+--------+-----+------+
|0 |null |72 |160 |5 |null |2.9421 |null |3 |4 |
|1 |54 |70 |null |5 |0.6301|2.7273 |null |3 |null |
|2 |null |51 |164 |5 |null |2.9834 |null |3 |4 |
|3 |null |74 |170 |5 |0.6966|2.9654 |2.3699 |3 |4 |
|4 |108 |62 |null |5 |0.6087|2.7093 |2.1619 |3 |4 |
+---------+-----+---+------+---+------+---------+--------+-----+------+
new_data
+---------+-----+---+------+---+------+---------+--------+-----+------+
|RowNumber|Poids|Age|Taille|0MI|Hmean |CoocParam|LdpParam|Test2|Classe|
+---------+-----+---+------+---+------+---------+--------+-----+------+
|0 |null |72 |160 |5 |null |2.9421 |null |3 |4 |
|1 |54 |70 |null |5 |0.6301|2.7273 |null |3 |null |
|2 |null |51 |164 |5 |null |2.9834 |null |3 |4 |
|3 |null |74 |170 |5 |0.6966|2.9654 |2.3699 |3 |4 |
|4 |108 |62 |null |5 |0.6087|2.7093 |2.1619 |3 |4 |
+---------+-----+---+------+---+------+---------+--------+-----+------+
What should i do

You can use withColumn api and use when function to check for null values in the columns as
df.dtypes.foreach { x =>
val colName = x._1
val fill = df.agg(max(col(s"`$colName`"))).first()(0).toString
newDF = newDF.withColumn(colName, when(col(s"`$colName`").isNull , fill).otherwise(col(s"`$colName`")) )
}
newDF.show(false)
I hope this solves your issue

If you are trying to replace the null values with mean value then you calculate mean and fill as
import org.apache.spark.sql.functions.mean
val data = spark.read.option("header", true)
.option("inferSchema", true).format("com.databricks.spark.csv")
.load("data.csv")
//Calculate the mean for each column and create a map with its column name
//and use na.fill() method to replace null with that mean
data.na.fill(data.columns.zip(
data.select(data.columns.map(mean(_)): _*).first.toSeq
).toMap)
I have tested the code in local and works fine.
Output:
+---------+-----+---+------+---+------------------+---------+------------------+-----+------+
|RowNumber|Poids|Age|Taille|0MI| Hmean|CoocParam| LdpParam|Test2|Classe|
+---------+-----+---+------+---+------------------+---------+------------------+-----+------+
| 0| 81| 72| 160| 5|0.6451333333333333| 2.9421|2.2659000000000002| 3| 4|
| 1| 54| 70| 164| 5| 0.6301| 2.7273|2.2659000000000002| 3| 4|
| 2| 81| 51| 164| 5|0.6451333333333333| 2.9834|2.2659000000000002| 3| 4|
| 3| 81| 74| 170| 5| 0.6966| 2.9654| 2.3699| 3| 4|
| 4| 108| 62| 164| 5| 0.6087| 2.7093| 2.1619| 3| 4|
+---------+-----+---+------+---+------------------+---------+------------------+-----+------+
Hope this helps!

This should do:
var imputeDF = df
df.dtypes.foreach { x =>
val colName = x._1
newDF = newDF.na.fill(df.agg(max(colName)).first()(0).toString , Seq(colName)) }
Note that it is not a good practice to use Mutable data types with scala.
Depending on your data, you can use a SQL join or something else to replace the nulls with a more suitable value.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Concatenate names of the columns in a dataframe which contain true - scala

Related

Scala Spark dataframe filter using multiple column based on available value

How to replace values with NULL of all column's start with "stage{col's}" when condition statisfied

getting duplicate count but retaining duplicate rows in pyspark

Value of column changes after changing the Date format in scala spark

clean missing values spark with aggregation function

Categories

Resources