How to build map rows from comma-delimited strings? - scala

var clearedLine = ""
var dict = collection.mutable.Map[String, String]()
val rdd = BufferedSource.map(line=> ({
if (!line.endsWith(", ")) {
clearedLine = line+", "
} else{
clearedLine = line.trim
}
clearedLine.split(",")(0).trim->clearedLine.split(",")(1).trim
}
//,clearedLine.split(",")(1).trim->clearedLine.split(",")(0).trim
)
//dict +=clearedLine.split(",")(0).trim.replace(" TO ","->")
)
for ((k,v) <- rdd) printf("key: %s, value: %s\n", k, v)
OUTPUT:
key: EQU EB.AR.DESCRIPT TO 1, value: EB.AR.ASSET.CLASS TO 2
key: EB.AR.CURRENCY TO 3, value: EB.AR.ORIGINAL.VALUE TO 4
I want to split By ' TO ' then prouduce the single dict key->value,please help
key: 1, value: EQU EB.AR.DESCRIPT
key: 2 value: EB.AR.ASSET.CLASS
key: 3, value: EB.AR.CURRENCY
key: 4, value: EB.AR.ORIGINAL.VALUE

Assuming your input to be lines like below
EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2
EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4
try this scala df solution
scala> val df = Seq(("EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2"),("EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4")).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: string]
scala> df.show(false)
+----------------------------------------------+
|a |
+----------------------------------------------+
|EQU EB.AR.DESCRIPT TO 1,EB.AR.ASSET.CLASS TO 2|
|EB.AR.CURRENCY TO 3, EB.AR.ORIGINAL.VALUE TO 4|
+----------------------------------------------+
scala> val df2 = df.select(split($"a",",").getItem(0).as("a1"),split($"a",",").getItem(1).as("a2"))
df2: org.apache.spark.sql.DataFrame = [a1: string, a2: string]
scala> df2.show(false)
+-----------------------+--------------------------+
|a1 |a2 |
+-----------------------+--------------------------+
|EQU EB.AR.DESCRIPT TO 1|EB.AR.ASSET.CLASS TO 2 |
|EB.AR.CURRENCY TO 3 | EB.AR.ORIGINAL.VALUE TO 4|
+-----------------------+--------------------------+
scala> val df3 = df2.flatMap( r => { (0 until r.size).map( i=> r.getString(i) ) })
df3: org.apache.spark.sql.Dataset[String] = [value: string]
scala> df3.show(false)
+--------------------------+
|value |
+--------------------------+
|EQU EB.AR.DESCRIPT TO 1 |
|EB.AR.ASSET.CLASS TO 2 |
|EB.AR.CURRENCY TO 3 |
| EB.AR.ORIGINAL.VALUE TO 4|
+--------------------------+
scala> df3.select(regexp_extract($"value",""" TO (\d+)\s*$""",1).as("key"),regexp_replace($"value",""" TO (\d+)\s*$""","").as("value")).show(false)
+---+---------------------+
|key|value |
+---+---------------------+
|1 |EQU EB.AR.DESCRIPT |
|2 |EB.AR.ASSET.CLASS |
|3 |EB.AR.CURRENCY |
|4 | EB.AR.ORIGINAL.VALUE|
+---+---------------------+
If you want them as "map" column, then
scala> val df4 = df3.select(regexp_extract($"value",""" TO (\d+)\s*$""",1).as("key"),regexp_replace($"value",""" TO (\d+)\s*$""","").as("value")).select(map($"key",$"value").as("kv"))
df4: org.apache.spark.sql.DataFrame = [kv: map<string,string>]
scala> df4.show(false)
+----------------------------+
|kv |
+----------------------------+
|[1 -> EQU EB.AR.DESCRIPT] |
|[2 -> EB.AR.ASSET.CLASS] |
|[3 -> EB.AR.CURRENCY] |
|[4 -> EB.AR.ORIGINAL.VALUE]|
+----------------------------+
scala> df4.printSchema
root
|-- kv: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
scala>

Related

Add new record before another in Spark

I have a Dataframe:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:02 2
I want to add a new record with Spark-Scala before with the same time minus 1 second when the value is 2.
The output would be:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:01 2
1 17:04:02 2
Thanks
You need a .flatMap()
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
val data = (spark.createDataset(Seq(
(1, "15:00:01", 3),
(1, "17:04:02", 2)
)).toDF("ID", "TIMESTAMP_STR", "VALUE")
.withColumn("TIMESTAMP", $"TIMESTAMP_STR".cast("timestamp").as("TIMESTAMP"))
.drop("TIMESTAMP_STR")
.select("ID", "TIMESTAMP", "VALUE")
)
data.as[(Long, java.sql.Timestamp, Long)].flatMap(r => {
if(r._3 == 2) {
Seq(
(r._1, new java.sql.Timestamp(r._2.getTime() - 1000L), r._3),
(r._1, r._2, r._3)
)
} else {
Some(r._1, r._2, r._3)
}
}).toDF("ID", "TIMESTAMP", "VALUE").show()
Which results in:
+---+-------------------+-----+
| ID| TIMESTAMP|VALUE|
+---+-------------------+-----+
| 1|2019-03-04 15:00:01| 3|
| 1|2019-03-04 17:04:01| 2|
| 1|2019-03-04 17:04:02| 2|
+---+-------------------+-----+
You can introduce a new column array - when value =2 then Array(-1,0) else Array(0), then explode that column and add it with the timestamp as seconds. The below one should work for you. Check this out:
scala> val df = Seq((1,"15:00:01",3),(1,"17:04:02",2)).toDF("id","timestamp","value")
df: org.apache.spark.sql.DataFrame = [id: int, timestamp: string ... 1 more field]
scala> val df2 = df.withColumn("timestamp",'timestamp.cast("timestamp"))
df2: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 1 more field]
scala> df2.show(false)
+---+-------------------+-----+
|id |timestamp |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala> val df3 = df2.withColumn("newc", when($"value"===lit(2),lit(Array(-1,0))).otherwise(lit(Array(0))))
df3: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 2 more fields]
scala> df3.show(false)
+---+-------------------+-----+-------+
|id |timestamp |value|newc |
+---+-------------------+-----+-------+
|1 |2019-03-04 15:00:01|3 |[0] |
|1 |2019-03-04 17:04:02|2 |[-1, 0]|
+---+-------------------+-----+-------+
scala> val df4 = df3.withColumn("c_explode",explode('newc)).withColumn("timestamp2",to_timestamp(unix_timestamp('timestamp)+'c_explode))
df4: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 4 more fields]
scala> df4.select($"id",$"timestamp2",$"value").show(false)
+---+-------------------+-----+
|id |timestamp2 |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:01|2 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala>
If you want the time part alone, then you can do like
scala> df4.withColumn("timestamp",from_unixtime(unix_timestamp('timestamp2),"HH:mm:ss")).select($"id",$"timestamp",$"value").show(false)
+---+---------+-----+
|id |timestamp|value|
+---+---------+-----+
|1 |15:00:01 |3 |
|1 |17:04:01 |2 |
|1 |17:04:02 |2 |
+---+---------+-----+

In spark iterate through each column and find the max length

I am new to spark scala and I have following situation as below
I have a table "TEST_TABLE" on cluster(can be hive table)
I am converting that to dataframe
as:
scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")
Now the DF can be viewed as
scala> testDF.show()
COL1|COL2|COL3
----------------
abc|abcd|abcdef
a|BCBDFG|qddfde
MN|1234B678|sd
I want an output like below
COLUMN_NAME|MAX_LENGTH
COL1|3
COL2|8
COL3|6
Is this feasible to do so in spark scala?
Plain and simple:
import org.apache.spark.sql.functions._
val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)
You can try in the following way:
import org.apache.spark.sql.functions.{length, max}
import spark.implicits._
val df = Seq(("abc","abcd","abcdef"),
("a","BCBDFG","qddfde"),
("MN","1234B678","sd"),
(null,"","sd")).toDF("COL1","COL2","COL3")
df.cache()
val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH")
+-----------+----------+
|COLUMN_NAME|MAX_LENGTH|
+-----------+----------+
| COL1| 3|
| COL2| 8|
| COL3| 6|
+-----------+----------+
I think it's good idea to cache input dataframe df to make the computation faster.
Here is one more way to get the report of column names in vertical
scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]
scala> df.show(false)
+----+--------+------+
|COL1|COL2 |COL3 |
+----+--------+------+
|abc |abcd |abcdef|
|a |BCBDFG |qddfde|
|MN |1234B678|sd |
+----+--------+------+
scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)
scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]
scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3 |8 |6 |
+---------+---------+---------+
scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1 |_2 |
+----+---+
|COL1|3 |
|COL2|8 |
|COL3|6 |
+----+---+
scala>
To get the results into Scala collections, say Map()
scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)
scala>

Scala : Passing elements of a Dataframe from every row and get back the result in separate rows

In My requirment , i come across a situation where i have to pass 2 strings from my dataframe's 2 column and get back the result in string and want to store it back to a dataframe.
Now while passing the value as string, it is always returning the same value. So in all the rows the same value is being populated. (In My case PPPP is being populated in all rows)
Is there a way to pass element (for those 2 columns) from every row and get the result in separate rows.
I am ready to modify my function to accept Dataframe and return Dataframe OR accept arrayOfString and get back ArrayOfString but i dont know how to do that as i am new to programming. Can someone please help me.
Thanks.
def myFunction(key: String , value :String ) : String = {
//Do my functions and get back a string value2 and return this value2 string
value2
}
val DF2 = DF1.select (
DF1("col1")
,DF1("col2")
,DF1("col5") )
.withColumn("anyName", lit(myFunction ( DF1("col3").toString() , DF1("col4").toString() )))
/* DF1:
/*+-----+-----+----------------+------+
/*|col1 |col2 |col3 | col4 | col 5|
/*+-----+-----+----------------+------+
/*|Hello|5 |valueAAA | XXX | 123 |
/*|How |3 |valueCCC | YYY | 111 |
/*|World|5 |valueDDD | ZZZ | 222 |
/*+-----+-----+----------------+------+
/*DF2:
/*+-----+-----+--------------+
/*|col1 |col2 |col5| anyName |
/*+-----+-----+--------------+
/*|Hello|5 |123 | PPPPP |
/*|How |3 |111 | PPPPP |
/*|World|5 |222 | PPPPP |
/*+-----+-----+--------------+
*/
After you define the function, you need to register them as udf(). The udf() function is available in org.apache.spark.sql.functions. check this out
scala> val DF1 = Seq(("Hello",5,"valueAAA","XXX",123),
| ("How",3,"valueCCC","YYY",111),
| ("World",5,"valueDDD","ZZZ",222)
| ).toDF("col1","col2","col3","col4","col5")
DF1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 3 more fields]
scala> val DF2 = DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5") )
DF2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> DF2.show(false)
+-----+----+----+
|col1 |col2|col5|
+-----+----+----+
|Hello|5 |123 |
|How |3 |111 |
|World|5 |222 |
+-----+----+----+
scala> DF1.select("*").show(false)
+-----+----+--------+----+----+
|col1 |col2|col3 |col4|col5|
+-----+----+--------+----+----+
|Hello|5 |valueAAA|XXX |123 |
|How |3 |valueCCC|YYY |111 |
|World|5 |valueDDD|ZZZ |222 |
+-----+----+--------+----+----+
scala> def myConcat(a:String,b:String):String=
| return a + "--" + b
myConcat: (a: String, b: String)String
scala>
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val myConcatUDF = udf(myConcat(_:String,_:String):String)
myConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5"), myConcatUDF( DF1("col3"), DF1("col4"))).show()
+-----+----+----+---------------+
| col1|col2|col5|UDF(col3, col4)|
+-----+----+----+---------------+
|Hello| 5| 123| valueAAA--XXX|
| How| 3| 111| valueCCC--YYY|
|World| 5| 222| valueDDD--ZZZ|
+-----+----+----+---------------+
scala>

Convert Spark DataFrame to HashMap of HashMaps

I have a dataframe that looks like this:
column1_ID column2 column3 column4
A_123 12 A 1
A_123 12 B 2
A_123 23 A 1
B_456 56 DB 4
B_456 56 BD 5
B_456 60 BD 3
I would like to convert above dataframe/rdd into below OUTPUT column1_ID(KEY): HashMap(Long, HashMap(String, Long))
'A_123': {12 : {'A': 1, 'B': 2}, 23: {'A': 1} },
'B_456': {56 : {'DB': 4, 'BD': 5}, 60: {'BD': 3} }
Tried with reduceByKey and groupByKey but couldn't convert the output as expected.
Can be done with creating complex structure from three last columns, and then apply UDF:
val data = List(
("A_123", 12, "A", 1),
("A_123", 12, "B", 2),
("A_123", 23, "A", 1),
("B_456", 56, "DB", 4),
("B_456", 56, "BD", 5),
("B_456", 60, "BD", 3))
val df = data.toDF("column1_ID", "column2", "column3", "column4")
val twoLastCompacted = df.withColumn("lastTwo", struct($"column3", $"column4"))
twoLastCompacted.show(false)
val grouppedByTwoFirst = twoLastCompacted.groupBy("column1_ID", "column2").agg(collect_list("lastTwo").alias("lastTwoArray"))
grouppedByTwoFirst.show(false)
val treeLastCompacted = grouppedByTwoFirst.withColumn("lastThree", struct($"column2", $"lastTwoArray"))
treeLastCompacted.show(false)
val gruppedByFirst = treeLastCompacted.groupBy("column1_ID").agg(collect_list("lastThree").alias("lastThreeArray"))
gruppedByFirst.printSchema()
gruppedByFirst.show(false)
val structToMap = (value: Seq[Row]) =>
value.map(v => v.getInt(0) ->
v.getSeq(1).asInstanceOf[Seq[Row]].map(r => r.getString(0) -> r.getInt(1)).toMap)
.toMap
val structToMapUDF = udf(structToMap)
gruppedByFirst.select($"column1_ID", structToMapUDF($"lastThreeArray")).show(false)
Output:
+----------+-------+-------+-------+-------+
|column1_ID|column2|column3|column4|lastTwo|
+----------+-------+-------+-------+-------+
|A_123 |12 |A |1 |[A,1] |
|A_123 |12 |B |2 |[B,2] |
|A_123 |23 |A |1 |[A,1] |
|B_456 |56 |DB |4 |[DB,4] |
|B_456 |56 |BD |5 |[BD,5] |
|B_456 |60 |BD |3 |[BD,3] |
+----------+-------+-------+-------+-------+
+----------+-------+----------------+
|column1_ID|column2|lastTwoArray |
+----------+-------+----------------+
|B_456 |60 |[[BD,3]] |
|A_123 |12 |[[A,1], [B,2]] |
|B_456 |56 |[[DB,4], [BD,5]]|
|A_123 |23 |[[A,1]] |
+----------+-------+----------------+
+----------+-------+----------------+---------------------------------+
|column1_ID|column2|lastTwoArray |lastThree |
+----------+-------+----------------+---------------------------------+
|B_456 |60 |[[BD,3]] |[60,WrappedArray([BD,3])] |
|A_123 |12 |[[A,1], [B,2]] |[12,WrappedArray([A,1], [B,2])] |
|B_456 |56 |[[DB,4], [BD,5]]|[56,WrappedArray([DB,4], [BD,5])]|
|A_123 |23 |[[A,1]] |[23,WrappedArray([A,1])] |
+----------+-------+----------------+---------------------------------+
root
|-- column1_ID: string (nullable = true)
|-- lastThreeArray: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- column2: integer (nullable = false)
| | |-- lastTwoArray: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- column3: string (nullable = true)
| | | | |-- column4: integer (nullable = false)
+----------+--------------------------------------------------------------+
|column1_ID|lastThreeArray |
+----------+--------------------------------------------------------------+
|B_456 |[[60,WrappedArray([BD,3])], [56,WrappedArray([DB,4], [BD,5])]]|
|A_123 |[[12,WrappedArray([A,1], [B,2])], [23,WrappedArray([A,1])]] |
+----------+--------------------------------------------------------------+
+----------+----------------------------------------------------+
|column1_ID|UDF(lastThreeArray) |
+----------+----------------------------------------------------+
|B_456 |Map(60 -> Map(BD -> 3), 56 -> Map(DB -> 4, BD -> 5))|
|A_123 |Map(12 -> Map(A -> 1, B -> 2), 23 -> Map(A -> 1)) |
+----------+----------------------------------------------------+
You can convert the DF to an rdd and apply the operations like below:
scala> case class Data(col1: String, col2: Int, col3: String, col4: Int)
defined class Data
scala> var x: Seq[Data] = List(Data("A_123",12,"A",1), Data("A_123",12,"B",2), Data("A_123",23,"A",1), Data("B_456",56,"DB",4), Data("B_456",56,"BD",5), Data("B_456",60,"BD",3))
x: Seq[Data] = List(Data(A_123,12,A,1), Data(A_123,12,B,2), Data(A_123,23,A,1), Data(B_456,56,DB,4), Data(B_456,56,BD,5), Data(B_456,60,BD,3))
scala> sc.parallelize(x).groupBy(_.col1).map{a => (a._1, HashMap(a._2.groupBy(_.col2).map{b => (b._1, HashMap(b._2.groupBy(_.col3).map{c => (c._1, c._2.map(_.col4).head)}.toArray: _*))}.toArray: _*))}.toDF()
res26: org.apache.spark.sql.DataFrame = [_1: string, _2: map<int,map<string,int>>]
I have initialized an rdd with the data structure as in your case by sc.parallelize(x)

delete constant columns spark having issue with timestamp column

hi guys i did this code that allows to drop columns with constant values.
i start by computing the standard deviation i then drop the ones having standard equal to zero ,but i got this issue when having a column which has a timestamp type what to do
cannot resolve 'stddev_samp(time.1)' due to data type mismatch: argument 1 requires double type, however, 'time.1' is of timestamp type.;;
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
//val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
df.show(5)
//df.columns.map(p=>s"`${p}`")
//val aggs = df.columns.map(c => stddev(c).as(c))
val aggs = df.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = df.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
df.select(columnsToKeep: _*).show(5,false)
Using stddev
stddev is only defined on numeric columns. If you want to compute the standard deviation of a date column you will need to convert it to a time stamp first:
scala> var myDF = (0 to 10).map(x => (x, scala.util.Random.nextDouble)).toDF("id", "rand_double")
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double]
scala> myDF = myDF.withColumn("Date", current_date())
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: date (nullable = false)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|2017-03-21|
| 1| 0.5968932024004612|2017-03-21|
| 2|0.05912760417456575|2017-03-21|
| 3|0.29974600653895667|2017-03-21|
| 4| 0.8448407414817856|2017-03-21|
| 5| 0.2049495659443249|2017-03-21|
| 6| 0.4184846380144779|2017-03-21|
| 7|0.21400484330739022|2017-03-21|
| 8| 0.9558142791013501|2017-03-21|
| 9|0.32530639391058036|2017-03-21|
| 10| 0.5100585655062743|2017-03-21|
+---+-------------------+----------+
scala> myDF = myDF.withColumn("Date", unix_timestamp($"Date"))
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: long (nullable = true)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|1490072400|
| 1| 0.5968932024004612|1490072400|
| 2|0.05912760417456575|1490072400|
| 3|0.29974600653895667|1490072400|
| 4| 0.8448407414817856|1490072400|
| 5| 0.2049495659443249|1490072400|
| 6| 0.4184846380144779|1490072400|
| 7|0.21400484330739022|1490072400|
| 8| 0.9558142791013501|1490072400|
| 9|0.32530639391058036|1490072400|
| 10| 0.5100585655062743|1490072400|
+---+-------------------+----------+
At this point all of the columns are numeric so your code will run fine:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val aggs = myDF.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = myDF.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(myDF.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
myDF.select(columnsToKeep: _*).show(false)
// Exiting paste mode, now interpreting.
+---+-------------------+
|id |rand_double |
+---+-------------------+
|0 |0.3786008989478248 |
|1 |0.5968932024004612 |
|2 |0.05912760417456575|
|3 |0.29974600653895667|
|4 |0.8448407414817856 |
|5 |0.2049495659443249 |
|6 |0.4184846380144779 |
|7 |0.21400484330739022|
|8 |0.9558142791013501 |
|9 |0.32530639391058036|
|10 |0.5100585655062743 |
+---+-------------------+
aggs: Array[org.apache.spark.sql.Column] = Array(stddev_samp(id) AS `id`, stddev_samp(rand_double) AS `rand_double`, stddev_samp(Date) AS `Date`)
stddevs: org.apache.spark.sql.DataFrame = [id: double, rand_double: double ... 1 more field]
columnsToKeep: Seq[org.apache.spark.sql.Column] = ArrayBuffer(id, rand_double)
Using countDistinct
All that being said, it would be more general to use countDistinct:
scala> val distCounts = myDF.select(myDF.columns.map(c => countDistinct(c) as c): _*).first.toSeq.zip(myDF.columns)
distCounts: Seq[(Any, String)] = ArrayBuffer((11,id), (11,rand_double), (1,Date))]
scala> distCounts.foldLeft(myDF)((accDF, dc_col) => if (dc_col._1 == 1) accDF.drop(dc_col._2) else accDF).show
+---+-------------------+
| id| rand_double|
+---+-------------------+
| 0| 0.3786008989478248|
| 1| 0.5968932024004612|
| 2|0.05912760417456575|
| 3|0.29974600653895667|
| 4| 0.8448407414817856|
| 5| 0.2049495659443249|
| 6| 0.4184846380144779|
| 7|0.21400484330739022|
| 8| 0.9558142791013501|
| 9|0.32530639391058036|
| 10| 0.5100585655062743|
+---+-------------------+