Apply QuantileDiscretizer to all columns in a DataFrame - scala

Assume that I have a dataframe with id and 100 columns. I want to apply QuantileDiscretizer on each column and return and a new dataframe with the id column tied with new columns with the discretized values.
Example for two columns only:
Input
id | col1 | col2
----|------|------
0 | 18.0 | 20.0
----|------|------
1 | 19.0 | 30.0
----|------|------
2 | 8.0 | 35.0
----|------|------
3 | 5.0 | 10.0
----|------|------
4 | 2.2 | 5.0
Output
id | col1Disc | col2Disc
----|----------|------
0 | 2 | 2
----|----------| ------
1 | 2 | 3
----|----------|------
2 | 1 | 3
----|----------|------
3 | 2 | 1
----|----------|------
4 | 0 | 0

You can use Pipeline API:
import org.apache.spark.ml.Pipeline
val df = Seq(
(0, 18.0, 20.0), (1, 19.0, 30.0), (2, 8.0, 35.0), (3, 5.0, 10.0), (4, 2.2, 5.0)
).toDF("id", "col1", "col2")
val pipeline = new Pipeline().setStages(for {
c <- df.columns
if c != "id"
} yield new QuantileDiscretizer().setInputCol(c).setOutputCol(s"${c}Disc"))
val result = pipeline.fit(df).transform(df)
result.drop(df.columns.diff(Seq("id")): _*).show
+---+--------+--------+
| id|col1Disc|col2Disc|
+---+--------+--------+
| 0| 1.0| 1.0|
| 1| 1.0| 1.0|
| 2| 1.0| 1.0|
| 3| 0.0| 0.0|
| 4| 0.0| 0.0|
+---+--------+--------+

Related

Retrieve column value given a column of column names (spark / scala)

I have a dataframe like the following:
+-----------+-----------+---------------+------+---------------------+
|best_col |A |B | C |<many more columns> |
+-----------+-----------+---------------+------+---------------------+
| A | 14 | 26 | 32 | ... |
| C | 13 | 17 | 96 | ... |
| B | 23 | 19 | 42 | ... |
+-----------+-----------+---------------+------+---------------------+
I want to end up with a DataFrame like this:
+-----------+-----------+---------------+------+---------------------+----------+
|best_col |A |B | C |<many more columns> | result |
+-----------+-----------+---------------+------+---------------------+----------+
| A | 14 | 26 | 32 | ... | 14 |
| C | 13 | 17 | 96 | ... | 96 |
| B | 23 | 19 | 42 | ... | 19 |
+-----------+-----------+---------------+------+---------------------+----------+
Essentially, I want to add a column result that will choose the value from the column specified in the best_col column. best_col only contains column names that are present in the DataFrame. Since I have dozens of columns, I want to avoid using a bunch of when statements to check when col(best_col) === A etc. I tried doing col(col("best_col").toString()), but this didn't work. Is there an easy way to do this?
Using map_filter introduced in Spark 3.0:
val df = Seq(
("A", 14, 26, 32),
("C", 13, 17, 96),
("B", 23, 19, 42),
).toDF("best_col", "A", "B", "C")
df.withColumn("result", map(df.columns.tail.flatMap(c => Seq(col(c), lit(col("best_col") === lit(c)))): _*))
.withColumn("result", map_filter(col("result"), (a, b) => b))
.withColumn("result", map_keys(col("result"))(0))
.show()
+--------+---+---+---+------+
|best_col| A| B| C|result|
+--------+---+---+---+------+
| A| 14| 26| 32| 14|
| C| 13| 17| 96| 96|
| B| 23| 19| 42| 19|
+--------+---+---+---+------+

How to make a transform of return type DataFrame => DataFrame which will give product of 2 column as value and column1_column2 as name

Input
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 1 | 0 | 1
+-------+-------+----+-------
output
+-------+-------+----+-------+-------+-------+----+-------
| id | a | b | c | a_b | a_c | b_c
+-------+-------+----+-------+-------+-------+----+-------
| 1 | 1 | 0 | 1 | 0 | 1 | 0
+-------+-------+----+-------+-------+-------+----+-------
basically I have a sequence of pair which contains Seq((a,b),(a,c),(b,c))
and thier values will be col(a)*col(b) , col(a)*col(c) col(b)*col(c) for new column
Like I know how to add them in dataFrame but not able to make a transform of return type DataFrame => DataFrame
Is this what you what?
Take a look at the API page. You will save yourself sometime :)
val df = Seq((1, 1, 0, 1))
.toDF("id", "a", "b", "c")
.withColumn("a_b", $"a" * $"b")
.withColumn("a_c", $"a" * $"c")
.withColumn("b_c", $"b" * $"c")
output ============
+---+---+---+---+---+---+---+
| id| a| b| c|a_b|a_c|b_c|
+---+---+---+---+---+---+---+
| 1| 1| 0| 1| 0| 1| 0|
+---+---+---+---+---+---+---+

How to convert Columns to rows in Spark scala or spark sql?

I have the Data like this.
+------+------+------+----------+----------+----------+----------+----------+----------+
| Col1 | Col2 | Col3 | Col1_cnt | Col2_cnt | Col3_cnt | Col1_wts | Col2_wts | Col3_wts |
+------+------+------+----------+----------+----------+----------+----------+----------+
| AAA | VVVV | SSSS | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| BBB | BBBB | TTTT | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
| CCC | DDDD | YYYY | 3 | 4 | 5 | 0.5 | 0.4 | 0.6 |
+------+------+------+----------+----------+----------+----------+----------+----------+
I have tried but I am not getting any help here.
val df = Seq(("G",Some(4),2,None),("H",None,4,Some(5))).toDF("A","X","Y", "Z")
I want the output in the form of below table
+-----------+---------+---------+
| Cols_name | Col_cnt | Col_wts |
+-----------+---------+---------+
| Col1 | 3 | 0.5 |
| Col2 | 4 | 0.4 |
| Col3 | 5 | 0.6 |
+-----------+---------+---------+
Here's a general approach for transposing a DataFrame:
For each of the pivot columns (say c1, c2, c3), combine the column name and associated value columns into a struct (e.g. struct(lit(c1), c1_cnt, c1_wts))
Put all these struct-typed columns into an array which is then explode-ed into rows of struct columns
Group by the pivot column name to aggregate the associated struct elements
The following sample code has been generalized to handle an arbitrary list of columns to be transposed:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("AAA", "VVVV", "SSSS", 3, 4, 5, 0.5, 0.4, 0.6),
("BBB", "BBBB", "TTTT", 3, 4, 5, 0.5, 0.4, 0.6),
("CCC", "DDDD", "YYYY", 3, 4, 5, 0.5, 0.4, 0.6)
).toDF("c1", "c2", "c3", "c1_cnt", "c2_cnt", "c3_cnt", "c1_wts", "c2_wts", "c3_wts")
val pivotCols = Seq("c1", "c2", "c3")
val valueColSfx = Seq("_cnt", "_wts")
val arrStructs = pivotCols.map{ c => struct(
Seq(lit(c).as("_pvt")) ++
valueColSfx.map((c, _)).map{ case (p, s) => col(p + s).as(s) }: _*
).as(c + "_struct")
}
val valueColAgg = valueColSfx.map(s => first($"struct_col.$s").as(s + "_first"))
df.
select(array(arrStructs: _*).as("arr_structs")).
withColumn("struct_col", explode($"arr_structs")).
groupBy($"struct_col._pvt").agg(valueColAgg.head, valueColAgg.tail: _*).
show
// +----+----------+----------+
// |_pvt|_cnt_first|_wts_first|
// +----+----------+----------+
// | c1| 3| 0.5|
// | c3| 5| 0.6|
// | c2| 4| 0.4|
// +----+----------+----------+
Note that function first is used in the above example, but it could be any other aggregate function (e.g. avg, max, collect_list) depending on the specific business requirement.

Extracting array index in Spark Dataframe

I have a Dataframe with a Column of Array Type
For example :
val df = List(("a", Array(1d,2d,3d)), ("b", Array(4d,5d,6d))).toDF("ID", "DATA")
df: org.apache.spark.sql.DataFrame = [ID: string, DATA: array<double>]
scala> df.show
+---+---------------+
| ID| DATA|
+---+---------------+
| a|[1.0, 2.0, 3.0]|
| b|[4.0, 5.0, 6.0]|
+---+---------------+
I wish to explode the array and have index like
+---+------------------+
| ID| DATA_INDEX| DATA|
+---+------------------+
| a|1 | 1.0 |
| a|2 | 2.0 |
| a|3 | 3.0 |
| b|1 | 4.0 |
| b|2 | 5.0 |
| b|3 | 6.0 |
+---+------------+-----+
I wish be able to do that with scala, and Sparlyr or SparkR
I'm using spark 1.6
There is a posexplode function available in spark functions
import org.apache.spark.sql.functions._
df.select("ID", posexplode($"DATA))
PS: This is only available after 2.1.0 versions
With Spark 1.6, you can register you dataframe as a temporary table and then run Hive QL over it to get the desired result.
df.registerTempTable("tab")
sqlContext.sql("""
select
ID, exploded.DATA_INDEX + 1 as DATA_INDEX, exploded.DATA
from
tab
lateral view posexplode(tab.DATA) exploded as DATA_INDEX, DATA
""").show
+---+----------+----+
| ID|DATA_INDEX|DATA|
+---+----------+----+
| a| 1| 1.0|
| a| 2| 2.0|
| a| 3| 3.0|
| b| 1| 4.0|
| b| 2| 5.0|
| b| 3| 6.0|
+---+----------+----+

Read value from table and apply condition in Spark

I have dataframe: df1
+------+--------+--------+--------+
| Name | value1 | value2 | value3 |
+------+--------+--------+--------+
| A | 100 | null | 200 |
| B | 10000 | 300 | 10 |
| c | null | 10 | 100 |
+------+--------+--------+--------+
second dataframe: df2:
+------+------+
| Col1 | col2 |
+------+------+
| X | 1000 |
| Y | 2002 |
| Z | 3000 |
+------+------+
I want to read the values from table1 like value1,value2 and value3
Apply condition to table2 with new columns:
cond1: when name= A and col2>value1, flag it to Y or N
cond2: when name= B and col2>value2 then Y or N
cond3: when name =c and col2>value1 and col2> value3, then Y or N
source code:
df2.withColumn("cond1",when($"col2")>value1,lit("Y)).otherwise(lit("N"))
df2.withColumn("cond2",when($"col2")>value2,lit("Y)).otherwise(lit("N"))
df2.withColumn("cond3",when($"col2")>value1 && when($"col2")>value3,lit("Y")).otherwise(lit("N"))
output:
+------+------+-------+-------+-------+
| Col1 | col2 | cond1 | cond2 | cond3 |
+------+------+-------+-------+-------+
| X | 1000 | Y | Y | y |
| Y | 2002 | N | Y | Y |
| Z | 3000 | Y | Y | Y |
+------+------+-------+-------+-------+
If I understand your question correctly, you can join the two dataframes and create the condition columns as shown below. A couple of notes:
1) With the described conditions,null in df1 is replaced with Int.MinValue for simplified integer comparison
2) Since df1 is small, broadcast join is used to minimize sorting/shuffling for better performance
val df1 = Seq(
("A", 100, Int.MinValue, 200),
("B", 10000, 300, 10),
("C", Int.MinValue, 10, 100)
).toDF("Name", "value1", "value2", "value3")
val df2 = Seq(
("A", 1000),
("B", 2002),
("C", 3000),
("A", 5000),
("A", 150),
("B", 250),
("B", 12000),
("C", 50)
).toDF("Col1", "col2")
val df3 = df2.join(broadcast(df1), df2("Col1") === df1("Name")).select(
df2("Col1"),
df2("col2"),
when(df2("col2") > df1("value1"), "Y").otherwise("N").as("cond1"),
when(df2("col2") > df1("value2"), "Y").otherwise("N").as("cond2"),
when(df2("col2") > df1("value1") && df2("col2") > df1("value3"), "Y").otherwise("N").as("cond3")
)
df3.show
+----+-----+-----+-----+-----+
|Col1| col2|cond1|cond2|cond3|
+----+-----+-----+-----+-----+
| A| 1000| Y| Y| Y|
| B| 2002| N| Y| N|
| C| 3000| Y| Y| Y|
| A| 5000| Y| Y| Y|
| A| 150| Y| Y| N|
| B| 250| N| N| N|
| B|12000| Y| Y| Y|
| C| 50| Y| Y| N|
+----+-----+-----+-----+-----+
You can create rowNo column in both dataframes as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf1 = df1.withColumn("rowNo", row_number().over(Window.orderBy("Name")))
val tempdf2 = df2.withColumn("rowNo", row_number().over(Window.orderBy("Col1")))
Then you can join them with the created column as below
val joinedDF = tempdf2.join(tempdf1, Seq("rowNo"), "left")
Finally you can use select and when function to get the final dataframe
joinedDF.select($"Col1",
$"col2",
when($"col2">$"value1" || $"value1".isNull, "Y").otherwise("N").as("cond1"),
when($"col2">$"value2" || $"value2".isNull, "Y").otherwise("N").as("cond2"),
when(($"col2">$"value1" && $"col2">$"value3") || $"value3".isNull, "Y").otherwise("N").as("cond3"))
you should have your desired dataframe as
+----+----+-----+-----+-----+
|Col1|col2|cond1|cond2|cond3|
+----+----+-----+-----+-----+
|X |1000|Y |Y |Y |
|Y |2002|N |Y |Y |
|Z |3000|Y |Y |Y |
+----+----+-----+-----+-----+
I hope the answer is helpful