spark dataframe with column when condition - scala

My requirement is as below
Joined two data frames as below:
var c = a.join(b,keys,"fullouter")
c.printSchema() as below:
|-- add: string (nullable = true)
|-- sub: string (nullable = true)
|-- delete: string (nullable = true)
|-- mul: long (nullable = true)
|-- ADD: string (nullable = true)
|-- SUB: string (nullable = true)
|-- DELETE: string (nullable = true)
|-- MUL: long (nullable = true)
It's good until here.
Now i am doing a withcolumn when condition as below
val d = c.withColumn("column", when(c("a.add") === c("b.ADD"),
"Neardata"))
error as below:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Cannot resolve column name "a.add"
I tried as below also
val d = c.withColumn("column", when(col("a.add") === col("b.ADD"), "Neardata"))
Again error.
Please suggest.

You have to define alias with datframe.as("a") and dataframe1.as("b")
Example :
import spark.sqlContext.implicits._
val data = List(("James","","Smith","36636","M",60000),
("Michael","Rose","","40288","M",70000),
("Robert","","Williams","42114","",400000),
("Maria","Anne","Jones","39192","F",500000),
("Jen","Mary","Brown","","F",0))
val cols = Seq("first_name","middle_name","last_name","dob","gender","salary")
val df = spark.createDataFrame(data).toDF(cols:_*).as("a")
val df2 = df.withColumn("a.new_gender", when(col("a.gender") === "M","Male")
.when(col("a.gender") === "F","Female")
.otherwise("Unknown")).show
Output :
+----------+-----------+---------+-----+------+------+------------+
|first_name|middle_name|last_name| dob|gender|salary|a.new_gender|
+----------+-----------+---------+-----+------+------+------------+
| James| | Smith|36636| M| 60000| Male|
| Michael| Rose| |40288| M| 70000| Male|
| Robert| | Williams|42114| |400000| Unknown|
| Maria| Anne| Jones|39192| F|500000| Female|
| Jen| Mary| Brown| | F| 0| Female|
+----------+-----------+---------+-----+------+------+------------+
I think with out alias you are trying to access like this... that might be the cause.
val df2 = df.withColumn("df.new_gender", when(col("df.gender") === "M","Male")
.when(col("df.gender") === "F","Female")
.otherwise("Unknown")).show

Related

Explode multiple columns of same type with different lengths

I have a spark data frame with the following format that needs to be exploded. I check other solutions such as this one. However, in my case, before and after can be arrays of different length.
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
For instance, if the data frame has just one row, before is a array of size 2 and after is an array of size 3, the exploded version should be have 5 rows with the following schema:
root
|-- id: string (nullable = true)
|-- type: string (nullable = true)
|-- start_time: integer (nullable = false)
|-- end_time: string (nullable = true)
|-- area: string (nullable = true)
where type is a new column which can be "before" or "after".
I can do thss in two separate explodes where I make type column in each explode and union then.
val dfSummary1 = df.withColumn("before_exp",
explode($"before")).withColumn("type",
lit("before")).withColumn(
"start_time", $"before_exp.start_time").withColumn(
"end_time", $"before_exp.end_time").withColumn(
"area", $"before_exp.area").drop("before_exp", "before")
val dfSummary2 = df.withColumn("after_exp",
explode($"after")).withColumn("type",
lit("after")).withColumn(
"start_time", $"after_exp.start_time").withColumn(
"end_time", $"after_exp.end_time").withColumn(
"area", $"after_exp.area").drop("after_exp", "after")
val dfResult = dfSumamry1.unionAll(dfSummary2)
But, I was wondering if there is a more elegant way to do this. Thanks.
You can also achieve this without union. With the data :
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
you can do
df
.select($"id",
explode(
array(
struct(lit("before").as("type"), $"before".as("data")),
struct(lit("after").as("type"), $"after".as("data"))
)
).as("step1")
)
.select($"id",$"step1.type", explode($"step1.data").as("step2"))
.select($"id",$"type", $"step2.*")
.show()
+---+------+----------+--------+----+
| id| type|start_time|end_time|area|
+---+------+----------+--------+----+
| 1|before| 01:00| 01:30| 10|
| 1|before| 02:00| 02:30| 20|
| 1| after| 07:00| 07:30| 70|
| 1| after| 08:00| 08:30| 80|
| 1| after| 09:00| 09:30| 90|
+---+------+----------+--------+----+
I think exploding the two columns separately followed by a union is a decent straightforward approach. You could simplify the StructField-element selection a little and create a simple method for the repetitive explode process, like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
def explodeCol(df: DataFrame, colName: String): DataFrame = {
val expColName = colName + "_exp"
df.
withColumn("type", lit(colName)).
withColumn(expColName, explode(col(colName))).
select("id", "type", expColName + ".*")
}
val dfResult = explodeCol(df, "before") union explodeCol(df, "after")
dfResult.show
// +---+------+----------+--------+----+
// | id| type|start_time|end_time|area|
// +---+------+----------+--------+----+
// | 1|before| 01:00| 01:30| 10|
// | 1|before| 02:00| 02:30| 20|
// | 1| after| 07:00| 07:30| 70|
// | 1| after| 08:00| 08:30| 80|
// | 1| after| 09:00| 09:30| 90|
// +---+------+----------+--------+----+

Spark: How to split struct type into multiple columns?

I know this question has been asked many times on Stack Overflow and has been satisfactorily answered in most posts, but I'm not sure if this is the best way in my case.
I have a Dataset that has several struct types embedded in it:
root
|-- STRUCT1: struct (nullable = true)
| |-- FIELD_1: string (nullable = true)
| |-- FIELD_2: long (nullable = true)
| |-- FIELD_3: integer (nullable = true)
|-- STRUCT2: struct (nullable = true)
| |-- FIELD_4: string (nullable = true)
| |-- FIELD_5: long (nullable = true)
| |-- FIELD_6: integer (nullable = true)
|-- STRUCT3: struct (nullable = true)
| |-- FIELD_7: string (nullable = true)
| |-- FIELD_8: long (nullable = true)
| |-- FIELD_9: integer (nullable = true)
|-- ARRAYSTRUCT4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD_10: integer (nullable = true)
| | |-- FIELD_11: integer (nullable = true)
+-------+------------+------------+------------------+
|STRUCT1| STRUCT2 | STRUCT3 | ARRAYSTRUCT4 |
+-------+------------+------------+------------------+
|[1,2,3]|[aa, xx, yy]|[p1, q2, r3]|[[1a, 2b],[3c,4d]]|
+-------+------------+------------+------------------+
I want to convert this into:
1. A dataset where the structs are expanded into columns.
2. A data set where the array (ARRAYSTRUCT4) is exploded into rows.
root
|-- FIELD_1: string (nullable = true)
|-- FIELD_2: long (nullable = true)
|-- FIELD_3: integer (nullable = true)
|-- FIELD_4: string (nullable = true)
|-- FIELD_5: long (nullable = true)
|-- FIELD_6: integer (nullable = true)
|-- FIELD_7: string (nullable = true)
|-- FIELD_8: long (nullable = true)
|-- FIELD_9: integer (nullable = true)
|-- FIELD_10: integer (nullable = true)
|-- FIELD_11: integer (nullable = true)
+-------+------------+------------+---------+ ---------+----------+
|FIELD_1| FIELD_2 | FIELD_3 | FIELD_4 | |FIELD_10| FIELD_11 |
+-------+------------+------------+---------+ ... ---------+----------+
|1 |2 |3 | aa | | 1a | 2b |
+-------+------------+------------+-----------------------------------+
To achieve this, I could use:
val expanded = df.select("STRUCT1.*", "STRUCT2.*", "STRUCT3.*", "STRUCT4")
followed by an explode:
val exploded = expanded.select(explode(expanded("STRUCT4")))
However, I was wondering if there's a more functional way to do this, especially the select. I could use withColumn as below:
data.withColumn("FIELD_1", $"STRUCT1".getItem(0))
.withColumn("FIELD_2", $"STRUCT1".getItem(1))
.....
But I have 80+ columns. Is there a better way to achieve this?
You can first make all columns struct-type by explode-ing any Array(struct) columns into struct columns via foldLeft, then use map to interpolate each of the struct column names into col.*, as shown below:
import org.apache.spark.sql.functions._
case class S1(FIELD_1: String, FIELD_2: Long, FIELD_3: Int)
case class S2(FIELD_4: String, FIELD_5: Long, FIELD_6: Int)
case class S3(FIELD_7: String, FIELD_8: Long, FIELD_9: Int)
case class S4(FIELD_10: Int, FIELD_11: Int)
val df = Seq(
(S1("a1", 101, 11), S2("a2", 102, 12), S3("a3", 103, 13), Array(S4(1, 1), S4(3, 3))),
(S1("b1", 201, 21), S2("b2", 202, 22), S3("b3", 203, 23), Array(S4(2, 2), S4(4, 4)))
).toDF("STRUCT1", "STRUCT2", "STRUCT3", "ARRAYSTRUCT4")
// +-----------+-----------+-----------+--------------+
// | STRUCT1| STRUCT2| STRUCT3| ARRAYSTRUCT4|
// +-----------+-----------+-----------+--------------+
// |[a1,101,11]|[a2,102,12]|[a3,103,13]|[[1,1], [3,3]]|
// |[b1,201,21]|[b2,202,22]|[b3,203,23]|[[2,2], [4,4]]|
// +-----------+-----------+-----------+--------------+
val arrayCols = df.dtypes.filter( t => t._2.startsWith("ArrayType(StructType") ).
map(_._1)
// arrayCols: Array[String] = Array(ARRAYSTRUCT4)
val expandedDF = arrayCols.foldLeft(df)((accDF, c) =>
accDF.withColumn(c.replace("ARRAY", ""), explode(col(c))).drop(c)
)
val structCols = expandedDF.columns
expandedDF.select(structCols.map(c => col(s"$c.*")): _*).
show
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// |FIELD_1|FIELD_2|FIELD_3|FIELD_4|FIELD_5|FIELD_6|FIELD_7|FIELD_8|FIELD_9|FIELD_10|FIELD_11|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 1| 1|
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 3| 3|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 2| 2|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 4| 4|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
Note that for simplicity it's assumed that your DataFrame has only struct and Array(struct)-type columns. If there are other data types, just apply filtering conditions to arrayCols and structCols accordingly.

How to find the "lowest" element from array<struct>?

I've a dataframe with following schema -
|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _v1: string (nullable = true)
| | |-- _v2: string (nullable = true)
VALUES are like -
[["1","a"],["2","b"],["3","c"],["4","d"]]
[["4","g"]]
[["3","e"],["4","f"]]
I want to take the VALUES with the lowest integer i.e.
The result df should look like - (which will be StructType now, not Array[Struct])
["1","a"]
["4","g"]
["3","e"]
Can someone please guide me how can I approach this problem by creating a udf ?
Thanks in advance.
You don't need a UDF for that. Just use sort_array and pick the first element.
df.show
+--------------------+
| data_arr|
+--------------------+
|[[4,a], [2,b], [1...|
| [[1,a]]|
| [[3,b], [1,v]]|
+--------------------+
df.printSchema
root
|-- data_arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = false)
import org.apache.spark.sql.functions.sort_array
df.withColumn("first_asc", sort_array($"data_arr")(0)).show
+--------------------+---------+
| data_arr|first_asc|
+--------------------+---------+
|[[4,a], [2,b], [1...| [1,c]|
| [[1,a]]| [1,a]|
| [[3,b], [1,v]]| [1,v]|
+--------------------+---------+
Using the same dataframe as in the example:
val findSmallest = udf((rows: Seq[Row]) => {
rows.map(row => (row.getAs[String](0), row.getAs[String](1))).sorted.head
})
df.withColumn("SMALLEST", findSmallest($"VALUES"))
Will give a result like this:
+---+--------------------+--------+
| ID| VALUES|SMALLEST|
+---+--------------------+--------+
| 1|[[1,a], [2,b], [3...| [1,2]|
| 2| [[4,e]]| [4,g]|
| 3| [[3,g], [4,f]]| [3,g]|
+---+--------------------+--------+
If you only want the final values use select("SMALLEST).

How to transform Spark Dataframe columns to a single column of a string array

I want to know how can I "merge" multiple dataframe columns into one as a string array?
For example, I have this dataframe:
val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")
Which looks like this:
scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
| 1|Jack| 125| Text|
| 2|Mary| 152| Text2|
+---+----+------+-------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- Name: string (nullable = true)
|-- Number: string (nullable = true)
|-- Comment: string (nullable = true)
How can I transform it so it would look like this:
scala> df.show
+---+-----------------+
| Id| List|
+---+-----------------+
| 1| [Jack,125,Text]|
| 2| [Mary,152,Text2]|
+---+-----------------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- List: Array (nullable = true)
| |-- element: string (containsNull = true)
Use org.apache.spark.sql.functions.array:
import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")
result.show()
// +---+------------------+
// |Id |List |
// +---+------------------+
// |1 |[Jack, 125, Text] |
// |2 |[Mary, 152, Text2]|
// +---+------------------+
Can also be used with withColumn :
import org.apache.spark.sql.functions as F
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))

spark2.0 dataframe collect muilt row as array by column

i have some dataframe like below, i want convert muilt row as an array if column value is the same
val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show
+---+---+----+------+
|id1|id2|type|value2|
+---+---+----+------+
| a| b| sum| 0|
| a| b| avg| 2|
+---+---+----+------+
i want to convert it to below
+---+---+----+------+
|id1|id2|agg |value2|
+---+---+----+------+
| a| b| 0,2| 0|
+---+---+----+------+
the printSchema should be like below
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = true)
| |-- sum: int (nullable = true)
| |-- dc: int (nullable = true)
You can:
import org.apache.spark.sql.functions._
val data = Seq(
("a","b","sum",0),("a","b","avg",2)
).toDF("id1","id2","type","value2")
val result = data.groupBy($"id1", $"id2").agg(struct(
first(when($"type" === "sum", $"value2"), true).alias("sum"),
first(when($"type" === "avg", $"value2"), true).alias("avg")
).alias("agg"))
result.show
+---+---+-----+
|id1|id2| agg|
+---+---+-----+
| a| b|[0,2]|
+---+---+-----+
result.printSchema
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = false)
| |-- sum: integer (nullable = true)
| |-- avg: integer (nullable = true)