spark scala conditional join replace null values - scala

I have two dataframes. I want to replace values in col1 of df1 where values are null using the values from col1 of df2. Please keep in mind df1 can have > 10^6 rows similarly to df2 and that df1 have some additional columns which are different from some addtional columns of df2.
I know how to do join but I do not know how to do some kind of conditional join here in Spark with Scala.
df1
name | col1 | col2 | col3
----------------------------
foo | 0.1 | ...
bar | null |
hello | 0.6 |
foobar | null |
df2
name | col1 | col7
--------------------
lorem | 0.1 |
bar | 0.52 |
foobar | 0.47 |
EDIT:
This is my current solution:
df1.select("name", "col2", "col3").join(df2, (df1("name") === df2("name")), "left").select(df1("name"), col("col1"))
EDIT2:
val df1 = Seq(
("foo", Seq(0.1), 10, "a"),
("bar", Seq(), 20, "b"),
("hello", Seq(0.1), 30, "c"),
("foobar", Seq(), 40, "d")
).toDF("name", "col1", "col2", "col3")
val df2 = Seq(
("lorem", Seq(0.1), "x"),
("bar", Seq(0.52), "y"),
("foobar", Seq(0.47), "z")
).toDF("name", "col1", "col7")
display(df1.
join(df2, Seq("name"), "left_outer").
select(df1("name"), coalesce(df1("col1"), df2("col1")).as("col1")))
returns:
name | col1
bar | []
foo | [0.1]
foobar | []
hello | [0.1]

Consider using coalesce on col1 after performing the left join. To handle both nulls and empty arrays (in the case of ArrayType) as per revised requirement in the comments section, a when/otherwise clause is used, as shown below:
val df1 = Seq(
("foo", Some(Seq(0.1)), 10, "a"),
("bar", None, 20, "b"),
("hello", Some(Seq(0.1)), 30, "c"),
("foobar", Some(Seq()), 40, "d")
).toDF("name", "col1", "col2", "col3")
val df2 = Seq(
("lorem", Seq(0.1), "x"),
("bar", Seq(0.52), "y"),
("foobar", Seq(0.47), "z")
).toDF("name", "col1", "col7")
df1.
join(df2, Seq("name"), "left_outer").
select(
df1("name"),
coalesce(
when(lit(df1.schema("col1").dataType.typeName) === "array" && size(df1("col1")) === 0, df2("col1")).otherwise(df1("col1")),
df2("col1")
).as("col1")
).
show
/*
+------+------+
| name| col1|
+------+------+
| foo| [0.1]|
| bar|[0.52]|
| hello| [0.1]|
|foobar|[0.47]|
+------+------+
*/
UPDATE:
It appears that Spark, surprisingly, does not handle conditionA && conditionB the way most other languages do -- even when conditionA is false conditionB will still be evaluated, and replacing && with nested when/otherwise still would not resolve the issue. It might be due to limitations in how the internally translated case/when/else SQL is executed.
As a result, the above when/otherwise data-type check via array-specific function size() fails when col1 is non-ArrayType. Given that, I would forgo the dynamic column type check and perform different queries based on whether col1 is ArrayType or not, assuming it's known upfront:
df1.
join(df2, Seq("name"), "left_outer").
select(
df1("name"),
coalesce(
when(size(df1("col1")) === 0, df2("col1")).otherwise(df1("col1")), // <-- if col1 is an array
// df1("col1"), // <-- if col1 is not an array
df2("col1")
).as("col1")
).
show

Related

How to Filter a List in spark with another column of same dataframe(Version 2.2)

I have a requirement to filter a List with another column in the same dataframe.
Below is my DataFrame. Here, I want to filter col3 list with col1 and get only active childs for parent.
Df.show(10,false):
=============================
Col1 Col2 col3 flag
P1 Parent [c1,c2,c3,c4] Active
c1 Child [] InActive
c2 Child [] Active
c3 Child [] Active
Expected Output :
===================
Df.show(10,false):
Col1 Col2 col3 flag
P1 Parent [c2,c3] Active
c2 Child [] Active
c3 Child [] Active
Can someone help me to get the above result.
I generated your dataframe like this:
val df = Seq(("p1", "Parent", Seq("c1", "c2", "c3", "c4"), "Active"),
("c1", "Child", Seq(), "Inactive"),
("c2", "Child", Seq(), "Active"),
("c3", "Child", Seq(), "Active"))
.toDF("Col1", "Col2", "col3", "flag")
Then I filter only the active children in one dataframe which is one part of your output:
val active_children = df.where('flag === "Active").where('Col2 === "Child")
I also generate a flatten dataframe of parent/child relationships with explode:
val rels = df.withColumn("child", explode('col3))
.select("Col1", "Col2", "flag", "child")
scala> rels.show
+----+------+------+-----+
|Col1| Col2| flag|child|
+----+------+------+-----+
| p1|Parent|Active| c1|
| p1|Parent|Active| c2|
| p1|Parent|Active| c3|
| p1|Parent|Active| c4|
+----+------+------+-----+
and a dataframe with only one column corresponding to active children like this:
val child_filter = active_children.select('Col1 as "child")
and use this child_filter dataframe to filter (with a join) the parents you are interested in and use a groupBy to aggregate the lines back to your output format:
val parents = rels
.join(child_filter, "child")
.groupBy("Col1")
.agg(first('Col2) as "Col2",
collect_list('child) as "col3",
first('flag) as "flag")
scala> parents.show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
+----+------+--------+------+
Finally, a union yields the expected output:
scala> parents.union(active_children).show
+----+------+--------+------+
|Col1| Col2| col3| flag|
+----+------+--------+------+
| p1|Parent|[c2, c3]|Active|
| c2| Child| []|Active|
| c3| Child| []|Active|
+----+------+--------+------+

Finding size of distinct array column

I am using Scala and Spark to create a dataframe. Here's my code so far:
val df = transformedFlattenDF
.groupBy($"market", $"city", $"carrier").agg(count("*").alias("count"), min($"bandwidth").alias("bandwidth"), first($"network").alias("network"), concat_ws(",", collect_list($"carrierCode")).alias("carrierCode")).withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>")).withColumn("Carrier Count", collect_set("carrierCode"))
The column carrierCode becomes an array column. The data is present as follows:
CarrierCode
1: [12,2,12]
2: [5,2,8]
3: [1,1,3]
I'd like to create a column that counts the number of distinct values in each array. I tried doing collect_set, however, it gives me an error saying grouping expressions sequence is empty Is it possible to find the number of distinct values in each row's array? So that way in our same example, there could be a column like so:
Carrier Count
1: 2
2: 3
3: 2
collect_set is for aggregation hence should be applied within your groupBy-agg step:
val df = transformedFlattenDF.groupBy($"market", $"city", $"carrier").agg(
count("*").alias("count"), min($"bandwidth").alias("bandwidth"),
first($"network").alias("network"),
concat_ws(",", collect_list($"carrierCode")).alias("carrierCode"),
size(collect_set($"carrierCode")).as("carrier_count") // <-- ADDED `collect_set`
).
withColumn("carrierCode", split(($"carrierCode"), ",").cast("array<string>"))
If you don't want to change the existing groupBy-agg code, you can create a UDF like in the following example:
import org.apache.spark.sql.functions._
val codeDF = Seq(
Array("12", "2", "12"),
Array("5", "2", "8"),
Array("1", "1", "3")
).toDF("carrier_code")
def distinctElemCount = udf( (a: Seq[String]) => a.toSet.size )
codeDF.withColumn("carrier_count", distinctElemCount($"carrier_code")).
show
// +------------+-------------+
// |carrier_code|carrier_count|
// +------------+-------------+
// | [12, 2, 12]| 2|
// | [5, 2, 8]| 3|
// | [1, 1, 3]| 2|
// +------------+-------------+
Without UDF and using RDD conversion and back to DF for posterity:
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", 2, 100, 2), ("F", 7, 100, 1), ("B", 10, 100, 100)
)).toDF("c1", "c2", "c3", "c4")
val x = df.select("c1", "c2", "c3", "c4").rdd.map(x => (x.get(0), List(x.get(1), x.get(2), x.get(3))) )
val y = x.map {case (k, vL) => (k, vL.toSet.size) }
// Manipulate back to your DF, via conversion, join, what not.
Returns:
res15: Array[(Any, Int)] = Array((A,2), (F,3), (B,2))
Solution above better, as stated more so for posterity.
You can take help for udf and you can do like this.
//Input
df.show
+-----------+
|CarrierCode|
+-----------+
|1:[12,2,12]|
| 2:[5,2,8]|
| 3:[1,1,3]|
+-----------+
//udf
val countUDF=udf{(str:String)=>val strArr=str.split(":"); strArr(0)+":"+strArr(1).split(",").distinct.length.toString}
df.withColumn("Carrier Count",countUDF(col("CarrierCode"))).show
//Sample Output:
+-----------+-------------+
|CarrierCode|Carrier Count|
+-----------+-------------+
|1:[12,2,12]| 1:3|
| 2:[5,2,8]| 2:3|
| 3:[1,1,3]| 3:3|
+-----------+-------------+

Row aggregations in Scala

I am looking for a way to get a new column in a data frame in Scala that calculates the min/max of the values in col1, col2, ..., col10 for each row.
I know I can do it with a UDF but maybe there is an easier way.
Thanks!
Porting this Python answer by user6910411
import org.apache.spark.sql.functions._
val df = Seq(
(1, 3, 0, 9, "a", "b", "c")
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "Col7")
val cols = Seq("col1", "col2", "col3", "col4")
val rowMax = greatest(
cols map col: _*
).alias("max")
val rowMin = least(
cols map col: _*
).alias("min")
df.select($"*", rowMin, rowMax).show
// +----+----+----+----+----+----+----+---+---+
// |col1|col2|col3|col4|col5|col6|Col7|min|max|
// +----+----+----+----+----+----+----+---+---+
// | 1| 3| 0| 9| a| b| c|0.0|9.0|
// +----+----+----+----+----+----+----+---+---+

Merging the results of a scala spark dataframe as an array of results in another dataframe's column

Is there a way to take the following two dataframes and join them by the col0 field producing the output below?
//dataframe1
val df1 = Seq(
(1, 9, 100.1, 10),
).toDF("pk", "col0", "col1", "col2")
//dataframe2
val df2 = Seq(
(1, 9 "a1", "b1"),
(2, 9 "a2", "b2")
).toDF("pk", "col0", "str_col1", "str_col2")
//expected dataframe result
+---+-----+----+---------------------------+
| pk| col1|col2| new_arr_col |
+---+-----+----+---------------------------+
| 1|100.1| 10|[[1,9,a1, b1],[2,9,a2, b2]]|
+---+-----+----+---------------------------+
import org.apache.spark.sql.functions._
import spark.implicits._
// creating new array column out of all df2 columns:
val df2AsArray = df2.select($"col0", array(df2.columns.map(col): _*) as "new_arr_col")
val result = df1.join(df2AsArray, "col0")
.groupBy(df1.columns.map(col): _*) // grouping by all df1 columns
.agg(collect_list("new_arr_col") as "new_arr_col") // collecting array of arrays
.drop("col0")
result.show(false)
// +---+-----+----+--------------------------------------------------------+
// |pk |col1 |col2|new_arr_col |
// +---+-----+----+--------------------------------------------------------+
// |1 |100.1|10 |[WrappedArray(2, 9, a2, b2), WrappedArray(1, 9, a1, b1)]|
// +---+-----+----+--------------------------------------------------------+

Fill scala column with nulls

I am getting the error Caused by: scala.MatchError: Null (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) when I try to fill a DataFrame with null values to replace other values in it. How can I do this using Scala Spark 2.1?
You can use isin and when. Required imports:
import org.apache.spark.sql.functions.when
Example data:
val toReplace = Seq("foo", "bar")
val df = Seq((1, "Jane"), (2, "foo"), (3, "John"), (4, "bar")).toDF("id", "name")
Query:
df.withColumn("name", when(!$"name".isin(toReplace: _*), $"name")).
and the result:
+---+----+
| id|name|
+---+----+
| 1|Jane|
| 2|null|
| 3|John|
| 4|null|
+---+----+