An Apache Spark Join including null keys - scala

My objectifs is to join two dataframes, have infomations from both, despite the fact that I can have nulls in my join keys. These are my two dataframes :
val data1 = Seq(
(601, null, null, "8121000868-10", "CN88"),
(3925, null, null, "8121000936-50", "CN88")
)
val df1 = data1.toDF("id", "work_order_number", "work_order_item_number", "tally_number", "company_code")
val data2 = Seq(
(null, null, "8121000868-10", "CN88", "popo"),
(null, null, "8121000936-50", "CN88", "Smith")
)
val df2 = data2.toDF("work_order_number", "work_order_item_number", "tally_number", "company_code", "name")
Actually my objectif is to get the "id" from df1, rename it as "tally_summary_id" and be able to to re-attach some other informations to every single id. This is my code :
val final_df =
df1.select(col("id").alias("tally_summary_id"), col("work_order_number"), col("work_order_item_number"),
col("tally_number"), col("company_code"))
.join(df2, Seq("tally_number", "work_order_number", "work_order_item_number", "company_code"), "full")
A left join give me :
+-------------+-----------------+----------------------+------------+----------------+----+
| tally_number|work_order_number|work_order_item_number|company_code|tally_summary_id|name|
+-------------+-----------------+----------------------+------------+----------------+----+
|8121000868-10| null| null| CN88| 601|null|
|8121000936-50| null| null| CN88| 3925|null|
+-------------+-----------------+----------------------+------------+----------------+----+
A right join give me :
+-------------+-----------------+----------------------+------------+----------------+-----+
| tally_number|work_order_number|work_order_item_number|company_code|tally_summary_id| name|
+-------------+-----------------+----------------------+------------+----------------+-----+
|8121000868-10| null| null| CN88| null| popo|
|8121000936-50| null| null| CN88| null|Smith|
+-------------+-----------------+----------------------+------------+----------------+-----+
A full join give me :
+-------------+-----------------+----------------------+------------+----------------+-----+
| tally_number|work_order_number|work_order_item_number|company_code|tally_summary_id| name|
+-------------+-----------------+----------------------+------------+----------------+-----+
|8121000868-10| null| null| CN88| 601| null|
|8121000868-10| null| null| CN88| null| popo|
|8121000936-50| null| null| CN88| 3925| null|
|8121000936-50| null| null| CN88| null|Smith|
+-------------+-----------------+----------------------+------------+----------------+-----+
What can i do to have something like this :
+-------------+-----------------+----------------------+------------+----------------+-----+
| tally_number|work_order_number|work_order_item_number|company_code|tally_summary_id| name|
+-------------+-----------------+----------------------+------------+----------------+-----+
|8121000868-10| null| null| CN88| 601|popo |
|8121000936-50| null| null| CN88| 3925|Smith|
+-------------+-----------------+----------------------+------------+----------------+-----+

You can use the <=> equality operator which is null safe as shown here.
I added a schema to the dataframe creation as it seemed that without it the auto schema inference didn't give a type to the columns with only nulls and the join failed.
The resulting dataframe is exactly the one you wanted
import scala.collection.JavaConversions._
val data1 = Seq(
Row(601, null, null, "8121000868-10", "CN88"),
Row(3925, null, null, "8121000936-50", "CN88")
)
val schema1 = StructType(List(
StructField("id", IntegerType, false),
StructField("work_order_number", StringType, true),
StructField("work_order_item_number", StringType, true),
StructField("tally_number", StringType, true),
StructField("company_code", StringType, true)
))
val df1 = sparkSession.createDataFrame(data1, schema1)
val data2 = Seq(
Row(null, null, "8121000868-10", "CN88", "popo"),
Row(null, null, "8121000936-50", "CN88", "Smith")
)
val schema2 = StructType(Seq(
StructField("work_order_number", StringType, true),
StructField("work_order_item_number", StringType, true),
StructField("tally_number", StringType, true),
StructField("company_code", StringType, true),
StructField("name", StringType, false)
))
val df2 = sparkSession.createDataFrame(data2, schema2)
val final_df =
df1.join(df2, df1("tally_number") <=> df2("tally_number")
&& df1("work_order_number") <=> df2("work_order_number")
&& df1("work_order_item_number") <=> df2("work_order_item_number")
&& df1("company_code") <=> df2("company_code")
, "inner")
.select(df1("tally_number"),
df1("work_order_number"),
df1("work_order_item_number"),
df1("company_code"),
df1("id").as("tally_summary_id"),
df2("name"))

Related

How can I convert a nested json to a specific dataframe

I have a json which looks like this:
"A":{"B":1,"C":[{"D":2,"E":3},{"D":6,"E":7}]}
I would like to create a dataframe based on this with two rows.
B
D
E
1
2
3
1
6
7
In an imperative language, I would use a for loop. However, I saw that this is not recommended with Scala and therefore I am wondering how I can explode this inner JSON and use the entries in new rows.
Thank you very much
You can use from_json to parse the string json values :
val df = Seq(
""""A":{"B":1,"C":[{"D":2,"E":3},{"D":6,"E":7}]}"""
).toDF("col")
val df1 = df.withColumn(
"col",
from_json(
regexp_extract(col("col"), "\"A\":(.*)", 1), // extract the part after "A":
lit("struct<B:int,C:array<struct<D:int,E:int>>>")
)
).select(col("col.B"), expr("inline(col.C)"))
df1.show
//+---+---+---+
//| B| D| E|
//+---+---+---+
//| 1| 2| 3|
//| 1| 6| 7|
//+---+---+---+
You can also pass the schema as StructType to from_json function which you define like this :
val schema = StructType(Array(
StructField("B", IntegerType, true),
StructField("C", ArrayType(StructType(Array(
StructField("D", IntegerType, true),
StructField("E", IntegerType, true)
))))
)
)

agg function to transform multiple rows to multiple columns with different type

I want to transform value of multiples row with same id into columns , but each columns is a different type.
Input data :
val dataInput = List(
Row( "meta001","duration", 2 , null, null),
Row("meta001","price", 300 , null , null),
Row("meta001","name", null , null , "name"),
Row("meta001","exist", null , true , null),
Row("meta002","price", 400 , null, null),
Row("meta002","duration", 3 , null, null)
)
val schemaInput = new StructType()
.add("id",StringType,true)
.add("code",StringType,true)
.add("integer value",IntegerType,true)
.add("boolean value",BooleanType,true)
.add("string value",StringType,true)
var dfInput = spark.createDataFrame(
spark.sparkContext.parallelize(dataInput),
schemaInput
)
+-------+--------+-------------+-------------+------------+
| id| code|integer value|boolean value|string value|
+-------+--------+-------------+-------------+------------+
|meta001|duration| 2| null| null|
|meta001| price| 300| null| null|
|meta001| name| null| null| name|
|meta001| exist| null| true| null|
|meta002| price| 400| null| null|
|meta002|duration| 3| null| null|
+-------+--------+-------------+-------------+------------+
Expected output :
+-------+--------+-------------+-------------+------------+
| id|duration|price |name |exist |
+-------+--------+-------------+-------------+------------+
|meta001| 2| 300| name| true|
|meta002| 3| 400| null| null|
+-------+--------+-------------+-------------+------------+
I think i should use groupBy and pivot funtion but I am little bit lost when i should agg result :
dfInput.groupby("id").pivot("code",Seq("duration","price","name","exist").agg(???)
You don't need pivot here, just combine first with a when:
dfInput
.groupBy($"id")
.agg(
first(when($"code" === "duration", $"integer value"), ignoreNulls = true).as("duration"),
first(when($"code" === "price", $"integer value"), ignoreNulls = true).as("price"),
first(when($"code" === "name", $"string value"), ignoreNulls = true).as("name"),
first(when($"code" === "exist", $"boolean value"), ignoreNulls = true).as("exist")
)
.show()
gives
+-------+--------+-----+----+-----+
| id|duration|price|name|exist|
+-------+--------+-----+----+-----+
|meta001| 2| 300|name| true|
|meta002| 3| 400|null| null|
+-------+--------+-----+----+-----+
You can use the coalesce function but all columns should be the same type, e.g "string".
df.groupBy("id").pivot("code",Seq("duration","price","name","exist"))
.agg(first(coalesce($"integer value".cast("string"), $"boolean value".cast("string"), $"string value".cast("string"))))
.show()
+-------+--------+-----+----+-----+
| id|duration|price|name|exist|
+-------+--------+-----+----+-----+
|meta001| 2| 300|name| true|
|meta002| 3| 400|null| null|
+-------+--------+-----+----+-----+
If you want to preserve the data type, then the function should be more complex, might need the conditional statements. Or just pivot to all and select the columns what you need.
df.groupBy("id").pivot("code",Seq("duration","price","name","exist"))
.agg(first($"integer value").as("int"), first($"boolean value").as("bool"), first($"string value").as("string"))
.select("id", "duration_int", "price_int", "name_string", "exist_bool")
.show()
+-------+------------+---------+-----------+----------+
| id|duration_int|price_int|name_string|exist_bool|
+-------+------------+---------+-----------+----------+
|meta001| 2| 300| name| true|
|meta002| 3| 400| null| null|
+-------+------------+---------+-----------+----------+
root
|-- id: string (nullable = true)
|-- duration_int: integer (nullable = true)
|-- price_int: integer (nullable = true)
|-- name_string: string (nullable = true)
|-- exist_bool: boolean (nullable = true)

Fill a DataFrame with values giving in a Map

My objective is to create a function that taking a Map and a data frame as parameter:
fillNa(columnsToFill, originalDF)
can fill the data frame with the values giving in a Map.
I'm working with a Data Frame similar to the one you can see below :
+---------+-------------+----------------+-------------------+
|seller_id| nickname|successful_items|power_seller_status|
+---------+-------------+----------------+-------------------+
|260341211|HEBICOTE62617| 15| null|
|269984665|VACAPERVIAJES| 12| null|
|223499446|GAFAOCOSSR005| 10| gold|
|265004480|NEFCOTEOC8179| null| silver|
|265200651|RUBENTARARIRA| 11| null|
+---------+-------------+----------------+-------------------+
The desired output, therefore, is the following:
+---------+-------------+----------------+-------------------+
|seller_id| nickname|successful_items|power_seller_status|
+---------+-------------+----------------+-------------------+
|260341211|HEBICOTE62617| 15| normal|
|269984665|VACAPERVIAJES| 12| normal|
|223499446|GAFAOCOSSR005| 10| gold|
|265004480|NEFCOTEOC8179| 0| silver|
|265200651|RUBENTARARIRA| 11| normal|
+---------+-------------+----------------+-------------------+
The code that generate the DataFrame is the following:
val someData = Seq(
Row("260341211", "HEBICOTE62617", 15, null),
Row("269984665", "VACAPERVIAJES", 12, null),
Row("223499446", "GAFAOCOSSR005", 10, "gold"),
Row("265004480", "NEFCOTEOC8179", null, "silver"),
Row("265200651", "RUBENTARARIRA", 11, null)
)
val someSchema = List(
StructField("seller_id", StringType, true),
StructField("nickname", StringType, true),
StructField("successful_items", IntegerType, true),
StructField("power_seller_status", StringType, true)
)
val originalDF = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(someSchema)
)
However, when I tried to create a function that take an string and fill the values I can't do it for both fields. The best I could do is:
1- Replace only one column
2- Duplicate the rows
The map using as parameter is the following:
val columnsToFill = Map("power_seller_status" -> "normal",
"successful_items" -> "0")
The functions I've created:
Version 1
def fillNa_version1(replacements: Map[String, String], dataFrame: DataFrame): DataFrame = {
dataFrame.na.fill(replacements.values.head, Seq(replacements.keys.head))
}
Version 2
def fillNa_version2(replacements: Map[String, String], dataFrame: DataFrame)= {
replacements.map{keyVal => dataFrame.na.fill(keyVal._2, Seq(keyVal._1))}.reduce(_.union(_))
}
originalDF.na.fill(columnsToFill).show()
yields:
+---------+-------------+----------------+-------------------+
|seller_id| nickname|successful_items|power_seller_status|
+---------+-------------+----------------+-------------------+
|260341211|HEBICOTE62617| 15| normal|
|269984665|VACAPERVIAJES| 12| normal|
|223499446|GAFAOCOSSR005| 10| gold|
|265004480|NEFCOTEOC8179| 0| silver|
|265200651|RUBENTARARIRA| 11| normal|
+---------+-------------+----------------+-------------------+
which appears to be what you want, no?
If all you want to do is replace your nulls with some sort of default value, there are much easier ways to do that. You can use withColumn to derive a new column.
originalDF.select(
$"seller_id",
$"nickname",
$"successful_items",
$"power_seller_status").
withColumn("derived_successful_items", when($"successful_items".isNull,"0").otherwise($"successful_items")).
withColumn("derived_power_seller",when ($"power_seller_status".isNull,"normal").otherwise($"power_seller_status")).show
You could also use coalesce (returns the first non-null argument):
withColumn("coalesced_successful_items",coalesce($"successful_items",lit("0")))

Spark Scala row-wise average by handling null

I've a dataframe with high volume of data and "n" number of columns.
df_avg_calc: org.apache.spark.sql.DataFrame = [col1: double, col2: double ... 4 more fields]
+------------------+-----------------+------------------+-----------------+-----+-----+
| col1| col2| col3| col4| col5| col6|
+------------------+-----------------+------------------+-----------------+-----+-----+
| null| null| null| null| null| null|
| 14.0| 5.0| 73.0| null| null| null|
| null| null| 28.25| null| null| null|
| null| null| null| null| null| null|
|33.723333333333336|59.78999999999999|39.474999999999994|82.09666666666666|101.0|53.43|
| 26.25| null| null| 2.0| null| null|
| null| null| null| null| null| null|
| 54.46| 89.475| null| null| null| null|
| null| 12.39| null| null| null| null|
| null| 58.0| 19.45| 1.0| 1.33|158.0|
+------------------+-----------------+------------------+-----------------+-----+-----+
I need to perform rowwise average keeping in mind not to consider the cell with "null" for averaging.
This needs to be implemented in Spark / Scala. I've tried to explain the same as in the attached image
What I have tried so far :
By referring - Calculate row mean, ignoring NAs in Spark Scala
val df = df_raw.schema.fieldNames.filter(f => f.contains("colname"))
val rowMeans = df_raw.select(df.map(f => col(f)).reduce(+) / lit(df.length) as "row_mean")
Where df_raw contains columns which needs to be aggregated (of course rowise). There are more than 80 columns. Arbitrarily they have data and null, count of Null needs to be ignored in the denominator while calculating average. It works fine, when all the column contain data, even a single Null in a column returns Null
Edit:
I've tried to adjust this answer by Terry Dactyl
def average(l: Seq[Double]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Double]))
val rowAvgDF = df_avg_calc.select(avgUdf(array($"col1",$"col2",$"col3",$"col4",$"col5",$"col6")).as("row_avg"))
rowAvgDF.show(10,false)
rowAvgDF: org.apache.spark.sql.DataFrame = [row_avg: double]
+------------------+
|row_avg |
+------------------+
|0.0 |
|15.333333333333334|
|4.708333333333333 |
|0.0 |
|61.58583333333333 |
|4.708333333333333 |
|0.0 |
|23.989166666666666|
|2.065 |
|39.63 |
+------------------+
Spark >= 2.4
It is possible to use aggregate:
val row_mean = expr("""aggregate(
CAST(array(_1, _2, _3) AS array<double>),
-- Initial value
-- Note that aggregate is picky about types
CAST((0.0 as sum, 0.0 as n) AS struct<sum: double, n: double>),
-- Merge function
(acc, x) -> (
acc.sum + coalesce(x, 0.0),
acc.n + CASE WHEN x IS NULL THEN 0.0 ELSE 1.0 END),
-- Finalize function
acc -> acc.sum / acc.n)""")
Usage:
df.withColumn("row_mean", row_mean).show
Result:
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+
Version independent
Compute sum and count of NOT NULL columns and divide one over another:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def row_mean(cols: Column*) = {
// Sum of values ignoring nulls
val sum = cols
.map(c => coalesce(c, lit(0)))
.foldLeft(lit(0))(_ + _)
// Count of not null values
val cnt = cols
.map(c => when(c.isNull, 0).otherwise(1))
.foldLeft(lit(0))(_ + _)
sum / cnt
}
Example data:
val df = Seq(
(None, None, None),
(Some(2.0), None, None),
(Some(50.0), Some(34.0), None),
(Some(1.0), Some(2.0), Some(3.0))
).toDF
Result:
df.withColumn("row_mean", row_mean($"_1", $"_2", $"_3")).show
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+
def average(l: Seq[Integer]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Integer]))
val df = List((Some(1),Some(2)), (Some(1), None), (None, None)).toDF("a", "b")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
res0: Array[org.apache.spark.sql.Row] = Array([1.5], [1.0], [null])
Testing on the data you supplied gives the correct result:
val df = List(
(Some(10),Some(5), Some(5), None, None),
(None, Some(5), Some(5), None, Some(5)),
(Some(2), Some(8), Some(5), Some(1), Some(2)),
(None, None, None, None, None)
).toDF("col1", "col2", "col3", "col4", "col5")
Array[org.apache.spark.sql.Row] = Array([6.666666666666667], [5.0], [3.6], [null])
Note if you have columns you do not want included make sure they are filtered when populating the array passed to the UDF.
Finally:
val df = List(
(Some(14), Some(5), Some(73), None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])
Which again is the correct result.
If you want to use Doubles...
def average(l: Seq[java.lang.Double]): Option[java.lang.Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _) / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[java.lang.Double]))
val df = List(
(Some(14.0), Some(5.0), Some(73.0), None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])

How to paralelize processing of dataframe in apache spark with combination over a column

I'm looking a solution to build an aggregation with all combination of a column. For example , I have for a data frame as below:
val df = Seq(("A", 1), ("B", 2), ("C", 3), ("A", 4), ("B", 5)).toDF("id", "value")
+---+-----+
| id|value|
+---+-----+
| A| 1|
| B| 2|
| C| 3|
| A| 4|
| B| 5|
+---+-----+
And looking an aggregation for all combination over the column "id". Here below I found a solution, but this cannot use the parallelism of Spark, works only on driver node or only on a single executor. Is there any better solution in order to get rid of the for loop?
import spark.implicits._;
val list =df.select($"id").distinct().orderBy($"id").as[String].collect();
val combinations = (1 to list.length flatMap (x => list.combinations(x))) filter(_.length >1)
val schema = StructType(
StructField("indexvalue", IntegerType, true) ::
StructField("segment", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
for (x <- combinations) {
initialDF = initialDF.union(df.filter($"id".isin(x: _*))
.agg(expr("sum(value)").as("indexvalue"))
.withColumn("segment",lit(x.mkString("+"))))
}
initialDF.show()
+----------+-------+
|indexvalue|segment|
+----------+-------+
| 12| A+B|
| 8| A+C|
| 10| B+C|
| 15| A+B+C|
+----------+-------+