Conditional Spark map() function based on input columns - scala

What I'm trying to achieve here is sending to Spark SQL map function conditionally generated columns depending on if they have null, 0 or any other value I may want.
Take for example this initial DF.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
From that initial DataFrame I want to generate yet another column which will be a map, like this.
initialDF.withColumn("thisMap", MY_FUNCTION)
My current approach to this is basically take a Seq[String] in a method a flatMap the key-value pairs that the Spark SQL method receives, like this.
def toMap(columns: String*): Column = {
map(
columns.flatMap(column => List(lit(column), col(column))): _*
)
}
But then, filtering becomes a Scala thing and is quite a mess.
What I would like to obtain after the processing would be, for each of those rows, the next DataFrame.
val initialDF = Seq(
("a", "b", 1, Map("field1" -> "a", "field2" -> "b", "field3" -> 1)),
("a", "b", null, Map("field1" -> "a", "field2" -> "b")),
("a", null, 0, Map("field1" -> "a"))
)
.toDF("field1", "field2", "field3", "thisMap")
I was wondering if this can be achieved using the Column API which is way more intuitive with .isNull or .equalTo?

Here's a small improvement on Lamanus' answer above which only loops over df.columns once:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
df.withColumn("thisMap", map_concat(
df.columns.map { colName =>
when(col(colName).isNull or col(colName) === 0, map())
.otherwise(map(lit(colName), col(colName)))
}: _*
)).show(false)
// +------+------+------+---------------------------------------+
// |field1|field2|field3|thisMap |
// +------+------+------+---------------------------------------+
// |a |b |1 |[field1 -> a, field2 -> b, field3 -> 1]|
// |a |b |null |[field1 -> a, field2 -> b] |
// |a |null |0 |[field1 -> a] |
// +------+------+------+---------------------------------------+

UPDATE
I found a way to achieve the expected result but it is a bit dirty.
val df2 = df.columns.foldLeft(df) { (df, n) => df.withColumn(n + "_map", map(lit(n), col(n))) }
val col_cond = df.columns.map(n => when(not(col(n + "_map").getItem(n).isNull || col(n + "_map").getItem(n) === lit("0")), col(n + "_map")).otherwise(map()))
df2.withColumn("map", map_concat(col_cond: _*))
.show(false)
ORIGINAL
Here is my try with the function map_from_arrays that is possible to use in spark 2.4+.
df.withColumn("array", array(df.columns.map(col): _*))
.withColumn("map", map_from_arrays(lit(df.columns), $"array")).show(false)
Then, the result is:
+------+------+------+---------+---------------------------------------+
|field1|field2|field3|array |map |
+------+------+------+---------+---------------------------------------+
|a |b |1 |[a, b, 1]|[field1 -> a, field2 -> b, field3 -> 1]|
|a |b |null |[a, b,] |[field1 -> a, field2 -> b, field3 ->] |
|a |null |0 |[a,, 0] |[field1 -> a, field2 ->, field3 -> 0] |
+------+------+------+---------+---------------------------------------+

Related

Pyspark group by collect list, to_json and pivot

Summary: Combining multiple rows to columns for a user
Input DF:
Id
group
A1
A2
B1
B2
1
Alpha
1
2
null
null
1
AlphaNew
6
8
null
null
2
Alpha
7
4
null
null
2
Beta
null
null
3
9
Note: The group values are dynamic
Expected Output DF:
Id
Alpha_A1
Alpha_A2
AlphaNew_A1
AlphaNew_A2
Beta_B1
Beta_B2
1
1
2
6
8
null
null
2
7
4
null
null
3
9
Attempted Solution:
I thought of making a json of non-null columns for each row, then a group by and concat_list of maps. Then I can explode the json to get the expected output.
But I am stuck at the stage of a nested json. Here is my code
vcols = df.columns[2:]
df\
.withColumn('json', F.to_json(F.struct(*vcols)))\
.groupby('id')\
.agg(
F.to_json(
F.collect_list(
F.create_map('group', 'json')
)
)
).alias('json')
Id
json
1
[{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}]
2
[{Alpha: {A1:7, A2:4}}, {Beta: {B1:3, B2:9}}]
What I am trying to get:
Id
json
1
[{Alpha_A1:1, Alpha_A2:2, AlphaNew_A1:6, AlphaNew_A2:8}]
2
[{Alpha_A1:7, Alpha_A2:4, Beta_B1:3, Beta_B2:9}]
I'd appreciate any help. I'm also trying to avoid UDFs as my true dataframe's shape is quite big
There's definitely a better way to do this but I continued your to json experiment.
Using UDFs:
After you get something like [{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}] you could create a UDF to flatten the dict. But since it's a JSON string you'll have to parse it to dict and then back again to JSON.
After that you would like to explode and pivot the table but that's not possible with JSON strings, so you have to use F.from_json with defined schema. That will give you MapType which you can explode and pivot.
Here's an example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from collections import MutableMapping
import json
from pyspark.sql.types import (
ArrayType,
IntegerType,
MapType,
StringType,
)
def flatten_dict(d, parent_key="", sep="_"):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, MutableMapping):
items.extend(flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def flatten_groups(data):
result = []
for item in json.loads(data):
result.append(flatten_dict(item))
return json.dumps(result)
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["Id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
vcols = df.columns[2:]
df = (
df.withColumn("json", F.struct(*vcols))
.groupby("id")
.agg(F.to_json(F.collect_list(F.create_map("group", "json"))).alias("json"))
)
# Flatten groups
flatten_groups_udf = F.udf(lambda x: flatten_groups(x))
schema = ArrayType(MapType(StringType(), IntegerType()))
df = df.withColumn("json", F.from_json(flatten_groups_udf(F.col("json")), schema))
# Explode and pivot
df = df.select(F.col("id"), F.explode(F.col("json")).alias("json"))
df = (
df.select("id", F.explode("json"))
.groupby("id")
.pivot("key")
.agg(F.first("value"))
)
At the end dataframe looks like:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
Without UDFs:
vcols = df.columns[2:]
df = (
df.withColumn("json", F.to_json(F.struct(*vcols)))
.groupby("id")
.agg(
F.collect_list(
F.create_map(
"group", F.from_json("json", MapType(StringType(), IntegerType()))
)
).alias("json")
)
)
df = df.withColumn("json", F.explode(F.col("json")).alias("json"))
df = df.select("id", F.explode(F.col("json")).alias("root", "value"))
df = df.select("id", "root", F.explode(F.col("value")).alias("sub", "value"))
df = df.select(
"id", F.concat(F.col("root"), F.lit("_"), F.col("sub")).alias("name"), "value"
)
df = df.groupBy(F.col("id")).pivot("name").agg(F.first("value"))
Result:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
I found a slightly better way than the json approach:
Stack the input dataframe value columns A1, A2,B1, B2,.. as rows
So the structure would look like id, group, sub, value where sub has the column name like A1, A2, B1, B2 and the value column has the value associated
Filter out the rows that have value as null
And, now we are able to pivot by the group. Since the null value rows are removed, we wont have the initial issue of the pivot making extra columns
import pyspark.sql.functions as F
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
# Value columns that need to be stacked
vcols = df.columns[2:]
expr_str = ', '.join([f"'{i}', {i}" for i in vcols])
expr_str = f"stack({len(vcols)}, {expr_str}) as (sub, value)"
df = df\
.selectExpr("id", "group", expr_str)\
.filter(F.col("value").isNotNull())\
.select("id", F.concat("group", F.lit("_"), "sub").alias("group"), "value")\
.groupBy("id")\
.pivot("group")\
.agg(F.first("value"))
df.show()
Result:
+---+-----------+-----------+--------+--------+-------+-------+
| id|AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
| 1| 6| 8| 1| 2| null| null|
| 2| null| null| 7| 4| 3| 9|
+---+-----------+-----------+--------+--------+-------+-------+

How write code that creates a Dataset with columns that have the elements of an array column as values and their names being positions?

Input data:
val inputDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF
println("Input:")
inputDf.show(false)
Here is how look Input:
+---------+
|value |
+---------+
|[a, b, c]|
|[X, Y, Z]|
+---------+
Here is how look Expected:
+---+---+---+
|0 |1 |2 |
+---+---+---+
|a |b |c |
|X |Y |Z |
+---+---+---+
I tried use code like this:
val ncols = 3
val selectCols = (0 until ncols).map(i => $"arr"(i).as(s"col_$i"))
inputDf
.select(selectCols:_*)
.show()
But I have errors, because I need some :Unit
Another way to create a dataframe ---
df1 = spark.createDataFrame([(1,[4,2, 1]),(4,[3,2])], [ "col2","col4"])
OUTPUT---------
+----+---------+
|col2| col4|
+----+---------+
| 1|[4, 2, 1]|
| 4| [3, 2]|
+----+---------+
package spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col
object ArrayToCol extends App {
val spark = SparkSession.builder()
.master("local")
.appName("DataFrame-example")
.getOrCreate()
import spark.implicits._
val inptDf = Seq(Seq("a", "b", "c"), Seq("X", "Y", "Z")).toDF("value")
val d = inptDf
.withColumn("0", col("value").getItem(0))
.withColumn("1", col("value").getItem(1))
.withColumn("2", col("value").getItem(2))
.drop("value")
d.show(false)
}
// Variant 2
val res = inptDf.select(
$"value".getItem(0).as("col0"),
$"value".getItem(1).as("col1"),
$"value".getItem(2).as("col2")
)
// Variant 3
val res1 = inptDf.select(
col("*") +: (0 until 3).map(i => col("value").getItem(i).as(s"$i")): _*
)
.drop("value")

Select a literal based on a column value in Spark

I have a map:
val map = Map("A" -> 1, "B" -> 2)
And I have a DataFrame. a column in the data frame contains the keys in the map. I am trying to select a column in a new DF that has the map values in it based on the key:
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
lit(map.get(col(ColumnWithKey).toString()).get) as newColumn)
But this is resulting in the following error:
java.lang.RuntimeException: Unsupported literal type class scala.None$ None
I made sure that the column ColumnWithKey has As and Bs only and does not have empty values in it.
Is there another way to get the result I am looking for? Any help would be appreciated.
The Problem in this statement (besides syntax problems)
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
lit(map.get(col(ColumnWithKey).toString()).get) as newColumn)
is that col(ColumnWithKey) will not take the value of a specific row, but is only given by the schema, i.e. has a constant value.
In your case I would suggest to join your map to your dataframe :
val map = Map("A" -> 1, "B" -> 2)
val df_map = map.toSeq.toDF("key","value")
val DfThatContainsTheKeyColumn = Seq(
"A",
"A",
"B",
"B"
).toDF("myCol")
DfThatContainsTheKeyColumn
.join(broadcast(df_map),$"mycol"===$"key")
.select(concat($"mycol",lit("|"),$"value").as("newColumn"))
.show()
gives
|newColumn|
+---------+
| A|1|
| A|1|
| B|2|
| B|2|
+---------+
You can use case classes to make it easy. This is an example:
Given this input
val givenMap = Map("A" -> 1, "B" -> 2)
import spark.implicits._
val df = Seq(
(1, "A"),
(2, "A"),
(3, "B"),
(4, "B")
).toDF("col_a", "col_b")
df.show()
Above code looks like:
+-----+-----+
|col_a|col_b|
+-----+-----+
| 1| A|
| 2| A|
| 3| B|
| 4| B|
+-----+-----+
givenMap: scala.collection.immutable.Map[String,Int] = Map(A -> 1, B -> 2)
import spark.implicits._
df: org.apache.spark.sql.DataFrame = [col_a: int, col_b: string]
The code that you need will look like:
case class MyInput(col_a: Int, col_b: String)
case class MyOutput(col_a: Int, col_b: String, new_column: Int)
df.as[MyInput].map(row=> MyOutput(row.col_a, row.col_b, givenMap(row.col_b))).show()
With the case classes you can cast your df and use object notation to access to your column values within a .map. Above code will output:
+-----+-----+----------+
|col_a|col_b|new_column|
+-----+-----+----------+
| 1| A| 1|
| 2| A| 1|
| 3| B| 2|
| 4| B| 2|
+-----+-----+----------+
defined class MyInput
defined class MyOutput
You can lookup a map using key from a column as,
val map = Map("A" -> 1, "B" -> 2)
val df = spark.createDataset(Seq("dummy"))
.withColumn("key",lit("A"))
df.map{ row =>
val k = row.getAs[String]("key")
val v = map.getOrElse(k,0)
(k,v)
}.toDF("key", "value").show(false)
Result -
+---+-----+
|key|value|
+---+-----+
|A |1 |
+---+-----+
You can look up a map present inside a column using a literal key using Column.getItem, please see an example below.
val mapKeys = Array("A","B")
val mapValues = Array(1,2)
val df = spark.createDataset(Seq("dummy"))
.withColumn("key",lit("A"))
.withColumn("keys",lit(mapKeys))
.withColumn("values",lit(mapValues))
.withColumn("map",map_from_arrays($"keys",$"values"))
.withColumn("lookUpTheMap",$"map".getItem("A"))
//A dataframe with Map is created.
//A map is looked up using a hard coded String key.
df.show(false)
Result
+-----+---+------+------+----------------+------------+
|value|key|keys |values|map |lookUpTheMap|
+-----+---+------+------+----------------+------------+
|dummy|A |[A, B]|[1, 2]|[A -> 1, B -> 2]|1 |
+-----+---+------+------+----------------+------------+
To look up a map present inside a column based on another column containing the key - you can use an UDF or use map function on the dataframe the way I am showing below.
//A map is looked up using a Column key.
df.map{ row =>
val m = row.getAs[Map[String,Int]]("map")
val k = row.getAs[String]("key")
val v = m.getOrElse(k,0)
(m,k,v)
}.toDF("map","key", "value").show(false)
Result
+----------------+---+-----+
|map |key|value|
+----------------+---+-----+
|[A -> 1, B -> 2]|A |1 |
+----------------+---+-----+
I think a simpler option could be to use typedLit:
val map = typedLit(Map("A" -> 1, "B" -> 2))
val newDF = DfThatContainsTheKeyColumn.select(concat(col(SomeColumn), lit("|"),
map(col(ColumnWithKey))) as newColumn)

spark expression rename the column list after aggregation

I have written below code to group and aggregate the columns
val gmList = List("gc1","gc2","gc3")
val aList = List("val1","val2","val3","val4","val5")
val cype = "first"
val exprs = aList.map((_ -> cype )).toMap
dfgroupBy(gmList.map (col): _*).agg (exprs).show
but this create a columns with appending aggregation name in all column as shown below
so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
| gc1 | gc2 | gc3 | first(val1) | first(val2)| first(val3) | first(val4) | first(val5) |
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
One approach would be to alias the aggregated columns to the original column names in a subsequent select. I would also suggest generalizing the single aggregate function (i.e. first) to a list of functions, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 10, "a1", "a2", "a3"),
(1, 10, "b1", "b2", "b3"),
(2, 20, "c1", "c2", "c3"),
(2, 30, "d1", "d2", "d3"),
(2, 30, "e1", "e2", "e3")
).toDF("gc1", "gc2", "val1", "val2", "val3")
val gmList = List("gc1", "gc2")
val aList = List("val1", "val2", "val3")
// Populate with different aggregate methods for individual columns if necessary
val fList = List.fill(aList.size)("first")
val afPairs = aList.zip(fList)
// afPairs: List[(String, String)] = List((val1,first), (val2,first), (val3,first))
df.
groupBy(gmList.map(col): _*).agg(afPairs.toMap).
select(gmList.map(col) ::: afPairs.map{ case (v, f) => col(s"$f($v)").as(v) }: _*).
show
// +---+---+----+----+----+
// |gc1|gc2|val1|val2|val3|
// +---+---+----+----+----+
// | 2| 20| c1| c2| c3|
// | 1| 10| a1| a2| a3|
// | 2| 30| d1| d2| d3|
// +---+---+----+----+----+
You can slightly change the way you are generating the expression and use the function alias in there:
import org.apache.spark.sql.functions.col
val aList = List("val1","val2","val3","val4","val5")
val exprs = aList.map(c => first(col(c)).alias(c) )
dfgroupBy( gmList.map(col) : _*).agg(exprs.head , exprs.tail: _*).show
Here's a more generic version that will work with any aggregate functions and doesn't require naming your aggregate columns up front. Build your grouped df as you normally would, then use:
val colRegex = raw"^.+\((.*?)\)".r
val newCols = df.columns.map(c => col(c).as(colRegex.replaceAllIn(c, m => m.group(1))))
df.select(newCols: _*)
This will extract out only what is inside the parentheses, regardless of what aggregate function is called (e.g. first(val) -> val, sum(val) -> val, count(val) -> val, etc.).

Alternative to groupBy on several columns in spark dataframe

I have a spark data frame with columns like so:
df
--------------------------
A B C D E F amt
"A1" "B1" "C1" "D1" "E1" "F1" 1
"A2" "B2" "C2" "D2" "E2" "F2" 2
I would like to perform groupBy with column combinations
(A, B, sum(amt))
(A, C, sum(amt))
(A, D, sum(amt))
(A, E, sum(amt))
(A, F, sum(amt))
such that the resulting data frame looks like:
df_grouped
----------------------
A field value amt
"A1" "B" "B1" 1
"A2" "B" "B2" 2
"A1" "C" "C1" 1
"A2" "C" "C2" 2
"A1" "D" "D1" 1
"A2" "D" "D2" 2
My attempt at this was the following:
val cols = Vector("B","C","D","E","F")
//code for creating empty data frame with structs for the cols A, field, value and act
for (col <- cols){
empty_df = empty_df.union (df.groupBy($"A",col)
.agg(sum(amt).as(amt)
.withColumn("field",lit(col)
.withColumnRenamed(col, "value"))
}
I feel that the usage "for" or "foreach" may be clumsy for a distributed env such as spark. Are there any alternatives with map functionality for what I am doing? In my mind, aggregateByKey and collect_list may work; however, I am unable to imagine a complete solution. Please advise.
foldLeft is very powerful function devised in Scala if you know how to play with it. I am suggesting you to use foldLeft function ( I have commented for clarity in the code and for explanation)
//selecting the columns without A and amt
val columnsForAggregation = df.columns.tail.toSet - "amt"
//creating an empty dataframe (format for final output
val finalDF = Seq(("empty", "empty", "empty", 0.0)).toDF("A", "field", "value", "amt")
//using foldLeft for the aggregation and merging each aggreted results
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) => {
//aggregation on the dataframe with A and one of the column and finally selecting as required in the outptu
val aggregatedf = tempdf._1.groupBy("A", column).agg(sum("amt").as("amt"))
.select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))
//union the aggregated results and transferring dataframes for next loop
(df, tempdf._2.union(aggregatedf))
}
}
//finally removing the dummy row created
transformeddf.filter(col("A") =!= "empty")
.show(false)
You should have the dataframe you desire
+---+-----+-----+---+
|A |field|value|amt|
+---+-----+-----+---+
|A1 |E |E1 |1.0|
|A2 |E |E2 |2.0|
|A1 |F |F1 |1.0|
|A2 |F |F2 |2.0|
|A2 |B |B2 |2.0|
|A1 |B |B1 |1.0|
|A2 |C |C2 |2.0|
|A1 |C |C1 |1.0|
|A1 |D |D1 |1.0|
|A2 |D |D2 |2.0|
+---+-----+-----+---+
I hope the answer is helpful
Concised form of above foldLeft function is
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) =>
(df, tempdf._2.union(tempdf._1.groupBy("A", column).agg(sum("amt").as("amt")).select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))))
}