Alternative to groupBy on several columns in spark dataframe - scala

I have a spark data frame with columns like so:
df
--------------------------
A B C D E F amt
"A1" "B1" "C1" "D1" "E1" "F1" 1
"A2" "B2" "C2" "D2" "E2" "F2" 2
I would like to perform groupBy with column combinations
(A, B, sum(amt))
(A, C, sum(amt))
(A, D, sum(amt))
(A, E, sum(amt))
(A, F, sum(amt))
such that the resulting data frame looks like:
df_grouped
----------------------
A field value amt
"A1" "B" "B1" 1
"A2" "B" "B2" 2
"A1" "C" "C1" 1
"A2" "C" "C2" 2
"A1" "D" "D1" 1
"A2" "D" "D2" 2
My attempt at this was the following:
val cols = Vector("B","C","D","E","F")
//code for creating empty data frame with structs for the cols A, field, value and act
for (col <- cols){
empty_df = empty_df.union (df.groupBy($"A",col)
.agg(sum(amt).as(amt)
.withColumn("field",lit(col)
.withColumnRenamed(col, "value"))
}
I feel that the usage "for" or "foreach" may be clumsy for a distributed env such as spark. Are there any alternatives with map functionality for what I am doing? In my mind, aggregateByKey and collect_list may work; however, I am unable to imagine a complete solution. Please advise.

foldLeft is very powerful function devised in Scala if you know how to play with it. I am suggesting you to use foldLeft function ( I have commented for clarity in the code and for explanation)
//selecting the columns without A and amt
val columnsForAggregation = df.columns.tail.toSet - "amt"
//creating an empty dataframe (format for final output
val finalDF = Seq(("empty", "empty", "empty", 0.0)).toDF("A", "field", "value", "amt")
//using foldLeft for the aggregation and merging each aggreted results
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) => {
//aggregation on the dataframe with A and one of the column and finally selecting as required in the outptu
val aggregatedf = tempdf._1.groupBy("A", column).agg(sum("amt").as("amt"))
.select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))
//union the aggregated results and transferring dataframes for next loop
(df, tempdf._2.union(aggregatedf))
}
}
//finally removing the dummy row created
transformeddf.filter(col("A") =!= "empty")
.show(false)
You should have the dataframe you desire
+---+-----+-----+---+
|A |field|value|amt|
+---+-----+-----+---+
|A1 |E |E1 |1.0|
|A2 |E |E2 |2.0|
|A1 |F |F1 |1.0|
|A2 |F |F2 |2.0|
|A2 |B |B2 |2.0|
|A1 |B |B1 |1.0|
|A2 |C |C2 |2.0|
|A1 |C |C1 |1.0|
|A1 |D |D1 |1.0|
|A2 |D |D2 |2.0|
+---+-----+-----+---+
I hope the answer is helpful
Concised form of above foldLeft function is
import org.apache.spark.sql.functions._
val (originaldf, transformeddf) = columnsForAggregation.foldLeft((df, finalDF)){(tempdf, column) =>
(df, tempdf._2.union(tempdf._1.groupBy("A", column).agg(sum("amt").as("amt")).select(col("A"), lit(column).as("field"), col(column).as("value"), col("amt"))))
}

Related

Conditional Spark map() function based on input columns

What I'm trying to achieve here is sending to Spark SQL map function conditionally generated columns depending on if they have null, 0 or any other value I may want.
Take for example this initial DF.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
From that initial DataFrame I want to generate yet another column which will be a map, like this.
initialDF.withColumn("thisMap", MY_FUNCTION)
My current approach to this is basically take a Seq[String] in a method a flatMap the key-value pairs that the Spark SQL method receives, like this.
def toMap(columns: String*): Column = {
map(
columns.flatMap(column => List(lit(column), col(column))): _*
)
}
But then, filtering becomes a Scala thing and is quite a mess.
What I would like to obtain after the processing would be, for each of those rows, the next DataFrame.
val initialDF = Seq(
("a", "b", 1, Map("field1" -> "a", "field2" -> "b", "field3" -> 1)),
("a", "b", null, Map("field1" -> "a", "field2" -> "b")),
("a", null, 0, Map("field1" -> "a"))
)
.toDF("field1", "field2", "field3", "thisMap")
I was wondering if this can be achieved using the Column API which is way more intuitive with .isNull or .equalTo?
Here's a small improvement on Lamanus' answer above which only loops over df.columns once:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
df.withColumn("thisMap", map_concat(
df.columns.map { colName =>
when(col(colName).isNull or col(colName) === 0, map())
.otherwise(map(lit(colName), col(colName)))
}: _*
)).show(false)
// +------+------+------+---------------------------------------+
// |field1|field2|field3|thisMap |
// +------+------+------+---------------------------------------+
// |a |b |1 |[field1 -> a, field2 -> b, field3 -> 1]|
// |a |b |null |[field1 -> a, field2 -> b] |
// |a |null |0 |[field1 -> a] |
// +------+------+------+---------------------------------------+
UPDATE
I found a way to achieve the expected result but it is a bit dirty.
val df2 = df.columns.foldLeft(df) { (df, n) => df.withColumn(n + "_map", map(lit(n), col(n))) }
val col_cond = df.columns.map(n => when(not(col(n + "_map").getItem(n).isNull || col(n + "_map").getItem(n) === lit("0")), col(n + "_map")).otherwise(map()))
df2.withColumn("map", map_concat(col_cond: _*))
.show(false)
ORIGINAL
Here is my try with the function map_from_arrays that is possible to use in spark 2.4+.
df.withColumn("array", array(df.columns.map(col): _*))
.withColumn("map", map_from_arrays(lit(df.columns), $"array")).show(false)
Then, the result is:
+------+------+------+---------+---------------------------------------+
|field1|field2|field3|array |map |
+------+------+------+---------+---------------------------------------+
|a |b |1 |[a, b, 1]|[field1 -> a, field2 -> b, field3 -> 1]|
|a |b |null |[a, b,] |[field1 -> a, field2 -> b, field3 ->] |
|a |null |0 |[a,, 0] |[field1 -> a, field2 ->, field3 -> 0] |
+------+------+------+---------+---------------------------------------+

Convert multiple columns into a column of map on Spark Dataframe using Scala

I have a dataframe having variable number of columns like Col1, Col2, Col3.
I need combine Col1 and Col2 into one column of data type map by using the code below.
val df_converted = df.withColumn("ConvertedCols", map(lit("Col1"), col("Col1"), lit("Col2"), col("Col2")))
But how can I do it for all columns when I don't know the number and names of the columns?
One approach would be to expand the column list of the DataFrame via flatMap into a Seq(lit(c1), col(c1), lit(c2), col(c2), ...) and apply Spark's map as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("a", "b", "c", "d"),
("e", "f", "g", "h")
).toDF("c1", "c2", "c3", "c4")
val kvCols = df.columns.flatMap(c => Seq(lit(c), col(c)))
df.withColumn("ConvertedCols", map(kvCols: _*)).show(false)
// +---+---+---+---+---------------------------------------+
// |c1 |c2 |c3 |c4 |ConvertedCols |
// +---+---+---+---+---------------------------------------+
// |a |b |c |d |Map(c1 -> a, c2 -> b, c3 -> c, c4 -> d)|
// |e |f |g |h |Map(c1 -> e, c2 -> f, c3 -> g, c4 -> h)|
// +---+---+---+---+---------------------------------------+
Another way is to use from_json and to_json to get a map type column:
val df2 = df.withColumn(
"ConvertedCols",
from_json(to_json(struct("*")), lit("map<string,string>"))
)
df2.show(false)
+---+---+---+---+------------------------------------+
|c1 |c2 |c3 |c4 |ConvertedCols |
+---+---+---+---+------------------------------------+
|a |b |c |d |[c1 -> a, c2 -> b, c3 -> c, c4 -> d]|
|e |f |g |h |[c1 -> e, c2 -> f, c3 -> g, c4 -> h]|
+---+---+---+---+------------------------------------+

Get the number of null per row in PySpark dataframe

This is probably a duplicate, but somehow I have been searching for a long time already:
I want to get the number of nulls per Row in a Spark dataframe. I.e.
col1 col2 col3
null 1 a
1 2 b
2 3 null
Should in the end be:
col1 col2 col3 number_of_null
null 1 a 1
1 2 b 0
2 3 null 1
In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.
I.e.
col1 col2 col3 number_of_ABC
ABC 1 a 1
1 2 b 0
2 ABC ABC 2
I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/
EDIT: Assume that I have so many columns that I can't list them all.
EDIT2: I explicitely dont want to have a pandas solution.
EDIT3: The solution explained with sums or means does not work as it throws errors:
(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))
In Scala:
val df = List(
("ABC", "1", "a"),
("1", "2", "b"),
("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)
Output:
+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1 |a |1 |
|1 |2 |b |0 |
|2 |ABC |ABC |2 |
+----+----+----+--------+
As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7
Taking your DataFrame as df (and as is)
dfvals = [
(None, "1", "a"),
("1", "2", "b"),
("2", None, None)
]
df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])
new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
df.schema.names)))
Check if the value is Null and assign 1 or 0. Add the result to get the count.
new_df.show()
+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null| 1| a| 1|
| 1| 2| b| 0|
| 2|null|null| 2|
+----+----+----+--------+

how to access the column index for spark dataframe in scala for calculation

I am new to Scala programming , i have worked on R very extensively but while working for scala it has become tough to work in a loop to extract specific columns to perform computation on the column values
let me explain with help of an example :
i have Final dataframe arrived after joining the 2 dataframes,
now i need to perform calculation like
Above is the computation with reference to the columns , so after computation we'll get the below spark dataframe
How to refer to the column index in for-loop to compute the new column values in spark dataframe in scala
Here is one solution:
Input Data:
+---+---+---+---+---+---+---+---+---+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |
+---+---+---+---+---+---+---+---+---+
|24 |74 |74 |21 |66 |65 |100|27 |19 |
+---+---+---+---+---+---+---+---+---+
Zipped the columns to remove the non-matching columns:
val oneCols = data.schema.filter(_.name.contains("1")).map(x => x.name).sorted
val twoCols = data.schema.filter(_.name.contains("2")).map(x => x.name).sorted
val cols = oneCols.zip(twoCols)
//cols: Seq[(String, String)] = List((a1,a2), (b1,b2), (c1,c2), (d1,d2))
Use foldLeft function to dynamically add columns:
import org.apache.spark.sql.functions._
val result = cols.foldLeft(data)((data,c) => data.withColumn(s"Diff_${c._1}",
(col(s"${lit(c._2)}") - col(s"${lit(c._1)}"))/col(s"${lit(c._2)}")))
Here is the result:
result.show(false)
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |Diff_a1 |Diff_b1|Diff_c1 |Diff_d1 |
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|24 |74 |74 |21 |66 |65 |100|27 |19 |0.6307692307692307|0.26 |-1.7407407407407407|-0.10526315789473684|
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+

Scala LEFT JOIN on dataframes using two columns (case insensitive)

I have created the below method which takes two Dataframes; lhs & rhs and their respective first and second columns as input. The method should return the result of a left join between these two frames using the two columns provided for each dataframe (ignoring their case sensitivity).
The problem I am facing is that it is doing more of an inner join. It is is returning 3 times the number of the rows that is in the lhs data frame (due to duplicate values in rhs), but as it is a left join the duplication and number of rows in rhs dataframe should not matter.
def leftJoinCaseInsensitive(lhs: DataFrame, rhs: DataFrame, leftTableColumn: String, rightTableColumn: String, leftTableColumn1: String, rightTableColumn1: String): DataFrame = {
val joined: DataFrame = lhs.join(rhs, upper(lhs.col(leftTableColumn)) === upper(rhs.col(rightTableColumn)) && upper(lhs.col(leftTableColumn1)) === upper(rhs.col(rightTableColumn1)), "left");
return joined
}
If there are duplicate values in rhs, then it is normal for lhs to get replicated. If a joining values in joining columns from lhs row matches with multiple rhs rows then joined dataframe should have multiple rows from lhs matching the rows from rhs.
for example
lhs dataframe
+--------+--------+--------+
|col1left|col2left|col3left|
+--------+--------+--------+
|a |1 |leftside|
+--------+--------+--------+
And
rhs dataframe
+---------+---------+---------+
|col1right|col2right|col3right|
+---------+---------+---------+
|a |1 |rightside|
|a |1 |rightside|
+---------+---------+---------+
Then it is normal to have left join as
left joined lhs with rhs
+--------+--------+--------+---------+---------+---------+
|col1left|col2left|col3left|col1right|col2right|col3right|
+--------+--------+--------+---------+---------+---------+
|a |1 |leftside|a |1 |rightside|
|a |1 |leftside|a |1 |rightside|
+--------+--------+--------+---------+---------+---------+
You can have more information here
but as it is a left join the duplication and number of rows in rhs
dataframe should not matter
Not true. Your leftJoinCaseInsensitive method looks good to me. A left join would still produce more rows than the left table's if the right table has duplicated key column(s), as shown below:
val dfR = Seq(
(1, "a", "x"),
(1, "a", "y"),
(2, "b", "z")
).toDF("k1", "k2", "val")
val dfL = Seq(
(1, "a", "u"),
(2, "b", "v"),
(3, "c", "w")
).toDF("k1", "k2", "val")
leftJoinCaseInsensitive(dfL, dfR, "k1", "k1", "k2", "k2")
res1.show
+---+---+---+----+----+----+
| k1| k2|val| k1| k2| val|
+---+---+---+----+----+----+
| 1| a| u| 1| a| y|
| 1| a| u| 1| a| x|
| 2| b| v| 2| b| z|
| 3| c| w|null|null|null|
+---+---+---+----+----+----+