How to apply a customized function with multiple parameters to each group of a dataframe and union the resulting dataframes in Scala Spark? - scala

I have a customized function that looks like this that returns a different dataframe as the output
def customizedfun(data : DataFrame, param1 : Boolean, param2 : string) : DataFrame = {...}
and I want to apply this function to each group of
df.groupBy("type")
then append the output dataframes from each type into one dataframe.
This is a little different from other questions regarding applying customized functions to grouped dataframes in that this function also take other inputs, in addition to the dataframe in question df.groupBy("type").
What's the best way to do this?

You can filter down the original df to the different groups, call customizedfun for each group and then union the results.
I assume that customizedfun is a function that simply adds the two parameters as a new column, but it could be any function:
def customizedfun(data : DataFrame, param1 : Boolean, param2 : String) : DataFrame =
data.withColumn("newCol", lit(s"$param2 $param1"))
I need two helper function that calculate the values of param1 and param2 dependent on the value of type. In a real world application, these functions could be for example a lookup into a dictionary.
def calcParam1(typ: Integer): Boolean = typ % 2 == 0
def calcParam2(typ: Integer): String = s"type is $typ"
Now the original df is filtered into the different groups, customizedfun is called and the result is unioned:
//create some test data
val df = Seq((1, "A", "a"), (1, "B", "b"), (1, "C", "c"), (2, "D", "d"), (2, "E", "e"), (3, "F", "f"))
.toDF("type", "val1", "val2")
//+----+----+----+
//|type|val1|val2|
//+----+----+----+
//| 1| A| a|
//| 1| B| b|
//| 1| C| c|
//| 2| D| d|
//| 2| E| e|
//| 3| F| f|
//+----+----+----+
//get the distinct values of column type
val distinctTypes = df.select("type").distinct().as[Integer].collect()
//call customizedfun for each group
val resultPerGroup= for( typ <- distinctTypes)
yield customizedfun( df.filter(s"type = $typ"), calcParam1(typ), calcParam2(typ))
//the final union
val result = resultPerGroup.tail.foldLeft(resultPerGroup.head)(_ union _)
//+----+----+----+---------------+
//|type|val1|val2| newCol|
//+----+----+----+---------------+
//| 1| A| a|type is 1 false|
//| 1| B| b|type is 1 false|
//| 1| C| c|type is 1 false|
//| 3| F| f|type is 3 false|
//| 2| D| d| type is 2 true|
//| 2| E| e| type is 2 true|
//+----+----+----+---------------+

Related

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

Group by and find count before doing pivot spark

I have a dataframe like below
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot
On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))
Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>
This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+

Grouping by values on a Spark Dataframe

I'm working on a Spark dataframe containing this kind of data:
A,1,2,3
B,1,2,3
C,1,2,3
D,4,2,3
I want to aggegate this data on the three last columns, so the output would be :
ABC,1,2,3
D,4,2,3
How can I do it in scala ? (this is not a big dataframe so performance is secondary here)
As mentioned in the comments you can first use groupBy to group your columns and then use concat_ws on your first column. Here is one way of doing it,
//create you original DF
val df = Seq(("A",1,2,3),("B",1,2,3),("C",1,2,3),("D",4,2,3)).toDF("col1","col2","col3","col4")
df.show
//output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 1| 2| 3|
| B| 1| 2| 3|
| C| 1| 2| 3|
| D| 4| 2| 3|
+----+----+----+----+
//group by "col2","col3","col4" and store "col1" as list and then
//convert it to string
df.groupBy("col2","col3","col4")
.agg(collect_list("col1").as("col1"))
//you can change the string separator by concat_ws first arg
.select(concat_ws("", $"col1") as "col1",$"col2",$"col3",$"col4").show
//output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| D| 4| 2| 3|
| ABC| 1| 2| 3|
+----+----+----+----+
Alternatively you can map by your key in this case c2, c3, c4 and then concatenate your values via reduce by key. In the end I format each row as needed through the last map. It should be something like the following:
val data=sc.parallelize(List(
("A", "1", "2", "3"),
("B", "1", "2", "3"),
("C", "1", "2", "3"),
("D", "4", "2", "3")))
val res = data.map{ case (c1, c2, c3, c4) => ((c2, c3, c4), String.valueOf(c1)) }
.reduceByKey((x, y) => x + y)
.map(v => v._2.toString + "," + v._1.productIterator.toArray.mkString(","))
.collect

How to use Spark ML ALS algorithm? [duplicate]

Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.
Example, the original dataframe:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| A|
|1473492972|4060600988513370| A|
|1473509764|4060600988513370| B|
|1473513432|4060600988513370| C|
|1473513432|4060600988513370| A|
+----------+----------------+--------------------+
to the factorized version:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| 0|
|1473492972|4060600988513370| 0|
|1473509764|4060600988513370| 1|
|1473513432|4060600988513370| 2|
|1473513432|4060600988513370| 0|
+----------+----------------+--------------------+
In scala itself it would be fairly simple, but since Spark distributes it's dataframes over nodes I'm not sure how to keep a mapping from A->0, B->1, C->2.
Also, assume the dataframe is pretty big (gigabytes), which means loading one entire column into the memory of a single machine might not be possible.
Can it be done?
You can use StringIndexer to encode letters into indices:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("col3")
.setOutputCol("col3Index")
val indexed = indexer.fit(df).transform(df)
indexed.show()
+----------+----------------+----+---------+
| col1| col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370| A| 0.0|
|1473492972|4060600988513370| A| 0.0|
|1473509764|4060600988513370| B| 1.0|
|1473513432|4060600988513370| C| 2.0|
|1473513432|4060600988513370| A| 0.0|
+----------+----------------+----+---------+
Data:
val df = spark.createDataFrame(Seq(
(1473490929, "4060600988513370", "A"),
(1473492972, "4060600988513370", "A"),
(1473509764, "4060600988513370", "B"),
(1473513432, "4060600988513370", "C"),
(1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")
You can use an user defined function.
First you create the mapping you need:
val updateFunction = udf {(x: String) =>
x match {
case "A" => 0
case "B" => 1
case "C" => 2
case _ => 3
}
}
And now you only have to apply it to your DataFrame:
df.withColumn("col3", updateFunction(df.col("col3")))

Spark join produces wrong results

Presenting here before possibly filing a bug. I'm using Spark 1.6.0.
This is a simplified version of the problem I'm dealing with. I've filtered a table, and then I'm trying to do a left outer join with that subset and the main table, matching all the columns.
I've only got 2 rows in the main table and one in the filtered table. I'm expecting the resulting table to only have the single row from the subset.
scala> val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
b: org.apache.spark.sql.DataFrame = [a: string, b: string, c: int]
scala> val a = b.where("c = 1").withColumnRenamed("a", "filta").withColumnRenamed("b", "filtb")
a: org.apache.spark.sql.DataFrame = [filta: string, filtb: string, c: int]
scala> a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> b("c"), "left_outer").show
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
| a| b| 1| a| b| 2|
+-----+-----+---+---+---+---+
I didn't expect that result at all. I expected the first row, but not the second. I suspected it's the null-safe equality, so I tried it without.
scala> a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === b("c"), "left_outer").show
16/03/21 12:50:00 WARN Column: Constructing trivially true equals predicate, 'c#18232 = c#18232'. Perhaps you need to use aliases.
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
+-----+-----+---+---+---+---+
OK, that's the result I expected, but then I got suspicious of the warning. There is a separate StackOverflow question to deal with that warning here: Spark SQL performing carthesian join instead of inner join
So I create a new column that avoids the warning.
scala> a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === $"b" and $"newc" === b("c"), "left_outer").show
+-----+-----+---+----+---+---+---+
|filta|filtb| c|newc| a| b| c|
+-----+-----+---+----+---+---+---+
| a| b| 1| 1| a| b| 1|
| a| b| 1| 1| a| b| 2|
+-----+-----+---+----+---+---+---+
But now the result is wrong again!
I have a lot of null-safe equality checks, and the warning isn't fatal, so I don't see a clear path to working with/around this.
Is the behaviour a bug, or is this expected behaviour? If expected, why?
If you want an expected behavior use either join on names:
val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
val a = b.where("c = 1")
a.join(b, Seq("a", "b", "c")).show
// +---+---+---+
// | a| b| c|
// +---+---+---+
// | a| b| 1|
// +---+---+---+
or aliases:
val aa = a.alias("a")
val bb = b.alias("b")
aa.join(bb, $"a.a" === $"b.a" && $"a.b" === $"b.b" && $"a.c" === $"b.c")
You can use <=> as well:
aa.join(bb, $"a.a" <=> $"b.a" && $"a.b" <=> $"b.b" && $"a.c" <=> $"b.c")
As far as I remember there's been a special case for simple equality for a while. That's why you get correct results despite the warning.
The second behavior looks indeed like a bug related to the fact that you still have a.c in your data. It looks like it is picked downstream before b.c and the evaluated condition is actually a.newc = a.c.
val expr = $"filta" === $"a" and $"filtb" === $"b" and $"newc" === $"c"
a.withColumnRenamed("c", "newc").join(b, expr, "left_outer")