How to create a map column to count occurrences without udaf - scala

I would like to create a Map column which counts the number of occurrences.
For instance:
+---+----+
| b| a|
+---+----+
| 1| b|
| 2|null|
| 1| a|
| 1| a|
+---+----+
would result in
+---+--------------------+
| b| res|
+---+--------------------+
| 1|[a -> 2.0, b -> 1.0]|
| 2| []|
+---+--------------------+
For the moment, in Spark 2.4.6, I was able to make it using udaf.
While bumping to Spark3 I was wondering if I could get rid of this udaf (I tried using the new method aggregate without success)
Is there an efficient way to do it?
(For the efficiency part, I am able to test easily)

Here a Spark 3 solution:
import org.apache.spark.sql.functions._
df.groupBy($"b",$"a").count()
.groupBy($"b")
.agg(
map_from_entries(
collect_list(
when($"a".isNotNull,struct($"a",$"count"))
)
).as("res")
)
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
Here the solution using Aggregator:
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Encoder
val countOcc = new Aggregator[String, Map[String,Int], Map[String,Int]] with Serializable {
def zero: Map[String,Int] = Map.empty.withDefaultValue(0)
def reduce(b: Map[String,Int], a: String) = if(a!=null) b + (a -> (b(a) + 1)) else b
def merge(b1: Map[String,Int], b2: Map[String,Int]) = {
val keys = b1.keys.toSet.union(b2.keys.toSet)
keys.map{ k => (k -> (b1(k) + b2(k))) }.toMap
}
def finish(b: Map[String,Int]) = b
def bufferEncoder: Encoder[Map[String,Int]] = implicitly(ExpressionEncoder[Map[String,Int]])
def outputEncoder: Encoder[Map[String, Int]] = implicitly(ExpressionEncoder[Map[String, Int]])
}
val countOccUDAF = udaf(countOcc)
df
.groupBy($"b")
.agg(countOccUDAF($"a").as("res"))
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+

You could always use collect_list with UDF, but only if you groupings are not too lage:
val udf_histo = udf((x:Seq[String]) => x.groupBy(identity).mapValues(_.size))
df.groupBy($"b")
.agg(
collect_list($"a").as("as")
)
.select($"b",udf_histo($"as").as("res"))
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
This should be faster than UDAF: Spark custom aggregation : collect_list+UDF vs UDAF

We can achieve this is spark 2.4
//GET THE COUNTS
val groupedCountDf = originalDf.groupBy("b","a").count
//CREATE MAPS FOR EVERY COUNT | EMPTY MAP FOR NULL KEY
//AGGREGATE THEM AS ARRAY
val dfWithArrayOfMaps = groupedCountDf
.withColumn("newMap", when($"a".isNotNull, map($"a",$"count")).otherwise(map()))
.groupBy("b").agg(collect_list($"newMap") as "multimap")
//EXPRESSION TO CONVERT ARRAY[MAP] -> MAP
val mapConcatExpr = expr("aggregate(multimap, map(), (k, v) -> map_concat(k, v))")
val finalDf = dfWithArrayOfMaps.select($"b", mapConcatExpr.as("merged_data"))

Here a solution with a single groupBy and a slightly complex sql expression. This solution works for Spark 2.4+
df.groupBy("b")
.agg(expr("sort_array(collect_set(a)) as set"),
expr("sort_array(collect_list(a)) as list"))
.withColumn("res",
expr("map_from_arrays(set,transform(set, x -> size(filter(list, y -> y=x))))"))
.show()
Output:
+---+------+---------+----------------+
| b| set| list| res|
+---+------+---------+----------------+
| 1|[a, b]|[a, a, b]|[a -> 2, b -> 1]|
| 2| []| []| []|
+---+------+---------+----------------+
The idea is to collect the data from column a twice: one time into a set and one time into a list. Then with the help of transform for each element of the set the number of occurences of the particular element in the list is counted. Finally, the set and the number of elements are combined with map_from_arrays.
However I cannot say if this approach is really faster than a UDAF.

Related

How to create multiples columns from a MapType columns efficiently (without foldleft)

My goal is to create columns from another MapType column. The names of the columns being the keys of the Map and their associated values.
Below my starting dataframe:
+-----------+---------------------------+
|id | mapColumn |
+-----------+---------------------------+
| 1 |Map(keyA -> 0, keyB -> 1) |
| 2 |Map(keyA -> 4, keyB -> 2) |
+-----------+---------------------------+
Below the desired output:
+-----------+----+----+
|id |keyA|keyB|
+-----------+----+----+
| 1 | 0| 1|
| 2 | 4| 2|
+-----------+----+----+
I found a solution whith a Foldleft with accumulators (work but extremely slow):
val colsToAdd = startDF.collect()(0)(1).asInstanceOf[Map[String,Integer]].map(x => x._1).toSeq
res1: Seq[String] = List(keyA, keyB)
val endDF = colsToAdd.foldLeft(startDF)((startDF, key) => startDF.withColumn(key, lit(0)))
//(lit(0) for testing)
The real starting dataframe being enormous, I need optimization.
You could simply use explode function to explode the map type column and then use pivot to get each key as new column. Something like this:
val df = Seq((1,Map("keyA" -> 0, "keyB" -> 1)), (2,Map("keyA" -> 4, "keyB" -> 2))
).toDF("id", "mapColumn")
df.select($"id", explode($"mapColumn"))
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
.show()
Gives:
+---+----+----+
| id|keyA|keyB|
+---+----+----+
| 1| 0| 1|
| 2| 4| 2|
+---+----+----+

Spark GroupBy and Aggregate Strings to Produce a Map of Counts of the Strings Based on a Condition

I have a dataframe with two multiple columns, two of which are id and label as shown below.
+---+---+---+
| id| label|
+---+---+---+
| 1| "abc"|
| 1| "abc"|
| 1| "def"|
| 2| "def"|
| 2| "def"|
+---+---+---+
I want to groupBy "id" and aggregate the label column by counts (ignore null) of label in a map data structure and the expected result is as shown below:
+---+---+--+--+--+--+--+--
| id| label |
+---+-----+----+----+----+
| 1| {"abc":2, "def":1}|
| 2| {"def":2} |
+---+-----+----+----+----+
Is it possible to do this without using user-defined aggregate functions? I saw a similar answer here, but it doesn't aggregate based on the count of each item.
I apologize if this question is silly, I am new to both Scala and Spark.
Thanks
Without Custom UDFs
import org.apache.spark.sql.functions.{map, collect_list}
df.groupBy("id", "label")
.count
.select($"id", map($"label", $"count").as("map"))
.groupBy("id")
.agg(collect_list("map"))
.show(false)
+---+------------------------+
|id |collect_list(map) |
+---+------------------------+
|1 |[[def -> 1], [abc -> 2]]|
|2 |[[def -> 2]] |
+---+------------------------+
Using Custom UDF,
import org.apache.spark.sql.functions.udf
val customUdf = udf((seq: Seq[String]) => {
seq.groupBy(x => x).map(x => x._1 -> x._2.size)
})
df.groupBy("id")
.agg(collect_list("label").as("list"))
.select($"id", customUdf($"list").as("map"))
.show(false)
+---+--------------------+
|id |map |
+---+--------------------+
|1 |[abc -> 2, def -> 1]|
|2 |[def -> 2] |
+---+--------------------+

Writing Spark UDAFs in Scala to return Array type as output

I have a dataframe as below -
val myDF = Seq(
(1,"A",100),
(1,"E",300),
(1,"B",200),
(2,"A",200),
(2,"C",300),
(2,"D",100)
).toDF("id","channel","time")
myDF.show()
+---+-------+----+
| id|channel|time|
+---+-------+----+
| 1| A| 100|
| 1| E| 300|
| 1| B| 200|
| 2| A| 200|
| 2| C| 300|
| 2| D| 100|
+---+-------+----+
For each id, I want the channel sorted by time in ascending fashion. I want to implement an UDAF for this logic.
I would like to call this UDAF as -
scala > spark.sql("""select customerid , myUDAF(customerid,channel,time) group by customerid """).show()
Ouptut dataframe should look like -
+---+-------+
| id|channel|
+---+-------+
| 1|[A,B,E]|
| 2|[D,A,C]|
+---+-------+
I am trying to write an UDAF but unable to implement it -
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
class myUDAF extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function
override def inputSchema : org.apache.spark.sql.types.Structype =
Structype(
StructField("id" , IntegerType)
StructField("channel", StringType)
StructField("time", IntegerType) :: Nil
)
// This is the internal fields we would keep for computing the aggregate
// output
override def bufferSchema : Structype =
Structype(
StructField("Sequence", ArrayType(StringType)) :: Nil
)
// This is the output type of my aggregate function
override def dataType : DataType = ArrayType(StringType)
// no comments here
override def deterministic : Booelan = true
// initialize
override def initialize(buffer: MutableAggregationBuffer) : Unit = {
buffer(0) = Seq("")
}
}
Please help.
This will do it (no need to define your own UDF):
df.groupBy("id")
.agg(sort_array(collect_list( // NOTE: sort based on the first element of the struct
struct("time", "channel"))).as("stuff"))
.select("id", "stuff.channel")
.show(false)
+---+---------+
|id |channel |
+---+---------+
|1 |[A, B, E]|
|2 |[D, A, C]|
+---+---------+
I would not write an UDAF for that. In my experience UDAF are rather slow, especially with complex types. I would use the collect_list & UDF approach:
val sortByTime = udf((rws:Seq[Row]) => rws.sortBy(_.getInt(0)).map(_.getString(1)))
myDF
.groupBy($"id")
.agg(collect_list(struct($"time",$"channel")).as("channel"))
.withColumn("channel", sortByTime($"channel"))
.show()
+---+---------+
| id| channel|
+---+---------+
| 1|[A, B, E]|
| 2|[D, A, C]|
+---+---------+
A much simpler way without UDF.
import org.apache.spark.sql.functions._
myDF.orderBy($"time".asc).groupBy($"id").agg(collect_list($"channel") as "channel").show()

Spark find key/value pairs with key equals to other values and join

If we have the following key value pairs:
[T,V] [V,W] [A,B] [B,C]
I need to result to be
[T,V] [V,W] [T,W] [A,B] [B,C] [A,C]
So basically to generate [T,W] from [T,V] and [V,W] and append to the existing set
I'm not sure how to do this in spark with scala, please help.
val df = sc.parallelize(
Array(("T","V"),("V","W"),("A","B"),("B","C"))
).toDF("key","value")
df.show
+---+-----+
|key|value|
+---+-----+
| T| V|
| V| W|
| A| B|
| B| C|
+---+-----+
df.join(
df.toDF("keyR", "valueR"),
$"value" === $"keyR"
).explode($"key",$"value",$"keyR",$"valueR"){row => Seq(
(row.getString(0), row.getString(1)),
(row.getString(2), row.getString(3)),
(row.getString(0), row.getString(3))
)}.select($"_1" as "key", $"_2" as "value").show
+---+-----+
|key|value|
+---+-----+
| A| B|
| B| C|
| A| C|
| T| V|
| V| W|
| T| W|
+---+-----+
Using purely Scala collection functions (in Set) - I don't use Spark:
val ex = Set("T" -> "V", "V" -> "W", "A" -> "B", "B" -> "C")
val keysEquallingValues = ex.flatMap { tuple =>
ex.find(t => tuple._2 == t._1).map(t => tuple -> t)
}
val r = ex ++ keysEquallingValues.map(pair => pair._1._1 -> pair._2._2)
Explanation:
ex is your example input Set
We flatMap over it, using an expression that returns an Option[((String,String), (String, String))] - i.e. if the condition is there a tuple with a key the same as the current value? is true, we'll have a Some containing a tuple of the two tuples (!) that satisfy the condition.
Using flatMap and Option like this allows us to drop out non-matching cases (like a filter) but also simultaneously transform the content of the collection in the one pass.
Finally we cherrypick the key of the first tuple and the value of the second, to get the desired combination, and add it to your original Set.

Spark: Add column to dataframe conditionally

I am trying to take my input data:
A B C
--------------
4 blah 2
2 3
56 foo 3
And add a column to the end based on whether B is empty or not:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
I've tried .withColumn, but I can't get that to do what I want.
Try withColumn with the function when as follows:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show() shows
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
I added the (100, null, 5) row for testing the isNull case.
I tried this code with Spark 1.6.0 but as commented in the code of when, it works on the versions after 1.4.0.
My bad, I had missed one part of the question.
Best, cleanest way is to use a UDF.
Explanation within the code.
// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
("",3) ,("d",4)).map { x => Stuff(x._1,x._2) }).toDF
// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )
scala> r.show
+---+---+--------+
| a| b|notempty|
+---+---+--------+
| a| 1| 1111|
| b| 2| 1111|
| | 3| 0|
| d| 4| 1111|
+---+---+--------+
How about something like this?
val newDF = df.filter($"B" === "").take(1) match {
case Array() => df
case _ => df.withColumn("D", $"B" === "")
}
Using take(1) should have a minimal hit