Recently, i started working on spark windows and trying to understand what happens under the hood in spark executors when applying these windowing functions.
My question is, since every window must be created using partitionBy function which means shuffling data across the cluster, is it normal to use multiple windows?
For example i have this dataframe :
cli date item1 item2
--------- ---------- ------- -------
1234567 20191030 A D
1234567 20191030 B D
1234567 20191029 A E
1234567 20191026 A F
7456123 20191026 C D
7456123 20191025 C F
The aim here is to calculate the frequency of each item for each client for every date based on history.
For example the client 1234567 at 20191030 used 4 item_1 from 20191030 and backwards so the frequency of A will be 3/4 and B is 1/4.
I chose to calculate these frequencies for each day by using windows since it calculates a value for each row, but I need to use three windows :
// This will give me the number of items used by a client
// in that day and all history.
val lByCliWindow = Window.partitionBy(col("cli"))
// This will give me how many times a client used this exact item_1 A in
// that day and back in history (here my history is 120 days)
val lByCliItem1Window = Window
.partitionBy(col("cli"), col("item_1"))
.orderBy(to_timestamp(col("date"), "yyyyMMdd").cast(LongType))
.rangeBetween(-86400*120,0)
// This will give me how many times a client used this exact item_3 F in
// that day and back in history (here my history is 120 days)
val lByCliItem2Window = Window
.partitionBy(col("cli"), col("item_2"))
.orderBy(to_timestamp(col("date"), "yyyyMMdd").cast(LongType))
.rangeBetween(-86400*120,0)
The expected output is :
cli date frequency_item1 frequency_item2
--------- ---------- ------------------------- --------------------------------
1234567 20191030 Map(A -> 3/4, B -> 1/4) Map(D -> 2/4, E -> 1/4, F 1/4)
1234567 20191029 Map(A -> 2/2) Map(E -> 1/2, F -> 1/2)
1234567 20191026 Map(A -> 1/1) Map(F -> 1/1)
7456123 20191026 Map(C -> 2/2) Map(D -> 1/2, F -> 1/2)
7456123 20191025 Map(C -> 1/1) Map(F -> 1/1)
When i do explain() on this approach I can see so many exchange plans of hashpartitioning etc and that is very expectable since we are doing a partitionBy every time.
Giving that I have almost 30 variables this means 30 times partitioning of data. (This is a lot of shuffling)
What I want to understand is this approach normal? Will spark work on this partitioning on parallel (create multiple windows on the same time therefore partitioning the dataframe in multiple different ways at the same time) or sequentially?
Can we use multiple windows? what is more costly the groupBy shuffle or partitionBy windows shuffle?
Thank you for your replies and don't hesitate to propose different approach for calculating frequencies using windows.
I have a solution that involves only one window. I'll explain with comments.
// The columns you are interested in
val items = df.columns.filter(_ startsWith "item")
// collect_list aggregation. It avoid duplicates. We will group by cli and date.
val aggs = items.map(c => collect_list(col(c)) as c)
// A window over "cli" and ordered by date.
val win = Window.partitionBy("cli").orderBy("date")
// A UDF that computes the frequencies you want
// It takes as input a seq of seq because of the first aggregation we do
val compute_freqs = udf((s : Seq[Seq[String]]) => {
val flat_s = s.flatten
val total = flat_s.size
flat_s.groupBy(identity).mapValues(_.size.toDouble / total)
})
// for each item, we collect the values over the window, and compute the frequency
val frequency_columns = items
.map(item => compute_freqs(collect_list(col(item)) over win)
.alias(s"frequency_$item"))
// Then we use everything
val result = df
.groupBy("cli", "date")
.agg(aggs.head, aggs.tail : _*)
.select((Seq("cli", "date").map(col) ++ frequency_columns) :_*)
.orderBy($"cli", $"date" desc)
And here is the result:
scala> result.show(false)
+-------+--------+----------------------+--------------------------------+
|cli |date |frequency_item1 |frequency_item2 |
+-------+--------+----------------------+--------------------------------+
|1234567|20191030|[A -> 0.75, B -> 0.25]|[D -> 0.5, F -> 0.25, E -> 0.25]|
|1234567|20191029|[A -> 1.0] |[F -> 0.5, E -> 0.5] |
|1234567|20191026|[A -> 1.0] |[F -> 1.0] |
|7456123|20191026|[C -> 1.0] |[D -> 0.5, F -> 0.5] |
|7456123|20191025|[C -> 1.0] |[F -> 1.0] |
+-------+--------+----------------------+--------------------------------+
Related
I have a spark dataframe in the below format:
Name LD_Value
A37 Map(10 -> 0.20,5 -> 0.30,17 -> 0.25)
A39 Map(11 -> 0.40,6 -> 0.67,24 -> 0.45)
I need to sort based on keys in LD_Value column for each record in descending order.
Expected output:
Name LD_Value
A37 Map(17 -> 0.25,10 -> 0.20,5 -> 0.30)
A39 Map(24 -> 0.45,11 -> 0.40,6 -> 0.67)
Is it possible to do sorting on map type column in spark dataframe?
I looked into spark higher-order functions but no luck.
You can first get the keys of the map using map_keys function, sort the array of keys then use transform to get the corresponding value for each key element from the original map, and finally update the map column by creating a new map from the two arrays using map_from_arrays function.
For Spark 3+, you can sort the array of keys in descending order by using a comparator function as the second argument to array_sort function :
from pyspark.sql import functions as F
df1 = df.withColumn(
"LD_Value_keys",
F.expr("array_sort(map_keys(LD_Value), (x, y) -> case when x > y then -1 when x < y then 1 else 0 end)")
).withColumn("LD_Value_values", F.expr("transform(LD_Value_keys, x -> LD_Value[x])")) \
.withColumn("LD_Value", F.map_from_arrays(F.col("LD_Value_keys"), F.col("LD_Value_values"))) \
.drop("LD_Value_keys", "LD_Value_values")
df1.show()
#+----+----------------------------------+
#|Name|LD_Value |
#+----+----------------------------------+
#|A37 |[17 -> 0.25, 10 -> 0.2, 5 -> 0.3] |
#|A39 |[24 -> 0.45, 11 -> 0.4, 6 -> 0.67]|
#+----+----------------------------------+
For Spark < 3, you can sort an array in descending order using this UDF:
# array_sort_udf (array, reverse): if reverse = True then desc
array_sort_udf = F.udf(lambda arr, r: sorted(arr, reverse=r), ArrayType(StringType()))
And use it like this:
df.withColumn("LD_Value_keys", array_sort_udf(F.map_keys(F.col("LD_Value")), F.lit(True)))
I have the following 2 maps:
val map12:Map[(String,String),Double]=Map(("Sam","0203") -> 16216.0, ("Jam","0157") -> 50756.0, ("Pam","0129") -> 3052.0)
val map22:Map[(String,String),Double]=Map(("Jam","0157") -> 16145.0, ("Pam","0129") -> 15258.0, ("Sam","0203") -> -1638.0, ("Dam","0088") -> -8440.0,("Ham","0104") -> 4130.0,("Hari","0268") -> -108.0, ("Om","0169") -> 5486.0, ("Shiv","0181") -> 275.0, ("Brahma","0148") -> 18739.0)
In the first approach I am using foldLeft to achieve the merging and accumulation:
val t1 = System.nanoTime()
val merged1 = (map12 foldLeft map22)((map22, map12) => map22 + (map12._1 -> (map12._2 + map22.getOrElse(map12._1, 0.0))))
val t2 = System.nanoTime()
println(" First Time taken :"+ (t2-t1))
In the second approach I am trying to use aggregate() function which supports parallel operation:
def merge(map12:Map[(String,String),Double], map22:Map[(String,String),Double]):Map[(String,String),Double]=
map12 ++ map22.map{case(k, v) => k -> (v + (map12.getOrElse(k, 0.0)))}
val inArr= Array(map12,map22)
val t5 = System.nanoTime()
val mergedNew12 = inArr.par.aggregate(Map[(String,String),Double]())(merge,merge)
val t6 = System.nanoTime()
println(" Second Time taken :"+ (t6-t5))
But I notice the foldLeft is much faster than the aggregate.
I am looking for advice on how to make this operation the most efficient.
If you want an aggregate more efficient by running with par, try with Vector instead of Array, it is one of the best collections for parallel algorithms.
On the other hand, parallel working has some overhead so If you have insufficient data, it will be not convenient.
With the data you gave us, Vector.par.aggregate is better than Array.par.aggregate, but Vector.aggregate is better than foldLeft.
val inVector= Vector(map12,map22)
val t7 = System.nanoTime()
val mergedNew12_2 = inVector.aggregate(Map[(String,String),Double]())(merge,merge)
val t8 = System.nanoTime()
println(" Third Time taken :"+ (t8-t7))
These are my times
First Time taken :6431723
Second Time taken:147474028
Third Time taken :4855489
The following code allows me the sum per filter key. How do I sum and average all the values together. i.e combine the results of all filtered values.
val f= p.groupBy(d => (d.Id))
.mapValues(totavg =>
(totavg.groupBy(_.Day).filterKeys(Set(2,3,4)).mapValues(_.map(_.Amount))
Sample output:
Map(A9 -> Map(2 -> List(473.3, 676.48), 4 -> List(685.45, 812.73))
I would like to add all values together and compute total average.
i.e (473.3+676.48+685.45+812.73)/4
For the given Map, you can apply flatMap twice to return the sequence of values firstly, then calculate the average:
val m = Map("A9" -> Map(2 -> List(473.3, 676.48), 4 -> List(685.45, 812.73)))
val s = m.flatMap(_._2.flatMap(_._2))
// s: scala.collection.immutable.Iterable[Double] = List(473.3, 676.48, 685.45, 812.73)
s.sum/s.size
// res14: Double = 661.99
having the following rdd
BBBBBBAAAAAAABAABBBBBBBB
AAAAABBAAAABBBAAABAAAAAB
I need to calculate the numbers of iterations per group of event, so, for this example the expected output should be:
BBBBBBAAAAAAABAABBBBBBBB A -> 2 B -> 3
AAAAABBAAAABBBAAABBCCCCC A -> 3 B -> 4 C-> 1
Final Output -> A -> 5 B -> 7 C-> 1
I have implemented the splitting and them a sliding for each character to try to obtain the values, but I cannot obtain the expected result.
Thanks,
val baseRDD = sc.parallelize(Seq("BBBBBBAAAAAAABAABBBBBBBB", "AAAAABBAAAABBBAAABBCCCC"))
baseRDD.flatMap(x => "(\\w)\\1*".r.findAllMatchIn(x).map(x => (x.matched.charAt(0), 1)).toList).reduceByKey((accum, current) => accum + current).foreach(println(_))
Result
(C,1)
(B,6)
(A,5)
Hope this is what you wanted.
I need to write values with key 1 to file file1.txt and values with key 2 to file2.txt:
val ar = Array (1 -> 1, 1 -> 2, 1 -> 3, 1 -> 4, 1 -> 5, 2 -> 6, 2 -> 7, 2 -> 8, 2 -> 9)
val distAr = sc.parallelize(ar)
val grk = distAr.groupByKey()
How to do this without iterrating collection grk twice?
We write data from different customers to different tables, which is essentially the same usecase. The common pattern we use is something like this:
val customers:List[String] = ???
customers.foreach{customer => rdd.filter(record => belongsToCustomer(record,customer)).saveToFoo()}
This probably does not fulfill the wish of 'not iterating over the rdd twice (or n times)', but filter is a cheap operation to do in a parallel distributed environment and it works, so I think it does comply to the 'general Spark way' of doing things.