Spark Scala: Aggregate DataFrame Column Values into a Ordered List - scala

I have a spark scala DataFrame that has four values: (id, day, val, order). I want to create a new DataFrame with columns: (id, day, value_list: List(val1, val2, ..., valn)) where val1, through valn are ordered by asc order value.
For instance:
(50, 113, 1, 1),
(50, 113, 1, 3),
(50, 113, 2, 2),
(51, 114, 1, 2),
(51, 114, 2, 1),
(51, 113, 1, 1)
would become:
((51,113),List(1))
((51,114),List(2, 1)
((50,113),List(1, 2, 1))
I'm close, but don't know what to do after I've aggregated the data into a list. I'm not sure how to then have spark order each value list by the order int:
import org.apache.spark.sql.Row
val testList = List((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1))
val testDF = sqlContext.sparkContext.parallelize(testList).toDF("id1", "id2", "val", "order")
val rDD1 = testDF.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int) => ((key1, key2), List((val1, val2)))}
val rDD2 = rDD1.reduceByKey{case (x, y) => x ++ y}
where the output looks like:
((51,113),List((1,1)))
((51,114),List((1,2), (2,1)))
((50,113),List((1,3), (1,1), (2,2)))
The next step would be to produce:
((51,113),List((1,1)))
((51,114),List((2,1), (1,2)))
((50,113),List((1,1), (2,2), (1,3)))

You will just need to map over your RDD and use sortBy:
scala> val df = Seq((50, 113, 1, 1), (50, 113, 1, 3), (50, 113, 2, 2), (51, 114, 1, 2), (51, 114, 2, 1), (51, 113, 1, 1)).toDF("id1", "id2", "val", "order")
df: org.apache.spark.sql.DataFrame = [id1: int, id2: int, val: int, order: int]
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rDD1 = df.map{case Row(key1: Int, key2: Int, val1: Int, val2: Int) => ((key1, key2), List((val1, val2)))}
rDD1: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[10] at map at <console>:28
scala> val rDD2 = rDD1.reduceByKey{case (x, y) => x ++ y}
rDD2: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = ShuffledRDD[11] at reduceByKey at <console>:30
scala> val rDD3 = rDD2.map(x => (x._1, x._2.sortBy(_._2)))
rDD3: org.apache.spark.rdd.RDD[((Int, Int), List[(Int, Int)])] = MapPartitionsRDD[12] at map at <console>:32
scala> rDD3.collect.foreach(println)
((51,113),List((1,1)))
((50,113),List((1,1), (2,2), (1,3)))
((51,114),List((2,1), (1,2)))

testDF.groupBy("id1","id2").agg(collect_list($"val")).show
+---+---+-----------------+
|id1|id2|collect_list(val)|
+---+---+-----------------+
| 51|113| [1]|
| 51|114| [1, 2]|
| 50|113| [1, 1, 2]|
+---+---+-----------------+

Related

Grouping by an alternating sequence in Spark

I have a set of data where I can identify an alternating sequence. However, I want to group this data into one chunk leaving all the other data unchanged. That is, where ever flickering is occurring in id, I want to overwrite that group of id with the single id that makes since to the order. As a small example, consider
val dataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 4), // flickering; id=0
("a", "silom", 1, 5), // flickering; id=0
("a", "silom", 0, 6), // flickering; id=0
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 10),
("a", "silom", 3, 11), // flickering and so on
("a", "silom", 4, 12),
("a", "silom", 3, 13),
("a", "silom", 4, 14),
("a", "silom", 5, 15)
).toDF("user", "cat", "id", "time_sec")
val resultDataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 15), // grouped by flickering summing on time_sec
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 60),
("a", "silom", 5, 15). // grouped by flickering summing on time_sec
).toDF("user", "cat", "id", "time_sec")
Now a more realistic MWE. In this case, we can have multiple users and cat; unfortunately, this approach doesnt use the dataframe API and needs to collect data to the driver. This isn't scalable and needs to recursively call getGrps by dropping the length of the returned array indices.
How can I implement this using the dataframe API so as not to need collect the data to the driver which would be impossible due to size? Also, if there is a better way to do this, what would that be?
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.DoubleType
import scala.collection.mutable.WrappedArray
val dataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 4),
("a", "silom", 1, 5),
("a", "silom", 0, 6),
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 10),
("a", "silom", 3, 11),
("a", "silom", 4, 12),
("a", "silom", 3, 13),
("a", "silom", 4, 14),
("a", "silom", 5, 15),
("a", "suk", 18, 1),
("a", "suk", 19, 2),
("a", "suk", 20, 3),
("a", "suk", 21, 4),
("a", "suk", 20, 5),
("a", "suk", 21, 6),
("a", "suk", 0, 7),
("a", "suk", 1, 8),
("a", "suk", 2, 9),
("a", "suk", 3, 10),
("a", "suk", 4, 11),
("a", "suk", 3, 12),
("a", "suk", 4, 13),
("a", "suk", 3, 14),
("a", "suk", 5, 15),
("b", "silom", 4, 1),
("b", "silom", 3, 2),
("b", "silom", 2, 3),
("b", "silom", 1, 4),
("b", "silom", 0, 5),
("b", "silom", 1, 6),
("b", "silom", 2, 7),
("b", "silom", 3, 8),
("b", "silom", 4, 9),
("b", "silom", 5, 10),
("b", "silom", 6, 11),
("b", "silom", 7, 12),
("b", "silom", 8, 13),
("b", "silom", 9, 14),
("b", "silom", 10, 15),
("b", "suk", 11, 1),
("b", "suk", 12, 2),
("b", "suk", 13, 3),
("b", "suk", 14, 4),
("b", "suk", 13, 5),
("b", "suk", 14, 6),
("b", "suk", 13, 7),
("b", "suk", 12, 8),
("b", "suk", 11, 9),
("b", "suk", 10, 10),
("b", "suk", 9, 11),
("b", "suk", 8, 12),
("b", "suk", 7, 13),
("b", "suk", 6, 14),
("b", "suk", 5, 15)
).toDF("user", "cat", "id", "time_sec")
val recastDataDF = dataDF.withColumn("id", $"id".cast(DoubleType))
val category = recastDataDF.select("cat").distinct.collect.map(x => x(0).toString)
val data = recastDataDF
.select($"*" +: category.map(
name =>
lag("id", 1).over(
Window.partitionBy("user", "cat").orderBy("time_sec")
)
.alias(s"lag_${name}_id")): _*)
.withColumn("sequencing_diff", when($"cat" === "silom", ($"lag_silom_id" - $"id").cast(DoubleType))
.otherwise(($"lag_suk_id" - $"id")))
.drop("lag_silom_id", "lag_suk_id")
.withColumn("rn", row_number.over(Window.partitionBy("user", "cat").orderBy("time_sec")).cast(DoubleType))
.withColumn("zipped", array("user", "cat", "sequencing_diff", "rn", "id"))
// non dataframe API approach (not scalable)
// needs to collect data to driver to process
val iterTuples = data.select("zipped").collect.map(x => x(0).asInstanceOf[WrappedArray[Any]]).map(x => x.toArray)
val shifted: Array[Array[Any]] = iterTuples.drop(1)
val combined = iterTuples
.zipAll(shifted, Array("", "", Double.NaN, Double.NaN, Double.NaN), Array("", "", Double.NaN, Double.NaN, Double.NaN))
val testArr = combined.map{
case (data0, data1) =>
if(data1(3).toString.toDouble > 2 && data0(3).toString.toDouble > 2 && data1(0) == data0(0) && data1(1) == data0(1)) {
if(data0(2) != data1(2) && data0(2).toString.toDouble + data1(2).toString.toDouble == 0) {
(data1(0), data1(1), data1(3), data0(4))
}
else ("", "", Double.NaN, Double.NaN)
}
else ("", "", Double.NaN, Double.NaN)
}
.filter(t => t._1 != "" && t._2 != "" && t._3 == t._3 && t._4 == t._4) // fast NaN removal
val typeMappedArray = testArr.map(x => (x._1.toString, x._2.toString, x._3.toString.toDouble, x._4.toString.toDouble))
def getGrps(arr: Array[(String, String, Double, Double)]): (Array[Double], Double, String, String) = {
if(arr.nonEmpty) {
val user = arr.take(1)(0)._1
val cat = arr.take(1)(0)._2
val rowNum = arr.take(1)(0)._3
val keepID = arr.take(1)(0)._4
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._3) {
rowNum + 1 + idx
}
else Double.NaN
}
.filter(v => v == v)
(rowNums, keepID, user, cat)
}
else (Array(Double.NaN), Double.NaN, "", "")
}
// after overwriting, this would allow me to group by user, cat, id to sum the time
getGrps(typeMappedArray) // returns rows number to overwrite, value to overwrite id with, user, cat
res0: (Array(5.0, 6.0, 7.0),0.0,a,silom)
getGrps(typeMappedArray.drop(3))
res1: (Array(11.0, 12.0, 13.0, 14.0),4.0,a,silom)
A second approach using collect_list but this relies on getGrps working recursively which I cannot get working properly. Here is the code I have so far with a modified getGrps for the the collect_list minus recursive.
val data = recastDataDF
.select($"*" +: category.map(
name =>
lag("id", 1).over(
Window.partitionBy("user", "cat").orderBy("time_sec")
)
.alias(s"lag_${name}_id")): _*)
.withColumn("sequencing_diff", when($"cat" === "silom", ($"lag_silom_id" - $"id").cast(DoubleType))
.otherwise(($"lag_suk_id" - $"id")))
.drop("lag_silom_id", "lag_suk_id")
.withColumn("rn", row_number.over(Window.partitionBy("user", "cat").orderBy("time_sec")).cast(DoubleType))
.withColumn("id_rn", array($"id", $"rn", $"sequencing_diff"))
.groupBy($"user", $"cat").agg(collect_list($"id_rn").alias("array_data"))
// collect one row to develop how the UDF would work
val testList = data.where($"user" === "a" && $"cat" === "silom").select("array_data").collect
.map(x => x(0).asInstanceOf[WrappedArray[WrappedArray[Any]]])
.map(x => x.toArray)
.head
.map(x => (x(0).toString.toDouble, x(1).toString.toDouble, x(2).asInstanceOf[Double]))
// this code would be in the UDF; that is, we would pass array_data to the UDF
scala.util.Sorting.stableSort(testList, (e1: (Double, Double, Double), e2: (Double, Double, Double)) => e1._2 < e2._2)
val shifted: Array[(Double, Double, Double)] = testList.drop(1)
val combined = testList
.zipAll(shifted, (Double.NaN, Double.NaN, Double.NaN), (Double.NaN, Double.NaN, Double.NaN))
val testArr = combined.map{
case (data0, data1) =>
if(data0._3 != data1._3 && data0._2 > 1) {
(data0._2, data0._1)
}
else (Double.NaN, Double.NaN)
}
.filter(t => t._1 == t._1 && t._1 == t._1)
// called inside the UDF
def getGrps2(arr: Array[(Double, Double)]): (Array[Double], Double) = {
// no need for user or cat
if(arr.nonEmpty) {
val rowNum = arr.take(1)(0)._1
val keepID = arr.take(1)(0)._2
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._1) {
rowNum + 1 + idx
}
else Double.NaN
}
.filter(v => v == v)
(rowNums, keepID)
}
else (Array(Double.NaN), Double.NaN)
}
We would .withColumn("data_to_update", udf) and the data_to_update column would be a WrappedArray[Tuple2[Array[Double], Double]] with row_numbers to id to overwrite. The result for user a, cat silom would be
WrappedArray((Array(4.0, 5.0, 6.0),0.0), (Array(10.0, 11.0, 12.0, 13.0),4.0))
The array pieces are row numbers and the Double is id to update those rows with
The following recursive method applied in a UDF operating on the array_data column will create the desired results
def getGrps(arr: Array[(Double, Double)]): Array[(Array[Double], Double)] = {
def returnAlternatingIDs(arr: Array[(Double, Double)],
altIDs: Array[(Array[Double], Double)]): Array[(Array[Double], Double)] = arr match {
case arr if arr.nonEmpty =>
val rowNum = arr.take(1)(0)._1
val keepID = arr.take(1)(0)._2
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._1) {
rowNum + 1 + idx
}
else {
Double.NaN
}
}
.filter(v => v == v)
val updateArray = altIDs ++ Array((rowNums, keepID))
returnAlternatingIDs(arr.drop(rowNums.length), updateArray)
case _ => altIDs
}
returnAlternatingIDs(arr, Array((Array(Double.NaN), Double.NaN))).drop(1)
}
The return value for the first collect_list is Array((Array(5.0, 6.0, 7.0),0.0), (Array(11.0, 12.0, 13.0, 14.0),4.0)) as desired.
Full UDF
val identifyFlickeringIDs: UserDefinedFunction = udf {
(colArrayData: WrappedArray[WrappedArray[Double]]) =>
val newArray: Array[(Double, Double, Double)] = colArrayData.toArray
.map(x => (x(0).toDouble, x(1).toDouble, x(2).toDouble))
// sort array by rn via less than relation
stableSort(newArray, (e1: (Double, Double, Double), e2: (Double, Double, Double)) => e1._2 < e2._2)
val shifted: Array[(Double, Double, Double)] = newArray.toArray.drop(1)
val combined = newArray
.zipAll(shifted, (Double.NaN, Double.NaN, Double.NaN), (Double.NaN, Double.NaN, Double.NaN))
val parsedArray = combined.map{
case (data0, data1) =>
if(data0._3 != data1._3 && data0._2 > 1 && data0._3 + data1._3 == 0) {
(data0._2, data0._1)
}
else (Double.NaN, Double.NaN)
}
.filter(t => t._1 == t._1 && t._1 == t._1)
getGrps(parsedArray).filter(data => data._1.length > 1)
}

printing length of string of Tuples in Scala

I am a newbie to Scala.
I have a Tuple[Int, String]
((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
For the above list, I want to print all strings where the corresponding length is 4.
printing length of string of Tuples in Scala
val tuples = List((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
println(tuples.map(x => (x._2, x._2.length)))
//List((alpha,5), (beta,4), (gamma,5), (zeta,4), (omega,5))
I want to print all strings where the corresponding length is 4
You can filter first and then print as
val tuples = List((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
tuples.filter(_._2.length == 4).foreach(x => println(x._2))
it should print
beta
zeta
You can convert your Tuple to List and then map and filter as you need:
tuple.productIterator.toList
.map{case (a,b) => b.toString}
.filter(_.length==4)
Example:
For the given input:
val tuple = ((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
tuple: ((Int, String), (Int, String), (Int, String), (Int, String), (Int, String)) = ((1,alpha),(2,beta),(3,gamma),(4,zeta),(5,omega))
Output:
List[String] = List(beta, zeta)
Let's suppose you have a list of Tuple, and you need all the values with string length equal to 4.
You can do a filter on the list:
val filteredList = list.filter(_._2.length == 4)
And then iterate over each element to print them:
filteredList.foreach(tuple => println(tuple._2))
Here is way to achieve this
scala> val x = ((1, "alpha"), (2, "beta"), (3, "gamma"), (4, "zeta"), (5, "omega"))
x: ((Int, String), (Int, String), (Int, String), (Int, String), (Int, String)) = ((1,alpha),(2,beta),(3,gamma),(4,zeta),(5,omega))
scala> val y = x.productIterator.toList.collect{
case ele : (Int, String) if ele._2.length == 4 => ele._2
}
y: List[String] = List(beta, zeta)

Join two lists with unequal length in Scala

I have 2 lists:
val list_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val list_2 = List((1, 111), (2, 122), (3, 133), (4, 144), (1, 123), (2, 234))
I want to replace key in the second list as value of first list, resulting in a new list that looks like:
List ((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
This is my attempt:
object UniqueTest {
def main(args: Array[String]){
val l_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val l_2 = List((1, 111), (2,122), (3, 133), (4, 144), (1, 123), (2, 234))
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
print(l_3)
}
def f(i: Int, list: List[(Int, Int)]): Int = {
for(pair <- list){
if(i == pair._1){
return pair._2
}
}
return 0
}
}
This results in:
((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
Is the program above a good way to do this? Are there built-in functions in Scala to handle this need, or another way to do this manipulation?
The only real over-complication you make is this line:
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
Your f function uses an imperative style to loop over a list to find a key. Any time you find yourself doing this, it's a good indication what you want is a map. By doing the for loop each time you're exploding the computational complexity: a map will allow you to fetch the corresponding value for a given key in O(1). With a map you first convert your list, which is a key-value pair, to a datastructure explicit about supporting the key-value pair relationship.
Thus, the first thing you should do is build your map. Scala provides a really easy way to do this with toMap:
val map_1 = list_1.toMap
Then it is just a matter of 'mapping':
val result = list_2.map { case (key, value) => map_1.getOrElse(key, 0), value) }
This takes each case in your list_2, matches the first value (key) to a key in your map_1, retrieves that value (or the default 0) and puts as the first value in a key-value tuple.
You can do:
val map = l_1.toMap // transform l_1 to a Map[Int, Int]
// for each (a, b) in l_2, retrieve the new value v of a and return (v, b)
val res = l_2.map { case (a, b) => (map.getOrElse(a, 0), b) }
The most idiomatic way is zipping them together and then transforming according to your needs:
(list_1 zip list_2) map { case ((k1, v1), (k2, v2)) => (v1, v2) }

Make RDD from List in scala&spark

Orgin data
ID, NAME, SEQ, NUMBER
A, John, 1, 3
A, Bob, 2, 5
A, Sam, 3, 1
B, Kim, 1, 4
B, John, 2, 3
B, Ria, 3, 5
To mak ID group list, I did below
val MapRDD = originDF.map { x => (x.getAs[String](colMap.ID), List(x)) }
val ListRDD = MapRDD.reduceByKey { (a: List[Row], b: List[Row]) => List(a, b).flatten }
My goal is making this RDD (purpose is to find SEQ-1's NAME and Number diff in each ID group)
ID, NAME, SEQ, NUMBER, PRE_NAME, DIFF
A, John, 1, 3, NULL, NULL
A, Bob, 2, 5, John, 2
A, Sam, 3, 1, Bob, -4
B, Kim, 1, 4, NULL, NULL
B, John, 2, 3, Kim, -1
B, Ria, 3, 5, John, 2
Currently ListRDD would be like
A, ([A,Jone,1,3], [A,Bob,2,5], ..)
B, ([B,Kim,1,4], [B,John,2,3], ..)
This is code I tried to make my goal RDD with ListRDD (not working as I want)
def myFunction(ListRDD: RDD[(String, List[Row])]) = {
var rows: List[Row] = Nil
ListRDD.foreach( row => {
rows ::: make(row._2)
})
//rows has nothing and It's not RDD
}
def make( eachList: List[Row]): List[Row] = {
caseList.foreach { x => //... Make PRE_NAME and DIFF in new List
}
My final goal is to save this RDD in csv (RDD.saveAsFile...). How to make this RDD(not list) with this data.
Window functions look like a good fit here:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
val df = sc.parallelize(Seq(
("A", "John", 1, 3),
("A", "Bob", 2, 5),
("A", "Sam", 3, 1),
("B", "Kim", 1, 4),
("B", "John", 2, 3),
("B", "Ria", 3, 5))).toDF("ID", "NAME", "SEQ", "NUMBER")
val w = Window.partitionBy($"ID").orderBy($"SEQ")
df.select($"*",
lag($"NAME", 1).over(w).alias("PREV_NAME"),
($"NUMBER" - lag($"NUMBER", 1).over(w)).alias("DIFF"))

Distribute elements of the fist list to another list/array

I have an Array[(List(String)), Array[(Int, Int)]] like this
((123, 456, 789), (1, 24))
((89, 284), (2, 6))
((125, 173, 88, 222), (3, 4))
I would like to distribute each element of the first list to the second list, like this
(123, (1, 24))
(456, (1, 24))
(789, (1, 24))
(89, (2, 6))
(284, (2, 6))
(125, (3, 4))
(173, (3, 4))
(88, (3, 4))
(22, (3, 4))
Can anyone help me with this? Thank you very much.
For input data defined as follows:
val data = Array((List("123", "456", "789"), (1, 24)), (List("89", "284"), (2, 6)), (List("125", "173", "88", "222"), (3, 4)))
you can use:
data.flatMap { case (l, ii) => l.map((_, ii)) }
which yields:
Array[(String, (Int, Int))] = Array(("123", (1, 24)), ("456", (1, 24)), ("789", (1, 24)), ("89", (2, 6)), ("284", (2, 6)), ("125", (3, 4)), ("173", (3, 4)), ("88", (3, 4)), ("222", (3, 4)))
which I believe matches what you are looking for.
Based on your example, it seemed to me that you were using a single type.
scala> val xs: List[(List[Int], (Int, Int))] =
| List( ( List(123, 456, 789), (1, 24) ),
| ( List(89, 284), (2,6)),
| ( List(125, 173, 88, 222), (3, 4)) )
xs: List[(List[Int], (Int, Int))] = List((List(123, 456, 789), (1,24)),
(List(89, 284),(2,6)),
(List(125, 173, 88, 222),(3,4)))
Then I wrote this function:
scala> def f[A](xs: List[(List[A], (A, A))]): List[(A, (A, A))] =
| for {
| x <- xs
| head <- x._1
| } yield (head, x._2)
f: [A](xs: List[(List[A], (A, A))])List[(A, (A, A))]
Apply f to xs.
scala> f(xs)
res9: List[(Int, (Int, Int))] = List((123,(1,24)), (456,(1,24)),
(789,(1,24)), (89,(2,6)), (284,(2,6)), (125,(3,4)),
(173,(3,4)), (88,(3,4)), (222,(3,4)))