Scala Transformation and action - scala

I have an RDD List[(String, List[Int])] like List(("A",List(1,2,3,4)),("B",List(5,6,7)))
How to transform them to List(("A",1),("A",2),("A",3),("A",4),("B",5),("B",6),("B",7))
Then action would be reducing by key and generating result like List(("A",2.5)("B",6))
I have tried using map(e=>List(e._1,e._2)) but its not giving desired result.
Where 2.5 is average for "A" and 6 is average for "B"
Help me with these set of transformation and actions.
Thanks in advance

There are several ways to get what you want. You could use a for comprehension as well, but the very first one came up to my mind is this implementation:
val l = List(("A", List(1, 2, 3)), ("B", List(1, 2, 3)))
val flattenList = l.flatMap {
case (elem, _elemList) =>
_elemList.map((elem, _))
}
Output:
List((A,1), (A,2), (A,3), (B,1), (B,2), (B,3))

If what you want is the average of each list in the end, then it's not necessary to break them up into individual elements with a flatMap. Doing so with a large list would unnecessarily shuffle a lot of data with a large data set.
Since they are already aggregated by key, just transform them with something like this:
val l = spark.sparkContext.parallelize(Seq(
("A", List(1, 2, 3, 4)),
("B", List(5, 6, 7))
))
val avg = l.map(r => {
(r._1, (r._2.sum.toDouble / r._2.length.toDouble))
})
avg.collect.foreach(println)
Bear in mind that this will fail if any of your lists are 0 length. If you have some 0 length lists, you'll have to put a check condition in the map.
The above code gives you:
(A,2.5)
(B,6.0)

You can try explode()
scala> val df = List(("A",List(1,2,3,4)),("B",List(5,6,7))).toDF("x","y")
df: org.apache.spark.sql.DataFrame = [x: string, y: array<int>]
scala> df.withColumn("z",explode('y)).show(false)
+---+------------+---+
|x |y |z |
+---+------------+---+
|A |[1, 2, 3, 4]|1 |
|A |[1, 2, 3, 4]|2 |
|A |[1, 2, 3, 4]|3 |
|A |[1, 2, 3, 4]|4 |
|B |[5, 6, 7] |5 |
|B |[5, 6, 7] |6 |
|B |[5, 6, 7] |7 |
+---+------------+---+
scala> val df2 = df.withColumn("z",explode('y))
df2: org.apache.spark.sql.DataFrame = [x: string, y: array<int> ... 1 more field]
scala> df2.groupBy("x").agg(sum('z)/count('z) ).show(false)
+---+-------------------+
|x |(sum(z) / count(z))|
+---+-------------------+
|B |6.0 |
|A |2.5 |
+---+-------------------+
scala>

Related

How to get all combinations of an array column in Spark?

Suppose I have an array column group_ids
+-------+----------+
|user_id|group_ids |
+-------+----------+
|1 |[5, 8] |
|3 |[1, 2, 3] |
|2 |[1, 4] |
+-------+----------+
Schema:
root
|-- user_id: integer (nullable = false)
|-- group_ids: array (nullable = false)
| |-- element: integer (containsNull = false)
I want to get all combinations of pairs:
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
So far I created the easiest solution with UDF:
spark.udf.register("permutate", udf((xs: Seq[Int]) => xs.combinations(2).toSeq))
dataset.withColumn("group_ids", expr("permutate(group_ids)"))
What I'm looking for is something that implemented via Spark Built-in functions. Is there a way to implement the same code without UDF?
Some higher order functions can do the trick. Requires Spark >= 2.4.
val df2 = df.withColumn(
"group_ids",
expr("""
filter(
transform(
flatten(
transform(
group_ids,
x -> arrays_zip(
array_repeat(x, size(group_ids)),
group_ids
)
)
),
x -> array(x['0'], x['group_ids'])
),
x -> x[0] < x[1]
)
""")
)
df2.show(false)
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
You can get the max size of the column group_ids. Then, using combinations on the range (1 - maxSize) with when expressions to create the sub arrays combinations from the original array, and finaly filter the null elements from the resulting array:
val maxSize = df.select(max(size($"group_ids"))).first.getAs[Int](0)
val newCol = (1 to maxSize).combinations(2)
.map(c =>
when(
size($"group_ids") >= c(1),
array(element_at($"group_ids", c(0)), element_at($"group_ids", c(1)))
)
).toSeq
df.withColumn("group_ids", array(newCol: _*))
.withColumn("group_ids", expr("filter(group_ids, x -> x is not null)"))
.show(false)
//+-------+------------------------+
//|user_id|group_ids |
//+-------+------------------------+
//|1 |[[5, 8]] |
//|3 |[[1, 2], [1, 3], [2, 3]]|
//|2 |[[1, 4]] |
//+-------+------------------------+
Based on explode and joins solution
val exploded = df.select(col("user_id"), explode(col("group_ids")).as("e"))
// to have combinations
val joined1 = exploded.as("t1")
.join(exploded.as("t2"), Seq("user_id"), "outer")
.select(col("user_id"), col("t1.e").as("e1"), col("t2.e").as("e2"))
// to filter out redundant combinations
val joined2 = joined1.as("t1")
.join(joined1.as("t2"), $"t1.user_id" === $"t2.user_id" && $"t1.e1" === $"t2.e2" && $"t1.e2"=== $"t2.e1")
.where("t1.e1 < t2.e1")
.select("t1.*")
// group into array
val result = joined2.groupBy("user_id")
.agg(collect_set(struct("e1", "e2")).as("group_ids"))

Convert an array to custom string format in Spark with Scala

I created a DataFrame as follows:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(1, List(1,2,3)),
(1, List(5,7,9)),
(2, List(4,5,6)),
(2, List(7,8,9)),
(2, List(10,11,12))
).toDF("id", "list")
val df1 = df.groupBy("id").agg(collect_set($"list").as("col1"))
df1.show(false)
Then I tried to convert the WrappedArray row value to string as follows:
import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
val d = df1.withColumn("col1", arrayToString($"col1"))
d: org.apache.spark.sql.DataFrame = [id: int, col1: string]
scala> d.show(false)
+---+----------------------------+
|id |col1 |
+---+----------------------------+
|1 |1, 2, 3, 5, 7, 9 |
|2 |4, 5, 6, 7, 8, 9, 10, 11, 12|
+---+----------------------------+
What I really want is to generate an output like the following:
+---+----------------------------+
|id |col1 |
+---+----------------------------+
|1 |1$2$3, 5$7$ 9 |
|2 |4$5$6, 7$8$9, 10$11$12 |
+---+----------------------------+
How can I achieve this?
You don't need a udf function, a simple concat_ws should do the trick for you as
import org.apache.spark.sql.functions._
val df1 = df.withColumn("list", concat_ws("$", col("list")))
.groupBy("id")
.agg(concat_ws(", ", collect_set($"list")).as("col1"))
df1.show(false)
which should give you
+---+----------------------+
|id |col1 |
+---+----------------------+
|1 |1$2$3, 5$7$9 |
|2 |7$8$9, 4$5$6, 10$11$12|
+---+----------------------+
As usual, udf function should be avoided if inbuilt functions are available since udf function would require serialization and deserialization of column data to primitive types for calculation and from primitives to columns respectively
even more concise you can avoid the withColumn step as
val df1 = df.groupBy("id")
.agg(concat_ws(", ", collect_set(concat_ws("$", col("list")))).as("col1"))
I hope the answer is helpful

prefix span output formatting

I am trying to run following example code:
import org.apache.spark.mllib.fpm.PrefixSpan
val sequences = sc.parallelize(Seq(
Array(Array(1, 2), Array(3)),
Array(Array(1), Array(3, 2), Array(1, 2)),
Array(Array(1, 2), Array(5)),
Array(Array(6))
), 2).cache()
val prefixSpan = new PrefixSpan()
.setMinSupport(0.5)
.setMaxPatternLength(5)
val model = prefixSpan.run(sequences)
model.freqSequences.collect().foreach { freqSequence =>
println(
freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]") +
", " + freqSequence.freq
)
}
I need to format model.freqSequences to something similar to following(it is a dataframe with sequence and freq)
|[WrappedArray(2,3)] | 3
|[WrappedArray(1)] | 2
|[WrappedArray(2,1)] | 1
Using flatten on freqSequence.sequence and applying toDF should give your desired output
model.freqSequences.map(freqSequence => (freqSequence.sequence.flatten, freqSequence.freq)).toDF("array", "freq").show(false)
which should give you
+------+----+
|array |freq|
+------+----+
|[2] |3 |
|[3] |2 |
|[1] |3 |
|[2, 1]|3 |
|[1, 3]|2 |
+------+----+
I hope the answer is helpful

How can I compute the average vector in a Spark Dataset w. Scala? [duplicate]

This question already has answers here:
How to find mean of grouped Vector columns in Spark SQL?
(2 answers)
Closed 4 years ago.
Let's say that I have a dataset in Apache Spark as follows:
+---+--------------------+
| id| vec|
+---+--------------------+
| 0|[1, 2, 3, 4] |
| 0|[2, 3, 4, 5] |
| 0|[6, 7, 8, 9] |
| 1|[1, 2, 3, 4] |
| 1|[5, 6, 7, 8] |
+---+--------------------+
And the vec is a List of Doubles.
How can I create a dataset from this that contains the ids and the average of the vectors associated with that id, like so:
+---+--------------------+
| id| vec|
+---+--------------------+
| 0|[3, 4, 5, 6] |
| 1|[3, 4, 5, 6] |
+---+--------------------+
Thanks in advance!
Created a case class to match the input schema of DataSet.
Grouped the Dataset by id and used foldLeft to accumulate the average of each idx in the vector for a grouped Dataset.
scala> case class Test(id: Int, vec: List[Double])
defined class Test
scala> val inputList = List(
| Test(0, List(1, 2, 3, 4)),
| Test(0, List(2, 3, 4, 5)),
| Test(0, List(6, 7, 8, 9)),
| Test(1, List(1, 2, 3, 4)),
| Test(1, List(5, 6, 7, 8)))
inputList: List[Test] = List(Test(0,List(1.0, 2.0, 3.0, 4.0)), Test(0,List(2.0, 3.0, 4.0, 5.0)), Test(0,List(6.0, 7.0, 8.0, 9.0)), Test(1,
List(1.0, 2.0, 3.0, 4.0)), Test(1,List(5.0, 6.0, 7.0, 8.0)))
scala>
scala> import spark.implicits._
import spark.implicits._
scala> val ds = inputList.toDF.as[Test]
ds: org.apache.spark.sql.Dataset[Test] = [id: int, vec: array<double>]
scala> ds.show(false)
+---+--------------------+
|id |vec |
+---+--------------------+
|0 |[1.0, 2.0, 3.0, 4.0]|
|0 |[2.0, 3.0, 4.0, 5.0]|
|0 |[6.0, 7.0, 8.0, 9.0]|
|1 |[1.0, 2.0, 3.0, 4.0]|
|1 |[5.0, 6.0, 7.0, 8.0]|
+---+--------------------+
scala>
scala> val outputDS = ds.groupByKey(_.id).mapGroups {
| case (key, valuePairs) =>
| val vectors = valuePairs.map(_.vec).toArray
| // compute the length of the vectors for each key
| val len = vectors.length
| // get average for each index in vectors
| val avg = vectors.head.indices.foldLeft(List[Double]()) {
| case (acc, idx) =>
| val sumOfIdx = vectors.map(_ (idx)).sum
| acc :+ (sumOfIdx / len)
| }
| Test(key, avg)
| }
outputDS: org.apache.spark.sql.Dataset[Test] = [id: int, vec: array<double>]
scala> outputDS.show(false)
+---+--------------------+
|id |vec |
+---+--------------------+
|1 |[3.0, 4.0, 5.0, 6.0]|
|0 |[3.0, 4.0, 5.0, 6.0]|
+---+--------------------+
Hope this helps!

How to replicate an element in Spark dataframe in Scala?

Suppose I have a DataFrame:
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array")
+---+---+---+------------+
|one|two| X| Array|
+---+---+---+------------+
| 1| 2| x|[1, 2, 3, 4]|
+---+---+---+------------+
I want to replicate the single elements, let's say 4 times, in order to achieve a single row DataFrame with each field as an array of four elements. The desired output would be:
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
+------------+------------+------------+------------+
You can use builit-in array function to replicate n time column of your choice.
Below is PoC code.
import org.apache.spark.sql.functions._
val replicate = (n: Int, colName: String) => array((1 to n).map(s => col(colName)):_*)
val replicatedCol = Seq("one", "two", "X").map(s => replicate(4, s).as(s))
val cols = col("Array") +: replicatedCol
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array").select(cols:_*)
testDf.show(false)
+------------+------------+------------+------------+
|Array |one |two |X |
+------------+------------+------------+------------+
|[1, 2, 3, 4]|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|
+------------+------------+------------+------------+
In the case, you want different n for each column
val testDf = sc.parallelize(Seq(
(1,2,"x", Array(1,2,3,4)))).toDF("one", "two", "X", "Array").select(replicate(2, "one").as("one"), replicate(3, "X").as("X"), replicate(4, "two").as("two"), $"Array")
testDf.show(false)
+------+---------+------------+------------+
|one |X |two |Array |
+------+---------+------------+------------+
|[1, 1]|[x, x, x]|[2, 2, 2, 2]|[1, 2, 3, 4]|
+------+---------+------------+------------+
Well, here is my solution:
First declare the columns you want to replicate:
val columnsToReplicate = List("one", "two", "X")
Then define the replication factor and the udf to perform it:
val replicationFactor = 4
val replicate = (s:String) => {
for {
i <- 1 to replicationFactor
} yield s
}
val replicateudf = functions.udf(replicate)
Then just perform the foldLeft on the DataFrame when the columname belongs to your list of desired column names:
testDf.columns.foldLeft(testDf)((acc, colname) => if (columnsToReplicate.contains(colname)) acc.withColumn(colname, replicateudf(acc.col(colname))) else acc)
Output:
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
+------------+------------+------------+------------+
Note: You need to import this class:
import org.apache.spark.sql.functions
EDIT:
Variable replicationFactor as suggested in comments:
val mapColumnsToReplicate = Map("one"->4, "two"->5, "X"->6)
val replicateudf2 = functions.udf ((s: String, replicationFactor: Int) =>
for {
i <- 1 to replicationFactor
} yield s
)
testDf.columns.foldLeft(testDf)((acc, colname) => if (mapColumnsToReplicate.keys.toList.contains(colname)) acc.withColumn(colname, replicateudf2($"$colname", functions.lit(mapColumnsToReplicate(colname))))` else acc)
Output with those values above:
+------------+---------------+------------------+------------+
| one| two| X| Array|
+------------+---------------+------------------+------------+
|[1, 1, 1, 1]|[2, 2, 2, 2, 2]|[x, x, x, x, x, x]|[1, 2, 3, 4]|
+------------+---------------+------------------+------------+
You can use explode und groupBy/collect_list :
val testDf = sc.parallelize(
Seq((1, 2, "x", Array(1, 2, 3, 4)),
(3, 4, "y", Array(1, 2, 3)),
(5,6, "z", Array(1)))
).toDF("one", "two", "X", "Array")
testDf
.withColumn("id",monotonically_increasing_id())
.withColumn("tmp", explode($"Array"))
.groupBy($"id")
.agg(
collect_list($"one").as("cl_one"),
collect_list($"two").as("cl_two"),
collect_list($"X").as("cl_X"),
first($"Array").as("Array")
)
.select(
$"cl_one".as("one"),
$"cl_two".as("two"),
$"cl_X".as("X"),
$"Array"
)
.show()
+------------+------------+------------+------------+
| one| two| X| Array|
+------------+------------+------------+------------+
| [5]| [6]| [z]| [1]|
|[1, 1, 1, 1]|[2, 2, 2, 2]|[x, x, x, x]|[1, 2, 3, 4]|
| [3, 3, 3]| [4, 4, 4]| [y, y, y]| [1, 2, 3]|
+------------+------------+------------+------------+
This solution has the advantage that it does not rely on constant array-sizes