RxPy: How to convert stream of sets of elements into a stream of single elements - flatmap

I have two pieces of code s.t.
- one produces stream of sets of active alerts in the system.
- second consumes events of raise/fall of an alert.
assuming the first part produces the following stream
["a", "b"],
["e", "f", "g"],
I want to push these as
("a", True),
("b", True),
("c", True),
("a", False),
("b", False),
("e", True),
("f", True),
("g", True),
("c", False).
to the second part of the system.
I can do the following
events=[["a", "b"], ["c"], ["e", "f", "g"]]
alerts = Observable\
.map(lambda x : set(x))\
.scan(lambda (prev, events), curr : (curr, {(i, True) for i in curr - prev}.union(\
{(i, False) for i in prev - curr})),\
(set(), set()))\
.map(lambda (prev, events) : events)
subject = rx.subjects.Subject()
def my_flatten(set):
for x in set:
subject.subscribe(lambda x : print(x))
which produces the following result, which is ok
('a', True)
('b', True)
('b', False)
('a', False)
('c', True)
('c', False)
('g', True)
('e', True)
('f', True)
But I hoped to have a solution without a subject, something like the following
events=[["a", "b"], ["c"], ["e", "f", "g"]]
alerts = Observable\
.map(lambda x : set(x))\
.scan(lambda (prev, events), curr : (curr, {(i, True) for i in curr - prev}.union(\
{(i, False) for i in prev - curr})),\
(set(), set()))\
.flat_map(lambda (prev, events) : events)
alerts.subscribe(lambda x : print(x))
alerts = Observable\
.map(lambda x : set(x))\
.scan(lambda (prev, events), curr : (curr, {(i, True) for i in curr - prev}.union({(i, False) for i in prev - curr})), (set(), set()))\
.map(lambda (prev, events) : events)
but it produces the following, which is incorrect because you cannot reconstruct active events from it, c turns to be active at the end.
('a', True)
('b', True)
('b', False)
('a', False)
('c', False)
('c', True)
('g', True)
('e', True)
('f', True)
flat_map does not preserve the order, do you think there is another solution?
Thank you,


Grouping by an alternating sequence in Spark

I have a set of data where I can identify an alternating sequence. However, I want to group this data into one chunk leaving all the other data unchanged. That is, where ever flickering is occurring in id, I want to overwrite that group of id with the single id that makes since to the order. As a small example, consider
val dataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 4), // flickering; id=0
("a", "silom", 1, 5), // flickering; id=0
("a", "silom", 0, 6), // flickering; id=0
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 10),
("a", "silom", 3, 11), // flickering and so on
("a", "silom", 4, 12),
("a", "silom", 3, 13),
("a", "silom", 4, 14),
("a", "silom", 5, 15)
).toDF("user", "cat", "id", "time_sec")
val resultDataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 15), // grouped by flickering summing on time_sec
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 60),
("a", "silom", 5, 15). // grouped by flickering summing on time_sec
).toDF("user", "cat", "id", "time_sec")
Now a more realistic MWE. In this case, we can have multiple users and cat; unfortunately, this approach doesnt use the dataframe API and needs to collect data to the driver. This isn't scalable and needs to recursively call getGrps by dropping the length of the returned array indices.
How can I implement this using the dataframe API so as not to need collect the data to the driver which would be impossible due to size? Also, if there is a better way to do this, what would that be?
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.DoubleType
import scala.collection.mutable.WrappedArray
val dataDF = Seq(
("a", "silom", 3, 1),
("a", "silom", 2, 2),
("a", "silom", 1, 3),
("a", "silom", 0, 4),
("a", "silom", 1, 5),
("a", "silom", 0, 6),
("a", "silom", 1, 7),
("a", "silom", 2, 8),
("a", "silom", 3, 9),
("a", "silom", 4, 10),
("a", "silom", 3, 11),
("a", "silom", 4, 12),
("a", "silom", 3, 13),
("a", "silom", 4, 14),
("a", "silom", 5, 15),
("a", "suk", 18, 1),
("a", "suk", 19, 2),
("a", "suk", 20, 3),
("a", "suk", 21, 4),
("a", "suk", 20, 5),
("a", "suk", 21, 6),
("a", "suk", 0, 7),
("a", "suk", 1, 8),
("a", "suk", 2, 9),
("a", "suk", 3, 10),
("a", "suk", 4, 11),
("a", "suk", 3, 12),
("a", "suk", 4, 13),
("a", "suk", 3, 14),
("a", "suk", 5, 15),
("b", "silom", 4, 1),
("b", "silom", 3, 2),
("b", "silom", 2, 3),
("b", "silom", 1, 4),
("b", "silom", 0, 5),
("b", "silom", 1, 6),
("b", "silom", 2, 7),
("b", "silom", 3, 8),
("b", "silom", 4, 9),
("b", "silom", 5, 10),
("b", "silom", 6, 11),
("b", "silom", 7, 12),
("b", "silom", 8, 13),
("b", "silom", 9, 14),
("b", "silom", 10, 15),
("b", "suk", 11, 1),
("b", "suk", 12, 2),
("b", "suk", 13, 3),
("b", "suk", 14, 4),
("b", "suk", 13, 5),
("b", "suk", 14, 6),
("b", "suk", 13, 7),
("b", "suk", 12, 8),
("b", "suk", 11, 9),
("b", "suk", 10, 10),
("b", "suk", 9, 11),
("b", "suk", 8, 12),
("b", "suk", 7, 13),
("b", "suk", 6, 14),
("b", "suk", 5, 15)
).toDF("user", "cat", "id", "time_sec")
val recastDataDF = dataDF.withColumn("id", $"id".cast(DoubleType))
val category = recastDataDF.select("cat").distinct.collect.map(x => x(0).toString)
val data = recastDataDF
.select($"*" +: category.map(
name =>
lag("id", 1).over(
Window.partitionBy("user", "cat").orderBy("time_sec")
.alias(s"lag_${name}_id")): _*)
.withColumn("sequencing_diff", when($"cat" === "silom", ($"lag_silom_id" - $"id").cast(DoubleType))
.otherwise(($"lag_suk_id" - $"id")))
.drop("lag_silom_id", "lag_suk_id")
.withColumn("rn", row_number.over(Window.partitionBy("user", "cat").orderBy("time_sec")).cast(DoubleType))
.withColumn("zipped", array("user", "cat", "sequencing_diff", "rn", "id"))
// non dataframe API approach (not scalable)
// needs to collect data to driver to process
val iterTuples = data.select("zipped").collect.map(x => x(0).asInstanceOf[WrappedArray[Any]]).map(x => x.toArray)
val shifted: Array[Array[Any]] = iterTuples.drop(1)
val combined = iterTuples
.zipAll(shifted, Array("", "", Double.NaN, Double.NaN, Double.NaN), Array("", "", Double.NaN, Double.NaN, Double.NaN))
val testArr = combined.map{
case (data0, data1) =>
if(data1(3).toString.toDouble > 2 && data0(3).toString.toDouble > 2 && data1(0) == data0(0) && data1(1) == data0(1)) {
if(data0(2) != data1(2) && data0(2).toString.toDouble + data1(2).toString.toDouble == 0) {
(data1(0), data1(1), data1(3), data0(4))
else ("", "", Double.NaN, Double.NaN)
else ("", "", Double.NaN, Double.NaN)
.filter(t => t._1 != "" && t._2 != "" && t._3 == t._3 && t._4 == t._4) // fast NaN removal
val typeMappedArray = testArr.map(x => (x._1.toString, x._2.toString, x._3.toString.toDouble, x._4.toString.toDouble))
def getGrps(arr: Array[(String, String, Double, Double)]): (Array[Double], Double, String, String) = {
if(arr.nonEmpty) {
val user = arr.take(1)(0)._1
val cat = arr.take(1)(0)._2
val rowNum = arr.take(1)(0)._3
val keepID = arr.take(1)(0)._4
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._3) {
rowNum + 1 + idx
else Double.NaN
.filter(v => v == v)
(rowNums, keepID, user, cat)
else (Array(Double.NaN), Double.NaN, "", "")
// after overwriting, this would allow me to group by user, cat, id to sum the time
getGrps(typeMappedArray) // returns rows number to overwrite, value to overwrite id with, user, cat
res0: (Array(5.0, 6.0, 7.0),0.0,a,silom)
res1: (Array(11.0, 12.0, 13.0, 14.0),4.0,a,silom)
A second approach using collect_list but this relies on getGrps working recursively which I cannot get working properly. Here is the code I have so far with a modified getGrps for the the collect_list minus recursive.
val data = recastDataDF
.select($"*" +: category.map(
name =>
lag("id", 1).over(
Window.partitionBy("user", "cat").orderBy("time_sec")
.alias(s"lag_${name}_id")): _*)
.withColumn("sequencing_diff", when($"cat" === "silom", ($"lag_silom_id" - $"id").cast(DoubleType))
.otherwise(($"lag_suk_id" - $"id")))
.drop("lag_silom_id", "lag_suk_id")
.withColumn("rn", row_number.over(Window.partitionBy("user", "cat").orderBy("time_sec")).cast(DoubleType))
.withColumn("id_rn", array($"id", $"rn", $"sequencing_diff"))
.groupBy($"user", $"cat").agg(collect_list($"id_rn").alias("array_data"))
// collect one row to develop how the UDF would work
val testList = data.where($"user" === "a" && $"cat" === "silom").select("array_data").collect
.map(x => x(0).asInstanceOf[WrappedArray[WrappedArray[Any]]])
.map(x => x.toArray)
.map(x => (x(0).toString.toDouble, x(1).toString.toDouble, x(2).asInstanceOf[Double]))
// this code would be in the UDF; that is, we would pass array_data to the UDF
scala.util.Sorting.stableSort(testList, (e1: (Double, Double, Double), e2: (Double, Double, Double)) => e1._2 < e2._2)
val shifted: Array[(Double, Double, Double)] = testList.drop(1)
val combined = testList
.zipAll(shifted, (Double.NaN, Double.NaN, Double.NaN), (Double.NaN, Double.NaN, Double.NaN))
val testArr = combined.map{
case (data0, data1) =>
if(data0._3 != data1._3 && data0._2 > 1) {
(data0._2, data0._1)
else (Double.NaN, Double.NaN)
.filter(t => t._1 == t._1 && t._1 == t._1)
// called inside the UDF
def getGrps2(arr: Array[(Double, Double)]): (Array[Double], Double) = {
// no need for user or cat
if(arr.nonEmpty) {
val rowNum = arr.take(1)(0)._1
val keepID = arr.take(1)(0)._2
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._1) {
rowNum + 1 + idx
else Double.NaN
.filter(v => v == v)
(rowNums, keepID)
else (Array(Double.NaN), Double.NaN)
We would .withColumn("data_to_update", udf) and the data_to_update column would be a WrappedArray[Tuple2[Array[Double], Double]] with row_numbers to id to overwrite. The result for user a, cat silom would be
WrappedArray((Array(4.0, 5.0, 6.0),0.0), (Array(10.0, 11.0, 12.0, 13.0),4.0))
The array pieces are row numbers and the Double is id to update those rows with
The following recursive method applied in a UDF operating on the array_data column will create the desired results
def getGrps(arr: Array[(Double, Double)]): Array[(Array[Double], Double)] = {
def returnAlternatingIDs(arr: Array[(Double, Double)],
altIDs: Array[(Array[Double], Double)]): Array[(Array[Double], Double)] = arr match {
case arr if arr.nonEmpty =>
val rowNum = arr.take(1)(0)._1
val keepID = arr.take(1)(0)._2
val newArr = arr.drop(1)
val rowNums = (Array(rowNum)) ++ newArr.zipWithIndex.map{
case (tups, idx) =>
if(rowNum + idx + 1 == tups._1) {
rowNum + 1 + idx
else {
.filter(v => v == v)
val updateArray = altIDs ++ Array((rowNums, keepID))
returnAlternatingIDs(arr.drop(rowNums.length), updateArray)
case _ => altIDs
returnAlternatingIDs(arr, Array((Array(Double.NaN), Double.NaN))).drop(1)
The return value for the first collect_list is Array((Array(5.0, 6.0, 7.0),0.0), (Array(11.0, 12.0, 13.0, 14.0),4.0)) as desired.
Full UDF
val identifyFlickeringIDs: UserDefinedFunction = udf {
(colArrayData: WrappedArray[WrappedArray[Double]]) =>
val newArray: Array[(Double, Double, Double)] = colArrayData.toArray
.map(x => (x(0).toDouble, x(1).toDouble, x(2).toDouble))
// sort array by rn via less than relation
stableSort(newArray, (e1: (Double, Double, Double), e2: (Double, Double, Double)) => e1._2 < e2._2)
val shifted: Array[(Double, Double, Double)] = newArray.toArray.drop(1)
val combined = newArray
.zipAll(shifted, (Double.NaN, Double.NaN, Double.NaN), (Double.NaN, Double.NaN, Double.NaN))
val parsedArray = combined.map{
case (data0, data1) =>
if(data0._3 != data1._3 && data0._2 > 1 && data0._3 + data1._3 == 0) {
(data0._2, data0._1)
else (Double.NaN, Double.NaN)
.filter(t => t._1 == t._1 && t._1 == t._1)
getGrps(parsedArray).filter(data => data._1.length > 1)

spark convert spark-SQL to RDD API

Spark SQL is pretty clear to me. However, I am just getting started with spark's RDD API. As spark apply function to columns in parallel points out this should allow me to get rid of slow shuffles for
def handleBias(df: DataFrame, colName: String, target: String = this.target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
In pseudo code: df foreach column (handleBias(column)
So a minimal data frame is loaded up
val input = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
but fails to map correctly
val rdd1_inputDf = inputDf.rdd.flatMap { x => {(0 until x.size).map(idx => (idx, x(idx)))}}
It fails with
java.lang.ClassNotFoundException: scala.Any
java.lang.ClassNotFoundException: scala.Any
An example can be found https://github.com/geoHeil/sparkContrastCoding respectively https://github.com/geoHeil/sparkContrastCoding/blob/master/src/main/scala/ColumnParallel.scala for the problem outlined in this question.
When you call .rdd on a DataFrame you get an RDD[Row] which is not strongly typed. If you want to be able to map over the elements you will need to pattern match over Row:
scala> val input = Seq(
| (0, "A", "B", "C", "D"),
| (1, "A", "B", "C", "D"),
| (0, "d", "a", "jkl", "d"),
| (0, "d", "g", "C", "D"),
| (1, "A", "d", "t", "k"),
| (1, "d", "c", "C", "D"),
| (1, "c", "B", "C", "D")
| )
input: Seq[(Int, String, String, String, String)] = List((0,A,B,C,D), (1,A,B,C,D), (0,d,a,jkl,d), (0,d,g,C,D), (1,A,d,t,k), (1,d,c,C,D), (1,c,B,C,D))
scala> val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
inputDf: org.apache.spark.sql.DataFrame = [TARGET: int, col1: string ... 3 more fields]
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = inputDf.rdd
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at rdd at <console>:27
scala> val typedRDD = rowRDD.map{case Row(a: Int, b: String, c: String, d: String, e: String) => (a,b,c,d,e)}
typedRDD: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[20] at map at <console>:29
scala> typedRDD.keyBy(_._1).groupByKey.foreach{println}
[Stage 7:> (0 + 0) / 4]
(0,CompactBuffer((A,B,C,D), (d,a,jkl,d), (d,g,C,D)))
(1,CompactBuffer((A,B,C,D), (A,d,t,k), (d,c,C,D), (c,B,C,D)))
Otherwise you can use a typed Dataset:
scala> val ds = input.toDS
ds: org.apache.spark.sql.Dataset[(Int, String, String, String, String)] = [_1: int, _2: string ... 3 more fields]
scala> ds.rdd
res2: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[8] at rdd at <console>:30
scala> ds.rdd.keyBy(_._1).groupByKey.foreach{println}
[Stage 0:> (0 + 0) / 4]
(0,CompactBuffer((0,A,B,C,D), (0,d,a,jkl,d), (0,d,g,C,D)))
(1,CompactBuffer((1,A,B,C,D), (1,A,d,t,k), (1,d,c,C,D), (1,c,B,C,D)))

Join two lists with unequal length in Scala

I have 2 lists:
val list_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val list_2 = List((1, 111), (2, 122), (3, 133), (4, 144), (1, 123), (2, 234))
I want to replace key in the second list as value of first list, resulting in a new list that looks like:
List ((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
This is my attempt:
object UniqueTest {
def main(args: Array[String]){
val l_1 = List((1, 11), (2, 12), (3, 13), (4, 14))
val l_2 = List((1, 111), (2,122), (3, 133), (4, 144), (1, 123), (2, 234))
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
def f(i: Int, list: List[(Int, Int)]): Int = {
for(pair <- list){
if(i == pair._1){
return pair._2
return 0
This results in:
((11, 111), (12, 122), (13, 133), (14, 144), (11, 123), (12, 234))
Is the program above a good way to do this? Are there built-in functions in Scala to handle this need, or another way to do this manipulation?
The only real over-complication you make is this line:
val l_3 = l_2.map(x => (f(x._1, l_1), x._2))
Your f function uses an imperative style to loop over a list to find a key. Any time you find yourself doing this, it's a good indication what you want is a map. By doing the for loop each time you're exploding the computational complexity: a map will allow you to fetch the corresponding value for a given key in O(1). With a map you first convert your list, which is a key-value pair, to a datastructure explicit about supporting the key-value pair relationship.
Thus, the first thing you should do is build your map. Scala provides a really easy way to do this with toMap:
val map_1 = list_1.toMap
Then it is just a matter of 'mapping':
val result = list_2.map { case (key, value) => map_1.getOrElse(key, 0), value) }
This takes each case in your list_2, matches the first value (key) to a key in your map_1, retrieves that value (or the default 0) and puts as the first value in a key-value tuple.
You can do:
val map = l_1.toMap // transform l_1 to a Map[Int, Int]
// for each (a, b) in l_2, retrieve the new value v of a and return (v, b)
val res = l_2.map { case (a, b) => (map.getOrElse(a, 0), b) }
The most idiomatic way is zipping them together and then transforming according to your needs:
(list_1 zip list_2) map { case ((k1, v1), (k2, v2)) => (v1, v2) }

Spark table transformation (ERROR: 5063)

I have the following data:
val RDDApp = sc.parallelize(List("A", "B", "C"))
val RDDUser = sc.parallelize(List(1, 2, 3))
val RDDInstalled = sc.parallelize(List((1, "A"), (1, "B"), (2, "B"), (2, "C"), (3, "A"))).groupByKey
val RDDCart = RDDUser.cartesian(RDDApp)
I want to map this data so that I have an RDD of tuples with (userId, Boolean if the letter is given for user). I thought I found a solution with this:
val results = RDDCart.map (entry =>
(entry._1, RDDInstalled.lookup(entry._1).contains(entry._2))
If I call results.first, I get org.apache.spark.SparkException: SPARK-5063. I see the problem with the Action within the Mapping function but do not know how I can work around it so that I get the same result.
Just join and mapValues:
RDDCart.join(RDDInstalled).mapValues{case (x, xs) => xs.toSeq.contains(x)}

Distribute elements of the fist list to another list/array

I have an Array[(List(String)), Array[(Int, Int)]] like this
((123, 456, 789), (1, 24))
((89, 284), (2, 6))
((125, 173, 88, 222), (3, 4))
I would like to distribute each element of the first list to the second list, like this
(123, (1, 24))
(456, (1, 24))
(789, (1, 24))
(89, (2, 6))
(284, (2, 6))
(125, (3, 4))
(173, (3, 4))
(88, (3, 4))
(22, (3, 4))
Can anyone help me with this? Thank you very much.
For input data defined as follows:
val data = Array((List("123", "456", "789"), (1, 24)), (List("89", "284"), (2, 6)), (List("125", "173", "88", "222"), (3, 4)))
you can use:
data.flatMap { case (l, ii) => l.map((_, ii)) }
which yields:
Array[(String, (Int, Int))] = Array(("123", (1, 24)), ("456", (1, 24)), ("789", (1, 24)), ("89", (2, 6)), ("284", (2, 6)), ("125", (3, 4)), ("173", (3, 4)), ("88", (3, 4)), ("222", (3, 4)))
which I believe matches what you are looking for.
Based on your example, it seemed to me that you were using a single type.
scala> val xs: List[(List[Int], (Int, Int))] =
| List( ( List(123, 456, 789), (1, 24) ),
| ( List(89, 284), (2,6)),
| ( List(125, 173, 88, 222), (3, 4)) )
xs: List[(List[Int], (Int, Int))] = List((List(123, 456, 789), (1,24)),
(List(89, 284),(2,6)),
(List(125, 173, 88, 222),(3,4)))
Then I wrote this function:
scala> def f[A](xs: List[(List[A], (A, A))]): List[(A, (A, A))] =
| for {
| x <- xs
| head <- x._1
| } yield (head, x._2)
f: [A](xs: List[(List[A], (A, A))])List[(A, (A, A))]
Apply f to xs.
scala> f(xs)
res9: List[(Int, (Int, Int))] = List((123,(1,24)), (456,(1,24)),
(789,(1,24)), (89,(2,6)), (284,(2,6)), (125,(3,4)),
(173,(3,4)), (88,(3,4)), (222,(3,4)))