Flink WindowFunction Fold - scala

I create a sliding window and hope to recursively pack all the elements enter that window period,
This is chunk of the code
.map(x => ((x.pickup.get.latitude, x.pickup.get.longitude), (x.dropoff.get.latitude, x.dropoff.get.longitude)))
.windowAll(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))
.fold(List[((Double, Double), (Double, Double))]) {(acc, v) => acc :+ ((v._1._1, v._1._2), (v._2._1, v._2._2))}
I hope to create a List in which the elements are tuple, but this does not work.
I tried this and it works:
val l2 : List[((Int, Int), (Int, Int))] = List(((1, 1), (2, 2)))
val newl2 = l2 :+ ((3, 3), (4, 4))
How can I do this?
Thanks so much

The first argument of the fold function needs to be the initial value and not the type. Changing the last line into:
.fold(List.empty[((Long, Long), (Long, Long))]) {(acc, v) => acc :+ ((v._1._1, v._1._2), (v._2._1, v._2._2))}
should do the trick.

Related

Spark scala faster way to groupbykey and sort rdd values [duplicate]

This question already has answers here:
take top N after groupBy and treat them as RDD
(4 answers)
Closed 4 years ago.
I have a rdd with format of each row (key, (int, double))
I would like to transform the rdd into (key, ((int, double), (int, double) ...) )
Where the the values in the new rdd is the top N values pairs sorted by the double
So far I came up with the solution below but it's really slow and runs forever, it works fine with smaller rdd but now the rdd is too big
val top_rated = test_rated.partitionBy(new HashPartitioner(4)).sortBy(_._2._2).groupByKey()
.mapValues(x => x.takeRight(n))
I wonder if there are better and faster ways to do this?
The most efficient way is probably aggregateByKey
type K = String
type V = (Int, Double)
val rdd: RDD[(K, V)] = ???
//TODO: implement a function that adds a value to a sorted array and keeps top N elements. Returns the same array
def addToSortedArray(arr: Array[V], newValue: V): Array[V] = ???
//TODO: implement a function that merges 2 sorted arrays and keeps top N elements. Returns the first array
def mergeSortedArrays(arr1: Array[V], arr2: Array[V]): Array[V] = ??? //TODO
val result: RDD[(K, Array[(Int, Double)])] = rdd.aggregateByKey(zeroValue = new Array[V](0))(seqOp = addToSortedArray, combOp = mergeSortedArrays)
Since you're interested only in the top-N values in your RDD, I would suggest that you avoid sorting across the entire RDD. In addition, use the more performing reduceByKey rather than groupByKey if at all possible. Below is an example using a topN method, borrowed from this blog:
def topN(n: Int, list: List[(Int, Double)]): List[(Int, Double)] = {
def bigHead(l: List[(Int, Double)]): List[(Int, Double)] = list match {
case Nil => list
case _ => l.tail.foldLeft( List(l.head) )( (acc, x) =>
if (x._2 <= acc.head._2) x :: acc else acc :+ x
)
}
def update(l: List[(Int, Double)], e: (Int, Double)): List[(Int, Double)] = {
if (e._2 > l.head._2) bigHead((e :: l.tail)) else l
}
list.drop(n).foldLeft( bigHead(list.take(n)) )( update ).sortWith(_._2 > _._2)
}
val rdd = sc.parallelize(Seq(
("a", (1, 10.0)), ("a", (4, 40.0)), ("a", (3, 30.0)), ("a", (5, 50.0)), ("a", (2, 20.0)),
("b", (3, 30.0)), ("b", (1, 10.0)), ("b", (4, 40.0)), ("b", (2, 20.0))
))
val n = 2
rdd.
map{ case (k, v) => (k, List(v)) }.
reduceByKey{ (acc, x) => topN(n, acc ++ x) }.
collect
// res1: Array[(String, List[(Int, Double)])] =
// Array((a,List((5,50.0), (4,40.0))), (b,List((4,40.0), (3,30.0)))))

How the get the index of the duplicate pair in the scala list

I have a scala list like this below:
slist = List("a","b","c","a","d","c","a")
I want to get the index of the same element pair in this list.
For example,the result of this slist is
(0,3),(0,6),(3,6),(2,5)
which (0,3) means the slist(0)==slist(3)
(0,6) means the slist(0)==slist(6)
and so on.
So is there any method to do this in scala?
Thanks very much
There's simpler approaches but starting with zipWithIndex leads down this path. zipWithIndex returns a Tuple2 with the index and one of the letters. From there we groupBy the letter to get a map of the letter to it's indices and filter the ones with more than one value. Lastly, we have this MapLike.DefaultValuesIterable(List((a,0), (a,3), (a,6)), List((c,2), (c,5)))
which we take the indices from and make combinations.
scala> slist.zipWithIndex.groupBy(zipped => zipped._1).filter(t => t._2.size > 1).values.flatMap(xs => xs.map(t => t._2).combinations(2))
res40: Iterable[List[Int]] = List(List(0, 3), List(0, 6), List(3, 6), List(2, 5))
Indexing a List is rather inefficient so I recommend a transition to Vector and then back again (if needed).
val svec = slist.toVector
svec.indices
.map(x => (x,svec.indexOf(svec(x),x+1)))
.filter(_._2 > 0)
.toList
//res0: List[(Int, Int)] = List((0,3), (2,5), (3,6))
val v = slist.toVector; val s = v.size
for(i<-0 to s-1;j<-0 to s-1;if(i<j && v(i)==v(j))) yield (i,j)
In Scala REPL:
scala> for(i<-0 to s-1;j<-0 to s-1;if(i<j && v(i)==v(j))) yield (i,j)
res34: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,3), (0,6), (2,5), (3,6))

The operation about Merge two tuples

Now there are two thus tuples.
1st tuple:((A,1),(B,3),(D,5)......)
2nd tuple:((A,3),(B,1),(E,6)......)
And the function is to merge those two tuples to this.
((A,1,3),(B,3,1),(D,5,0),(E,0,6)......)
If the first tuple contains a key that is not in the second tuple, set the value to 0, and vice versa. How could I code this function in scala?
Lets say you get the input in the format
val tuple1: List[(String, Int)] = List(("A",1),("B",3),("D",5),("E",0))
val tuple2: List[(String, Int)] = List(("A",3),("B",1),("D",6))
You can write a merge function as
def merge(tuple1: List[(String, Int)],tuple2: List[(String, Int)]) =
{
val map1 = tuple1.toMap
val map2 = tuple2.toMap
map1.map{ case (k,v) =>
(k,v,map2.get(k).getOrElse(0))
}
}
On calling the function
merge(tuple1,tuple2)
You will get the output as
res0: scala.collection.immutable.Iterable[(String, Int, Int)] = List((A,1,3), (B,3,1), (D,5,6), (E,0,0))
Please let me know if that answers your question.
val t1= List(("A",1),("B",3),("D",5),("E",0))
val t2= List(("A",3),("B",1),("D",6),("H",5))
val t3 = t2.filter{ case (k,v) => !t1.exists(case (k1,_) => k1==k)) }.map{case (k,_) => (k,0)}
val t4 = t1.filter{ case (k,v) => !t2.exists{case (k1,_) => k1==k} }.map{case (k,_) => (k,0)}
val t5=(t1 ++ t3).sortBy{case (k,v) => k}
val t6=(t2 ++ t4).sortBy{case (k,v) => k}
t5.zip(t6).map{case ((k,v1),(_,v2)) => (k,v1,v2) }
res: List[(String, Int, Int)] = List(("A", 1, 3), ("B", 3, 1), ("D", 5, 6), ("E", 0, 0), ("H", 0, 5))
In terms of what's happening here
t3 and t4 - find the missing value in t1 and t2 respectively and add them with a zero value
t5 and t6 sort the unified list (t1 with t3 and t2 with t4). Lastly they are zipped together and transformed to the desired output

Reduce/fold over scala sequence with grouping

In scala, given an Iterable of pairs, say Iterable[(String, Int]),
is there a way to accumulate or fold over the ._2s based on the ._1s? Like in the following, add up all the #s that come after A and separately the # after B
List(("A", 2), ("B", 1), ("A", 3))
I could do this in 2 steps with groupBy
val mapBy1 = list.groupBy( _._1 )
for ((key,sublist) <- mapBy1) yield (key, sublist.foldLeft(0) (_+_._2))
but then I would be allocating the sublists, which I would rather avoid.
You could build the Map as you go and convert it back to a List after the fact.
listOfPairs.foldLeft(Map[String,Int]().withDefaultValue(0)){
case (m,(k,v)) => m + (k -> (v + m(k)))
}.toList
You could do something like:
list.foldLeft(Map[String, Int]()) {
case (map, (k,v)) => map + (k -> (map.getOrElse(k, 0) + v))
}
You could also use groupBy with mapValues:
list.groupBy(_._1).mapValues(_.map(_._2).sum).toList
res1: List[(String, Int)] = List((A,5), (B,1))

How to sum a List[(Char,Int)] into a Map[Char,Int] in Scala?

I've got list of pairs:
List(('a',3),('b',3),('a',1))
and I would like to transform it by grouping by _1 and summing _2. The result should be like
Map('a'->4, 'b' -> 3)
I very new to Scala so please be kind :)
More direct version. We fold over the list, using a Map as the accumulator. The withDefaultValue means we don't have to test if we have the entry in the map already.
val xs = List(('a',3),('b',3),('a',1))
xs.foldLeft(Map[Char, Int]() withDefaultValue 0)
{case (m, (c, i)) => m updated (c,m(c)+i)}
//> res0: scala.collection.immutable.Map[Char,Int] = Map(a -> 4, b -> 3)
list.groupBy(_._1).mapValues(_.map(_._2).sum)
which can be written as
list.groupBy(_._1).mapValues { tuples =>
val ints = tuples.map { case (c, i) => i }
ints.sum
}