Related
The data consists of two columns
A B
A C
A D
B A
B C
B D
B E
C A
C B
C D
C E
D A
D B
D C
D E
E B
E C
E D
In the first row, think of it as A is friends with B, etc.
How do I find their common friends?
(A,B) -> (C D)
Meaning A and B have common friends C and D. I came as close as doing a groupByKey with the following result.
(B,CompactBuffer(A, C, D, E))
(A,CompactBuffer(B, C, D))
(C,CompactBuffer(A, B, D, E))
(E,CompactBuffer(B, C, D))
(D,CompactBuffer(A, B, C, E))
The code:
val rdd: RDD[String] = spark.sparkContext.textFile("twocols.txt")
val splitrdd: RDD[(String, String)] = rdd.map { s =>
var str = s.split(" ")
new Tuple2(str(0), str(1))
}
val group: RDD[(String, Iterable[String])] = splitrdd.groupByKey()
group.foreach(println)
First swap the elements:
val swapped = splitRDD.map(_.swap)
Then self-join and swap back:
val shared = swapped.join(swapped).map(_.swap)
Finally filter out duplicates (if needed) and groupByKey:
shared.filter { case ((x, y), _) => x < y }.groupByKey
This is just an ugly attempt:
Suppose you have converted your two columns into Array[Array[String]] (or List[List[String]], it's really the same), say
val pairs=Array(
Array("A","B"),
Array("A","C"),
Array("A","D"),
Array("B","A"),
Array("B","C"),
Array("B","D"),
Array("B","E"),
Array("C","A"),
Array("C","B"),
Array("C","D"),
Array("C","E"),
Array("D","A"),
Array("D","B"),
Array("D","C"),
Array("D","E"),
Array("E","B"),
Array("E","C"),
Array("E","D")
)
Define the group for which you want to find their common friends:
val group=Array("C","D")
The following will find the friends for each member in your group
val friendsByMemberOfGroup=group.map(
i => pairs.filter(x=> x(1) contains i)
.map(x=>x(0))
)
For example, pairs.filter(x=>x(1) contains "C").map(x=>x(0)) returns the friends of "C" where "C" is being taken from the second column and its friends are taken from the first column:
scala> pairs.filter(x=> x(1) contains "C").map(x=>x(0))
res212: Array[String] = Array(A, B, D, E)
And the following loop will find the common friends of all the members in your group
var commonFriendsOfGroup=friendsByMemberOfGroup(0).toSet
for(i <- 1 to friendsByMemberOfGroup.size-1){
commonFriendsOfGroup=
commonFriendsOfGroup.intersect(friendsByMemberOfGroup(i).toSet)
}
So you get
scala> commonFriendsOfGroup.toArray
res228: Array[String] = Array(A, B, E)
If you change your group to val group=Array("A","B","E") and apply the previous lines then you will get
scala> commonFriendsOfGroup.toArray
res230: Array[String] = Array(C, D)
Continuing from where you left off:
val group: RDD[(String, Iterable[String])] = splitrdd.groupByKey()
val group_map = group.collectAsMap
val common_friends = group
.flatMap{case (x, friends) =>
friends.map{y =>
((x,y),group_map.get(y).get.toSet.intersect(friends.toSet))
}
}
scala> common_friends.foreach(println)
((B,A),Set(C, D))
((B,C),Set(A, D, E))
((B,D),Set(A, C, E))
((B,E),Set(C, D))
((D,A),Set(B, C))
((D,B),Set(A, C, E))
((D,C),Set(A, B, E))
((D,E),Set(B, C))
((A,B),Set(C, D))
((A,C),Set(B, D))
((A,D),Set(B, C))
((C,A),Set(B, D))
((C,B),Set(A, D, E))
((C,D),Set(A, B, E))
((C,E),Set(B, D))
((E,B),Set(C, D))
((E,C),Set(B, D))
((E,D),Set(B, C))
Note: this assumes your data has the relationship in both directions like in your example: (A B and B A). If it's not the case you need to add some code to deal with the fact that group_map.get(y) might return None.
So I ended up doing this on the client side. DO NOT DO THIS
val arr: Array[(String, Iterable[String])] = group.collect()
//arr.foreach(println)
var arr2 = scala.collection.mutable.Set[((String, String), List[String])]()
for (i <- arr)
for (j <- arr)
if (i != j) {
val s1 = i._2.toSet
val s2 = j._2.toSet
val s3 = s1.intersect(s2).toList
//println(s3)
val pair = if (i._1 < j._1) (i._1, j._1) else (j._1, i._1)
arr2 += ((pair, s3))
}
arr2.foreach(println)
The result is
((B,E),List(C, D))
((A,C),List(B, D))
((A,B),List(C, D))
((A,D),List(B, C))
((B,D),List(A, C, E))
((C,D),List(A, B, E))
((B,C),List(A, D, E))
((C,E),List(B, D))
((D,E),List(B, C))
((A,E),List(B, C, D))
I am wondering if I can do this using transformations within Spark.
I have a List "a, b, c,d", and this is the expected result
a
ab
abc
abcd
b
bc
bcd
c
cd
d
I tried bruteforce, but I think, there might be any other efficient solution, provided I am having very long list.
Here's a one-liner.
"abcd".tails.flatMap(_.inits.toSeq.init.reverse).mkString(",")
//res0: String = a,ab,abc,abcd,b,bc,bcd,c,cd,d
The mkString() is added just so we can see the result. Otherwise the result is an Iterator[String], which is a pretty memory efficient collection type.
The reverse is only there so that it comes out in the order you specified. If the order of the results is unimportant then that can be removed.
The toSeq.init is there to remove empty elements left behind by the inits call. If those can be dealt with elsewhere then this can also be removed.
This may not be the best solution but one way of doing this is by using sliding function as follow,
val lst = List('a', 'b', 'c', 'd')
val groupedElements = (1 to lst.size).flatMap(x =>
lst.sliding(x, 1))
groupedElements.foreach(x => println(x.mkString("")))
//output
/* a
b
c
d
ab
bc
cd
abc
bcd
abcd
*/
It may not be the best solution, but I think is a good one, and it's tailrec
First this function to get the possible sublists of a List
def getSubList[A](lista: Seq[A]): Seq[Seq[A]] = {
for {
i <- 1 to lista.length
} yield lista.take(i)
}
And then this one to perform the recursion calling the first function and obtain all the sublists possible:
def getSubListRecursive[A](lista: Seq[A]): Seq[Seq[A]] = {
#tailrec
def go(acc: Seq[Seq[A]], rest: Seq[A]): Seq[Seq[A]] = {
rest match {
case Nil => acc
case l => go(acc = acc ++ getSubList(l), rest= l.tail)
}
}
go(Nil, lista)
}
The ouput:
getSubListRecursive(l)
res4: Seq[Seq[String]] = List(List(a), List(a, b), List(a, b, c), List(a, b, c, d), List(b), List(b, c), List(b, c, d), List(c), List(c, d), List(d))
def map2[A,B,C] (a: Par[A], b: Par[B]) (f: (A,B) => C) : Par[C] =
(es: ExecutorService) => {
val af = a (es)
val bf = b (es)
UnitFuture (f(af.get, bf.get))
}
def map3[A,B,C,D] (pa :Par[A], pb: Par[B], pc: Par[C]) (f: (A,B,C) => D) :Par[D] =
map2(map2(pa,pb)((a,b)=>(c:C)=>f(a,b,c)),pc)(_(_))
I have map2 and need to produce map3 in terms of map2. I found the solution in GitHub but it is hard to understand. Can anyone put a sight on it and explain map3 and also what this does (())?
On a purely abstract level, map2 means you can run two tasks in parallel, and that is a new task in itself. The implementation provided for map3 is: run in parallel (the task that consist in running in parallel the two first ones) and (the third task).
Now down to the code: first, let's give name to all the objects created (I also extended _ notations for clarity):
def map3[A,B,C,D] (pa :Par[A], pb: Par[B], pc: Par[C]) (f: (A,B,C) => D) :Par[D] = {
def partialCurry(a: A, b: B)(c: C): D = f(a, b, c)
val pc2d: Par[C => D] = map2(pa, pb)((a, b) => partialCurry(a, b))
def applyFunc(func: C => D, c: C): D = func(c)
map2(pc2d, pc)((c2d, c) => applyFunc(c2d, c)
}
Now remember that map2 takes two Par[_], and a function to combine the eventual values, to get a Par[_] of the result.
The first time you use map2 (the inside one), you parallelize the first two tasks, and combine them into a function. Indeed, using f, if you have a value of type A and a value of type B, you just need a value of type C to build one of type D, so this exactly means that partialCurry(a, b) is a function of type C => D (partialCurry itself is of type (A, B) => C => D).
Now you have again two values of type Par[_], so you can again map2 on them, and there is only one natural way to combine them to get the final value.
The previous answer is correct but I found it easier to think about like this:
def map3[A, B, C, D](a: Par[A], b: Par[B], c: Par[C])(f: (A, B, C) => D): Par[D] = {
val f1 = (a: A, b: B) => (c: C) => f(a, b, c)
val f2: Par[C => D] = map2(a, b)(f1)
map2(f2, c)((f3: C => D, c: C) => f3(c))
}
Create a function f1 that is a version of f with the first 2 arguments partially applied, then we can map2 that with a and b to give us a function of type C => D in the Par context (f1).
Finally we can use f2 and c as arguments to map2 then apply f3(C => D) to c to give us a D in the Par context.
Hope this helps someone!
Problem
Hey, I have a DStream with type List[A], what's the best way to transform this DStream into type A?
To help illustrate my goal, I want
List(A, A, A, ....), List(A, A, ...), List(A, A, A, ...), ...
to be
A, A, A, A, A, ...
Basically it's very similar to a flatten operation in concept. Thanks!
Update:
I think I figured it out, a simple flatMap should do it. Thanks anyways!
Just in case anyone want the answer.
If x is some DStream of List[A] then applying a flat map on x where the transformation function simply returns the list, will flatten those lists into a DStream of A.
val x: DStream[List[A]] = ...
val y: DStream[A] = x.flatmap(k => k)
So I'm running into a speed issue where I have a dataset that needs to be aggregated multiple times.
Initially my team had set up three accumulators and were running a single foreach loop over the data. Something along the lines of
val accum1:Accumulable[a]
val accum2: Accumulable[b]
val accum3: Accumulable[c]
data.foreach{
u =>
accum1+=u
accum2 += u
accum3 += u
}
I am trying to switch these accumulations into an aggregation so that I can get a speed boost and have access to accumulators for debugging. I am currently trying to figure out a way to aggregate these three types at once, since running 3 separate aggregations is significantly slower. Does anyone have any thoughts as to how I can do this? Perhaps aggregating agnostically then pattern matching to split into two RDDs?
Thank you
As far as I can tell all you need here is aggregate with zeroValue, seqOp and combOp corresponding to the operations which are performed by your accumulators.
val zeroValue: (A, B, C) = ??? // (accum1.zero, accum2.zero, accum3.zero)
def seqOp(r: (A, B, C), t: T): (A, B, C) = r match {
case (a, b, c) => {
// Apply operations equivalent to
// accum1.addAccumulator(a, t)
// accum2.addAccumulator(c, t))
// accum3.addAccumulator(c, t)
// and return the first argument
// r
}
}
def combOp(r1: (A, B, C), r2: (A, B, C)): (A, B, C) = (r1, r2) match {
case ((a1, b1, c1), (a2, b2, c2)) => {
// Apply operations equivalent to
// acc1.addInPlace(a1, a2)
// acc2.addInPlace(b1, b2)
// acc3.addInPlace(c1, c2)
// and return the first argument
// r1
}
}
val rdd: RDD[T] = ???
val accums: (A, B, C) = rdd.aggregate(zeroValue)(seqOp, combOp)