RDD Product without repeat some tuples - scala

I got the following RDD[String]
TTT
SSS
AAA
and I am having problems to get the following tuples
(TTT, SSS)
(TTT, AAA)
(SSS, AAA)
I was doing:
val res = input.cartesian(input).filter{ case (a,b) => a != b }
But the result is:
(TTT,SSS)
(TTT,AAA)
(SSS,TTT)
(SSS,AAA)
(AAA,TTT)
(AAA,SSS)
What is the best way to do that? please

You could impose an order in the tuple to obtain the combinations:
val res = input.cartesian(input).filter{ case (a,b) => a < b }

Related

Spark - aggregateByKey Type mismatch error

I am trying find the problem behind this. I am trying to find the maximum number Marks of each student using aggregateByKey.
val data = spark.sc.Seq(("R1","M",22),("R1","E",25),("R1","F",29),
("R2","M",20),("R2","E",32),("R2","F",52))
.toDF("Name","Subject","Marks")
def seqOp = (acc:Int,ele:(String,Int)) => if (acc>ele._2) acc else ele._2
def combOp =(acc:Int,acc1:Int) => if(acc>acc1) acc else acc1
val r = data.rdd.map{case(t1,t2,t3)=> (t1,(t2,t3))}.aggregateByKey(0)(seqOp,combOp)
I am getting error that aggregateByKey accepts (Int,(Any,Any)) but actual is (Int,(String,Int)).
Your map function is incorrect since you have a Row as input, not a Tuple3
Fix the last line with :
val r = data.rdd.map { r =>
val t1 = r.getAs[String](0)
val t2 = r.getAs[String](1)
val t3 = r.getAs[Int](2)
(t1,(t2,t3))
}.aggregateByKey(0)(seqOp,combOp)

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)

Scala zip with id on sorted collections

I have two collections in scala with objects containing id. And I want to zip them together by id. So in such example:
case class A(id: Long)
case class B(id: Long)
val col1 = A(1) :: A(2) :: A(5) :: Nil
val col2 = B(2) :: B(2) :: B(5) :: Nil
I would expect as result:
List(
(A(1), List()),
(A(2), List(B(2), B(2)),
(A(5), List(B(5))
)
How to do it the easy way?
If I know col1 and col2 are already sorted by id would it help somehow?
I can't figure out a good way to do it without an intermediate variable, but how about something like this:
val map = col2.groupBy(_.id).withDefault(_ => List.empty)
col1.map { a => a -> map(a.id) }
For 3-element arrays it doesn't matter, but note that the main difference from the other answer is that this is linear time.
One way would be to map the first collection and inside the map do a filter on the second and build a tuple:
scala> col1.map { c1 =>
| (c1, col2.filter(_.id == c1.id))
| }
res0: List[(A, List[B])] = List((A(1),List()), (A(2),List(B(2), B(2))), (A(5),List(B(5))))

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}

scala - using filter with pattern matching

I have the following lists :
case class myPair(ids:Int,vals:Int)
val someTable = List((20,30), (89,90), (40,65), (45,75), (35,45))
val someList:List[myPair] =
someTable.map(elem => myPair(elem._1, elem._2)).toList
I would like to filter all "ids" > 45 .
I tried something like this article filter using pattern matching):
someList.filter{ case(myPair) => ids >= 45 }
but without success.
appreciate your help
You don't need pattern matching at all, type is known at compile time:
someList.filter(_.ids >= 45)
or slightly more verbose/readable:
someList.filter(pair => pair.ids >= 45)
You mean like:
someList.filter{ case MyPair(ids,vals) => ids >= 45 }
Renamed myPair to MyPair, identifiers beginning with lowercase are considered variables, much like ids and vals in the above. --Actually this is not true, look at #RandallSchulz's comment.
Going further(1):
val someList = someTable.map(case (ids,vals) => MyPair(ids,vals)).toList
Even more(2):
val someList = someTable.map(elem => MyPair.tupled(elem)).toList
Way more(3):
val someList = someTable.map(MyPair.tupled).toList
Of course, only (1) is about pattern match. (2) and (3) is turning the arguments of MyPair.apply(Int,Int) into Tuple[Int,Int].
Here's one more variant using pattern matching
someTable collect {case (i, v) if i > 45 => MyPair(i, v)}
collect combines a filter operation and a map operation.
case class myPair(ids:Int,vals:Int)
val someTable = List((20,30), (89,90), (40,65), (45,75), (35,45))
val someList:List[myPair] = for( elem <- someTable; if elem._1 > 45) yield myPair(elem._1, elem._2)
Which gives
someList: List[myPair] = List(myPair(89,90))